## From KNN LR Lesson

# Modeling - KNN & Logistic Regression

What is it?
- a machine learning algorithm used for predicting categorical target variables
- Pipeline: Plan - Acquire - Prepare - Explore - **Model** - Deliver

Why do we care?
- we can predict future target variables based on the model we build! 

How does it work?
- [KNN](https://www.canva.com/design/DAF1kLSg27I/d5RDlzsWE9fGUyBiYHsCMw/view?utm_content=DAF1kLSg27I&utm_campaign=designshare&utm_medium=link&utm_source=editor)
- [Logistic Regression](https://www.canva.com/design/DAF1kruKRPA/Io1xEv-v0Ucf0htssSZ-uA/view?utm_content=DAF1kruKRPA&utm_campaign=designshare&utm_medium=link&utm_source=editor)

How do we use it?
- acquire, prepare, explore our data
    - determine features to go into model
- split data for modeling
- build models on train
    - create rules based on our input data
- evaluate models on train & validate
    - see how our rules work on unseen data
- pick best of the best model, and evaluate bestest model on test

## Demo: Iris Data

In [1]:
#data things
import numpy as np
import pandas as pd

#to evaulate
from sklearn.metrics import classification_report

#my own py files
import acquire
import prepare

#new imports! 
#for classification
from sklearn.linear_model import LogisticRegression #logistic not linear!
from sklearn.neighbors import KNeighborsClassifier #pick the classifier one

### Acquire

In [3]:
df = acquire.get_iris_data()

this file exists, reading csv


In [4]:
df.head()

Unnamed: 0,species_id,measurement_id,sepal_length,sepal_width,petal_length,petal_width,species_name
0,1,1,5.1,3.5,1.4,0.2,setosa
1,1,2,4.9,3.0,1.4,0.2,setosa
2,1,3,4.7,3.2,1.3,0.2,setosa
3,1,4,4.6,3.1,1.5,0.2,setosa
4,1,5,5.0,3.6,1.4,0.2,setosa


In [5]:
df.shape

(150, 7)

### Prepare

In [8]:
df = prepare.prep_iris(df)

In [9]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [12]:
train, validate, test = prepare.splitting_data(df, 'species')

In [13]:
train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
24,4.8,3.4,1.9,0.2,setosa
147,6.5,3.0,5.2,2.0,virginica
88,5.6,3.0,4.1,1.3,versicolor
123,6.3,2.7,4.9,1.8,virginica
31,5.4,3.4,1.5,0.4,setosa


In [14]:
train.shape

(90, 5)

In [15]:
validate.shape

(30, 5)

In [16]:
test.shape

(30, 5)

## Explore 

**ONLY USING TRAIN!**

completed the following steps on my features and target variable
1. hypothesize
2. visualize
3. analyze (with stats)
4. summarize

determined features that have a relationship with the target variable

## Modeling

#### verify my data is ready for modeling aka no string features

In [22]:
train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
24,4.8,3.4,1.9,0.2,setosa
147,6.5,3.0,5.2,2.0,virginica
88,5.6,3.0,4.1,1.3,versicolor
123,6.3,2.7,4.9,1.8,virginica
31,5.4,3.4,1.5,0.4,setosa


#### split into features and target variable
- need to do this on my train, validate, and test dataframe
- will end up with the following variables:
    - X_train, X_validate, X_test: all the features we plan to put into our model
    - y_train, y_validate, y_test: the target variable

In [21]:
train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
24,4.8,3.4,1.9,0.2,setosa
147,6.5,3.0,5.2,2.0,virginica
88,5.6,3.0,4.1,1.3,versicolor
123,6.3,2.7,4.9,1.8,virginica
31,5.4,3.4,1.5,0.4,setosa


In [25]:
#isolate my target variable
y_train = train.species
y_train.head()

24         setosa
147     virginica
88     versicolor
123     virginica
31         setosa
Name: species, dtype: object

In [26]:
#repeat for validate and test
y_validate = validate.species
y_test = test.species

In [30]:
#isolate our features, using all of them for now
X_train = train.drop(columns='species')
X_train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
24,4.8,3.4,1.9,0.2
147,6.5,3.0,5.2,2.0
88,5.6,3.0,4.1,1.3
123,6.3,2.7,4.9,1.8
31,5.4,3.4,1.5,0.4


In [31]:
#repeat for validate and test
X_validate = validate.drop(columns='species')
X_test = test.drop(columns='species')

#### binary classification
- necessary for logistic regression
- will predict if species is virginica or not

In [33]:
y_train.value_counts()

setosa        30
virginica     30
versicolor    30
Name: species, dtype: int64

In [40]:
#np where(condition, do if true, do if false)
y_train = pd.Series(np.where(y_train == 'virginica', 
                             'virginica', 'not virginica'))

In [41]:
y_train.value_counts()

not virginica    60
virginica        30
dtype: int64

In [42]:
#repeat for validate and test
y_validate = pd.Series(np.where(y_validate == 'virginica', 
                             'virginica', 'not virginica'))
y_test = pd.Series(np.where(y_test == 'virginica', 
                             'virginica', 'not virginica'))

## create Baseline

In [86]:
y_train.value_counts()

not virginica    60
virginica        30
dtype: int64

In [82]:
y_train.mode()

0    not virginica
dtype: object

> my baseline prediction is 'not virginica'

> so if i predicted 'not virigina' every single time, how often would i be correct

In [85]:
#baseline accuracy
y_train.value_counts(normalize=True)[0]

0.6666666666666666

> my baseline accuracy is 67%

## KNN

#### sklearn modeling process

1. create the object
2. fit the object
3. use the object 

#### create it

In [55]:
knn = KNeighborsClassifier(n_neighbors=5)
knn

In [56]:
# knn.classes_

#### fit it

In [57]:
#only fit on our TRAIN DATA!!!!!!!
#dont have to save it back to the knn variable
knn.fit(X_train, y_train)
#rules have been built based on our train dataset

In [58]:
knn.feature_names_in_

array(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
      dtype=object)

In [59]:
knn.classes_

array(['not virginica', 'virginica'], dtype=object)

#### use it

In [63]:
y_train.head()

0    not virginica
1        virginica
2    not virginica
3        virginica
4    not virginica
dtype: object

In [68]:
#predicted values
predicted = knn.predict(X_train)
predicted[:10]

array(['not virginica', 'virginica', 'not virginica', 'virginica',
       'not virginica', 'not virginica', 'not virginica', 'not virginica',
       'not virginica', 'not virginica'], dtype=object)

In [65]:
#probabilties
knn.predict_proba(X_train)[:10]

array([[1. , 0. ],
       [0. , 1. ],
       [1. , 0. ],
       [0.2, 0.8],
       [1. , 0. ],
       [0.6, 0.4],
       [1. , 0. ],
       [1. , 0. ],
       [1. , 0. ],
       [1. , 0. ]])

In [66]:
knn.classes_

array(['not virginica', 'virginica'], dtype=object)

In [67]:
#accuracy
knn.score(X_train,y_train)

0.9777777777777777

> model is accurately predicting virginica or not virginica 98%

In [72]:
#use the predicted values to manually calculate
(predicted == y_train).mean()

0.9777777777777777

In [74]:
#y_true = y_train
#y_pred = predicted
print(classification_report(y_train, predicted))

               precision    recall  f1-score   support

not virginica       0.98      0.98      0.98        60
    virginica       0.97      0.97      0.97        30

     accuracy                           0.98        90
    macro avg       0.97      0.97      0.97        90
 weighted avg       0.98      0.98      0.98        90



### another knn

In [77]:
#create the object
#hyperparameters are the arguments we send in
knn2 = KNeighborsClassifier(n_neighbors=10)
knn2

In [78]:
#fit the object with TRAIN
knn2.fit(X_train, y_train)

In [80]:
#use the object
knn2.score(X_train, y_train)

0.9666666666666667

> my knn with 10 neighbors performed at 97% accuracy

### Logisitic Regression

#### sklearn modeling process

1. create the object
2. fit the object
3. use the object 

#### create it

In [87]:
lr = LogisticRegression()
lr

#### fit it 

In [88]:
#fit only on TRAIN! 
#DO NOT FIT ON VALIDATE OR TEST
lr.fit(X_train, y_train)

In [89]:
lr.feature_names_in_

array(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
      dtype=object)

In [90]:
lr.classes_

array(['not virginica', 'virginica'], dtype=object)

In [91]:
lr.coef_

array([[-0.0836365 , -0.20447008,  2.40501367,  2.03830433]])

In [92]:
lr.intercept_

array([-13.87931336])

#### use it 

In [95]:
#probabilities
lr.predict_proba(X_train).round(2)[:10]

array([[1.  , 0.  ],
       [0.18, 0.82],
       [0.92, 0.08],
       [0.38, 0.62],
       [1.  , 0.  ],
       [0.71, 0.29],
       [1.  , 0.  ],
       [0.88, 0.12],
       [1.  , 0.  ],
       [1.  , 0.  ]])

In [97]:
#predictions
lr.predict(X_train)[:10]

array(['not virginica', 'virginica', 'not virginica', 'virginica',
       'not virginica', 'not virginica', 'not virginica', 'not virginica',
       'not virginica', 'not virginica'], dtype=object)

In [102]:
#calculate accuracy
lr.score(X_train, y_train)

0.9777777777777777

In [103]:
print(classification_report(y_train, lr.predict(X_train)))

               precision    recall  f1-score   support

not virginica       0.98      0.98      0.98        60
    virginica       0.97      0.97      0.97        30

     accuracy                           0.98        90
    macro avg       0.97      0.97      0.97        90
 weighted avg       0.98      0.98      0.98        90



## Evaluate models using Validate dataset

- use to tune hyperparameters
- select the best model

#### knn - 5 neighbors

In [104]:
#first model acccuracy
knn.score(X_train, y_train)

0.9777777777777777

In [105]:
#compare to validate to check for overfitting
knn.score(X_validate, y_validate)

0.9333333333333333

#### knn - 10 neighbors

In [106]:
#compare my second model
knn2.score(X_train, y_train)

0.9666666666666667

In [107]:
#compare to validate
knn2.score(X_validate, y_validate)

1.0

#### logisitic regression

In [108]:
lr.score(X_train, y_train)

0.9777777777777777

In [109]:
lr.score(X_validate, y_validate)

0.9666666666666667

## Evaluate BEST mode using Test dataset
ONLY GOING TO EVALUATE ONE TIME!


In [110]:
#selected knn with 10 neighbors to be the best model

In [111]:
knn2.score(X_test, y_test)

0.9666666666666667

> in the real world, i can expect my model to accurately predict virginica or not 97% of the time