# KNN Model Exercises

1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

2. Evaluate your results using the model score, confusion matrix, and classification report.

3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

4. Run through steps 1-3 setting k to 10

5. Run through steps 1-3 setting k to 20

6. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

7. Which model performs best on our out-of-sample data from validate?

In [2]:
# Importing modules

# ds libs for tab data
import pandas as pd
import numpy as np

# data viz
import matplotlib.pyplot as plt
import seaborn as sns

# knn modules - models
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

# data to use
from acquire import get_titanic_data
from prepare import prep_titanic, split_data

### Setup

In [3]:
# getting data and stacking all of the changes on top of each other

train, validate, test = split_data(prep_titanic(get_titanic_data()), target = 'survived')

In [8]:
features = [col for col in train if train[col].dtype != 'O']
features.remove('survived')

In [52]:
# assigning feature
X_train = train[features]
X_val = validate[features]
X_test = test[features]

In [11]:
# assigning targets
y_train = train.survived
y_val = validate.survived
y_test = test.survived

### Work

> #### 1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

In [14]:
# creating
knn = KNeighborsClassifier()

# fiting
knn.fit(X_train, y_train)

# use
y_pred = knn.predict(X_train)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


> #### 2. Evaluate your results using the model score, confusion matrix, and classification report.

In [16]:
knn.score(X_train, y_train)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


0.785140562248996

In [17]:
# confusion matrix
confusion_matrix(y_train, y_pred)

array([[264,  43],
       [ 64, 127]])

In [19]:
# crosstab so I can make sense of the sonfucion matrix
pd.crosstab(y_train, y_pred)

col_0,0,1
survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,264,43
1,64,127


In [21]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.86      0.83       307
           1       0.75      0.66      0.70       191

    accuracy                           0.79       498
   macro avg       0.78      0.76      0.77       498
weighted avg       0.78      0.79      0.78       498



> #### 3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [29]:
tn, fp, fn, tp = confusion_matrix(y_train, y_pred).ravel()

In [34]:
accuracy = (tp + tn) / (tp + tn + fn + fp)

tp_rate =  tp / (tp + fn)

fp_rate = fp / (fp + tn)

tn_rate = tn / (tn + fp)

fn_rate = fn / (tp + fn)

precision_ = tp / (tp + fp)

recall = tp / (tp + fn)

f1 = 2 * (precision_ * recall) / (precision_ + recall)

In [36]:
 print(f'''
        
        Model: {knn}
        Accuracy: {accuracy:.2%}
        True Postive Rate: {tp_rate:.2%}
        False Positive Rate: {fp_rate:.2%}
        True Negative Rate: {tn_rate:.2%}
        False Negative Rate: {fn_rate:.2%}
        Precision: {precision_:.2%}
        Recall: {recall:.2%}
        F1: {f1:.2%}
        Validation Set Classification Report:
        
        {classification_report(y_train, y_pred)}
        ''')


       
       Model: KNeighborsClassifier()
       Accuracy: 78.51%
       True Postive Rate: 66.49%
       False Positive Rate: 14.01%
       True Negative Rate: 85.99%
       False Negative Rate: 33.51%
       Precision: 74.71%
       Recall: 66.49%
       F1: 70.36%
       Validation Set Classification Report:
       
                     precision    recall  f1-score   support

           0       0.80      0.86      0.83       307
           1       0.75      0.66      0.70       191

    accuracy                           0.79       498
   macro avg       0.78      0.76      0.77       498
weighted avg       0.78      0.79      0.78       498

       


> #### 4. Run through steps 1-3 setting k to 10

In [39]:
# create
knn10 = KNeighborsClassifier(n_neighbors=10)

# fit
knn10.fit(X_train,y_train )


# prediction
y_preds10 = knn10.predict(X_train)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [46]:
tn, fp, fn, tp = confusion_matrix(y_train, y_preds10).ravel()

accuracy = (tp + tn) / (tp + tn + fn + fp)

tp_rate =  tp / (tp + fn)

fp_rate = fp / (fp + tn)

tn_rate = tn / (tn + fp)

fn_rate = fn / (tp + fn)

precision_ = tp / (tp + fp)

recall = tp / (tp + fn)

f1 = 2 * (precision_ * recall) / (precision_ + recall)

In [50]:
 print(f'''
        
        Model: {knn}
        Accuracy: {accuracy:.2%}
        True Postive Rate: {tp_rate:.2%}
        False Positive Rate: {fp_rate:.2%}
        True Negative Rate: {tn_rate:.2%}
        False Negative Rate: {fn_rate:.2%}
        Precision: {precision_:.2%}
        Recall: {recall:.2%}
        F1: {f1:.2%}
        Validation Set Classification Report:
        
        {classification_report(y_train, y_preds10)}
        ''')


       
       Model: KNeighborsClassifier()
       Accuracy: 72.09%
       True Postive Rate: 42.41%
       False Positive Rate: 9.45%
       True Negative Rate: 90.55%
       False Negative Rate: 57.59%
       Precision: 73.64%
       Recall: 42.41%
       F1: 53.82%
       Validation Set Classification Report:
       
                     precision    recall  f1-score   support

           0       0.75      0.91      0.82       307
           1       0.77      0.52      0.62       191

    accuracy                           0.76       498
   macro avg       0.76      0.71      0.72       498
weighted avg       0.76      0.76      0.74       498

       


> #### 5. Run through steps 1-3 setting k to 20

In [44]:
# create
knn20 = KNeighborsClassifier(n_neighbors=20)

# fit
knn20.fit(X_train,y_train)


# prediction
y_preds20 = knn20.predict(X_train)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [49]:
tn, fp, fn, tp = confusion_matrix(y_train, y_preds20).ravel()

accuracy = (tp + tn) / (tp + tn + fn + fp)

tp_rate =  tp / (tp + fn)

fp_rate = fp / (fp + tn)

tn_rate = tn / (tn + fp)

fn_rate = fn / (tp + fn)

precision_ = tp / (tp + fp)

recall = tp / (tp + fn)

f1 = 2 * (precision_ * recall) / (precision_ + recall)

In [51]:
 print(f'''
        
        Model: {knn}
        Accuracy: {accuracy:.2%}
        True Postive Rate: {tp_rate:.2%}
        False Positive Rate: {fp_rate:.2%}
        True Negative Rate: {tn_rate:.2%}
        False Negative Rate: {fn_rate:.2%}
        Precision: {precision_:.2%}
        Recall: {recall:.2%}
        F1: {f1:.2%}
        Validation Set Classification Report:
        
        {classification_report(y_train, y_preds20)}
        ''')


       
       Model: KNeighborsClassifier()
       Accuracy: 72.09%
       True Postive Rate: 42.41%
       False Positive Rate: 9.45%
       True Negative Rate: 90.55%
       False Negative Rate: 57.59%
       Precision: 73.64%
       Recall: 42.41%
       F1: 53.82%
       Validation Set Classification Report:
       
                     precision    recall  f1-score   support

           0       0.72      0.91      0.80       307
           1       0.74      0.42      0.54       191

    accuracy                           0.72       498
   macro avg       0.73      0.66      0.67       498
weighted avg       0.72      0.72      0.70       498

       


> #### 6. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

In [59]:
neighbors = [1, 10, 20]
knn_dict = {}

for i in neighbors:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    knn
    knn_dict[f'{i} neighbors'] = {'Train Score': round(knn.score(X_train, y_train), 2), 'Validate Score': round(knn.score(X_val, y_val), 2), 'Difference': round(knn.score(X_train, y_train) - knn.score(X_val, y_val), 2) }

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [68]:
for i in neighbors:
    print(knn_dict[f'{i} neighbors']['Train Score'])
    
# the highest score for the first ds is the nearest neighbor of 1
# Why? I'm not sure, maybe overfitting? becasue there is a huges difference on the validat eset

0.99
0.76
0.72


> #### 7. Which model performs best on our out-of-sample data from validate?

In [70]:
for i in neighbors:
    print(knn_dict[f'{i} neighbors']['Validate Score'])
    
# the model with 10 performs the best 

0.68
0.68
0.67


> ### 1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)