# K Nearest Neighbors Modeling

## Objectives
* Load data
* Tune hyper parameters for each version of the data
* Select a model
* Examine results
* Save results

#### Load data

I will start by importing the necessary libraries.

In [11]:
import pickle # loading data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from imblearn.under_sampling import NearMiss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score, accuracy_score, roc_curve, auc, precision_recall_curve, confusion_matrix
from sklearn.model_selection import GridSearchCV
import time
from modeling_functions import *

import warnings
warnings.filterwarnings("ignore") 

Now I will load the data.

In [2]:
# Load data
pickle_in = open("olr_keys_n_components.pickle", "rb")
olr_keys_n_components = list(pickle.load(pickle_in))
pickle_in.close()

# Sanity Check
print(olr_keys_n_components)

[('s1_r1_o1', 16), ('s1_r1_o3', 16), ('s1_r2_o1', 16), ('s1_r2_o3', 15), ('s2_r1_o1', 24), ('s2_r1_o3', 25), ('s2_r2_o1', 19), ('s2_r2_o3', 20), ('s3_r1_o1', 23), ('s3_r1_o3', 24), ('s3_r2_o1', 19), ('s3_r2_o3', 18), ('s4_r1_o1', 21), ('s4_r1_o3', 22), ('s4_r2_o1', 16), ('s4_r2_o3', 18)]


In [3]:
# Load data
pickle_in = open("cleaned_data.pickle", "rb")
clean_data = pickle.load(pickle_in)
pickle_in.close()

X = clean_data['X']
y = clean_data['y']

#### Tune hyperparameters

My goal for this project is to create a model that can help alert a credit lender to suspicious activity. <br><br>

For this reason I want to have low false negatives, so I will be using recall as my main metric. High recall will mean a low amount of fradulent transactions are left undetected. <br><br>

My second metric will be precision because I do not want false positives either. low precision would cause the model to flag too large an amount of the data as likely to be fraudulent. If the credit lender chose to take preventative action on say, every other transaction, then that would be a nuisance to both the credit lender and the clients. <br><br>

However precision does not need to be nearly as high as recall. If recall was say %80 then I would have potentially stopped %80 percent of fraud and if precision was say %20 then less than 1 out of 100 transactions would be flagged as suspicous to fraud, because in this dataset fraud accounts for %0.17 percent of the data I was  given. <br><br> 

F1-score is the harmonic mean of recall and precision. It is not the best metric to use though because it is important that recall is high, but precision can get away with being much lower.<br><br>

The metrics mentioned above are calculated by comparing the known values to the model's predicted values. The simplified formulas for precision and recall are showed below. 
<img src="../Images/Precision_Recall.png"><br>
Image Source: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
<br><br>

To tune the hyperparameters I will use my own function called customGridSearch. It has a doc string attached. The function will go through the data transforming it according to the function's parameters and return the cross validation scores for each method as well as for each combination of parameters.

The parameter grid in the cell below was inspired by the <a href=https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets>top rated kaggle post</a> I marked the url as source 1.

In [7]:
# Instantiate logistic regression classifer
clf = KNeighborsClassifier

# Create parameter grid (Source 1)
params = {
    "n_neighbors": list(range(3,10,2)), 
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
}

Warning the following cell 20 took minutes to run!

In [8]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# No PCA

t1 = time.time()

# Record results
results_o1 = {}
for scaler_str in ['s1', 's2', 's3', 's4']:
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss())
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'algorithm': 'auto', 'n_neighbors': 3} 
 recall 0.9302856835711791 
 precision: 0.004979637923784965 
 f1-score: 0.009882896639106551 


 {'algorithm': 'auto', 'n_neighbors': 5} 
 recall 0.9027923351877233 
 precision: 0.007202711757391823 
 f1-score: 0.014212882589033221 


 {'algorithm': 'auto', 'n_neighbors': 7} 
 recall 0.8858878228385606 
 precision: 0.011190741318014385 
 f1-score: 0.021823881698992176 


 {'algorithm': 'auto', 'n_neighbors': 9} 
 recall 0.871093015130748 
 precision: 0.02292080918390552 
 f1-score: 0.04298733589887324 


 {'algorithm': 'ball_tree', 'n_neighbors': 3} 
 recall 0.9302856835711791 
 precision: 0.004979637923784965 
 f1-score: 0.009882896639106551 


 {'algorithm': 'ball_tree', 'n_neighbors': 5} 
 recall 0.9027923351877233 
 precision: 0.007202711757391823 
 f1-score: 0.014212882589033221 


 {'algorithm': 'ball_tree', 'n_neighbors': 7} 
 recall 0.8858878228385606 
 pre

Warning: The following cell took 20 minutes to run!

In [9]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# outliers removed
# No PCA

t1 = time.time()

# Record results
results_o3 = {}
for scaler_str in ['s1', 's2', 's3', 's4']:
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o3[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), outlier_removal=True)
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'algorithm': 'auto', 'n_neighbors': 3} 
 recall 0.9302856835711791 
 precision: 0.005367727655402031 
 f1-score: 0.010651056133316834 


 {'algorithm': 'auto', 'n_neighbors': 5} 
 recall 0.9006691929371926 
 precision: 0.007786736951632391 
 f1-score: 0.015361833634364863 


 {'algorithm': 'auto', 'n_neighbors': 7} 
 recall 0.8795183960869682 
 precision: 0.012411212766093997 
 f1-score: 0.02419987482535109 


 {'algorithm': 'auto', 'n_neighbors': 9} 
 recall 0.8668467306296863 
 precision: 0.026443058428713802 
 f1-score: 0.049410074948044114 


 {'algorithm': 'ball_tree', 'n_neighbors': 3} 
 recall 0.9302856835711791 
 precision: 0.005367727655402031 
 f1-score: 0.010651056133316834 


 {'algorithm': 'ball_tree', 'n_neighbors': 5} 
 recall 0.9006691929371926 
 precision: 0.007786736951632391 
 f1-score: 0.015361833634364863 


 {'algorithm': 'ball_tree', 'n_neighbors': 7} 
 recall 0.8795183960869682 
 p

Warning: The following cell took 20 minutes to run!

In [12]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components
o1_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o1' in key:
        o1_n_components.append(n)
        

# Record results
results_o1_p = {}
for scaler_str, n in zip(['s1', 's2', 's3', 's4'], o1_n_components):
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), pca=PCA(n))
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'algorithm': 'auto', 'n_neighbors': 3} 
 recall 0.9302856835711791 
 precision: 0.004282428530172016 
 f1-score: 0.008507871073086166 


 {'algorithm': 'auto', 'n_neighbors': 5} 
 recall 0.9070251820796044 
 precision: 0.0062918449543638766 
 f1-score: 0.012441181486201727 


 {'algorithm': 'auto', 'n_neighbors': 7} 
 recall 0.8943535166223225 
 precision: 0.009609835926880283 
 f1-score: 0.01882796464034441 


 {'algorithm': 'auto', 'n_neighbors': 9} 
 recall 0.8816684135558601 
 precision: 0.018152282495877692 
 f1-score: 0.034580263210354435 


 {'algorithm': 'ball_tree', 'n_neighbors': 3} 
 recall 0.9302856835711791 
 precision: 0.004282428530172016 
 f1-score: 0.008507871073086166 


 {'algorithm': 'ball_tree', 'n_neighbors': 5} 
 recall 0.9070251820796044 
 precision: 0.0062918449543638766 
 f1-score: 0.012441181486201727 


 {'algorithm': 'ball_tree', 'n_neighbors': 7} 
 recall 0.8943535166223225 


Warning: this cell took 20 minutes to run!

In [13]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# Outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components
o3_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o3' in key:
        o3_n_components.append(n)
        

# Record results
results_o1_p = {}
for scaler_str, n in zip(['s1', 's2', 's3', 's4'], o3_n_components):
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), 
                                                  outlier_removal=True, pca=PCA(n))
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'algorithm': 'auto', 'n_neighbors': 3} 
 recall 0.9281759789298288 
 precision: 0.00492936332847839 
 f1-score: 0.009788907730652746 


 {'algorithm': 'auto', 'n_neighbors': 5} 
 recall 0.9070386196887851 
 precision: 0.0069656386981167785 
 f1-score: 0.013768849574137305 


 {'algorithm': 'auto', 'n_neighbors': 7} 
 recall 0.8858609476201994 
 precision: 0.01072662056196094 
 f1-score: 0.021011099193867105 


 {'algorithm': 'auto', 'n_neighbors': 9} 
 recall 0.8752989868042679 
 precision: 0.02107209893834623 
 f1-score: 0.040013032886252416 


 {'algorithm': 'ball_tree', 'n_neighbors': 3} 
 recall 0.9281759789298288 
 precision: 0.00492936332847839 
 f1-score: 0.009788907730652746 


 {'algorithm': 'ball_tree', 'n_neighbors': 5} 
 recall 0.9070386196887851 
 precision: 0.0069656386981167785 
 f1-score: 0.013768849574137305 


 {'algorithm': 'ball_tree', 'n_neighbors': 7} 
 recall 0.8858609476201994 
 pr

#### Model selection

Now I will look through each models scores manually and conclude which one is the best performing model.

My choice a model using the data scaled with StandardScaler, with the outliers not removed and without PCA because it had a high cross validated recall of ~%88 but unfortunately a low precision of ~%2.
The models parameters are as follows: <br> <br>
algorithm: 'auto'<br>
n_neighbors: 9

##### Final Model

Now I will check the scores at each split of the model to make sure it is not over fitting.

In [38]:
# Best model
# Print cross val scores
for key, n in olr_keys_n_components:
    if key == 's2_r2_o1':
        n_components = n

model = KNeighborsClassifier(algorithm='auto', n_neighbor=9)
customCV(model, X, y, 's1', NearMiss(), outlier_removal=False,
         pca=PCA(n_components), print_splits=True)

split 1
recall: 0.8987
precision: 0.1276
f1: 0.2234
split 2
recall: 0.7532
precision: 0.7212
f1: 0.7368
Mean Scores:
Mean recall: 0.8054
Mean precision: 0.5627
Mean f1: 0.5868 



[0.8054099814560992, 0.5626520230293816, 0.5867627368973181]

Looks good.

#### Save Data

Now I will save the data along with a string to represent the transformations done to the data

In [40]:
pickle_out = open("Models/SupportVectorMachine.pickle", "wb")
pickle.dump([model, 's1_r2_o1, pca'+str(n_components)], pickle_out)
pickle_out.close()

## Sources
1) Top rated kaggle post: https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets