# K Nearest Neighbors Modeling

## Objectives
* Load data
* Tune hyper parameters for each version of the data
* Select a model
* Examine results
* Save results

#### Load data

I will start by importing the necessary libraries.

In [2]:
import pickle # loading data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from imblearn.under_sampling import NearMiss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score, accuracy_score, roc_curve, auc, precision_recall_curve, confusion_matrix
from sklearn.model_selection import GridSearchCV
import time
from modeling_functions import *

import warnings
warnings.filterwarnings("ignore") 

Now I will load the data.

## Pickle

In [4]:
# Load data
pickle_in = open("olr_keys_n_components.pickle", "rb")
olr_keys_n_components = list(pickle.load(pickle_in))
pickle_in.close()

# Sanity Check
print(olr_keys_n_components)

[('s1_r1_o1', 16), ('s1_r1_o3', 16), ('s1_r2_o1', 16), ('s1_r2_o3', 15), ('s2_r1_o1', 24), ('s2_r1_o3', 25), ('s2_r2_o1', 19), ('s2_r2_o3', 20), ('s3_r1_o1', 23), ('s3_r1_o3', 24), ('s3_r2_o1', 19), ('s3_r2_o3', 18), ('s4_r1_o1', 21), ('s4_r1_o3', 22), ('s4_r2_o1', 16), ('s4_r2_o3', 18)]


In [3]:
# Load data
pickle_in = open("../clean_data.pickle", "rb")
df = pickle.load(pickle_in)
pickle_in.close()

# Seperate X and y
X = df.drop('Class', axis=1)
y = df.Class

df.head()


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## Tune hyperparameters

My goal for this project is to create a model that can help alert a credit lender to suspicious activity. <br><br>

For this reason I want to have low false negatives, so I will be using recall as my main metric. High recall will mean a low amount of fradulent transactions are left undetected. <br><br>

My second metric will be precision because I do not want false positives either. low precision would cause the model to flag too large an amount of the data as likely to be fraudulent. If the credit lender chose to take preventative action on say, every other transaction, then that would be a nuisance to both the credit lender and the clients. <br><br>

However precision does not need to be nearly as high as recall. If recall was say %80 then I would have potentially stopped %80 percent of fraud and if precision was say %20 then less than 1 out of 100 transactions would be flagged as suspicous to fraud, because in this dataset fraud accounts for %0.17 percent of the data I was  given. <br><br> 

F1-score is the harmonic mean of recall and precision. It is not the best metric to use though because it is important that recall is high, but precision can get away with being much lower.<br><br>

The metrics mentioned above are calculated by comparing the known values to the model's predicted values. The simplified formulas for precision and recall are showed below. 
<img src="../Images/Precision_Recall.png"><br>
Image Source: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
<br><br>

To tune the hyperparameters I will use my own function called customGridSearch. It has a doc string attached. The function will go through the data transforming it according to the function's parameters and return the cross validation scores for each method as well as for each combination of parameters.

The parameter grid in the cell below was inspired by the <a href=https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets>top rated kaggle post</a> I marked the url as source 1.

## KNeighborsClassifier

In [8]:
# Instantiate logistic regression classifer
clf = KNeighborsClassifier

# Create parameter grid (Source 1)
params = {
    "n_neighbors": list(range(3,32,4)), 
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
}

Warning the following cell 20 took minutes to run!

In [9]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# No PCA

t1 = time.time()

# Record results
results_o1 = {}
for scaler_str in ['s1', 's2', 's3', 's4']:
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss())
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'algorithm': 'auto', 'n_neighbors': 11} 
 recall 0.8732027197720983 
 precision: 0.046128550767900824 
 f1-score: 0.08029813406819665 


 {'algorithm': 'auto', 'n_neighbors': 15} 
 recall 0.8499556558897042 
 precision: 0.15717900418549108 
 f1-score: 0.20274354337932554 


 {'algorithm': 'auto', 'n_neighbors': 19} 
 recall 0.8330242683221801 
 precision: 0.21600960505432273 
 f1-score: 0.25304347538958233 


 {'algorithm': 'auto', 'n_neighbors': 23} 
 recall 0.8245585745384182 
 precision: 0.24326832248412467 
 f1-score: 0.2850828264177117 


 {'algorithm': 'auto', 'n_neighbors': 27} 
 recall 0.8096965787847026 
 precision: 0.2659179588068615 
 f1-score: 0.31450685742664375 


 {'algorithm': 'auto', 'n_neighbors': 31} 
 recall 0.7991211803595905 
 precision: 0.28799612457905105 
 f1-score: 0.34488480373061847 


 {'algorithm': 'auto', 'n_neighbors': 3} 
 recall 0.9302856835711791 
 precision: 0.004979637


 {'algorithm': 'auto', 'n_neighbors': 11} 
 recall 0.8816146631191378 
 precision: 0.018919840122573913 
 f1-score: 0.03700589853535074 


 {'algorithm': 'auto', 'n_neighbors': 15} 
 recall 0.8646967131607944 
 precision: 0.027701438079800838 
 f1-score: 0.05355556792397753 


 {'algorithm': 'auto', 'n_neighbors': 19} 
 recall 0.8520250477035126 
 precision: 0.0382916482969183 
 f1-score: 0.07287315128331713 


 {'algorithm': 'auto', 'n_neighbors': 23} 
 recall 0.8478056384208122 
 precision: 0.04888062498886867 
 f1-score: 0.0915534653160082 


 {'algorithm': 'auto', 'n_neighbors': 27} 
 recall 0.8435593539197507 
 precision: 0.06144939810931172 
 f1-score: 0.1134512268758469 


 {'algorithm': 'auto', 'n_neighbors': 31} 
 recall 0.8329570802762772 
 precision: 0.07228955396714769 
 f1-score: 0.1316298856614221 


 {'algorithm': 'auto', 'n_neighbors': 3} 
 recall 0.9197102851460667 
 precision: 0.004919447873318995 
 f1-score: 0.009785891007103696 


 {'algorithm': 'auto', 'n_neighbor

Warning: The following cell took 20 minutes to run!

In [10]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# outliers removed
# No PCA

t1 = time.time()

# Record results
results_o3 = {}
for scaler_str in ['s1', 's2', 's3', 's4']:
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o3[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), outlier_removal=True)
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'algorithm': 'auto', 'n_neighbors': 11} 
 recall 0.8626004461286249 
 precision: 0.05548920050148132 
 f1-score: 0.09531569756972384 


 {'algorithm': 'auto', 'n_neighbors': 15} 
 recall 0.8393399446370502 
 precision: 0.18016129290049343 
 f1-score: 0.23148365386581127 


 {'algorithm': 'auto', 'n_neighbors': 19} 
 recall 0.8224219946787068 
 precision: 0.24847730807710458 
 f1-score: 0.29787131723919075 


 {'algorithm': 'auto', 'n_neighbors': 23} 
 recall 0.8054502942836411 
 precision: 0.2871979177815153 
 f1-score: 0.34564481713767115 


 {'algorithm': 'auto', 'n_neighbors': 27} 
 recall 0.796957725281518 
 precision: 0.33658538877143523 
 f1-score: 0.3986721837159497 


 {'algorithm': 'auto', 'n_neighbors': 31} 
 recall 0.7778897578542825 
 precision: 0.39508031417222794 
 f1-score: 0.4456185172108724 


 {'algorithm': 'auto', 'n_neighbors': 3} 
 recall 0.9302856835711791 
 precision: 0.005367727655


 {'algorithm': 'auto', 'n_neighbors': 11} 
 recall 0.879491520868607 
 precision: 0.01929636031386868 
 f1-score: 0.03771719398966893 


 {'algorithm': 'auto', 'n_neighbors': 15} 
 recall 0.8646967131607944 
 precision: 0.029430833446908954 
 f1-score: 0.05672491131888765 


 {'algorithm': 'auto', 'n_neighbors': 19} 
 recall 0.8520250477035126 
 precision: 0.04148760858454986 
 f1-score: 0.07848997941101915 


 {'algorithm': 'auto', 'n_neighbors': 23} 
 recall 0.8478056384208122 
 precision: 0.057791151720300714 
 f1-score: 0.10676481865975658 


 {'algorithm': 'auto', 'n_neighbors': 27} 
 recall 0.8414362116692198 
 precision: 0.08114252516428846 
 f1-score: 0.14443692310191308 


 {'algorithm': 'auto', 'n_neighbors': 31} 
 recall 0.826614528743046 
 precision: 0.10424550760464008 
 f1-score: 0.1790586904439715 


 {'algorithm': 'auto', 'n_neighbors': 3} 
 recall 0.9197102851460667 
 precision: 0.004926487729531081 
 f1-score: 0.009799812341771068 


 {'algorithm': 'auto', 'n_neighbo

Warning: The following cell took 20 minutes to run!

In [11]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components
o1_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o1' in key:
        o1_n_components.append(n)
        

# Record results
results_o1_p = {}
for scaler_str, n in zip(['s1', 's2', 's3', 's4'], o1_n_components):
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), pca=PCA(n))
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'algorithm': 'auto', 'n_neighbors': 11} 
 recall 0.8732027197720983 
 precision: 0.03611925242542257 
 f1-score: 0.06482982031255329 


 {'algorithm': 'auto', 'n_neighbors': 15} 
 recall 0.854161627563224 
 precision: 0.13309234454823787 
 f1-score: 0.18120008090541193 


 {'algorithm': 'auto', 'n_neighbors': 19} 
 recall 0.8414765244967616 
 precision: 0.20764572821860197 
 f1-score: 0.2447229908159638 


 {'algorithm': 'auto', 'n_neighbors': 23} 
 recall 0.8266817167889489 
 precision: 0.23952565444911203 
 f1-score: 0.27723653556136263 


 {'algorithm': 'auto', 'n_neighbors': 27} 
 recall 0.816066005536295 
 precision: 0.256375677819868 
 f1-score: 0.3006649976518561 


 {'algorithm': 'auto', 'n_neighbors': 31} 
 recall 0.8012308850009408 
 precision: 0.2705938382252346 
 f1-score: 0.32368941696354525 


 {'algorithm': 'auto', 'n_neighbors': 3} 
 recall 0.9302856835711791 
 precision: 0.004282428530172


 {'algorithm': 'auto', 'n_neighbors': 11} 
 recall 0.890107232121261 
 precision: 0.013408069499108763 
 f1-score: 0.02635966428936431 


 {'algorithm': 'auto', 'n_neighbors': 15} 
 recall 0.8752721115859066 
 precision: 0.022486398141853187 
 f1-score: 0.04365625931347996 


 {'algorithm': 'auto', 'n_neighbors': 19} 
 recall 0.8562713322045742 
 precision: 0.03280780854035482 
 f1-score: 0.06269390522310385 


 {'algorithm': 'auto', 'n_neighbors': 23} 
 recall 0.8520384853126931 
 precision: 0.04335969589862662 
 f1-score: 0.0816004298842939 


 {'algorithm': 'auto', 'n_neighbors': 27} 
 recall 0.8478056384208122 
 precision: 0.05527000784446076 
 f1-score: 0.10249765517524421 


 {'algorithm': 'auto', 'n_neighbors': 31} 
 recall 0.8478056384208122 
 precision: 0.06586766316271299 
 f1-score: 0.12054161617819899 


 {'algorithm': 'auto', 'n_neighbors': 3} 
 recall 0.9366282351044103 
 precision: 0.0036279373647990026 
 f1-score: 0.007225264814055829 


 {'algorithm': 'auto', 'n_neigh

Warning: this cell took 20 minutes to run!

In [12]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# Outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components
o3_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o3' in key:
        o3_n_components.append(n)
        

# Record results
results_o1_p = {}
for scaler_str, n in zip(['s1', 's2', 's3', 's4'], o3_n_components):
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), 
                                                  outlier_removal=True, pca=PCA(n))
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'algorithm': 'auto', 'n_neighbors': 11} 
 recall 0.8626004461286249 
 precision: 0.04493753450019084 
 f1-score: 0.07947355261550017 


 {'algorithm': 'auto', 'n_neighbors': 15} 
 recall 0.8414496492784004 
 precision: 0.1620546494925699 
 f1-score: 0.21538946026060116 


 {'algorithm': 'auto', 'n_neighbors': 19} 
 recall 0.8224219946787068 
 precision: 0.23612362438395354 
 f1-score: 0.2843984308037106 


 {'algorithm': 'auto', 'n_neighbors': 23} 
 recall 0.8054502942836411 
 precision: 0.275804790865868 
 f1-score: 0.33009195972741456 


 {'algorithm': 'auto', 'n_neighbors': 27} 
 recall 0.7948345830309872 
 precision: 0.3218841897488765 
 f1-score: 0.38312615588374627 


 {'algorithm': 'auto', 'n_neighbors': 31} 
 recall 0.7821226047461636 
 precision: 0.37526017668514755 
 f1-score: 0.43250421814584494 


 {'algorithm': 'auto', 'n_neighbors': 3} 
 recall 0.9281759789298288 
 precision: 0.0049293633284


 {'algorithm': 'auto', 'n_neighbors': 11} 
 recall 0.8922303743717919 
 precision: 0.01371823577185645 
 f1-score: 0.026965282938318624 


 {'algorithm': 'auto', 'n_neighbors': 15} 
 recall 0.8795049584777876 
 precision: 0.02376322370272751 
 f1-score: 0.04603858282536339 


 {'algorithm': 'auto', 'n_neighbors': 19} 
 recall 0.8541481899540434 
 precision: 0.03443583164736213 
 f1-score: 0.06558159967501459 


 {'algorithm': 'auto', 'n_neighbors': 23} 
 recall 0.8520384853126931 
 precision: 0.05143120639881411 
 f1-score: 0.09542088468109088 


 {'algorithm': 'auto', 'n_neighbors': 27} 
 recall 0.8478056384208122 
 precision: 0.07169922160836406 
 f1-score: 0.12852133176644814 


 {'algorithm': 'auto', 'n_neighbors': 31} 
 recall 0.8393399446370502 
 precision: 0.09152246057393859 
 f1-score: 0.1586816408795139 


 {'algorithm': 'auto', 'n_neighbors': 3} 
 recall 0.93451853046306 
 precision: 0.00363626025149778 
 f1-score: 0.007242080416002733 


 {'algorithm': 'auto', 'n_neighbors

#### Model selection

Now I will look through each models scores manually and conclude which one is the best performing model.

My choice a model using the data scaled with StandardScaler, with the outliers not removed and without PCA because it had a high cross validated recall of ~%88 but unfortunately a low precision of ~%2.
The models parameters are as follows: <br> <br>
algorithm: 'auto'<br>
n_neighbors: 9

##### Final Model

Now I will check the scores at each split of the model to make sure it is not over fitting.

In [38]:
# Best model
# Print cross val scores
for key, n in olr_keys_n_components:
    if key == 's2_r2_o1':
        n_components = n

model = KNeighborsClassifier(algorithm='auto', n_neighbor=9)
customCV(model, X, y, 's1', NearMiss(), outlier_removal=False,
         pca=PCA(n_components), print_splits=True)

split 1
recall: 0.8987
precision: 0.1276
f1: 0.2234
split 2
recall: 0.7532
precision: 0.7212
f1: 0.7368
Mean Scores:
Mean recall: 0.8054
Mean precision: 0.5627
Mean f1: 0.5868 



[0.8054099814560992, 0.5626520230293816, 0.5867627368973181]

Looks good.

#### Save Data

Now I will save the data along with a string to represent the transformations done to the data

In [40]:
pickle_out = open("Models/SupportVectorMachine.pickle", "wb")
pickle.dump([model, 's1_r2_o1, pca'+str(n_components)], pickle_out)
pickle_out.close()

## Sources
1) Top rated kaggle post: https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets