# Support Vector Machine Modeling

## Objectives
* Load data
* Tune hyper parameters for each version of the data
* Select a model
* Examine results
* Save results

#### Load data

I will start by importing the necessary libraries.

In [1]:
import pickle # loading data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_selection import RFE # RFE
from sklearn.svm import SVC
from imblearn.under_sampling import NearMiss
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score, accuracy_score, roc_curve, auc, precision_recall_curve, confusion_matrix
from sklearn.model_selection import GridSearchCV
import time
from modeling_functions import *

import warnings
warnings.filterwarnings("ignore") 

Using TensorFlow backend.


Now I will load the data.

In [2]:
# Load data
pickle_in = open("olr_keys_n_components.pickle", "rb")
olr_keys_n_components = list(pickle.load(pickle_in))
pickle_in.close()

# Sanity Check
print(olr_keys_n_components)

[('s1_r1_o1', 16), ('s1_r1_o3', 16), ('s1_r2_o1', 16), ('s1_r2_o3', 15), ('s2_r1_o1', 24), ('s2_r1_o3', 25), ('s2_r2_o1', 19), ('s2_r2_o3', 20), ('s3_r1_o1', 23), ('s3_r1_o3', 24), ('s3_r2_o1', 19), ('s3_r2_o3', 18), ('s4_r1_o1', 21), ('s4_r1_o3', 22), ('s4_r2_o1', 16), ('s4_r2_o3', 18)]


In [3]:
# Load data
pickle_in = open("cleaned_data.pickle", "rb")
clean_data = pickle.load(pickle_in)
pickle_in.close()

X = clean_data['X']
y = clean_data['y']

#### Tune hyperparameters

My goal for this project is to create a model that can help alert a credit lender to suspicious activity. <br><br>

For this reason I want to have low false negatives, so I will be using recall as my main metric. High recall will mean a low amount of fradulent transactions are left undetected. <br><br>

My second metric will be precision because I do not want false positives either. low precision would cause the model to flag too large an amount of the data as likely to be fraudulent. If the credit lender chose to take preventative action on say, every other transaction, then that would be a nuisance to both the credit lender and the clients. <br><br>

However precision does not need to be nearly as high as recall. If recall was say %80 then I would have potentially stopped %80 percent of fraud and if precision was say %20 then less than 1 out of 100 transactions would be flagged as suspicous to fraud, because in this dataset fraud accounts for %0.17 percent of the data I was  given. <br><br> 

F1-score is the harmonic mean of recall and precision. It is not the best metric to use though because it is important that recall is high, but precision can get away with being much lower.<br><br>

The metrics mentioned above are calculated by comparing the known values to the model's predicted values. The simplified formulas for precision and recall are showed below. 
<img src="../Images/Precision_Recall.png"><br>
Image Source: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
<br><br>

To tune the hyperparameters I will use my own function called customGridSearch. It has a doc string attached. The function will go through the data transforming it according to the function's parameters and return the cross validation scores for each method as well as for each combination of parameters.

The parameter grid in the cell below was inspired by the <a href=https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets>top rated kaggle post</a> I marked the url as source 1.

In [4]:
# Instantiate logistic regression classifer
clf = SVC

# Create parameter grid (Source 1)
params = {
    'C': [30, 50, 75, 100, 125, 150, 200], 
    'kernel': ['rbf', 'poly', 'sigmoid', 'linear']
}

Warning the following cell 18 took minutes to run!

In [5]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# No PCA

t1 = time.time()

# Record results
results_o1 = {}
for scaler_str in ['s1', 's2', 's3', 's4']:
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss())
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'C': 100, 'kernel': 'linear'} 
 recall 0.9155446263000888 
 precision: 0.00850831784325763 
 f1-score: 0.01676627917064486 


 {'C': 100, 'kernel': 'poly'} 
 recall 0.8097100163938832 
 precision: 0.3001933183577212 
 f1-score: 0.3858290054970775 


 {'C': 100, 'kernel': 'rbf'} 
 recall 0.8986132387325648 
 precision: 0.019259003148995594 
 f1-score: 0.03658851532729069 


 {'C': 100, 'kernel': 'sigmoid'} 
 recall 0.8668467306296863 
 precision: 0.03152976164065854 
 f1-score: 0.05895474720685615 


 {'C': 125, 'kernel': 'linear'} 
 recall 0.9155446263000888 
 precision: 0.007953423959064091 
 f1-score: 0.015685180586126616 


 {'C': 125, 'kernel': 'poly'} 
 recall 0.8203122900373566 
 precision: 0.20958677264651496 
 f1-score: 0.27865748747199975 


 {'C': 125, 'kernel': 'rbf'} 
 recall 0.900722943373915 
 precision: 0.017849790462428662 
 f1-score: 0.03407869166731136 


 {'C': 125, 'kernel': 'sigmoid'}


 {'C': 100, 'kernel': 'linear'} 
 recall 0.9153430621623801 
 precision: 0.0037258708865092033 
 f1-score: 0.0074191618194768515 


 {'C': 100, 'kernel': 'poly'} 
 recall 0.8963154075626866 
 precision: 0.005226463478137937 
 f1-score: 0.010379396306202653 


 {'C': 100, 'kernel': 'rbf'} 
 recall 0.9682334918971217 
 precision: 0.0020821998720829816 
 f1-score: 0.004155389089216382 


 {'C': 100, 'kernel': 'sigmoid'} 
 recall 0.7122604746163562 
 precision: 0.015351675579743049 
 f1-score: 0.029982230945959365 


 {'C': 125, 'kernel': 'linear'} 
 recall 0.9110967776613185 
 precision: 0.003665393057142023 
 f1-score: 0.007299023346337821 


 {'C': 125, 'kernel': 'poly'} 
 recall 0.8942057029213363 
 precision: 0.004895442014707068 
 f1-score: 0.009727634900753625 


 {'C': 125, 'kernel': 'rbf'} 
 recall 0.9682334918971217 
 precision: 0.0021493119308610637 
 f1-score: 0.004288973425322932 


 {'C': 125, 'kernel': 'sigmoid'} 
 recall 0.7122604746163562 
 precision: 0.015344862856477895

Warning: The following cell took 18 minutes to run!

In [6]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# outliers removed
# No PCA

t1 = time.time()

# Record results
results_o3 = {}
for scaler_str in ['s1', 's2', 's3', 's4']:
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o3[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), outlier_removal=True)
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'C': 100, 'kernel': 'linear'} 
 recall 0.9112983417990271 
 precision: 0.009764123460969706 
 f1-score: 0.019145636634917505 


 {'C': 100, 'kernel': 'poly'} 
 recall 0.796984600499879 
 precision: 0.4123200902742292 
 f1-score: 0.5239615863113217 


 {'C': 100, 'kernel': 'rbf'} 
 recall 0.8901475449488027 
 precision: 0.020350431810553897 
 f1-score: 0.03856728218735537 


 {'C': 100, 'kernel': 'sigmoid'} 
 recall 0.8456824961702815 
 precision: 0.04756391010359967 
 f1-score: 0.08537016511236846 


 {'C': 125, 'kernel': 'linear'} 
 recall 0.9134214840495579 
 precision: 0.00923051475125893 
 f1-score: 0.018129119182212484 


 {'C': 125, 'kernel': 'poly'} 
 recall 0.8033405896422909 
 precision: 0.30900204778982515 
 f1-score: 0.41559687096529196 


 {'C': 125, 'kernel': 'rbf'} 
 recall 0.894366954231503 
 precision: 0.020356917900182646 
 f1-score: 0.03858459357727557 


 {'C': 125, 'kernel': 'sigmoid'}


 {'C': 100, 'kernel': 'linear'} 
 recall 0.9216587384772502 
 precision: 0.003579308793799652 
 f1-score: 0.007129445705200843 


 {'C': 100, 'kernel': 'poly'} 
 recall 0.8942057029213363 
 precision: 0.005196119081254891 
 f1-score: 0.01031958627014491 


 {'C': 100, 'kernel': 'rbf'} 
 recall 0.9682334918971217 
 precision: 0.0020890375377021035 
 f1-score: 0.004169002626129206 


 {'C': 100, 'kernel': 'sigmoid'} 
 recall 0.7270955951517105 
 precision: 0.014859874160083853 
 f1-score: 0.029060784567165945 


 {'C': 125, 'kernel': 'linear'} 
 recall 0.9153161869440188 
 precision: 0.003503365950103562 
 f1-score: 0.00697865176797683 


 {'C': 125, 'kernel': 'poly'} 
 recall 0.8942057029213363 
 precision: 0.004894658497289643 
 f1-score: 0.009726166198417189 


 {'C': 125, 'kernel': 'rbf'} 
 recall 0.9661103496465909 
 precision: 0.0021501412728122866 
 f1-score: 0.004290600647019299 


 {'C': 125, 'kernel': 'sigmoid'} 
 recall 0.7270955951517105 
 precision: 0.014859296470911013 
 f

Warning: The following cell took 18 minutes to run!

In [7]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components
o1_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o1' in key:
        o1_n_components.append(n)
        

# Record results
results_o1_p = {}
for scaler_str, n in zip(['s1', 's2', 's3', 's4'], o1_n_components):
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), pca=PCA(n))
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'C': 100, 'kernel': 'linear'} 
 recall 0.9176005805047165 
 precision: 0.011727431337702135 
 f1-score: 0.02288967701825666 


 {'C': 100, 'kernel': 'poly'} 
 recall 0.19413313983176114 
 precision: 0.9108527131782945 
 f1-score: 0.2788340460917082 


 {'C': 100, 'kernel': 'rbf'} 
 recall 0.9134080464403773 
 precision: 0.014815139492300747 
 f1-score: 0.028532333117239007 


 {'C': 100, 'kernel': 'sigmoid'} 
 recall 0.8922438119809724 
 precision: 0.018898640184169816 
 f1-score: 0.03603173446795359 


 {'C': 125, 'kernel': 'linear'} 
 recall 0.9260393990701173 
 precision: 0.01224693621498899 
 f1-score: 0.023875673923167227 


 {'C': 125, 'kernel': 'poly'} 
 recall 0.20890107232121258 
 precision: 0.917562724014337 
 f1-score: 0.29262587331832096 


 {'C': 125, 'kernel': 'rbf'} 
 recall 0.9112849041898466 
 precision: 0.013740244325322974 
 f1-score: 0.026582611404371136 


 {'C': 125, 'kernel': 'sigmo


 {'C': 100, 'kernel': 'linear'} 
 recall 0.9282028541481898 
 precision: 0.003959356155754845 
 f1-score: 0.007881942193779285 


 {'C': 100, 'kernel': 'poly'} 
 recall 0.9239162568195867 
 precision: 0.0034216130306434747 
 f1-score: 0.006817148010242406 


 {'C': 100, 'kernel': 'rbf'} 
 recall 0.9830954876508372 
 precision: 0.0021830603979385234 
 f1-score: 0.004356413433442269 


 {'C': 100, 'kernel': 'sigmoid'} 
 recall 0.7375634927033782 
 precision: 0.0964821634492286 
 f1-score: 0.1466596709870852 


 {'C': 125, 'kernel': 'linear'} 
 recall 0.9282028541481898 
 precision: 0.0038957196911280525 
 f1-score: 0.0077560751690491 


 {'C': 125, 'kernel': 'poly'} 
 recall 0.9218065521782365 
 precision: 0.0035370049111236846 
 f1-score: 0.007046041327513576 


 {'C': 125, 'kernel': 'rbf'} 
 recall 0.9830954876508372 
 precision: 0.002258567030500831 
 f1-score: 0.004506755158770724 


 {'C': 125, 'kernel': 'sigmoid'} 
 recall 0.7375634927033782 
 precision: 0.09706769454211449 
 f1-s

Warning: this cell took 18 minutes to run!

In [8]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# Outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components
o3_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o3' in key:
        o3_n_components.append(n)
        

# Record results
results_o1_p = {}
for scaler_str, n in zip(['s1', 's2', 's3', 's4'], o3_n_components):
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), 
                                                  outlier_removal=True, pca=PCA(n))
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'C': 100, 'kernel': 'linear'} 
 recall 0.9155043134725469 
 precision: 0.009331425322353369 
 f1-score: 0.018329929456359438 


 {'C': 100, 'kernel': 'poly'} 
 recall 0.18147491198365986 
 precision: 0.9116465863453816 
 f1-score: 0.2627228369448286 


 {'C': 100, 'kernel': 'rbf'} 
 recall 0.904955790265796 
 precision: 0.01274266872190189 
 f1-score: 0.024695242424769694 


 {'C': 100, 'kernel': 'sigmoid'} 
 recall 0.8773952538364375 
 precision: 0.01553160702637557 
 f1-score: 0.029995683246450983 


 {'C': 125, 'kernel': 'linear'} 
 recall 0.9155043134725469 
 precision: 0.009279102985110133 
 f1-score: 0.018227531305012475 


 {'C': 125, 'kernel': 'poly'} 
 recall 0.202571958397162 
 precision: 0.9184397163120567 
 f1-score: 0.2818121693121693 


 {'C': 125, 'kernel': 'rbf'} 
 recall 0.9070654949071462 
 precision: 0.011417193347324764 
 f1-score: 0.022226099824664166 


 {'C': 125, 'kernel': 'sigmoid


 {'C': 100, 'kernel': 'linear'} 
 recall 0.9344782176355183 
 precision: 0.00532162443849184 
 f1-score: 0.010567673517521926 


 {'C': 100, 'kernel': 'poly'} 
 recall 0.9238490687736838 
 precision: 0.003244623977155303 
 f1-score: 0.0064653824886602325 


 {'C': 100, 'kernel': 'rbf'} 
 recall 0.9724797763981833 
 precision: 0.002105789217374081 
 f1-score: 0.004202436726010961 


 {'C': 100, 'kernel': 'sigmoid'} 
 recall 0.81805477169502 
 precision: 0.021771457350560203 
 f1-score: 0.04115522797292414 


 {'C': 125, 'kernel': 'linear'} 
 recall 0.9323550753849874 
 precision: 0.005316764352752172 
 f1-score: 0.010557994132175598 


 {'C': 125, 'kernel': 'poly'} 
 recall 0.9301781826977343 
 precision: 0.003297382903917326 
 f1-score: 0.006570233977494445 


 {'C': 125, 'kernel': 'rbf'} 
 recall 0.9703566341476524 
 precision: 0.002179385354634736 
 f1-score: 0.004348978422236433 


 {'C': 125, 'kernel': 'sigmoid'} 
 recall 0.81805477169502 
 precision: 0.02174596447774191 
 f1-scor

#### Model selection

Now I will look through each models scores manually and conclude which one is the best performing model.

My choice a model using the data scaled with MinMaxScaler, with the outliers not removed and with no PCA because it had a high cross validated recall of ~%79 and a precision of ~%63.
The models parameters are as follows: <br> <br>
C: 30 <br>
kernel: 'poly'

##### Final Model

Now I will check the scores at each split of the model to make sure it is not over fitting.

In [9]:
# Best model
# Print cross val scores
for key, n in olr_keys_n_components:
    if key == 's1_r2_o1':
        n_components = n

model = SVC(C=30, kernel='poly')
customCV(model, X, y, 's1', NearMiss(),  print_splits=True)

split 1
recall: 0.8734
precision: 0.4742
f1: 0.6147
split 2
recall: 0.7405
precision: 0.7091
f1: 0.7245
Mean Scores:
Mean recall: 0.7928
Mean precision: 0.634
Mean f1: 0.6933 



[0.7927517536079981, 0.6339601958220391, 0.6932994256412194]

Looks good.

#### Save Data

Now I will save the data along with a string to represent the transformations done to the data

In [10]:
pickle_out = open("Models/SupportVectorMachine.pickle", "wb")
pickle.dump([model, 's1_r2_o1, pca'+str(n_components)], pickle_out)
pickle_out.close()

## Sources
1) Top rated kaggle post: https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets