# Logistic Regression Modeling

## Objectives
* Load data
* Tune hyper parameters for each version of the data
* Select a model
* Examine results
* Save results

#### Load data

I will start by importing the necessary libraries.

In [2]:
import pickle # loading data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_selection import RFE # RFE
from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import NearMiss
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score, accuracy_score, roc_curve, auc, precision_recall_curve, confusion_matrix
from sklearn.model_selection import GridSearchCV
import time
from modeling_functions import *

import warnings
warnings.filterwarnings("ignore") 

## Pickle

In [31]:
# Load data
pickle_in = open("olr_keys_n_components.pickle", "rb")
olr_keys_n_components = list(pickle.load(pickle_in))
pickle_in.close()

# Sanity Check
print(olr_keys_n_components)

[('s1_r1_o1', 16), ('s1_r1_o3', 16), ('s1_r2_o1', 16), ('s1_r2_o3', 15), ('s2_r1_o1', 24), ('s2_r1_o3', 25), ('s2_r2_o1', 19), ('s2_r2_o3', 20), ('s3_r1_o1', 23), ('s3_r1_o3', 24), ('s3_r2_o1', 19), ('s3_r2_o3', 18), ('s4_r1_o1', 21), ('s4_r1_o3', 22), ('s4_r2_o1', 16), ('s4_r2_o3', 18)]


In [3]:
# Load data
pickle_in = open("../clean_data.pickle", "rb")
df = pickle.load(pickle_in)
pickle_in.close()

# Seperate X and y
X = df.drop('Class', axis=1)
y = df.Class

df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## Tune hyperparameters

My goal for this project is to create a model that can help alert a credit lender to suspicious activity. <br><br>

For this reason I want to have low false negatives, so I will be using recall as my main metric. High recall will mean a low amount of fradulent transactions are left undetected. <br><br>

My second metric will be precision because I do not want false positives either. low precision would cause the model to flag too large an amount of the data as likely to be fraudulent. If the credit lender chose to take preventative action on say, every other transaction, then that would be a nuisance to both the credit lender and the clients. <br><br>

However precision does not need to be nearly as high as recall. If recall was say %80 then I would have potentially stopped %80 percent of fraud and if precision was say %20 then less than 1 out of 100 transactions would be flagged as suspicous to fraud, because in this dataset fraud accounts for %0.17 percent of the data I was  given. <br><br> 

F1-score is the harmonic mean of recall and precision. It is not the best metric to use though because it is important that recall is high, but precision can get away with being much lower.<br><br>

The metrics mentioned above are calculated by comparing the known values to the model's predicted values. The simplified formulas for precision and recall are showed below. 
<img src="../Images/Precision_Recall.png"><br>
Image Source: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
<br><br>

To tune the hyperparameters I will use my own function called customGridSearch. It has a doc string attached. The function will go through the data transforming it according to the function's parameters and return the cross validation scores for each method as well as for each combination of parameters.

The parameter grid in the cell below was taken from the <a href=https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets>top rated kaggle post</a> I marked the url as source 1.

## Logistic Regression

In [12]:
# Instantiate logistic regression classifer
clf = LogisticRegression

# Create parameter grid (Source 1)
params = {
    "penalty": ['l1', 'l2'], 
    'C': [0.01, 0.05, 0.1, 0.5, 1, 10, 100]
}

Warning the following cell 8 took minutes to run!

In [13]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# No PCA

t1 = time.time()

# Record results
results_o1 = {}
for scaler_str in ['s1', 's2', 's3', 's4']:
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss())
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'C': 0.01, 'penalty': 'l1'} 
 recall 0.0 
 precision: 0.0 
 f1-score: 0.0 


 {'C': 0.01, 'penalty': 'l2'} 
 recall 0.6739498508425381 
 precision: 0.844290081875631 
 f1-score: 0.7192879089439513 


 {'C': 0.05, 'penalty': 'l1'} 
 recall 0.7226746217313016 
 precision: 0.4189304169984232 
 f1-score: 0.3782060892066823 


 {'C': 0.05, 'penalty': 'l2'} 
 recall 0.7587680399903248 
 precision: 0.7149055058217776 
 f1-score: 0.6839753283252765 


 {'C': 0.1, 'penalty': 'l1'} 
 recall 0.7629471364454835 
 precision: 0.26152504532557275 
 f1-score: 0.297831389057803 


 {'C': 0.1, 'penalty': 'l2'} 
 recall 0.7842054341691526 
 precision: 0.670313022167762 
 f1-score: 0.6599272825029454 


 {'C': 0.5, 'penalty': 'l1'} 
 recall 0.8371764895589777 
 precision: 0.09596266505313532 
 f1-score: 0.15925821157796674 


 {'C': 0.5, 'penalty': 'l2'} 
 recall 0.809656265957161 
 precision: 0.36982066232455973 
 f1-score:

Warning: The following cell took 8 minutes to run!

In [14]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# outliers removed
# No PCA

t1 = time.time()

# Record results
results_o3 = {}
for scaler_str in ['s1', 's2', 's3', 's4']:
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o3[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), outlier_removal=True)
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'C': 0.01, 'penalty': 'l1'} 
 recall 0.0 
 precision: 0.0 
 f1-score: 0.0 


 {'C': 0.01, 'penalty': 'l2'} 
 recall 0.5152785616383132 
 precision: 0.9045343868018286 
 f1-score: 0.6145250965630091 


 {'C': 0.05, 'penalty': 'l1'} 
 recall 0.6527318659464107 
 precision: 0.5482395126746591 
 f1-score: 0.4816985363156892 


 {'C': 0.05, 'penalty': 'l2'} 
 recall 0.7099760810556587 
 precision: 0.826395173453997 
 f1-score: 0.7364863050429135 


 {'C': 0.1, 'penalty': 'l1'} 
 recall 0.7396731973447284 
 precision: 0.3126174696088722 
 f1-score: 0.34213338233291424 


 {'C': 0.1, 'penalty': 'l2'} 
 recall 0.7524254884570937 
 precision: 0.7622014537902388 
 f1-score: 0.7252338179717067 


 {'C': 0.5, 'penalty': 'l1'} 
 recall 0.8265607783063237 
 precision: 0.10864678338755863 
 f1-score: 0.17702937166418609 


 {'C': 0.5, 'penalty': 'l2'} 
 recall 0.8054099814560992 
 precision: 0.5371265955176471 
 f1-scor

Warning: The following cell took 8 minutes to run!

In [33]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components
o1_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o1' in key:
        o1_n_components.append(n)
        

# Record results
results_o1_p = {}
for scaler_str, n in zip(['s1', 's2', 's3', 's4'], o1_n_components):
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), pca=PCA(n))
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'C': 0.01, 'penalty': 'l1'} 
 recall 0.16348195329087048 
 precision: 0.32083333333333336 
 f1-score: 0.21659634317862167 


 {'C': 0.01, 'penalty': 'l2'} 
 recall 0.7141551775108171 
 precision: 0.6903372838218907 
 f1-score: 0.6180975333517706 


 {'C': 0.05, 'penalty': 'l1'} 
 recall 0.7331022064554275 
 precision: 0.5877803824232396 
 f1-score: 0.47559930195413563 


 {'C': 0.05, 'penalty': 'l2'} 
 recall 0.7587142895536026 
 precision: 0.7035725519212277 
 f1-score: 0.6702705541242647 


 {'C': 0.1, 'penalty': 'l1'} 
 recall 0.7607702437582305 
 precision: 0.5630537765315834 
 f1-score: 0.5007045567278419 


 {'C': 0.1, 'penalty': 'l2'} 
 recall 0.7820688543094412 
 precision: 0.6920898992169876 
 f1-score: 0.6761414061925569 


 {'C': 0.5, 'penalty': 'l1'} 
 recall 0.8286973581660352 
 precision: 0.3055952642774076 
 f1-score: 0.42164322214878336 


 {'C': 0.5, 'penalty': 'l2'} 
 recall 0.8054099814

Warning: this cell took 8 minutes to run!

In [35]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# Outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components
o3_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o3' in key:
        o3_n_components.append(n)
        

# Record results
results_o1_p = {}
for scaler_str, n in zip(['s1', 's2', 's3', 's4'], o3_n_components):
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), 
                                                  outlier_removal=True, pca=PCA(n))
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'C': 0.01, 'penalty': 'l1'} 
 recall 0.0 
 precision: 0.0 
 f1-score: 0.0 


 {'C': 0.01, 'penalty': 'l2'} 
 recall 0.6781423849068774 
 precision: 0.8579925229409766 
 f1-score: 0.7278956970586927 


 {'C': 0.05, 'penalty': 'l1'} 
 recall 0.7394447579886586 
 precision: 0.5803513254106427 
 f1-score: 0.4730784470232264 


 {'C': 0.05, 'penalty': 'l2'} 
 recall 0.7120589104786476 
 precision: 0.7887852745032364 
 f1-score: 0.7116591582109776 


 {'C': 0.1, 'penalty': 'l1'} 
 recall 0.7375097422666559 
 precision: 0.585984914073966 
 f1-score: 0.518635547389541 


 {'C': 0.1, 'penalty': 'l2'} 
 recall 0.7587142895536026 
 precision: 0.7582124345772251 
 f1-score: 0.7211156260801651 


 {'C': 0.5, 'penalty': 'l1'} 
 recall 0.8287376709935769 
 precision: 0.309234345428736 
 f1-score: 0.42427326535756554 


 {'C': 0.5, 'penalty': 'l2'} 
 recall 0.8011771345642184 
 precision: 0.6076711411459376 
 f1-score: 0

## Model selection

Now I will look through each models scores manually and conclude which one is the best performing model.

My choice a model using the data scaled with MinMaxScaler, with the outliers not removed and with PCA because it had a high cross validated recall of ~%81 and a precision of ~%56.
The models parameters are as follows: <br> <br>
C: 0.5<br>
penalty: l2

##### Final Model

Now I will check the scores at each split of the model to make sure it is not over fitting.

In [38]:
# Best model
# Print cross val scores
for key, n in olr_keys_n_components:
    if key == 's1_r2_o1':
        n_components = n

model = LogisticRegression(C=0.5, penalty='l2')
customCV(model, X, y, 's1', NearMiss(), outlier_removal=False,
         pca=PCA(n_components), print_splits=True)

split 1
recall: 0.8987
precision: 0.1276
f1: 0.2234
split 2
recall: 0.7532
precision: 0.7212
f1: 0.7368
Mean Scores:
Mean recall: 0.8054
Mean precision: 0.5627
Mean f1: 0.5868 



[0.8054099814560992, 0.5626520230293816, 0.5867627368973181]

Looks good.

#### Save Data

Now I will save the data along with a string to represent the transformations done to the data

In [40]:
pickle_out = open("Models/LogisticRegression.pickle", "wb")
pickle.dump([model, 's1_r2_o1, pca'+str(n_components)], pickle_out)
pickle_out.close()

## Sources
1) Top rated kaggle post: https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets