# Logistic Regression Modeling

## Objectives
* Load data
* Tune hyper parameters for each version of the data
* Select a model
* Examine results
* Save results

## Tools Used
* Pickle
* Numpy
* Pandas
* Matplotlib
* Sklearn
* Imblearn

#### Load data

I will start by importing the necessary libraries.

In [2]:
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from imblearn.under_sampling import NearMiss
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedShuffleSplit
import time
from modeling_functions import *

import warnings
warnings.filterwarnings("ignore") 

## Pickle

In [4]:
# Load data
pickle_in = open("engineered_data.pickle", "rb")
df = pickle.load(pickle_in)
pickle_in.close()

# Seperate X and y
X = df.drop('Class', axis=1)
y = df.Class

# stratified train test split
train_index, test_index = next(StratifiedShuffleSplit(test_size=0.1).split(X,y))
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]

X_train.shape, X_test.shape

((254968, 53), (28330, 53))

In [3]:
# Load data
pickle_in = open("olr_keys_n_components.pickle", "rb")
olr_keys_n_components = list(pickle.load(pickle_in))
pickle_in.close()

# Sanity Check
print(olr_keys_n_components)

[('s1_r1_o1', 19), ('s1_r1_o3', 18), ('s1_r2_o1', 17), ('s1_r2_o3', 16), ('s2_r1_o1', 26), ('s2_r1_o3', 26), ('s2_r2_o1', 21), ('s2_r2_o3', 22), ('s3_r1_o1', 23), ('s3_r1_o3', 25), ('s3_r2_o1', 17), ('s3_r2_o3', 20)]


## Tune hyperparameters

My goal for this project is to create a model that can help alert a credit lender to suspicious activity. <br><br>

For this reason I want to have low false negatives, so I will be using recall as my main metric. High recall will mean a low amount of fradulent transactions are left undetected. <br><br>

My second metric will be precision because I do not want false positives either. low precision would cause the model to flag too large an amount of the data as likely to be fraudulent. If the credit lender chose to take preventative action on say, every other transaction, then that would be a nuisance to both the credit lender and the clients. <br><br>

However precision does not need to be nearly as high as recall. If recall was say %80 then I would have potentially stopped %80 percent of fraud and if precision was say %20 then less than 1 out of 100 transactions would be flagged as suspicous to fraud, because in this dataset fraud accounts for %0.17 percent of the data I was  given. <br><br> 

F1-score is the harmonic mean of recall and precision. It is not the best metric to use though because it is important that recall is high, but precision can get away with being much lower.<br><br>

The metrics mentioned above are calculated by comparing the known values to the model's predicted values. The simplified formulas for precision and recall are showed below. 
<img src="../Images/Precision_Recall.png"><br>
<a href="https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9">Image Source</a> 
<br><br>

To tune the hyperparameters I will use my own function called customGridSearch. It has a doc string attached. The function will go through the data transforming it according to the function's parameters and return the cross validation scores for each method as well as for each combination of parameters.

The parameter grid in the cell below was inspired by the <a href=https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets>top rated kaggle post</a> I marked the url as source 1.

## Logistic Regression 

In [5]:
# Instantiate logistic regression classifer
clf = LogisticRegression

# Create parameter grid (Source 1)
params = {
    "penalty": ["l1", "l2"], 
    'C': [0.01, 0.1, 1, 10, 100],
    "solver": ["liblinear"]
}

Warning the following cell 5 took minutes to run!

In [6]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# No PCA

t1 = time.time()

# Record results
results_o1 = {}
scaler_str = ["Min-Max", "Standard", "Robust"]
for n, scaler in enumerate([MinMaxScaler(), StandardScaler(), RobustScaler()]):
    print(scaler_str[n], '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1[scaler_str[n]] = customGridSearchCV(clf, params, X_train, y_train, 'custom', scaler, NearMiss())
    
t2 = time.time()

print((t2 - t1)/60)

Min-Max ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Standard ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Robust ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17.543113509813946


Warning: The following cell took 4 minutes to run!

In [7]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# outliers removed
# No PCA

t1 = time.time()

# Record results
results_o3 = {}
scaler_str = ["Min-Max", "Standard", "Robust"]
for n, scaler in enumerate([MinMaxScaler(), StandardScaler(), RobustScaler()]):
    print(scaler_str[n], '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o3[scaler_str[n]] = customGridSearchCV(clf, params, X_train, y_train, 'custom', scaler, NearMiss(), outlier_removal=True)
    
t2 = time.time()

print((t2 - t1)/60)

Min-Max ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'} 
 recall 0.6679728317659352 
 precision: 0.6396700632438357 
 f1-score: 0.5593222783287597 

Standard ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Robust ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17.16404824256897


Warning: The following cell took 8 minutes to run!

In [12]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components (for PCA)
o1_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o1' in key:
        o1_n_components.append(n)
        

# Record results
results_o1_p = {}
scaler_str = ["Min-Max", "Standard", "Robust"]
for n, scaler, k in zip([0,1,2],[MinMaxScaler(), StandardScaler(), RobustScaler()], o1_n_components):
    print(scaler_str[n], '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str[n]] = customGridSearchCV(clf, params, X_train, y_train, 'custom', scaler, NearMiss(), pca=PCA(k))
t2 = time.time()

print((t2 - t1)/60)

Min-Max ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Standard ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Robust ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17.133947738011678


Warning: this cell took 8 minutes to run!

In [13]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# Outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components (for PCA)
o3_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o3' in key:
        o3_n_components.append(n)
        

# Record results
results_o1_p = {}
for n, scaler, k in zip([0,1,2],[MinMaxScaler(), StandardScaler(), RobustScaler()], o3_n_components):
    print(scaler_str[n], '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str[n]] = customGridSearchCV(clf, params, X_train, y_train, 'custom', scaler, NearMiss(), 
                                                     outlier_removal=True, pca=PCA(k))
    
t2 = time.time()

print((t2 - t1)/60)

Min-Max ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'} 
 recall 0.7505224660397074 
 precision: 0.530077052560172 
 f1-score: 0.5118384578980918 

Standard ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'C': 0.1, 'penalty': 'l1', 'solver': 'liblinear'} 
 recall 0.7941222570532915 
 precision: 0.4868455336966675 
 f1-score: 0.5284148094658674 

Robust ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'C': 0.01, 'penalty': 'l1', 'solver': 'liblinear'} 
 recall 0.7321577847439916 
 precision: 0.8283664597735001 
 f1-score: 0.7745360787125343 

17.110976183414458


# Model selection

Now I will look through each models scores manually and conclude which one is the best performing model.

My choice a model using the data scaled with StandardScaler, with the outliers removed and with PCA because it had a high cross validated recall of ~%79 and precision of ~%49
The models parameters are as follows: <br> <br>
C: 0.1<br>
penalty: "l1"<br>
solver: "liblinear"

## Final Model

Now I will check the scores at each split of the model to make sure it is not over fitting to a specific split.

In [5]:
# Best model
# Print cross val scores
for key, n in olr_keys_n_components:
    if key == 's2_r2_o3':
        n_components = n

model = LogisticRegression(penalty="l1", C=0.1, solver="liblinear")
customCV(model, X_train, y_train, StandardScaler(), NearMiss(), outlier_removal=True,
         pca=PCA(n_components), print_splits=True)

split 1
recall: 0.8736
precision: 0.0987
f1: 0.1774
split 2
recall: 0.7955
precision: 0.5072
f1: 0.6195
split 3
recall: 0.7273
precision: 0.8649
f1: 0.7901
split 4
recall: 0.8046
precision: 0.5036
f1: 0.6195
split 5
recall: 0.8391
precision: 0.5034
f1: 0.6293
Mean Scores:
Mean recall: 0.808
Mean precision: 0.4956
Mean f1: 0.5671 



[0.8079937304075235, 0.49557158770839704, 0.5671469497061382]

It appears to not be overfitting to a particular split.

## Run on test data
Now I will run the model on the test data which has yet to be seen by the model.

In [7]:
# fit model to entire train set
model.fit(X_train, y_train)

# run model on test set
y_hat = model.predict(X_test)

# get results
print(classification_report(y_test, y_hat))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     28281
           1       0.91      0.63      0.75        49

    accuracy                           1.00     28330
   macro avg       0.96      0.82      0.87     28330
weighted avg       1.00      1.00      1.00     28330



The testing scores are %63 recall and %91 precision

## Save Data

Now I will save the data along with a string to represent the transformations done to the data

In [10]:
pickle_out = open("Models/LogReg.pickle", "wb")
pickle.dump([model, 's1_r2_o1, pca'+str(n_components)], pickle_out)
pickle_out.close()

## Sources
1) Top rated kaggle post: https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets