# Random Forest Modeling

## Objectives
* Load data
* Tune hyper parameters for each version of the data
* Select a model
* Examine results
* Save results

#### Load data

I will start by importing the necessary libraries.

In [1]:
import pickle # loading data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_selection import RFE # RFE
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import NearMiss
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score, accuracy_score, roc_curve, auc, precision_recall_curve, confusion_matrix
from sklearn.model_selection import GridSearchCV
import time
from modeling_functions import *

import warnings
warnings.filterwarnings("ignore") 

## Pickle

In [2]:
# Load data
pickle_in = open("olr_keys_n_components.pickle", "rb")
olr_keys_n_components = list(pickle.load(pickle_in))
pickle_in.close()

# Sanity Check
print(olr_keys_n_components)

[('s1_r1_o1', 16), ('s1_r1_o3', 16), ('s1_r2_o1', 16), ('s1_r2_o3', 15), ('s2_r1_o1', 24), ('s2_r1_o3', 25), ('s2_r2_o1', 19), ('s2_r2_o3', 20), ('s3_r1_o1', 23), ('s3_r1_o3', 24), ('s3_r2_o1', 19), ('s3_r2_o3', 18), ('s4_r1_o1', 21), ('s4_r1_o3', 22), ('s4_r2_o1', 16), ('s4_r2_o3', 18)]


In [3]:
# Load data
pickle_in = open("../clean_data.pickle", "rb")
df = pickle.load(pickle_in)
pickle_in.close()

# Seperate X and y
X = df.drop('Class', axis=1)
y = df.Class

df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## Tune hyperparameters

My goal for this project is to create a model that can help alert a credit lender to suspicious activity. <br><br>

For this reason I want to have low false negatives, so I will be using recall as my main metric. High recall will mean a low amount of fradulent transactions are left undetected. <br><br>

My second metric will be precision because I do not want false positives either. low precision would cause the model to flag too large an amount of the data as likely to be fraudulent. If the credit lender chose to take preventative action on say, every other transaction, then that would be a nuisance to both the credit lender and the clients. <br><br>

However precision does not need to be nearly as high as recall. If recall was say %80 then I would have potentially stopped %80 percent of fraud and if precision was say %20 then less than 1 out of 100 transactions would be flagged as suspicous to fraud, because in this dataset fraud accounts for %0.17 percent of the data I was  given. <br><br> 

F1-score is the harmonic mean of recall and precision. It is not the best metric to use though because it is important that recall is high, but precision can get away with being much lower.<br><br>

The metrics mentioned above are calculated by comparing the known values to the model's predicted values. The simplified formulas for precision and recall are showed below. 
<img src="../Images/Precision_Recall.png"><br>
Image Source: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
<br><br>

To tune the hyperparameters I will use my own function called customGridSearch. It has a doc string attached. The function will go through the data transforming it according to the function's parameters and return the cross validation scores for each method as well as for each combination of parameters.

The parameter grid in the cell below was taken from the <a href=https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets>top rated kaggle post</a> I marked the url as source 1.

## Random Forest Classifier

In [4]:
# Instantiate logistic regression classifer
clf = RandomForestClassifier

# Create parameter grid (Source 1)
params = {
    "criterion": ["gini", "entropy"], 
    "max_depth": list(range(2,4,1)), 
    "min_samples_leaf": list(range(5,7,1)),
    "n_estimators": [10]
}

Warning the following cell 5 took minutes to run!

In [5]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# No PCA

t1 = time.time()

# Record results
results_o1 = {}
for scaler_str in ['s1', 's2', 's3', 's4']:
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss())
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 10} 
 recall 0.8561907065494907 
 precision: 0.016915973345694387 
 f1-score: 0.032962233790608676 


 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 6, 'n_estimators': 10} 
 recall 0.8499019054529819 
 precision: 0.049176798505088354 
 f1-score: 0.09141070307340372 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 5, 'n_estimators': 10} 
 recall 0.8815340374640543 
 precision: 0.009597594792957619 
 f1-score: 0.018971067884175744 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 6, 'n_estimators': 10} 
 recall 0.9027923351877233 
 precision: 0.004862972609499556 
 f1-score: 0.009672825225653173 


 {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 10} 
 recall 0.8413824612324975 
 precision: 0.01162661841576189 
 f1-score: 0.022884592558373048 


 {'criteri

Warning: The following cell took 4 minutes to run!

In [6]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# outliers removed
# No PCA

t1 = time.time()

# Record results
results_o3 = {}
for scaler_str in ['s1', 's2', 's3', 's4']:
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o3[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), outlier_removal=True)
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 10} 
 recall 0.8561772689403101 
 precision: 0.02601376317503507 
 f1-score: 0.05005798171249858 


 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 6, 'n_estimators': 10} 
 recall 0.8435056034830284 
 precision: 0.06595397360835854 
 f1-score: 0.1130338080678706 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 5, 'n_estimators': 10} 
 recall 0.862533258082722 
 precision: 0.012803445284855913 
 f1-score: 0.025180689502753442 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 6, 'n_estimators': 10} 
 recall 0.9048886022198931 
 precision: 0.005823538874727411 
 f1-score: 0.011571296273044929 


 {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 10} 
 recall 0.8604369910505523 
 precision: 0.013780363849539562 
 f1-score: 0.02712611895028498 


 {'criterion': 

Warning: The following cell took 8 minutes to run!

In [7]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components
o1_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o1' in key:
        o1_n_components.append(n)
        

# Record results
results_o1_p = {}
for scaler_str, n in zip(['s1', 's2', 's3', 's4'], o1_n_components):
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), pca=PCA(n))
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 10} 
 recall 0.9384557499529684 
 precision: 0.004249756052705766 
 f1-score: 0.008428428054906135 


 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 6, 'n_estimators': 10} 
 recall 0.9174393291945497 
 precision: 0.006532853451885703 
 f1-score: 0.012898250701960545 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 5, 'n_estimators': 10} 
 recall 0.989384288747346 
 precision: 0.0023763715960965046 
 f1-score: 0.004740177399083042 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 6, 'n_estimators': 10} 
 recall 0.9809589077911257 
 precision: 0.0026952272123693575 
 f1-score: 0.005374446286262067 


 {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 10} 
 recall 0.9575371549893843 
 precision: 0.003372731599676626 
 f1-score: 0.006709068290510367 


 {'crit

Warning: this cell took 8 minutes to run!

In [8]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# Outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components
o3_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o3' in key:
        o3_n_components.append(n)
        

# Record results
results_o1_p = {}
for scaler_str, n in zip(['s1', 's2', 's3', 's4'], o3_n_components):
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), 
                                                  outlier_removal=True, pca=PCA(n))
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 10} 
 recall 0.9341825902335458 
 precision: 0.004526528092221553 
 f1-score: 0.008967910927366795 


 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 6, 'n_estimators': 10} 
 recall 0.9023354564755839 
 precision: 0.0017670278530406327 
 f1-score: 0.003526910858629473 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 5, 'n_estimators': 10} 
 recall 0.9640409578327823 
 precision: 0.002040012260141031 
 f1-score: 0.004071171383196943 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 6, 'n_estimators': 10} 
 recall 0.9662178505200355 
 precision: 0.0024267426572209196 
 f1-score: 0.004839714238909065 


 {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 10} 
 recall 0.9237684431186004 
 precision: 0.0024170237276189797 
 f1-score: 0.004820351644035101 


 {'cr

# Model selection

Now I will look through each models scores manually and conclude which one is the best performing model.

My choice a model using the data scaled with StandardScaler, with the outliers removed and with PCA because it had a high cross validated recall of ~%93, but it had a super low precision of ~%1
The models parameters are as follows: <br> <br>
criterion: 'gini' <br>
max_depth: 3<br>
min_samples_leaf: 6<br>
n_estimators: 10

##### Final Model

Now I will check the scores at each split of the model to make sure it is not over fitting.

In [9]:
# Best model
# Print cross val scores
for key, n in olr_keys_n_components:
    if key == 's1_r2_o1':
        n_components = n

model = RandomForestClassifier(criterion='entropy', max_depth=3, min_samples_leaf=6, n_estimators=10)
customCV(model, X, y, 's1', NearMiss(), outlier_removal=True,
         pca=PCA(n_components), print_splits=True)

split 1
recall: 1.0
precision: 0.0017
f1: 0.0035
split 2
recall: 0.943
precision: 0.0029
f1: 0.0058
split 3
recall: 0.8917
precision: 0.0036
f1: 0.0072
Mean Scores:
Mean recall: 0.9449
Mean precision: 0.0027
Mean f1: 0.0055 



[0.9449192399688249, 0.0027484282851757955, 0.005479248226513808]

#### Save Data

Now I will save the data along with a string to represent the transformations done to the data

In [10]:
pickle_out = open("Models/RandomForest.pickle", "wb")
pickle.dump([model, 's1_r2_o1, pca'+str(n_components)], pickle_out)
pickle_out.close()

## Sources
1) Top rated kaggle post: https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets