# Decision Tree Modeling

## Objectives
* Load data
* Tune hyper parameters for each version of the data
* Select a model
* Examine results
* Save results

#### Load data

I will start by importing the necessary libraries.

In [3]:
import pickle # loading data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_selection import RFE # RFE
from sklearn.tree import DecisionTreeClassifier
from imblearn.under_sampling import NearMiss
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score, accuracy_score, roc_curve, auc, precision_recall_curve, confusion_matrix
from sklearn.model_selection import GridSearchCV
import time
from modeling_functions import *

import warnings
warnings.filterwarnings("ignore") 

Using TensorFlow backend.


In [4]:
# Load data
pickle_in = open("olr_keys_n_components.pickle", "rb")
olr_keys_n_components = list(pickle.load(pickle_in))
pickle_in.close()

# Sanity Check
print(olr_keys_n_components)

[('s1_r1_o1', 16), ('s1_r1_o3', 16), ('s1_r2_o1', 16), ('s1_r2_o3', 15), ('s2_r1_o1', 24), ('s2_r1_o3', 25), ('s2_r2_o1', 19), ('s2_r2_o3', 20), ('s3_r1_o1', 23), ('s3_r1_o3', 24), ('s3_r2_o1', 19), ('s3_r2_o3', 18), ('s4_r1_o1', 21), ('s4_r1_o3', 22), ('s4_r2_o1', 16), ('s4_r2_o3', 18)]


In [5]:
# Load data
pickle_in = open("cleaned_data.pickle", "rb")
clean_data = pickle.load(pickle_in)
pickle_in.close()

X = clean_data['X']
y = clean_data['y']

#### Tune hyperparameters

My goal for this project is to create a model that can help alert a credit lender to suspicious activity. <br><br>

For this reason I want to have low false negatives, so I will be using recall as my main metric. High recall will mean a low amount of fradulent transactions are left undetected. <br><br>

My second metric will be precision because I do not want false positives either. low precision would cause the model to flag too large an amount of the data as likely to be fraudulent. If the credit lender chose to take preventative action on say, every other transaction, then that would be a nuisance to both the credit lender and the clients. <br><br>

However precision does not need to be nearly as high as recall. If recall was say %80 then I would have potentially stopped %80 percent of fraud and if precision was say %20 then less than 1 out of 100 transactions would be flagged as suspicous to fraud, because in this dataset fraud accounts for %0.17 percent of the data I was  given. <br><br> 

F1-score is the harmonic mean of recall and precision. It is not the best metric to use though because it is important that recall is high, but precision can get away with being much lower.<br><br>

The metrics mentioned above are calculated by comparing the known values to the model's predicted values. The simplified formulas for precision and recall are showed below. 
<img src="../Images/Precision_Recall.png"><br>
Image Source: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
<br><br>

To tune the hyperparameters I will use my own function called customGridSearch. It has a doc string attached. The function will go through the data transforming it according to the function's parameters and return the cross validation scores for each method as well as for each combination of parameters.

The parameter grid in the cell below was taken from the <a href=https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets>top rated kaggle post</a> I marked the url as source 1.

In [6]:
# Instantiate logistic regression classifer
clf = DecisionTreeClassifier

# Create parameter grid (Source 1)
params = {
    "criterion": ["gini", "entropy"], 
    "max_depth": list(range(2,4,1)), 
    "min_samples_leaf": list(range(5,7,1))
}

Warning the following cell 5 took minutes to run!

In [7]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# No PCA

t1 = time.time()

# Record results
results_o1 = {}
for scaler_str in ['s1', 's2', 's3', 's4']:
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss())
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 5} 
 recall 0.8836706173237658 
 precision: 0.004867095882645422 
 f1-score: 0.009679644958261173 


 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 6} 
 recall 0.8836706173237658 
 precision: 0.004867095882645422 
 f1-score: 0.009679644958261173 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 5} 
 recall 0.9026579590959177 
 precision: 0.004462327788962611 
 f1-score: 0.008880360346696677 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 6} 
 recall 0.885780321965116 
 precision: 0.0048638860239000395 
 f1-score: 0.009673425844366297 


 {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5} 
 recall 0.8836706173237658 
 precision: 0.004600227671537817 
 f1-score: 0.009152292376072449 


 {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 6} 
 recall 0.8836706173237658 
 precision: 0.004

Warning: The following cell took 4 minutes to run!

In [8]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# outliers removed
# No PCA

t1 = time.time()

# Record results
results_o3 = {}
for scaler_str in ['s1', 's2', 's3', 's4']:
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o3[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), outlier_removal=True)
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 5} 
 recall 0.8836706173237658 
 precision: 0.004674433115584351 
 f1-score: 0.009298217160070727 


 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 6} 
 recall 0.8836706173237658 
 precision: 0.004488488536557425 
 f1-score: 0.00892973209275079 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 5} 
 recall 0.9026579590959177 
 precision: 0.0042612758819192366 
 f1-score: 0.008482432770083446 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 6} 
 recall 0.885780321965116 
 precision: 0.004488184235194659 
 f1-score: 0.008929205882395095 


 {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5} 
 recall 0.885780321965116 
 precision: 0.004479571025357598 
 f1-score: 0.008912377442247058 


 {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 6} 
 recall 0.8836706173237658 
 precision: 0.00486

Warning: The following cell took 8 minutes to run!

In [9]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# No outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components
o1_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o1' in key:
        o1_n_components.append(n)
        

# Record results
results_o1_p = {}
for scaler_str, n in zip(['s1', 's2', 's3', 's4'], o1_n_components):
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), pca=PCA(n))
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 5} 
 recall 0.9384557499529684 
 precision: 0.004971061058805759 
 f1-score: 0.009839004022137745 


 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 6} 
 recall 0.9384557499529684 
 precision: 0.004971061058805759 
 f1-score: 0.009839004022137745 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 5} 
 recall 0.9575505925985649 
 precision: 0.0025784857992051783 
 f1-score: 0.005140490678361256 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 6} 
 recall 0.9575505925985649 
 precision: 0.0025784857992051783 
 f1-score: 0.005140490678361256 


 {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5} 
 recall 0.9405654545943186 
 precision: 0.004900966646975602 
 f1-score: 0.009699370997523389 


 {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 6} 
 recall 0.9405654545943186 
 precision: 0.0

Warning: this cell took 8 minutes to run!

In [10]:
# Tune hyperparameters for all scalers 
# Implementing NearMiss
# Outliers removed
# PCA

t1 = time.time()


# Get correct scaler_str's and n_components
o3_n_components = []
for key, n in olr_keys_n_components:
    if 'r2_o3' in key:
        o3_n_components.append(n)
        

# Record results
results_o1_p = {}
for scaler_str, n in zip(['s1', 's2', 's3', 's4'], o3_n_components):
    print(scaler_str, '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    results_o1_p[scaler_str] = customGridSearchCV(clf, params, X, y, 'manual', scaler_str, NearMiss(), 
                                                  outlier_removal=True, pca=PCA(n))
    
t2 = time.time()

print((t2 - t1)/60)

s1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 5} 
 recall 0.9363191700932569 
 precision: 0.005318035390584327 
 f1-score: 0.01050822525490222 


 {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 6} 
 recall 0.9363191700932569 
 precision: 0.005318035390584327 
 f1-score: 0.01050822525490222 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 5} 
 recall 0.9110027143970546 
 precision: 0.005450162346088447 
 f1-score: 0.010771171234230657 


 {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 6} 
 recall 0.9299900561692063 
 precision: 0.005299415211196146 
 f1-score: 0.010471077048846263 


 {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5} 
 recall 0.9363191700932569 
 precision: 0.0053106255340101514 
 f1-score: 0.010493458054980883 


 {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 6} 
 recall 0.9363191700932569 
 precision: 0.0053

# Model selection

Now I will look through each models scores manually and conclude which one is the best performing model.

My choice a model using the data scaled with StandardScaler, with the outliers removed and with PCA because it had a high cross validated recall of ~%97, but it had a super low precision of ~%.
The models parameters are as follows: <br> <br>
criterion: 'entropy'<br>
max_depth: 3<br>
min_samples_leaf: 5

##### Final Model

Now I will check the scores at each split of the model to make sure it is not over fitting.

In [11]:
# Best model
# Print cross val scores
for key, n in olr_keys_n_components:
    if key == 's2_r2_o3':
        n_components = n

model = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5)
customCV(model, X, y, 's2', NearMiss(), outlier_removal=True,
         pca=PCA(n_components), print_splits=True)

split 1
recall: 0.962
precision: 0.0026
f1: 0.0052
split 2
recall: 0.943
precision: 0.0018
f1: 0.0035
split 3
recall: 0.9936
precision: 0.0017
f1: 0.0034
Mean Scores:
Mean recall: 0.9662
Mean precision: 0.002
Mean f1: 0.004 



[0.9662312881292161, 0.0020172410914646517, 0.004025721278692943]

#### Save Data

Now I will save the data along with a string to represent the transformations done to the data

In [12]:
pickle_out = open("Models/DecisionTree.pickle", "wb")
pickle.dump([model, 's1_r2_o1, pca'+str(n_components)], pickle_out)
pickle_out.close()

## Sources
1) Top rated kaggle post: https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets