# Modeling - Sklearn Decision Tree

## Summary
If a simpler model can get results that are as good as a more complex model, it is better to use the simpler one, to maintain interpretability and keep computational costs low. In this notebook, I'm going to spend a little time seeing how good a base decision tree from sklearn can get. In the following cells I will:
- Recreate the baseline DT model
- Add a max depth to the model
- Gridsearch for a max depth
- Add class weights to the model

# DT Modeling

In [37]:
# Import statements
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import balanced_accuracy_score
# Importing metrics function from functions.py
from functions import metrics as custom_score
from functions import improvement as custom_change


In [17]:
# Load in cleaned data.

# Training Data
X_train = pd.read_csv('../Data/train/X_train.csv', index_col=0)
y_train = pd.read_csv('../Data/train/y_train.csv', index_col=0)

# Testing Data
X_test = pd.read_csv('../Data/test/X_test.csv', index_col=0)
y_test = pd.read_csv('../Data/test/y_test.csv', index_col=0)

## Base DT Model
Re-creating a base decision tree to compare our models too going forward

In [18]:
# Instantiating Tree
FSM_DT = DecisionTreeClassifier(random_state=15)

# Fitting Model
FSM_DT.fit(X_train, y_train)

# Score on the testing data.
FSM_results = custom_score(X_test, y_test, FSM_DT)

Accuracy: 0.90
Precision: 0.50
Recall: 0.54
F1: 0.52
ROC AUC: 0.74


### Analysis
Remember that **guessing one class, a "modeless baseline," would result in 90% accuracy**. There are some quick things we could do to improve this model's performance

## DT with Maximum Depth
The model is dramatically overfitting to the training data, making it less generalizable when it comes to unseen data. Setting a maximum depth should help.

In [19]:
# Checking depth of previous tree
FSM_DT.get_depth()

36

In [20]:
# Let's start with 25, and see if that leads to improvement.
DT_depth = DecisionTreeClassifier(max_depth=10, random_state=15)

# Fitting on training data
DT_depth.fit(X_train, y_train)

# Printing test results
print('Test results')
DT_depth_results = custom_score(X_test, y_test, DT_depth)

Test results
Accuracy: 0.92
Precision: 0.62
Recall: 0.52
F1: 0.56
ROC AUC: 0.74


In [21]:
# Printing the improvement
custom_change(FSM_results, DT_depth_results)

Accuracy        +0.02
Precision       +0.12
Recall          -0.02
F1              +0.04
ROCAUC          +0.00


### Analysis
An overall improvement, let's do a quick gridsearch to see what gives us the best results here.

## Gridsearch Depth

In [39]:
# Create search parameters, going up by 2, starting at 2 and going to 30
depth_parameters = {
    'max_depth': list(range(2, 30, 2))
}
# Creating the gridsearch object, using a DT as a classifier and accuracy as scoring metric
depth_search = GridSearchCV(estimator=DecisionTreeClassifier(random_state=15),
                            param_grid=depth_parameters,
                            scoring='roc_auc')

# Fitting the model on traiining data
depth_search.fit(X_train, y_train)

GridSearchCV(estimator=DecisionTreeClassifier(random_state=15),
             param_grid={'max_depth': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22,
                                       24, 26, 28, 30]},
             scoring='roc_auc')

In [40]:
# Finding best estimator
depth_search.best_estimator_

DecisionTreeClassifier(max_depth=6, random_state=15)

In [41]:
# Getting Score of gridsearch model
depth_search_results = custom_score(X_test, y_test, depth_search)

Accuracy: 0.93
Precision: 0.68
Recall: 0.50
F1: 0.58
ROC AUC: 0.74


In [42]:
# Comparing to previous model
custom_change(DT_depth_results, depth_search_results)

Accuracy        +0.01
Precision       +0.06
Recall          -0.01
F1              +0.02
ROCAUC          -0.00


## Analysis
That was a lot of time for small gain, but it still seems to be an overall improvement.

## Class Weights

Adding class weights should greatly help our model while requiring little effort implementing.

In [50]:
# Setting class weight to 'balanced' is equivalent to: n_samples / (n_classes * np.bincount(y))
DT_weighted = DecisionTreeClassifier(max_depth=6, class_weight='balanced', random_state=15)

# Fitting on training data
DT_weighted.fit(X_train, y_train)

# Printing test results
DT_weighted_results = custom_score(X_test, y_test, DT_depth)

Accuracy: 0.92
Precision: 0.62
Recall: 0.52
F1: 0.56
ROC AUC: 0.74


In [51]:
# Printing difference from last model.
custom_change(depth_search_results, DT_weighted_results)

Accuracy        -0.01
Precision       -0.06
Recall          +0.01
F1              -0.02
ROCAUC          +0.00


## Analysis
It seems that adding class weights resulted in worse performance overall. I think that the decision tree is going to continue to struggle from here, so rather then continuing to spend time on it, lets more to a more robust model type.