# Hyperparameter Tuning and Pruning in Decision Trees - Lab

## Introduction

In this lab, you will use the titanic dataset to see the impact of tree pruning and hyperparameter tuning on the predictive performance of a decision tree classifier. Pruning reduces the size of decision trees by removing nodes of the tree that do not provide much predictive power to classify instances. Decision trees are the most susceptible out of all the machine learning algorithms to overfitting and effective pruning can reduce this likelihood. 

## Objectives

In this lab you will: 

- Determine the optimal hyperparameters for a decision tree model and evaluate the model performance

## Import necessary libraries

Let's first import the libraries you'll need for this lab. 

In [228]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
plt.style.use('seaborn-v0_8-darkgrid')

## Import the data

The titanic dataset, available in `'titanic.csv'`, is all cleaned up and preprocessed for you so that you can focus on pruning and optimization. Import the dataset and print the first five rows of the data: 

In [229]:
# Import the data
titanic = pd.read_csv("titanic.csv")
titanic

Unnamed: 0,PassengerId,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Survived
0,1,22.000000,1,0,7.2500,0,0,1,0,1,0,0,1,0
1,2,38.000000,1,0,71.2833,1,0,0,1,0,1,0,0,1
2,3,26.000000,0,0,7.9250,0,0,1,1,0,0,0,1,1
3,4,35.000000,1,0,53.1000,1,0,0,1,0,0,0,1,1
4,5,35.000000,0,0,8.0500,0,0,1,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,27.000000,0,0,13.0000,0,1,0,0,1,0,0,1,0
887,888,19.000000,0,0,30.0000,1,0,0,1,0,0,0,1,1
888,889,29.699118,1,2,23.4500,0,0,1,1,0,0,0,1,0
889,890,26.000000,0,0,30.0000,1,0,0,0,1,1,0,0,1


## Create training and test sets

- Assign the `'Survived'` column to `y` 
- Drop the `'Survived'` and `'PassengerId'` columns from `df`, and assign the resulting DataFrame to `X` 
- Split `X` and `y` into training and test sets. Assign 30% to the test set and set the `random_state` to `SEED` 

In [230]:
# Create X and y 
y = titanic["Survived"]
X = titanic.drop(columns= "Survived", axis = 1)

# Split into training and test sets
SEED = 1
X_train, X_test, y_train, y_test = train_test_split(X,y)

## Train a vanilla classifier

__Note:__ The term "vanilla" is used for a machine learning algorithm with its default settings (no tweaking/tuning).

- Instantiate a decision tree 
  - Use the `'entropy'` criterion and set the `random_state` to `SEED` 
- Fit this classifier to the training data 

In [231]:
# Train the classifier using training data
model = DecisionTreeClassifier(criterion="entropy", random_state= SEED)
model.fit(X_train, y_train)


## Make predictions 
- Create a set of predictions using the test set 
- Using `y_test` and `y_pred`, calculate the AUC (Area under the curve) to check the predictive performance

In [232]:
# Make predictions using test set 
vanilla_train = model.predict(X_train)
vanilla_test = model.predict(X_test)

# Check the AUC of predictions
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, vanilla_train)
vanilla_train_accuracy = auc(false_positive_rate, true_positive_rate)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, vanilla_test)
vanilla_test_accuracy = auc(false_positive_rate, true_positive_rate)


## Maximum Tree Depth

Let's first check for the best depth parameter for our decision tree: 

- Create an array for `max_depth` values ranging from 1 - 32  
- In a loop, train the classifier for each depth value (32 runs) 
- Calculate the training and test AUC for each run 
- Plot a graph to show under/overfitting and the optimal value 
- Interpret the results 

In [233]:
# Identify the optimal tree depth for given data
depth = range(1,32)
train = []
test =[]
for d in depth:
    model = DecisionTreeClassifier(criterion="entropy", random_state= SEED, max_depth= d)
    model.fit(X_train, y_train)
    
    # Predict on training and test set
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)

    # Calculate accuracy
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    train_acc = auc(false_positive_rate, true_positive_rate)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, test_pred)
    test_acc = auc(false_positive_rate, true_positive_rate)

    # Store accuracies
    train.append(train_acc)
    test.append(test_acc)
    # for x in train.index:
    #     for y in test.index:
    #         #find absolute error
    #         Error = (x - y)

In [234]:
# Your observations here 
#"Error": Error}
Accuracy = pd.DataFrame({"Training_accuracy": train, "Test_accuracy": test})
Accuracy[:5]

Unnamed: 0,Training_accuracy,Test_accuracy
0,0.766358,0.76958
1,0.750733,0.719274
2,0.80964,0.822552
3,0.813457,0.819056
4,0.838388,0.774344


In [235]:
Accuracy["Test_accuracy"].max()

0.8225524475524476

## Minimum Sample Split

Now check for the best `min_samples_splits` parameter for our decision tree 

- Create an array for `min_sample_splits` values ranging from 0.1 - 1 with an increment of 0.1 
- In a loop, train the classifier for each `min_samples_splits` value (10 runs) 
- Calculate the training and test AUC for each run 
- Plot a graph to show under/overfitting and the optimal value 
- Interpret the results

In [236]:
# Identify the optimal min-samples-split for given data
min_sample_splits = np.arange(0.1, 1, 0.1).astype(float)
train_AUC = []
test_AUC = []
SEED = 1
for x in min_sample_splits:
    model2 = DecisionTreeClassifier(criterion="entropy", random_state=SEED, min_samples_split = x)
    model2.fit(X_train, y_train)
    
    train_pred = model2.predict(X_train)
    test_pred = model2.predict(X_test)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    train_acc = auc(false_positive_rate, true_positive_rate)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, test_pred)
    test_acc = auc(false_positive_rate, true_positive_rate)
    
    train_AUC.append(train_acc)
    test_AUC.append(test_acc)
    
    

In [237]:
# Your observations here
AUC = pd.DataFrame({"Train accuracy":train_AUC, "Test accuracy": test_AUC})
AUC


Unnamed: 0,Train accuracy,Test accuracy
0,0.824841,0.756862
1,0.75455,0.712281
2,0.752642,0.719274
3,0.785254,0.777098
4,0.785254,0.777098
5,0.785254,0.777098
6,0.766358,0.76958
7,0.766358,0.76958
8,0.766358,0.76958


In [238]:
AUC["Test accuracy"].max()

0.7770979020979021

## Minimum Sample Leafs

Now check for the best `min_samples_leafs` parameter value for our decision tree 

- Create an array for `min_samples_leafs` values ranging from 0.1 - 0.5 with an increment of 0.1 
- In a loop, train the classifier for each `min_samples_leafs` value (5 runs) 
- Calculate the training and test AUC for each run 
- Plot a graph to show under/overfitting and the optimal value 
- Interpret the results

In [239]:
# Calculate the optimal value for minimum sample leafs
min_leaf = np.arange(0.1, 0.5, 0.1)

train_auc = []
test_auc = []

for x in min_leaf:
    model = DecisionTreeClassifier(criterion="entropy", random_state=SEED, min_samples_leaf= x)
    model.fit(X_train, y_train)
    
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    train_auc2 = auc(false_positive_rate, true_positive_rate)
    false_positive_rate, true_positive_rate,thresholds = roc_curve(y_test, test_pred)
    test_auc2 = auc(false_positive_rate, true_positive_rate)
    
    train_auc.append(train_auc2)
    test_auc.append(test_auc2)
    

In [240]:
# Your observations here 
Results = pd.DataFrame({"Train accuracy": train_auc, "Test accuracy": test_auc})
Results

Unnamed: 0,Train accuracy,Test accuracy
0,0.731837,0.711757
1,0.766358,0.76958
2,0.766358,0.76958
3,0.673993,0.633654


In [241]:
Results["Test accuracy"].max()

0.7695804195804196

## Maximum Features

Now check for the best `max_features` parameter value for our decision tree 

- Create an array for `max_features` values ranging from 1 - 12 (1 feature vs all)
- In a loop, train the classifier for each `max_features` value (12 runs) 
- Calculate the training and test AUC for each run 
- Plot a graph to show under/overfitting and the optimal value 
- Interpret the results

In [242]:
# Find the best value for optimal maximum feature size
max_feat = np.arange(1, 13)

Train_auc = []
Test_auc = []

for x in max_feat:
    model = DecisionTreeClassifier(criterion= "entropy", random_state= SEED, max_features= x)
    model.fit(X_train, y_train)
    
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    auc_train_score =  auc(false_positive_rate, true_positive_rate)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, test_pred)
    auc_test_score = auc(false_positive_rate, true_positive_rate)
    
    Train_auc.append(auc_train_score)
    Test_auc.append(auc_test_score)


In [243]:
# Your observations here
max_results = pd.DataFrame({"Train_score": Train_auc, "Test_score": Test_auc})
max_results

Unnamed: 0,Train_score,Test_score
0,1.0,0.766084
1,1.0,0.721372
2,1.0,0.726879
3,1.0,0.737369
4,1.0,0.773601
5,1.0,0.737369
6,1.0,0.719362
7,1.0,0.748601
8,1.0,0.740865
9,1.0,0.767351


In [244]:
max_results["Test_score"].max()

0.7736013986013985

## Re-train the classifier with chosen values

Now we will use the best values from each training phase above and feed it back to our classifier. Then we can see if there is any improvement in predictive performance. 

- Train the classifier with the optimal values identified 
- Compare the AUC of the new model with the earlier vanilla decision tree AUC 
- Interpret the results of the comparison

In [245]:
# Train a classifier with optimal values identified above
model3 = DecisionTreeClassifier(criterion="entropy",max_depth=3,
                                min_samples_split=4, min_samples_leaf=2, max_features=7)
model3.fit(X_train, y_train)

train_predict = model3.predict(X_train)
test_predict = model3.predict(X_test)


false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_predict)
Train_accuracy = auc(false_positive_rate, true_positive_rate)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, test_predict)
Test_accuracy = auc(false_positive_rate, true_positive_rate)


# false_positive_rate, true_positive_rate, thresholds = None
# roc_auc = None
# roc_auc

In [249]:
# Your observations here
print(f"The vanilla model train accuracy is {vanilla_train_accuracy}, and for the test is {vanilla_test_accuracy}")
print(f"The final model train auc is {Train_accuracy}, and the test accuracy is {Test_accuracy}")

The vanilla model train accuracy is 1.0, and for the test is 0.7213723776223776
The final model train auc is 0.7968920392584515, and the test accuracy is 0.791083916083916


In order to address the issue of a baseline classifier performing better than a tuned one like this, a more-sophisticated technique is called a "grid search" and this will be introduced in a future lesson.

## Summary 

In this lesson, we looked at tuning a decision tree classifier in order to avoid overfitting and increasing the generalization capabilities of the classifier. For the titanic dataset, we see that identifying optimal parameter values can result in some improvements towards predictions. This idea will be exploited further in upcoming lessons and labs. 