# Hyperparameter Tuning and Pruning in Decision Trees - Lab

## Introduction

In this lab, you will use the titanic dataset to see the impact of tree pruning and hyperparameter tuning on the predictive performance of a decision tree classifier. Pruning reduces the size of decision trees by removing nodes of the tree that do not provide much predictive power to classify instances. Decision trees are the most susceptible out of all the machine learning algorithms to overfitting and effective pruning can reduce this likelihood. 

## Objectives

In this lab you will: 

- Determine the optimal hyperparameters for a decision tree model and evaluate the model performance

## Import necessary libraries

Let's first import the libraries you'll need for this lab. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
#plt.style.use('seaborn-v0_8-darkgrid')

## Import the data

The titanic dataset, available in `'titanic.csv'`, is all cleaned up and preprocessed for you so that you can focus on pruning and optimization. Import the dataset and print the first five rows of the data: 

In [2]:
# Import the data
df = pd.read_csv('titanic.csv')
df.head()

Unnamed: 0,PassengerId,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Survived
0,1,22.0,1,0,7.25,0,0,1,0,1,0,0,1,0
1,2,38.0,1,0,71.2833,1,0,0,1,0,1,0,0,1
2,3,26.0,0,0,7.925,0,0,1,1,0,0,0,1,1
3,4,35.0,1,0,53.1,1,0,0,1,0,0,0,1,1
4,5,35.0,0,0,8.05,0,0,1,0,1,0,0,1,0


## Create training and test sets

- Assign the `'Survived'` column to `y` 
- Drop the `'Survived'` and `'PassengerId'` columns from `df`, and assign the resulting DataFrame to `X` 
- Split `X` and `y` into training and test sets. Assign 30% to the test set and set the `random_state` to `SEED` 

In [3]:
# Create X and y 
y = df['Survived']
X = df.drop(columns=['Survived', 'PassengerId'])

# Split into training and test sets
SEED = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.030, random_state=SEED)

## Train a vanilla classifier

__Note:__ The term "vanilla" is used for a machine learning algorithm with its default settings (no tweaking/tuning).

- Instantiate a decision tree 
  - Use the `'entropy'` criterion and set the `random_state` to `SEED` 
- Fit this classifier to the training data 

In [4]:
# Train the classifier using training data
dt = DecisionTreeClassifier(criterion='entropy', random_state=SEED)
dt.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=1)

## Make predictions 
- Create a set of predictions using the test set 
- Using `y_test` and `y_pred`, calculate the AUC (Area under the curve) to check the predictive performance

In [5]:
# Make predictions using test set 
y_pred = dt.predict(X_test)

# Check the AUC of predictions
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc

0.8131868131868133

## Maximum Tree Depth

Let's first check for the best depth parameter for our decision tree: 

- Create an array for `max_depth` values ranging from 1 - 32  
- In a loop, train the classifier for each depth value (32 runs) 
- Calculate the training and test AUC for each run 
- Plot a graph to show under/overfitting and the optimal value 
- Interpret the results 

In [6]:
# Identify the optimal tree depth for given data

depth_array = np.array(range(1, 33))
for i in depth_array:
    # Train the classifier using training data
    dt = DecisionTreeClassifier(criterion='entropy', random_state=SEED, max_depth=i)
    dt.fit(X_train, y_train)
    # Make predictions using test set 
    y_pred = dt.predict(X_test)

    # Check the AUC of predictions
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print(f"roc_auc for {i} is {roc_auc}")
    

roc_auc for 1 is 0.8131868131868133
roc_auc for 2 is 0.7692307692307692
roc_auc for 3 is 0.8131868131868133
roc_auc for 4 is 0.7747252747252749
roc_auc for 5 is 0.7747252747252749
roc_auc for 6 is 0.7747252747252749
roc_auc for 7 is 0.8104395604395604
roc_auc for 8 is 0.8104395604395604
roc_auc for 9 is 0.8489010989010989
roc_auc for 10 is 0.8489010989010989
roc_auc for 11 is 0.8131868131868133
roc_auc for 12 is 0.7390109890109889
roc_auc for 13 is 0.7390109890109889
roc_auc for 14 is 0.8131868131868133
roc_auc for 15 is 0.7774725274725275
roc_auc for 16 is 0.8131868131868133
roc_auc for 17 is 0.8131868131868133
roc_auc for 18 is 0.8131868131868133
roc_auc for 19 is 0.8131868131868133
roc_auc for 20 is 0.7774725274725275
roc_auc for 21 is 0.7417582417582418
roc_auc for 22 is 0.8131868131868133
roc_auc for 23 is 0.8131868131868133
roc_auc for 24 is 0.8131868131868133
roc_auc for 25 is 0.8131868131868133
roc_auc for 26 is 0.8131868131868133
roc_auc for 27 is 0.8131868131868133
roc_auc fo

In [7]:
# Your observations here 
#the best model to choose from is 9 and 10 as it has the larger auc

## Minimum Sample Split

Now check for the best `min_samples_splits` parameter for our decision tree 

- Create an array for `min_sample_splits` values ranging from 0.1 - 1 with an increment of 0.1 
- In a loop, train the classifier for each `min_samples_splits` value (10 runs) 
- Calculate the training and test AUC for each run 
- Plot a graph to show under/overfitting and the optimal value 
- Interpret the results

In [13]:
# Identify the optimal min-samples-split for given data
split_array = np.arange(0.1, 1.1, 0.1)
for i in split_array:
    # Train the classifier using training data
    dt = DecisionTreeClassifier(criterion='entropy', random_state=SEED, min_samples_split=i)
    dt.fit(X_train, y_train)
    # Make predictions using test set 
    y_pred = dt.predict(X_test)

    # Check the AUC of predictions
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print(f"roc_auc for {i} is {roc_auc}")
    

roc_auc for 0.1 is 0.8131868131868133
roc_auc for 0.2 is 0.7692307692307692
roc_auc for 0.30000000000000004 is 0.7692307692307692
roc_auc for 0.4 is 0.8131868131868133
roc_auc for 0.5 is 0.8131868131868133
roc_auc for 0.6 is 0.8131868131868133
roc_auc for 0.7000000000000001 is 0.8131868131868133
roc_auc for 0.8 is 0.8131868131868133
roc_auc for 0.9 is 0.8131868131868133
roc_auc for 1.0 is 0.8131868131868133


In [14]:
# Your observations here
#for some samples the auc is fairly constant but changes for only 0.2 and 0.3
#best result will be given from any but 0.2 and 0.3

## Minimum Sample Leafs

Now check for the best `min_samples_leafs` parameter value for our decision tree 

- Create an array for `min_samples_leafs` values ranging from 0.1 - 0.5 with an increment of 0.1 
- In a loop, train the classifier for each `min_samples_leafs` value (5 runs) 
- Calculate the training and test AUC for each run 
- Plot a graph to show under/overfitting and the optimal value 
- Interpret the results

In [17]:
# Calculate the optimal value for minimum sample leafs
split_array = np.arange(0.1, 0.5, 0.1)
for i in split_array:
    # Train the classifier using training data
    dt = DecisionTreeClassifier(criterion='entropy', random_state=SEED, min_samples_leaf=i)
    dt.fit(X_train, y_train)
    # Make predictions using test set 
    y_pred = dt.predict(X_test)

    # Check the AUC of predictions
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print(f"roc_auc for {i} is {roc_auc}")
    

roc_auc for 0.1 is 0.7692307692307692
roc_auc for 0.2 is 0.8131868131868133
roc_auc for 0.30000000000000004 is 0.8131868131868133
roc_auc for 0.4 is 0.7032967032967032


In [None]:
# Your observations here 
#the auc only changes for 0.1 and 0.4

## Maximum Features

Now check for the best `max_features` parameter value for our decision tree 

- Create an array for `max_features` values ranging from 1 - 12 (1 feature vs all)
- In a loop, train the classifier for each `max_features` value (12 runs) 
- Calculate the training and test AUC for each run 
- Plot a graph to show under/overfitting and the optimal value 
- Interpret the results

In [19]:
# Find the best value for optimal maximum feature size
split_array = np.array(range(1, 13))
for i in split_array:
    # Train the classifier using training data
    dt = DecisionTreeClassifier(criterion='entropy', random_state=SEED, max_features=i)
    dt.fit(X_train, y_train)
    # Make predictions using test set 
    y_pred = dt.predict(X_test)

    # Check the AUC of predictions
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print(f"roc_auc for {i} is {roc_auc}")
    

roc_auc for 1 is 0.7774725274725275
roc_auc for 2 is 0.8159340659340659
roc_auc for 3 is 0.7774725274725275
roc_auc for 4 is 0.7802197802197803
roc_auc for 5 is 0.7417582417582418
roc_auc for 6 is 0.8131868131868133
roc_auc for 7 is 0.8159340659340659
roc_auc for 8 is 0.7774725274725275
roc_auc for 9 is 0.8159340659340659
roc_auc for 10 is 0.8131868131868133
roc_auc for 11 is 0.8131868131868133
roc_auc for 12 is 0.8131868131868133


In [None]:
# Your observations here
#best model to choose from is 2 and 7 as they have a hgher auc value

## Re-train the classifier with chosen values

Now we will use the best values from each training phase above and feed it back to our classifier. Then we can see if there is any improvement in predictive performance. 

- Train the classifier with the optimal values identified 
- Compare the AUC of the new model with the earlier vanilla decision tree AUC 
- Interpret the results of the comparison

In [21]:
# Train a classifier with optimal values identified above
dt = DecisionTreeClassifier(criterion='entropy', random_state=SEED, max_features=2, max_depth=9, min_samples_split=0.2)
dt.fit(X_train, y_train)
# Make predictions using test set 
y_pred = dt.predict(X_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc

0.7692307692307692

In [None]:
# Your observations here
#our model has a lower auc as the vanila model which is 0.8131868131868133

In order to address the issue of a baseline classifier performing better than a tuned one like this, a more-sophisticated technique is called a "grid search" and this will be introduced in a future lesson.

## Summary 

In this lesson, we looked at tuning a decision tree classifier in order to avoid overfitting and increasing the generalization capabilities of the classifier. For the titanic dataset, we see that identifying optimal parameter values can result in some improvements towards predictions. This idea will be exploited further in upcoming lessons and labs. 