<a href="https://colab.research.google.com/github/dymiyata/intro-to-ml-and-ai-2025-2026/blob/main/decision_trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementing a Decision Tree Classifier

First we import some of the usual stuff.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Dataset

For this example, we will use seaborn's built in `"titanic"` dataset. We can load the dataset and save it to a dataframe using `sns.load_dataset()`.

In [2]:
df = sns.load_dataset("titanic")

Let's get an idea of what this dataset contains.

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


For this example, let's predict whether a given passenger survived or not.  There are many features here, so to simplify things, let's just go with four features: `"sex"`, `"pclass"`, `"age"`, `"fare"`, `"survived"`.

When we filter, be sure to use `.dropna` to get rid of any entries with missing values.

In [5]:
df_filtered = df[["sex", "pclass", "age", "fare", "survived"]]
df_filtered.head()

Unnamed: 0,sex,pclass,age,fare,survived
0,male,3,22.0,7.25,0
1,female,1,38.0,71.2833,1
2,female,3,26.0,7.925,1
3,female,1,35.0,53.1,1
4,male,3,35.0,8.05,0


Now let's define our feature matrix `X` and target vector `y` using this data.

In [6]:
X = df_filtered[["sex", "pclass", "age", "fare"]]
y = df_filtered["survived"]

In [8]:
y.head()

Unnamed: 0,survived
0,0
1,1
2,1
3,1
4,0


### Trying to Run the Decision Tree Classifier

Now we import the model and try to fit it to the data:

- For linear and logisitic regression, we had to import `LinearRegression` or `LogisticRegression` from `sklearn.linear_model`.  

- To make a decision tree, we have to import `DecisionTreeClassifier` from `sklearn.tree`.

In [9]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

Let's fit it to the data:

In [10]:
model.fit(X, y)

ValueError: could not convert string to float: 'male'

Looks like we got an error...

### One Hot Encoding


We got an error because every feature for `sklearn`'s `DecisionTreeClassifier` (even the categorical ones) must be either:
- numerical
- Something Python knows how to automatically convert to numbers
  - Like `True`/`False` which will be converted to `1` or `0`

Currently, the `"sex"` feature is not of this form.  Python doesn't know how to convert `"male"` and `"female"` to numbers.


This is where one hot encoding comes in.  Recall one hot encoding does the following:
- Take each *value* of a categorical variable (e.g. `"male"` is a value of the variable `"sex"`)
- Make a new feature just for that value.
- Examples with that value get a `1` for the new feature
- Examples with a different value get a `0` for the new feature

To do this, we run `pd.get_dummies` on our feature matrix and store the result in another variable.  Let's see what happens when we do this:

In [11]:
X_encoded = pd.get_dummies(X)

In [12]:
X_encoded.head()

Unnamed: 0,pclass,age,fare,sex_female,sex_male
0,3,22.0,7.25,False,True
1,1,38.0,71.2833,True,False
2,3,26.0,7.925,True,False
3,1,35.0,53.1,True,False
4,3,35.0,8.05,False,True


But notice, these two new features are a bit redundant:
- If a passenger is `"male"` you know they aren't `"female"` and vice versa.

Thus, we can get rid of one of these new features from the one-hot-encoding. To do this automatically, we add the argument `drop_first = True`.

In [15]:
X_encoded = pd.get_dummies(X, drop_first=True)

In [16]:
X_encoded.head()

Unnamed: 0,pclass,age,fare,sex_male
0,3,22.0,7.25,True
1,1,38.0,71.2833,False
2,3,26.0,7.925,False
3,1,35.0,53.1,False
4,3,35.0,8.05,True


### Fitting the model

Now that we've done the one-hot-encoding, we can actually fit the model.  First, let's do the correct practice of splitting the data into a training set and a testing set. (we need to import `train_test_split` from `sklearn.model_selection` first)
  - Don't forget we should stratify by y with any classification problem

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded,
    y,
    test_size=.2,
    random_state=2026,
    stratify=y
)

In [19]:
X_train.head()

Unnamed: 0,pclass,age,fare,sex_male
413,2,,0.0,True
118,1,24.0,247.5208,True
848,2,28.0,33.0,True
399,2,28.0,12.65,False
626,2,57.0,12.35,True


Now fit the model on the training data:

In [20]:
model.fit(X_train, y_train)

To evaluate our model we make predictions on the train and test sets:

In [21]:
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

We can use the confusion matrix or classification report to see how well we did:

In [22]:
from sklearn.metrics import confusion_matrix, classification_report

In [24]:
print(confusion_matrix(y_train, y_pred_train))
print(confusion_matrix(y_test, y_pred_test))

[[437   2]
 [ 10 263]]
[[85 25]
 [19 50]]


In [25]:
print(classification_report(y_train, y_pred_train))
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       439
           1       0.99      0.96      0.98       273

    accuracy                           0.98       712
   macro avg       0.99      0.98      0.98       712
weighted avg       0.98      0.98      0.98       712

              precision    recall  f1-score   support

           0       0.82      0.77      0.79       110
           1       0.67      0.72      0.69        69

    accuracy                           0.75       179
   macro avg       0.74      0.75      0.74       179
weighted avg       0.76      0.75      0.76       179



Our model seems to be overfitting the training data...

This could be because the default stopping condition for sklearn's decision tree classifier is to keep going until the gini impurity stops decreasing.

Let's look at the depth of our tree. We can do this with `model.get_depth()`

In [26]:
model.get_depth()

21

In [27]:
model.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': None,
 'splitter': 'best'}

Let's try another decision tree where we set the max depth of the tree.

In [48]:
model2 = DecisionTreeClassifier(max_depth=8)

The max depth is called a *hyperparameter* since it is not part of the actual model's "formula", but it is a value that affects the training of the model.

Thus, our act of adjusting things like max depth is called *hyperparameter tuning* and is an important aspect of machine learning which we will continue to talk more about in the future.

In [49]:
model2.fit(X_train, y_train)

In [50]:
y_pred_train2 = model2.predict(X_train)
y_pred_test2 = model2.predict(X_test)

In [51]:
print(classification_report(y_train, y_pred_train2))
print(classification_report(y_test, y_pred_test2))

              precision    recall  f1-score   support

           0       0.86      0.98      0.92       439
           1       0.97      0.75      0.84       273

    accuracy                           0.89       712
   macro avg       0.91      0.87      0.88       712
weighted avg       0.90      0.89      0.89       712

              precision    recall  f1-score   support

           0       0.82      0.89      0.85       110
           1       0.80      0.68      0.73        69

    accuracy                           0.81       179
   macro avg       0.81      0.79      0.79       179
weighted avg       0.81      0.81      0.81       179



Let's try a model that uses _entropy_ instead of _Gini Impurity_

In [60]:
model3 = DecisionTreeClassifier(max_depth=6, criterion='entropy')

In [61]:
model3.fit(X_train, y_train)

In [62]:
y_pred_train3 = model3.predict(X_train)
y_pred_test3 = model3.predict(X_test)

In [63]:
print(classification_report(y_train, y_pred_train3))
print(classification_report(y_test, y_pred_test3))

              precision    recall  f1-score   support

           0       0.83      0.96      0.89       439
           1       0.92      0.69      0.79       273

    accuracy                           0.86       712
   macro avg       0.88      0.83      0.84       712
weighted avg       0.87      0.86      0.85       712

              precision    recall  f1-score   support

           0       0.78      0.92      0.85       110
           1       0.82      0.59      0.69        69

    accuracy                           0.79       179
   macro avg       0.80      0.76      0.77       179
weighted avg       0.80      0.79      0.79       179



Seems like Gini Impurity is the way to go for this dataset