# Decision Trees

### Guided Tutorial

For each step, read the explanation, then **run the code cell(s)** right below it.

You will practice:
- Loading and inspecting data for a classification problem  
- Visualizing features to build intuition about possible splits  
- Computing **Gini** and **Entropy** (impurity) for a candidate split  
- Training and interpreting a **shallow** decision tree vs. a **fully grown** tree


#### Import libraries

**Run the next 2 cells**

In [None]:
import os
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from dmba import plotDecisionTree
%matplotlib inline

# Set random seed variable for code reproducibility
SEED = 0

In [None]:
# Import local libraries
root_dir = Path.cwd().resolve().parents[0]
sys.path.append(str(root_dir))

# Visualization functions
from src.utils.helpers import *

# Load the "autoreload" extension so that code can change
%load_ext autoreload
#%reload_ext autoreload

# Always reload modules so that as you change code in src, it gets loaded
%autoreload 2

### Example 1: Lecture

**Create a dataframe for the `RidingMowers.csv` data**

In the next cell, we load the dataset from a `.csv` file into a pandas DataFrame so we can explore it and model it.


**Run the next cell**

This is the data for the Riding Mowers Example in lecture.


In [None]:
mowers_df = pd.read_csv(os.path.join('..','data','RidingMowers.csv'))
mowers_df.head()

**Create a scatterplot for Income and Lot Size with Ownership as the Color**


Next, we visualize the relationship between **Income** and **Lot Size** and use color to show the class label (**Ownership**). This helps us see whether a simple split might separate the classes.


**Run the next cell**

This is the scatter plot used in lecture to visualize the splits.


In [None]:
sns.scatterplot(x='Income',y='Lot_Size',hue='Ownership', data=mowers_df)
plt.show()

**Calculate Gini Index for First Split Condition**

Now we try a **candidate split** (a threshold on Income). We separate the data into the left/right child nodes and compute impurity for each side.

**Run the next cell**

This is the code used for the first split using Gini index.


In [None]:
split_value = 59.7
split_condition = mowers_df['Income'] <= split_value

split_true = list(mowers_df['Ownership'][split_condition])
split_false = list(mowers_df['Ownership'][~split_condition])

print(f"Left Split: Income <= {split_value}, Gini Index = {gini_index(split_true)[1]:.3f}")
print(f"Right Split: Income > {split_value}, Gini Index = {gini_index(split_false)[1]:.3f}")

**Calculate Entropy for First Split Condition**

Here we compute **entropy** for the left and right child nodes created by the Income split, and print the results.

**Run the next cell**

This is the code used for the first split using Entropy measure.


In [None]:
print(f"Left Split: Income <= {split_value}, Entropy = {entropy_loss(split_true)[1]:.3f}")
print(f"Right Split: Income > {split_value}, Entropy = {entropy_loss(split_false)[1]:.3f}")

**Calculate Overall Gini and Entropy for First Split Condition**

Finally, we compute the **overall (weighted)** impurity for this split by weighting each child node’s impurity by its share of samples.

**Run the next cell**

This is the code used to calculated the combined impurity of the two nodes for both Gini and Entropy.


In [None]:
print(f"Combined Gini for Income Split on {split_value} is {weighted_impurity(split_true, split_false, mowers_df['Income'], 'gini'):.3f}")
print(f"Combined Entropy for Income Split on {split_value} is {weighted_impurity(split_true, split_false, mowers_df['Income'], 'entropy'):.3f}")

**Create Decision Tree with a Depth of 2**

Use Gini for splitting criteria and display impurity value on nodes.


Next, we fit a small decision tree with **max_depth=2** (a shallow, interpretable tree) using **Gini** as the split criterion, then visualize it.


**Run the next cell**

This is the code used to create the tree after 3 splits.


In [None]:
mowers_X = mowers_df.drop(columns=['Ownership'])
mowers_y = mowers_df['Ownership']
dt = DecisionTreeClassifier(max_depth=2, criterion='gini', random_state=SEED)
dt.fit(mowers_X, mowers_y)

fig, ax = plt.subplots(figsize=(7, 5))
plot_tree(dt, 
          feature_names=mowers_X.columns, 
          class_names=['Nonowner', 'Owner'], 
          filled=True, 
          impurity=True, 
          ax=ax)
plt.tight_layout()
plt.show()

**Create Full Decision Tree**

Use Gini for splitting criteria and display impurity value on nodes.


Lastly, we fit a **fully grown** decision tree (no max depth) to see how the model continues splitting when not constrained.


**Run the next cell**

This is the code used to create the tree after all splits.


In [None]:
dt2 = DecisionTreeClassifier(max_depth=None, criterion='gini', random_state=SEED)
dt2.fit(mowers_X, mowers_y)

fig, ax = plt.subplots(figsize=(7, 5))
plot_tree(dt2, 
          feature_names=mowers_X.columns, 
          class_names=['Nonowner', 'Owner'], 
          filled=True, 
          impurity=True, 
          ax=ax)
plt.tight_layout()
plt.show()

## Key DecisionTreeClassifier Parameters

Below we will discuss **some** of the key parameters for the `DecisionTreeClassifier` that control **how the tree grows**, **how complex it becomes**, and **how splits are chosen**. Check out the [API Reference](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for the full list of parameters.

**`criterion`**
Determines how the quality of a split is measured.

* `"gini"` → Uses Gini impurity (default)
* `"entropy"` → Uses information gain
* `"log_loss"` → Similar to entropy but based on logistic loss

This controls how "purity" is defined.

**`splitter`**
Strategy used to choose splits.

* `"best"` → Chooses the best split (default)
* `"random"` → Chooses the best random split

`random` can add variability and sometimes reduce overfitting.

**`max_depth`**
Maximum depth of the tree.

* `None` → Grow until pure or minimum samples reached
* Integer (e.g., `max_depth=5`)

Limits tree complexity where smaller depth = simpler model.

**`min_samples_split`**
Minimum samples required to split a node.

* Integer → exact number (e.g., `10`)
* Float → fraction of dataset (e.g., `0.05`)

Larger value = fewer splits and can reduce overfitting.

**`min_samples_leaf`**
Minimum samples required in a leaf node.

Prevents leaves with very few observations.

**`max_leaf_nodes`**
Limits total number of leaf nodes.

Controls tree size directly.

**`max_features`**
Number of features considered at each split.

* `None` → All features
* `"sqrt"` → √(#features)
* `"log2"` → log₂(#features)
* Integer / Float

Adds randomness and useful in ensembles such as Random Forest.

**`random_state`**
Ensures reproducibility.

Very important to **always** fix this value for consistent results.

### Example 2: Lecture

**Create a dataframe for the `UniversalBank.csv` data**

In the next cell, we load the dataset from a `.csv` file into a pandas DataFrame so we can explore it and model it.

**Run the next cell**

This is the data for the Personal Loan Example in lecture.


In [None]:
bank_df = pd.read_csv(os.path.join('..','data','UniversalBank.csv'))
bank_df.head()

**Create crosstab summaries of the `Securities Account`, `CD Account`, `CreditCard` variables.**


We start by building contingency tables (crosstabs) between each binary predictor and the target (**Personal Loan**). This lets us see how informative each split might be.


**Run the next 3 cells**

This is the code used to create the tables for the calculations for the first split.


In [None]:
pd.crosstab(bank_df['Securities Account'], bank_df['Personal Loan'])

In [None]:
pd.crosstab(bank_df['CD Account'], bank_df['Personal Loan'])

In [None]:
pd.crosstab(bank_df['CreditCard'], bank_df['Personal Loan'])

**Calculate overall impurity for `Securities Account`, `CD Account`, `CreditCard` variables.**

Now we try a **candidate split** for each of the 3 binary variables. We separate the data into the left/right child nodes and compute impurity for each side.


**Run the next cell**

This calculates the combined impurity for each of the 3 variables to evaluate which variable to use for the first split.


In [None]:
vs = ['Securities Account', 'CD Account', 'CreditCard']
for var in vs:
    split_condition = bank_df[var] == 0

    split_true = list(bank_df['Personal Loan'][split_condition])
    split_false = list(bank_df['Personal Loan'][~split_condition])

    sa = weighted_impurity(split_true, split_false, bank_df[var], 'gini')
    print(f"Combined Gini for {var} Split is {sa:.4f}")

**Create Decision Tree with a Depth of 2 with only the Securities Account, CD Account, CreditCard variables.**

Use Gini for splitting criteria and display impurity value on nodes.


Next, we fit a small decision tree with **max_depth=2** (a shallow, interpretable tree) using **Gini** as the split criterion, then visualize it.


**Run the next cell**

This is the code used to create the final tree for the Personal Loan Example.


In [None]:
bank_X = bank_df[vs]
bank_y = bank_df['Personal Loan']
dt3 = DecisionTreeClassifier(max_depth=2, criterion='gini', random_state=SEED)
dt3.fit(bank_X, bank_y)

fig, ax = plt.subplots(figsize=(7, 5))
plot_tree(dt3, 
          feature_names=bank_X.columns, 
          class_names=['Declined', 'Accepted'], 
          filled=True, 
          impurity=True, 
          ax=ax)
plt.tight_layout()
plt.show()

## Create a Text Version of Tree

The `export_text()` function converts the trained tree into a readable rule-based format.

Instead of a visual diagram, this shows:

* The sequence of splits
* The decision rules
* The predicted class at each leaf
* (Optional) Sample counts / weights

**Run the next cell**

Typically in Jupyter Noteboook, you don't need to use the `print()` function but this is a good example where it preserves the formatting of the text.

In [None]:
print(export_text(dt3, feature_names=bank_X.columns, show_weights=True))

**Create a fully developed Decision Tree with all variables except for `ID` and `Zip Code`.**

Use Gini for splitting criteria and use the text dmba `plotDecisionTree()` to display the tree.

Finally, we build a fuller model using all available predictors (excluding identifier fields) and plot the resulting tree.


**Run the next cell**

This example was not shown in the lecture but included as a method to effectively visualize a very complex tree.


In [None]:
bank_X = bank_df.drop(columns=['ID', 'ZIP Code', 'Personal Loan'])
bank_y = bank_df['Personal Loan']

fullClassTree = DecisionTreeClassifier(random_state=SEED).fit(bank_X, bank_y)
plotDecisionTree(fullClassTree, feature_names=bank_X.columns)