# Decision Trees: Foundations and Interpretability

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/25_decision_trees.ipynb)

This notebook contains hands-on code examples and exercises from the **Decision Trees: Foundations and Interpretability** chapter. Use it to follow along, experiment with variations, and complete the practice activities.

**What you'll practice:** building and interpreting CART decision trees for both classification and regression, controlling tree complexity, and translating tree paths into business rules.

**Learning Objectives**
- Explain how decision trees split data using Gini impurity (classification) and SSE (regression)
- Build classification and regression trees with scikit-learn
- Control complexity with `max_depth`, `min_samples_split`, and `min_samples_leaf`
- Interpret tree structure and extract business rules from paths
- Evaluate train/test performance and identify overfitting

**How to use this notebook**
- Run cells from top to bottom. When you see **🏃‍♂️ Try It Yourself**, write your code below the prompt.
- If you’re in Colab, use `Runtime → Restart & run all` to test from a clean environment.


## Setup

Install and import required libraries. (Skip the install cells locally if you already have these packages.)

In [None]:
# If running in Colab or a fresh environment, uncomment to install
# !pip -q install scikit-learn pandas numpy matplotlib ISLP


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn import tree
from sklearn.metrics import (classification_report, confusion_matrix, accuracy_score,
                             mean_absolute_error, r2_score, mean_squared_error)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Display options
pd.set_option('display.max_columns', 100)


## 1) Why Decision Trees?

Decision trees automatically discover non-linear patterns and context-dependent rules by asking a series of yes/no questions. Below, we start with a tiny single-variable example to see how a tree creates thresholds.


In [None]:
# --- Single-variable example: Income → Loan approval (simulated) ---
np.random.seed(42)
n = 200
income = np.random.uniform(30_000, 120_000, n)
approval_prob = np.where(income < 50_000, 0.2,
                         np.where(income < 80_000, 0.7, 0.9))
approval_prob += np.random.normal(0, 0.1, n)
approved = (approval_prob > 0.5).astype(int)

df_simple = pd.DataFrame({'income': income, 'approved': approved})
X = df_simple[['income']]
y = df_simple['approved']

clf_simple = DecisionTreeClassifier(max_depth=2, min_samples_split=20, random_state=42)
clf_simple.fit(X, y)

print(f"Single-variable tree accuracy: {clf_simple.score(X, y):.3f}")

plt.figure(figsize=(8, 5))
tree.plot_tree(clf_simple, feature_names=['Income'], class_names=['Denied', 'Approved'],
               filled=True, rounded=True, fontsize=10)
plt.title("Simple Decision Tree: Approval by Income Only")
plt.show()


### 🏃‍♂️ Try It Yourself
- Change `max_depth` and `min_samples_split` above and re-run. How do thresholds and colors (purity) change?
- Add noise: try `approval_prob += np.random.normal(0, 0.2, n)` and observe the impact on splits.


## 2) Interactions Emerge Naturally (Two Variables)

Trees pick the split (feature + threshold) that best reduces impurity at each step, automatically capturing interactions.

In [None]:
# --- Two-variable example: Income + Credit Score ---
np.random.seed(123)
n = 300
income = np.random.uniform(30_000, 120_000, n)
credit = np.random.uniform(500, 800, n)

approval_prob = (0.3 * (income > 60_000) +
                 0.4 * (credit > 650) +
                 0.2 * ((income > 60_000) & (credit > 650)) +
                 np.random.normal(0, 0.1, n))
approved = (approval_prob > 0.5).astype(int)

df_two = pd.DataFrame({'income': income, 'credit_score': credit, 'approved': approved})
X2 = df_two[['income', 'credit_score']]
y2 = df_two['approved']

clf_two = DecisionTreeClassifier(max_depth=3, min_samples_split=30, random_state=42)
clf_two.fit(X2, y2)

print(f"Two-variable tree accuracy (train): {clf_two.score(X2, y2):.3f}")

plt.figure(figsize=(10, 6))
tree.plot_tree(clf_two, feature_names=['Income', 'Credit Score'],
               class_names=['Denied', 'Approved'], filled=True, rounded=True, fontsize=9)
plt.title("Decision Tree: Approval by Income & Credit Score")
plt.show()


### 🏃‍♂️ Try It Yourself
- Force the tree to start with income by limiting depth or raising `min_samples_split`. Does it still prefer credit score first?
- Increase `max_depth` to 4–5. Do you notice overfitting (very pure leaves)?

## 3) Classification Tree in Practice: Heart Disease

We’ll predict heart disease (1/0) from clinical features. This dataset is included in the course repo.


In [None]:
# Load heart disease data
heart = pd.read_csv("https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/heart.csv")
heart.head()


In [None]:
# Encode any non-numeric columns (LabelEncoder is adequate for trees)
heart_enc = heart.copy()
categorical_cols = heart_enc.select_dtypes(include=['object']).columns
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    heart_enc[col] = le.fit_transform(heart_enc[col])
    label_encoders[col] = le

# Define features/target
Xh = heart_enc.drop('disease', axis=1)
yh = heart_enc['disease']

Xh_train, Xh_test, yh_train, yh_test = train_test_split(
    Xh, yh, test_size=0.3, random_state=42, stratify=yh
)

# Default tree
heart_tree = DecisionTreeClassifier(random_state=42)
heart_tree.fit(Xh_train, yh_train)

yh_pred = heart_tree.predict(Xh_test)
print("Classification Report (Default Tree):")
print(classification_report(yh_test, yh_pred, target_names=['No Disease', 'Disease']))

print("\nComplexity:")
print(f"Depth: {heart_tree.get_depth()}")
print(f"Leaves: {heart_tree.get_n_leaves()}")
print(f"Train acc: {heart_tree.score(Xh_train, yh_train):.3f} | Test acc: {heart_tree.score(Xh_test, yh_test):.3f}")


In [None]:
# (Optional) Visualize the full tree — can be large!
plt.figure(figsize=(20, 12))
plot_tree(heart_tree, feature_names=Xh.columns, class_names=['No Disease', 'Disease'],
          filled=True, rounded=True, fontsize=6)
plt.title("Heart Disease Tree (Default Parameters)")
plt.show()


In [None]:
# Constrain complexity to improve generalization
heart_tree_tuned = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42
).fit(Xh_train, yh_train)

print("Model Comparison")
print("Default Tree:  Train {:.3f} | Test {:.3f}".format(heart_tree.score(Xh_train, yh_train),
                                                         heart_tree.score(Xh_test, yh_test)))
print("Tuned   Tree:  Train {:.3f} | Test {:.3f}".format(heart_tree_tuned.score(Xh_train, yh_train),
                                                         heart_tree_tuned.score(Xh_test, yh_test)))
print(f"Depth (tuned): {heart_tree_tuned.get_depth()} | Leaves: {heart_tree_tuned.get_n_leaves()}")

plt.figure(figsize=(16, 9))
plot_tree(heart_tree_tuned, feature_names=Xh.columns, class_names=['No Disease', 'Disease'],
          filled=True, rounded=True, fontsize=8)
plt.title("Heart Disease Tree (Tuned Parameters)")
plt.show()


### 🏃‍♂️ Try It Yourself
- Change `max_depth` and `min_samples_leaf` for the tuned model. Where is the bias–variance sweet spot?
- Use `classification_report` to compare precision/recall across settings.

## 4) Regression Tree in Practice: Ames Housing

We’ll predict sale price from a subset of intuitive home features.


In [None]:
ames = pd.read_csv("https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/ames_clean.csv")
features = ['GrLivArea', 'OverallQual', 'TotalBsmtSF', 'GarageArea', 'YearBuilt', 'LotArea', 'FullBath', 'BedroomAbvGr']
df_house = ames[features + ['SalePrice']].dropna().copy()
df_house.head()


In [None]:
Xr = df_house.drop('SalePrice', axis=1)
yr = df_house['SalePrice']

Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size=0.3, random_state=42)

reg_tree = DecisionTreeRegressor(random_state=42).fit(Xr_train, yr_train)
pred_train = reg_tree.predict(Xr_train)
pred_test = reg_tree.predict(Xr_test)

def eval_reg(y_true, y_pred):
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred) ** 0.5
    return r2, mae, rmse

r2_t, mae_t, rmse_t = eval_reg(yr_test, pred_test)
print("Ames Price Prediction (Default Tree)")
print(f"Test R^2: {r2_t:.3f} | MAE: ${mae_t:,.0f} | RMSE: ${rmse_t:,.0f}")
print(f"Depth: {reg_tree.get_depth()} | Leaves: {reg_tree.get_n_leaves()}")

plt.figure(figsize=(16, 9))
plot_tree(reg_tree, feature_names=Xr.columns, filled=True, rounded=True, fontsize=7, max_depth=3)
plt.title("Ames Regression Tree (Top 3 Levels)")
plt.show()


### 🏃‍♂️ Try It Yourself
- Add `min_samples_leaf=20` or set `max_depth=6` on a new model and compare metrics.
- Which features tend to appear near the root? Why does that make sense economically?

## 5) Practical Controls for Overfitting

The most common levers:
- `max_depth`: limits the number of splits (path length)
- `min_samples_split`: minimum samples to split a node
- `min_samples_leaf`: minimum samples per leaf

Use validation (or cross-validation) to pick values that generalize well.


In [None]:
# Quick sweep example (illustrative; keep small to run fast in class)
depths = [3, 4, 5, 6, 8, 10]
results = []
for d in depths:
    m = DecisionTreeClassifier(max_depth=d, random_state=42)
    m.fit(Xh_train, yh_train)
    results.append((d, m.score(Xh_train, yh_train), m.score(Xh_test, yh_test)))

pd.DataFrame(results, columns=['max_depth', 'train_acc', 'test_acc'])


---

## ✅ Exercise 1: Baseball Salary Prediction (Regression)

**Goal:** Build a regression tree to predict `Salary` using the Hitters dataset from ISLP.

**Steps** (start code provided):
1. Load and clean data (drop NAs in `Salary`).
2. Select features `['Years','Hits','RBI','Walks','PutOuts']`.
3. Build a default tree, then a constrained tree (`max_depth=4`).
4. Evaluate Train/Test R², MAE, RMSE.
5. Visualize the constrained tree and extract 2–3 if-then rules.


In [None]:
# Starter code
from ISLP import load_data

Hitters = load_data('Hitters')
Hitters_clean = Hitters.dropna(subset=['Salary']).copy()

features = ['Years', 'Hits', 'RBI', 'Walks', 'PutOuts']
Xb = Hitters_clean[features]
yb = Hitters_clean['Salary']

Xb_train, Xb_test, yb_train, yb_test = train_test_split(Xb, yb, test_size=0.3, random_state=42)

# TODO: build default tree
# default_reg = DecisionTreeRegressor(random_state=42).fit(Xb_train, yb_train)

# TODO: evaluate metrics with helper
def eval_reg(y_true, y_pred):
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred) ** 0.5
    return r2, mae, rmse

# TODO: build constrained tree (e.g., max_depth=4), visualize with plot_tree
# constrained_reg = DecisionTreeRegressor(max_depth=4, random_state=42).fit(Xb_train, yb_train)
# plt.figure(figsize=(14, 8)); plot_tree(constrained_reg, feature_names=features, filled=True, rounded=True, fontsize=9); plt.show()


---

## ✅ Exercise 2: Credit Default Classification

**Goal:** Predict default using balance, income, and student status.

**Steps** (start code provided):
1. Create two models: (A) default params, (B) constrained (`max_depth=3`, `min_samples_split=50`, `min_samples_leaf=25`).
2. Compare train vs. test accuracy, classification report, and complexity.
3. Discuss class imbalance (~3% default): focus on precision/recall for the positive class.


In [None]:
# Starter code
Default = load_data('Default')
Default_enc = pd.get_dummies(Default, columns=['student'], drop_first=True)
Default_enc['default_binary'] = (Default_enc['default'] == 'Yes').astype(int)

Xd = Default_enc[['balance', 'income', 'student_Yes']]
yd = Default_enc['default_binary']

Xd_train, Xd_test, yd_train, yd_test = train_test_split(Xd, yd, test_size=0.3, random_state=42, stratify=yd)

# TODO: Model A (default) and Model B (constrained)
# clfA = DecisionTreeClassifier(random_state=42).fit(Xd_train, yd_train)
# clfB = DecisionTreeClassifier(max_depth=3, min_samples_split=50, min_samples_leaf=25, random_state=42).fit(Xd_train, yd_train)

# TODO: Compare performance
# for name, model in [('A', clfA), ('B', clfB)]: ...


---

## ⭐ Optional Challenge: Stock Market Direction (Weekly)

**Goal:** Predict market direction (Up/Down) using lagged returns.

- Split chronologically (first 80% train, last 20% test).
- Compare to a naive "always Up" rule.
- Interpret which lags matter (if any) and discuss why this problem is hard.


In [None]:
# Starter code
Weekly = load_data('Weekly').copy()
Weekly['Direction_binary'] = (Weekly['Direction'] == 'Up').astype(int)
lag_features = ['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']

Xw = Weekly[lag_features]
yw = Weekly['Direction_binary']

# Chronological split: first 80% train, last 20% test
split_idx = int(0.8 * len(Weekly))
Xw_train, Xw_test = Xw.iloc[:split_idx], Xw.iloc[split_idx:]
yw_train, yw_test = yw.iloc[:split_idx], yw.iloc[split_idx:]

# TODO: Fit a tree and evaluate vs. a naive baseline that predicts all Up


---

## Summary

- CART selects (feature, threshold) splits that minimize impurity (classification) or SSE (regression).
- Trees easily model thresholds and interactions, but can overfit without constraints.
- Use `max_depth`, `min_samples_split`, and `min_samples_leaf` to control complexity.
- Translate paths to if–then business rules for stakeholder communication.
