# Understanding Feature Importance: Peeking Inside the Black Box

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/27_feature_importance.ipynb)

This companion notebook provides hands-on exercises for the **Feature Importance** chapter. You'll compute impurity-based and permutation importance, compare their rankings, and use partial dependence plots (PDPs) to understand how features influence predictions.

**What you'll practice**
- Train Random Forests for classification and regression
- Extract impurity-based importance (`.feature_importances_`)
- Compute permutation importance on a held-out test set
- Visualize and compare importance rankings
- Create PDPs to explore feature effects
- Translate findings into business insights

**How to use**
- Run from top to bottom. When you see **🏃‍♂️ Try It Yourself**, add your code beneath the prompt.
- In Colab: `Runtime → Restart and run all` to test from a clean environment.


## 0) Setup

Install and import the required packages. In local environments where these libraries are already installed, you can skip the install cell.


In [None]:
# If using Colab/a fresh env, uncomment to install
# !pip -q install scikit-learn pandas numpy matplotlib ISLP


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from ISLP import load_data

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, classification_report, r2_score, mean_absolute_error, mean_squared_error
from sklearn.inspection import permutation_importance, PartialDependenceDisplay


## 1) Warm‑up: Build a Random Forest and Extract Impurity-Based Importance

We’ll start with the **Default** dataset (classification). Features: `balance`, `income`, and `student` (binary). Target: `default` (1 = Yes, 0 = No).


In [None]:
# Load and prepare Default dataset
Default = load_data('Default')

# Convert to numeric (student/default as binary; cast features to float for PDPs)
Default = Default.copy()
Default['default'] = (Default['default'] == 'Yes').astype(int)
Default['student'] = (Default['student'] == 'Yes').astype(float)

X = Default[['student', 'balance', 'income']].astype(float)
y = Default['default']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

rf_clf = RandomForestClassifier(n_estimators=300, random_state=42)
rf_clf.fit(X_train, y_train)

y_pred = rf_clf.predict(X_test)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['No Default', 'Default']))


In [None]:
# Impurity-based importance (from training)
impurity_df = (pd.DataFrame({
    'feature': X_train.columns,
    'impurity_importance': rf_clf.feature_importances_
})
                .sort_values('impurity_importance', ascending=False)
                .reset_index(drop=True))

impurity_df


In [None]:
# Plot impurity-based importance (sorted horizontal bar)
ordered = impurity_df.sort_values('impurity_importance')
plt.figure(figsize=(6, 3.5))
plt.barh(ordered['feature'], ordered['impurity_importance'])
plt.xlabel("Impurity-based importance")
plt.title("Random Forest Feature Importance (Training)")
plt.tight_layout()
plt.show()


### 🏃‍♂️ Try It Yourself
- Change the number of trees (`n_estimators`) or set a `max_depth` and re-run. Do rankings change?
- Add a noisy feature (e.g., a random normal column) and see where it ranks.


## 2) Permutation Importance (Model‑Agnostic)

Compute importance as the **drop in test performance** when a feature is shuffled (relationship to target broken).


In [None]:
perm = permutation_importance(
    rf_clf, X_test, y_test,
    n_repeats=10, random_state=42, scoring='accuracy'
)

perm_df = (pd.DataFrame({
    'feature': X_test.columns,
    'perm_importance_mean': perm.importances_mean,
    'perm_importance_std': perm.importances_std
})
           .sort_values('perm_importance_mean', ascending=False)
           .reset_index(drop=True))

perm_df


In [None]:
# Plot permutation importance with error bars
ordered = perm_df.sort_values('perm_importance_mean')
plt.figure(figsize=(6, 3.5))
plt.barh(ordered['feature'], ordered['perm_importance_mean'], xerr=ordered['perm_importance_std'])
plt.xlabel("Permutation importance (Δ accuracy)")
plt.title("Permutation Importance (Test Set)")
plt.tight_layout()
plt.show()


### Compare methods side‑by‑side

In [None]:
compare = (impurity_df
           .merge(perm_df, on='feature', how='inner')
           .sort_values('perm_importance_mean', ascending=False)
           .reset_index(drop=True))
compare


### 🏃‍♂️ Try It Yourself
- Do impurity and permutation rankings agree? Where do they disagree?
- If they disagree strongly for a continuous vs. binary feature, why might that be?


## 3) Partial Dependence Plots (PDPs)

PDPs reveal **how** predictions change as a feature varies (marginal effect), averaging over other features.


In [None]:
# Single-feature PDP for the most important feature (by permutation importance)
top_feat = perm_df.iloc[0]['feature']

fig, ax = plt.subplots(1, 1, figsize=(5, 3.5))
PartialDependenceDisplay.from_estimator(
    rf_clf, X_train, features=[top_feat], grid_resolution=60, ax=ax
)
ax.set_title(f"Partial Dependence: {top_feat}")
plt.tight_layout()
plt.show()


In [None]:
# PDPs for all features (side-by-side)
fig, ax = plt.subplots(1, len(X.columns), figsize=(12, 3.5))
PartialDependenceDisplay.from_estimator(
    rf_clf, X_train, features=list(X.columns), grid_resolution=50, ax=ax
)
plt.tight_layout()
plt.show()


### 🏃‍♂️ Try It Yourself
- Identify any threshold or saturation effects in the PDPs.
- Do PDP insights align with the importance rankings and your domain intuition?


## 4) (Optional) Regression Example: Ames Housing

Repeat the workflow for a regression task to compute **MSE-based importance** and **permutation importance (R²)**.


In [None]:
# Load and prep a subset of Ames features (via raw URL for Colab friendliness)
ames_url = "https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/ames_clean.csv"

try:
    ames = pd.read_csv(ames_url)
    feats = ['GrLivArea','OverallQual','TotalBsmtSF','GarageArea','YearBuilt','LotArea','FullBath','BedroomAbvGr']
    df_house = ames[feats + ['SalePrice']].dropna().copy()

    Xr = df_house[feats]
    yr = df_house['SalePrice']

    Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size=0.3, random_state=42)

    rf_reg = RandomForestRegressor(n_estimators=300, random_state=42)
    rf_reg.fit(Xr_train, yr_train)

    pred = rf_reg.predict(Xr_test)
    r2 = r2_score(yr_test, pred)
    mae = mean_absolute_error(yr_test, pred)
    rmse = mean_squared_error(yr_test, pred) ** 0.5

    print(f"Test R^2: {r2:.3f} | MAE: ${mae:,.0f} | RMSE: ${rmse:,.0f}")

    imp_reg = (pd.DataFrame({'feature': Xr.columns, 'impurity_importance': rf_reg.feature_importances_})
               .sort_values('impurity_importance', ascending=False).reset_index(drop=True))
    imp_reg
except Exception as e:
    print("Ames dataset could not be loaded. If offline, skip this section.\n", e)


In [None]:
# Plot regression impurity-based importance (if computed above)
try:
    ordered = imp_reg.sort_values('impurity_importance')
    plt.figure(figsize=(6, 3.5))
    plt.barh(ordered['feature'], ordered['impurity_importance'])
    plt.xlabel("Impurity-based importance (regression)")
    plt.title("Random Forest Regressor Feature Importance")
    plt.tight_layout()
    plt.show()
except:
    pass


In [None]:
# Permutation importance for regression (R^2 drop)
try:
    perm_r = permutation_importance(
        rf_reg, Xr_test, yr_test,
        n_repeats=10, random_state=42, scoring='r2'
    )
    perm_reg = (pd.DataFrame({
        'feature': Xr_test.columns,
        'perm_importance_mean': perm_r.importances_mean,
        'perm_importance_std': perm_r.importances_std
    })
                .sort_values('perm_importance_mean', ascending=False)
                .reset_index(drop=True))
    perm_reg
except:
    pass


In [None]:
# PDPs for top regression feature
try:
    top_r = perm_reg.iloc[0]['feature']
    fig, ax = plt.subplots(1, 1, figsize=(5, 3.5))
    PartialDependenceDisplay.from_estimator(rf_reg, Xr_train, features=[top_r], grid_resolution=50, ax=ax)
    ax.set_title(f"Partial Dependence (Regression): {top_r}")
    plt.tight_layout()
    plt.show()
except:
    pass


## 5) Quick Checklist for Trustworthy Feature Importance

- Use **permutation importance on a held-out test set** for reliable rankings
- Compare with **impurity-based importance** for tree models (fast signal; validate later)
- Watch for **high-cardinality bias** and **correlated features** splitting importance
- Remember: **importance ≠ causation**; pair with domain expertise
- Use **PDPs** (and, later, SHAP/LIME) to understand *how* features influence predictions


---

## ✅ End‑of‑Chapter Exercises

These extend your Chapter 26 models with feature importance analysis.


### Exercise 1 — Baseball Salary (Regression)

Use your **Hitters** random forest from Chapter 26.

**Tasks**
1. Extract and plot impurity-based importance
2. Compute permutation importance (`scoring='r2'`); compare to impurity
3. Create PDPs for the top 3–5 features
4. Write 3 business insights + 2 caveats (importance ≠ causation)


In [None]:
# Starter
Hitters = load_data('Hitters')
Hitters_clean = Hitters.dropna(subset=['Salary']).copy()

features = ['Years','Hits','RBI','Walks','PutOuts']
Xb = Hitters_clean[features]
yb = Hitters_clean['Salary']

Xb_train, Xb_test, yb_train, yb_test = train_test_split(Xb, yb, test_size=0.3, random_state=42)

# TODO: Fit RandomForestRegressor; compute impurity and permutation importance; make PDPs


### Exercise 2 — Credit Default (Classification)

Re-use your **Default** random forest.

**Tasks**
1. Plot impurity-based importance
2. Compute permutation importance (`scoring='accuracy'`); compare
3. Discuss high-cardinality bias (continuous vs. binary)
4. Create PDPs for `balance`, `income`, `student`
5. Draft a 1-page memo with thresholds and actions


In [None]:
# Starter (you already trained rf_clf above)
# TODO: Recreate charts/tables here as your final deliverables for Exercise 2


### Exercise 3 — Market Direction (Optional)

Work with **Weekly** data (lags 1–5).

**Tasks**
1. Compare impurity vs. permutation importance
2. Examine correlations among lags
3. Create PDPs for top lags
4. Compare model accuracy to baseline (always "Up"); reflect on limits


In [None]:
# Starter
Weekly = load_data('Weekly').copy()
Weekly['Direction_binary'] = (Weekly['Direction'] == 'Up').astype(int)
lag_features = ['Lag1','Lag2','Lag3','Lag4','Lag5']

Xw = Weekly[lag_features]
yw = Weekly['Direction_binary']

split_idx = int(0.8 * len(Weekly))
Xw_train, Xw_test = Xw.iloc[:split_idx], Xw.iloc[split_idx:]
yw_train, yw_test = yw.iloc[:split_idx], yw.iloc[split_idx:]

# TODO: Fit RandomForestClassifier; compute and compare importances; PDPs; baseline comparison


---

## Summary
- Use **impurity-based importance** for fast, model-specific signals
- Validate with **permutation importance** on test data
- Use **PDPs** to understand *how* important features affect predictions
- Always communicate **insights + caveats** to stakeholders
