# Interpretation - Feature Importance + Partial Dependence + Project Improved Model Delivery

<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/main/notebooks/figures/mgmt_474_ai_logo_02-modified.png" width="200"/>
</div>
</center>

# <center><a class="tocSkip"></center>
# <center>MGMT47400 Predictive Analytics</center>
# <center>Professor: Davi Moreira </center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/15_interpretation_error_analysis_project.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Generate model interpretation artifacts (permutation importance, PDP/ICE)
2. Conduct error analysis to find systematic failure segments
3. Communicate model behavior honestly (limits, caveats, instability)
4. Deliver a project improved model with interpretation and error analysis
5. Use Gemini to draft explanation text, then tighten it to evidence

---

## 1. Setup

This notebook uses the **California Housing** dataset (20,640 samples, 8 features) to demonstrate model interpretation and error analysis techniques. We switch from the breast cancer classification task to a regression task because continuous predictions make residual analysis and partial dependence plots more informative.

The imports include `permutation_importance` and `PartialDependenceDisplay` from scikit-learn's inspection module, which provide model-agnostic tools for understanding how features influence predictions.

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.inspection import permutation_importance, PartialDependenceDisplay
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

RANDOM_SEED = 474
np.random.seed(RANDOM_SEED)

print("✓ Setup complete!")

**Reading the output:**

The setup cell loads the interpretation toolkit: `permutation_importance` for measuring feature contributions, `PartialDependenceDisplay` for visualizing how individual features affect predictions, and standard regression metrics (`mean_absolute_error`, `mean_squared_error`, `r2_score`). The figure size is set to (12, 6) for wider plots that accommodate feature names.

The message `Setup complete!` with **RANDOM_SEED = 474** confirms all libraries are available. This notebook requires scikit-learn 1.0+ for the `PartialDependenceDisplay` API; older versions use a different function signature.

**Key takeaway:** The inspection tools imported here (`permutation_importance`, `PartialDependenceDisplay`) are model-agnostic: they work with any scikit-learn estimator, not just Random Forests. You can reuse the same code with Gradient Boosting, Ridge regression, or any other model.

---

## 2. Load Data and Train Champion Model

We load the California Housing dataset and split it into train (60 %), validation (20 %), and test (20 %) sets. The target variable `MedHouseVal` represents the median house value in units of $100,000 for census block groups in California. Features include median income, house age, average number of rooms, geographic coordinates (latitude and longitude), and population density.

We then train a `RandomForestRegressor` as our champion model. Random Forests are a natural choice for interpretation exercises because they are non-linear (so partial dependence plots show interesting curves) yet relatively stable (so permutation importance estimates are reliable).

In [None]:
# Load dataset
california = fetch_california_housing(as_frame=True)
df = california.frame

X = df.drop(columns=['MedHouseVal'])
y = df['MedHouseVal']

# Split data
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.20, random_state=RANDOM_SEED)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=RANDOM_SEED)

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")

**Reading the output:**

The dataset is split into three partitions: **Train** (~12,384 samples), **Val** (~4,128 samples), and **Test** (~4,128 samples), following the standard 60/20/20 ratio. The California Housing dataset has **8 features**: `MedInc` (median income), `HouseAge`, `AveRooms`, `AveBedrms`, `Population`, `AveOccup`, `Latitude`, and `Longitude`.

The target `MedHouseVal` is capped at 5.0 ($500,000) in the original dataset, meaning very expensive properties are clipped. This cap will show up later in the error analysis as a ceiling effect where the model cannot predict values above 5.0.

**Why this matters:** The three-way split is essential for this notebook's workflow. We train on the training set, compute all interpretation artifacts on the validation set, and reserve the test set for a final unbiased evaluation. Computing importance on validation data (not training data) ensures we measure genuine predictive value.

---

In [None]:
# Train champion model (Random Forest for interpretation)
rf_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=RANDOM_SEED, n_jobs=-1)
rf_model.fit(X_train, y_train)

# Predictions
y_val_pred = rf_model.predict(X_val)

# Evaluation
mae = mean_absolute_error(y_val, y_val_pred)
rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
r2 = r2_score(y_val, y_val_pred)

print("\n=== CHAMPION MODEL PERFORMANCE ===")
print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²: {r2:.4f}")

**Reading the output:**

Three metrics are reported for the champion Random Forest on the validation set: **MAE** (mean absolute error, in $100k units), **RMSE** (root mean squared error), and **R-squared**. Typical values on this dataset with 100 trees and max_depth=10: MAE around **0.33** ($33,000 average error), RMSE around **0.48**, and R-squared around **0.80**.

An **R-squared of 0.80** means the model explains 80 % of the variance in median house values using just 8 features. The remaining 20 % is driven by factors not in the dataset (school quality, neighborhood amenities, market conditions, etc.).

The MAE of ~0.33 means that on average, the model's prediction is off by about $33,000. For a median house valued at $200,000, that is a ~16 % error, which is reasonable for a first model but leaves room for improvement.

**Key takeaway:** These baseline metrics are the reference point for all subsequent analysis. When we find high-error segments later, we will compare their segment-specific MAE against this overall MAE of ~0.33.

---

## 3. Permutation Feature Importance

### 3.1 Compute Permutation Importance

In [None]:
# Compute permutation importance on validation set
perm_importance = permutation_importance(
    rf_model, X_val, y_val, 
    n_repeats=10, 
    random_state=RANDOM_SEED,
    scoring='neg_mean_absolute_error'
)

# Create importance DataFrame
importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance_mean': perm_importance.importances_mean,
    'importance_std': perm_importance.importances_std
}).sort_values('importance_mean', ascending=False)

print("\n=== PERMUTATION FEATURE IMPORTANCE ===")
print(importance_df)

**Reading the output:**

The permutation importance table lists all 8 features ranked by their impact on MAE (scored as `neg_mean_absolute_error`). The `importance_mean` column shows how much MAE increases when that feature is shuffled; higher values mean the feature is more important.

**MedInc** (median income) typically dominates with importance around **0.30-0.50**, meaning shuffling income values increases MAE by $30,000-50,000. This makes economic sense: income is the strongest predictor of housing prices. **Latitude** and **Longitude** often rank second and third, capturing the geographic price premium of coastal California.

Features with importance near zero (e.g., `Population`, `AveBedrms`) contribute little to the model's predictions on the validation set. Removing them would barely change performance, which is useful information for model simplification.

The `importance_std` column shows variability across the 10 shuffle repeats. Features with high std relative to their mean should be interpreted cautiously: their importance is not precisely estimated.

**Key takeaway:** Permutation importance answers the question "which features does the model actually rely on for its predictions?" On California Housing, the answer is overwhelmingly income and geography.

---

### 3.2 Visualize Importance

A horizontal bar chart is the standard way to present permutation importance because it displays feature names legibly and lets you compare magnitudes at a glance. Error bars (one standard deviation from the 10 shuffle repeats) indicate how stable each importance estimate is.

Features with error bars overlapping zero are not significantly important to the model. Features with large bars and tight error bars are the most reliably important drivers of predictions.

In [None]:
# Plot permutation importance
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(importance_df['feature'], importance_df['importance_mean'], xerr=importance_df['importance_std'])
ax.set_xlabel('Permutation Importance (decrease in MAE)')
ax.set_title('Feature Importance - Random Forest Model')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

**Reading the output:**

The horizontal bar chart provides a visual ranking of all 8 features. Error bars (horizontal lines extending from each bar) represent one standard deviation from the 10 permutation repeats. **MedInc** stands out as the tallest bar by a wide margin, followed by the geographic features.

Features with bars that barely extend beyond zero (e.g., `Population`) are essentially noise from the model's perspective. You could remove them and the model's MAE would barely change.

Notice whether any error bars overlap with zero: that would mean the feature's importance is not statistically distinguishable from zero. For the California Housing dataset, typically 5-6 features have clearly positive importance while 2-3 features are near zero.

**Why this matters:** This chart is the primary deliverable for stakeholders who ask "what drives the model's predictions?" A domain expert can look at this chart and immediately verify whether the model makes sense (income and location driving prices is intuitive) or if something suspicious is happening (e.g., population being the top feature would be concerning).

---

## 📝 PAUSE-AND-DO Exercise 1 (10 minutes)

**Task:** Create permutation importance and write 3 evidence-based bullets about feature importance.

**Instructions:**
1. Review the permutation importance results above
2. Identify the top 3 most important features
3. Write 3 evidence-based interpretation bullets

---

### YOUR INTERPRETATION HERE:

**Finding 1:**  
[Evidence-based interpretation of most important feature]

**Finding 2:**  
[Evidence-based interpretation of second feature]

**Finding 3:**  
[Evidence-based interpretation or pattern]

---

## 4. Partial Dependence Plots (PDP)

### 4.1 Create PDP for Top Features

In [None]:
# Select top 4 features for PDP
top_features = importance_df.head(4)['feature'].tolist()

print(f"Creating PDP for: {top_features}")

# Create partial dependence plots
fig, ax = plt.subplots(figsize=(14, 10))
display = PartialDependenceDisplay.from_estimator(
    rf_model, X_val, top_features, 
    ax=ax, 
    kind="average",
    random_state=RANDOM_SEED
)
plt.suptitle('Partial Dependence Plots - Top 4 Features', fontsize=16)
plt.tight_layout()
plt.show()

**Reading the output:**

The partial dependence plots show four subplots, one for each of the top 4 features. Each curve shows the **average predicted house value** as that feature varies, while all other features are held at their actual values in the dataset.

For **MedInc**: the PDP typically shows a strong positive relationship, rising steeply from income 0-5 and then flattening above income 8-10. This flattening occurs partly because the target is capped at 5.0 and partly because very few training samples have income above 10.

For **Latitude** and **Longitude**: the PDPs capture the geographic price gradient. Lower latitude (Southern California) and specific longitude ranges (coastal areas) tend to have higher predicted values.

For **HouseAge** or **AveRooms** (depending on the ranking): the relationships may be non-monotonic, showing that the Random Forest captures patterns that a linear model would miss.

**Key takeaway:** PDPs transform a black-box model into interpretable feature-response curves. They answer the question "how does the model's prediction change as I vary this one feature?" The curves should align with domain knowledge; unexpected shapes warrant investigation.

---

### 4.2 Individual Conditional Expectation (ICE) Plots

While PDP shows the **average** effect of a feature across all samples, ICE plots show the effect for **each individual sample** as a separate thin line. The thick PDP line is the average of all ICE curves. If the ICE curves are tightly bundled, the feature has a consistent effect across the population. If they fan out or cross, the feature interacts with other features, meaning its effect depends on context.

Wide spread in ICE curves is a signal of **interaction effects** that the PDP alone would hide. For example, median income might increase predicted house value for coastal locations (low latitude) but have a smaller effect for inland locations.

In [None]:
# Create ICE plots for top feature
top_feature = top_features[0]

fig, ax = plt.subplots(figsize=(10, 6))
display = PartialDependenceDisplay.from_estimator(
    rf_model, X_val, [top_feature],
    kind="both",  # Shows both average PDP and individual ICE curves
    ax=ax,
    random_state=RANDOM_SEED,
    ice_lines_kw={"alpha": 0.1}
)
plt.suptitle(f'ICE Plot for {top_feature}', fontsize=14)
plt.tight_layout()
plt.show()

print(f"\n⚠️ Interpretation Notes:")
print(f"  - PDP shows average effect of {top_feature} on predictions")
print(f"  - ICE curves show effect for individual samples")
print(f"  - Wide spread in ICE = interactions with other features")

**Reading the output:**

The ICE plot overlays hundreds of thin semi-transparent lines (individual samples) with a thick solid line (the PDP average). For **MedInc**, most ICE curves follow the same upward trend, but you can see variation: some curves are steeper and some are flatter, indicating that the effect of income on predicted house value depends on the other features of each sample.

If the ICE curves cross each other, that is strong evidence of a **feature interaction**. For example, a sample in a high-latitude area (e.g., northern rural California) may show a flatter income-price curve than a sample in a coastal urban area, because the geographic premium amplifies the income effect.

The spread of ICE curves at any given income level represents the heterogeneity of the model's response. A narrow band means income has a consistent effect; a wide band means the effect is context-dependent.

**Why this matters:** ICE plots reveal individual-level variation that PDP averages hide. Before making policy recommendations based on PDP curves (e.g., "increasing income by $10k raises predicted house value by $X"), check ICE plots to see if that relationship holds uniformly or varies by subgroup.

---

## 5. Error Analysis

### 5.1 Residual Analysis

Error analysis goes beyond aggregate metrics (MAE, RMSE, R-squared) to examine **where and how** the model fails. Residual plots reveal systematic patterns: a funnel shape indicates heteroscedasticity (errors grow with predicted value), clusters of large residuals point to subpopulations the model struggles with, and non-zero mean residuals signal bias.

The two-panel visualization below shows residuals vs. predicted values (left) and the residual distribution (right). An ideal model produces residuals that are randomly scattered around zero with constant variance and a symmetric, roughly normal distribution.

In [None]:
# Compute residuals
residuals = y_val - y_val_pred

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Residual plot
axes[0].scatter(y_val_pred, residuals, alpha=0.3)
axes[0].axhline(y=0, color='r', linestyle='--')
axes[0].set_xlabel('Predicted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residual Plot')

# Residual histogram
axes[1].hist(residuals, bins=50, edgecolor='black')
axes[1].set_xlabel('Residuals')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')

plt.tight_layout()
plt.show()

print(f"\nResidual Statistics:")
print(f"  Mean: {residuals.mean():.4f}")
print(f"  Std: {residuals.std():.4f}")
print(f"  Median: {residuals.median():.4f}")

**Reading the output:**

The **left panel** (residuals vs. predicted values) reveals systematic patterns. On California Housing, you will typically see a **funnel shape**: residuals are small for low predicted values but fan out for high predicted values. This heteroscedasticity means the model is less accurate for expensive homes. You may also notice a cluster of large positive residuals near predicted value 5.0, caused by the target cap.

The **right panel** (residual histogram) shows the distribution of errors. The distribution is roughly symmetric and centered near zero (mean residual close to 0.0), confirming no systematic bias. However, the right tail is heavier than the left, meaning the model underpredicts more often than it overpredicts, especially for high-value properties.

The printed residual statistics give the numerical summary: mean near **0.00** (no bias), standard deviation around **0.45-0.50**, and median close to zero. A non-zero median would indicate skewed errors.

**Why this matters:** Residual analysis is the diagnostic that tells you where your model fails systematically, not just on average. The funnel shape above directly motivates the segment error analysis in the next section.

---

### 5.2 Segment Error Analysis

Aggregate error metrics can mask dramatic performance differences across subpopulations. Segment error analysis splits the validation set into groups based on the most important feature and computes error metrics separately for each group. This reveals whether the model performs well everywhere or only in certain regions of the feature space.

We use quartiles of the top feature (typically median income) to create four segments. If one segment has dramatically higher MAE than the others, that segment represents a systematic failure mode that may require additional features, a different model architecture, or explicit post-processing.

In [None]:
# Analyze errors by segments
# Create price segments
X_val_analysis = X_val.copy()
X_val_analysis['y_true'] = y_val.values
X_val_analysis['y_pred'] = y_val_pred
X_val_analysis['abs_error'] = np.abs(residuals)

# Segment by top feature
top_feature = importance_df.iloc[0]['feature']
X_val_analysis[f'{top_feature}_segment'] = pd.qcut(X_val_analysis[top_feature], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

# Error by segment
segment_errors = X_val_analysis.groupby(f'{top_feature}_segment').agg({
    'abs_error': ['mean', 'median', 'std'],
    'y_true': 'count'
}).round(4)

print(f"\n=== ERROR ANALYSIS BY {top_feature} SEGMENT ===")
print(segment_errors)

**Reading the output:**

The segment error table breaks down MAE, median error, error standard deviation, and sample count for each quartile of the top feature (typically **MedInc**). The columns are organized hierarchically: under `abs_error` you see mean, median, and std; under `y_true` you see the count.

Typical pattern on California Housing: **Q1** (lowest income quartile) has relatively low MAE because these low-value homes cluster in a narrow price range that is easy to predict. **Q4** (highest income quartile) has dramatically higher MAE, often 2-3x the Q1 error, because high-income areas have diverse housing values and the target cap at 5.0 truncates the model's ability to distinguish luxury properties.

The sample counts should be roughly equal across quartiles (each ~25 % of validation data), confirming that `pd.qcut` created balanced groups.

**Key takeaway:** The high-income segment (Q4) is the model's primary failure mode. Any effort to improve the model should focus here: additional features (school ratings, proximity to coast), removing the target cap, or training a separate specialist model for the luxury segment.

---

In [None]:
# Visualize errors by segment
fig, ax = plt.subplots(figsize=(10, 6))
X_val_analysis.boxplot(column='abs_error', by=f'{top_feature}_segment', ax=ax)
ax.set_xlabel(f'{top_feature} Quartile')
ax.set_ylabel('Absolute Error')
ax.set_title(f'Error Distribution by {top_feature} Segment')
plt.suptitle('')  # Remove default title
plt.tight_layout()
plt.show()

**Reading the output:**

The box plot visualizes the error distribution for each quartile segment. Each box shows the interquartile range (IQR) of absolute errors, the line inside is the median, and the whiskers extend to 1.5x IQR. Outlier dots beyond the whiskers represent samples where the model made particularly large errors.

The **Q4 box** is visibly taller and higher than the others, confirming quantitatively what the residual plot showed qualitatively: the model struggles most with high-income areas. The Q4 box also has more outliers (dots above the whisker), some with absolute errors exceeding $100,000.

Compare the median line across boxes: Q1 and Q2 have medians near **0.15-0.25** ($15,000-25,000), while Q4's median is around **0.40-0.60** ($40,000-60,000). This 2-3x error ratio means the model's accuracy depends heavily on the income bracket of the property being valued.

**Why this matters:** This box plot is a powerful communication tool for stakeholders. Instead of reporting a single MAE of ~$33,000, you can say: "Our model predicts low-income properties to within ~$20,000 but high-income properties only to within ~$50,000. We should not use this model for luxury home valuations without additional improvements."

---

## 📝 PAUSE-AND-DO Exercise 2 (10 minutes)

**Task:** Run segment error analysis and identify one failure segment.

**Instructions:**
1. Review the segment error analysis above
2. Identify which segment has the highest error
3. Propose a hypothesis for why this segment performs poorly
4. Suggest one potential improvement

---

### YOUR SEGMENT ANALYSIS HERE:

**Highest Error Segment:**  
[Which segment and why?]

**Hypothesis:**  
[Why does this segment have higher errors?]

**Potential Improvement:**  
[What could help reduce errors in this segment?]

---

## 6. Interpretation Narrative Template (Evidence-Based)

### 6.1 Model Strengths

**Model Strengths (Evidence-Based):**

1. **Overall Performance**: The Random Forest model achieves an R² of [X.XX] on the validation set, explaining [XX]% of variance in housing prices.

2. **Key Drivers**: Feature importance analysis reveals that [Top Feature] is the strongest predictor, with a permutation importance of [X.XX], followed by [Second Feature].

3. **Predictive Patterns**: Partial dependence plots show [describe pattern] relationship between [feature] and predictions.

---

### 6.2 Model Limitations

**Model Limitations (Honest Assessment):**

1. **Segment Performance Gaps**: Error analysis reveals systematically higher errors in [segment description], with mean absolute error of [X.XX] compared to overall MAE of [X.XX].

2. **Feature Interactions**: ICE plots show substantial variation in individual effects, suggesting complex interactions that the model may not fully capture.

3. **Residual Patterns**: Residual analysis indicates [describe pattern], suggesting [interpretation].

4. **Generalization Concerns**: The model is trained on [timeframe/geography], and may not generalize to [different context].

---

### 6.3 Recommendations

**Recommendations:**

1. **Use with Caution**: Apply elevated scrutiny to predictions for [high-error segment].

2. **Feature Engineering**: Consider engineering additional features to capture [identified interaction].

3. **Model Monitoring**: Track permutation importance stability over time to detect feature drift.

4. **Error Bounds**: When communicating predictions, include confidence intervals, especially for [segment].

---

## 7. Project Milestone 3 Scaffold

### 7.1 Deliverable Checklist

**Project Milestone 3 Requirements:**

- [ ] Updated model comparison table (baseline vs improved models)
- [ ] Champion model selection with justification
- [ ] Permutation importance plot and interpretation
- [ ] PDP/ICE plots for top 3-4 features
- [ ] Segment error analysis with findings
- [ ] Evidence-based interpretation narrative
- [ ] Model limitations section (honest assessment)
- [ ] Next steps and recommendations

---

## 8. Wrap-Up: Key Takeaways

### What We Learned Today:

1. **Permutation Importance**: Model-agnostic method to measure feature importance
2. **Partial Dependence**: Visualizing average effect of features on predictions
3. **ICE Plots**: Understanding individual-level feature effects and interactions
4. **Error Analysis**: Finding systematic failure patterns through segmentation
5. **Honest Communication**: Evidence-based interpretation with clear limitations

### Interpretation Best Practices:

- ✓ Always compute importance on validation/test data, not training data
- ✓ Report standard deviations to show stability
- ✓ Check for correlated features when interpreting importance
- ✓ Use PDP cautiously when features are highly correlated
- ✓ Conduct segment analysis to find failure modes
- ✓ Be honest about limitations and uncertainties

### Remember:

> **"Interpretation is about honest communication, not selling the model."**  
> Report what you find, including limitations and failure modes.

---

## Bibliography

- scikit-learn User Guide: [Inspection Tools](https://scikit-learn.org/stable/inspection.html) (permutation importance, partial dependence)
- Molnar, C. (2022). *Interpretable Machine Learning*. [Online book](https://christophm.github.io/interpretable-ml-book/)
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Python* (ISLP). Springer.
- Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32.

---




<center>

Thank you!

</center>