# 2.06: Model Performance Diagnostics - Executive Level

## OPENING

Alright — we've computed metrics, we've built the scoreboard, and we've compared our baselines.

But here's the thing: **No model is signed off yet.**

Before we deploy anything — before we call a baseline "good enough" — we need due diligence. This is inspection before deployment.

Even if a baseline passes the Sufficiency Gate, diagnostics are required for risk management.

Executives don't want 30,000 individual SKU metrics. They want a system health report. They want to know: **Where is the fire?**

This notebook takes us from scoreboard rankings to diagnostic triage — the repeatable workflow real forecasting teams use to assess health at the portfolio level.

## SETUP: Load Dependencies and Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

sns.set_theme()
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

In [None]:
# Load CV results from Module 2.05
# This should contain cross-validation metrics for each model across SKUs
cv_results = pd.read_csv('path/to/cv_results.csv')
sku_metadata = pd.read_csv('path/to/sku_metadata.csv')

print(f"Loaded {len(cv_results)} cross-validation records")
print(f"Loaded {len(sku_metadata)} SKU metadata records")
print(f"\nCV Results columns: {cv_results.columns.tolist()}")
print(f"\nSKU Metadata columns: {sku_metadata.columns.tolist()}")

---
## SECTION 1: THE DIAGNOSTIC MINDSET

### Diagnostics Are a Roadmap, Not a Ranking

The scoreboard told us who scored best. But that doesn't tell us what to do next.

Diagnostics answer different questions:
- **Where** does the error live? Segments? Volume tiers?
- **Does performance decay by horizon?** Short-term vs mid-term?
- **Is the error systematic or random?** Bias vs variance?
- **Is the model stable over time?** Consistent or volatile?
- **Is the problem even forecastable?** Modeling failure or reality?

### The 5-Question Triage Checklist

**1. Where?** Find the 20% of segments causing 80% of the pain.

**2. When?** Check for cliffs in horizon accuracy.

**3. What Type?** Bias (fixable) vs Variance (manageable).

**4. How Stable?** Check for planner-killing volatility.

**5. Why?** Is this a modeling problem or a noise floor problem?

**Pro tip:** Each section below answers one of these five questions.

---
## SECTION 2: THE PARETO OF ERROR (WHERE THE FIRE IS)

### Find the High-Impact Failure Zones Fast

The first question: **Where does the error live?**

We slice total portfolio error by three groups:
- **Category:** Does most error come from Produce? Frozen?
- **Volume Tier:** Are we failing on A-items (high impact) or C-items (noise)?
- **Forecastability Quadrants:** Does the model fail where it should succeed?

In [None]:
# Merge CV results with metadata to get segment information
diagnostics_df = cv_results.merge(sku_metadata, on='sku_id', how='left')

# Calculate total portfolio error (using wMAPE)
total_portfolio_wmape = diagnostics_df['wmape'].mean()
print(f"Total Portfolio wMAPE: {total_portfolio_wmape:.2%}")

In [None]:
# ANALYSIS 1: Error by Category
error_by_category = diagnostics_df.groupby('category').agg({
    'wmape': ['mean', 'count'],
    'sku_id': 'count'
}).round(4)

error_by_category.columns = ['avg_wmape', 'num_records', 'num_skus']
error_by_category['pct_of_portfolio'] = (error_by_category['num_records'] / error_by_category['num_records'].sum() * 100).round(1)
error_by_category = error_by_category.sort_values('avg_wmape', ascending=False)

print("Error Contribution by Category:")
print(error_by_category)
print(f"\nTop contributor: {error_by_category.index[0]} with {error_by_category.iloc[0]['avg_wmape']:.2%} wMAPE")

In [None]:
# ANALYSIS 2: Error by Volume Tier (A, B, C items)
# Assuming 'volume_tier' column exists in metadata
if 'volume_tier' in diagnostics_df.columns:
    error_by_tier = diagnostics_df.groupby('volume_tier').agg({
        'wmape': ['mean', 'count'],
        'sku_id': 'nunique'
    }).round(4)
    
    error_by_tier.columns = ['avg_wmape', 'num_records', 'num_skus']
    error_by_tier['pct_of_portfolio'] = (error_by_tier['num_records'] / error_by_tier['num_records'].sum() * 100).round(1)
    error_by_tier = error_by_tier.sort_values('avg_wmape', ascending=False)
    
    print("\nError Contribution by Volume Tier:")
    print(error_by_tier)
else:
    print("\nVolume tier column not found. Skipping this analysis.")

---
## SECTION 3: BIAS VS VARIANCE (THE TWO FAILURE MODES)

### The Key Insight

**"Bias tells you what to build. Variance tells you what to buffer."**

- **Bias (Systematic Error):** Consistently high or low. Missing structure — a promo, a holiday, a trend. *Fixable via features.*
- **Variance (Random Error):** Forecast jumps around wildly. Weak signal or noisy world. *Managed via smoothing or inventory buffers.*

In [None]:
# Calculate Bias: Mean Percentage Error (MPE) - directional error
# Assuming 'forecast' and 'actual' columns exist
if 'forecast' in diagnostics_df.columns and 'actual' in diagnostics_df.columns:
    diagnostics_df['mpe'] = ((diagnostics_df['forecast'] - diagnostics_df['actual']) / diagnostics_df['actual']).abs()
    diagnostics_df['signed_pct_error'] = (diagnostics_df['forecast'] - diagnostics_df['actual']) / diagnostics_df['actual']
    
    # Bias = Mean Signed Percentage Error
    diagnostics_df['bias'] = diagnostics_df['signed_pct_error']
    
    # Variance = Standard deviation of errors
    diagnostics_df['error_magnitude'] = np.abs(diagnostics_df['forecast'] - diagnostics_df['actual'])
    
    # Segment analysis
    bias_variance_by_category = diagnostics_df.groupby('category').agg({
        'bias': 'mean',
        'error_magnitude': 'std',
        'wmape': 'mean'
    }).round(4)
    
    bias_variance_by_category.columns = ['mean_bias', 'error_variance', 'wmape']
    bias_variance_by_category = bias_variance_by_category.sort_values('wmape', ascending=False)
    
    print("Bias vs Variance by Category:")
    print(bias_variance_by_category)
else:
    print("Forecast and actual columns not found. Cannot compute bias/variance.")

In [None]:
# Classify segments into action categories
if 'bias' in diagnostics_df.columns:
    by_category_summary = diagnostics_df.groupby('category').agg({
        'bias': lambda x: x.abs().mean(),
        'error_magnitude': 'std',
        'wmape': 'mean'
    }).round(4)
    
    by_category_summary.columns = ['avg_bias', 'error_volatility', 'wmape']
    
    # Classification
    by_category_summary['action'] = 'Investigate'
    high_bias = by_category_summary['avg_bias'] > by_category_summary['avg_bias'].median()
    high_variance = by_category_summary['error_volatility'] > by_category_summary['error_volatility'].median()
    
    by_category_summary.loc[high_bias & ~high_variance, 'action'] = 'Feature Engineering'
    by_category_summary.loc[~high_bias & high_variance, 'action'] = 'Buffer/Policy'
    by_category_summary.loc[high_bias & high_variance, 'action'] = 'Both'
    
    print("\nAction Plan by Segment:")
    print(by_category_summary)

---
## SECTION 4: STABILITY ACROSS ORIGINS (PLANNER PAIN)

### The Fourth Question: Is the Model Stable Over Time?

Planners hate volatility. A model that's consistently 12% accurate is better than one that swings between 8% and 18% every week.

We measure stability using **standard deviation of wMAPE across cross-validation windows**.

In [None]:
# Assuming 'fold_id' or 'cv_window' column exists
if 'fold_id' in cv_results.columns or 'cv_window' in cv_results.columns:
    fold_col = 'fold_id' if 'fold_id' in cv_results.columns else 'cv_window'
    
    # Calculate wMAPE by fold
    wmape_by_fold = cv_results.groupby(fold_col)['wmape'].mean()
    
    stability_metrics = {
        'mean_wmape': wmape_by_fold.mean(),
        'std_wmape': wmape_by_fold.std(),
        'min_wmape': wmape_by_fold.min(),
        'max_wmape': wmape_by_fold.max(),
        'cv_coefficient': wmape_by_fold.std() / wmape_by_fold.mean()
    }
    
    print("Model Stability Metrics:")
    for metric, value in stability_metrics.items():
        print(f"  {metric}: {value:.4f}")
    
    if stability_metrics['cv_coefficient'] > 0.1:
        print("\n⚠️  WARNING: High volatility detected. CV coefficient > 0.1")
    else:
        print("\n✓ Stability looks good. Model is consistent across folds.")
else:
    print("Fold information not found. Skipping stability analysis.")

---
## SECTION 5: HORIZON DIAGNOSTICS (SHORT VS MID-TERM)

### The Second Question: Does Performance Decay by Horizon?

Different horizons drive different decisions.
- **Weeks 1-4 (Short Term):** Execution and replenishment
- **Weeks 5-13 (Mid Term):** Ordering and planning

If a model is great at week 1 but falls off a cliff at week 5, that matters.

In [None]:
# Assuming 'horizon' or 'forecast_horizon' column exists
if 'horizon' in cv_results.columns or 'forecast_horizon' in cv_results.columns:
    horizon_col = 'horizon' if 'horizon' in cv_results.columns else 'forecast_horizon'
    
    # Calculate wMAPE by horizon
    wmape_by_horizon = cv_results.groupby(horizon_col)['wmape'].mean().sort_index()
    
    print("Error Curve by Forecast Horizon:")
    print(wmape_by_horizon)
    
    # Detect cliff
    horizon_diffs = wmape_by_horizon.diff()
    max_cliff = horizon_diffs.max()
    max_cliff_horizon = horizon_diffs.idxmax()
    
    print(f"\nLargest accuracy drop: {max_cliff:.4f} at horizon {max_cliff_horizon}")
    
    if max_cliff > 0.05:  # 5% drop
        print(f"⚠️  CLIFF DETECTED: Sharp performance drop at horizon {max_cliff_horizon}")
    else:
        print("✓ Gradual decay pattern observed.")
else:
    print("Horizon information not found. Skipping horizon diagnostics.")

In [None]:
# Visualize horizon decay if available
if 'horizon' in cv_results.columns or 'forecast_horizon' in cv_results.columns:
    fig, ax = plt.subplots(figsize=(10, 6))
    wmape_by_horizon.plot(ax=ax, marker='o', linewidth=2, markersize=8)
    ax.set_xlabel('Forecast Horizon (weeks)')
    ax.set_ylabel('wMAPE')
    ax.set_title('Performance Decay by Horizon')
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

---
## SECTION 6: FORECASTABILITY QUADRANTS (STRUCTURE × CHAOS)

### Connecting Back to Module 1

We map performance back to our Structure vs Chaos plane.

- **Easy Wins (High Structure / Low Chaos):** Baseline should win. If it fails, something is broken.
- **Variance Trap (High Structure / High Chaos):** Expect some jitter.
- **Sparse Zone (Low Structure / Low Chaos):** Precision is impossible. Focus on bias.
- **Unforecastable Core (Low Structure / High Chaos):** The baseline is expected to struggle here.

Often we don't need "a better model" — we need **routing**.

In [None]:
# Assuming 'structure_score' and 'chaos_score' exist in metadata
if 'structure_score' in sku_metadata.columns and 'chaos_score' in sku_metadata.columns:
    quad_data = diagnostics_df.copy()
    
    # Define quadrants
    structure_median = quad_data['structure_score'].median()
    chaos_median = quad_data['chaos_score'].median()
    
    def assign_quadrant(row):
        structure = row['structure_score'] >= structure_median
        chaos = row['chaos_score'] >= chaos_median
        
        if structure and not chaos:
            return 'Easy Wins'
        elif structure and chaos:
            return 'Variance Trap'
        elif not structure and not chaos:
            return 'Sparse Zone'
        else:
            return 'Unforecastable Core'
    
    quad_data['quadrant'] = quad_data.apply(assign_quadrant, axis=1)
    
    # Analyze by quadrant
    quadrant_analysis = quad_data.groupby('quadrant').agg({
        'wmape': ['mean', 'std', 'count'],
        'sku_id': 'nunique'
    }).round(4)
    
    quadrant_analysis.columns = ['avg_wmape', 'wmape_std', 'num_records', 'num_skus']
    
    print("Performance by Forecastability Quadrant:")
    print(quadrant_analysis)
else:
    print("Structure/Chaos scores not found in metadata.")

---
## SECTION 7: SYSTEM HEALTH SUMMARY

### Key Findings

In [None]:
# Generate health summary report
health_report = f"""
SYSTEM HEALTH REPORT
{'='*60}

1. PORTFOLIO OVERVIEW
   - Total wMAPE: {total_portfolio_wmape:.2%}
   - Number of SKUs: {len(sku_metadata)}
   - Number of Records: {len(cv_results)}

2. KEY INSIGHTS

   Where is the fire?
   - Top error contributor: {error_by_category.index[0] if len(error_by_category) > 0 else 'N/A'}
   - That segment contributes {error_by_category.iloc[0]['avg_wmape']:.2%} wMAPE
   
   What kind of error?
   - Bias vs Variance analysis available in Section 3
   
   Is it stable?
   - Check stability metrics in Section 4
   
   Performance by horizon?
   - Check decay pattern in Section 5

3. PRIORITY ACTIONS
   - See action plan by segment (Section 3)
   - Bias-driven segments → Feature Engineering
   - Variance-driven segments → Buffering/Policy

{'='*60}
"""

print(health_report)

---
## SECTION 8: OUTPUTS AND HANDOFF

### What We've Built

We've answered our five diagnostic questions:
1. ✓ **Where** - Pareto analysis
2. ✓ **What Type** - Bias vs Variance
3. ✓ **How Stable** - Stability metrics
4. ✓ **When** - Horizon analysis
5. ✓ **Why** - Forecastability quadrants

### Deliverable: The System Health Report

You now have:
- **Top segments for feature engineering** (high bias)
- **Top segments for inventory policy** (high variance)
- **Horizon watchlist** (decay cliffs)
- **Stability assessment** (planner trust factor)

### What's Next

We know *where* to focus. The next step is to know *why*.

In Module 2.7, we go detective mode: visual inspection, SKU-level analysis, and root cause identification.

In [None]:
# Export priority backlog
priority_segments = []

# Add top segments by error
top_error_segments = error_by_category.head(3).index.tolist()
for segment in top_error_segments:
    priority_segments.append({
        'segment': segment,
        'reason': 'High Error Contribution',
        'priority': 'CRITICAL'
    })

priority_backlog = pd.DataFrame(priority_segments)

print("\nPRIORITY BACKLOG FOR MODULE 2.7:")
print(priority_backlog.to_string(index=False))

# Save for next module
priority_backlog.to_csv('priority_segments_for_2.7.csv', index=False)
print("\n✓ Priority backlog saved to 'priority_segments_for_2.7.csv'")