# 2.07: Model Performance Diagnostics - Detailed Forensics (Micro Level)

## OPENING

In the last module, we ran portfolio diagnostics. We found where the fire is — which segments are driving 80% of the error, whether that error is bias or variance, and whether the model is stable.

That's **triage**. Now we **investigate**.

In this notebook, we go detective mode: pulling representative SKUs from high-impact segments, inspecting their forecasts visually, analyzing residuals, and finding the **smoking guns** — specific patterns the model is missing.

**Critical:** We're not cherry-picking. We're not debugging anecdotes. We're looking for **failure signatures that generalize** — patterns we can fix systematically in Module 3.

The output isn't a score. It's a **bug report** — the requirements document for feature engineering.

## SETUP: Load Dependencies and Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from scipy import stats

sns.set_theme()
plt.rcParams['figure.figsize'] = (14, 6)
pd.set_option('display.max_columns', None)

In [None]:
# Load forecasts and actuals from cross-validation
# Expected columns: sku_id, date, actual, forecast, model_name, fold_id
cv_forecasts = pd.read_csv('path/to/cv_forecasts.csv')
diagnostics_backlog = pd.read_csv('priority_segments_for_2.7.csv')  # From 2.06

print(f"Loaded {len(cv_forecasts)} forecast records")
print(f"Columns: {cv_forecasts.columns.tolist()}")

---
## SECTION 1: THE SNIFF TEST IS A DISCIPLINE

### Visual Trust Matters

Sometimes a model passes all the metrics but produces a forecast that just looks... wrong. Straight line through obvious seasonality. Weird step functions. Negative demand.

This is the **Uncanny Valley of Forecasting**. The math passes, but reality fails.

Visual trust is part of system performance. A forecast that looks plausible will get used. A forecast that looks crazy will get overridden.

### The Sniff Test Checklist

When you look at a forecast plot, you're checking for:
- **Level:** Does it anchor to recent reality, or does it jump?
- **Shape:** Does the seasonal shape match history?
- **Trend:** Does it drift reasonably, or explode into infinity?
- **Negatives:** Negative demand = immediate fail
- **Step Changes:** Does it acknowledge known discontinuities?
- **Smoothness:** Is it overly smooth or overly reactive?

---
## SECTION 2: CASE SELECTION STRATEGY (DON'T CHERRY PICK)

In [None]:
# Get SKUs from high-impact segments identified in 2.06
# Representative sampling: 3-5 SKUs per quadrant

selected_skus = []

# From diagnostics_backlog, sample representative SKUs
for segment in diagnostics_backlog['segment'].unique():
    segment_skus = cv_forecasts[cv_forecasts['category'] == segment]['sku_id'].unique()
    # Sample 3-5 random SKUs from each high-impact segment
    sample_size = min(5, len(segment_skus))
    sampled = np.random.choice(segment_skus, size=sample_size, replace=False)
    selected_skus.extend(sampled)

selected_skus = list(set(selected_skus))
print(f"Selected {len(selected_skus)} SKUs for forensic investigation")
print(f"SKU List: {selected_skus[:10]}...")

---
## SECTION 3: VISUAL INSPECTION WORKFLOW

In [None]:
def plot_timeseries_inspection(sku_id, cv_forecasts, num_periods=52):
    """
    Visual inspection plot: actuals vs forecasts
    """
    sku_data = cv_forecasts[cv_forecasts['sku_id'] == sku_id].sort_values('date')
    
    if len(sku_data) == 0:
        print(f"No data for SKU {sku_id}")
        return
    
    # Take last num_periods for clarity
    sku_data = sku_data.tail(num_periods)
    
    fig, ax = plt.subplots(figsize=(14, 6))
    
    # Plot actuals
    ax.plot(sku_data['date'], sku_data['actual'], 
            marker='o', label='Actual', linewidth=2, color='black')
    
    # Plot forecasts by model
    for model in sku_data['model_name'].unique():
        model_data = sku_data[sku_data['model_name'] == model]
        ax.plot(model_data['date'], model_data['forecast'], 
                marker='s', label=f'Forecast: {model}', alpha=0.7, linewidth=1.5)
    
    ax.set_xlabel('Date')
    ax.set_ylabel('Demand')
    ax.set_title(f'Sniff Test: SKU {sku_id}')
    ax.legend()
    ax.grid(True, alpha=0.3)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    return sku_data

In [None]:
# Sniff test checklist for visual inspection
def run_sniff_test(sku_data):
    """
    Structured checklist for visual inspection
    """
    checks = {}
    
    # Level check: recent forecast vs actual
    recent_actual = sku_data['actual'].tail(4).mean()
    recent_forecast = sku_data['forecast'].tail(4).mean()
    level_ratio = recent_forecast / recent_actual if recent_actual > 0 else 0
    checks['level_check'] = 'PASS' if 0.8 <= level_ratio <= 1.2 else 'FAIL'
    
    # Negative check
    has_negatives = (sku_data['forecast'] < 0).any()
    checks['negative_check'] = 'FAIL' if has_negatives else 'PASS'
    
    # Trend check: does forecast change reasonably?
    forecast_std = sku_data['forecast'].std()
    actual_std = sku_data['actual'].std()
    volatility_ratio = forecast_std / actual_std if actual_std > 0 else 0
    checks['trend_check'] = 'PASS' if 0.1 <= volatility_ratio <= 3 else 'FLAG'
    
    return checks

---
## SECTION 4: RESIDUALS ARE THE CRIME SCENE

### What Residuals Tell You

Residuals = Actual - Forecast

If residuals are random noise, the model captured everything forecastable. If residuals have patterns, the model missed something.

**Residuals are the crime scene.** We're looking for smoking guns.

In [None]:
def analyze_residuals(sku_data):
    """
    Comprehensive residual analysis
    """
    # Calculate residuals
    sku_data = sku_data.copy()
    sku_data['residual'] = sku_data['actual'] - sku_data['forecast']
    sku_data['abs_error'] = np.abs(sku_data['residual'])
    sku_data['pct_error'] = (sku_data['residual'] / (sku_data['actual'] + 1)) * 100
    
    # Summary statistics
    print("Residual Summary:")
    print(f"  Mean (Bias): {sku_data['residual'].mean():.2f}")
    print(f"  Std Dev: {sku_data['residual'].std():.2f}")
    print(f"  Min: {sku_data['residual'].min():.2f}")
    print(f"  Max: {sku_data['residual'].max():.2f}")
    
    return sku_data

In [None]:
def plot_residuals(sku_data, sku_id):
    """
    Three-view residual plot: signed error, abs error, rolling bias
    """
    fig, axes = plt.subplots(3, 1, figsize=(14, 10))
    
    # View 1: Signed residuals (error over time)
    axes[0].bar(range(len(sku_data)), sku_data['residual'], 
                color=['red' if x < 0 else 'green' for x in sku_data['residual']])
    axes[0].axhline(y=0, color='black', linestyle='--', linewidth=1)
    axes[0].set_title(f'SKU {sku_id}: Signed Residuals (Actual - Forecast)')
    axes[0].set_ylabel('Error')
    axes[0].grid(True, alpha=0.3)
    
    # View 2: Absolute error over time
    axes[1].plot(range(len(sku_data)), sku_data['abs_error'], 
                 marker='o', color='orange', linewidth=2)
    axes[1].set_title('Absolute Error Over Time')
    axes[1].set_ylabel('|Error|')
    axes[1].grid(True, alpha=0.3)
    
    # View 3: Rolling bias (4-period rolling average)
    rolling_bias = sku_data['residual'].rolling(window=4, center=True).mean()
    axes[2].plot(range(len(sku_data)), rolling_bias, 
                 marker='s', color='purple', linewidth=2, label='4-week rolling bias')
    axes[2].axhline(y=0, color='black', linestyle='--', linewidth=1)
    axes[2].set_title('Rolling Bias (4-Period Window)')
    axes[2].set_xlabel('Time Period')
    axes[2].set_ylabel('Bias')
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

In [None]:
def detect_failure_signatures(sku_data):
    """
    Identify smoking guns in residuals
    """
    signatures = {}
    
    # 1. Autocorrelation (missed trend)
    residual_corr = sku_data['residual'].autocorr(lag=1)
    signatures['autocorrelation'] = {
        'value': residual_corr,
        'signal': 'HIGH' if abs(residual_corr) > 0.5 else 'LOW',
        'smoking_gun': 'Missed trend or momentum' if abs(residual_corr) > 0.5 else None
    }
    
    # 2. Seasonality in residuals (missed seasonal component)
    if len(sku_data) >= 52:
        seasonal_corr = sku_data['residual'].autocorr(lag=52)
        signatures['seasonality'] = {
            'value': seasonal_corr,
            'signal': 'HIGH' if seasonal_corr > 0.3 else 'LOW',
            'smoking_gun': 'Missed seasonal component' if seasonal_corr > 0.3 else None
        }
    
    # 3. Level shift (regime change)
    first_half = sku_data['residual'].iloc[:len(sku_data)//2].mean()
    second_half = sku_data['residual'].iloc[len(sku_data)//2:].mean()
    shift_magnitude = abs(second_half - first_half)
    signatures['level_shift'] = {
        'value': shift_magnitude,
        'signal': 'HIGH' if shift_magnitude > sku_data['residual'].std() else 'LOW',
        'smoking_gun': 'Regime change or structural shift' if shift_magnitude > sku_data['residual'].std() else None
    }
    
    return signatures

---
## SECTION 5: MODEL VS DATA VS POLICY TRIAGE

In [None]:
def triage_problem_type(smoking_gun_description):
    """
    Classify problem into three categories: Model, Data, or Policy
    """
    # This is a rule-based classification that can be expanded
    
    model_keywords = ['trend', 'seasonal', 'holiday', 'promo', 'promotion', 'event']
    data_keywords = ['stockout', 'gap', 'null', 'error', 'freeze']
    policy_keywords = ['sporadic', 'intermittent', 'chaotic', 'unforecastable']
    
    description_lower = smoking_gun_description.lower()
    
    if any(keyword in description_lower for keyword in model_keywords):
        return 'MODEL_PROBLEM', 'Add features in Module 3'
    elif any(keyword in description_lower for keyword in data_keywords):
        return 'DATA_PROBLEM', 'Clean the data pipeline'
    elif any(keyword in description_lower for keyword in policy_keywords):
        return 'POLICY_PROBLEM', 'Use min/max inventory rules, not precision forecasting'
    else:
        return 'UNKNOWN', 'Requires further investigation'

# Example: triage a potential smoking gun
example_smoking_gun = "Residual spikes align with holidays and promotions"
problem_type, fix = triage_problem_type(example_smoking_gun)
print(f"Example Smoking Gun: {example_smoking_gun}")
print(f"Problem Type: {problem_type}")
print(f"Fix: {fix}")

---
## SECTION 6: FORENSIC INVESTIGATION SUMMARY

In [None]:
# Compile forensic findings into backlog for Module 3
forensic_backlog = []

# For each selected SKU, run full investigation and collect findings
for sku_id in selected_skus[:3]:  # Start with first 3 for demonstration
    sku_data = cv_forecasts[cv_forecasts['sku_id'] == sku_id].sort_values('date')
    
    if len(sku_data) > 10:
        sku_data = analyze_residuals(sku_data)
        sniff_test = run_sniff_test(sku_data)
        signatures = detect_failure_signatures(sku_data)
        
        forensic_backlog.append({
            'sku_id': sku_id,
            'sniff_test_pass': all(v == 'PASS' for v in sniff_test.values()),
            'primary_smoking_gun': next((v['smoking_gun'] for v in signatures.values() if v['smoking_gun']), None),
            'requires_investigation': not all(v == 'PASS' for v in sniff_test.values())
        })

forensic_df = pd.DataFrame(forensic_backlog)
print("\nForensic Investigation Results:")
print(forensic_df.to_string(index=False))

# Save for Module 3
forensic_df.to_csv('forensic_findings_for_module_3.csv', index=False)
print("\n✓ Forensic findings saved to 'forensic_findings_for_module_3.csv'")