# 2.07: Model Performance Diagnostics - Detailed Forensics (Micro Level)

## OPENING

In the last module, we ran portfolio diagnostics. We found where the fire is — which segments are driving 80% of the error, whether that error is bias or variance, and whether the model is stable.

That's **triage**. Now we **investigate**.

In this notebook, we go detective mode: pulling representative SKUs from high-impact segments, inspecting their forecasts visually, analyzing residuals, and finding the **smoking guns** — specific patterns the model is missing.

**Critical:** We're not cherry-picking. We're not debugging anecdotes. We're looking for **failure signatures that generalize** — patterns we can fix systematically in Module 3.

The output isn't a score. It's a **bug report** — the requirements document for feature engineering.

## SETUP: Load Dependencies and Data

In [None]:
# Standard Library
import sys
import warnings
from pathlib import Path
import scipy.stats as st

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# --- Path Configuration ---
MODULE_DIR = Path().resolve()
PROJECT_ROOT = MODULE_DIR.parent.parent
sys.path.insert(0, str(PROJECT_ROOT))

# --- Local Imports ---
from src.forecast_foundations import (
    CacheManager,
    get_notebook_name,
    find_project_root,
)

# --- Settings ---
warnings.filterwarnings("ignore")
sns.set_theme()
plt.rcParams['figure.figsize'] = (14, 6)
pd.set_option('display.max_columns', None)

# --- Paths ---
PROJECT_ROOT = find_project_root()
DATA_DIR = PROJECT_ROOT / "data"
DATA_DIR.mkdir(exist_ok=True)

# --- Managers ---
NB_NAME = get_notebook_name()
cache = CacheManager(PROJECT_ROOT / ".cache" / NB_NAME)

# --- Output Path ---
OUTPUT_DIR = PROJECT_ROOT / "artifacts" / "02_baselines" / "output"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# --- Configuration ---
MODEL = 'AutoTheta'

In [None]:
## Load Data
Read the enriched cross-validation results and ABC classification matrix from previous notebooks.

In [251]:
# Load forecasts and actuals from cross-validation
# Expected columns: sku_id, date, actual, forecast, model_name, fold_id
cv_forecasts = pd.read_parquet(OUTPUT_DIR / "enriched_backtest.parquet").query("model == @MODEL") # grab the model 
abc_df = pd.read_parquet(OUTPUT_DIR / 'abc_matrix.parquet')  # From 2.06
snap_calendar = pd.read_csv(DATA_DIR / "m5/datasets/calendar.csv",usecols=['date','snap_CA','snap_TX','snap_WI'],parse_dates=['date'])

# add ABC column to data 
abc_map = abc_df.set_index("unique_id")['abc_class'].to_dict()
cv_forecasts['ABC_tier'] = cv_forecasts['unique_id'].map(abc_map)

print(f"Loaded {len(cv_forecasts)} forecast records")
print(f"Columns: {cv_forecasts.columns.tolist()}")

Loaded 1585480 forecast records
Columns: ['unique_id', 'ds', 'y', 'cutoff', 'model', 'y_pred', 'error', 'abs_error', 'ABC_tier']


---
## SECTION 1: THE SNIFF TEST IS A DISCIPLINE

### Visual Trust Matters

Sometimes a model passes all the metrics but produces a forecast that just looks... wrong. Straight line through obvious seasonality. Weird step functions. Negative demand.

This is the **Uncanny Valley of Forecasting**. The math passes, but reality fails.

Visual trust is part of system performance. A forecast that looks plausible will get used. A forecast that looks crazy will get overridden.

### The Sniff Test Checklist

When you look at a forecast plot, you're checking for:
- **Level:** Does it anchor to recent reality, or does it jump?
- **Shape:** Does the seasonal shape match history?
- **Trend:** Does it drift reasonably, or explode into infinity?
- **Negatives:** Negative demand = immediate fail
- **Step Changes:** Does it acknowledge known discontinuities?
- **Smoothness:** Is it overly smooth or overly reactive?

In [252]:
# get our hierarchy columns via splitting 
cv_forecasts['dept_id'] = cv_forecasts['unique_id'].str.split("_",expand=True)[0]
# add item id to cv dataframe 
cv_forecasts['item_id'] = cv_forecasts['unique_id'].str.rsplit("_",n=2,expand=True)[0]

# aggregate snap calendar, lets get total # of snap days in a week 
snap_calendar = snap_calendar.groupby(
    pd.Grouper(key='date',freq="W-SUN")
    )[['snap_WI','snap_TX','snap_CA']].sum().reset_index()

---
## SECTION 2: Tailed-ness of the errors, where are we failing BIG? 

## Section 1: Feature Engineering
Extract hierarchical columns and prepare SNAP calendar data for analysis.

## Section 2: Error Distribution Diagnostics
Identify heavy-tailed errors in high-priority (A-tier) items and quantify horizon decay.

In [253]:
cv_forecasts.head()

Unnamed: 0,unique_id,ds,y,cutoff,model,y_pred,error,abs_error,ABC_tier,dept_id,item_id
39637000,FOODS_1_001_CA_1,2015-06-28,2.0,2015-06-21,AutoTheta,4.031191,2.031191,2.031191,B,FOODS,FOODS_1_001
39637001,FOODS_1_001_CA_1,2015-07-05,2.0,2015-06-21,AutoTheta,4.011292,2.011292,2.011292,B,FOODS,FOODS_1_001
39637002,FOODS_1_001_CA_1,2015-07-12,7.0,2015-06-21,AutoTheta,3.991393,-3.008607,3.008607,B,FOODS,FOODS_1_001
39637003,FOODS_1_001_CA_1,2015-07-19,4.0,2015-06-21,AutoTheta,3.971494,-0.028506,0.028506,B,FOODS,FOODS_1_001
39637004,FOODS_1_001_CA_1,2015-07-26,2.0,2015-06-21,AutoTheta,3.951595,1.951595,1.951595,B,FOODS,FOODS_1_001


In [254]:
# lets inspect 
a_tier_errors = cv_forecasts.loc[(cv_forecasts['ABC_tier'] == 'A')]

id_dists = a_tier_errors.groupby(["item_id","dept_id"],as_index=False).agg(
    error_kurtosis = ('error',lambda x: st.kurtosis(x,fisher=True,bias=False)),
    skew_error = ('error',lambda x: st.skew(x,bias=False)),
).sort_values(by="error_kurtosis")

# take top 20% in terms of heavy tailed errors
largest_offenders = id_dists.nlargest(n=int(len(id_dists) * .2),columns="error_kurtosis")

# we see many large offenders live in foods! Could this be snap effects a statistical learner can't capture? 
largest_offenders['dept_id'].value_counts(normalize=True)

dept_id
FOODS        0.680108
HOUSEHOLD    0.255376
HOBBIES      0.064516
Name: proportion, dtype: float64

## Section 3: Residual Autocorrelation
Test whether forecast errors show systematic patterns (autocorrelation) that a learner should have captured.

In [255]:

# ============================================================================
# 1. ERROR ACCUMULATION BY FORECAST HORIZON (Degradation curve)
# ============================================================================

# look at error for these over time... 
horizon_diagnostics = (
    a_tier_errors
    .assign(h=lambda df: df.groupby(['unique_id','cutoff']).cumcount() + 1)
    .groupby(['dept_id','h'],as_index=False)
    .agg(mean_error = ('error','mean'),std_error=('error','std'),kurt_error=("error",lambda x: st.kurtosis(x,bias=False)))
    .round(3)
    .melt(id_vars=['dept_id','h'],var_name='metric',value_name='val')
)

# 
horizon_diagnostics.plot(
    x='h',
    y='val',
    facet_col='metric',
    backend='plotly',
    color='dept_id',
    title=f'Error Diagnostics Across Forecast Horizon {MODEL}'
)

## Section 4: SNAP Effects Analysis
Investigate whether SNAP benefit days (promotional calendar) explain large forecast errors.

In [256]:
def acf_lag_value(x, lag):
    """Extract correlation between error and its lag"""
    x_shift = x.shift(lag).dropna()
    return x[lag:].corr(x_shift) 

residual_corr = a_tier_errors.groupby(["dept_id","unique_id"]).agg(
    acf_lag1=('error',lambda x: acf_lag_value(x, 1)),
    acf_lag2=('error',lambda x: acf_lag_value(x, 2)),
    acf_lag3=('error',lambda x: acf_lag_value(x, 3)),
   # acf_lag2=lambda x: acf_lag_value(x['error'], 2),
   # acf_lag3=lambda x: acf_lag_value(x['error'], 3)
)

In [257]:
residual_corr = residual_corr.assign(
    significance_threshold = (1 / np.sqrt(52)) * 1.96,

)


long_df_corr = (
    residual_corr
    
    .reset_index()
    
    .melt(id_vars=['unique_id','dept_id','significance_threshold'],var_name='lag',value_name='acf')
    
    .query('acf.abs() > significance_threshold')

    .sort_values('acf')
) 

# highlight % of lags with significant ACF across departments 
(
    long_df_corr['dept_id'].value_counts(normalize=True)
    .to_frame("percentage_of_significant_lags")
)

Unnamed: 0_level_0,percentage_of_significant_lags
dept_id,Unnamed: 1_level_1
FOODS,0.697046
HOUSEHOLD,0.238425
HOBBIES,0.064529


# Now lets inspect the SNAP data

* can we make a link between SNAP and our large errors in A-tier items? 

In [258]:
cv_w_snap = cv_forecasts.merge(
    snap_calendar,
    left_on='ds',
    right_on='date',
    how='left'
)

# fix the snap column, snap in others states shouldn't apply 
cv_w_snap = cv_w_snap.assign(
    snap_days = np.where(
        cv_w_snap['unique_id'].str.contains("_CA_"),
        cv_w_snap['snap_CA'],
        np.where(
            cv_w_snap['unique_id'].str.contains("_WI_"),
            cv_w_snap['snap_WI'],
            cv_w_snap['snap_TX']  # default/fallback
        )
    )
)

cv_w_snap['snap_ind'] = cv_w_snap['snap_days'] > 0

In [259]:
cv_w_snap.groupby('snap_days',as_index=False).agg(
    mean_error = ('error','mean'),
    mean_abs_error = ('abs_error','mean'),
    max_abs_error = ('abs_error','max'),
    ).melt(
        id_vars='snap_days',
        var_name='metric',
        value_name='val'
    )

Unnamed: 0,snap_days,metric,val
0,0,mean_error,0.059839
1,1,mean_error,-0.376594
2,2,mean_error,0.626035
3,3,mean_error,-0.56221
4,4,mean_error,-0.497513
5,5,mean_error,-0.638802
6,6,mean_error,-0.2355
7,7,mean_error,-0.310846
8,0,mean_abs_error,4.468731
9,1,mean_abs_error,4.574986


In [260]:
snap_day_corr = cv_w_snap.groupby('snap_days',as_index=False).agg(
    mean_error = ('error','mean'),
    mean_abs_error = ('abs_error','mean'),
    max_abs_error = ('abs_error','max'),
    stddev_error =  ("error","mean")
    ).melt(
        id_vars='snap_days',
        var_name='metric',
        value_name='val'
    )


fig = snap_day_corr.plot(
    x='snap_days',
    y='val',
    facet_col='metric',
    facet_col_wrap=1,
    backend='plotly',
title='Mean Abs Error vs # of Snap Days')

# Make y-axes independent
fig.update_yaxes(matches=None)

fig.show()

In [262]:
snap_summary = cv_w_snap_A.groupby(['dept_id','snap_ind'])['error'].agg(
    mean_error='mean',
    std_error='std',
    total_abs_error = lambda x: x.abs().sum()
).reset_index()

snap_summary

Unnamed: 0,dept_id,snap_ind,mean_error,std_error,total_abs_error
0,FOODS,False,0.438489,23.008306,1465524.0
1,FOODS,True,-1.126339,24.867163,2434903.0
2,HOBBIES,False,-0.034535,12.266741,148144.3
3,HOBBIES,True,0.046093,11.508209,213702.2
4,HOUSEHOLD,False,-0.329216,12.3167,381637.3
5,HOUSEHOLD,True,-0.055012,12.41841,588520.6


In [273]:
cv_w_snap = cv_w_snap.assign(
    large_error = cv_w_snap['abs_error'] > cv_w_snap.groupby("unique_id")['abs_error'].transform(lambda x: x.quantile(.9))
    )

In [280]:
# we see
(cv_w_snap
.loc[cv_w_snap['large_error']]
.groupby('dept_id')['snap_ind'].value_counts(normalize=True)
.unstack()
) 

snap_ind,False,True
dept_id,Unnamed: 1_level_1,Unnamed: 2_level_1
FOODS,0.353772,0.646228
HOBBIES,0.382181,0.617819
HOUSEHOLD,0.384061,0.615939
