# 2.10: Prediction Interval Metrics - Coverage, Width, and Winkler

## OPENING: PREDICTION INTERVALS ARE PROMISES

In the last video, we built prediction intervals. We generated model-based intervals from AutoETS and AutoTheta. We wrapped those same forecasts with conformal intervals calibrated to historical reality.

But here's the question: **Can we trust them?**

A prediction interval is a promise. It says: "80% of the time, demand will fall inside this range."

That's a big claim. And if the claim is wrong — if we set safety stock based on a lie — we stock out. Or we overstock. Either way, we lose money.

So before we deploy any interval method in production, we audit it. We measure whether it kept its promise.

**Forecasting isn't about being right. It's about being responsibly wrong.**

Intervals are how we quantify that responsibility. Metrics are how we verify it.

## SETUP: Load Dependencies and Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

sns.set_theme()
plt.rcParams['figure.figsize'] = (14, 6)
pd.set_option('display.max_columns', None)

In [None]:
# Load prediction intervals and actuals from previous module
intervals_df = pd.read_csv('prediction_intervals_80pct.csv')
cv_forecasts = pd.read_csv('path/to/cv_forecasts.csv')

print(f"Loaded {len(intervals_df)} interval predictions")
print(f"Columns: {intervals_df.columns.tolist()}")

---
## SECTION 1: THE TWO JOBS OF AN INTERVAL

### Job 1: Validity (Calibration)

**The question:** Does the interval hit the coverage level it claims?

If we ask for 80%, do we actually get around 80%?

This is the **promise**. An interval that says "80%" but only covers 65% of outcomes is lying.

### Job 2: Efficiency (Usefulness)

**The question:** Is the interval narrow enough to be operationally useful?

An interval that says "demand will be between 0 and 10,000 units" might have perfect coverage — but it's useless.

---
## SECTION 2: VALIDITY METRICS (DOES IT KEEP ITS PROMISE?)

In [None]:
def calculate_coverage(actuals, lower_bounds, upper_bounds, target_coverage=0.80):
    """
    Calculate empirical coverage of prediction intervals
    
    Coverage = % of actuals that fall within [lower, upper]
    """
    within_interval = (actuals >= lower_bounds) & (actuals <= upper_bounds)
    coverage = within_interval.mean()
    
    # How close are we to target?
    coverage_error = abs(coverage - target_coverage)
    
    return {
        'coverage': coverage,
        'target': target_coverage,
        'coverage_error': coverage_error,
        'status': 'PASS' if coverage_error < 0.05 else 'FAIL',
        'num_within': within_interval.sum(),
        'total': len(actuals)
    }

# Calculate coverage
if all(col in intervals_df.columns for col in ['actual', 'lower_80', 'upper_80']):
    coverage_metrics = calculate_coverage(
        intervals_df['actual'].values,
        intervals_df['lower_80'].values,
        intervals_df['upper_80'].values,
        target_coverage=0.80
    )
    
    print("\nCOVERAGE VALIDITY TEST (Job 1: Does it keep its promise?):")
    print(f"  Target Coverage: {coverage_metrics['target']:.1%}")
    print(f"  Actual Coverage: {coverage_metrics['coverage']:.1%}")
    print(f"  Coverage Error: {coverage_metrics['coverage_error']:.1%}")
    print(f"  Result: {coverage_metrics['status']}")
    print(f"  ({coverage_metrics['num_within']} out of {coverage_metrics['total']} actuals within interval)")

In [None]:
def calculate_sharpness(lower_bounds, upper_bounds):
    """
    Calculate sharpness (tightness) of intervals
    Lower width = sharper (better) intervals
    """
    widths = upper_bounds - lower_bounds
    
    return {
        'mean_width': widths.mean(),
        'median_width': np.median(widths),
        'min_width': widths.min(),
        'max_width': widths.max(),
        'std_width': widths.std()
    }

# Calculate sharpness
if all(col in intervals_df.columns for col in ['lower_80', 'upper_80']):
    sharpness_metrics = calculate_sharpness(
        intervals_df['lower_80'].values,
        intervals_df['upper_80'].values
    )
    
    print("\nSHARPNESS METRICS (Job 2: How useful is it?):")
    print(f"  Mean Interval Width: ±{sharpness_metrics['mean_width']:.2f} units")
    print(f"  Median Width: ±{sharpness_metrics['median_width']:.2f} units")
    print(f"  Range: [{sharpness_metrics['min_width']:.2f}, {sharpness_metrics['max_width']:.2f}]")

---
## SECTION 3: INTERVAL SCORE (WINKLER SCORE)

In [None]:
def calculate_winkler_score(actuals, lower_bounds, upper_bounds, alpha=0.20):
    """
    Winkler Score: Combines validity + efficiency
    
    WS = (Upper - Lower) + (2/α) * (Lower - Actual) if Actual < Lower
                          + (2/α) * (Actual - Upper) if Actual > Upper
    
    Lower score = Better intervals (narrow + covers actual)
    
    Args:
        alpha: 1 - confidence_level (0.20 for 80% confidence)
    """
    
    scores = np.zeros(len(actuals))
    
    for i in range(len(actuals)):
        width = upper_bounds[i] - lower_bounds[i]
        
        if actuals[i] < lower_bounds[i]:
            # Actual below interval
            penalty = (2 / alpha) * (lower_bounds[i] - actuals[i])
            scores[i] = width + penalty
        elif actuals[i] > upper_bounds[i]:
            # Actual above interval
            penalty = (2 / alpha) * (actuals[i] - upper_bounds[i])
            scores[i] = width + penalty
        else:
            # Actual within interval
            scores[i] = width
    
    return scores

# Calculate Winkler scores
if all(col in intervals_df.columns for col in ['actual', 'lower_80', 'upper_80']):
    winkler_scores = calculate_winkler_score(
        intervals_df['actual'].values,
        intervals_df['lower_80'].values,
        intervals_df['upper_80'].values,
        alpha=0.20  # For 80% intervals
    )
    
    print("\nWINKLER SCORE (Combined Metric):")
    print(f"  Mean Winkler Score: {winkler_scores.mean():.2f}")
    print(f"  Median: {np.median(winkler_scores):.2f}")
    print(f"  Std Dev: {winkler_scores.std():.2f}")
    print(f"\n  Interpretation:")
    print(f"    - Penalizes intervals that miss actuals (validity)")
    print(f"    - Penalizes wide intervals (efficiency)")
    print(f"    - Lower score = Better intervals")

---
## SECTION 4: INTERVAL METRICS BY SEGMENT

In [None]:
def interval_quality_scorecard(intervals_df, segment_column='category'):
    """
    Generate interval quality metrics by segment
    """
    
    if segment_column not in intervals_df.columns:
        print(f"Column {segment_column} not found. Using overall metrics only.")
        return None
    
    scorecard = []
    
    for segment in intervals_df[segment_column].unique():
        seg_data = intervals_df[intervals_df[segment_column] == segment]
        
        # Calculate metrics
        coverage = ((
            (seg_data['actual'] >= seg_data['lower_80']) & 
            (seg_data['actual'] <= seg_data['upper_80'])
        ).sum() / len(seg_data))
        
        width = (seg_data['upper_80'] - seg_data['lower_80']).mean()
        
        scorecard.append({
            'segment': segment,
            'coverage': coverage,
            'interval_width': width,
            'sample_size': len(seg_data)
        })
    
    return pd.DataFrame(scorecard)

# If intervals_df has category column, generate scorecard
if 'category' in intervals_df.columns:
    scorecard = interval_quality_scorecard(intervals_df, 'category')
    print("\nINTERVAL QUALITY BY SEGMENT:")
    print(scorecard.to_string(index=False))
else:
    print("\nCategory column not found. Add segment information to intervals_df for detailed analysis.")

---
## SECTION 5: VISUALIZING INTERVAL PERFORMANCE

In [None]:
def plot_interval_performance(intervals_df, sample_size=100):
    """
    Visualize interval performance: coverage + width
    """
    sample = intervals_df.head(sample_size)
    
    fig, axes = plt.subplots(2, 1, figsize=(14, 10))
    
    # Plot 1: Intervals vs Actuals
    ax = axes[0]
    x = range(len(sample))
    
    ax.plot(x, sample['actual'], 'ko-', label='Actual', markersize=4, linewidth=1)
    ax.plot(x, sample['forecast'], 'b^-', label='Point Forecast', alpha=0.7, markersize=4)
    ax.fill_between(x, sample['lower_80'], sample['upper_80'], 
                     alpha=0.3, color='blue', label='80% Interval')
    
    ax.set_title('Interval Coverage: Do Actuals Fall Within Predicted Range?')
    ax.set_ylabel('Demand')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Plot 2: Interval width over time
    widths = sample['upper_80'] - sample['lower_80']
    ax = axes[1]
    ax.bar(x, widths, color='orange', alpha=0.7)
    ax.axhline(y=widths.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean Width: {widths.mean():.2f}')
    ax.set_title('Interval Width (Sharpness): Narrower = More Useful')
    ax.set_xlabel('Time Period')
    ax.set_ylabel('Interval Width')
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()

# Visualize
plot_interval_performance(intervals_df)

---
## SECTION 6: DECISION: DEPLOY OR RECALIBRATE?

In [None]:
def interval_audit_decision(coverage_metrics, sharpness_metrics):
    """
    Make go/no-go decision based on interval metrics
    """
    
    decision = {
        'validity_pass': coverage_metrics['status'] == 'PASS',
        'validity_score': f"{coverage_metrics['coverage']:.1%} vs {coverage_metrics['target']:.1%}",
        'sharpness_score': f"{sharpness_metrics['mean_width']:.2f} units",
        'recommendation': None
    }
    
    # Decision logic
    if decision['validity_pass']:
        if sharpness_metrics['mean_width'] < 50:  # Threshold example
            decision['recommendation'] = 'DEPLOY - Intervals are valid and sharp'
        else:
            decision['recommendation'] = 'DEPLOY with warning - Valid but wide intervals may reduce usefulness'
    else:
        if coverage_metrics['coverage'] < coverage_metrics['target']:
            decision['recommendation'] = 'RECALIBRATE - Intervals are too narrow (missing actuals)'
        else:
            decision['recommendation'] = 'RECALIBRATE - Intervals are too wide (over-conservative)'
    
    return decision

# Make decision
if 'coverage_metrics' in locals() and 'sharpness_metrics' in locals():
    decision = interval_audit_decision(coverage_metrics, sharpness_metrics)
    
    print("\nINTERVAL AUDIT DECISION:")
    print(f"\n  Validity (Coverage): {decision['validity_score']}")
    print(f"  Validity Status: {'✓ PASS' if decision['validity_pass'] else '✗ FAIL'}")
    print(f"\n  Efficiency (Sharpness): {decision['sharpness_score']}")
    print(f"\n  Decision: {decision['recommendation']}")

---
## SECTION 7: FINAL DELIVERABLE - INTERVAL AUDIT REPORT

In [None]:
# Generate audit report
audit_report = f"""
PREDICTION INTERVAL AUDIT REPORT
{'='*70}

OBJECTIVE
--------
Verify that 80% prediction intervals keep their promise:
"80% of the time, actuals fall within the predicted range"

RESULTS - VALIDITY (Job 1: Does it keep its promise?)
---
Target Coverage:     80%
Actual Coverage:     {coverage_metrics['coverage']:.1%}
Coverage Error:      {coverage_metrics['coverage_error']:.2%}
Status:              {coverage_metrics['status']}

RESULTS - EFFICIENCY (Job 2: Is it useful?)
---
Mean Interval Width: ±{sharpness_metrics['mean_width']:.2f} units
Median Width:        ±{sharpness_metrics['median_width']:.2f} units
Min Width:           ±{sharpness_metrics['min_width']:.2f} units
Max Width:           ±{sharpness_metrics['max_width']:.2f} units

COMBINED METRIC - WINKLER SCORE
---
Mean Score:          {winkler_scores.mean():.2f}
Median Score:        {np.median(winkler_scores):.2f}
(Lower is better)

RECOMMENDATION
---
{decision['recommendation']}

NEXT STEPS
---
1. Review this audit with stakeholders
2. If deploying: Use intervals for safety stock decisions
3. If recalibrating: Adjust interval method and re-test
4. Monitor coverage in production (ongoing validation)

{'='*70}
"""

print(audit_report)

# Save report
with open('interval_audit_report_2.10.txt', 'w') as f:
    f.write(audit_report)

print("\n✓ Audit report saved to 'interval_audit_report_2.10.txt'")