# The Persistence Baseline Problem

## Why "Low MAE" Doesn't Mean Your Model is Good

---

## ðŸš¨ If You Know sklearn But Not Time Series Baselines, Read This First

**What you already know (from standard ML)**:
- Classification baseline: random chance (50% for binary)
- Regression baseline: predict the mean
- Lower error = better model
- If you beat the baseline, you've learned something

**What's different with time series**:

In time series, your baseline is **persistence**: "tomorrow = today"

```python
# Classification baseline: random guess
baseline = np.random.choice([0, 1], size=n)  # ~50% accuracy

# Regression baseline: predict mean
baseline = np.full(n, y.mean())  # High MAE

# TIME SERIES baseline: predict previous value
baseline = y[:-1]  # Can have VERY low MAE!
```

| Data Type | Baseline | Easy to Beat? |
|-----------|----------|---------------|
| Classification | Random (50%) | Yes |
| Regression | Mean | Usually |
| Low-persistence TS (Ï†=0.3) | Persistence | Yes |
| High-persistence TS (Ï†=0.98) | Persistence | **Nearly impossible** |

**The trap**: Your model has MAE = 0.05. Impressive! But persistence has MAE = 0.048.
You've learned to predict "tomorrow = today" â€” which is trivial and useless.

---

**What you'll learn:**
1. Why persistence (naive forecast) is the natural baseline for time series
2. Why persistence is nearly impossible to beat on high-autocorrelation data
3. How to use MASE for scale-invariant evaluation
4. How to detect "too good to be true" results with `gate_suspicious_improvement`

**Prerequisites:** Notebooks 01-02

---

In [None]:
# Setup
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error

from temporalcv.cv import WalkForwardCV
from temporalcv.gates import gate_suspicious_improvement, GateStatus
from temporalcv.metrics import compute_mase, compute_naive_error

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

print("Setup complete.")

---

## Section 0: Should You Even Try to Beat Persistence?

Before building a model, ask: **Is prediction even the right task?**

### Decision Tree: When to Attempt Beating Persistence

```
1. Check ACF(1) of your series
   â”‚
   â”œâ”€ ACF(1) > 0.99 â”€â”€â”€> VERY HARD
   â”‚   â”‚
   â”‚   â””â”€ Theoretical max improvement: <0.02%
   â”‚       Consider: Is prediction the right task?
   â”‚       Alternative: Focus on direction or regime detection
   â”‚
   â”œâ”€ 0.95 < ACF(1) < 0.99 â”€â”€â”€> HARD
   â”‚   â”‚
   â”‚   â””â”€ Theoretical max improvement: <1%
   â”‚       Use: Move-conditional metrics (MC-SS)
   â”‚       Expect: MASE â‰ˆ 1.0 is normal
   â”‚
   â”œâ”€ 0.90 < ACF(1) < 0.95 â”€â”€â”€> DIFFICULT
   â”‚   â”‚
   â”‚   â””â”€ Theoretical max improvement: <2.5%
   â”‚       Use: MASE as primary metric
   â”‚       Expect: Small but meaningful gains possible
   â”‚
   â”œâ”€ 0.70 < ACF(1) < 0.90 â”€â”€â”€> MODERATE
   â”‚   â”‚
   â”‚   â””â”€ Theoretical max improvement: <10%
   â”‚       Standard metrics (MAE, RMSE) are meaningful
   â”‚       Models can add real value
   â”‚
   â””â”€ ACF(1) < 0.70 â”€â”€â”€> ACHIEVABLE
       â”‚
       â””â”€ Theoretical max improvement: >10%
           Standard ML approaches work well
           Persistence is a weak baseline
```

### Sample Size Guidance [T3]

| Data Frequency | Minimum Observations | Minimum per CV Fold |
|----------------|---------------------|---------------------|
| Daily | 500+ | 100 |
| Weekly | 104+ (2 years) | 52 |
| Monthly | 60+ (5 years) | 24 |
| Quarterly | 40+ (10 years) | 16 |

**Rule of thumb**: `n_train >= 10 Ã— n_features` for stable estimates.

**High persistence penalty**: When ACF(1) > 0.9, effective sample size is reduced. Consider 2Ã— the minimum observations.

In [None]:
# Generate AR(1) processes with different persistence levels
def generate_ar1(n=500, phi=0.9, sigma=1.0, seed=42):
    """Generate AR(1) process with specified autocorrelation."""
    rng = np.random.default_rng(seed)
    y = np.zeros(n)
    y[0] = rng.normal(0, sigma / np.sqrt(1 - phi**2))
    for t in range(1, n):
        y[t] = phi * y[t-1] + sigma * rng.normal()
    return y

def create_lag_features(series, n_lags=5):
    """Create lag features for prediction."""
    n = len(series)
    X = np.column_stack([
        np.concatenate([[np.nan]*lag, series[:-lag]]) 
        for lag in range(1, n_lags + 1)
    ])
    valid = ~np.isnan(X).any(axis=1)
    return X[valid], series[valid]

# Generate three series with different persistence
series_low = generate_ar1(n=500, phi=0.3, seed=42)   # Low persistence
series_mid = generate_ar1(n=500, phi=0.7, seed=42)   # Medium persistence
series_high = generate_ar1(n=500, phi=0.98, seed=42) # High persistence (Treasury-like)

print(f"Low persistence (phi=0.3):  ACF(1) = {np.corrcoef(series_low[1:], series_low[:-1])[0,1]:.3f}")
print(f"Medium persistence (phi=0.7): ACF(1) = {np.corrcoef(series_mid[1:], series_mid[:-1])[0,1]:.3f}")
print(f"High persistence (phi=0.98): ACF(1) = {np.corrcoef(series_high[1:], series_high[:-1])[0,1]:.3f}")

---

## Section 1: What is Persistence?

**Persistence** (aka naive forecast) predicts:

$$\hat{y}_{t+1} = y_t$$

In words: "Tomorrow will be the same as today."

For changes: persistence predicts zero change:

$$\hat{\Delta y}_{t+1} = 0$$

**This is your baseline.** Any model must beat this to be useful.

In [None]:
# Demonstrate what persistence means
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Left: Level prediction
ax = axes[0]
t = np.arange(50)
sample = series_high[:50]
persistence_pred = np.roll(sample, 1)[1:]  # y[t+1] = y[t]

ax.plot(t, sample, 'b-', linewidth=2, label='Actual', alpha=0.8)
ax.plot(t[1:], persistence_pred, 'r--', linewidth=2, label='Persistence (y[t-1])', alpha=0.8)
ax.set_xlabel('Time')
ax.set_ylabel('Value')
ax.set_title('Persistence on Levels: y_hat[t] = y[t-1]', fontsize=12, fontweight='bold')
ax.legend()

# Right: Change prediction
ax = axes[1]
changes = np.diff(sample)
persistence_change = np.zeros(len(changes))  # Persistence predicts 0 change

ax.plot(changes, 'b-', linewidth=2, label='Actual change', alpha=0.8)
ax.axhline(y=0, color='r', linestyle='--', linewidth=2, label='Persistence (0)')
ax.fill_between(range(len(changes)), changes, 0, alpha=0.3)
ax.set_xlabel('Time')
ax.set_ylabel('Change (Î”y)')
ax.set_title('Persistence on Changes: Î”y_hat = 0', fontsize=12, fontweight='bold')
ax.legend()

plt.tight_layout()
plt.show()

print(f"\nPersistence MAE on changes: {np.mean(np.abs(changes)):.4f}")
print(f"(This IS the baseline to beat)")

---

## Section 2: Why Persistence is Hard to Beat [T1]

### The Mathematics of Persistence

For an AR(1) process with autocorrelation $\phi$:

$$y_{t+1} = \phi \cdot y_t + \epsilon_{t+1}$$

The **optimal 1-step forecast** is:

$$\hat{y}_{t+1}^* = \phi \cdot y_t$$

The **persistence forecast** is:

$$\hat{y}_{t+1}^{\text{persist}} = y_t$$

The **difference in expected squared error**:

$$\text{MSE}_{\text{persist}} - \text{MSE}_{\text{optimal}} = (1-\phi)^2 \cdot \text{Var}(y)$$

When $\phi \to 1$, this difference approaches **zero**.

### The Intuition

- **Low phi (0.3)**: Tomorrow is mostly noise â†’ persistence is bad â†’ easy to beat
- **High phi (0.98)**: Tomorrow is almost the same as today â†’ persistence is great â†’ nearly impossible to beat

In [None]:
# Show the theoretical bound: as phi -> 1, persistence becomes optimal
phi_values = np.linspace(0.1, 0.99, 50)

# Theoretical improvement possible over persistence (as fraction of variance)
improvement_possible = (1 - phi_values)**2

fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(phi_values, improvement_possible * 100, 'b-', linewidth=2)
ax.fill_between(phi_values, 0, improvement_possible * 100, alpha=0.3)

# Mark common phi values
for phi, label in [(0.3, 'Low'), (0.7, 'Medium'), (0.98, 'High\n(Treasury)')]:
    improvement = (1 - phi)**2 * 100
    ax.axvline(x=phi, color='red', linestyle='--', alpha=0.5)
    ax.scatter([phi], [improvement], color='red', s=100, zorder=5)
    ax.annotate(f'{label}\n{improvement:.1f}%', xy=(phi, improvement),
                xytext=(phi + 0.03, improvement + 5), fontsize=10)

ax.set_xlabel('Autocorrelation (Ï†)', fontsize=12)
ax.set_ylabel('Maximum Possible Improvement\nover Persistence (%)', fontsize=12)
ax.set_title('[T1] Theoretical Bound: High Ï† = Hard to Beat Persistence', 
             fontsize=13, fontweight='bold')
ax.set_xlim(0, 1)
ax.set_ylim(0, 100)

plt.tight_layout()
plt.show()

print("\nKey insight:")
print(f"  At Ï†=0.3: Up to {(1-0.3)**2*100:.1f}% improvement is theoretically possible")
print(f"  At Ï†=0.98: Only {(1-0.98)**2*100:.2f}% improvement is theoretically possible")
print(f"\n  Claims of >20% improvement on high-Ï† data are SUSPICIOUS.")

In [None]:
# Empirical demonstration: train models on different phi series
def evaluate_vs_persistence(series, phi_label):
    """Train a model and compare to persistence."""
    X, y = create_lag_features(series, n_lags=5)
    
    # Train-test split (80-20)
    split_idx = int(len(X) * 0.8)
    X_train, X_test = X[:split_idx], X[split_idx:]
    y_train, y_test = y[:split_idx], y[split_idx:]
    
    # Model predictions
    model = Ridge(alpha=1.0)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    
    # Persistence predictions (last known value)
    persistence_preds = X_test[:, 0]  # First lag = y[t-1]
    
    # Calculate MAEs
    model_mae = mean_absolute_error(y_test, preds)
    persistence_mae = mean_absolute_error(y_test, persistence_preds)
    
    # Improvement
    improvement = (persistence_mae - model_mae) / persistence_mae * 100
    
    return {
        'phi': phi_label,
        'model_mae': model_mae,
        'persistence_mae': persistence_mae,
        'improvement': improvement
    }

# Test all three series
results = [
    evaluate_vs_persistence(series_low, 0.3),
    evaluate_vs_persistence(series_mid, 0.7),
    evaluate_vs_persistence(series_high, 0.98),
]

print("MODEL vs PERSISTENCE BASELINE")
print("=" * 60)
print(f"{'Ï†':<10} {'Model MAE':<15} {'Persist. MAE':<15} {'Improvement':<15}")
print("-" * 60)
for r in results:
    print(f"{r['phi']:<10} {r['model_mae']:<15.4f} {r['persistence_mae']:<15.4f} {r['improvement']:>+.1f}%")
print("-" * 60)
print("\nAs Ï† increases, beating persistence becomes nearly impossible.")

---

## Section 3: The MAE Mirage

**The Problem:** Your model might have impressively low MAE, but so does persistence!

Low absolute error doesn't mean the model has learned anything. What matters is whether you beat the baseline.

In [None]:
# Demonstrate the MAE mirage
X_high, y_high = create_lag_features(series_high, n_lags=5)

# Train-test split
split_idx = int(len(X_high) * 0.8)
X_train, X_test = X_high[:split_idx], X_high[split_idx:]
y_train, y_test = y_high[:split_idx], y_high[split_idx:]

# Train model
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
model_preds = model.predict(X_test)

# Persistence predictions
persistence_preds = X_test[:, 0]

# Compare
model_mae = mean_absolute_error(y_test, model_preds)
persistence_mae = mean_absolute_error(y_test, persistence_preds)

print("THE MAE MIRAGE")
print("=" * 55)
print(f"\nHigh-persistence series (Ï†=0.98, like Treasury rates)")
print(f"\n  Model MAE:       {model_mae:.4f}  <-- Looks great!")
print(f"  Persistence MAE: {persistence_mae:.4f}  <-- Also great...")
print(f"\n  Improvement: {(persistence_mae - model_mae) / persistence_mae * 100:.1f}%")
print(f"\n  The model's 'impressive' MAE is just matching persistence.")
print(f"  It learned to predict 'tomorrow = today', which is trivial.")

In [None]:
# Visualize: model predictions vs persistence
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Predictions over time
ax = axes[0]
t = np.arange(len(y_test))
ax.plot(t, y_test, 'b-', linewidth=2, label='Actual', alpha=0.8)
ax.plot(t, model_preds, 'g--', linewidth=2, label='Model', alpha=0.8)
ax.plot(t, persistence_preds, 'r:', linewidth=2, label='Persistence', alpha=0.8)
ax.set_xlabel('Test Index')
ax.set_ylabel('Value')
ax.set_title('Model vs Persistence: Nearly Identical!', fontsize=12, fontweight='bold')
ax.legend()

# Right: Model predictions vs persistence predictions
ax = axes[1]
ax.scatter(persistence_preds, model_preds, alpha=0.5, s=30)
min_val = min(persistence_preds.min(), model_preds.min())
max_val = max(persistence_preds.max(), model_preds.max())
ax.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='y=x')
ax.set_xlabel('Persistence Prediction', fontsize=11)
ax.set_ylabel('Model Prediction', fontsize=11)
ax.set_title('Model Learned to Mimic Persistence', fontsize=12, fontweight='bold', color='red')
ax.legend()

# Calculate correlation
corr = np.corrcoef(persistence_preds, model_preds)[0, 1]
ax.annotate(f'Correlation: {corr:.3f}', xy=(0.05, 0.95), xycoords='axes fraction',
            fontsize=12, fontweight='bold', color='red')

plt.tight_layout()
plt.show()

print(f"\nCorrelation between model and persistence: {corr:.3f}")
print(f"The model is essentially just predicting y[t+1] = y[t].")

---

## Section 4: MASE â€” The Scale-Invariant Answer [T1]

**MASE (Mean Absolute Scaled Error)** normalizes MAE by the persistence baseline:

$$\text{MASE} = \frac{\text{MAE}}{\text{MAE}_{\text{naive}}}$$

Where $\text{MAE}_{\text{naive}}$ is the in-sample MAE of the persistence forecast.

**Interpretation:**
- MASE < 1: Model beats persistence â†’ **Good!**
- MASE = 1: Model equals persistence â†’ Learned nothing
- MASE > 1: Model is worse than persistence â†’ **Bad!**

### Reference [T1]
Hyndman & Koehler (2006): "Another look at measures of forecast accuracy"

In [None]:
# Compute MASE for all three series
def evaluate_with_mase(series, phi_label):
    """Evaluate using MASE."""
    X, y = create_lag_features(series, n_lags=5)
    
    # Train-test split
    split_idx = int(len(X) * 0.8)
    X_train, X_test = X[:split_idx], X[split_idx:]
    y_train, y_test = y[:split_idx], y[split_idx:]
    
    # Naive MAE from training data
    naive_mae = compute_naive_error(y_train, method='persistence')
    
    # Model predictions
    model = Ridge(alpha=1.0)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    
    # MASE
    mase = compute_mase(preds, y_test, naive_mae)
    
    return {
        'phi': phi_label,
        'mae': mean_absolute_error(y_test, preds),
        'naive_mae': naive_mae,
        'mase': mase
    }

mase_results = [
    evaluate_with_mase(series_low, 0.3),
    evaluate_with_mase(series_mid, 0.7),
    evaluate_with_mase(series_high, 0.98),
]

print("MASE EVALUATION")
print("=" * 65)
print(f"{'Ï†':<10} {'Model MAE':<15} {'Naive MAE':<15} {'MASE':<15} {'Verdict'}")
print("-" * 65)
for r in mase_results:
    verdict = 'âœ“ Beats baseline' if r['mase'] < 1 else 'âœ— No skill'
    print(f"{r['phi']:<10} {r['mae']:<15.4f} {r['naive_mae']:<15.4f} {r['mase']:<15.3f} {verdict}")
print("-" * 65)
print("\nMASE reveals the truth: high-Ï† series don't improve over persistence.")

In [None]:
# Visualize MASE across different phi levels
fig, ax = plt.subplots(figsize=(10, 5))

phis = [r['phi'] for r in mase_results]
mases = [r['mase'] for r in mase_results]

colors = ['green' if m < 1 else 'red' for m in mases]
bars = ax.bar(range(len(phis)), mases, color=colors, alpha=0.7, edgecolor='black')

# Add baseline line at MASE=1
ax.axhline(y=1, color='black', linestyle='--', linewidth=2, label='Persistence baseline')

# Labels
ax.set_xticks(range(len(phis)))
ax.set_xticklabels([f'Ï†={p}' for p in phis])
ax.set_ylabel('MASE', fontsize=12)
ax.set_title('MASE by Persistence Level\n(< 1 = Better than persistence)', 
             fontsize=13, fontweight='bold')

# Add value labels on bars
for i, (bar, mase) in enumerate(zip(bars, mases)):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
            f'{mase:.3f}', ha='center', fontsize=11, fontweight='bold')

ax.legend(loc='upper left')
ax.set_ylim(0, max(mases) * 1.2)

plt.tight_layout()
plt.show()

---

## Section 5: `gate_suspicious_improvement`

temporalcv provides `gate_suspicious_improvement` to automatically flag results that are "too good to be true."

**Thresholds [T2/T3]:**
- >20% improvement over baseline: **HALT** â€” likely leakage
- 10-20% improvement: **WARN** â€” investigate
- <10% improvement: **PASS** â€” plausible

**Why these thresholds?**
- Per theoretical bounds, >20% improvement is nearly impossible for high-Ï† data
- In practice, >20% improvements usually indicate leakage or bugs

In [None]:
# Demonstrate gate_suspicious_improvement
print("gate_suspicious_improvement Demo")
print("=" * 55)

# Scenario 1: Realistic improvement (5%)
model_mae_realistic = 0.95  # 5% better
baseline_mae = 1.0

result1 = gate_suspicious_improvement(
    model_metric=model_mae_realistic,
    baseline_metric=baseline_mae
)
print(f"\nScenario 1: 5% improvement")
print(f"  Model MAE: {model_mae_realistic}, Baseline: {baseline_mae}")
print(f"  Status: {result1.status.value}")
print(f"  Message: {result1.message}")

# Scenario 2: Suspicious improvement (25%)
model_mae_suspicious = 0.75  # 25% better

result2 = gate_suspicious_improvement(
    model_metric=model_mae_suspicious,
    baseline_metric=baseline_mae
)
print(f"\nScenario 2: 25% improvement")
print(f"  Model MAE: {model_mae_suspicious}, Baseline: {baseline_mae}")
print(f"  Status: {result2.status.value}")
print(f"  Message: {result2.message}")
if result2.recommendation:
    print(f"  Recommendation: {result2.recommendation}")

# Scenario 3: Edge case (15%)
model_mae_warn = 0.85  # 15% better

result3 = gate_suspicious_improvement(
    model_metric=model_mae_warn,
    baseline_metric=baseline_mae
)
print(f"\nScenario 3: 15% improvement")
print(f"  Model MAE: {model_mae_warn}, Baseline: {baseline_mae}")
print(f"  Status: {result3.status.value}")
print(f"  Message: {result3.message}")

In [None]:
# Test gate on our actual results
print("Applying gate_suspicious_improvement to our models")
print("=" * 60)

for r in results:  # From Section 2
    gate_result = gate_suspicious_improvement(
        model_metric=r['model_mae'],
        baseline_metric=r['persistence_mae']
    )
    
    status_color = {
        'PASS': 'âœ“',
        'WARN': 'âš ',
        'HALT': 'âœ—'
    }.get(gate_result.status.value, '?')
    
    print(f"\nÏ†={r['phi']}:")
    print(f"  Improvement: {r['improvement']:.1f}%")
    print(f"  Gate status: {status_color} {gate_result.status.value}")

---

## Section 6: Complete Evaluation Workflow

Always evaluate with both:
1. **MASE** â€” to see if you beat persistence
2. **gate_suspicious_improvement** â€” to catch "too good" results

In [None]:
def evaluate_forecast_properly(series, horizon=1, n_lags=5, random_state=42):
    """
    Complete forecast evaluation with MASE and gate checks.
    
    Parameters
    ----------
    series : array-like
        Time series to forecast
    horizon : int
        Forecast horizon
    n_lags : int
        Number of lag features
    random_state : int
        Random seed
        
    Returns
    -------
    dict
        Comprehensive evaluation results
    """
    np.random.seed(random_state)
    
    # Prepare data
    X, y = create_lag_features(series, n_lags=n_lags)
    
    # Walk-forward CV with proper gap
    cv = WalkForwardCV(
        n_splits=5,
        extra_gap=horizon,
        window_type='expanding',
        test_size=50
    )
    
    model_maes = []
    persistence_maes = []
    all_preds = []
    all_actuals = []
    all_train_data = []  # Collect training data for MASE denominator
    
    for train_idx, test_idx in cv.split(X):
        X_train, y_train = X[train_idx], y[train_idx]
        X_test, y_test = X[test_idx], y[test_idx]
        
        # Train model
        model = Ridge(alpha=1.0)
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        
        # Persistence predictions
        persist_preds = X_test[:, 0]  # y[t-1]
        
        # Collect errors
        model_maes.append(mean_absolute_error(y_test, preds))
        persistence_maes.append(mean_absolute_error(y_test, persist_preds))
        all_preds.extend(preds)
        all_actuals.extend(y_test)
        all_train_data.extend(y_train)  # Collect all training targets
    
    # Aggregate
    model_mae = np.mean(model_maes)
    persistence_mae = np.mean(persistence_maes)
    
    # MASE: naive error from TRAINING data per Hyndman & Koehler 2006
    # Using collected training data instead of full series to avoid leakage
    naive_mae = compute_naive_error(np.array(all_train_data), method='persistence')
    mase = compute_mase(np.array(all_preds), np.array(all_actuals), naive_mae)
    
    # Gate check
    gate_result = gate_suspicious_improvement(
        model_metric=model_mae,
        baseline_metric=persistence_mae
    )
    
    return {
        'model_mae': model_mae,
        'persistence_mae': persistence_mae,
        'improvement_pct': (persistence_mae - model_mae) / persistence_mae * 100,
        'mase': mase,
        'beats_persistence': mase < 1,
        'gate_status': gate_result.status.value,
        'gate_message': gate_result.message
    }

# Run complete evaluation
print("COMPLETE FORECAST EVALUATION")
print("=" * 70)

for series, phi in [(series_low, 0.3), (series_mid, 0.7), (series_high, 0.98)]:
    result = evaluate_forecast_properly(series)
    
    print(f"\nÏ†={phi}:")
    print(f"  Model MAE:      {result['model_mae']:.4f}")
    print(f"  Persist. MAE:   {result['persistence_mae']:.4f}")
    print(f"  Improvement:    {result['improvement_pct']:+.1f}%")
    print(f"  MASE:           {result['mase']:.3f} ({'âœ“ beats baseline' if result['beats_persistence'] else 'âœ— no skill'})")
    print(f"  Gate Status:    {result['gate_status']}")

---

## Pitfall Section

### Pitfall 1: Reporting MAE Without Context

```python
# WRONG: MAE alone is meaningless for time series
print(f"Model MAE: {mae:.4f}")  # Looks impressive!

# RIGHT: Always compare to persistence
print(f"Model MAE: {model_mae:.4f}")
print(f"Persistence MAE: {persistence_mae:.4f}")
print(f"MASE: {mase:.3f}")
```

### Pitfall 2: Claiming "X% Improvement" Without Verification

```python
# WRONG: Trust the improvement
improvement = 25%  # Great!

# RIGHT: Verify with gate
result = gate_suspicious_improvement(model_mae, persistence_mae)
if result.status == GateStatus.HALT:
    raise ValueError(f"Suspicious improvement: {result.message}")
```

### Pitfall 3: Using RMSE Instead of MAE for MASE

```python
# WRONG: MASE uses MAE, not RMSE
mase = rmse / naive_rmse  # Incorrect!

# RIGHT: MASE definition
mase = mae / naive_mae  # Correct
```

In [None]:
# Demonstrate the pitfalls
print("Pitfall Demonstrations")
print("=" * 55)

# Pitfall 1: MAE without context
model_mae = 0.05
print("\nPitfall 1: MAE without context")
print(f"  'Model MAE: {model_mae}'")
print(f"  â†‘ Is this good? You have no idea without persistence baseline!")
print(f"  â†“ With context:")
print(f"     Model MAE: {model_mae}")
print(f"     Persistence MAE: 0.048")
print(f"     MASE: 1.04 (worse than persistence!)")

# Pitfall 2: Trusting big improvements
print("\nPitfall 2: Trusting big improvements")
print(f"  '25% improvement!' â†’ Run gate_suspicious_improvement first!")

---

## Key Insights

### 1. Persistence is the Natural Baseline [T1]
For time series, "no change" is the simplest forecast. All models must beat this.

### 2. High-Ï† Data is Hard to Beat [T1]
When autocorrelation is high (Ï† > 0.9), the theoretical improvement over persistence approaches zero.

### 3. Low MAE â‰  Good Model
A model can have low MAE but no skill â€” it just learned to predict "tomorrow = today."

### 4. MASE Reveals the Truth [T1]
MASE < 1 means better than persistence. MASE â‰¥ 1 means no skill.

### 5. >20% Improvement is Suspicious [T2]
For high-persistence data, large improvements usually indicate leakage.

---

## Next Steps

- **04_autocorrelation_matters.ipynb**: HAC variance for correlated forecast errors
- **05_shuffled_target_gate.ipynb**: Definitive leakage detection
- **10_high_persistence_metrics.ipynb**: MC-SS and move-conditional evaluation

---

*"A low MAE is meaningless without a baseline. Always compute MASE."*