# Why Time Series ML is Different

## The Four Traps That Fool Even Experienced ML Practitioners

---

## Prerequisites

**New to time-series ML?** Start with [00_time_series_fundamentals.ipynb](00_time_series_fundamentals.ipynb) first. It covers:
- Why autocorrelation matters
- What ACF(1)=0.9 means practically  
- The three types of leakage

**Feature engineering?** See the [Feature Engineering Safety Guide](../docs/tutorials/feature_engineering_safety.md).

---

## If You Know sklearn But Not Time Series, Read This First

**What you already know (from standard ML)**:
- `KFold` shuffles data randomly → protects against order bias
- `train_test_split(shuffle=True)` is standard practice
- MAE/RMSE tells you how good your model is
- Lower error = better model

**What's different in time series (and why it matters)**:

| Standard ML | Time Series | Why |
|-------------|-------------|-----|
| Rows are independent (Customer A ≠ Customer B) | Rows are the *same entity* at different times | Today's Treasury rate ≈ yesterday's |
| Shuffle freely | **Never shuffle** | Future can't train past |
| KFold is safe | KFold is **dangerous** | Creates "time travel" |
| MAE = model quality | MAE can be **meaningless** | Persistence baseline may be trivial |

**The traps we'll expose**:
1. **The Shuffle Catastrophe**: KFold creates fake 30%+ improvement
2. **The Gap Trap**: Multi-step forecasts leak without proper gaps
3. **The Persistence Mirage**: MAE=0.001 can mean zero skill
4. **The Solution**: temporalcv gates catch all of this

---

In [None]:
# Setup and imports
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error

# temporalcv - the solution
from temporalcv.cv import WalkForwardCV
from temporalcv.gates import (
    gate_signal_verification,
    gate_temporal_boundary,
    gate_suspicious_improvement,
    run_gates,
    GateStatus,
)

# Reproducibility
np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

print("Setup complete. Let's expose some traps.")

---

## Section 1: The Shuffle Catastrophe

**The Problem:** Standard KFold shuffles data randomly. For time series, this means:
- Future observations can appear in the training set
- Past observations can appear in the test set
- The model "sees the answer" before the test

**Why it's dangerous:** Results look *amazing*... but they're fake.

In [None]:
def generate_ar1(n=500, phi=0.9, sigma=1.0, seed=42):
    """
    Generate AR(1) process: y[t] = phi * y[t-1] + noise
    
    This mimics real-world persistent time series like:
    - Interest rates (phi ~ 0.99)
    - Unemployment rates (phi ~ 0.95)
    - Stock volatility (phi ~ 0.90)
    """
    rng = np.random.default_rng(seed)
    y = np.zeros(n)
    # Start from stationary distribution
    y[0] = rng.normal(0, sigma / np.sqrt(1 - phi**2))
    for t in range(1, n):
        y[t] = phi * y[t-1] + sigma * rng.normal()
    return y

def create_lag_features(series, n_lags=5):
    """
    Create lag features: y[t-1], y[t-2], ..., y[t-n_lags]
    
    These are the "safe" features for time series prediction.
    """
    n = len(series)
    X = np.column_stack([
        np.concatenate([[np.nan]*lag, series[:-lag]]) 
        for lag in range(1, n_lags + 1)
    ])
    valid = ~np.isnan(X).any(axis=1)
    return X[valid], series[valid]

# Generate our time series
series = generate_ar1(n=500, phi=0.9)
X, y = create_lag_features(series, n_lags=5)

acf1 = np.corrcoef(series[1:], series[:-1])[0, 1]
print(f"Generated AR(1) series with ACF(1) = {acf1:.3f}")
print(f"Features shape: {X.shape}, Target shape: {y.shape}")

In [None]:
# The comparison that exposes the trap
model = Ridge(alpha=1.0)

# WRONG: Standard KFold (shuffles time!)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
kfold_scores = -cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_absolute_error')

# CORRECT: Walk-forward (respects time)
wfcv = WalkForwardCV(n_splits=5, window_type='expanding', test_size=50)
wfcv_scores = []
for train_idx, test_idx in wfcv.split(X):
    model.fit(X[train_idx], y[train_idx])
    preds = model.predict(X[test_idx])
    wfcv_scores.append(mean_absolute_error(y[test_idx], preds))

kfold_mae = np.mean(kfold_scores)
wfcv_mae = np.mean(wfcv_scores)
fake_improvement = (wfcv_mae - kfold_mae) / wfcv_mae * 100

print("=" * 55)
print("THE SHUFFLE CATASTROPHE")
print("=" * 55)
print(f"KFold MAE:        {kfold_mae:.4f}  <-- LOOKS GREAT!")
print(f"WalkForward MAE:  {wfcv_mae:.4f}  <-- REALITY")
print(f"")
print(f"Fake improvement from shuffling: {fake_improvement:.1f}%")
print("")
print("The KFold result is an ILLUSION caused by temporal leakage.")

In [None]:
# Visualize WHY shuffling leaks information (THE KEY DIAGRAM)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: What KFold does (WRONG) - show multiple folds
ax = axes[0]
ax.set_title('❌ KFold: Future Leaks into Training\n(WRONG for time series)', fontsize=12, fontweight='bold', color='red')

# Simulate 5 KFold splits on 100 points
np.random.seed(42)
n_points = 100

for fold in range(5):
    # Random train/test split like KFold
    indices = np.arange(n_points)
    test_mask = np.zeros(n_points, dtype=bool)
    test_start = fold * 20
    test_mask[test_start:test_start + 20] = True
    
    # Shuffle for visual effect of randomness
    shuffled_test = np.sort(np.random.choice(n_points, 20, replace=False))
    shuffled_train = np.setdiff1d(indices, shuffled_test)
    
    y_pos = fold
    ax.scatter(shuffled_train, [y_pos]*len(shuffled_train), c='steelblue', s=8, alpha=0.7, marker='s')
    ax.scatter(shuffled_test, [y_pos]*len(shuffled_test), c='salmon', s=12, alpha=0.9, marker='o')

ax.set_xlabel('Time Index', fontsize=11)
ax.set_ylabel('Fold Number', fontsize=11)
ax.set_yticks(range(5))
ax.set_yticklabels([f'Fold {i+1}' for i in range(5)])

# Add legend
from matplotlib.lines import Line2D
legend_elements = [
    Line2D([0], [0], marker='s', color='w', markerfacecolor='steelblue', markersize=10, label='Train'),
    Line2D([0], [0], marker='o', color='w', markerfacecolor='salmon', markersize=10, label='Test')
]
ax.legend(handles=legend_elements, loc='upper right')

# Highlight the problem
ax.annotate('Test points are\nSCATTERED in time!\nFuture trains past.', 
            xy=(50, 2), fontsize=10, color='red', fontweight='bold',
            bbox=dict(boxstyle='round', facecolor='lightyellow', edgecolor='red'))

# Right: Walk-forward proper split (CORRECT)
ax = axes[1]
ax.set_title('✓ WalkForward: Train THEN Test\n(CORRECT)', fontsize=12, fontweight='bold', color='green')

# Show 5 expanding window splits
train_ends = [35, 50, 65, 80, 85]
test_sizes = [15, 15, 15, 15, 15]

for fold, (train_end, test_size) in enumerate(zip(train_ends, test_sizes)):
    train_idx = np.arange(0, train_end)
    test_idx = np.arange(train_end, min(train_end + test_size, n_points))
    
    y_pos = fold
    ax.barh(y_pos, train_end, left=0, height=0.6, color='steelblue', alpha=0.7, label='Train' if fold==0 else '')
    ax.barh(y_pos, len(test_idx), left=train_end, height=0.6, color='salmon', alpha=0.9, label='Test' if fold==0 else '')
    
    # Draw boundary line
    ax.axvline(x=train_end, ymin=(y_pos-0.3)/5, ymax=(y_pos+0.7)/5, color='black', linestyle='--', linewidth=1.5)

ax.set_xlabel('Time Index', fontsize=11)
ax.set_ylabel('Fold Number', fontsize=11)
ax.set_yticks(range(5))
ax.set_yticklabels([f'Fold {i+1}' for i in range(5)])
ax.legend(loc='upper right')

# Highlight the solution
ax.annotate('Train ALWAYS\nprecedes Test.\nNo time travel!', 
            xy=(70, 2), fontsize=10, color='green', fontweight='bold',
            bbox=dict(boxstyle='round', facecolor='honeydew', edgecolor='green'))

plt.tight_layout()
plt.show()

print("\n★ Key Insight: In KFold, test points are scattered throughout time.")
print("  The model can 'see' future data points during training.")
print("  WalkForward ensures train data ALWAYS precedes test data.")

### The Revelation

KFold's "excellent" MAE is an illusion. The model learned to interpolate because:
1. When predicting y[t=50], it had y[t=55], y[t=60] in training
2. AR(1) processes are highly autocorrelated — nearby values are nearly identical
3. The model isn't predicting the future; it's peeking at it

**This is data leakage through temporal contamination.**

---

### The Solution: `gate_signal_verification`

temporalcv provides a definitive test: if your model beats a shuffled target, something is wrong.

In [None]:
# The definitive leakage detector
result = gate_signal_verification(
    model=Ridge(alpha=1.0),
    X=X, y=y,
    n_shuffles=20,
    method="permutation",
    random_state=42,
)

print("VALIDATION GATE: Shuffled Target Test")
print("=" * 50)
print(f"Status: {result.status.value}")
print(f"p-value: {result.details['pvalue']:.4f}")
print(f"")
print(f"Interpretation: {result.message}")
print(f"")
if result.status == GateStatus.PASS:
    print("The model does NOT beat shuffled targets - no evidence of leakage.")
else:
    print(f"Recommendation: {result.recommendation}")

---

## Section 1.5: Diagnose Your Data First

Before building models, run this diagnostic to understand what you're dealing with.

**Key Questions**:
1. Is this high-persistence data? (ACF(1) > 0.9)
2. Should you even try to beat persistence?
3. What metrics will be meaningful?

In [None]:
def diagnose_your_data(series, name="Your Series"):
    """
    Quick diagnostic to understand what you're dealing with.
    
    Run this BEFORE building models to set expectations.
    """
    n = len(series)
    
    # ACF at lag 1
    acf1 = np.corrcoef(series[1:], series[:-1])[0, 1]
    
    # Persistence baseline MAE
    persistence_errors = np.abs(series[1:] - series[:-1])
    persistence_mae = np.mean(persistence_errors)
    
    # Volatility
    std = np.std(series)
    
    print(f"DIAGNOSTIC: {name}")
    print("=" * 55)
    print(f"Length:              {n} observations")
    print(f"Standard deviation:  {std:.4f}")
    print(f"Persistence MAE:     {persistence_mae:.4f}")
    print(f"ACF(1):              {acf1:.3f}")
    print(f"")
    
    # Classification
    if acf1 > 0.95:
        persistence_level = "VERY HIGH"
        guidance = "Beating persistence is extremely difficult. Consider move-conditional metrics."
    elif acf1 > 0.9:
        persistence_level = "HIGH"
        guidance = "Use MASE to measure true skill. Raw MAE will be misleading."
    elif acf1 > 0.7:
        persistence_level = "MODERATE"
        guidance = "Models can add value, but still compare to persistence."
    else:
        persistence_level = "LOW"
        guidance = "Standard metrics (MAE, RMSE) are meaningful."
    
    print(f"Persistence Level:   {persistence_level}")
    print(f"")
    print(f"Guidance: {guidance}")
    print(f"")
    
    # Visual
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Time series plot
    ax = axes[0]
    ax.plot(series, 'b-', linewidth=0.8)
    ax.set_title(f'{name}', fontsize=11)
    ax.set_xlabel('Time')
    ax.set_ylabel('Value')
    
    # ACF plot
    ax = axes[1]
    max_lag = min(30, n // 5)
    acf_values = [1.0]  # ACF(0) = 1
    for lag in range(1, max_lag + 1):
        acf_values.append(np.corrcoef(series[lag:], series[:-lag])[0, 1])
    ax.bar(range(max_lag + 1), acf_values, color='steelblue', alpha=0.7)
    ax.axhline(y=0.9, color='red', linestyle='--', label='High persistence threshold')
    ax.axhline(y=0, color='black', linewidth=0.5)
    ax.set_title('Autocorrelation Function (ACF)', fontsize=11)
    ax.set_xlabel('Lag')
    ax.set_ylabel('ACF')
    ax.legend(loc='upper right')
    
    plt.tight_layout()
    plt.show()
    
    return {"acf1": acf1, "persistence_mae": persistence_mae, "level": persistence_level}

# Diagnose our generated series
diag = diagnose_your_data(series, "AR(1) with φ=0.9")
print(f"\nStored diagnostic: ACF(1)={diag['acf1']:.3f}, Level={diag['level']}")

---

## Section 2: The Gap Trap

**The Problem:** For h-step ahead forecasting, the target y[t+h] overlaps with the training window unless you enforce a gap.

**Example:** Predicting Treasury rates 12 weeks ahead:
- Train ends at t=100
- Test starts at t=101
- But the target y[101] = rate[113] - rate[101]... and rate[101] is in the test period!

Without gap enforcement, the model sees part of the answer.

In [None]:
# Demonstrate the gap problem with h-step forecasting
horizon = 12  # 12-step ahead forecast (e.g., 12 weeks)

# Create h-step change targets: y[t] = series[t+h] - series[t]
y_changes = series[horizon:] - series[:-horizon]
X_changes = X[:len(y_changes)]

print(f"Horizon: {horizon} steps")
print(f"Target: predicting {horizon}-step changes")
print(f"")

# WITHOUT gap (WRONG)
cv_no_gap = WalkForwardCV(n_splits=3, horizon=1, extra_gap=0, test_size=horizon)

# WITH gap (CORRECT)
cv_with_gap = WalkForwardCV(n_splits=3, extra_gap=horizon, test_size=horizon)

print("--- Gap=0 (WRONG) ---")
for info in cv_no_gap.get_split_info(X_changes):
    boundary = gate_temporal_boundary(
        train_end_idx=info.train_end,
        test_start_idx=info.test_start,
        horizon=horizon,
    )
    status = "LEAKAGE!" if boundary.status == GateStatus.HALT else "OK"
    print(f"  Split {info.split_idx}: train[..{info.train_end}] test[{info.test_start}..] gap={info.gap} -> {status}")

print(f"")
print(f"--- Gap={horizon} (CORRECT) ---")
for info in cv_with_gap.get_split_info(X_changes):
    boundary = gate_temporal_boundary(
        train_end_idx=info.train_end,
        test_start_idx=info.test_start,
        horizon=horizon,
    )
    status = "LEAKAGE!" if boundary.status == GateStatus.HALT else "OK"
    print(f"  Split {info.split_idx}: train[..{info.train_end}] test[{info.test_start}..] gap={info.gap} -> {status}")

In [None]:
# Show the performance difference
model = Ridge(alpha=1.0)

# Train with horizon=1, extra_gap=0 (WRONG)
scores_no_gap = []
for train_idx, test_idx in cv_no_gap.split(X_changes):
    model.fit(X_changes[train_idx], y_changes[train_idx])
    preds = model.predict(X_changes[test_idx])
    scores_no_gap.append(mean_absolute_error(y_changes[test_idx], preds))

# Train with gap=horizon (CORRECT)
scores_with_gap = []
for train_idx, test_idx in cv_with_gap.split(X_changes):
    model.fit(X_changes[train_idx], y_changes[train_idx])
    preds = model.predict(X_changes[test_idx])
    scores_with_gap.append(mean_absolute_error(y_changes[test_idx], preds))

print("THE GAP TRAP")
print("=" * 50)
print(f"MAE without gap: {np.mean(scores_no_gap):.4f}  <-- INFLATED")
print(f"MAE with gap:    {np.mean(scores_with_gap):.4f}  <-- REALISTIC")
print(f"")
print(f"The {np.mean(scores_no_gap) / np.mean(scores_with_gap):.1%} 'improvement' without gap is fake.")

In [None]:
# Visualize the overlap problem
fig, axes = plt.subplots(2, 1, figsize=(14, 6), sharex=True)

train_end = 80
test_start_no_gap = 81
test_start_with_gap = 81 + horizon

# Top: Gap=0 (WRONG)
ax = axes[0]
ax.axvspan(0, train_end, alpha=0.3, color='blue', label='Training window')
ax.axvspan(test_start_no_gap, test_start_no_gap + horizon, alpha=0.3, color='red', label='Test window')

# Show target computation overlap
ax.annotate(f'Target y[{test_start_no_gap}] uses data at t={test_start_no_gap + horizon}',
            xy=(test_start_no_gap + horizon, 0.5), xytext=(test_start_no_gap - 20, 0.7),
            arrowprops=dict(arrowstyle='->', color='red', lw=2),
            fontsize=10, color='red', fontweight='bold',
            transform=ax.get_xaxis_transform())
ax.axvline(x=test_start_no_gap + horizon, color='red', linestyle='--', linewidth=2,
           label=f'Target leaks to t={test_start_no_gap + horizon}')

ax.set_title(f'Gap=0: Target overlaps with test window (LEAKAGE)', fontsize=12, fontweight='bold', color='red')
ax.set_ylabel('(conceptual)')
ax.legend(loc='upper left')
ax.set_xlim(0, 120)

# Bottom: Gap=horizon (CORRECT)
ax = axes[1]
ax.axvspan(0, train_end, alpha=0.3, color='blue', label='Training window')
ax.axvspan(train_end, test_start_with_gap, alpha=0.3, color='gray', label=f'Gap ({horizon} periods)')
ax.axvspan(test_start_with_gap, test_start_with_gap + horizon, alpha=0.3, color='red', label='Test window')

ax.set_title(f'Gap={horizon}: Clean separation (NO LEAKAGE)', fontsize=12, fontweight='bold', color='green')
ax.set_xlabel('Time Index')
ax.set_ylabel('(conceptual)')
ax.legend(loc='upper left')
ax.set_xlim(0, 120)

plt.tight_layout()
plt.show()

### The Revelation

Without proper gap enforcement:
- The model "knows" part of the answer before the test
- Performance appears better than it truly is
- In production, the model will underperform expectations

**Rule [T1]:** For h-step forecasting, always set `gap >= horizon`.

---

## Section 3: The Persistence Mirage

**The Problem:** On "sticky" data (high autocorrelation), even naive models have excellent MAE. Your complex model's MAE looks similar... but it may have learned nothing useful.

**Example: Treasury rates (ACF(1) ~ 0.99)**
- Persistence baseline: y_pred[t] = y[t-1] ("predict no change")
- Persistence MAE = 0.001 (incredibly small!)
- Your model: MAE = 0.0009 (10% "better")
- Did you actually learn anything? Often: no.

In [None]:
# Generate very sticky data (like Treasury rates)
series_sticky = generate_ar1(n=500, phi=0.98, sigma=0.05, seed=42)
X_sticky, y_sticky = create_lag_features(series_sticky, n_lags=5)

acf1_sticky = np.corrcoef(series_sticky[1:], series_sticky[:-1])[0, 1]
print(f"Generated high-persistence series with ACF(1) = {acf1_sticky:.4f}")
print(f"(This mimics Treasury rates, unemployment, many economic indicators)")
print(f"")

# Persistence baseline: predict y[t] = y[t-1] (the first lag feature)
persistence_preds = X_sticky[:, 0]  # y[t-1]
persistence_mae = mean_absolute_error(y_sticky, persistence_preds)

print(f"Persistence baseline MAE: {persistence_mae:.6f}")
print(f"")
print(f"This is the bar you need to clear. It's already VERY low.")

In [None]:
# Train a "sophisticated" model
model = Ridge(alpha=1.0)

# Proper temporal split
train_size = int(0.7 * len(y_sticky))
X_train, X_test = X_sticky[:train_size], X_sticky[train_size:]
y_train, y_test = y_sticky[:train_size], y_sticky[train_size:]

model.fit(X_train, y_train)
model_preds = model.predict(X_test)

# Persistence on test set
persistence_test_preds = X_test[:, 0]  # y[t-1]

# Metrics
model_mae = mean_absolute_error(y_test, model_preds)
persistence_test_mae = mean_absolute_error(y_test, persistence_test_preds)
improvement = (persistence_test_mae - model_mae) / persistence_test_mae * 100

print("THE PERSISTENCE MIRAGE")
print("=" * 50)
print(f"Persistence MAE: {persistence_test_mae:.6f}")
print(f"Ridge Model MAE: {model_mae:.6f}")
print(f"Improvement:     {improvement:.2f}%")
print(f"")
print(f"Both MAEs are TINY. But did the model learn anything useful?")

In [None]:
# MASE reveals the truth
# MASE < 1 means model beats persistence
# MASE = 1 means model equals persistence
# MASE > 1 means model is WORSE than persistence

# Compute naive error from training data (the proper way)
naive_errors = np.abs(np.diff(y_train))  # |y[t] - y[t-1]| on training
naive_mae = np.mean(naive_errors)

# MASE = model_mae / naive_mae
mase = model_mae / naive_mae

print("MASE: Mean Absolute Scaled Error [T1]")
print("=" * 50)
print(f"Naive MAE (from training): {naive_mae:.6f}")
print(f"Model MAE (on test):       {model_mae:.6f}")
print(f"")
print(f"MASE = {mase:.3f}")
print(f"")
if mase < 1:
    print(f"MASE < 1: Model beats naive by {(1-mase)*100:.1f}% - genuine skill!")
elif mase == 1:
    print(f"MASE = 1: Model equals naive - no skill.")
else:
    print(f"MASE > 1: Model is WORSE than naive!")
print(f"")
print(f"MASE normalizes by persistence difficulty, revealing true skill.")

In [None]:
# Visualize: predictions follow persistence almost exactly
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

test_indices = np.arange(len(y_test))

# Top: Full test period
ax = axes[0]
ax.plot(test_indices, y_test, 'k-', linewidth=1.5, label='Actual', alpha=0.8)
ax.plot(test_indices, model_preds, 'b--', linewidth=1.5, label='Model predictions')
ax.plot(test_indices, persistence_test_preds, 'r:', linewidth=1.5, label='Persistence (y[t-1])')
ax.set_title('High-Persistence Data: Model vs Persistence (Full Test Period)', fontsize=12, fontweight='bold')
ax.set_xlabel('Test Index')
ax.set_ylabel('Value')
ax.legend()

# Bottom: Zoomed to show they're nearly identical
ax = axes[1]
zoom = slice(0, 50)
ax.plot(test_indices[zoom], y_test[zoom], 'k-', linewidth=2, label='Actual')
ax.plot(test_indices[zoom], model_preds[zoom], 'b--', linewidth=2, label='Model')
ax.plot(test_indices[zoom], persistence_test_preds[zoom], 'r:', linewidth=2, label='Persistence')
ax.set_title('Zoomed: Model predictions follow persistence almost exactly', fontsize=12, fontweight='bold')
ax.set_xlabel('Test Index')
ax.set_ylabel('Value')
ax.legend()

plt.tight_layout()
plt.show()

print("\nVisual inspection: The model essentially learned to copy persistence!")
print("On sticky data, this is easy but provides no forecasting value.")

### The Revelation

On high-persistence data:
- MAE can be tiny for completely useless predictions
- A model that just copies y[t-1] achieves "state of the art" MAE
- MASE (Mean Absolute Scaled Error) reveals true skill by comparing to naive

**Rule [T1]:** Always report MASE (or improvement over persistence), not just MAE, on time series data.

---

### The Solution: `gate_suspicious_improvement`

If improvement over persistence is too large (>20%), something is probably wrong.

In [None]:
# Check if improvement is suspiciously large
result = gate_suspicious_improvement(
    model_metric=model_mae,
    baseline_metric=persistence_test_mae,
    threshold=0.20,  # >20% improvement is suspicious
    warn_threshold=0.10,  # 10-20% warrants caution
)

print("VALIDATION GATE: Suspicious Improvement")
print("=" * 50)
print(f"Status: {result.status.value}")
print(f"Improvement: {result.metric_value:.1%}")
print(f"")
print(f"Message: {result.message}")
if result.recommendation:
    print(f"Recommendation: {result.recommendation}")

---

## Section 4: The Validation Solution

Now that you've seen the three traps, here's how to protect yourself:

1. **`gate_signal_verification`**: Detects any temporal leakage (definitive test)
2. **`gate_temporal_boundary`**: Ensures proper gap for h-step forecasts
3. **`gate_suspicious_improvement`**: Flags "too good to be true" results

These gates follow **HALT > WARN > PASS** priority:
- **HALT**: Stop everything, investigate immediately
- **WARN**: Proceed with caution, verify results
- **PASS**: Validation passed, safe to proceed

In [None]:
# Run all gates on our clean model
print("Running validation gates on clean model...")
print("")

# Get predictions from walk-forward CV
cv = WalkForwardCV(n_splits=5, window_type='expanding', horizon=1, extra_gap=0, test_size=50)
train_idx, test_idx = list(cv.split(X))[-1]  # Last split for demo

model = Ridge(alpha=1.0)
model.fit(X[train_idx], y[train_idx])
preds = model.predict(X[test_idx])

# Compute metrics
model_mae = mean_absolute_error(y[test_idx], preds)
persistence_mae = mean_absolute_error(y[test_idx], X[test_idx, 0])

# Run all gates
gates = [
    gate_signal_verification(
        model=Ridge(alpha=1.0),
        X=X, y=y,
        n_shuffles=20,
        method="permutation",
        random_state=42,
    ),
    gate_temporal_boundary(
        train_end_idx=train_idx[-1],
        test_start_idx=test_idx[0],
        horizon=1,  # 1-step forecast
    ),
    gate_suspicious_improvement(
        model_metric=model_mae,
        baseline_metric=persistence_mae,
        threshold=0.20,
    ),
]

report = run_gates(gates)
print(report.summary())

In [None]:
# Decision framework
print("""
VALIDATION DECISION FRAMEWORK
=============================

        START
          |
          v
+-------------------+
| Run shuffled      |
| target test       |
+-------------------+
          |
    HALT? -----> YES -----> STOP: Features leak target info
          |
          NO
          |
          v
+-------------------+
| Check temporal    |
| boundary (gap)    |
+-------------------+
          |
    HALT? -----> YES -----> STOP: Gap < horizon (increase gap)
          |
          NO
          |
          v
+-------------------+
| Check improvement |
| vs baseline       |
+-------------------+
          |
    HALT? -----> YES -----> STOP: >20% improvement (investigate!)
          |
    WARN? -----> YES -----> CAUTION: 10-20% improvement
          |
          NO
          |
          v
      DEPLOY
""")

In [None]:
# Demonstrate HALT on intentionally leaky features
print("Now let's create LEAKY features and watch the gate catch it...")
print("")

def create_leaky_features(series, n_lags=5):
    """
    Create features WITH intentional leakage.
    
    BUG: Centered rolling mean includes future values!
    This is a common real-world mistake.
    """
    n = len(series)
    features = []
    
    # Safe lag features
    for lag in range(1, n_lags + 1):
        lagged = np.full(n, np.nan)
        lagged[lag:] = series[:-lag]
        features.append(lagged)
    
    # LEAKY FEATURE: Centered rolling mean (includes future!)
    smoothed = np.full(n, np.nan)
    for t in range(3, n - 3):
        smoothed[t] = np.mean(series[t-3:t+4])  # Includes t+1, t+2, t+3!
    features.append(smoothed)
    
    X = np.column_stack(features)
    valid = ~np.isnan(X).any(axis=1)
    return X[valid], series[valid]

# Generate fresh series and create leaky features
series_fresh = generate_ar1(n=500, phi=0.9, seed=123)
X_leaky, y_leaky = create_leaky_features(series_fresh)

result = gate_signal_verification(
    model=Ridge(alpha=1.0),
    X=X_leaky, y=y_leaky,
    n_shuffles=20,
    method="permutation",
    random_state=42,
)

print("LEAKY FEATURES DETECTED!")
print("=" * 50)
print(f"Status: {result.status.value}")
print(f"p-value: {result.details['pvalue']:.4f}")
print(f"")
print(f"Message: {result.message}")
print(f"")
print(f"Recommendation: {result.recommendation}")
print(f"")
print("The gate correctly identified the centered rolling mean as leakage!")

---

## Key Insights

### 1. KFold Shuffles Time (Bad!)
Random splits contaminate training with future data. Use `WalkForwardCV` instead of `KFold`.

### 2. Gap Must Match Horizon [T1]
For h-step forecasts, `gap >= h` is mandatory. `gate_temporal_boundary` catches violations.

### 3. MAE Lies on Sticky Data
High-persistence data makes naive baselines look great. Use MASE (Mean Absolute Scaled Error) instead.

### 4. Too Good = Probably Wrong [T3]
>20% improvement over persistence? Investigate! `gate_suspicious_improvement` automates this check.

### 5. Shuffled Target Test is Definitive [T2]
If model beats shuffled targets: **LEAKAGE**. Always run before trusting results.

---

## Complete Validation Template

In [None]:
def validate_time_series_model(model, X, y, horizon=1):
    """
    Complete validation pipeline for time series models.
    
    Returns ValidationReport with HALT/WARN/PASS status.
    Copy this template for your own projects!
    """
    from temporalcv.cv import WalkForwardCV
    from temporalcv.gates import (
        gate_signal_verification,
        gate_temporal_boundary,
        gate_suspicious_improvement,
        run_gates,
    )
    from sklearn.metrics import mean_absolute_error
    from sklearn.base import clone
    
    # Proper CV with gap enforcement
    cv = WalkForwardCV(n_splits=5, extra_gap=horizon, test_size=50)
    train_idx, test_idx = list(cv.split(X))[-1]
    
    # Train and evaluate
    model_clone = clone(model)
    model_clone.fit(X[train_idx], y[train_idx])
    preds = model_clone.predict(X[test_idx])
    
    model_mae = mean_absolute_error(y[test_idx], preds)
    persistence_mae = mean_absolute_error(y[test_idx], X[test_idx, 0])
    
    # Run gates
    gates = [
        gate_signal_verification(model, X, y, n_shuffles=20, method="permutation"),
        gate_temporal_boundary(train_idx[-1], test_idx[0], horizon),
        gate_suspicious_improvement(model_mae, persistence_mae),
    ]
    
    return run_gates(gates)

# Example usage
report = validate_time_series_model(Ridge(alpha=1.0), X, y, horizon=1)
print(f"Overall Status: {report.status}")
print(f"")
if report.status == "HALT":
    print("STOP: Investigation required before proceeding.")
    for failure in report.failures:
        print(f"  - {failure.name}: {failure.recommendation}")
elif report.status == "WARN":
    print("CAUTION: Proceed carefully and verify results.")
else:
    print("PASS: Safe to proceed with deployment.")

---

## Next Steps

You've learned the four critical pitfalls in time series ML:

1. **The Shuffle Catastrophe** - Don't use KFold on time series
2. **The Gap Trap** - Enforce gap >= horizon for h-step forecasts
3. **The Persistence Mirage** - Use MASE, not just MAE
4. **The Validation Solution** - Run temporalcv gates before trusting results

### Continue Learning

- **02_gap_enforcement.ipynb**: Deep dive into multi-step forecasting
- **03_persistence_baseline.ipynb**: Advanced metrics for sticky data (MC-SS, move-conditional)
- **04_autocorrelation_matters.ipynb**: HAC variance and statistical testing
- **05_shuffled_target_gate.ipynb**: When and how to use the shuffled target test

### Resources

- [temporalcv Documentation](https://github.com/...)
- [SPECIFICATION.md](../SPECIFICATION.md): All thresholds and parameters
- [Examples Directory](../examples/): Runnable Python scripts

---

*"Most time series models I've seen have subtle leakage bugs. The shuffled target test is the definitive detector."*