# Gap Enforcement for Multi-Step Forecasting

## Why `gap >= horizon` is Non-Negotiable

---

## 🚨 If You Know sklearn But Not h-Step Forecasting, Read This First

**What you already know (from standard ML)**:
- Train/test split separates data
- You predict immediately after training
- No gap needed between train and test

**What's different with h-step forecasting**:

In standard ML, you predict "now" from "before now". Simple.

In h-step forecasting (e.g., predicting 12 weeks ahead), your **target** is:
```
target[t] = value[t + 12] - value[t]
```

This target **reaches into the future** by 12 periods! If your test starts at t=101:
- Target for t=101 uses value at t=113
- But t=113 is in the test period!
- **The model gets partial answers before the test even starts**

| Scenario | Train Ends | Test Starts | Target Uses | Problem? |
|----------|------------|-------------|-------------|----------|
| gap=0 | t=100 | t=101 | t=101+h=113 | Yes: target overlaps test |
| gap=h | t=100 | t=113 | t=113+h=125 | No: clean separation |

**The rule**: For h-step forecasts, `gap >= h` is mandatory.

---

**What you'll learn:**
1. Why h-step ahead forecasting requires a gap between train and test sets
2. How to detect and fix temporal boundary violations with `gate_temporal_boundary`
3. How to configure proper gaps in `WalkForwardCV` for any forecast horizon

**Prerequisites:** Notebook 01 (Why Temporal CV Matters)

---

In [None]:
# Setup
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error

from temporalcv.cv import WalkForwardCV
from temporalcv.gates import gate_temporal_boundary, GateStatus

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

print("Setup complete.")

In [None]:
# Generate AR(1) process (same as notebook 01)
def generate_ar1(n=500, phi=0.9, sigma=1.0, seed=42):
    """Generate AR(1) process mimicking persistent time series."""
    rng = np.random.default_rng(seed)
    y = np.zeros(n)
    y[0] = rng.normal(0, sigma / np.sqrt(1 - phi**2))
    for t in range(1, n):
        y[t] = phi * y[t-1] + sigma * rng.normal()
    return y

def create_lag_features(series, n_lags=5):
    """Create lag features for prediction."""
    n = len(series)
    X = np.column_stack([
        np.concatenate([[np.nan]*lag, series[:-lag]]) 
        for lag in range(1, n_lags + 1)
    ])
    valid = ~np.isnan(X).any(axis=1)
    return X[valid], series[valid]

# Generate base series
series = generate_ar1(n=600, phi=0.95, seed=42)
print(f"Generated series with {len(series)} observations")
print(f"ACF(1) = {np.corrcoef(series[1:], series[:-1])[0,1]:.3f}")

---

## Section 1: The h-Step Target Problem

**The Problem:** When forecasting h steps ahead, your target is:

$$y_t^{(h)} = \text{series}[t+h] - \text{series}[t]$$

This target **uses data from time t+h**. If your test set starts at t+1, the target for t+1 uses data from t+1+h — which is in the test period!

### Example: 12-Week Yield Curve Forecast

Imagine forecasting Treasury rate changes 12 weeks ahead:
- Train ends at week 100
- Test starts at week 101
- Target for week 101 = rate[113] - rate[101]
- But rate[101] is in the test period!

In [None]:
# Create h-step change targets
horizon = 12  # 12-step ahead forecast (e.g., 12 weeks for quarterly Treasury forecasts)

# Target: y[t] = series[t+h] - series[t] (h-step change)
y_hstep = series[horizon:] - series[:-horizon]

# Features: lag values of the LEVEL, aligned to target
X_hstep, _ = create_lag_features(series[:-horizon], n_lags=5)
y_hstep = y_hstep[5:]  # Align with valid features

print(f"Horizon: {horizon} steps")
print(f"Features shape: {X_hstep.shape}")
print(f"Target shape: {y_hstep.shape}")
print(f"")
print(f"Target interpretation: {horizon}-step change in the series")
print(f"Target std: {np.std(y_hstep):.4f} (larger than 1-step changes due to cumulative movement)")

In [None]:
# Demonstrate the overlap problem visually
fig, ax = plt.subplots(figsize=(14, 5))

# Timeline
train_end = 100
test_start_no_gap = 101

# Show the problem
ax.axvspan(0, train_end, alpha=0.3, color='blue', label='Training period')
ax.axvspan(test_start_no_gap, test_start_no_gap + 30, alpha=0.3, color='red', label='Test period')

# Mark the target computation
target_obs = test_start_no_gap  # First test observation
target_uses = target_obs + horizon  # Data used by target

ax.axvline(x=target_obs, color='red', linestyle='-', linewidth=2)
ax.axvline(x=target_uses, color='darkred', linestyle='--', linewidth=2)

# Annotation
ax.annotate(f'Test obs t={target_obs}',
            xy=(target_obs, 0.7), xytext=(target_obs - 20, 0.85),
            arrowprops=dict(arrowstyle='->', color='red', lw=2),
            fontsize=11, color='red', fontweight='bold',
            transform=ax.get_xaxis_transform())

ax.annotate(f'Target uses data at t={target_uses}!\n(in test period)',
            xy=(target_uses, 0.5), xytext=(target_uses + 5, 0.65),
            arrowprops=dict(arrowstyle='->', color='darkred', lw=2),
            fontsize=11, color='darkred', fontweight='bold',
            transform=ax.get_xaxis_transform())

# Draw the "leakage arrow"
ax.annotate('', xy=(target_uses, 0.4), xytext=(target_obs, 0.4),
            arrowprops=dict(arrowstyle='<->', color='orange', lw=3),
            transform=ax.get_xaxis_transform())
ax.text((target_obs + target_uses) / 2, 0.35, f'h={horizon} steps', 
        ha='center', fontsize=12, color='orange', fontweight='bold',
        transform=ax.get_xaxis_transform())

ax.set_xlim(80, 150)
ax.set_xlabel('Time Index', fontsize=12)
ax.set_title(f'The Gap Problem: h={horizon} Forecast Without Gap\nTarget computation overlaps with test period!',
             fontsize=13, fontweight='bold', color='red')
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

---

## Section 2: The Performance Difference

**Without gap:** Model sees part of the answer → inflated performance  
**With gap:** Clean separation → realistic performance

In [None]:
# Compare horizon=0, extra_gap=0 vs gap=horizon
model = Ridge(alpha=1.0)

# CV WITHOUT gap (WRONG)
cv_no_gap = WalkForwardCV(
    n_splits=5,
    window_type='expanding',
    horizon=0, extra_gap=0,  # NO GAP - WRONG!
    test_size=horizon
)

# CV WITH gap (CORRECT)
cv_with_gap = WalkForwardCV(
    n_splits=5,
    window_type='expanding',
    gap=horizon,  # gap >= horizon - CORRECT!
    test_size=horizon
)

# Evaluate both
scores_no_gap = []
for train_idx, test_idx in cv_no_gap.split(X_hstep):
    model.fit(X_hstep[train_idx], y_hstep[train_idx])
    preds = model.predict(X_hstep[test_idx])
    scores_no_gap.append(mean_absolute_error(y_hstep[test_idx], preds))

scores_with_gap = []
for train_idx, test_idx in cv_with_gap.split(X_hstep):
    model.fit(X_hstep[train_idx], y_hstep[train_idx])
    preds = model.predict(X_hstep[test_idx])
    scores_with_gap.append(mean_absolute_error(y_hstep[test_idx], preds))

mae_no_gap = np.mean(scores_no_gap)
mae_with_gap = np.mean(scores_with_gap)
fake_improvement = (mae_with_gap - mae_no_gap) / mae_with_gap * 100

print("GAP ENFORCEMENT COMPARISON")
print("=" * 55)
print(f"Horizon: {horizon} steps")
print(f"")
print(f"MAE without gap (horizon=0, extra_gap=0):  {mae_no_gap:.4f}  <-- INFLATED")
print(f"MAE with gap (gap={horizon}):   {mae_with_gap:.4f}  <-- REALISTIC")
print(f"")
print(f"Fake improvement from missing gap: {fake_improvement:.1f}%")
print(f"")
print(f"The {abs(fake_improvement):.1f}% 'better' performance without gap is leakage.")

In [None]:
# Show per-split details
print("Per-Split Analysis")
print("=" * 60)
print(f"{'Split':<8} {'Train End':<12} {'Test Start':<12} {'Actual Gap':<12} {'Status'}")
print("-" * 60)

print("\nWithout gap (horizon=0, extra_gap=0):")
for i, info in enumerate(cv_no_gap.get_split_info(X_hstep)):
    result = gate_temporal_boundary(
        train_end_idx=info.train_end,
        test_start_idx=info.test_start,
        horizon=horizon
    )
    status = "LEAKAGE!" if result.status == GateStatus.HALT else "OK"
    print(f"{i:<8} {info.train_end:<12} {info.test_start:<12} {info.gap:<12} {status}")

print(f"\nWith gap (gap={horizon}):")
for i, info in enumerate(cv_with_gap.get_split_info(X_hstep)):
    result = gate_temporal_boundary(
        train_end_idx=info.train_end,
        test_start_idx=info.test_start,
        horizon=horizon
    )
    status = "LEAKAGE!" if result.status == GateStatus.HALT else "OK"
    print(f"{i:<8} {info.train_end:<12} {info.test_start:<12} {info.gap:<12} {status}")

---

## Section 3: The Theory [T1]

### Why gap >= horizon?

For an h-step change target:
$$y_t^{(h)} = \text{series}[t+h] - \text{series}[t]$$

The **last valid training target** is at index `train_end - h`, because:
- Target at `train_end` would use `series[train_end + h]`
- If `test_start = train_end + 1`, then `series[train_end + h]` is in the test period when `h >= 1`

**The rule [T1]:** To prevent target leakage:
$$\text{test\_start} \geq \text{train\_end} + h + 1$$

Which means:
$$\text{gap} \geq h \quad \text{(where gap = test\_start - train\_end - 1)}$$

### Reference
Per Bergmeir & Benitez (2012): "h-block" cross-validation requires excluding h observations between train and test to prevent the overlap of training targets with test features.

In [None]:
# Visualize the correct gap
fig, axes = plt.subplots(2, 1, figsize=(14, 7), sharex=True)

train_end = 100

# Top: Without gap (WRONG)
ax = axes[0]
test_start_bad = train_end + 1
target_reach = test_start_bad + horizon

ax.axvspan(0, train_end, alpha=0.3, color='blue', label='Training')
ax.axvspan(test_start_bad, test_start_bad + 20, alpha=0.3, color='red', label='Test')

# Show target reaching into test
ax.plot([test_start_bad, target_reach], [0.5, 0.5], 'r-', linewidth=3, 
        transform=ax.get_xaxis_transform())
ax.scatter([target_reach], [0.5], color='darkred', s=100, marker='X', 
           zorder=5, transform=ax.get_xaxis_transform())
ax.annotate(f'Target at t={test_start_bad}\nreaches to t={target_reach}',
            xy=(target_reach, 0.5), xytext=(target_reach + 5, 0.65),
            arrowprops=dict(arrowstyle='->', color='darkred'),
            transform=ax.get_xaxis_transform(), fontsize=10, color='darkred')

ax.set_title(f'Gap=0: Target computation leaks into test period',
             fontsize=12, fontweight='bold', color='red')
ax.legend(loc='upper left')
ax.set_xlim(80, 140)

# Bottom: With gap (CORRECT)
ax = axes[1]
test_start_good = train_end + horizon + 1
target_reach_good = test_start_good + horizon

ax.axvspan(0, train_end, alpha=0.3, color='blue', label='Training')
ax.axvspan(train_end + 1, test_start_good, alpha=0.3, color='gray', label=f'Gap ({horizon} periods)')
ax.axvspan(test_start_good, test_start_good + 20, alpha=0.3, color='red', label='Test')

# Show target safely within test
ax.plot([test_start_good, target_reach_good], [0.5, 0.5], 'g-', linewidth=3,
        transform=ax.get_xaxis_transform())
ax.scatter([target_reach_good], [0.5], color='green', s=100, marker='o',
           zorder=5, transform=ax.get_xaxis_transform())
ax.annotate(f'Target at t={test_start_good}\nreaches to t={target_reach_good}\n(safely in test)',
            xy=(target_reach_good, 0.5), xytext=(target_reach_good + 5, 0.65),
            arrowprops=dict(arrowstyle='->', color='green'),
            transform=ax.get_xaxis_transform(), fontsize=10, color='green')

ax.set_title(f'Gap={horizon}: Clean separation, no leakage',
             fontsize=12, fontweight='bold', color='green')
ax.legend(loc='upper left')
ax.set_xlabel('Time Index', fontsize=12)

plt.tight_layout()
plt.show()

---

## Section 4: Using `gate_temporal_boundary`

temporalcv provides `gate_temporal_boundary` to automatically verify your splits.

**Rule:** Run this gate on every split before trusting your results.

In [None]:
# Demonstrate gate_temporal_boundary
print("gate_temporal_boundary Demo")
print("=" * 55)

# Test case 1: Insufficient gap
result_bad = gate_temporal_boundary(
    train_end_idx=100,
    test_start_idx=101,  # gap = 0
    horizon=12
)

print(f"\nCase 1: train_end=100, test_start=101, horizon=12")
print(f"  Actual gap: {101 - 100 - 1} = 0")
print(f"  Required gap: {12}")
print(f"  Status: {result_bad.status.value}")
print(f"  Message: {result_bad.message}")
if result_bad.recommendation:
    print(f"  Recommendation: {result_bad.recommendation}")

# Test case 2: Correct gap
result_good = gate_temporal_boundary(
    train_end_idx=100,
    test_start_idx=113,  # gap = 12
    horizon=12
)

print(f"\nCase 2: train_end=100, test_start=113, horizon=12")
print(f"  Actual gap: {113 - 100 - 1} = 12")
print(f"  Required gap: {12}")
print(f"  Status: {result_good.status.value}")
print(f"  Message: {result_good.message}")

In [None]:
# Validate all splits in a CV scheme
def validate_cv_gaps(cv, X, horizon):
    """Check all splits for temporal boundary violations."""
    all_valid = True
    results = []
    
    for info in cv.get_split_info(X):
        result = gate_temporal_boundary(
            train_end_idx=info.train_end,
            test_start_idx=info.test_start,
            horizon=horizon
        )
        results.append((info.split_idx, result))
        if result.status == GateStatus.HALT:
            all_valid = False
    
    return all_valid, results

# Test on our two CV schemes
print("Validating CV schemes for h=12 forecasting")
print("=" * 55)

valid_no_gap, results_no_gap = validate_cv_gaps(cv_no_gap, X_hstep, horizon)
print(f"\nCV with horizon=0, extra_gap=0: {'VALID' if valid_no_gap else 'INVALID (has leakage)'}")

valid_with_gap, results_with_gap = validate_cv_gaps(cv_with_gap, X_hstep, horizon)
print(f"CV with gap={horizon}: {'VALID' if valid_with_gap else 'INVALID (has leakage)'}")

---

## Section 5: The `horizon` Parameter in WalkForwardCV

WalkForwardCV has a `horizon` parameter that **automatically validates** gap >= horizon.

If you try to create a CV with insufficient gap, it raises an error immediately.

In [None]:
# WalkForwardCV with horizon parameter
print("WalkForwardCV horizon parameter")
print("=" * 55)

# This works: gap >= horizon
try:
    cv_safe = WalkForwardCV(
        n_splits=5,
        horizon=12,  # Explicit horizon declaration
        horizon=12, extra_gap=0,      # gap >= horizon ✓
        test_size=12
    )
    print(f"✓ Created CV with horizon=12, horizon=12, extra_gap=0 (safe)")
except ValueError as e:
    print(f"✗ Error: {e}")

# This fails: gap < horizon
try:
    cv_unsafe = WalkForwardCV(
        n_splits=5,
        horizon=12,  # Explicit horizon declaration
        horizon=0, extra_gap=0,       # gap < horizon ✗
        test_size=12
    )
    print(f"✓ Created CV with horizon=12, horizon=0, extra_gap=0 (unsafe)")
except ValueError as e:
    print(f"\n✗ Error caught: {e}")
    print(f"\n  The horizon parameter prevents you from making this mistake!")

---

## Pitfall Section

### Pitfall 1: Forgetting to Set Gap for Multi-Step Forecasts

```python
# WRONG: No gap for 12-step forecast
cv = WalkForwardCV(n_splits=5, test_size=12)  # gap defaults to 0!

# RIGHT: Gap matches horizon
cv = WalkForwardCV(n_splits=5, horizon=12, extra_gap=0, test_size=12)

# EVEN BETTER: Use horizon parameter for automatic validation
cv = WalkForwardCV(n_splits=5, horizon=12, horizon=12, extra_gap=0, test_size=12)
```

### Pitfall 2: Gap Too Small for Change Targets

```python
# Target: 12-week change
y = series[12:] - series[:-12]

# WRONG: Gap of 5 for 12-step target
cv = WalkForwardCV(n_splits=5, horizon=5, extra_gap=0, test_size=12)  # Leakage!

# RIGHT: Gap at least equals the target horizon
cv = WalkForwardCV(n_splits=5, horizon=12, extra_gap=0, test_size=12)
```

### Pitfall 3: Not Validating After Custom Splits

```python
# If you create custom splits, ALWAYS validate
for train_idx, test_idx in custom_splits:
    result = gate_temporal_boundary(
        train_end_idx=train_idx[-1],
        test_start_idx=test_idx[0],
        horizon=your_horizon
    )
    if result.status == GateStatus.HALT:
        raise ValueError(f"Temporal leakage: {result.message}")
```

In [None]:
# Demonstration of the pitfalls
print("Pitfall Demonstrations")
print("=" * 55)

# Pitfall 1: Default horizon=0, extra_gap=0
cv_default = WalkForwardCV(n_splits=5, test_size=12)  # gap=0 by default!
print(f"\nPitfall 1: Default gap")
print(f"  WalkForwardCV(n_splits=5, test_size=12)")
print(f"  Actual gap: {cv_default.gap} <-- Dangerous for h-step forecasting!")

# The fix: Always specify gap for h-step
cv_fixed = WalkForwardCV(n_splits=5, horizon=12, extra_gap=0, test_size=12)
print(f"\n  Fixed: WalkForwardCV(n_splits=5, horizon=12, extra_gap=0, test_size=12)")
print(f"  Actual gap: {cv_fixed.gap} ✓")

---

## Complete Example: Treasury Rate Forecasting

Let's put it all together with a realistic scenario: forecasting 12-week Treasury rate changes.

In [None]:
# Complete workflow for h-step forecasting
def forecast_with_proper_gap(series, horizon, n_lags=5):
    """
    Complete h-step forecasting workflow with proper gap enforcement.
    
    Parameters
    ----------
    series : array-like
        Time series to forecast
    horizon : int
        Forecast horizon (h)
    n_lags : int
        Number of lag features
        
    Returns
    -------
    dict
        Results including MAE and validation status
    """
    # Step 1: Create h-step change targets
    y_target = series[horizon:] - series[:-horizon]
    
    # Step 2: Create features (aligned)
    X_features, _ = create_lag_features(series[:-horizon], n_lags=n_lags)
    y_target = y_target[n_lags:]  # Align
    
    # Step 3: Create CV with PROPER gap
    cv = WalkForwardCV(
        n_splits=5,
        horizon=horizon,  # Automatic validation!
        gap=horizon,
        window_type='expanding',
        test_size=horizon
    )
    
    # Step 4: Validate all splits
    for info in cv.get_split_info(X_features):
        result = gate_temporal_boundary(
            train_end_idx=info.train_end,
            test_start_idx=info.test_start,
            horizon=horizon
        )
        if result.status == GateStatus.HALT:
            raise ValueError(f"Temporal leakage in split {info.split_idx}: {result.message}")
    
    # Step 5: Train and evaluate
    model = Ridge(alpha=1.0)
    scores = []
    
    for train_idx, test_idx in cv.split(X_features):
        model.fit(X_features[train_idx], y_target[train_idx])
        preds = model.predict(X_features[test_idx])
        scores.append(mean_absolute_error(y_target[test_idx], preds))
    
    return {
        'horizon': horizon,
        'mae': np.mean(scores),
        'mae_std': np.std(scores),
        'n_splits': len(scores),
        'gap_enforced': True
    }

# Run the complete workflow
print("Complete h-Step Forecasting Workflow")
print("=" * 55)

# Generate "Treasury rate-like" series (high persistence)
treasury_like = generate_ar1(n=600, phi=0.98, sigma=0.1, seed=42)

# Test multiple horizons
for h in [4, 12, 26]:
    result = forecast_with_proper_gap(treasury_like, horizon=h)
    print(f"\nHorizon h={h} (e.g., {h} weeks):")
    print(f"  MAE: {result['mae']:.4f} ± {result['mae_std']:.4f}")
    print(f"  Gap enforced: {result['gap_enforced']} ✓")

---

## Key Insights

### 1. h-Step Targets Reach Into the Future [T1]
When predicting changes h steps ahead, the target uses data at t+h. Without gap, this overlaps with test.

### 2. gap >= horizon is the Rule [T1]
Per Bergmeir & Benitez (2012), "h-block" CV requires gap >= h to prevent target leakage.

### 3. Use the `horizon` Parameter
`WalkForwardCV(horizon=h, gap=h)` automatically validates that gap >= horizon.

### 4. Always Validate with `gate_temporal_boundary`
Even with proper settings, run the gate to confirm no leakage — especially with custom splits.

### 5. Performance Difference is Real
Missing the gap can inflate performance by 10-30%, leading to unrealistic expectations in production.

---

## Next Steps

- **03_persistence_baseline.ipynb**: Why persistence is so hard to beat (MASE, scale-invariant metrics)
- **04_autocorrelation_matters.ipynb**: HAC variance for multi-step forecast comparison
- **05_shuffled_target_gate.ipynb**: Definitive leakage detection with permutation tests

---

*"For h-step forecasting, gap >= horizon isn't optional — it's a mathematical requirement to prevent target leakage."*