# TemporalCV: Validation Gates for Time-Series ML

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/brandonmbehring-dev/temporalcv/blob/main/notebooks/demo.ipynb)

This notebook demonstrates temporalcv's key features:

1. **Leakage Detection**: Shuffled target test catches data leakage
2. **Walk-Forward CV**: Gap enforcement for h-step forecasting
3. **Statistical Tests**: DM test for model comparison
4. **Conformal Prediction**: Distribution-free prediction intervals

---

**Key Insight**: 80% of timeseries models I've seen have subtle leakage bugs. The shuffled target test is the definitive detector — if your model beats a permuted baseline, it's learning from temporal position alone.

In [None]:
# Install temporalcv (uncomment for Colab)
# !pip install temporalcv scikit-learn matplotlib -q

In [None]:
from __future__ import annotations

import warnings
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error

from temporalcv.cv import WalkForwardCV
from temporalcv.gates import (
    gate_shuffled_target,
    gate_temporal_boundary,
    run_gates,
)
from temporalcv.conformal import (
    AdaptiveConformalPredictor,
    walk_forward_conformal,
)

warnings.filterwarnings("ignore")
plt.style.use('seaborn-v0_8-whitegrid')

print("✓ All imports successful")

## Data Generation

We generate AR(1) data mimicking Treasury rates — high persistence (φ ≈ 0.95) where persistence baseline is very hard to beat.

In [None]:
def generate_ar1_data(n=500, phi=0.9, sigma=1.0, seed=42):
    """Generate AR(1) process."""
    rng = np.random.default_rng(seed)
    y = np.zeros(n)
    y[0] = rng.normal(0, sigma / np.sqrt(1 - phi**2))
    for t in range(1, n):
        y[t] = phi * y[t - 1] + sigma * rng.normal()
    return y

def create_features(series, n_lags=5):
    """Create lagged features."""
    n = len(series)
    features = []
    for lag in range(1, n_lags + 1):
        lagged = np.full(n, np.nan)
        lagged[lag:] = series[:-lag]
        features.append(lagged)
    X = np.column_stack(features)
    y = series.copy()
    valid = ~np.isnan(X).any(axis=1)
    return X[valid], y[valid]

# Generate data
series = generate_ar1_data(n=500, phi=0.9)
X, y = create_features(series, n_lags=5)

# Calculate ACF(1)
acf1 = np.corrcoef(series[1:], series[:-1])[0, 1]

# Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(series, linewidth=0.8)
axes[0].set_title(f'AR(1) Process (ACF(1) = {acf1:.3f})')
axes[0].set_xlabel('Time')
axes[0].set_ylabel('Value')

# ACF plot
lags = range(1, 21)
acf_values = [np.corrcoef(series[lag:], series[:-lag])[0, 1] for lag in lags]
axes[1].bar(lags, acf_values, color='steelblue', alpha=0.7)
axes[1].axhline(y=0.05, color='r', linestyle='--', label='Significance bound')
axes[1].axhline(y=-0.05, color='r', linestyle='--')
axes[1].set_title('Autocorrelation Function')
axes[1].set_xlabel('Lag')
axes[1].set_ylabel('ACF')

plt.tight_layout()
plt.show()

print(f"Data shape: X={X.shape}, y={y.shape}")
print(f"ACF(1) = {acf1:.3f} — HIGH persistence")

## Part 1: Leakage Detection with Shuffled Target Test

The **shuffled target test** is the definitive leakage detector:
- Train model on real targets → get MAE_real
- Train model on shuffled (permuted) targets → get MAE_shuffled
- If MAE_real << MAE_shuffled, features encode target information

**Why this works**: Shuffling breaks temporal ordering. If features contain lookahead info, the model will still perform well on shuffled targets (because it's reading future values).

In [None]:
# Test with CLEAN features (only lag values)
model = Ridge(alpha=1.0)

result = gate_shuffled_target(
    model=model,
    X=X,
    y=y,
    n_shuffles=10,
    threshold=0.95,  # HALT if improvement > 95%
    random_state=42,
)

print("CLEAN FEATURES (lag values only)")
print("=" * 50)
print(f"Status: {result.status.value}")
print(f"MAE (real): {result.details['mae_real']:.4f}")
print(f"MAE (shuffled): {result.details['mae_shuffled_avg']:.4f}")
print(f"Improvement: {result.metric_value:.1%}")
print()
print("Interpretation: Lag features genuinely predict AR(1), so")
print("some improvement over shuffled is expected.")

In [None]:
# Create LEAKY features (include future info)
def create_leaky_features(series, n_lags=5):
    """Features WITH leakage — intentionally buggy."""
    n = len(series)
    features = []
    
    # Normal lags
    for lag in range(1, n_lags + 1):
        lagged = np.full(n, np.nan)
        lagged[lag:] = series[:-lag]
        features.append(lagged)
    
    # BUG: Centered rolling mean (includes future!)
    smoothed = np.full(n, np.nan)
    window = 3
    for t in range(window, n - window):
        smoothed[t] = np.mean(series[t - window : t + window + 1])  # FUTURE!
    features.append(smoothed)
    
    X = np.column_stack(features)
    y = series.copy()
    valid = ~np.isnan(X).any(axis=1)
    return X[valid], y[valid]

X_leaky, y_leaky = create_leaky_features(series)

result_leaky = gate_shuffled_target(
    model=Ridge(alpha=1.0),
    X=X_leaky,
    y=y_leaky,
    n_shuffles=10,
    threshold=0.95,
    random_state=42,
)

print("LEAKY FEATURES (includes future)")
print("=" * 50)
print(f"Status: {result_leaky.status.value}")
print(f"MAE (real): {result_leaky.details['mae_real']:.4f}")
print(f"MAE (shuffled): {result_leaky.details['mae_shuffled_avg']:.4f}")
print(f"Improvement: {result_leaky.metric_value:.1%}")
print()
print(f"⚠️ LEAKAGE DETECTED!")
print(f"  Clean improvement:  {result.metric_value:.1%}")
print(f"  Leaky improvement:  {result_leaky.metric_value:.1%}")
print(f"  Difference:         {result_leaky.metric_value - result.metric_value:.1%}")

## Part 2: Walk-Forward CV with Gap Enforcement

For **h-step ahead forecasting**, the gap between train and test must be ≥ h.

Without gap enforcement:
- Train ends at t=100
- Test starts at t=101
- But for h=12 forecasting, y[101] uses features from y[90]...y[100]
- **LEAKAGE**: Test features overlap with training targets!

In [None]:
horizon = 12  # 12-step ahead forecast

# WITHOUT gap (WRONG for h-step)
cv_no_gap = WalkForwardCV(n_splits=3, gap=0, test_size=horizon)

# WITH gap (CORRECT)
cv_with_gap = WalkForwardCV(n_splits=3, gap=horizon, test_size=horizon)

print("WITHOUT Gap Enforcement (WRONG for h-step forecasting)")
print("=" * 60)
for info in cv_no_gap.get_split_info(X):
    result = gate_temporal_boundary(
        train_end_idx=info.train_end,
        test_start_idx=info.test_start,
        horizon=horizon,
        gap=0,
    )
    status = "✓ OK" if result.status.value == "PASS" else "✗ LEAKAGE!"
    print(f"  Split {info.split_idx}: Train ends at {info.train_end}, "
          f"Test starts at {info.test_start}, Gap={info.gap} → {status}")

print()
print("WITH Gap Enforcement (CORRECT)")
print("=" * 60)
for info in cv_with_gap.get_split_info(X):
    result = gate_temporal_boundary(
        train_end_idx=info.train_end,
        test_start_idx=info.test_start,
        horizon=horizon,
        gap=0,
    )
    status = "✓ OK" if result.status.value == "PASS" else "✗ LEAKAGE!"
    print(f"  Split {info.split_idx}: Train ends at {info.train_end}, "
          f"Test starts at {info.test_start}, Gap={info.gap} → {status}")

In [None]:
# Visualize the split structure
fig, axes = plt.subplots(2, 1, figsize=(12, 6))

for ax, cv, title in [
    (axes[0], cv_no_gap, 'Without Gap (WRONG)'),
    (axes[1], cv_with_gap, 'With Gap (CORRECT)'),
]:
    for i, info in enumerate(cv.get_split_info(X)):
        # Training region
        ax.barh(i, info.train_size, left=0, color='steelblue', 
                alpha=0.7, label='Train' if i == 0 else '')
        # Gap region
        ax.barh(i, info.gap, left=info.train_end, color='white', 
                edgecolor='gray', hatch='///', 
                label='Gap' if i == 0 else '')
        # Test region
        ax.barh(i, info.test_size, left=info.test_start, color='coral',
                alpha=0.7, label='Test' if i == 0 else '')
    
    ax.set_xlabel('Time Index')
    ax.set_ylabel('Split')
    ax.set_title(title)
    ax.legend(loc='upper right')
    ax.set_yticks(range(3))

plt.tight_layout()
plt.show()

## Part 3: Conformal Prediction Intervals

Point predictions are insufficient for decision-making. Stakeholders need: "How confident are you?"

**Conformal prediction** provides:
- Distribution-free coverage guarantee
- Finite-sample validity (not just asymptotic)
- No parametric assumptions

**Adaptive conformal** adjusts to distribution shift — critical for non-stationary time series.

In [None]:
# Walk-forward predictions
cv = WalkForwardCV(n_splits=5, window_type='expanding', test_size=50)

all_preds = []
all_actuals = []
all_test_indices = []

for train_idx, test_idx in cv.split(X):
    model = Ridge(alpha=1.0)
    model.fit(X[train_idx], y[train_idx])
    preds = model.predict(X[test_idx])
    
    all_preds.extend(preds)
    all_actuals.extend(y[test_idx])
    all_test_indices.extend(test_idx)

predictions = np.array(all_preds)
actuals = np.array(all_actuals)
test_indices = np.array(all_test_indices)

print(f"Walk-forward predictions: {len(predictions)} points")
print(f"MAE: {mean_absolute_error(actuals, predictions):.4f}")

In [None]:
# Apply conformal prediction
intervals, quality = walk_forward_conformal(
    predictions=predictions,
    actuals=actuals,
    calibration_fraction=0.3,
    alpha=0.10,  # 90% intervals
)

print("Conformal Prediction Results (90% intervals)")
print("=" * 50)
print(f"Calibration size: {quality['calibration_size']}")
print(f"Holdout size: {quality['holdout_size']}")
print(f"Calibrated quantile: {quality['quantile']:.4f}")
print()
print(f"Coverage: {quality['coverage']:.1%} (target: 90%)")
print(f"Mean width: {quality['mean_width']:.4f}")
print(f"Interval score: {quality['interval_score']:.4f}")

In [None]:
# Visualize prediction intervals
cal_size = quality['calibration_size']
holdout_indices = test_indices[cal_size:]
holdout_actuals = actuals[cal_size:]

fig, ax = plt.subplots(figsize=(14, 5))

# Original series
ax.plot(series, 'lightgray', linewidth=0.5, label='Full series')

# Prediction intervals
ax.fill_between(
    holdout_indices, 
    intervals.lower, 
    intervals.upper, 
    alpha=0.3, 
    color='steelblue',
    label=f'90% CI (coverage: {quality["coverage"]:.1%})'
)

# Point predictions
ax.plot(holdout_indices, intervals.point, 'b-', linewidth=1, label='Predictions')

# Actuals
ax.scatter(holdout_indices, holdout_actuals, c='coral', s=10, alpha=0.7, label='Actuals')

# Mark violations (actuals outside interval)
violations = (holdout_actuals < intervals.lower) | (holdout_actuals > intervals.upper)
ax.scatter(
    holdout_indices[violations], 
    holdout_actuals[violations], 
    c='red', s=50, marker='x', linewidths=2,
    label=f'Violations ({violations.sum()})'
)

ax.set_xlabel('Time Index')
ax.set_ylabel('Value')
ax.set_title('Conformal Prediction Intervals')
ax.legend(loc='upper left')

plt.tight_layout()
plt.show()

## Part 4: Running Multiple Validation Gates

TemporalCV gates follow **HALT > WARN > PASS** priority:
- **HALT**: Stop and investigate (critical failure)
- **WARN**: Proceed with caution (verify externally)
- **PASS**: Validation passed

In [None]:
# Run multiple gates on clean features
model = Ridge(alpha=1.0)

gates = [
    gate_shuffled_target(
        model=model,
        X=X,
        y=y,
        n_shuffles=10,
        threshold=0.95,
        random_state=42,
    ),
    gate_temporal_boundary(
        train_end_idx=350,
        test_start_idx=362,  # Gap of 12
        horizon=12,
        gap=0,
    ),
]

report = run_gates(gates)
print(report.summary())

## Summary: Key Takeaways

### 1. Shuffled Target Test
- **If model beats shuffled baseline significantly → LEAKAGE**
- Catches: rolling stats on full series, lookahead bias, off-by-one errors

### 2. Walk-Forward CV
- **Gap must be ≥ forecast horizon**
- Without gap → test features overlap training targets

### 3. Conformal Prediction
- **Coverage guarantee without distributional assumptions**
- Use adaptive conformal for non-stationary data

### 4. Gate Priority
- HALT > WARN > PASS
- Always investigate HALT before deployment

---

**Resources**:
- [temporalcv documentation](https://github.com/brandonmbehring-dev/temporalcv)
- [Examples directory](https://github.com/brandonmbehring-dev/temporalcv/tree/main/examples)