# Threshold and Regime Leakage

## Computing Percentiles and Regimes Without Lookahead

---

## üö® If You Know sklearn But Not Regime Classification, Read This First

**What you already know (from standard ML)**:
- Thresholds are just numbers you pick
- Percentiles help you understand data distribution
- Classification boundaries are features like any other

**What's different with time series regimes**:

In time series, we classify observations into **regimes**:
- UP/DOWN/FLAT based on move magnitude
- HIGH/MEDIUM/LOW volatility periods

**The trap**: The threshold that *defines* these regimes is itself information!

```python
# SEEMS INNOCENT:
threshold = np.percentile(np.abs(y), 70)  # 70th percentile of all changes

# ACTUALLY LEAKY:
# This threshold uses FUTURE changes to classify PAST observations!
# The threshold "knows" what volatility looks like in the test period.
```

**Real-world example (BUG-005)**:

A team classified volatility using `std(levels)` instead of `std(changes)`:
- A steady 3% drift series was labeled "HIGH volatility" (values spread 3.0 ‚Üí 4.0)
- Actually, it was the most PREDICTABLE series (constant drift = easy to forecast)
- Model looked great in "HIGH volatility" regime (because it wasn't actually volatile!)

**The fix**: Use `basis='changes'` and compute thresholds from training only.

---

**What you'll learn:**
1. Why computing thresholds from the full series creates subtle leakage
2. How to compute move thresholds correctly from training data
3. The BUG-005 volatility basis error and why `basis='changes'` matters
4. How regime stratification exposes hidden model failures

**Prerequisites:** Notebooks 01, 05, 06

---

In [None]:
# Setup
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error

from temporalcv.cv import WalkForwardCV
from temporalcv.gates import gate_shuffled_target, gate_suspicious_improvement, GateStatus

# Check if regimes module is available
try:
    from temporalcv.regimes import classify_volatility_regime, classify_direction_regime
    HAS_REGIMES = True
except ImportError:
    HAS_REGIMES = False
    print("Note: regimes module not available. Using local implementations.")

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

print("Setup complete.")

In [None]:
# Generate AR(1) process with regime-like behavior
def generate_ar1(n=500, phi=0.9, sigma=1.0, seed=42):
    """Generate AR(1) process."""
    rng = np.random.default_rng(seed)
    y = np.zeros(n)
    y[0] = rng.normal(0, sigma / np.sqrt(1 - phi**2))
    for t in range(1, n):
        y[t] = phi * y[t-1] + sigma * rng.normal()
    return y

def generate_regime_switching(n=600, seed=42):
    """Generate series with known regime changes (volatility shifts)."""
    rng = np.random.default_rng(seed)
    y = np.zeros(n)
    
    # Three regimes: LOW (0-200), HIGH (200-400), LOW (400-600)
    sigma_low, sigma_high = 0.5, 2.0
    phi = 0.9
    
    y[0] = rng.normal(0, sigma_low)
    for t in range(1, n):
        if t < 200 or t >= 400:
            sigma = sigma_low
        else:
            sigma = sigma_high
        y[t] = phi * y[t-1] + sigma * rng.normal()
    
    return y

# Generate data
series = generate_ar1(n=600, phi=0.9, seed=42)
series_regime = generate_regime_switching(n=600, seed=42)

print(f"Generated standard series: {len(series)} observations")
print(f"Generated regime-switching series: {len(series_regime)} observations")

---

## Section 1: Move Threshold Leakage

### The Problem

The **move threshold** (typically 70th percentile of |changes|) defines what counts as a significant move (UP or DOWN vs FLAT).

```python
# WRONG: Threshold from full series
threshold = np.percentile(np.abs(y_all), 70)

# Bug: This threshold uses test data information!
```

### Why It Matters [T2]

The 70th percentile threshold (per SPECIFICATION.md) determines:
- Which observations are classified as UP/DOWN/FLAT
- How MC-SS (Move-Conditional Skill Score) is computed
- Which regime-stratified metrics are applied

If computed from full data, you're using future volatility information to classify past observations.

In [None]:
# Compute changes
changes = np.diff(series)
split_idx = 400  # 400 train, ~200 test

# WRONG: Threshold from full series
threshold_wrong = np.percentile(np.abs(changes), 70)

# RIGHT: Threshold from training only
threshold_right = np.percentile(np.abs(changes[:split_idx]), 70)

print("MOVE THRESHOLD COMPARISON")
print("=" * 60)
print(f"\nSplit index: {split_idx}")
print(f"Training observations: {split_idx}")
print(f"Test observations: {len(changes) - split_idx}")
print(f"\nThreshold (WRONG - full series):   {threshold_wrong:.4f}")
print(f"Threshold (RIGHT - training only): {threshold_right:.4f}")
print(f"\nDifference: {abs(threshold_wrong - threshold_right):.4f}")
print(f"  ({abs(threshold_wrong - threshold_right) / threshold_right * 100:.1f}% relative difference)")

In [None]:
# Show how classification differs
def classify_moves(changes, threshold):
    """Classify moves as UP, DOWN, or FLAT."""
    labels = np.array(['FLAT'] * len(changes))
    labels[changes > threshold] = 'UP'
    labels[changes < -threshold] = 'DOWN'
    return labels

# Classify test period with both thresholds
test_changes = changes[split_idx:]

labels_wrong = classify_moves(test_changes, threshold_wrong)
labels_right = classify_moves(test_changes, threshold_right)

print("CLASSIFICATION COMPARISON (Test Period)")
print("=" * 60)

for label in ['UP', 'DOWN', 'FLAT']:
    n_wrong = np.sum(labels_wrong == label)
    n_right = np.sum(labels_right == label)
    diff = n_wrong - n_right
    print(f"  {label:<6}: Wrong={n_wrong:<4} Right={n_right:<4} Diff={diff:+d}")

# How many observations changed classification?
n_changed = np.sum(labels_wrong != labels_right)
print(f"\nObservations with different classification: {n_changed}")
print(f"  ({n_changed / len(labels_wrong) * 100:.1f}% of test period)")
print(f"\nThese classification differences are LEAKAGE.")

In [None]:
# Visualize the impact on regime classification
fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True)

t = np.arange(len(changes))

# Top: Full series with thresholds
ax = axes[0]
ax.plot(t, changes, 'b-', alpha=0.6, linewidth=0.8)
ax.axhline(y=threshold_wrong, color='red', linestyle='--', linewidth=2,
           label=f'WRONG threshold: {threshold_wrong:.3f}')
ax.axhline(y=-threshold_wrong, color='red', linestyle='--', linewidth=2)
ax.axhline(y=threshold_right, color='green', linestyle='-', linewidth=2,
           label=f'RIGHT threshold: {threshold_right:.3f}')
ax.axhline(y=-threshold_right, color='green', linestyle='-', linewidth=2)
ax.axvline(x=split_idx, color='black', linestyle=':', linewidth=2, label='Train/Test split')
ax.set_ylabel('Change', fontsize=11)
ax.set_title('Move Thresholds: WRONG (full series) vs RIGHT (training only)',
             fontsize=13, fontweight='bold')
ax.legend(loc='upper right')

# Bottom: Classification differences
ax = axes[1]
# Color by whether classification differs
colors = np.where(labels_wrong != labels_right, 'red', 'gray')
ax.scatter(t[split_idx:], test_changes, c=colors, s=10, alpha=0.7)
ax.axhline(y=0, color='black', linewidth=0.5)
ax.axvline(x=split_idx, color='black', linestyle=':', linewidth=2)
ax.set_xlabel('Time Index', fontsize=11)
ax.set_ylabel('Change', fontsize=11)
ax.set_title('Test Period: Red = Different Classification (Leakage Impact)',
             fontsize=12, fontweight='bold', color='red')

plt.tight_layout()
plt.show()

---

## Section 2: BUG-005 ‚Äî Volatility Basis Error [T3]

### The Problem

When computing volatility regimes, the **basis** matters critically:

- `basis='levels'`: Computes std of the series levels ‚Üí **WRONG for drift**
- `basis='changes'`: Computes std of the changes ‚Üí **CORRECT**

### Why It Matters

Consider a series with steady drift (e.g., 3.0 ‚Üí 4.0 over 100 periods):
- **std(levels)** is HIGH because values range from 3.0 to 4.0
- **std(changes)** is LOW because each change is approximately +0.01

The first definition is misleading ‚Äî the series is **predictable** (steady drift), not volatile!

In [None]:
# Demonstrate BUG-005: volatility basis error
def generate_steady_drift(n=100, drift_per_step=0.01, noise=0.001, seed=42):
    """Generate series with steady drift but low volatility."""
    rng = np.random.default_rng(seed)
    y = np.cumsum(np.ones(n) * drift_per_step + rng.normal(0, noise, n))
    return y + 3.0  # Start at 3.0

def generate_high_volatility(n=100, sigma=0.1, seed=42):
    """Generate stationary series with high volatility."""
    rng = np.random.default_rng(seed)
    return 3.5 + rng.normal(0, sigma, n)

# Generate two contrasting series
drift_series = generate_steady_drift(n=100)
volatile_series = generate_high_volatility(n=100)

print("BUG-005: Volatility Basis Comparison")
print("=" * 60)

# Compute volatility both ways
print("\nSeries 1: Steady Drift (predictable)")
print(f"  std(levels):  {np.std(drift_series):.4f}  ‚Üê Looks HIGH")
print(f"  std(changes): {np.std(np.diff(drift_series)):.4f}  ‚Üê Actually LOW")

print("\nSeries 2: Stationary Noise (unpredictable)")
print(f"  std(levels):  {np.std(volatile_series):.4f}  ‚Üê Moderate")
print(f"  std(changes): {np.std(np.diff(volatile_series)):.4f}  ‚Üê Actually HIGH")

print("\n[T3] The WRONG basis (levels) would classify steady drift as HIGH volatility!")

In [None]:
# Visualize the two series
fig, axes = plt.subplots(2, 2, figsize=(14, 8))

# Top: Drift series
axes[0, 0].plot(drift_series, 'b-', linewidth=1.5)
axes[0, 0].set_title('Steady Drift: Levels', fontsize=11, fontweight='bold')
axes[0, 0].set_ylabel('Value')
axes[0, 0].annotate(f'std(levels) = {np.std(drift_series):.3f}\n(LOOKS high)',
                    xy=(0.7, 0.3), xycoords='axes fraction', fontsize=10, color='red')

axes[0, 1].plot(np.diff(drift_series), 'b-', linewidth=1.5)
axes[0, 1].axhline(y=0, color='gray', linestyle='--')
axes[0, 1].set_title('Steady Drift: Changes', fontsize=11, fontweight='bold')
axes[0, 1].set_ylabel('Change')
axes[0, 1].annotate(f'std(changes) = {np.std(np.diff(drift_series)):.4f}\n(Actually LOW)',
                    xy=(0.7, 0.3), xycoords='axes fraction', fontsize=10, color='green')

# Bottom: Volatile series
axes[1, 0].plot(volatile_series, 'r-', linewidth=1.5)
axes[1, 0].set_title('Stationary Noise: Levels', fontsize=11, fontweight='bold')
axes[1, 0].set_xlabel('Time')
axes[1, 0].set_ylabel('Value')
axes[1, 0].annotate(f'std(levels) = {np.std(volatile_series):.3f}',
                    xy=(0.7, 0.3), xycoords='axes fraction', fontsize=10)

axes[1, 1].plot(np.diff(volatile_series), 'r-', linewidth=1.5)
axes[1, 1].axhline(y=0, color='gray', linestyle='--')
axes[1, 1].set_title('Stationary Noise: Changes', fontsize=11, fontweight='bold')
axes[1, 1].set_xlabel('Time')
axes[1, 1].set_ylabel('Change')
axes[1, 1].annotate(f'std(changes) = {np.std(np.diff(volatile_series)):.4f}\n(Actually HIGH)',
                    xy=(0.7, 0.3), xycoords='axes fraction', fontsize=10, color='red')

plt.tight_layout()
plt.show()

print("\nKey insight: Use basis='changes' to measure TRUE volatility (unpredictability).")

In [None]:
# Implement correct volatility classification
def classify_volatility_regime_local(values, window=13, basis='changes',
                                     low_percentile=33, high_percentile=67):
    """
    Classify each observation into LOW, MEDIUM, or HIGH volatility regime.
    
    Parameters
    ----------
    values : array-like
        Time series values
    window : int, default=13
        Rolling window for volatility calculation (13 weeks ‚âà quarterly) [T3]
    basis : {'changes', 'levels'}, default='changes'
        Whether to compute volatility from changes or levels.
        CRITICAL: Use 'changes' to avoid BUG-005.
    low_percentile : float, default=33
        Percentile below which volatility is LOW [T2]
    high_percentile : float, default=67
        Percentile above which volatility is HIGH [T2]
        
    Returns
    -------
    regimes : array
        Regime labels: 'LOW', 'MEDIUM', or 'HIGH'
    """
    values = np.asarray(values)
    n = len(values)
    
    # Compute what we're measuring volatility of
    if basis == 'changes':
        vol_input = np.abs(np.diff(values))
        # Pad to match original length
        vol_input = np.concatenate([[np.nan], vol_input])
    else:  # basis == 'levels' (WRONG for most use cases)
        vol_input = values
    
    # Rolling volatility
    rolling_vol = pd.Series(vol_input).rolling(window, min_periods=1).std().values
    
    # Compute thresholds (from the data we have)
    valid_vol = rolling_vol[~np.isnan(rolling_vol)]
    low_thresh = np.percentile(valid_vol, low_percentile)
    high_thresh = np.percentile(valid_vol, high_percentile)
    
    # Classify
    regimes = np.array(['MEDIUM'] * n)
    regimes[rolling_vol < low_thresh] = 'LOW'
    regimes[rolling_vol > high_thresh] = 'HIGH'
    
    return regimes

# Test on regime-switching series
regimes_right = classify_volatility_regime_local(series_regime, basis='changes')
regimes_wrong = classify_volatility_regime_local(series_regime, basis='levels')

print("Volatility Regime Classification")
print("=" * 60)
print(f"\nSeries has known regime changes at t=200 and t=400")
print(f"  t=0-200: LOW volatility")
print(f"  t=200-400: HIGH volatility")
print(f"  t=400-600: LOW volatility")

print(f"\nClassification counts (basis='changes' - CORRECT):")
for regime in ['LOW', 'MEDIUM', 'HIGH']:
    print(f"  {regime}: {np.sum(regimes_right == regime)}")

print(f"\nClassification counts (basis='levels' - WRONG):")
for regime in ['LOW', 'MEDIUM', 'HIGH']:
    print(f"  {regime}: {np.sum(regimes_wrong == regime)}")

---

## Section 3: Regime Stratification Exposes Hidden Failures

### The Insight

A model might **pass overall** but **fail in specific regimes**.

- Overall MAE looks good
- But HIGH volatility regime has terrible performance
- Or FLAT regime has artificially good metrics (persistence works)

**Stratification** reveals these hidden issues.

In [None]:
# Create a model that works well on average but fails in HIGH regime
def create_lag_features(series, n_lags=5):
    """Create lag features."""
    n = len(series)
    X = np.column_stack([
        np.concatenate([[np.nan]*lag, series[:-lag]]) 
        for lag in range(1, n_lags + 1)
    ])
    valid = ~np.isnan(X).any(axis=1)
    return X[valid], series[valid]

# Use regime-switching series
X, y = create_lag_features(series_regime, n_lags=5)
regimes = classify_volatility_regime_local(series_regime, basis='changes')[5:]  # Align

# Train/test split
split_idx = int(len(X) * 0.7)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
regimes_test = regimes[split_idx:]

# Train model
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
preds = model.predict(X_test)

# Evaluate overall
overall_mae = mean_absolute_error(y_test, preds)
persist_mae = mean_absolute_error(y_test, X_test[:, 0])  # Persistence: y[t-1]

print("OVERALL vs STRATIFIED EVALUATION")
print("=" * 60)
print(f"\nOverall Performance:")
print(f"  Model MAE:       {overall_mae:.4f}")
print(f"  Persistence MAE: {persist_mae:.4f}")
improvement = (persist_mae - overall_mae) / persist_mae * 100
print(f"  Improvement: {improvement:.1f}%")

# Evaluate by regime
print(f"\nStratified Performance:")
for regime in ['LOW', 'MEDIUM', 'HIGH']:
    mask = regimes_test == regime
    if np.sum(mask) > 0:
        regime_mae = mean_absolute_error(y_test[mask], preds[mask])
        regime_persist = mean_absolute_error(y_test[mask], X_test[mask, 0])
        regime_improvement = (regime_persist - regime_mae) / regime_persist * 100
        print(f"  {regime:6s}: n={np.sum(mask):3d}, MAE={regime_mae:.4f}, "
              f"Improvement={regime_improvement:+.1f}%")

In [None]:
# Visualize stratified performance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Predictions by regime
ax = axes[0]
t = np.arange(len(y_test))

for regime, color in [('LOW', 'green'), ('MEDIUM', 'orange'), ('HIGH', 'red')]:
    mask = regimes_test == regime
    ax.scatter(t[mask], y_test[mask] - preds[mask], c=color, alpha=0.5, s=20, label=regime)

ax.axhline(y=0, color='black', linestyle='--', linewidth=0.5)
ax.set_xlabel('Test Index')
ax.set_ylabel('Error (Actual - Predicted)')
ax.set_title('Prediction Errors by Volatility Regime', fontsize=12, fontweight='bold')
ax.legend()

# Right: MAE by regime (bar chart)
ax = axes[1]
regimes_list = ['LOW', 'MEDIUM', 'HIGH']
maes = []
persist_maes = []
for regime in regimes_list:
    mask = regimes_test == regime
    if np.sum(mask) > 0:
        maes.append(mean_absolute_error(y_test[mask], preds[mask]))
        persist_maes.append(mean_absolute_error(y_test[mask], X_test[mask, 0]))
    else:
        maes.append(0)
        persist_maes.append(0)

x = np.arange(len(regimes_list))
width = 0.35
ax.bar(x - width/2, maes, width, label='Model', color='steelblue')
ax.bar(x + width/2, persist_maes, width, label='Persistence', color='gray', alpha=0.7)

ax.set_xticks(x)
ax.set_xticklabels(regimes_list)
ax.set_ylabel('MAE')
ax.set_title('MAE by Volatility Regime', fontsize=12, fontweight='bold')
ax.legend()

plt.tight_layout()
plt.show()

print("\nStratification reveals how the model performs in different conditions.")

---

## Section 4: Correct Threshold Computation

### The Rule

**All thresholds must be computed from TRAINING data only.**

```python
# WRONG
threshold = np.percentile(np.abs(y_all), 70)

# RIGHT
threshold = np.percentile(np.abs(y_train), 70)
```

### Inside CV Folds

```python
for train_idx, test_idx in cv.split(X, y):
    # Compute threshold from THIS fold's training data
    fold_threshold = np.percentile(np.abs(y[train_idx]), 70)
    
    # Apply to both train and test
    train_regimes = classify_moves(y[train_idx], fold_threshold)
    test_regimes = classify_moves(y[test_idx], fold_threshold)
```

In [None]:
# Complete workflow with correct threshold handling
#
# ‚ö†Ô∏è CRITICAL: LAG OFFSET ALIGNMENT
# When we create lag features with n_lags=5, we lose the first 5 observations.
# The indices from cv.split(X) are relative to X, not the original series!
# We must track this offset when mapping back to original series indices.

N_LAGS = 5  # Define constant for clarity

def evaluate_with_correct_thresholds(series, n_splits=5, window=13, n_lags=N_LAGS):
    """
    Evaluate model with thresholds computed correctly per fold.
    
    Note on index alignment:
        - create_lag_features drops first n_lags observations
        - CV indices are relative to the feature matrix X, not original series
        - Original series index = X index + n_lags
    """
    X, y = create_lag_features(series, n_lags=n_lags)
    
    cv = WalkForwardCV(n_splits=n_splits, window_type='expanding', test_size=50)
    model = Ridge(alpha=1.0)
    
    overall_results = []
    regime_results = {'LOW': [], 'MEDIUM': [], 'HIGH': []}
    
    for fold_idx, (train_idx, test_idx) in enumerate(cv.split(X)):
        # Get data
        X_train, y_train = X[train_idx], y[train_idx]
        X_test, y_test = X[test_idx], y[test_idx]
        
        # ‚ö†Ô∏è LAG OFFSET: Map from X indices back to original series indices
        # X[0] corresponds to series[n_lags], so:
        #   original_train_end = train_idx[-1] + n_lags
        #   original_test_start = test_idx[0] + n_lags
        original_train_end = train_idx[-1] + n_lags
        
        # Compute threshold from TRAINING data only (using original series)
        # Changes array has length = series length - 1
        train_changes = np.diff(series[:original_train_end + 1])
        fold_threshold = np.percentile(np.abs(train_changes), 70)
        
        # Compute regimes for test period using training threshold
        # Map test indices to original series
        original_test_start = test_idx[0] + n_lags
        original_test_end = test_idx[-1] + n_lags
        
        # Get changes for test period (aligned with predictions)
        test_series_slice = series[original_test_start - 1:original_test_end + 1]
        test_changes = np.diff(test_series_slice)
        test_regimes = classify_moves(test_changes, fold_threshold)
        
        # Train and predict
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        
        # Overall MAE
        overall_mae = mean_absolute_error(y_test, preds)
        overall_results.append(overall_mae)
        
        # Stratified MAE (ensure lengths match)
        n_test = len(y_test)
        aligned_regimes = test_regimes[:n_test] if len(test_regimes) >= n_test else test_regimes
        
        for regime in ['LOW', 'MEDIUM', 'HIGH']:
            if len(aligned_regimes) == n_test:
                mask = aligned_regimes == regime
                if np.sum(mask) >= 5:  # Need enough samples
                    regime_mae = mean_absolute_error(y_test[mask], preds[mask])
                    regime_results[regime].append(regime_mae)
    
    return {
        'overall_mae': np.mean(overall_results),
        'regime_maes': {k: np.mean(v) if v else np.nan for k, v in regime_results.items()}
    }

# Run evaluation
print("WALK-FORWARD CV WITH CORRECT THRESHOLDS")
print("=" * 60)
print(f"\nNote: Lag offset = {N_LAGS} (indices properly tracked)")

results = evaluate_with_correct_thresholds(series_regime)

print(f"\nOverall MAE: {results['overall_mae']:.4f}")
print(f"\nMAE by Regime:")
for regime, mae in results['regime_maes'].items():
    if not np.isnan(mae):
        print(f"  {regime}: {mae:.4f}")

print(f"\n[T2] Thresholds computed from training data in each fold.")

---

## Pitfall Section

### Pitfall 1: Full-Series Percentiles

```python
# WRONG: Percentile from all data
threshold = np.percentile(np.abs(y), 70)

# RIGHT: Percentile from training only
threshold = np.percentile(np.abs(y_train), 70)
```

### Pitfall 2: Wrong Volatility Basis

```python
# WRONG: Volatility from levels
vol = series.rolling(13).std()  # BUG-005!

# RIGHT: Volatility from changes
changes = series.diff()
vol = changes.rolling(13).std()
```

### Pitfall 3: Ignoring Regime Stratification

```python
# WRONG: Only look at overall metrics
print(f"Overall MAE: {mae}")

# RIGHT: Check performance by regime
for regime in ['LOW', 'MEDIUM', 'HIGH']:
    mask = regimes == regime
    print(f"{regime} MAE: {mean_absolute_error(y[mask], preds[mask])}")
```

### Pitfall 4: Thresholds Computed Outside CV

```python
# WRONG: Same threshold for all folds
threshold = np.percentile(np.abs(y), 70)
for train_idx, test_idx in cv.split(X):
    regimes = classify(y[test_idx], threshold)  # Leakage!

# RIGHT: Threshold per fold
for train_idx, test_idx in cv.split(X):
    threshold = np.percentile(np.abs(y[train_idx]), 70)  # Training only
    regimes = classify(y[test_idx], threshold)
```

### Pitfall 5: Lag Offset Misalignment ‚ö†Ô∏è

When creating lag features, the first `n_lags` observations are dropped. This creates an **index offset** between the feature matrix and the original series:

```python
# DANGEROUS: Ignoring lag offset when mapping back to series
n_lags = 5
X, y = create_lag_features(series, n_lags=n_lags)  # X[0] = series[n_lags]!

for train_idx, test_idx in cv.split(X):
    # WRONG: Using X indices directly on original series
    train_changes = np.diff(series[:train_idx[-1]])  # Off by n_lags!
    
# CORRECT: Track the lag offset explicitly
for train_idx, test_idx in cv.split(X):
    # Map X indices back to original series indices
    original_train_end = train_idx[-1] + n_lags
    original_test_start = test_idx[0] + n_lags
    
    # Now use correct indices for threshold computation
    train_changes = np.diff(series[:original_train_end + 1])
    threshold = np.percentile(np.abs(train_changes), 70)
```

**Key formula**: `original_series_index = feature_matrix_index + n_lags`

This is easy to overlook and causes subtle bugs where thresholds or regimes are computed on the wrong portion of data.

In [None]:
# Quick reference: threshold computation rules
print("THRESHOLD COMPUTATION RULES")
print("=" * 70)

rules = [
    ("Move threshold", "70th percentile of |changes|", "Training data only"),
    ("Volatility window", "13 periods (~quarterly)", "[T3] Assumption"),
    ("Volatility basis", "'changes' not 'levels'", "BUG-005 prevention"),
    ("Regime boundaries", "33rd/67th percentiles", "[T2] Training data only"),
    ("Normalization", "mean/std from training", "Applied to test"),
]

print(f"\n{'Threshold Type':<20} {'Value/Rule':<30} {'Source'}")
print("-" * 70)
for threshold_type, rule, source in rules:
    print(f"{threshold_type:<20} {rule:<30} {source}")

---

## Key Insights

### 1. Thresholds Are Features [T2]
Any threshold used for classification is effectively a feature. It must be computed from training data only.

### 2. BUG-005: Use `basis='changes'` [T3]
Volatility measures unpredictability of changes, not spread of levels. Steady drift is predictable.

### 3. Stratification Reveals Hidden Failures
A model may pass overall but fail in specific regimes. Always check stratified metrics.

### 4. Threshold Computation Happens Per Fold
In walk-forward CV, each fold computes its own thresholds from its training data.

### 5. 70th Percentile is the Standard [T2]
Per SPECIFICATION.md, moves above 70th percentile of |changes| are considered significant.

---

## Next Steps

- **08_validation_workflow.ipynb**: Complete HALT/WARN/PASS pipeline with `run_gates`
- **12_regime_stratified_evaluation.ipynb**: Deep dive into stratified metrics (Tier 3)
- **10_high_persistence_metrics.ipynb**: MC-SS and move-conditional evaluation

---

*"Thresholds computed from future data aren't thresholds ‚Äî they're answers."*