# The Shuffled Target Gate: Definitive Leakage Detection

## If Your Model Beats Shuffled Data, Something is Wrong

---

## 🚨 If You Know sklearn But Not Time Series Leakage, Read This First

**What you already know (from standard ML)**:
- Feature engineering is about extracting predictive signals
- If your model beats the baseline, you've built something useful
- Cross-validation protects against overfitting

**What's different with time series leakage**:

In time series, you can accidentally create features that **encode target position** instead of **predictive signal**:

```python
# LOOKS INNOCENT:
X['rolling_mean'] = y.rolling(7).mean()  # Computed on FULL series

# ACTUALLY LEAKY:
# rolling_mean at t=100 uses y[94:101]
# If y[100] is a test target, we've leaked test info into training!
```

**The shuffled target test detects this**:
- Shuffle the target randomly → destroy time structure
- If model STILL predicts well → features must encode position, not value
- That's leakage!

**What you'll learn:**
1. Why permutation testing is the definitive test for feature-target leakage
2. How to use `gate_signal_verification` with block vs IID permutation
3. When to use effect_size (fast) vs permutation (rigorous) methods
4. How to interpret HALT results and fix the underlying bugs

**Prerequisites:** Notebooks 01, 04

---

In [None]:
# Setup
import warnings
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error

from temporalcv.cv import WalkForwardCV
from temporalcv.gates import gate_signal_verification, GateStatus

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

print("Setup complete.")

In [None]:
# Generate AR(1) process
def generate_ar1(n=500, phi=0.9, sigma=1.0, seed=42):
    """Generate AR(1) process with specified autocorrelation."""
    rng = np.random.default_rng(seed)
    y = np.zeros(n)
    y[0] = rng.normal(0, sigma / np.sqrt(1 - phi**2))
    for t in range(1, n):
        y[t] = phi * y[t-1] + sigma * rng.normal()
    return y

# Generate base series
series = generate_ar1(n=500, phi=0.9, seed=42)
print(f"Generated AR(1) series with ACF(1) = {np.corrcoef(series[1:], series[:-1])[0,1]:.3f}")

---

## Section 1: The Problem — Features Can Encode Target Position

**The trap:** When features are computed from the full series (before train/test split), they can inadvertently encode information about **where** in the series each observation falls.

### Example: Leaky Features

Consider these "features" that encode the target:
```python
# Feature 1: target + small noise
X[:, 0] = y + noise

# Feature 2: scaled target + noise  
X[:, 1] = y * 0.5 + noise
```

These features encode the target value directly. A model trained on these will achieve impossibly good performance.

In [None]:
# Create clean features (legitimate lag features)
def create_lag_features(series, n_lags=5):
    """Create clean lag features - no lookahead."""
    n = len(series)
    X = np.column_stack([
        np.concatenate([[np.nan]*lag, series[:-lag]]) 
        for lag in range(1, n_lags + 1)
    ])
    valid = ~np.isnan(X).any(axis=1)
    return X[valid], series[valid]

# Create clean features
X_clean, y_clean = create_lag_features(series, n_lags=5)
print(f"Clean features: X shape = {X_clean.shape}")

# Create leaky features (encode target position)
rng = np.random.default_rng(42)
noise = rng.normal(0, 0.1, (len(y_clean), 3))

X_leaky = np.column_stack([
    y_clean + noise[:, 0],       # Feature 1: target + noise
    y_clean * 0.5 + noise[:, 1], # Feature 2: scaled target + noise
    noise[:, 2],                  # Feature 3: pure noise (legitimate)
])

print(f"Leaky features: X shape = {X_leaky.shape}")
print(f"\nCorrelation between features and target:")
for i in range(X_leaky.shape[1]):
    corr = np.corrcoef(X_leaky[:, i], y_clean)[0, 1]
    status = "LEAKED!" if abs(corr) > 0.5 else "OK"
    print(f"  Feature {i+1}: r = {corr:.3f} {status}")

In [None]:
# CRITICAL VISUALIZATION: Leaky vs Clean Features Heatmap
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Clean features - correlation with target
ax = axes[0]
ax.set_title('✓ Clean Features (Legitimate Lags)\nCorrelation with Target', fontsize=12, fontweight='bold', color='green')

# Compute correlations for clean features
clean_corrs = np.array([np.corrcoef(X_clean[:, i], y_clean)[0, 1] for i in range(X_clean.shape[1])])
clean_labels = [f'y[t-{i+1}]' for i in range(X_clean.shape[1])]

# Plot as horizontal bar
colors = ['green' if abs(c) < 0.5 else 'orange' if abs(c) < 0.8 else 'red' for c in clean_corrs]
bars = ax.barh(clean_labels, clean_corrs, color=colors, edgecolor='black', alpha=0.7)
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
ax.axvline(x=0.5, color='orange', linestyle='--', linewidth=1, alpha=0.7, label='Warning threshold')
ax.axvline(x=-0.5, color='orange', linestyle='--', linewidth=1, alpha=0.7)
ax.axvline(x=0.8, color='red', linestyle='--', linewidth=1, alpha=0.7, label='Danger threshold')
ax.axvline(x=-0.8, color='red', linestyle='--', linewidth=1, alpha=0.7)
ax.set_xlabel('Correlation with Target', fontsize=11)
ax.set_xlim(-1.1, 1.1)
ax.legend(loc='lower right', fontsize=9)

# Annotate
ax.annotate('Lag correlations are\nexpected and legitimate', 
            xy=(0.5, 0.2), xycoords='axes fraction',
            fontsize=10, color='green', fontweight='bold',
            bbox=dict(boxstyle='round', facecolor='honeydew', edgecolor='green'))

# Right: Leaky features - correlation with target
ax = axes[1]
ax.set_title('❌ Leaky Features (Target Encoded)\nCorrelation with Target', fontsize=12, fontweight='bold', color='red')

# Compute correlations for leaky features
leaky_corrs = np.array([np.corrcoef(X_leaky[:, i], y_clean)[0, 1] for i in range(X_leaky.shape[1])])
leaky_labels = ['target+noise', 'target*0.5+noise', 'pure_noise']

colors = ['green' if abs(c) < 0.5 else 'orange' if abs(c) < 0.8 else 'red' for c in leaky_corrs]
bars = ax.barh(leaky_labels, leaky_corrs, color=colors, edgecolor='black', alpha=0.7)
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
ax.axvline(x=0.5, color='orange', linestyle='--', linewidth=1, alpha=0.7)
ax.axvline(x=-0.5, color='orange', linestyle='--', linewidth=1, alpha=0.7)
ax.axvline(x=0.8, color='red', linestyle='--', linewidth=1, alpha=0.7)
ax.axvline(x=-0.8, color='red', linestyle='--', linewidth=1, alpha=0.7)
ax.set_xlabel('Correlation with Target', fontsize=11)
ax.set_xlim(-1.1, 1.1)

# Annotate
ax.annotate('SUSPICIOUSLY HIGH!\nFeatures encode target directly', 
            xy=(0.5, 0.6), xycoords='axes fraction',
            fontsize=10, color='red', fontweight='bold',
            bbox=dict(boxstyle='round', facecolor='lightyellow', edgecolor='red'))

plt.tight_layout()
plt.show()

print("\n★ Key Insight: Clean lag features have moderate correlation (expected for AR(1)).")
print("  Leaky features have SUSPICIOUSLY HIGH correlation (>0.9) with target.")
print("  Correlation > 0.8 is a red flag that warrants investigation.")

In [None]:
# Train on both feature sets and compare
model = Ridge(alpha=1.0)

# CV setup
cv = WalkForwardCV(n_splits=5, window_type='expanding', test_size=50)

# Evaluate clean features
clean_maes = []
for train_idx, test_idx in cv.split(X_clean):
    model.fit(X_clean[train_idx], y_clean[train_idx])
    preds = model.predict(X_clean[test_idx])
    clean_maes.append(mean_absolute_error(y_clean[test_idx], preds))

# Evaluate leaky features
leaky_maes = []
for train_idx, test_idx in cv.split(X_leaky):
    model.fit(X_leaky[train_idx], y_clean[train_idx])
    preds = model.predict(X_leaky[test_idx])
    leaky_maes.append(mean_absolute_error(y_clean[test_idx], preds))

print("CLEAN vs LEAKY FEATURES")
print("=" * 55)
print(f"\nClean features MAE:  {np.mean(clean_maes):.4f}")
print(f"Leaky features MAE:  {np.mean(leaky_maes):.4f}  <-- Suspiciously low!")
print(f"\nImprovement: {(np.mean(clean_maes) - np.mean(leaky_maes)) / np.mean(clean_maes) * 100:.1f}%")
print(f"\nThe leaky features achieve impossibly good performance.")
print(f"How do we DETECT this leakage systematically?")

---

## Section 2: Permutation Test Basics [T1]

### The Key Insight

**Shuffle the target** and see if the model can still predict it.

- **If model beats shuffled target** → Features encode target position → LEAKAGE!
- **If model fails on shuffled target** → Features capture legitimate patterns → SAFE

### The Test

1. Train model on original data, compute MAE
2. Shuffle target many times, train model, compute MAE each time
3. Calculate p-value: proportion of shuffled MAEs ≤ original MAE
4. If p-value < α (typically 0.05) → HALT

### Reference [T1]
Phipson & Smyth (2010): p-value = (1 + count(shuffled ≤ original)) / (1 + n_shuffles)

In [None]:
# Visualize the permutation test concept
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Clean features
ax = axes[0]
rng = np.random.default_rng(42)

# Simulate shuffled MAEs for clean features
original_mae_clean = np.mean(clean_maes)
shuffled_maes_clean = []
for _ in range(100):
    y_shuffled = rng.permutation(y_clean)
    maes = []
    for train_idx, test_idx in cv.split(X_clean):
        model.fit(X_clean[train_idx], y_shuffled[train_idx])
        preds = model.predict(X_clean[test_idx])
        maes.append(mean_absolute_error(y_shuffled[test_idx], preds))
    shuffled_maes_clean.append(np.mean(maes))

ax.hist(shuffled_maes_clean, bins=20, alpha=0.7, color='steelblue', 
        edgecolor='black', label='Shuffled MAEs')
ax.axvline(x=original_mae_clean, color='green', linestyle='--', linewidth=2,
           label=f'Original MAE = {original_mae_clean:.3f}')
ax.set_xlabel('MAE', fontsize=11)
ax.set_ylabel('Count', fontsize=11)
ax.set_title('Clean Features: Original MAE Within Shuffled Distribution\n(No Leakage)',
             fontsize=12, fontweight='bold', color='green')
ax.legend()

# Right: Leaky features
ax = axes[1]
original_mae_leaky = np.mean(leaky_maes)
shuffled_maes_leaky = []
for _ in range(100):
    y_shuffled = rng.permutation(y_clean)
    maes = []
    for train_idx, test_idx in cv.split(X_leaky):
        model.fit(X_leaky[train_idx], y_shuffled[train_idx])
        preds = model.predict(X_leaky[test_idx])
        maes.append(mean_absolute_error(y_shuffled[test_idx], preds))
    shuffled_maes_leaky.append(np.mean(maes))

ax.hist(shuffled_maes_leaky, bins=20, alpha=0.7, color='steelblue',
        edgecolor='black', label='Shuffled MAEs')
ax.axvline(x=original_mae_leaky, color='red', linestyle='--', linewidth=2,
           label=f'Original MAE = {original_mae_leaky:.3f}')
ax.set_xlabel('MAE', fontsize=11)
ax.set_ylabel('Count', fontsize=11)
ax.set_title('Leaky Features: Original MAE BEATS Shuffled Distribution\n(LEAKAGE DETECTED!)',
             fontsize=12, fontweight='bold', color='red')
ax.legend()

plt.tight_layout()
plt.show()

# Calculate p-values
pvalue_clean = (1 + sum(s <= original_mae_clean for s in shuffled_maes_clean)) / (1 + len(shuffled_maes_clean))
pvalue_leaky = (1 + sum(s <= original_mae_leaky for s in shuffled_maes_leaky)) / (1 + len(shuffled_maes_leaky))

print(f"\nP-values (Phipson & Smyth formula):")
print(f"  Clean features: p = {pvalue_clean:.4f} → {'PASS' if pvalue_clean >= 0.05 else 'HALT'}")
print(f"  Leaky features: p = {pvalue_leaky:.4f} → {'PASS' if pvalue_leaky >= 0.05 else 'HALT'}")

---

## Section 3: Using `gate_signal_verification`

temporalcv provides `gate_signal_verification` to run this test automatically.

### Key Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `method` | "permutation" | "permutation" for p-value, "effect_size" for quick check |
| `n_shuffles` | 100 (perm) / 5 (effect) | Number of shuffle iterations |
| `permutation` | "block" | "block" for time series, "iid" for standard |
| `alpha` | 0.05 | Significance level (permutation mode) |
| `strict` | False | If True, uses ≥199 shuffles for p < 0.005 resolution |

In [None]:
# Demonstrate gate_signal_verification
print("gate_signal_verification Demo")
print("=" * 60)

model = Ridge(alpha=1.0)

# Test clean features
print("\n1. Clean Features (Legitimate Lags):")
result_clean = gate_signal_verification(
    model=model,
    X=X_clean,
    y=y_clean,
    method='permutation',
    n_shuffles=100,  # For permutation mode
    random_state=42
)
print(f"   Status: {result_clean.status.value}")
print(f"   p-value: {result_clean.details.get('pvalue', 'N/A'):.4f}")
print(f"   Message: {result_clean.message}")

# Test leaky features
print("\n2. Leaky Features (Target Encoded):")
result_leaky = gate_signal_verification(
    model=model,
    X=X_leaky,
    y=y_clean,
    method='permutation',
    n_shuffles=100,
    random_state=42
)
print(f"   Status: {result_leaky.status.value}")
print(f"   p-value: {result_leaky.details.get('pvalue', 'N/A'):.4f}")
print(f"   Message: {result_leaky.message}")
if result_leaky.recommendation:
    print(f"   Recommendation: {result_leaky.recommendation}")

---

## Section 4: Block vs IID Permutation [T1]

### The Problem with IID Shuffling

Standard (IID) shuffling destroys **all** temporal structure. For persistent time series:
- The shuffled target is essentially white noise
- Any model with legitimate predictive ability should beat white noise
- This produces **false positives** (HALT when there's no leakage)

### The Solution: Block Permutation [T1]

Block permutation:
1. Divide series into blocks of size $b \approx n^{1/3}$
2. Shuffle the blocks (not individual observations)
3. Preserves local autocorrelation structure

**Reference [T1]**: Kunsch (1989), Politis & Romano (1994)

In [None]:
# Visualize block vs IID permutation
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Original series (subset)
t = np.arange(100)
original = series[:100]

axes[0].plot(t, original, 'b-', linewidth=1.5)
axes[0].set_title('Original Series\n(Autocorrelated)', fontsize=11, fontweight='bold')
axes[0].set_xlabel('Time')
axes[0].set_ylabel('Value')

# IID permutation
rng = np.random.default_rng(42)
iid_shuffled = rng.permutation(original)
axes[1].plot(t, iid_shuffled, 'r-', linewidth=1.5)
axes[1].set_title('IID Shuffled\n(Structure Destroyed)', fontsize=11, fontweight='bold', color='red')
axes[1].set_xlabel('Time')
axes[1].set_ylabel('Value')

# Block permutation (block_size = 10)
block_size = 10
n_blocks = len(original) // block_size
blocks = [original[i*block_size:(i+1)*block_size] for i in range(n_blocks)]
rng.shuffle(blocks)
block_shuffled = np.concatenate(blocks)

axes[2].plot(np.arange(len(block_shuffled)), block_shuffled, 'g-', linewidth=1.5)
axes[2].set_title(f'Block Shuffled (b={block_size})\n(Local Structure Preserved)', 
                  fontsize=11, fontweight='bold', color='green')
axes[2].set_xlabel('Time')
axes[2].set_ylabel('Value')

plt.tight_layout()
plt.show()

# Show ACF comparison
print("\nAutocorrelation at lag 1:")
print(f"  Original:      {np.corrcoef(original[1:], original[:-1])[0,1]:.3f}")
print(f"  IID Shuffled:  {np.corrcoef(iid_shuffled[1:], iid_shuffled[:-1])[0,1]:.3f} (destroyed)")
print(f"  Block Shuffled: {np.corrcoef(block_shuffled[1:], block_shuffled[:-1])[0,1]:.3f} (preserved)")

In [None]:
# Compare block vs IID permutation results
print("Block vs IID Permutation Comparison")
print("=" * 60)

# Create legitimate features for a persistent series
high_persist = generate_ar1(n=400, phi=0.95, seed=123)
X_legit, y_legit = create_lag_features(high_persist, n_lags=5)

model = Ridge(alpha=1.0)

# Test with block permutation (correct for time series)
print("\n1. Block Permutation (preserves autocorrelation):")
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    result_block = gate_signal_verification(
        model=model,
        X=X_legit,
        y=y_legit,
        method='permutation',
        permutation='block',
        n_shuffles=100,
        random_state=42
    )
print(f"   Status: {result_block.status.value}")
print(f"   p-value: {result_block.details.get('pvalue', 'N/A'):.4f}")

# Test with IID permutation (may produce false positive)
print("\n2. IID Permutation (destroys autocorrelation):")
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    result_iid = gate_signal_verification(
        model=model,
        X=X_legit,
        y=y_legit,
        method='permutation',
        permutation='iid',
        n_shuffles=100,
        random_state=42
    )
print(f"   Status: {result_iid.status.value}")
print(f"   p-value: {result_iid.details.get('pvalue', 'N/A'):.4f}")

print("\n[T1] Block permutation is the correct choice for time series.")
print("     IID permutation may flag legitimate models as leaky.")

---

## Section 5: Effect Size vs Permutation Methods

### Two Modes for Different Use Cases

| Method | Speed | Rigor | Use Case |
|--------|-------|-------|----------|
| `effect_size` | Fast (5 shuffles) | Heuristic | Development, quick checks |
| `permutation` | Slower (100+ shuffles) | Statistical p-value | Publication, final validation |

### Effect Size Mode
- Computes: `improvement_ratio = 1 - (model_mae / mean_shuffled_mae)`
- HALT if: `improvement_ratio > threshold` (default 5%)
- Fast but heuristic

### Permutation Mode [T1]
- Computes exact p-value: `(1 + count(shuffled ≤ original)) / (1 + n_shuffles)`
- HALT if: `p-value < alpha` (default 0.05)
- Statistically rigorous

In [None]:
# Compare effect_size vs permutation methods
print("Effect Size vs Permutation Methods")
print("=" * 60)

model = Ridge(alpha=1.0)

# Test on leaky features with both methods
print("\nTesting LEAKY features:")

# Effect size (fast)
import time
start = time.time()
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    result_effect = gate_signal_verification(
        model=model,
        X=X_leaky,
        y=y_clean,
        method='effect_size',
        random_state=42
    )
time_effect = time.time() - start

print(f"\n1. Effect Size Method:")
print(f"   Status: {result_effect.status.value}")
print(f"   Improvement: {result_effect.details.get('improvement_ratio', 0)*100:.1f}%")
print(f"   Time: {time_effect:.2f}s")

# Permutation (rigorous)
start = time.time()
result_perm = gate_signal_verification(
    model=model,
    X=X_leaky,
    y=y_clean,
    method='permutation',
    n_shuffles=100,
    random_state=42
)
time_perm = time.time() - start

print(f"\n2. Permutation Method:")
print(f"   Status: {result_perm.status.value}")
print(f"   p-value: {result_perm.details.get('pvalue', 'N/A'):.4f}")
print(f"   Time: {time_perm:.2f}s")

print(f"\nRecommendation:")
print(f"  - Use effect_size during development (fast feedback)")
print(f"  - Use permutation for final validation (statistically rigorous)")

In [None]:
# Demonstrate strict mode for publication
print("Strict Mode for Publication")
print("=" * 60)

print("\nStrict mode ensures adequate statistical power:")
print("- Overrides n_shuffles to ≥199")
print("- Provides p-value resolution of 0.005")
print("- Recommended for publication-quality results")

# Without strict mode
print("\nWithout strict (n_shuffles=100):")
result_normal = gate_signal_verification(
    model=model,
    X=X_clean,
    y=y_clean,
    method='permutation',
    n_shuffles=100,
    strict=False,
    random_state=42
)
print(f"   n_shuffles used: {result_normal.details.get('n_shuffles', 50)}")
print(f"   Min p-value resolution: 1/{50+1} ≈ {1/51:.4f}")

# With strict mode
print("\nWith strict=True:")
result_strict = gate_signal_verification(
    model=model,
    X=X_clean,
    y=y_clean,
    method='permutation',
    n_shuffles=100,  # Will be overridden
    strict=True,
    random_state=42
)
print(f"   n_shuffles used: {result_strict.details.get('n_shuffles', 199)}")
print(f"   Min p-value resolution: 1/{199+1} = {1/200:.4f}")

---

## Section 6: Interpreting HALT Results

When `gate_signal_verification` returns HALT, your features encode target position.

### Common Causes

1. **Direct target leakage**: Features computed from target values
2. **Rolling stats on full series**: Statistics computed before train/test split
3. **Feature selection on target**: Using target to select features
4. **Centered windows**: Using future values in window calculations

### Debugging Steps

1. Check feature-target correlations
2. Review feature engineering pipeline
3. Ensure train/test separation happens BEFORE feature computation
4. Use backward-only windows for rolling statistics

In [None]:
# Debugging workflow when HALT is triggered
def debug_leakage(X, y, feature_names=None):
    """Debug potential feature-target leakage."""
    if feature_names is None:
        feature_names = [f"Feature_{i}" for i in range(X.shape[1])]
    
    print("LEAKAGE DEBUGGING REPORT")
    print("=" * 60)
    
    # Step 1: Check correlations
    print("\n1. Feature-Target Correlations:")
    print("-" * 40)
    suspicious = []
    for i in range(X.shape[1]):
        corr = np.corrcoef(X[:, i], y)[0, 1]
        status = "SUSPICIOUS!" if abs(corr) > 0.8 else ("Warning" if abs(corr) > 0.5 else "OK")
        print(f"   {feature_names[i]}: r = {corr:+.3f} [{status}]")
        if abs(corr) > 0.5:
            suspicious.append(feature_names[i])
    
    # Step 2: Recommendations
    print(f"\n2. Recommendations:")
    print("-" * 40)
    if suspicious:
        print(f"   Investigate these features: {suspicious}")
        print(f"   \n   Possible causes:")
        print(f"   - Feature computed from target values")
        print(f"   - Rolling statistics on full series")
        print(f"   - Centered window including future")
    else:
        print(f"   No obviously suspicious correlations found.")
        print(f"   Check for subtle leakage in feature engineering pipeline.")
    
    return suspicious

# Run debugging on leaky features
suspicious = debug_leakage(
    X_leaky, y_clean, 
    feature_names=['target+noise', 'target*0.5+noise', 'pure_noise']
)

---

## Pitfall Section

### Pitfall 1: n_shuffles Too Low

```python
# WRONG: Too few shuffles for statistical power
result = gate_signal_verification(model, X, y, n_shuffles=5)  # Can't get p < 0.1!

# RIGHT: Enough shuffles for desired significance
result = gate_signal_verification(model, X, y, n_shuffles=100)  # Can get p < 0.01

# BETTER: Use strict mode for publication
result = gate_signal_verification(model, X, y, strict=True)  # ≥199 shuffles
```

**Rule of thumb [T1]:**
- n_shuffles ≥ 19 for p < 0.05
- n_shuffles ≥ 100 for p < 0.01
- n_shuffles ≥ 199 for p < 0.005 (strict mode)

### Pitfall 2: Using IID Permutation on Time Series

```python
# WRONG: IID permutation on autocorrelated data
result = gate_signal_verification(model, X, y, permutation='iid')  # False positives!

# RIGHT: Block permutation preserves local structure
result = gate_signal_verification(model, X, y, permutation='block')  # Default
```

### Pitfall 3: Ignoring HALT

```python
# WRONG: Ignore the warning
result = gate_signal_verification(model, X, y)
if result.status == GateStatus.HALT:
    pass  # Hope for the best?

# RIGHT: Investigate and fix
result = gate_signal_verification(model, X, y)
if result.status == GateStatus.HALT:
    raise ValueError(f"Leakage detected: {result.message}")
```

In [None]:
# Demonstrate the n_shuffles pitfall
print("Pitfall: n_shuffles and p-value Resolution")
print("=" * 60)

print("\nMinimum p-value achievable with different n_shuffles:")
print("-" * 40)
for n in [5, 10, 19, 50, 99, 199]:
    min_p = 1 / (n + 1)
    can_achieve = "p < 0.05" if min_p < 0.05 else "p < 0.1" if min_p < 0.1 else "p ≥ 0.1"
    print(f"   n_shuffles={n:3d} → min p = {min_p:.4f} ({can_achieve})")

print("\n[T1] Per Phipson & Smyth (2010): p = (1 + count) / (1 + n_shuffles)")

---

## Complete Example: Validation Pipeline

Here's how to integrate `gate_signal_verification` into your validation workflow:

In [None]:
def validate_features(model, X, y, feature_names=None, strict=False):
    """
    Validate features for leakage using shuffled target test.
    
    Parameters
    ----------
    model : estimator
        sklearn-compatible model
    X : array-like
        Features
    y : array-like
        Target
    feature_names : list, optional
        Names for debugging
    strict : bool, default=False
        If True, use publication-quality settings
        
    Returns
    -------
    dict
        Validation results
    """
    # Run gate
    result = gate_signal_verification(
        model=model,
        X=X,
        y=y,
        method='permutation',
        permutation='block',
        strict=strict,
        random_state=42
    )
    
    # Build report - using current API keys (mae_real, mae_shuffled_avg)
    report = {
        'status': result.status.value,
        'passed': result.status == GateStatus.PASS,
        'pvalue': result.details.get('pvalue'),
        'mae_real': result.details.get('mae_real'),
        'mae_shuffled_avg': result.details.get('mae_shuffled_avg'),
        'message': result.message,
    }
    
    # If failed, add debugging info
    if result.status == GateStatus.HALT:
        report['suspicious_features'] = debug_leakage(X, y, feature_names)
    
    return report

# Test on clean and leaky features
print("FEATURE VALIDATION RESULTS")
print("=" * 60)

model = Ridge(alpha=1.0)

print("\n1. Clean Features:")
report_clean = validate_features(
    model, X_clean, y_clean,
    feature_names=[f'lag_{i}' for i in range(1, 6)]
)
print(f"   Status: {report_clean['status']}")
print(f"   Passed: {report_clean['passed']}")

print("\n" + "="*60)
print("\n2. Leaky Features:")
report_leaky = validate_features(
    model, X_leaky, y_clean,
    feature_names=['target+noise', 'target*0.5+noise', 'pure_noise']
)
print(f"\n   Status: {report_leaky['status']}")
print(f"   Passed: {report_leaky['passed']}")

---

## Key Insights

### 1. The Shuffled Target Test is Definitive [T1]
If your model beats a shuffled target, your features encode target position. Period.

### 2. Use Block Permutation for Time Series [T1]
Block permutation preserves local autocorrelation. IID permutation produces false positives.

### 3. n_shuffles Determines p-value Resolution [T1]
- n_shuffles ≥ 19 for p < 0.05
- n_shuffles ≥ 100 for p < 0.01
- Use `strict=True` for publication

### 4. Effect Size for Development, Permutation for Production
Quick checks during development, rigorous testing before deployment.

### 5. HALT Means Stop and Investigate
Never ignore a HALT. Find and fix the leakage source.

---

## Next Steps

- **06_feature_engineering_pitfalls.ipynb**: Safe lag features and rolling stats
- **07_threshold_leakage.ipynb**: Regime and percentile computation
- **08_validation_workflow.ipynb**: Complete HALT/WARN/PASS pipeline

---

*"If your model beats shuffled data, it's not skill — it's leakage."*