# 08: Complete Validation Workflow

**Capstone: HALT/WARN/PASS Pipeline with `run_gates`**

---

## 🚨 If You Know sklearn But Not Systematic Validation, Read This First

**What you already know (from standard ML)**:
- Train a model, compute MAE, done
- Lower error = better model
- Cross-validation protects against overfitting
- If accuracy is high, ship it!

**What's different with time series validation**:

| Standard ML Approach | Time Series Reality |
|---------------------|---------------------|
| "MAE is 0.05, ship it!" | 0.05 means nothing without baseline |
| "Cross-validation passed" | KFold leaks future data |
| "Model beats baseline" | Maybe it learned to predict y[t-1] |
| "95% accuracy overall" | But 0% accuracy in volatility spikes |

**The systematic approach** (this notebook):

```
Step 1: EXTERNAL VALIDATION (hardest to game)
  └─ gate_shuffled_target: Features leak target? → HALT
  └─ gate_synthetic_ar1: Beat theoretical bounds? → HALT

Step 2: INTERNAL VALIDATION (sanity checks)  
  └─ gate_suspicious_improvement: >20% better? → HALT
  └─ gate_temporal_boundary: Gap < horizon? → HALT

Step 3: STRATIFIED VALIDATION (hidden failures)
  └─ Per-regime metrics: Pass overall but fail in HIGH vol? → WARN
```

**Priority rule**: Any HALT stops the pipeline. Warnings are logged but proceed.

---

## What You'll Learn

1. **GateStatus hierarchy** — HALT > WARN > PASS > SKIP priority order
2. **Gate execution order** — External validation before internal validation
3. **Aggregated validation** — `run_gates()` and `run_gates_stratified()` for systematic checks

**Prerequisites**: Notebooks 01-07 (all Tier 1 + Tier 2 concepts)

---

## The Problem: Ad-Hoc Validation Misses Systematic Issues

Consider a typical validation workflow:

```python
# Ad-hoc approach (common but dangerous)
mae = mean_absolute_error(y_test, predictions)
if mae < 0.1:
    print("Model is good!")
```

**Problems with ad-hoc validation:**

1. **No baseline comparison** — 0.1 MAE means nothing without context
2. **No leakage detection** — Features may encode target position
3. **No regime awareness** — Model may fail in specific conditions
4. **No consistent severity** — When should we halt vs. warn?

**temporalcv's solution**: A systematic pipeline with defined severity levels and execution order.

In [None]:
import numpy as np
from sklearn.linear_model import Ridge

from temporalcv.gates import (
    GateStatus,
    GateResult,
    ValidationReport,
    StratifiedValidationReport,
    gate_signal_verification,
    gate_suspicious_improvement,
    gate_temporal_boundary,
    run_gates,
    run_gates_stratified,
)
from temporalcv.cv import WalkForwardCV

np.random.seed(42)
print("temporalcv validation workflow")

---

## 1. GateStatus: The Severity Hierarchy

Every gate returns one of four status values, with strict priority order:

| Status | Priority | Meaning | Action |
|--------|----------|---------|--------|
| **HALT** | 1 (highest) | Critical failure | Stop and investigate |
| **WARN** | 2 | Caution flag | Continue but verify |
| **PASS** | 3 | Validation passed | Safe to proceed |
| **SKIP** | 4 (lowest) | Insufficient data | Cannot run gate |

**Key principle**: Any HALT overrides all other statuses.

In [None]:
# GateStatus is an Enum with defined values
print("GateStatus values:")
for status in GateStatus:
    print(f"  {status.name}: {status.value}")

### GateResult: What Each Gate Returns

Every gate function returns a `GateResult` dataclass with:

- `name`: Gate identifier (e.g., "shuffled_target")
- `status`: GateStatus (HALT/WARN/PASS/SKIP)
- `message`: Human-readable description
- `metric_value`: Primary metric for this gate
- `threshold`: Threshold used for decision
- `details`: Additional diagnostics
- `recommendation`: What to do if not PASS

In [None]:
# Example: gate_suspicious_improvement returns a GateResult
result = gate_suspicious_improvement(
    model_metric=0.08,      # Model MAE
    baseline_metric=0.10,   # Persistence MAE
)

print(f"Name: {result.name}")
print(f"Status: {result.status}")
print(f"Message: {result.message}")
print(f"Metric value: {result.metric_value:.2%}")
print(f"Threshold: {result.threshold:.0%}")
print(f"Recommendation: {result.recommendation or '(none)'}")

---

## 2. Gate Execution Order: External Before Internal

**Critical principle**: Run external validation gates BEFORE internal validation gates.

### Stage 1: External Validation (Hardest to Game)

| Gate | What It Detects |
|------|----------------|
| `gate_signal_verification` | Features that encode target position |
| `gate_synthetic_ar1` | Model beating theoretical bounds |

### Stage 2: Internal Validation (Sanity Checks)

| Gate | What It Detects |
|------|----------------|
| `gate_suspicious_improvement` | Improvement too good to be true |
| `gate_temporal_boundary` | Gap enforcement violations |

**Why this order?** External gates are harder to accidentally "pass" with leaky data.

In [None]:
# Generate synthetic data
def generate_ar1(n: int, phi: float = 0.9, sigma: float = 0.1) -> np.ndarray:
    """Generate AR(1) process: y[t] = phi * y[t-1] + epsilon[t]"""
    y = np.zeros(n)
    y[0] = np.random.normal(0, sigma)
    for t in range(1, n):
        y[t] = phi * y[t - 1] + np.random.normal(0, sigma)
    return y

# Create clean data
n = 500
y = generate_ar1(n, phi=0.9)

# Create CLEAN features (lagged values only)
X_clean = np.column_stack([y[:-2], y[1:-1]])
y_target = y[2:]

# Create LEAKY features (encode target position)
noise = np.random.normal(0, 0.01, len(y_target))
X_leaky = np.column_stack([y_target + noise, y_target * 0.5 + noise])

print(f"Clean features shape: {X_clean.shape}")
print(f"Leaky features shape: {X_leaky.shape}")

In [None]:
# Stage 1: External validation (shuffled target)
model = Ridge()

print("="*60)
print("STAGE 1: External Validation")
print("="*60)

# Test clean features
result_clean = gate_signal_verification(
    model, X_clean, y_target, 
    method="effect_size",  # Quick check
    random_state=42
)
print(f"\nClean features: {result_clean}")

# Test leaky features
result_leaky = gate_signal_verification(
    model, X_leaky, y_target,
    method="effect_size",
    random_state=42
)
print(f"Leaky features: {result_leaky}")

In [None]:
# Stage 2: Internal validation (suspicious improvement)
# Only run if Stage 1 passes!

print("="*60)
print("STAGE 2: Internal Validation")
print("="*60)

# First check if Stage 1 passed
if result_clean.status == GateStatus.HALT:
    print("\nHALT: Skipping Stage 2 - features leaked!")
else:
    # Fit model and compute metrics
    train_size = int(len(y_target) * 0.8)
    X_train, y_train = X_clean[:train_size], y_target[:train_size]
    X_test, y_test = X_clean[train_size:], y_target[train_size:]
    
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    model_mae = np.mean(np.abs(y_test - predictions))
    persistence_mae = np.mean(np.abs(np.diff(y_test)))
    
    result_improvement = gate_suspicious_improvement(
        model_metric=model_mae,
        baseline_metric=persistence_mae
    )
    print(f"\n{result_improvement}")
    print(f"\nModel MAE: {model_mae:.4f}")
    print(f"Persistence MAE: {persistence_mae:.4f}")

---

## 3. Aggregating Gates with `run_gates()`

`run_gates()` combines multiple gate results into a single `ValidationReport`.

**Key features:**
- Returns overall status (HALT if any HALT, WARN if any WARN)
- Provides `.failures` and `.warnings` properties
- Generates human-readable `.summary()`

In [None]:
# Run multiple gates and aggregate
def validate_model(model, X, y, X_train, y_train, X_test, y_test):
    """Run complete validation pipeline."""
    gates = []
    
    # Stage 1: External validation
    gates.append(gate_signal_verification(
        model, X, y,
        method="effect_size",
        random_state=42
    ))
    
    # Stage 2: Internal validation
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    model_mae = np.mean(np.abs(y_test - predictions))
    persistence_mae = np.mean(np.abs(np.diff(y_test)))
    
    gates.append(gate_suspicious_improvement(
        model_metric=model_mae,
        baseline_metric=persistence_mae
    ))
    
    # Check temporal boundary
    # Note: train_end and test_start are inclusive indices
    train_end = len(X_train) - 1   # Last training index (inclusive)
    test_start = len(X_train)       # First test index (inclusive)
    
    # The ACTUAL gap in observations skipped = test_start - train_end - 1
    # Here: actual_gap = len(X_train) - (len(X_train) - 1) - 1 = 0
    # This means train and test are adjacent (no observations skipped)
    actual_gap_in_data = test_start - train_end - 1
    
    gates.append(gate_temporal_boundary(
        train_end_idx=train_end,
        test_start_idx=test_start,
        horizon=1,
        extra_gap=0  # No additional gap beyond horizon is required here
    ))
    
    return run_gates(gates)

# Define train/test split
train_size = int(len(y_target) * 0.8)

# Validate clean model
report = validate_model(
    model=Ridge(),
    X=X_clean, y=y_target,
    X_train=X_clean[:train_size], y_train=y_target[:train_size],
    X_test=X_clean[train_size:], y_test=y_target[train_size:]
)

print(report.summary())

In [None]:
# ValidationReport properties
print(f"Overall status: {report.status}")
print(f"Number of failures: {len(report.failures)}")
print(f"Number of warnings: {len(report.warnings)}")

# Access individual gates
for gate in report.gates:
    print(f"\n{gate.name}:")
    print(f"  Status: {gate.status.value}")
    print(f"  Metric: {gate.metric_value}")

### Handling Failures Programmatically

Use the report status to control pipeline flow:

In [None]:
def deploy_or_halt(report: ValidationReport) -> bool:
    """Decision logic based on validation report."""
    
    if report.status == "HALT":
        print("DEPLOYMENT BLOCKED")
        print("\nFailures:")
        for failure in report.failures:
            print(f"  - {failure.name}: {failure.message}")
            print(f"    Recommendation: {failure.recommendation}")
        return False
    
    elif report.status == "WARN":
        print("DEPLOYMENT ALLOWED (with warnings)")
        print("\nWarnings:")
        for warning in report.warnings:
            print(f"  - {warning.name}: {warning.message}")
        return True  # Proceed but log warnings
    
    else:
        print("DEPLOYMENT APPROVED")
        return True

# Test with clean data
deploy_or_halt(report)

---

## 4. Regime-Stratified Validation

**Key insight**: Models may pass overall but fail in specific regimes.

`run_gates_stratified()` runs validation both overall AND per-regime:

- **Overall**: All gates on full dataset
- **Per-regime**: Numeric gates (suspicious_improvement) on each regime

**Why stratify?** Aggregate metrics hide regime-specific failures.

In [None]:
# Generate data with volatility regimes
np.random.seed(123)

# Low volatility regime (periods 0-150)
y_low = generate_ar1(150, phi=0.9, sigma=0.05)

# High volatility regime (periods 150-350) 
y_high = generate_ar1(200, phi=0.9, sigma=0.3)
y_high = y_high + y_low[-1]  # Continue from previous value

# Medium volatility regime (periods 350-500)
y_med = generate_ar1(150, phi=0.9, sigma=0.15)
y_med = y_med + y_high[-1]

# Combine
y_regimes = np.concatenate([y_low, y_high, y_med])

# Create regime labels
regime_labels = np.array(
    ["LOW"] * 150 + ["HIGH"] * 200 + ["MEDIUM"] * 150
)

# Create features
X_regimes = np.column_stack([y_regimes[:-2], y_regimes[1:-1]])
y_regimes_target = y_regimes[2:]
regime_labels = regime_labels[2:]  # Align with target

print(f"Data shape: {X_regimes.shape}")
print(f"Regime counts: {np.unique(regime_labels, return_counts=True)}")

In [None]:
# Train model and get predictions
train_size = int(len(y_regimes_target) * 0.8)
X_train = X_regimes[:train_size]
y_train = y_regimes_target[:train_size]
X_test = X_regimes[train_size:]
y_test = y_regimes_target[train_size:]
regimes_test = regime_labels[train_size:]

model = Ridge()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute MAE by regime
for regime in ["LOW", "MEDIUM", "HIGH"]:
    mask = regimes_test == regime
    if mask.sum() > 0:
        mae = np.mean(np.abs(y_test[mask] - predictions[mask]))
        print(f"{regime} regime MAE: {mae:.4f} (n={mask.sum()})")

In [None]:
# Run stratified validation
# First, compute overall gates
overall_gates = [
    gate_signal_verification(
        model, X_regimes, y_regimes_target,
        method="effect_size",
        random_state=42
    ),
    gate_suspicious_improvement(
        model_metric=np.mean(np.abs(y_test - predictions)),
        baseline_metric=np.mean(np.abs(np.diff(y_test)))
    )
]

# Run stratified gates
stratified_report = run_gates_stratified(
    overall_gates=overall_gates,
    actuals=y_test,
    predictions=predictions,
    regimes=regimes_test,  # Use our regime labels
    min_n_per_regime=10
)

print(stratified_report.summary())

### Auto-Detect Volatility Regimes

If you don't have regime labels, use `regimes="auto"` to classify based on volatility:

In [None]:
# Auto-detect regimes from actuals
auto_report = run_gates_stratified(
    overall_gates=overall_gates,
    actuals=y_test,
    predictions=predictions,
    regimes="auto",  # Auto-classify volatility
    volatility_window=13,  # ~1 quarter [T3]
    min_n_per_regime=10
)

print("Regime counts (auto-detected):")
for regime, count in auto_report.regime_counts.items():
    print(f"  {regime}: {count}")

if auto_report.masked_regimes:
    print(f"\nMasked regimes (n < 10): {auto_report.masked_regimes}")

---

## 5. Complete Validation Template

Here's a production-ready validation function:

In [None]:
from typing import Optional, Union

def full_validation_pipeline(
    model,
    X: np.ndarray,
    y: np.ndarray,
    train_end_idx: int,
    horizon: int = 1,
    gap: int = 0,
    regimes: Optional[Union[np.ndarray, str]] = None,
    random_state: int = 42,
) -> Union[ValidationReport, StratifiedValidationReport]:
    """
    Run complete validation pipeline with proper gate ordering.
    
    Parameters
    ----------
    model : sklearn estimator
        Model with fit/predict methods
    X : np.ndarray
        Feature matrix
    y : np.ndarray
        Target values
    train_end_idx : int
        Last index of training data
    horizon : int
        Forecast horizon
    gap : int
        Gap between train and test
    regimes : array | "auto" | None
        Regime labels for stratified validation
    random_state : int
        Random seed for reproducibility
    
    Returns
    -------
    ValidationReport or StratifiedValidationReport
    """
    # Split data
    X_train, y_train = X[:train_end_idx], y[:train_end_idx]
    X_test, y_test = X[train_end_idx:], y[train_end_idx:]
    
    # Stage 1: External validation (FIRST)
    gates = []
    
    # 1a. Shuffled target gate
    shuffled_result = gate_signal_verification(
        model, X, y,
        method="effect_size",  # Quick check
        random_state=random_state
    )
    gates.append(shuffled_result)
    
    # HALT early if external validation fails
    if shuffled_result.status == GateStatus.HALT:
        return run_gates(gates)
    
    # Stage 2: Internal validation (only if Stage 1 passes)
    # Fit model
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    # 2a. Suspicious improvement
    model_mae = np.mean(np.abs(y_test - predictions))
    persistence_mae = np.mean(np.abs(np.diff(y_test)))
    
    gates.append(gate_suspicious_improvement(
        model_metric=model_mae,
        baseline_metric=persistence_mae
    ))
    
    # 2b. Temporal boundary
    gates.append(gate_temporal_boundary(
        train_end_idx=train_end_idx - 1,
        test_start_idx=train_end_idx,
        horizon=horizon,
        extra_gap=gap
    ))
    
    # Return stratified if regimes provided
    if regimes is not None:
        test_regimes = regimes[train_end_idx:] if isinstance(regimes, np.ndarray) else "auto"
        return run_gates_stratified(
            overall_gates=gates,
            actuals=y_test,
            predictions=predictions,
            regimes=test_regimes
        )
    
    return run_gates(gates)

In [None]:
# Test the template
print("="*60)
print("CLEAN DATA VALIDATION")
print("="*60)

clean_report = full_validation_pipeline(
    model=Ridge(),
    X=X_clean,
    y=y_target,
    train_end_idx=train_size,
    horizon=1,
    gap=0
)
print(clean_report.summary())

In [None]:
print("="*60)
print("LEAKY DATA VALIDATION")
print("="*60)

leaky_report = full_validation_pipeline(
    model=Ridge(),
    X=X_leaky,
    y=y_target,
    train_end_idx=train_size,
    horizon=1,
    gap=0  # No gap (also a violation)
)
print(leaky_report.summary())

---

## Pitfall Section: Common Mistakes

### Pitfall 1: Wrong Gate Order

```python
# WRONG: Run internal before external
gates = [
    gate_suspicious_improvement(model_mae, persistence_mae),  # Internal first
    gate_shuffled_target(model, X, y),  # External second
]

# RIGHT: External gates FIRST
gates = [
    gate_shuffled_target(model, X, y),  # External first
    gate_suspicious_improvement(model_mae, persistence_mae),  # Internal second
]
```

**Why it matters**: External gates are harder to game. Run them first to catch leakage early.

---

### Pitfall 2: Ignoring WARN Status

```python
# WRONG: Only check for HALT
if report.status != "HALT":
    deploy()  # Ignores warnings!

# RIGHT: Handle warnings appropriately
if report.status == "HALT":
    raise ValueError("Validation failed")
elif report.status == "WARN":
    log_warnings(report.warnings)  # Log for monitoring
    deploy()  # Proceed with caution
else:
    deploy()  # Clean deployment
```

**Why it matters**: Warnings indicate edge cases that may become failures in production.

---

### Pitfall 3: Aggregate-Only Validation

```python
# WRONG: Only check overall metrics
report = run_gates(gates)  # No stratification

# RIGHT: Check per-regime performance
report = run_gates_stratified(
    gates, actuals, predictions, regimes="auto"
)
```

**Why it matters**: A model may pass overall but fail in HIGH volatility regimes.

---

### Pitfall 4: Misunderstanding Gap Calculation ⚠️

The **actual gap** (observations skipped between train and test) is:

```
actual_gap = test_start_idx - train_end_idx - 1
```

Example:
- `train_end_idx = 99` (last training observation)
- `test_start_idx = 100` (first test observation)
- `actual_gap = 100 - 99 - 1 = 0` ← Train and test are adjacent!

For h-step forecasting, the gate checks `test_start >= train_end + horizon + gap`:

```python
# CONFUSING: What does gap=2 mean?
gate_temporal_boundary(
    train_end_idx=99,
    test_start_idx=102,  # actual_gap = 102 - 99 - 1 = 2
    horizon=1,
    gap=2  # Additional gap beyond horizon
)
# Checks: 102 >= 99 + 1 + 2 = 102 → PASS

# CLEAR: Show actual gap in comments
train_end = len(X_train) - 1
test_start = len(X_train)
actual_gap = test_start - train_end - 1  # = 0 (adjacent)
```

**Why it matters**: Off-by-one errors in gap calculation are a common source of subtle leakage.

---

## Key Insights

```
★ Insight ────────────────────────────────────────────

1. GateStatus has strict priority: HALT > WARN > PASS > SKIP
   - Any HALT overrides all other statuses

2. Run external validation BEFORE internal validation
   - gate_signal_verification catches leakage early
   - Internal gates are sanity checks, not definitive

3. Use run_gates_stratified() to expose hidden failures
   - Aggregate metrics hide regime-specific issues
   - regimes="auto" classifies volatility automatically

4. Handle WARN status explicitly
   - Log warnings for monitoring
   - Today's warning may become tomorrow's failure

─────────────────────────────────────────────────────
```

---

## What You've Learned (Tier 2 Complete)

| Notebook | Key Concept |
|----------|-------------|
| 05 | Shuffled target gate for definitive leakage detection |
| 06 | Safe feature engineering (rolling stats, feature selection) |
| 07 | Threshold and regime leakage prevention |
| **08** | **Complete validation pipeline with gate aggregation** |

---

## Next Steps: Tier 3 - Advanced Evaluation

| Notebook | Topic |
|----------|-------|
| 09 | Statistical tests (DM, PT) for model comparison |
| 10 | High-persistence metrics (MC-SS, Theil's U) |
| 11 | Conformal prediction for uncertainty quantification |
| 12 | Regime-stratified evaluation (capstone) |

---

**Congratulations!** You've completed Tier 2: Leakage Prevention. You now know:

1. How to detect leakage with `gate_signal_verification`
2. How to engineer features safely
3. How to avoid threshold/regime leakage
4. How to build a systematic validation pipeline