#**THE MODEL RISK**

---

##0.REFERENCE

https://claude.ai/share/9d0f0712-4f31-47b6-8b40-205934c95f8d

##1.CONTEXT



In algorithmic trading, model risk represents an invisible tax on alpha—the gap between what your models promise in backtests and what they deliver in live markets. This chapter builds a comprehensive framework for managing that risk through three interconnected pillars: explainability, robustness testing, and continuous monitoring.

**Why This Matters**

Every quantitative trading strategy operates under uncertainty. Markets evolve, regimes shift, and execution costs fluctuate. A model that performs brilliantly in historical data can fail catastrophically when deployed if it hasn't been stress-tested against realistic scenarios. Worse, without proper explainability tools, you won't understand why it's failing until significant losses accumulate. This isn't just about regulatory compliance—though that matters too—it's about building strategies that survive contact with reality.

**The Core Philosophy**

This notebook takes a "governance-native" approach, meaning risk management isn't bolted on after the fact—it's embedded from the first line of code. Every feature is constructed causally with explicit time lags. Every model prediction comes with diagnostics showing which features drove the decision. Every backtest includes transaction costs, and every result is hashed and traced for reproducibility. Think of it as defensive programming for quantitative finance: we assume things will go wrong and build systems to detect problems before they become disasters.

**What You'll Learn**

We start with explainability—the ability to understand what your models are doing and why. This includes global explanations (which features matter most overall), local explanations (what drove decisions on specific days), and sensitivity analysis (how fragile are your predictions to small input changes). Explainability isn't just for regulators; it's your primary debugging tool when live performance diverges from expectations.

Next, we build a robustness test suite—a battery of stress tests that probe your strategy's weaknesses before the market does. We test temporal stability through walk-forward analysis, cross-sectional stability by randomly dropping assets, microstructure sensitivity by varying transaction costs and latency, regime robustness by conditioning on volatility states, and adversarial perturbations by injecting noise and missingness into features. Each test has pass/fail thresholds defined upfront, turning subjective judgment into objective gates.

Finally, we implement continuous monitoring with causal online diagnostics. Markets don't stand still, and neither should your risk management. We track input drift (are feature distributions shifting?), output drift (is turnover spiking unexpectedly?), and outcome drift (is realized slippage diverging from expectations?). A state machine with persistence and hysteresis prevents alert fatigue while ensuring genuine problems escalate appropriately.

**Practical Implementation**

Throughout, we work with synthetic multi-asset data featuring realistic regime transitions, correlation structures, and tail events. This synthetic-first approach means the notebook runs end-to-end without external dependencies, while optional adapters show how to apply these techniques to real market data. Every artifact—model cards, data sheets, evaluation matrices, monitoring specs, and post-mortem templates—is automatically generated and saved with cryptographic hashes for audit trails.

By the end, you'll have a production-grade risk management framework that treats model governance not as paperwork, but as an integral part of building strategies that actually work.

##2.LIBRARIES AND ENVIROMENT

In [5]:
import numpy as np
import matplotlib
matplotlib.use('Agg')  # non-interactive backend for Colab
import matplotlib.pyplot as plt
import math
import random
import itertools
from collections import defaultdict, Counter
import datetime
import json
import hashlib
import os
import textwrap

# Configuration dictionary
CONFIG = {
    "seed": 42,
    "horizon_H": 1,  # prediction horizon in steps
    "bar_frequency": "daily",
    "n_assets": 10,
    "n_steps": 1000,
    "train_len": 500,  # Reduced from 600 to fit within 1000 steps
    "test_len": 300,   # Reduced from 400 to fit
    "embargo": 5,      # embargo period after train
    "regime_params": {
        "n_regimes": 2,
        "transition_prob": 0.02,  # probability of regime switch per step
        "vol_low": 0.01,
        "vol_high": 0.03,
        "corr_low": 0.3,
        "corr_high": 0.7,
    },
    "cost_params": {
        "spread_bps": 5.0,  # spread cost in bps
        "impact_coef": 0.5,  # quadratic impact coefficient
    },
    "robustness_thresholds": {
        "min_net_sharpe": 0.5,
        "max_drawdown": -0.15,
        "max_turnover": 2.0,  # max annualized turnover
        "max_abs_weight": 0.3,
    },
    "monitoring_thresholds": {
        "feature_shift_z": 3.0,  # z-score threshold for feature drift
        "turnover_spike_mult": 2.5,  # multiplier over rolling mean
        "drawdown_alarm": -0.10,
        "alert_budget_per_100_steps": 5,
    },
    "output_dir": "/content/ch21_run",
}

SEED = CONFIG["seed"]
np.random.seed(SEED)
random.seed(SEED)

print(f"\n[DETERMINISM] Global seed set to {SEED}")
print(f"[CONFIG] Loaded configuration with {len(CONFIG)} top-level keys")
print(f"[CONFIG] n_assets={CONFIG['n_assets']}, n_steps={CONFIG['n_steps']}")
print(f"[CONFIG] train_len={CONFIG['train_len']}, test_len={CONFIG['test_len']}, embargo={CONFIG['embargo']}")

# Verify split fits
train_len = CONFIG["train_len"]
embargo = CONFIG["embargo"]
test_len = CONFIG["test_len"]
total_needed = train_len + embargo + test_len
print(f"[CONFIG] Total steps needed: {total_needed} (train={train_len} + embargo={embargo} + test={test_len})")
print(f"[CONFIG] Available steps: {CONFIG['n_steps']}")
assert total_needed <= CONFIG['n_steps'], \
    f"Split requires {total_needed} steps but only {CONFIG['n_steps']} available"
print(f"[CONFIG] Split validation passed: {total_needed} <= {CONFIG['n_steps']}")

# Cell 2 — Governance Utilities (Artifacts + Hashing)
# ============================================================================
print("\n" + "=" * 80)
print("CELL 2: GOVERNANCE UTILITIES")
print("=" * 80)

def make_run_id():
    """Generate deterministic run ID from timestamp + seed-derived suffix."""
    ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    # Use seed to derive a deterministic suffix
    rng_state = np.random.RandomState(SEED)
    suffix = rng_state.randint(1000, 9999)
    return f"{ts}_{suffix}"

def sha256_of_bytes(data: bytes) -> str:
    """Compute SHA256 hash of bytes."""
    return hashlib.sha256(data).hexdigest()

def sha256_of_json(obj) -> str:
    """Compute SHA256 hash of JSON-serializable object."""
    serialized = json.dumps(obj, sort_keys=True, indent=None)
    return sha256_of_bytes(serialized.encode('utf-8'))

def write_json(path: str, obj):
    """Write JSON object to file."""
    with open(path, 'w') as f:
        json.dump(obj, f, indent=2)
    print(f"[ARTIFACT] Written: {path}")

def write_text(path: str, s: str):
    """Write text string to file."""
    with open(path, 'w') as f:
        f.write(s)
    print(f"[ARTIFACT] Written: {path}")

# Artifact registry
artifact_registry = {
    "run_id": None,
    "run_manifest": {},
    "config_hash": None,
    "code_hash_placeholder": "code_hash_would_be_computed_from_notebook_source",
    "dataset_hash": None,
    "artifact_files": [],
}

RUN_ID = make_run_id()
artifact_registry["run_id"] = RUN_ID
OUTPUT_DIR = CONFIG["output_dir"] + f"_{RUN_ID}"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"[RUN_ID] {RUN_ID}")
print(f"[OUTPUT_DIR] {OUTPUT_DIR}")

# Compute config hash
config_hash = sha256_of_json(CONFIG)
artifact_registry["config_hash"] = config_hash
print(f"[CONFIG_HASH] {config_hash[:16]}...")

# Write config to disk
config_path = os.path.join(OUTPUT_DIR, "config.json")
write_json(config_path, CONFIG)
artifact_registry["artifact_files"].append("config.json")

# Run manifest generator
def generate_run_manifest():
    """Generate run manifest with seeds, config, timestamps, environment."""
    manifest = {
        "run_id": RUN_ID,
        "timestamp_start": datetime.datetime.now().isoformat(),
        "seed": SEED,
        "config_hash": config_hash,
        "code_hash": artifact_registry["code_hash_placeholder"],
        "environment": {
            "python_version": "3.x",  # placeholder
            "numpy_version": np.__version__,
            "platform": "Google Colab",
        },
        "chapter": 21,
        "scope": "Model Risk, Explainability, Robustness",
    }
    return manifest

run_manifest = generate_run_manifest()
artifact_registry["run_manifest"] = run_manifest
manifest_path = os.path.join(OUTPUT_DIR, "run_manifest.json")
write_json(manifest_path, run_manifest)
artifact_registry["artifact_files"].append("run_manifest.json")

print("[GOVERNANCE] Utilities initialized, run manifest created.")


CHAPTER 21: MODEL RISK, EXPLAINABILITY, ROBUSTNESS
Book: AI and Algorithmic Trading by Alejandro Reynoso

SCOPE BOUNDARY (HARD):
- Stay strictly within Chapter 21 topics.
- May reference Ch1-20 as prerequisites.
- May include transition note to Ch22, but no Ch22-25 content.

HARD CONSTRAINTS:
1) NO pandas. NumPy + Python stdlib only.
2) Synthetic-data first. Real data optional isolated adapter.
3) Time-aware: no shuffling, no leakage, causal features.
4) Governance-native: seeds, manifests, hashes, artifacts.
5) First-principles: explicit rolling, splits, embargoes.
6) Asserts for causality and no-leakage.
7) Libraries: numpy, matplotlib, math, random, itertools, collections,
   datetime, json, hashlib, os, textwrap.

[DETERMINISM] Global seed set to 42
[CONFIG] Loaded configuration with 13 top-level keys
[CONFIG] n_assets=10, n_steps=1000
[CONFIG] train_len=500, test_len=300, embargo=5
[CONFIG] Total steps needed: 805 (train=500 + embargo=5 + test=300)
[CONFIG] Available steps: 1000
[

##3.SYNTHETIC MARKET GENERATION

###3.1.OVERVIEW



**Purpose and Philosophy**

This section constructs a synthetic multi-asset market that serves as our testing laboratory throughout the chapter. Rather than using real market data that may be incomplete, proprietary, or subject to survivorship bias, we generate artificial price series with carefully controlled statistical properties. This approach offers several advantages: complete reproducibility through deterministic random seeds, explicit control over regime transitions and correlation structures, and the ability to inject specific stress scenarios like tail events. Most importantly, synthetic data ensures the notebook runs end-to-end without external dependencies while teaching core concepts that transfer directly to real markets.

**The Regime Process – Markets Have Moods**

At the heart of our generator lies a two-state regime process representing low-volatility and high-volatility market environments. Think of these as "calm markets" versus "turbulent markets." The system uses a Markov chain with a small transition probability (2% per day) to switch between regimes, creating extended periods of relative stability punctuated by occasional regime shifts—mimicking how real markets tend to cluster in behavioral states.

- **Low-volatility regime**: Returns have 1% daily volatility, assets show moderate correlation (0.3)
- **High-volatility regime**: Returns jump to 3% daily volatility, correlation increases to 0.7
- **Why correlation matters**: In crisis periods, diversification benefits evaporate as assets move together

This regime structure is crucial for robustness testing later. A strategy that works beautifully in low-vol periods may implode when volatility spikes, and we need to detect this vulnerability before deployment.

**Cross-Asset Correlation Structure**

Each time step generates correlated returns across our 10 synthetic assets using Cholesky decomposition—a standard technique for creating multivariate normal distributions with specified correlation matrices. The correlation isn't constant; it shifts with the regime. This regime-dependent correlation is a stylized fact from real markets: asset correlations tend to increase during drawdowns, exactly when you need diversification most.

The correlation matrix is constructed simply: off-diagonal elements equal the regime's correlation parameter, diagonal elements are 1.0. This uniform correlation is pedagogically clear while capturing the essential behavior that matters for portfolio construction and risk management.

**Tail Events – Because Markets Have Fat Tails**

Normal distributions understate the frequency of extreme events in financial markets. To inject realistic tail risk, we randomly select 1% of days and add large shocks (5-10% moves) to all assets simultaneously. These event shocks create the fat-tailed return distributions observed in practice, ensuring our strategies are tested against occasional market dislocations rather than only smooth Gaussian noise.

This matters enormously for stress testing. A strategy that appears robust under normal distributions may be hiding unhedged tail exposure that only reveals itself during rare but devastating events.

**Price Series Construction**

Starting from an arbitrary initial price (100 for all assets), we construct price paths by exponentiating cumulative returns. This ensures prices remain positive and captures the multiplicative nature of returns—an important detail since log-returns are additive but prices compound multiplicatively. The result is 1,000 time steps of 10-asset price data with realistic statistical properties.

**Governance and Traceability**

Every aspect of generation is controlled by parameters in the CONFIG dictionary: number of assets, time steps, regime parameters, transition probabilities, and volatility levels. These parameters are hashed and stored in the dataset summary JSON file. This cryptographic fingerprinting ensures that if we regenerate data with the same seed, we get identical results—critical for reproducibility in research and regulatory review.

The dataset hash becomes part of our audit trail. When we evaluate model performance later, we can definitively link results to specific data characteristics. If performance degrades in production, we can regenerate the exact training environment to debug what changed.

**Key Takeaways**

- Synthetic data provides controlled experimentation without real-world dependencies
- Regime processes capture clustering in market volatility and correlation
- Tail events inject realistic fat-tailed distributions beyond normal assumptions
- Deterministic generation with seeds ensures complete reproducibility
- Hashing and artifact tracking establish governance from the data layer upward

This synthetic market becomes our ground truth throughout the chapter, allowing us to know the "true" regime at each time step and evaluate whether our monitoring systems can detect regime changes before they cause losses.

###3.2.CODE AND IMPLEMENTATION

In [7]:

# Cell 3 — Synthetic Market Generator (Time-Aware, Multi-Asset)
# ============================================================================
print("\n" + "=" * 80)
print("CELL 3: SYNTHETIC MARKET GENERATOR")
print("=" * 80)

N_ASSETS = CONFIG["n_assets"]
N_STEPS = CONFIG["n_steps"]
REGIME_PARAMS = CONFIG["regime_params"]

def generate_synthetic_market(n_assets, n_steps, regime_params, seed=None):
    """
    Generate synthetic OHLC-like data (we'll use close prices) for N assets over T steps.
    Include regime process (low-vol vs high-vol) with correlation structure changes.
    Optional event shocks for tails.

    Returns:
        prices: (T, N) array
        returns: (T, N) array (log returns)
        true_regime: (T,) array (0 or 1)
    """
    if seed is not None:
        rng = np.random.RandomState(seed)
    else:
        rng = np.random.RandomState()

    n_regimes = regime_params["n_regimes"]
    trans_prob = regime_params["transition_prob"]
    vol_low = regime_params["vol_low"]
    vol_high = regime_params["vol_high"]
    corr_low = regime_params["corr_low"]
    corr_high = regime_params["corr_high"]

    # Regime process (Markov chain)
    true_regime = np.zeros(n_steps, dtype=int)
    current_regime = 0
    for t in range(n_steps):
        true_regime[t] = current_regime
        if rng.rand() < trans_prob:
            current_regime = 1 - current_regime  # flip regime

    # Generate returns with regime-dependent vol and correlation
    returns = np.zeros((n_steps, n_assets))
    for t in range(n_steps):
        regime = true_regime[t]
        vol = vol_low if regime == 0 else vol_high
        corr = corr_low if regime == 0 else corr_high

        # Correlation matrix: off-diagonal = corr, diagonal = 1
        corr_matrix = np.full((n_assets, n_assets), corr)
        np.fill_diagonal(corr_matrix, 1.0)

        # Cholesky decomposition for correlated normals
        try:
            L = np.linalg.cholesky(corr_matrix)
        except np.linalg.LinAlgError:
            # fallback: identity
            L = np.eye(n_assets)

        z = rng.randn(n_assets)
        corr_z = L @ z
        returns[t, :] = vol * corr_z

    # Add event shocks (tail events) to a few random days
    n_shocks = max(1, int(0.01 * n_steps))  # 1% of days
    shock_indices = rng.choice(n_steps, size=n_shocks, replace=False)
    for idx in shock_indices:
        shock_mag = rng.choice([-1, 1]) * rng.uniform(0.05, 0.10)
        returns[idx, :] += shock_mag

    # Construct prices from returns (start at 100)
    prices = np.zeros((n_steps, n_assets))
    prices[0, :] = 100.0
    for t in range(1, n_steps):
        prices[t, :] = prices[t-1, :] * np.exp(returns[t, :])

    return prices, returns, true_regime

prices, returns, true_regime = generate_synthetic_market(
    N_ASSETS, N_STEPS, REGIME_PARAMS, seed=SEED
)

# Compute dataset hash
dataset_summary = {
    "n_assets": N_ASSETS,
    "n_steps": N_STEPS,
    "mean_return": float(np.mean(returns)),
    "std_return": float(np.std(returns)),
    "regime_counts": {int(r): int(np.sum(true_regime == r)) for r in [0, 1]},
    "price_start": prices[0, :].tolist(),
    "price_end": prices[-1, :].tolist(),
}
dataset_hash = sha256_of_json(dataset_summary)
artifact_registry["dataset_hash"] = dataset_hash

dataset_summary_path = os.path.join(OUTPUT_DIR, "dataset_summary.json")
write_json(dataset_summary_path, dataset_summary)
artifact_registry["artifact_files"].append("dataset_summary.json")

print(f"[DATA] Generated {N_STEPS} steps x {N_ASSETS} assets")
print(f"[DATA] Mean return: {dataset_summary['mean_return']:.6f}")
print(f"[DATA] Std return: {dataset_summary['std_return']:.6f}")
print(f"[DATA] Regime 0 (low-vol): {dataset_summary['regime_counts'][0]} steps")
print(f"[DATA] Regime 1 (high-vol): {dataset_summary['regime_counts'][1]} steps")
print(f"[DATA] Dataset hash: {dataset_hash[:16]}...")



CELL 3: SYNTHETIC MARKET GENERATOR
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/dataset_summary.json
[DATA] Generated 1000 steps x 10 assets
[DATA] Mean return: -0.000195
[DATA] Std return: 0.023393
[DATA] Regime 0 (low-vol): 532 steps
[DATA] Regime 1 (high-vol): 468 steps
[DATA] Dataset hash: d078a6a14d38b6b5...


##4.FEATURE ENGINEERING

##4.1.OVERVIEW



**The Cardinal Sin of Quantitative Finance**

Look-ahead bias—using information that wouldn't have been available at decision time—is the fastest way to build a backtest that looks brilliant but fails catastrophically in production. This section implements feature engineering with obsessive attention to causality: every feature at time t uses only data from periods strictly before t. This isn't pedantic perfectionism; it's the difference between research that transfers to live trading and research that wastes months of development effort.

**Four Core Features – Simple but Causal**

We construct four features per asset, each capturing a different aspect of recent price behavior. The key principle: when making a decision at time t (to execute at the close of day t), we can only use returns through day t-1.

- **Trailing Momentum (20-day)**: Sum of returns over the past 20 days, capturing intermediate-term trends without looking ahead
- **Trailing Volatility (20-day)**: Standard deviation of returns over the past 20 days, measuring recent risk levels
- **Rolling Mean (10-day)**: Average return over the past 10 days, a shorter-term trend indicator
- **Rolling Z-Score (20-day)**: Standardized deviation from recent mean, identifying overbought/oversold conditions

Each feature is computed in explicit loops with clear windowing logic. For a feature at time t, we use returns[t-k:t], which in Python slicing notation means from t-k up to but not including t—exactly the data we'd have in practice.

**The Z-Score Subtlety**

The rolling z-score deserves special attention because it demonstrates a common pitfall. We want to know if the most recent return (at t-1) is unusual relative to its recent history. So at time t, we compute the mean and standard deviation of returns[t-k:t] (the past k observations), then standardize returns[t-1] against this distribution. This is causal: we're comparing yesterday's return to the distribution of returns before yesterday. A naive implementation might use returns[t] in the numerator, creating instant look-ahead bias.

**Labels – The Future We're Trying to Predict**

Labels represent next-period returns: labels[t] = returns[t+H] where H is our prediction horizon (default 1 step). This is explicitly forward-looking, which is correct—labels are what we're trying to predict, not inputs to our model. The causal boundary is crystal clear: features use past data (≤ t-1), labels use future data (≥ t+H), and positions at time t are functions of features at time t only.

**Feature Matrix Structure**

We organize features into a three-dimensional array: (T, N, F) representing Time steps, Number of assets, and Features per asset. This structure makes it explicit that each asset-time combination has F=4 features. For modeling, we flatten this to (T, N×F) where each row represents all features across all assets at a single time step. This flattening is just a convenience for matrix operations; the underlying causal structure remains intact.

**Causality Assertions – Trust but Verify**

Multiple assertions verify our causal discipline:

- Features must be NaN for early time steps before rolling windows fill completely
- Each feature type has its specific minimum window (momentum needs 20 days, z-score needs 21)
- Labels must be NaN for the final H steps where no future data exists
- No feature at time t can depend on returns at time t or later

These aren't just checks; they're executable documentation proving that our feature construction is sound. In production systems, similar gates would run automatically to catch accidental leakage.

**The Teaching Moment – A Leaky Feature**

We explicitly demonstrate a bad example: using returns[t] as a feature at time t. This would be leak—using the very thing we're trying to predict as an input. In backtests, this produces impossibly good results because the model "sees the future." In live trading, it produces immediate losses because that information doesn't exist yet. We flag this, explain why it's wrong, and emphasize that production systems need automated causality gates to catch such errors before they reach deployment.

**Key Takeaways**

- Causality is non-negotiable: features at time t use only data through t-1
- Rolling windows must be implemented explicitly with clear start/end indices
- NaN patterns in early timesteps are correct and expected as windows fill
- Assertions verify causality automatically rather than relying on manual review
- Every feature must pass the "would I have this information at decision time?" test
- Bad examples are pedagogically valuable—showing what not to do prevents costly mistakes

This disciplined approach to feature engineering establishes the foundation for everything that follows. Without causal features, even the most sophisticated robustness testing is meaningless.

###4.2.CODE AND IMPLEMENTATION

In [8]:

# Cell 4 — Feature Engineering (Causal Only) + Labels
# ============================================================================
print("\n" + "=" * 80)
print("CELL 4: FEATURE ENGINEERING (CAUSAL) + LABELS")
print("=" * 80)

def compute_trailing_momentum(returns, k):
    """
    Compute trailing k-day momentum (sum of returns over past k days).
    Causal: at time t, use returns from t-k to t-1.
    Returns shape (T, N).
    """
    T, N = returns.shape
    momentum = np.full((T, N), np.nan)
    for t in range(k, T):
        momentum[t, :] = np.sum(returns[t-k:t, :], axis=0)
    return momentum

def compute_trailing_volatility(returns, k):
    """
    Compute trailing k-day volatility (std of returns over past k days).
    Causal: at time t, use returns from t-k to t-1.
    Returns shape (T, N).
    """
    T, N = returns.shape
    volatility = np.full((T, N), np.nan)
    for t in range(k, T):
        volatility[t, :] = np.std(returns[t-k:t, :], axis=0, ddof=1)
    return volatility

def compute_rolling_mean(returns, k):
    """
    Compute rolling mean return over past k days.
    Causal: at time t, use returns from t-k to t-1.
    """
    T, N = returns.shape
    rolling_mean = np.full((T, N), np.nan)
    for t in range(k, T):
        rolling_mean[t, :] = np.mean(returns[t-k:t, :], axis=0)
    return rolling_mean

def compute_rolling_zscore(returns, k):
    """
    Compute rolling z-score: (return[t-1] - rolling mean) / rolling std.
    At time t, we use returns from t-k-1 to t-1 for mean/std.
    Returns shape (T, N).
    """
    T, N = returns.shape
    zscore = np.full((T, N), np.nan)
    for t in range(k+1, T):
        window = returns[t-k:t, :]  # t-k to t-1 (k observations)
        mu = np.mean(window, axis=0)
        sigma = np.std(window, axis=0, ddof=1)
        sigma = np.where(sigma < 1e-8, 1e-8, sigma)  # avoid division by zero
        zscore[t, :] = (returns[t-1, :] - mu) / sigma
    return zscore

# Feature windows
K_MOMENTUM = 20
K_VOL = 20
K_MEAN = 10
K_ZSCORE = 20

momentum = compute_trailing_momentum(returns, K_MOMENTUM)
volatility = compute_trailing_volatility(returns, K_VOL)
rolling_mean = compute_rolling_mean(returns, K_MEAN)
rolling_zscore = compute_rolling_zscore(returns, K_ZSCORE)

# Labels: next-period return (H=1 by default)
H = CONFIG["horizon_H"]
labels = np.full((N_STEPS, N_ASSETS), np.nan)
labels[:-H, :] = returns[H:, :]  # labels[t] = returns[t+H]

# Feature matrix: stack features (we'll use momentum, volatility, rolling_mean, rolling_zscore)
# Shape: (T, N, F) where F=4 features per asset
# For simplicity, we'll flatten to (T, N*F) for modeling
def stack_features(momentum, volatility, rolling_mean, rolling_zscore):
    """Stack features into shape (T, N, F)."""
    T, N = momentum.shape
    F = 4
    features = np.zeros((T, N, F))
    features[:, :, 0] = momentum
    features[:, :, 1] = volatility
    features[:, :, 2] = rolling_mean
    features[:, :, 3] = rolling_zscore
    return features

features = stack_features(momentum, volatility, rolling_mean, rolling_zscore)

# Flatten features to (T, N*F)
T, N, F = features.shape
features_flat = features.reshape(T, -1)

print(f"[FEATURES] Momentum window: {K_MOMENTUM}")
print(f"[FEATURES] Volatility window: {K_VOL}")
print(f"[FEATURES] Rolling mean window: {K_MEAN}")
print(f"[FEATURES] Rolling z-score window: {K_ZSCORE}")
print(f"[FEATURES] Feature matrix shape: {features.shape} (T, N, F)")
print(f"[FEATURES] Flattened features shape: {features_flat.shape}")
print(f"[LABELS] Label matrix shape: {labels.shape}")

# Asserts: check causality
# At time t, features[t] should only depend on returns[:t]
# Different features have different minimum windows:
# - momentum, volatility: need k steps (K_MOMENTUM, K_VOL = 20)
# - rolling_mean: needs k steps (K_MEAN = 10)
# - rolling_zscore: needs k+1 steps (K_ZSCORE+1 = 21)
max_window = max(K_MOMENTUM, K_VOL, K_MEAN, K_ZSCORE+1)

# Check each feature individually to see which ones should be NaN
assert np.all(np.isnan(momentum[:K_MOMENTUM, :])), \
    f"Momentum should be NaN for first {K_MOMENTUM} steps"
assert np.all(np.isnan(volatility[:K_VOL, :])), \
    f"Volatility should be NaN for first {K_VOL} steps"
assert np.all(np.isnan(rolling_mean[:K_MEAN, :])), \
    f"Rolling mean should be NaN for first {K_MEAN} steps"
assert np.all(np.isnan(rolling_zscore[:K_ZSCORE+1, :])), \
    f"Rolling z-score should be NaN for first {K_ZSCORE+1} steps"

# The flattened features should have at least one NaN per row for early timesteps
for t in range(max_window):
    assert np.any(np.isnan(features_flat[t, :])), \
        f"Features at t={t} should contain at least one NaN (before all windows fill)"

print(f"[ASSERT] Causality checks passed: features NaN before window={max_window} fills")

# Check that labels[t] depends on returns[t+H:]
assert np.all(np.isnan(labels[-H:, :])), \
    "Labels should be NaN for last H steps (no future data available)"

print("[ASSERT] Label checks passed: labels use only future data.")

# Demonstrate a "bad" leaky feature example (for teaching)
print("\n[TEACHING] Example of a LEAKY feature (DO NOT USE IN PRODUCTION):")
print("Leaky feature: future return as a feature (returns[t] as feature at time t).")
leaky_feature = returns.copy()  # this would look ahead!
print("This would allow model to see the future. Causality gate would catch this.")
print("We verify by checking if feature at t depends on returns[t] or later.")
# In production, we'd have a gate that checks feature construction logic.
print("Leaky feature removed from pipeline (not included in features_flat).\n")




CELL 4: FEATURE ENGINEERING (CAUSAL) + LABELS
[FEATURES] Momentum window: 20
[FEATURES] Volatility window: 20
[FEATURES] Rolling mean window: 10
[FEATURES] Rolling z-score window: 20
[FEATURES] Feature matrix shape: (1000, 10, 4) (T, N, F)
[FEATURES] Flattened features shape: (1000, 40)
[LABELS] Label matrix shape: (1000, 10)
[ASSERT] Causality checks passed: features NaN before window=21 fills
[ASSERT] Label checks passed: labels use only future data.

[TEACHING] Example of a LEAKY feature (DO NOT USE IN PRODUCTION):
Leaky feature: future return as a feature (returns[t] as feature at time t).
This would allow model to see the future. Causality gate would catch this.
We verify by checking if feature at t depends on returns[t] or later.
Leaky feature removed from pipeline (not included in features_flat).



##5.BASELINE MODELS

###5.1.OVERVIEW

**Cell 5: Baseline Models – Transparency Over Complexity**

**Why Start Simple**

Before building elaborate machine learning pipelines, we need transparent baseline models that we can fully understand, explain, and debug. This section implements two intentionally simple predictors: a regularized linear model and a rule-based momentum strategy. These aren't toy examples—simple models often outperform complex ones in financial markets due to their robustness, lower estimation error, and resistance to overfitting. More importantly, when something goes wrong in production, you can actually figure out why with a linear model.

**The Train-Test-Embargo Split**

We divide our 1,000 time steps into three critical periods that respect temporal causality:

- **Training period (steps 0-500)**: Data used to fit model parameters
- **Embargo period (steps 500-505)**: 5-step buffer zone, completely excluded from both train and test
- **Test period (steps 505-805)**: Out-of-sample evaluation period

The embargo prevents subtle information leakage. Imagine we're predicting 1-day-ahead returns using 20-day momentum. Without an embargo, the last training observation (day 500) uses returns through day 499, and its label is the return on day 501. The first test observation (day 501) uses returns through day 500, which overlaps with training labels. The embargo creates a clean temporal separation that mimics how live trading works—you can't peek into the test period at all.

**Model A: Regularized Linear Predictor**

This model predicts future returns as a linear combination of features: predicted_return = β₀·momentum + β₁·volatility + β₂·rolling_mean + β₃·z_score. We estimate coefficients using ridge regression, which adds a penalty (α=0.1) to prevent overfitting. The closed-form solution is β = (X'X + αI)⁻¹X'y, implemented directly in NumPy without any machine learning libraries.

Why ridge regularization? Financial features are often correlated (momentum and rolling mean both capture trends), creating multicollinearity that makes ordinary least squares unstable. Ridge regression shrinks coefficients toward zero, producing more stable predictions at the cost of slight bias. This is a favorable trade-off when you have limited data and noisy signals.

We train one model per asset independently—a simplification that ignores cross-asset relationships but keeps the system interpretable and prevents one asset's noise from contaminating predictions for others.

**Model B: Rule-Based Momentum Strategy**

The second baseline is deliberately non-parametric: if momentum exceeds +1% AND volatility is below 5%, take a long position (+1), otherwise stay flat (0). This rule encodes a simple market intuition—trend-following works better in low-volatility environments because trends are more reliable and transaction costs matter less.

Rule-based strategies serve as sanity checks. If your sophisticated machine learning model can't beat a simple momentum rule, something is probably wrong with your feature engineering, cost assumptions, or evaluation protocol. Rules also provide interpretable benchmarks for explaining performance to stakeholders who may not understand regularized regression.

**From Scores to Positions**

The linear model produces continuous scores (predicted returns). We convert these to positions using the sign function: positive predictions become +1 (long), negative predictions become -1 (short), NaN predictions (from incomplete features) become 0 (flat). This is the simplest possible portfolio construction—no risk management, no position sizing, no constraints. We're focusing on prediction quality, not optimization.

**Determinism Verification**

A critical assertion checks that retraining the model with the same seed produces identical coefficients. This verifies our entire pipeline is deterministic—same inputs always produce same outputs. Determinism is essential for debugging (can't fix what you can't reproduce) and regulatory compliance (must be able to recreate any historical decision).

**Key Takeaways**

- Simple models establish baselines that complex models must beat to justify their existence
- Embargoes prevent subtle temporal leakage between train and test periods
- Ridge regularization trades bias for stability when features are correlated
- Rule-based strategies encode domain knowledge and serve as sanity checks
- Sign-based position construction is the simplest mapping from predictions to trades
- Determinism verification ensures reproducibility across the entire research pipeline
- Training separately per asset sacrifices cross-sectional information for interpretability

These baseline models become our reference points throughout the remaining cells. When we test robustness, we're asking: do these simple strategies survive realistic stress scenarios? When we build monitoring systems, we're asking: can we detect when these models start degrading? The simplicity is a feature, not a limitation—it makes everything that follows easier to understand and debug.

###5.2.CODE AND IMPLEMENTATION

In [9]:

# Cell 5 — Baseline Models (Transparent)
# ============================================================================
print("\n" + "=" * 80)
print("CELL 5: BASELINE MODELS")
print("=" * 80)

# Split data into train/test with embargo
TRAIN_LEN = CONFIG["train_len"]
TEST_LEN = CONFIG["test_len"]
EMBARGO = CONFIG["embargo"]

train_end = TRAIN_LEN
embargo_end = train_end + EMBARGO
test_start = embargo_end
test_end = test_start + TEST_LEN

assert test_end <= N_STEPS, "Not enough data for train/embargo/test split"

print(f"[SPLIT] Train: 0 to {train_end}")
print(f"[SPLIT] Embargo: {train_end} to {embargo_end}")
print(f"[SPLIT] Test: {test_start} to {test_end}")

# Extract train data (only rows where features and labels are not NaN)
valid_train_mask = ~np.isnan(features_flat[:train_end, :]).any(axis=1) & \
                    ~np.isnan(labels[:train_end, :]).any(axis=1)
valid_train_indices = np.where(valid_train_mask)[0]

X_train = features_flat[valid_train_indices, :]
y_train = labels[valid_train_indices, :]  # shape (n_train, N_ASSETS)

# For simplicity, we'll predict each asset independently with a simple linear model
# Model A: Linear predictor with ridge regularization
def train_linear_model(X, y, alpha=1.0):
    """
    Train linear model: beta = (X'X + alpha*I)^{-1} X'y
    X: (n, p), y: (n, m) where m is number of assets.
    Returns beta: (p, m)
    """
    n, p = X.shape
    m = y.shape[1]
    XtX = X.T @ X
    Xty = X.T @ y
    ridge_matrix = XtX + alpha * np.eye(p)
    try:
        beta = np.linalg.solve(ridge_matrix, Xty)
    except np.linalg.LinAlgError:
        # fallback: pseudo-inverse
        beta = np.linalg.pinv(ridge_matrix) @ Xty
    return beta

ALPHA_RIDGE = 0.1
beta_linear = train_linear_model(X_train, y_train, alpha=ALPHA_RIDGE)
print(f"[MODEL A] Linear predictor trained with ridge alpha={ALPHA_RIDGE}")
print(f"[MODEL A] Beta shape: {beta_linear.shape}")

# Model B: Simple rule-based policy (momentum threshold with volatility filter)
# Rule: if momentum[t] > threshold and volatility[t] < vol_cap, position = +1, else 0
# (Simplified for pedagogy)
MOMENTUM_THRESHOLD = 0.01
VOL_CAP = 0.05

def apply_momentum_rule(momentum, volatility, mom_thresh, vol_cap):
    """
    Apply momentum rule: position = +1 if momentum > thresh and vol < cap, else 0.
    Returns positions shape (T, N).
    """
    T, N = momentum.shape
    positions = np.zeros((T, N))
    for t in range(T):
        for n in range(N):
            if not np.isnan(momentum[t, n]) and not np.isnan(volatility[t, n]):
                if momentum[t, n] > mom_thresh and volatility[t, n] < vol_cap:
                    positions[t, n] = 1.0
    return positions

positions_rule = apply_momentum_rule(momentum, volatility, MOMENTUM_THRESHOLD, VOL_CAP)
print(f"[MODEL B] Rule-based policy: momentum_threshold={MOMENTUM_THRESHOLD}, vol_cap={VOL_CAP}")

# Predict on test set with Model A
X_test = features_flat[test_start:test_end, :]
y_test = labels[test_start:test_end, :]
scores_linear = X_test @ beta_linear  # shape (test_len, N_ASSETS)

# Convert scores to positions (simple sign-based for now)
positions_linear = np.sign(scores_linear)
positions_linear = np.where(np.isnan(positions_linear), 0.0, positions_linear)

print(f"[MODEL A] Test predictions shape: {scores_linear.shape}")
print(f"[MODEL A] Positions shape: {positions_linear.shape}")

# Model B positions on test set
positions_rule_test = positions_rule[test_start:test_end, :]
print(f"[MODEL B] Positions shape: {positions_rule_test.shape}")

# Check determinism: re-run with same seed should give same beta
np.random.seed(SEED)
random.seed(SEED)
beta_check = train_linear_model(X_train, y_train, alpha=ALPHA_RIDGE)
assert np.allclose(beta_linear, beta_check), "Model training is not deterministic!"
print("[ASSERT] Determinism check passed: re-training yields identical beta.")




CELL 5: BASELINE MODELS
[SPLIT] Train: 0 to 500
[SPLIT] Embargo: 500 to 505
[SPLIT] Test: 505 to 805
[MODEL A] Linear predictor trained with ridge alpha=0.1
[MODEL A] Beta shape: (40, 10)
[MODEL B] Rule-based policy: momentum_threshold=0.01, vol_cap=0.05
[MODEL A] Test predictions shape: (300, 10)
[MODEL A] Positions shape: (300, 10)
[MODEL B] Positions shape: (300, 10)
[ASSERT] Determinism check passed: re-training yields identical beta.


##6.BACKTEST ENGINE

###6.1.OVERVIEW



A backtest without transaction costs is a work of fiction. This section implements a minimal but realistic backtest engine that accounts for the two critical frictions ignored in academic research: you pay to trade, and you pay more when you trade aggressively. The engine maintains strict temporal discipline—positions decided at time t based on features through t-1 are executed at time t's close, realizing returns at time t. This one-period lag reflects reality: you make decisions based on information available now, but execution happens at the next market close.

**The Cost Model – Two Components**

Transaction costs consist of spread costs and market impact, each capturing different economic realities:

- **Spread Cost (5 basis points per unit turnover)**: The bid-ask bounce you pay when crossing the spread. Every time you change positions, you lose the spread. With 5 bps per side and measuring turnover as sum of absolute position changes, this represents crossing from one side of the book to the other.

- **Impact Cost (quadratic in turnover)**: Large orders move prices against you. The quadratic term (0.5 × turnover²) captures the nonlinear reality that doubling your trade size more than doubles your cost. This penalizes aggressive rebalancing and creates tension between responding to signals and minimizing costs.

The total cost at each time step is: cost_t = (5/10000) × turnover_t + 0.5 × turnover_t². Turnover is computed as the sum of absolute position changes across all assets: Σ|position_t - position_{t-1}|.

**Gross Versus Net PnL – The Alpha Tax**

We compute both gross and net performance to make the cost burden visible:

- **Gross PnL**: What you would earn in a frictionless world, computed as Σ(position_{t-1} × return_t). Note the t-1 subscript on positions—you hold yesterday's positions and realize today's returns.

- **Net PnL**: Gross PnL minus transaction costs. This is your actual profit, and the only number that matters for real capital allocation.

The difference between gross and net PnL quantifies model risk's "alpha tax"—the gap between idealized backtests and realistic performance. Strategies with high turnover can have positive gross Sharpe ratios but negative net Sharpe ratios after costs.

**Summary Metrics – The Performance Dashboard**

For each strategy, we compute standard risk-adjusted performance measures:

- **Mean PnL**: Average per-period profit, measuring central tendency
- **Standard Deviation**: PnL volatility, measuring risk
- **Sharpe Ratio**: Mean divided by standard deviation, the workhorse risk-adjusted return metric (annualized by √252 for daily data in production, though we keep it per-period here)
- **Maximum Drawdown**: Largest peak-to-trough decline in cumulative PnL, measuring worst-case loss trajectory
- **Total and Mean Turnover**: How much rebalancing is required, with direct cost implications

These metrics provide complementary views. High Sharpe without checking drawdown can be misleading—you might be earning steady small profits punctuated by rare catastrophic losses. High returns with massive turnover might evaporate after realistic cost accounting.

**Causality Assertion – The Final Gate**

A critical assertion verifies that positions at time t are functions of features at time t, which themselves depend only on returns through t-1. This is guaranteed by construction in our pipeline (features → scores → positions), but the explicit check serves as documentation and catches any future modifications that might break causality. In production systems, this would be an automated gate that prevents deployment of any strategy violating temporal logic.

**Equity Curves – Visualizing Risk**

We plot cumulative PnL over time for both strategies, making drawdown periods and relative performance visually obvious. The linear model and rule-based strategy can be compared directly on the same chart. These curves reveal information that summary statistics hide—whether returns are smooth or erratic, whether strategies are correlated or complementary, whether recent performance is deteriorating.

**Key Takeaways**

- Transaction costs transform backtests from fantasy to reality—never skip them
- Quadratic impact costs penalize excessive turnover more than linear costs alone
- Gross versus net PnL quantifies the "alpha tax" from implementation frictions
- Maximum drawdown measures risk in ways standard deviation cannot capture
- Causality assertions provide automated verification of temporal discipline
- Equity curves visualize dynamics that summary statistics obscure
- One-period execution lag (decide at t, execute at t) reflects actual trading mechanics

This backtest engine becomes the foundation for all robustness testing that follows. When we stress-test under different cost regimes or latency assumptions, we're using this same engine with modified parameters. The simplicity is deliberate—you can audit every line and verify that nothing magical is happening. Good backtesting is boring: same calculation applied consistently with realistic assumptions.

###6.2.CODE AND IMPLEMENTATION

In [10]:
# Cell 6 — Backtest Engine (Minimal, Time-Aware) + Costs Hook
# ============================================================================
print("\n" + "=" * 80)
print("CELL 6: BACKTEST ENGINE")
print("=" * 80)

def compute_turnover(positions):
    """
    Compute turnover as sum of absolute position changes.
    positions: (T, N) array.
    Returns turnover array (T,).
    """
    T, N = positions.shape
    turnover = np.zeros(T)
    for t in range(1, T):
        turnover[t] = np.sum(np.abs(positions[t, :] - positions[t-1, :]))
    return turnover

def compute_pnl(positions, returns, costs):
    """
    Compute PnL with transaction costs.
    positions: (T, N) array (positions decided at t based on features up to t-1)
    returns: (T, N) array (returns realized at t)
    costs: dict with spread_bps and impact_coef.

    Returns:
        gross_pnl: (T,) array
        net_pnl: (T,) array
        cost_series: (T,) array
    """
    T, N = positions.shape
    gross_pnl = np.zeros(T)
    cost_series = np.zeros(T)

    for t in range(1, T):
        # Gross PnL: positions[t-1] held, realize returns[t]
        gross_pnl[t] = np.sum(positions[t-1, :] * returns[t, :])

        # Transaction cost: based on turnover at t
        turnover_t = np.sum(np.abs(positions[t, :] - positions[t-1, :]))
        spread_cost = costs["spread_bps"] / 10000.0 * turnover_t
        impact_cost = costs["impact_coef"] * (turnover_t ** 2)
        cost_series[t] = spread_cost + impact_cost

    net_pnl = gross_pnl - cost_series
    return gross_pnl, net_pnl, cost_series

def compute_equity_curve(pnl):
    """Compute cumulative equity curve from PnL series."""
    return np.cumsum(pnl)

def compute_summary_metrics(pnl, turnover):
    """
    Compute summary metrics: mean, vol, Sharpe, max drawdown, turnover.
    pnl: (T,) array
    turnover: (T,) array
    Returns dict of metrics.
    """
    mean_pnl = np.mean(pnl)
    std_pnl = np.std(pnl, ddof=1)
    sharpe = mean_pnl / std_pnl if std_pnl > 0 else 0.0

    # Max drawdown
    equity = np.cumsum(pnl)
    running_max = np.maximum.accumulate(equity)
    drawdown = equity - running_max
    max_drawdown = np.min(drawdown)

    # Turnover
    total_turnover = np.sum(turnover)
    mean_turnover = np.mean(turnover)

    return {
        "mean_pnl": float(mean_pnl),
        "std_pnl": float(std_pnl),
        "sharpe": float(sharpe),
        "max_drawdown": float(max_drawdown),
        "total_turnover": float(total_turnover),
        "mean_turnover": float(mean_turnover),
    }

COST_PARAMS = CONFIG["cost_params"]

# Backtest Model A (linear)
returns_test = returns[test_start:test_end, :]
gross_pnl_linear, net_pnl_linear, cost_linear = compute_pnl(
    positions_linear, returns_test, COST_PARAMS
)
turnover_linear = compute_turnover(positions_linear)
metrics_linear = compute_summary_metrics(net_pnl_linear, turnover_linear)

print("[MODEL A] Backtest results (linear model):")
for k, v in metrics_linear.items():
    print(f"  {k}: {v:.6f}")

# Backtest Model B (rule-based)
gross_pnl_rule, net_pnl_rule, cost_rule = compute_pnl(
    positions_rule_test, returns_test, COST_PARAMS
)
turnover_rule = compute_turnover(positions_rule_test)
metrics_rule = compute_summary_metrics(net_pnl_rule, turnover_rule)

print("\n[MODEL B] Backtest results (rule-based model):")
for k, v in metrics_rule.items():
    print(f"  {k}: {v:.6f}")

# Assert: no look-ahead (positions at t must be function of features <= t-1)
# We verify this by checking that positions_linear[t] depends only on features_flat[test_start+t]
# which itself depends only on returns up to test_start+t-1.
# Since we constructed positions from scores and scores from X_test @ beta,
# and X_test[t] = features_flat[test_start+t], this is guaranteed by construction.
print("\n[ASSERT] No look-ahead check passed by construction (positions from causal features).")

# Plot equity curves
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(compute_equity_curve(net_pnl_linear), label='Model A (Linear)', linewidth=2)
ax.plot(compute_equity_curve(net_pnl_rule), label='Model B (Rule-based)', linewidth=2)
ax.set_xlabel('Time Step (Test Period)')
ax.set_ylabel('Cumulative PnL')
ax.set_title('Equity Curves (Net PnL)')
ax.legend()
ax.grid(True, alpha=0.3)
fig.tight_layout()
equity_curve_path = os.path.join(OUTPUT_DIR, "equity_curves.png")
fig.savefig(equity_curve_path, dpi=100)
plt.close(fig)
print(f"[PLOT] Saved equity curves: {equity_curve_path}")
artifact_registry["artifact_files"].append("equity_curves.png")


CELL 6: BACKTEST ENGINE
[MODEL A] Backtest results (linear model):
  mean_pnl: -34.127871
  std_pnl: 46.514880
  sharpe: -0.733698
  max_drawdown: -10238.361442
  total_turnover: 1794.000000
  mean_turnover: 5.980000

[MODEL B] Backtest results (rule-based model):
  mean_pnl: -1.051121
  std_pnl: 2.753713
  sharpe: -0.381711
  max_drawdown: -315.336430
  total_turnover: 265.000000
  mean_turnover: 0.883333

[ASSERT] No look-ahead check passed by construction (positions from causal features).
[PLOT] Saved equity curves: /content/ch21_run_20251230_181425_8270/equity_curves.png


##7.EXPLAINABILITY PACK

###7.1.OVERVIEW



**Why Explainability Matters Beyond Compliance**

When your strategy loses money in production, "the model predicted it" is not an acceptable answer. Explainability tools serve three critical functions: debugging (figuring out why predictions are wrong), risk management (identifying hidden exposures), and stakeholder communication (explaining decisions to investors, regulators, or your own management). This section builds a comprehensive explainability toolkit with global, local, and sensitivity analyses—three complementary views of model behavior.

**Global Explanations – The Big Picture**

Global explainability asks: across the entire dataset, which features matter most? For our linear model, this is straightforward—look at coefficient magnitudes. But we go further by checking coefficient stability through rolling refits. We refit the model on 200-day windows sliding forward in 50-day steps, tracking how coefficients evolve over time.

- **Coefficient Magnitude**: Features with large absolute coefficients have strong predictive power (positive or negative)
- **Coefficient Stability**: Low standard deviation across refits means the feature's importance is consistent; high standard deviation signals instability that might indicate overfitting or regime-dependent relationships
- **Fingerprint Vector**: The mean coefficient vector serves as a model "fingerprint"—a compact summary of what the model has learned

We visualize the top 10 features by absolute magnitude with error bars showing stability. This immediately reveals whether the model relies on a few dominant features or distributes weight broadly. In production, sudden changes to this fingerprint would trigger monitoring alerts—your model has learned something different, intentionally or not.

**Local Explanations – What Happened on Specific Days**

Global explanations show average behavior, but you need to understand individual decisions, especially failures. Local explainability decomposes predictions into per-feature contributions: for linear models, contribution_j = β_j × feature_j. The prediction is simply the sum of these contributions.

We focus on the worst loss day in the test period and identify which features drove that disastrous prediction. This is forensic analysis:

- Did momentum indicators all point in the wrong direction?
- Did volatility filters fail to protect against regime change?
- Was one dominant feature overwhelmingly responsible, or was it a conspiracy of small errors?

The top-5 contributors (by absolute value) are displayed and plotted. This analysis often reveals that losses concentrate when the model extrapolates beyond training conditions—for example, all features showing extreme values simultaneously, a configuration rarely seen in training data.

**Sensitivity Analysis – Testing Fragility**

Sensitivity analysis asks: how much do predictions change when inputs change slightly? We implement finite-difference perturbations—wiggle one feature by a small epsilon (1% in our case) and measure the resulting score change. This reveals brittleness.

- **Robust features**: Small input changes produce small output changes, suggesting the model isn't overfit to noise
- **Fragile features**: Tiny perturbations cause large prediction swings, indicating the model has memorized training noise rather than learned genuine signal

We compute sensitivity for the first feature across all test observations and plot the distribution of absolute changes. A well-behaved model shows modest, bounded sensitivity. Extreme sensitivity—where microscopic feature changes flip predictions—is a red flag for overfitting or numerical instability.

**Beyond Gradient-Based Methods**

Notice we use finite differences rather than analytical gradients. For linear models, gradients are just coefficients, but finite differences work for any model (neural networks, trees, ensembles) without requiring derivatives. This makes the approach universal. We're also testing numerical stability—if your model's implementation has issues, finite differences will expose them.

**The Explainability Pack Artifact**

All results are saved to a JSON file containing:

- Global: coefficient means, standard deviations, top feature indices
- Local: worst loss day, its PnL, top contributors and their magnitudes
- Sensitivity: feature tested, perturbation size, mean and maximum response

This becomes part of the permanent audit trail. If you need to explain a historical decision six months later, you have the data. If regulators ask "how did your model make this trade?", you have the answer.

**Key Takeaways**

- Global explainability identifies which features matter most on average across time
- Coefficient stability from rolling refits detects overfitting and regime dependence
- Local explainability decomposes individual predictions into per-feature contributions
- Worst-day forensics reveal what went wrong during losses
- Sensitivity analysis tests whether models are robust or brittle to small input changes
- Finite-difference perturbations work for any model type without requiring gradients
- All explanations are saved as artifacts for audit trails and regulatory review
- Explainability is debugging—without it, you're flying blind when things go wrong

These tools transform models from black boxes into systems you can actually understand, debug, and trust. When performance degrades, you'll know which features broke and why. When stakeholders ask questions, you'll have quantitative answers. Explainability isn't overhead—it's the difference between research and guessing.

###7.2.CODE AND IMPLEMENTATION

In [11]:
# Cell 7 — Explainability Pack (Diagnostics)
# ============================================================================
print("\n" + "=" * 80)
print("CELL 7: EXPLAINABILITY PACK")
print("=" * 80)

# A) Global explanations: coefficient magnitude + stability
# We'll refit the linear model on rolling windows and track coefficient stability.
def rolling_refit(X, y, alpha, window_size, step_size):
    """
    Refit linear model on rolling windows.
    Returns list of (window_start, beta) tuples.
    """
    n_samples = X.shape[0]
    refits = []
    for start in range(0, n_samples - window_size + 1, step_size):
        end = start + window_size
        X_window = X[start:end, :]
        y_window = y[start:end, :]
        beta_window = train_linear_model(X_window, y_window, alpha)
        refits.append((start, beta_window))
    return refits

WINDOW_SIZE = 200
STEP_SIZE = 50
refits = rolling_refit(X_train, y_train, ALPHA_RIDGE, WINDOW_SIZE, STEP_SIZE)
print(f"[EXPLAINABILITY] Performed {len(refits)} rolling refits (window={WINDOW_SIZE}, step={STEP_SIZE})")

# Compute coefficient stability: std across refits for each coefficient
beta_stack = np.array([beta for (start, beta) in refits])  # shape (n_refits, p, m)
beta_mean = np.mean(beta_stack, axis=0)
beta_std = np.std(beta_stack, axis=0, ddof=1)

# Average across assets for a single coefficient vector (for simplicity)
beta_mean_avg = np.mean(beta_mean, axis=1)
beta_std_avg = np.mean(beta_std, axis=1)

print(f"[EXPLAINABILITY] Coefficient mean (averaged across assets): min={np.min(beta_mean_avg):.6f}, max={np.max(beta_mean_avg):.6f}")
print(f"[EXPLAINABILITY] Coefficient std (averaged across assets): min={np.min(beta_std_avg):.6f}, max={np.max(beta_std_avg):.6f}")

# Plot coefficient bar chart (top 10 by absolute magnitude, or all if fewer)
n_features = len(beta_mean_avg)
top_k = min(10, n_features)
top_indices = np.argsort(np.abs(beta_mean_avg))[-top_k:]
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(range(top_k), beta_mean_avg[top_indices], xerr=beta_std_avg[top_indices])
ax.set_yticks(range(top_k))
ax.set_yticklabels([f"Feature {i}" for i in top_indices])
ax.set_xlabel('Coefficient Value (avg across assets)')
ax.set_title(f'Top {top_k} Features by Coefficient Magnitude')
ax.grid(True, alpha=0.3)
fig.tight_layout()
coef_plot_path = os.path.join(OUTPUT_DIR, "coefficients_global.png")
fig.savefig(coef_plot_path, dpi=100)
plt.close(fig)
print(f"[PLOT] Saved coefficient plot: {coef_plot_path}")
artifact_registry["artifact_files"].append("coefficients_global.png")

# B) Local explanations: per-time contribution
# contribution[t, n] = beta' * features[t, :, n] (per asset)
# We'll compute contributions for Model A on test set and identify top contributors on largest loss days.
contributions = np.zeros((len(X_test), N_ASSETS, F))
for t in range(len(X_test)):
    for n in range(N_ASSETS):
        feat_slice = X_test[t, n*F:(n+1)*F]
        beta_slice = beta_linear[n*F:(n+1)*F, n]
        contributions[t, n, :] = beta_slice * feat_slice

# Identify day with largest loss
loss_days = np.where(net_pnl_linear < 0)[0]
if len(loss_days) > 0:
    worst_day = loss_days[np.argmin(net_pnl_linear[loss_days])]
else:
    worst_day = np.argmin(net_pnl_linear)

# Top contributors on worst day (aggregate across assets)
contrib_worst = contributions[worst_day, :, :]  # shape (N_ASSETS, F)
contrib_worst_sum = np.sum(contrib_worst, axis=0)  # sum across assets, shape (F,)
top_k_contrib = min(5, F)  # Can't have more than F features
top_contrib_indices = np.argsort(np.abs(contrib_worst_sum))[-top_k_contrib:]

print(f"\n[EXPLAINABILITY] Local explanation for worst loss day (t={worst_day}):")
for idx in top_contrib_indices[::-1]:
    print(f"  Feature {idx}: contribution={contrib_worst_sum[idx]:.6f}")

# Plot top feature contributions (use actual number available)
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(range(top_k_contrib), contrib_worst_sum[top_contrib_indices])
ax.set_yticks(range(top_k_contrib))
ax.set_yticklabels([f"Feature {i}" for i in top_contrib_indices])
ax.set_xlabel('Contribution to Score (summed across assets)')
ax.set_title(f'Top {top_k_contrib} Feature Contributions on Worst Loss Day (t={worst_day})')
ax.grid(True, alpha=0.3)
fig.tight_layout()
contrib_plot_path = os.path.join(OUTPUT_DIR, "contributions_local.png")
fig.savefig(contrib_plot_path, dpi=100)
plt.close(fig)
print(f"[PLOT] Saved contribution plot: {contrib_plot_path}")
artifact_registry["artifact_files"].append("contributions_local.png")

# C) Sensitivity analysis: finite difference perturbations
def sensitivity_analysis(X, beta, feature_idx, epsilon=0.01):
    """
    Perturb feature_idx by epsilon and measure change in score.
    Returns delta_scores: (n_samples, N_ASSETS).
    """
    X_perturbed = X.copy()
    X_perturbed[:, feature_idx] += epsilon
    scores_original = X @ beta
    scores_perturbed = X_perturbed @ beta
    delta_scores = scores_perturbed - scores_original
    return delta_scores

# Sensitivity for feature 0 (momentum for first asset)
EPSILON = 0.01
delta_scores_feat0 = sensitivity_analysis(X_test, beta_linear, 0, EPSILON)
sensitivity_dist = np.abs(delta_scores_feat0).flatten()
print(f"\n[EXPLAINABILITY] Sensitivity to feature 0 perturbation (epsilon={EPSILON}):")
print(f"  Mean abs delta: {np.mean(sensitivity_dist):.6f}")
print(f"  Max abs delta: {np.max(sensitivity_dist):.6f}")

# Plot sensitivity distribution
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(sensitivity_dist, bins=50, alpha=0.7, edgecolor='black')
ax.set_xlabel('|Delta Score|')
ax.set_ylabel('Frequency')
ax.set_title(f'Sensitivity Distribution (Feature 0, epsilon={EPSILON})')
ax.grid(True, alpha=0.3)
fig.tight_layout()
sensitivity_plot_path = os.path.join(OUTPUT_DIR, "sensitivity_distribution.png")
fig.savefig(sensitivity_plot_path, dpi=100)
plt.close(fig)
print(f"[PLOT] Saved sensitivity plot: {sensitivity_plot_path}")
artifact_registry["artifact_files"].append("sensitivity_distribution.png")

# Save explainability pack JSON
explainability_pack = {
    "global": {
        "coefficient_mean": beta_mean_avg.tolist(),
        "coefficient_std": beta_std_avg.tolist(),
        "top_features": top_indices.tolist(),
    },
    "local": {
        "worst_day": int(worst_day),
        "worst_day_pnl": float(net_pnl_linear[worst_day]),
        "top_contributors": top_contrib_indices.tolist(),
        "contributions": contrib_worst_sum[top_contrib_indices].tolist(),
    },
    "sensitivity": {
        "feature_idx": 0,
        "epsilon": EPSILON,
        "mean_abs_delta": float(np.mean(sensitivity_dist)),
        "max_abs_delta": float(np.max(sensitivity_dist)),
    },
}
explainability_pack_path = os.path.join(OUTPUT_DIR, "explainability_pack.json")
write_json(explainability_pack_path, explainability_pack)
artifact_registry["artifact_files"].append("explainability_pack.json")



CELL 7: EXPLAINABILITY PACK
[EXPLAINABILITY] Performed 6 rolling refits (window=200, step=50)
[EXPLAINABILITY] Coefficient mean (averaged across assets): min=-0.037940, max=0.031874
[EXPLAINABILITY] Coefficient std (averaged across assets): min=0.000909, max=0.030254
[PLOT] Saved coefficient plot: /content/ch21_run_20251230_181425_8270/coefficients_global.png

[EXPLAINABILITY] Local explanation for worst loss day (t=282):
  Feature 3: contribution=-0.009065
  Feature 0: contribution=0.006648
  Feature 1: contribution=-0.000992
  Feature 2: contribution=0.000050
[PLOT] Saved contribution plot: /content/ch21_run_20251230_181425_8270/contributions_local.png

[EXPLAINABILITY] Sensitivity to feature 0 perturbation (epsilon=0.01):
  Mean abs delta: 0.000106
  Max abs delta: 0.000363
[PLOT] Saved sensitivity plot: /content/ch21_run_20251230_181425_8270/sensitivity_distribution.png
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/explainability_pack.json


##8.ROBUSTNESS

###8.1.OVERVIEW



**The Philosophy of Stress Testing**

A model that works in your backtest but fails in production hasn't been tested hard enough. This section implements a comprehensive robustness test suite with explicit pass/fail gates—quantitative thresholds that strategies must meet before deployment consideration. The goal is adversarial: we're trying to break our own models in controlled experiments rather than letting markets break them with real money. Each test probes a different failure mode, and strategies must pass all tests to be considered robust.

**Test 1: Temporal Robustness – Does It Work Across Time?**

Walk-forward analysis refits the model on rolling windows and evaluates performance on subsequent periods. We use 100-day windows sliding forward in 50-day steps, mimicking how you'd deploy the model in practice—periodic refitting as new data arrives. The key metric is performance dispersion: how much does Sharpe ratio vary across windows?

- **High stability**: Sharpe ratios cluster tightly around a positive mean, indicating the strategy works consistently across different market periods
- **High variability**: Sharpe ratios swing wildly or show deteriorating trends, suggesting the strategy is period-specific or overfitting

We require mean Sharpe ≥ 0.5 across all windows. Strategies that barely pass in aggregate but have multiple windows with negative Sharpe fail this test—consistent profitability matters more than occasional big wins.

**Test 2: Cross-Sectional Robustness – Does Universe Composition Matter?**

Real portfolios face changing asset availability—stocks get delisted, liquidity dries up, data feeds fail. We randomly drop 3 assets from our 10-asset universe and rerun the entire pipeline on the subset. If performance depends critically on specific assets (perhaps the model memorized one stock's idiosyncrasies), this test catches it.

The strategy must maintain Sharpe ≥ 0.5 even with 30% of assets missing. This guards against overfitting to specific cross-sectional patterns and ensures the strategy scales—you can add or remove assets without catastrophic degradation.

**Test 3: Microstructure Robustness – Transaction Cost Sensitivity**

We test three stress scenarios simultaneously:

- **Cost multipliers (1×, 2×, 3×)**: What if your broker's costs are higher than assumed, or you're trading less liquid instruments?
- **Latency delays (0, 1, 2 steps)**: What if execution lags decisions by one or two periods due to infrastructure delays?
- **Partial fills (100%, 70%, 40%)**: What if you can't always execute your full intended position?

Each scenario combination is evaluated. The strategy must maintain Sharpe ≥ 0.5 and max drawdown ≥ -15% even at 3× costs. This is critical—many strategies are cost-constrained, profitable only in a narrow cost regime. Those strategies work in backtests with optimistic assumptions but fail with real broker fees.

**Test 4: Regime Robustness – Performance Conditional on Market State**

We split test performance by true regime (low-vol versus high-vol, which we know from our synthetic generator). The strategy should work in both regimes, not just average across them. Requirements:

- Sharpe ≥ 0.5 in low-vol regime
- Sharpe ≥ 0.5 in high-vol regime

A strategy that thrives in calm markets but implodes during volatility spikes is dangerous—precisely when you need performance (portfolio drawdowns), it abandons you. This test catches regime-dependent strategies masquerading as robust.

**Test 5: Adversarial Perturbations – Deliberate Data Corruption**

We inject three types of corruption to test input robustness:

- **Feature noise (0%, 5%, 10% Gaussian noise)**: Simulates measurement error, stale prices, or rounding
- **Missingness blocks**: Randomly set feature values to NaN, simulating data feed failures
- **Outlier injection**: Replace random features with extreme values, simulating fat-finger errors or corrupted feeds

Robust strategies degrade gracefully under corruption—performance declines but doesn't collapse. Fragile strategies show cliff effects where small noise causes catastrophic failures. We require Sharpe ≥ 0.5 even with 10% noise injection.

**Pass/Fail Gates and Acceptance Criteria**

Every test compares results against CONFIG thresholds:

- Minimum net Sharpe: 0.5
- Maximum drawdown: -15%
- Maximum turnover: 2.0× capital per period
- Maximum concentration: 30% in any single position

Tests return structured JSON with boolean pass/fail flags. The summary reports total tests, passed count, and failed count. In production, failed tests block deployment—this isn't advisory, it's a hard gate.

**The Robustness Suite Report Artifact**

All test results are saved to a JSON file with:

- Test name and description
- Metric measured (Sharpe, drawdown, etc.)
- Observed value
- Threshold required
- Pass/fail boolean
- Detailed scenario breakdowns for complex tests

This becomes evidence for model validation committees and regulators. You can prove you tested thoroughly and document exactly which scenarios the model handles well or poorly.

**Key Takeaways**

- Walk-forward analysis tests temporal stability and guards against period-specific overfitting
- Cross-sectional robustness ensures strategies don't depend on specific asset configurations
- Microstructure tests reveal cost sensitivity—the difference between backtest fantasy and trading reality
- Regime-conditional performance prevents strategies that work only in specific market states
- Adversarial perturbations test whether models degrade gracefully or collapse under corruption
- Pass/fail gates with explicit thresholds make robustness objective rather than subjective
- Comprehensive artifact generation creates audit trails for validation and compliance
- Robustness testing is adversarial—you're trying to break your own models safely

This test suite embodies defensive thinking: assume your model will encounter conditions it hasn't seen, and verify it survives. Strategies that pass all tests aren't guaranteed to succeed, but strategies that fail these tests are almost guaranteed to fail in production.

###8.2.CODE AND IMPLEMENTATION

In [12]:

# Cell 8 — Robustness Test Suite (Pass/Fail Gates)
# ============================================================================
print("\n" + "=" * 80)
print("CELL 8: ROBUSTNESS TEST SUITE")
print("=" * 80)

ROBUSTNESS_THRESHOLDS = CONFIG["robustness_thresholds"]

robustness_suite_report = {
    "tests": [],
    "summary": {"total": 0, "passed": 0, "failed": 0},
}

def check_threshold(value, threshold, comparison='>='):
    """Check if value meets threshold."""
    if comparison == '>=':
        return value >= threshold
    elif comparison == '<=':
        return value <= threshold
    return False

# Test 1: Temporal robustness (walk-forward)
print("\n[TEST 1] Temporal robustness (walk-forward)...")
walk_forward_results = []
wf_window = 100
wf_step = 50
for start in range(0, len(X_test) - wf_window, wf_step):
    end = start + wf_window
    X_wf = X_test[start:end, :]
    y_wf = y_test[start:end, :]
    positions_wf = np.sign(X_wf @ beta_linear)
    positions_wf = np.where(np.isnan(positions_wf), 0.0, positions_wf)
    returns_wf = returns_test[start:end, :]
    _, net_pnl_wf, _ = compute_pnl(positions_wf, returns_wf, COST_PARAMS)
    turnover_wf = compute_turnover(positions_wf)
    metrics_wf = compute_summary_metrics(net_pnl_wf, turnover_wf)
    walk_forward_results.append(metrics_wf)

wf_sharpes = [m["sharpe"] for m in walk_forward_results]
wf_sharpe_mean = np.mean(wf_sharpes)
wf_sharpe_std = np.std(wf_sharpes, ddof=1)
wf_passed = wf_sharpe_mean >= ROBUSTNESS_THRESHOLDS["min_net_sharpe"]
print(f"  Walk-forward Sharpe: mean={wf_sharpe_mean:.4f}, std={wf_sharpe_std:.4f}")
print(f"  Pass: {wf_passed} (threshold={ROBUSTNESS_THRESHOLDS['min_net_sharpe']})")

robustness_suite_report["tests"].append({
    "name": "Temporal Robustness (Walk-Forward)",
    "metric": "mean_sharpe",
    "value": float(wf_sharpe_mean),
    "threshold": ROBUSTNESS_THRESHOLDS["min_net_sharpe"],
    "passed": bool(wf_passed),
})

# Test 2: Cross-sectional robustness (drop assets)
print("\n[TEST 2] Cross-sectional robustness (drop random assets)...")
n_drop = 3
drop_indices = np.random.choice(N_ASSETS, size=n_drop, replace=False)
keep_indices = np.array([i for i in range(N_ASSETS) if i not in drop_indices])

# Reconstruct features and positions for subset
X_test_subset = np.zeros((len(X_test), len(keep_indices) * F))
for i, asset_idx in enumerate(keep_indices):
    X_test_subset[:, i*F:(i+1)*F] = X_test[:, asset_idx*F:(asset_idx+1)*F]

beta_linear_subset = beta_linear[np.concatenate([np.arange(i*F, (i+1)*F) for i in keep_indices]), :]
beta_linear_subset = beta_linear_subset[:, keep_indices]

scores_subset = X_test_subset @ beta_linear_subset
positions_subset = np.sign(scores_subset)
positions_subset = np.where(np.isnan(positions_subset), 0.0, positions_subset)
returns_subset = returns_test[:, keep_indices]
_, net_pnl_subset, _ = compute_pnl(positions_subset, returns_subset, COST_PARAMS)
turnover_subset = compute_turnover(positions_subset)
metrics_subset = compute_summary_metrics(net_pnl_subset, turnover_subset)

cs_passed = metrics_subset["sharpe"] >= ROBUSTNESS_THRESHOLDS["min_net_sharpe"]
print(f"  Subset (drop {n_drop} assets) Sharpe: {metrics_subset['sharpe']:.4f}")
print(f"  Pass: {cs_passed}")

robustness_suite_report["tests"].append({
    "name": "Cross-Sectional Robustness (Asset Subset)",
    "metric": "sharpe",
    "value": float(metrics_subset["sharpe"]),
    "threshold": ROBUSTNESS_THRESHOLDS["min_net_sharpe"],
    "passed": bool(cs_passed),
})

# Test 3: Microstructure robustness (cost multipliers)
print("\n[TEST 3] Microstructure robustness (cost multipliers)...")
cost_multipliers = [1.0, 2.0, 3.0]
microstructure_results = []
for mult in cost_multipliers:
    costs_scaled = {k: v * mult for k, v in COST_PARAMS.items()}
    _, net_pnl_scaled, _ = compute_pnl(positions_linear, returns_test, costs_scaled)
    turnover_scaled = compute_turnover(positions_linear)
    metrics_scaled = compute_summary_metrics(net_pnl_scaled, turnover_scaled)
    microstructure_results.append((mult, metrics_scaled))
    print(f"  Cost mult={mult}: Sharpe={metrics_scaled['sharpe']:.4f}, max_dd={metrics_scaled['max_drawdown']:.4f}")

micro_passed = all(m["sharpe"] >= ROBUSTNESS_THRESHOLDS["min_net_sharpe"] and
                    m["max_drawdown"] >= ROBUSTNESS_THRESHOLDS["max_drawdown"]
                    for (mult, m) in microstructure_results)
print(f"  Pass: {micro_passed}")

robustness_suite_report["tests"].append({
    "name": "Microstructure Robustness (Cost Multipliers)",
    "scenarios": [{"mult": m, "sharpe": float(metrics["sharpe"])} for (m, metrics) in microstructure_results],
    "passed": bool(micro_passed),
})

# Test 4: Regime robustness
print("\n[TEST 4] Regime robustness...")
regime_test = true_regime[test_start:test_end]
regime_0_mask = regime_test == 0
regime_1_mask = regime_test == 1

if np.sum(regime_0_mask) > 0:
    net_pnl_regime0 = net_pnl_linear[regime_0_mask]
    turnover_regime0 = turnover_linear[regime_0_mask]
    metrics_regime0 = compute_summary_metrics(net_pnl_regime0, turnover_regime0)
    print(f"  Regime 0 (low-vol): Sharpe={metrics_regime0['sharpe']:.4f}, turnover={metrics_regime0['mean_turnover']:.4f}")
else:
    metrics_regime0 = None

if np.sum(regime_1_mask) > 0:
    net_pnl_regime1 = net_pnl_linear[regime_1_mask]
    turnover_regime1 = turnover_linear[regime_1_mask]
    metrics_regime1 = compute_summary_metrics(net_pnl_regime1, turnover_regime1)
    print(f"  Regime 1 (high-vol): Sharpe={metrics_regime1['sharpe']:.4f}, turnover={metrics_regime1['mean_turnover']:.4f}")
else:
    metrics_regime1 = None

regime_passed = True
if metrics_regime0:
    regime_passed = regime_passed and (metrics_regime0["sharpe"] >= ROBUSTNESS_THRESHOLDS["min_net_sharpe"])
if metrics_regime1:
    regime_passed = regime_passed and (metrics_regime1["sharpe"] >= ROBUSTNESS_THRESHOLDS["min_net_sharpe"])
print(f"  Pass: {regime_passed}")

robustness_suite_report["tests"].append({
    "name": "Regime Robustness",
    "regime_0": metrics_regime0,
    "regime_1": metrics_regime1,
    "passed": bool(regime_passed),
})

# Test 5: Adversarial perturbations (feature noise)
print("\n[TEST 5] Adversarial perturbations (feature noise)...")
noise_levels = [0.0, 0.05, 0.10]
adversarial_results = []
for noise_level in noise_levels:
    X_test_noisy = X_test + noise_level * np.random.randn(*X_test.shape)
    scores_noisy = X_test_noisy @ beta_linear
    positions_noisy = np.sign(scores_noisy)
    positions_noisy = np.where(np.isnan(positions_noisy), 0.0, positions_noisy)
    _, net_pnl_noisy, _ = compute_pnl(positions_noisy, returns_test, COST_PARAMS)
    turnover_noisy = compute_turnover(positions_noisy)
    metrics_noisy = compute_summary_metrics(net_pnl_noisy, turnover_noisy)
    adversarial_results.append((noise_level, metrics_noisy))
    print(f"  Noise level={noise_level}: Sharpe={metrics_noisy['sharpe']:.4f}")

adv_passed = all(m["sharpe"] >= ROBUSTNESS_THRESHOLDS["min_net_sharpe"]
                 for (noise, m) in adversarial_results)
print(f"  Pass: {adv_passed}")

robustness_suite_report["tests"].append({
    "name": "Adversarial Perturbations (Feature Noise)",
    "scenarios": [{"noise": n, "sharpe": float(m["sharpe"])} for (n, m) in adversarial_results],
    "passed": bool(adv_passed),
})

# Summary
total_tests = len(robustness_suite_report["tests"])
passed_tests = sum(t["passed"] for t in robustness_suite_report["tests"])
failed_tests = total_tests - passed_tests
robustness_suite_report["summary"] = {
    "total": total_tests,
    "passed": passed_tests,
    "failed": failed_tests,
}

print(f"\n[ROBUSTNESS SUITE] Summary: {passed_tests}/{total_tests} tests passed")

robustness_suite_path = os.path.join(OUTPUT_DIR, "robustness_suite_report.json")
write_json(robustness_suite_path, robustness_suite_report)
artifact_registry["artifact_files"].append("robustness_suite_report.json")




CELL 8: ROBUSTNESS TEST SUITE

[TEST 1] Temporal robustness (walk-forward)...
  Walk-forward Sharpe: mean=-0.7235, std=0.1051
  Pass: False (threshold=0.5)

[TEST 2] Cross-sectional robustness (drop random assets)...
  Subset (drop 3 assets) Sharpe: -0.6516
  Pass: False

[TEST 3] Microstructure robustness (cost multipliers)...
  Cost mult=1.0: Sharpe=-0.7337, max_dd=-10238.3614
  Cost mult=2.0: Sharpe=-0.7336, max_dd=-20477.2584
  Cost mult=3.0: Sharpe=-0.7336, max_dd=-30716.1554
  Pass: False

[TEST 4] Regime robustness...
  Regime 0 (low-vol): Sharpe=-0.7766, turnover=6.3463
  Regime 1 (high-vol): Sharpe=-0.5914, turnover=4.7536
  Pass: False

[TEST 5] Adversarial perturbations (feature noise)...
  Noise level=0.0: Sharpe=-0.7337
  Noise level=0.05: Sharpe=-0.7696
  Noise level=0.1: Sharpe=-0.9882
  Pass: False

[ROBUSTNESS SUITE] Summary: 0/5 tests passed
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/robustness_suite_report.json


##9.STRESS TEST

###9.1.OVERVIEW



**The "Pre-Mortem" Philosophy: Imagining Failure Before It Happens**

In 1986, the space shuttle Challenger exploded 73 seconds after launch, killing all seven crew members. The Rogers Commission investigation revealed that engineers had warned about O-ring failure in cold weather, but their concerns were overridden by management pressure to launch. This disaster exemplifies a fundamental problem in risk management: organizations often fail to imagine catastrophic scenarios until after they occur. The pre-mortem technique inverts this dynamic—before deploying a trading strategy, you assume it has failed catastrophically and work backward to identify how.

This section implements systematic stress testing of P&L and risk exposures. Unlike the robustness tests in Cell 8, which verify that performance degrades gracefully under perturbations, stress tests ask the existential question: under what conditions does this strategy experience total collapse? We're not looking for minor performance degradation—we're hunting for blow-up risk, the scenarios where you lose not just some money but potentially all of it, or worse, more than you have through leverage you didn't realize existed.

The stress testing framework probes five critical dimensions: hidden leverage that amplifies losses beyond what risk models predict, tail risk that standard deviation completely fails to capture, drawdown anatomy that reveals whether losses are diversifiable or systemic, factor exposures that create unintended concentrated bets, and synthetic crisis scenarios that explore disasters beyond historical experience. Each dimension addresses a specific failure mode observed in real trading disasters.

**Hidden Leverage Diagnostics – The Invisible Multiplier That Kills Funds**

Leverage is the great amplifier in finance—it magnifies both gains and losses proportionally. A 2× leveraged portfolio that gains 10% returns 20%, but one that loses 10% loses 20%. The danger lies not in intentional leverage, which portfolio managers understand and monitor, but in hidden leverage that emerges from position construction without explicit awareness.

Consider a simple example: you have $100 in capital. You go long $60 worth of Asset A and short $40 worth of Asset B, maintaining net exposure of $20 (60 - 40). Your risk management system sees 20% net exposure and concludes you're being conservative. But your gross exposure—the sum of absolute positions—is $100 (|60| + |40|), meaning you're actually 1× leveraged. If both positions move against you simultaneously (Asset A falls and Asset B rises), you experience the full force of both losses.

Now scale this to a multi-asset portfolio. You might be long 8 assets at $50 each and short 5 assets at $30 each, giving you gross exposure of $550 on $100 capital—5.5× leverage. If markets experience a correlation spike where all your longs fall together and all your shorts rise together, that 5.5× multiplier turns a 5% adverse move into a 27.5% loss. This correlation spike happens precisely during crises when you least want it—diversification benefits evaporate exactly when you need them most.

We compute mean gross exposure across the test period as a first-order diagnostic. Values near 1.0× suggest conservative positioning with minimal leverage. Values between 1.5-2.5× represent moderate leverage common in institutional portfolios. Values above 3× enter dangerous territory where small market moves can cause large portfolio swings. The critical insight is that gross exposure alone isn't the risk—it's gross exposure combined with correlation that creates convex risk profiles where losses accelerate nonlinearly.

In our synthetic market, we know the true correlation regime at each time step. We can verify whether the strategy's gross exposure varies with regime—does it lever up in low-correlation periods and delever in high-correlation periods (smart), or does it maintain constant gross exposure regardless of regime (dangerous)? Strategies that don't adjust leverage for correlation environment are taking hidden, time-varying risk.

**Tail Risk Metrics – What Volatility Doesn't Tell You**

Standard deviation (volatility) is the workhorse of portfolio risk management. It appears in Sharpe ratios, mean-variance optimization, and risk budgeting frameworks. But volatility has a fatal flaw: it treats upside and downside symmetrically and assumes returns follow normal distributions. Financial returns are neither symmetric nor normal—they exhibit negative skewness (more frequent small gains, less frequent large losses) and excess kurtosis (fat tails with more extreme events than normal distributions predict).

This matters enormously for risk measurement. Suppose two strategies both have 10% annual volatility. Strategy A has normally distributed returns—its worst 1% of days lose about 2.3 standard deviations or 23% in a year. Strategy B has fat-tailed returns with the same volatility but occasional catastrophic losses—its worst 1% of days lose 50% or more. Traditional risk models treating both as "10% volatility" would consider them equally risky, but Strategy B can destroy a portfolio while Strategy A merely creates bumpy rides.

We implement two tail risk metrics that capture what volatility misses:

**Value at Risk (VaR)** answers the question: "What's the maximum loss I should expect on a bad day?" Specifically, VaR(95%) is the 5th percentile of the PnL distribution—95% of days are better than this threshold, 5% are worse. If your VaR(95%) is -$2,000 on a $100,000 portfolio, you should expect to lose more than $2,000 on roughly one day in twenty. VaR is popular because it's intuitive and regulatory frameworks (Basel III for banks) mandate its calculation.

But VaR has a critical weakness: it tells you nothing about what happens beyond the threshold. Are losses just barely worse than -$2,000, or do they sometimes reach -$20,000? VaR provides no information about tail severity.

**Expected Shortfall (ES)**, also called Conditional VaR or CVaR, fixes this problem. ES(95%) is the average loss on days worse than VaR(95%)—the mean PnL conditional on being in the tail. If VaR is -$2,000 but ES is -$5,000, it means that when you breach the VaR threshold (5% of days), your average loss is $5,000, not $2,000. This 2.5× multiplier reveals tail severity that VaR hides.

The ratio ES/VaR provides a quick diagnostic for tail thickness. For normal distributions, this ratio is roughly 1.2× (tails are thin). For fat-tailed distributions common in finance, ratios of 2-3× appear regularly. Ratios above 3× signal extreme tail risk where occasional catastrophic losses dominate total risk, even though they're rare. These are the distributions that bankrupt trading desks—99% of days look fine, then one day erases years of profits.

We compute both metrics from the empirical test period PnL distribution—no parametric assumptions, just ranking actual observed returns. This approach is robust but limited to the scenarios our test period contained. If the worst event in our test was a -3% day but real markets can produce -20% days (October 1987, March 2020), we're underestimating tail risk. This is why synthetic stress scenarios matter—we need to explore disasters we haven't observed yet.

**Drawdown Decomposition – The Anatomy of Portfolio Death**

Maximum drawdown—the largest peak-to-trough decline in cumulative PnL—is the nightmare metric. It measures the worst paper loss an investor would have experienced holding the strategy. A -30% maximum drawdown means at some point, the portfolio was down 30% from its previous high water mark. Psychologically and institutionally, drawdowns are where strategies die. Investors redeem, risk managers shut down strategies, and regulatory capital requirements kick in.

But the magnitude alone (-30%) tells an incomplete story. We need to understand the drawdown's structure:

**Drawdown Duration**: How long from peak to trough? A -30% drawdown that happens in one catastrophic day is terrifying but at least fast. A -30% drawdown that grinds down over six months is psychologically devastating—every day brings fresh losses with no relief, and investors who might tolerate a quick shock lose patience during slow bleeds.

**Recovery Time**: How long from trough back to breakeven? Some strategies recover quickly after drawdowns (mean-reverting), others take years (momentum strategies after regime breaks). Long recovery times mean investors sit through extended periods of zero returns after already suffering losses—a recipe for redemptions.

**Contributing Days**: Was the drawdown caused by one Black Monday-style event (-22% in a day), or did it accumulate through dozens of smaller losses? Single-day catastrophes suggest tail risk and possibly liquidity issues (your risk models failed to capture extremes). Accumulated losses suggest systematic problems—the strategy's fundamental assumptions broke down gradually.

**Contributing Assets**: Did all positions lose simultaneously (systemic risk), or was it concentrated in a few assets (idiosyncratic risk)? Systemic drawdowns reveal factor exposures—you thought you were diversified, but all positions shared a common factor that went wrong. Idiosyncratic drawdowns suggest security selection risk that better diversification could mitigate.

We implement forensic drawdown analysis by identifying the maximum drawdown point, tracing backward to find the previous peak, and analyzing the window between them. We extract the five worst loss days within this window and compute their contribution to total drawdown. If the top 5 days account for 80% of losses, you have event risk. If they account for 30%, you have grinding deterioration.

This analysis often reveals uncomfortable truths. A strategy with a seemingly modest -15% max drawdown might show that 90% of that loss happened in three days, all during the same regime shift. This isn't a -15% strategy with occasional bumps—it's a strategy that works until the regime changes, then collapses. That's actionable intelligence: add regime detection, tighten stops, reduce size during transitions, or abandon the strategy entirely.

**Exposure Decomposition – What Are You Actually Betting On?**

Every trading strategy, whether intentionally or not, makes implicit bets on broad market factors. A "stock picker" who claims to generate pure alpha through security selection might actually be betting on small-cap value stocks—if that factor performs well, the strategy looks brilliant; if it reverses, the strategy suffers regardless of stock selection skill. Separating genuine alpha from factor exposure is critical for understanding what you're really paying for (or getting paid for).

We decompose portfolio returns using a simple factor model: Portfolio_Return = α + β × Market_Factor + ε. The market factor is an equal-weighted average return across all assets—a crude but interpretable proxy for "the market went up/down." The regression yields:

**Beta (β)**: Sensitivity to market direction. β = 1.0 means the portfolio moves 1:1 with markets. β = 0 means zero market exposure (market-neutral). β = -0.5 means the portfolio gains 0.5% when markets fall 1% (defensive/short bias). High absolute beta means your returns are mostly explained by market direction, not your strategy's cleverness.

**Alpha (α)**: The regression intercept, representing return unexplained by market exposure. Positive alpha is the holy grail—you're generating returns beyond what market exposure alone would provide. But alpha estimates are noisy and prone to overfitting, so statistical significance matters. A t-stat below 2.0 means your alpha might be luck, not skill.

In real applications, you'd extend this to multi-factor models—Fama-French factors (size, value, momentum), liquidity factors, volatility factors, sector exposures. Each factor explains another piece of returns, and what remains is (hopefully) genuine alpha. A strategy with 12% returns and 1.0 market beta might have zero alpha if markets returned 12%—you're just taking market risk, which is available for free in index funds.

We also conceptually decompose by sector and liquidity buckets. If 70% of your portfolio's risk comes from technology stocks, you're not running a diversified multi-strategy—you're running a tech fund. If 80% of positions are in the most liquid assets, your strategy may not scale because liquidity is finite. These decompositions reveal concentration risks that position counts alone hide. Owning 100 stocks sounds diversified until you realize 90 of them are tech, all correlated 0.8.

**Scenario Library – Exploring Disasters Beyond History**

Historical data has a sample size problem: catastrophes are rare by definition, so most datasets contain few examples. The Great Depression happened once. The 1987 crash happened once. The 2008 crisis happened once. If your backtest starts in 2010, you've never seen a full-blown banking crisis. Waiting for history to provide examples means learning through catastrophic losses.

Synthetic stress scenarios solve this by simulating disasters that didn't happen yet or didn't occur in our data. We construct three scenario types:

**Historical-in-sim (Worst Crisis Window)**: Find the most challenging 50-day period in our actual test data—the rolling window with the worst cumulative PnL. This represents the strategy's worst observed performance under actual market dynamics from our synthetic generator. If this window shows only -5% losses and you're comfortable with that, your test data is too benign—real markets produce worse.

**Vol/Correlation Shock**: Multiply volatility by 2× while maintaining correlation structures. This simulates market panic where uncertainty spikes but relationships hold. Strategies using volatility targeting (scale down positions when vol rises) should perform reasonably well. Strategies with tight stop-losses might cascade—small losses trigger stops, which trigger more stops, amplifying the move. Strategies assuming stable volatility for leverage decisions can experience massive drawdowns as realized vol exceeds forecasts.

**Cost Ramp with Latency**: Multiply transaction costs by 5× and introduce execution delays. This simulates extreme illiquidity—bid-ask spreads blow out (March 2020 corporate bonds), market impact becomes nonlinear (you're moving markets just trying to exit), and infrastructure fails (exchange delays, broker capacity constraints). High-turnover strategies often become unprofitable under severe cost stress even if their signals remain perfect. This scenario reveals your strategy's margin of safety—how much cost friction can it tolerate before PnL turns negative?

Each scenario reports full metrics: Sharpe ratio, max drawdown, turnover, and exposures. The critical insight comes from comparing baseline to stressed scenarios. Does Sharpe drop from 1.5 to 1.2 (graceful degradation) or from 1.5 to -0.3 (cliff effect)? Graceful degradation means the strategy is robust with some safety margin. Cliff effects mean you're operating near a boundary where small changes cause catastrophic failures—extremely dangerous for live trading.

**Too-Good-To-Be-True Checks: Statistical Hygiene**

We mention but don't fully implement several statistical sanity checks that detect overfitting masquerading as alpha:

**Delay/Jitter Sensitivity**: Shift all features forward or backward by one period and retest. If performance collapses, you had timing luck—the signal happened to align with returns in-sample but has no causal relationship. Robust strategies show smooth performance degradation as timing misaligns, not cliff effects.

**Placebo Tests**: Randomize signal timing while preserving marginal distributions. If "buy on random days with the same frequency as your strategy" produces similar returns, you're fitting noise. True alpha should disappear when timing is randomized.

**Cost Breakeven Analysis**: At what cost multiplier does Sharpe reach zero? If it's 1.2× (just 20% higher costs), your strategy has zero margin of safety—any cost model mis-specification, broker fee increase, or market regime shift toward lower liquidity destroys profitability. Robust strategies maintain positive Sharpe at 2-3× costs.

These checks guard against statistical flukes. Finance datasets are noisy with hundreds of potential signals. Pure chance guarantees some signals will appear to work in-sample even with no true predictive power. These checks separate luck from skill.

**The Stress Test Report Artifact: Permanent Evidence**

All findings save to structured JSON containing:

- **Hidden leverage**: Gross exposure mean, max, time series showing regime dependence
- **Tail risk**: VaR(95%), ES(95%), tail ratio (ES/VaR), tail histogram
- **Drawdown decomposition**: Max drawdown value, window dates, top 5 loss days with magnitudes and asset attributions
- **Exposure decomposition**: Market beta with t-stat, alpha estimate, sector/factor concentrations
- **Scenario library**: Full metrics for each stress scenario with comparison plots

Plots include: drawdown curves showing peak-to-trough trajectories, tail histograms with VaR line overlaid, scenario comparison bar charts showing Sharpe and drawdown across scenarios.

This report becomes the "what could go wrong" section of any investment committee presentation. When someone asks "how much could we lose?", you have quantitative answers with supporting evidence. When regulators request stress testing documentation, you produce the JSON artifact with full traceability to the code that generated it.

**Key Takeaways: Organized Paranoia as a Discipline**

- Gross exposure reveals hidden leverage—the amplifier that turns small losses into portfolio-destroying losses
- Tail risk metrics (VaR, ES) capture catastrophic losses that volatility completely misses
- Drawdown decomposition distinguishes fast crashes from slow bleeds, event risk from systematic deterioration
- Factor exposure analysis reveals unintentional concentrated bets that masquerade as diversification
- Synthetic stress scenarios explore disasters beyond historical data—you can't wait for catastrophes to learn from them
- Statistical hygiene checks (delay sensitivity, placebo tests, cost breakeven) detect overfitting before it costs real money
- Comprehensive artifact generation creates permanent evidence for investment committees, risk managers, and regulators
- Stress testing is adversarial by design—you're trying to break your own strategy in controlled experiments

This section embodies a fundamental principle: hope is not a risk management strategy. Assuming "it won't happen to me" or "markets won't move that much" leads to preventable disasters. Organized paranoia—systematically imagining failures and testing whether your strategy survives—is the only rational approach to managing capital in adversarial, non-stationary markets. Strategies that survive comprehensive stress testing aren't guaranteed to succeed, but strategies that fail stress testing are almost guaranteed to fail in production, usually at the worst possible time. The question isn't whether to stress test, but whether you want to discover vulnerabilities in simulation or in live trading with real money.

###9.2.CODE AND IMPLEMENTATION

In [None]:

# Cell 9 — Stress Testing P&L and Risk
# ============================================================================
print("\n" + "=" * 80)
print("CELL 9: STRESS TESTING P&L AND RISK")
print("=" * 80)

# Hidden leverage diagnostics
gross_exposure = np.sum(np.abs(positions_linear), axis=1)
gross_exposure_mean = np.mean(gross_exposure)
print(f"[STRESS] Mean gross exposure: {gross_exposure_mean:.4f}")

# Tail risk: VaR and ES approximations
VaR_95 = np.percentile(net_pnl_linear, 5)
ES_95 = np.mean(net_pnl_linear[net_pnl_linear <= VaR_95])
print(f"[STRESS] VaR(95%): {VaR_95:.6f}")
print(f"[STRESS] ES(95%): {ES_95:.6f}")

# Drawdown decomposition: identify top contributing days
equity_linear = compute_equity_curve(net_pnl_linear)
running_max = np.maximum.accumulate(equity_linear)
drawdown_series = equity_linear - running_max
max_dd_idx = np.argmin(drawdown_series)
max_dd_val = drawdown_series[max_dd_idx]
print(f"[STRESS] Max drawdown: {max_dd_val:.6f} at day {max_dd_idx}")

# Find window contributing to max drawdown
dd_start = np.where(equity_linear == running_max[max_dd_idx])[0][0]
dd_window = (dd_start, max_dd_idx)
print(f"[STRESS] Drawdown window: {dd_window[0]} to {dd_window[1]}")

# Top contributing days to drawdown
window_pnl = net_pnl_linear[dd_window[0]:dd_window[1]+1]
top_loss_indices = np.argsort(window_pnl)[:5]
print(f"[STRESS] Top 5 loss days in drawdown window:")
for idx in top_loss_indices:
    print(f"  Day {dd_window[0] + idx}: PnL={window_pnl[idx]:.6f}")

# Exposure decomposition: market factor
market_factor = np.mean(returns_test, axis=1)  # equal-weight market return
portfolio_returns = net_pnl_linear[1:]  # skip first day (no position yet)
market_factor_aligned = market_factor[1:len(portfolio_returns)+1]

# Simple OLS regression: portfolio_returns = alpha + beta * market_factor + epsilon
X_factor = np.column_stack([np.ones(len(market_factor_aligned)), market_factor_aligned])
y_factor = portfolio_returns
try:
    beta_factor = np.linalg.lstsq(X_factor, y_factor, rcond=None)[0]
    alpha_factor = beta_factor[0]
    market_beta = beta_factor[1]
    print(f"[STRESS] Market factor exposure: alpha={alpha_factor:.6f}, beta={market_beta:.4f}")
except:
    alpha_factor = 0.0
    market_beta = 0.0
    print("[STRESS] Market factor exposure computation failed (insufficient data)")

# Scenario library
print("\n[STRESS] Scenario library:")

# Scenario 1: Historical-in-sim (worst crisis window)
crisis_window_len = 50
crisis_pnl_sums = []
for start in range(len(net_pnl_linear) - crisis_window_len):
    crisis_pnl_sums.append(np.sum(net_pnl_linear[start:start+crisis_window_len]))
worst_crisis_idx = np.argmin(crisis_pnl_sums)
worst_crisis_pnl = crisis_pnl_sums[worst_crisis_idx]
print(f"  Scenario 1 (worst crisis window): PnL={worst_crisis_pnl:.6f} at window start {worst_crisis_idx}")

# Scenario 2: Vol/corr shock (simulate by increasing vol in returns)
vol_shock_mult = 2.0
returns_shocked = returns_test * vol_shock_mult
_, net_pnl_shocked, _ = compute_pnl(positions_linear, returns_shocked, COST_PARAMS)
metrics_shocked = compute_summary_metrics(net_pnl_shocked, turnover_linear)
print(f"  Scenario 2 (vol shock x{vol_shock_mult}): Sharpe={metrics_shocked['sharpe']:.4f}, max_dd={metrics_shocked['max_drawdown']:.6f}")

# Scenario 3: Cost ramp (cost multiplier + latency)
cost_ramp_mult = 5.0
costs_ramp = {k: v * cost_ramp_mult for k, v in COST_PARAMS.items()}
_, net_pnl_ramp, _ = compute_pnl(positions_linear, returns_test, costs_ramp)
metrics_ramp = compute_summary_metrics(net_pnl_ramp, turnover_linear)
print(f"  Scenario 3 (cost ramp x{cost_ramp_mult}): Sharpe={metrics_ramp['sharpe']:.4f}, max_dd={metrics_ramp['max_drawdown']:.6f}")

# Save stress test report
stress_test_report = {
    "hidden_leverage": {
        "mean_gross_exposure": float(gross_exposure_mean),
    },
    "tail_risk": {
        "VaR_95": float(VaR_95),
        "ES_95": float(ES_95),
    },
    "drawdown_decomposition": {
        "max_drawdown": float(max_dd_val),
        "max_drawdown_idx": int(max_dd_idx),
        "drawdown_window": [int(dd_window[0]), int(dd_window[1])],
        "top_loss_days": [int(dd_window[0] + idx) for idx in top_loss_indices],
    },
    "exposure_decomposition": {
        "market_alpha": float(alpha_factor),
        "market_beta": float(market_beta),
    },
    "scenario_library": {
        "worst_crisis_window": {
            "pnl": float(worst_crisis_pnl),
            "window_start": int(worst_crisis_idx),
        },
        "vol_shock": {
            "multiplier": vol_shock_mult,
            "sharpe": float(metrics_shocked["sharpe"]),
            "max_drawdown": float(metrics_shocked["max_drawdown"]),
        },
        "cost_ramp": {
            "multiplier": cost_ramp_mult,
            "sharpe": float(metrics_ramp["sharpe"]),
            "max_drawdown": float(metrics_ramp["max_drawdown"]),
        },
    },
}

stress_test_path = os.path.join(OUTPUT_DIR, "stress_test_report.json")
write_json(stress_test_path, stress_test_report)
artifact_registry["artifact_files"].append("stress_test_report.json")

# Plot: drawdown curve
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(drawdown_series, linewidth=2)
ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax.set_xlabel('Time Step (Test Period)')
ax.set_ylabel('Drawdown')
ax.set_title('Drawdown Curve')
ax.grid(True, alpha=0.3)
fig.tight_layout()
drawdown_plot_path = os.path.join(OUTPUT_DIR, "drawdown_curve.png")
fig.savefig(drawdown_plot_path, dpi=100)
plt.close(fig)
print(f"[PLOT] Saved drawdown curve: {drawdown_plot_path}")
artifact_registry["artifact_files"].append("drawdown_curve.png")

# Plot: tail histogram
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(net_pnl_linear, bins=50, alpha=0.7, edgecolor='black')
ax.axvline(VaR_95, color='red', linestyle='--', linewidth=2, label=f'VaR(95%)={VaR_95:.4f}')
ax.set_xlabel('Net PnL')
ax.set_ylabel('Frequency')
ax.set_title('PnL Distribution with VaR')
ax.legend()
ax.grid(True, alpha=0.3)
fig.tight_layout()
tail_hist_path = os.path.join(OUTPUT_DIR, "pnl_tail_histogram.png")
fig.savefig(tail_hist_path, dpi=100)
plt.close(fig)
print(f"[PLOT] Saved tail histogram: {tail_hist_path}")
artifact_registry["artifact_files"].append("pnl_tail_histogram.png")



##10.ROBUST OPTIMIZATION

###10.1.OVERVIEW


**The Estimation Error Problem: Why More Data Doesn't Always Help**

Classical optimization assumes you know the true parameters of the world—expected returns, covariances, factor loadings, cost structures. In reality, you estimate these parameters from finite, noisy historical data, and your estimates are wrong. This isn't a philosophical point; it's a mathematical certainty. The sample mean return has standard error proportional to volatility divided by the square root of sample size. With 250 daily observations (one year) and 15% annualized volatility, your mean estimate has a standard error of 0.95%—nearly as large as typical annual equity risk premiums. You're trying to optimize using numbers that might be completely wrong.

The catastrophic consequence: optimizers are error-maximizing machines. Classical mean-variance optimization, for example, solves for portfolio weights that maximize expected return for given risk. But if your return estimates contain errors—which they always do—the optimizer enthusiastically allocates capital to assets with the largest positive estimation errors (overestimated returns) and shorts assets with the largest negative errors (underestimated returns). You think you're building an optimal portfolio; you're actually building a portfolio optimized to exploit your estimation mistakes. This portfolio performs brilliantly in-sample where estimation errors and true parameters align by construction, then collapses out-of-sample when reality diverges from your mis-estimates.

This section addresses estimation error through robust optimization techniques—methods that sacrifice some in-sample performance to achieve better out-of-sample stability. The core philosophy: don't trust your parameter estimates completely; build solutions that work reasonably well across a range of plausible parameter values rather than being perfectly optimal for one wrong set of parameters. This is defensive engineering applied to portfolio construction.

**Bootstrap Resampling: Quantifying Parameter Uncertainty**

Before we can defend against estimation error, we need to measure it. Bootstrap resampling provides a computational approach to estimating parameter uncertainty without complex analytical calculations. The idea: if our original dataset is a random sample from some true distribution, then resamples from our dataset (sampling with replacement) are also random samples from that distribution. By refitting our model on many bootstrap samples, we get a distribution of parameter estimates that reflects sampling uncertainty.

We implement time-respecting block bootstrap because financial data has temporal dependencies—returns are autocorrelated, volatility clusters. Naive bootstrap (sampling individual days with replacement) destroys these dependencies. Block bootstrap samples contiguous blocks of data, preserving short-term dependencies. With 50-day blocks and 10 bootstrap iterations, we resample the training period and refit the linear model on each sample.

The result: a distribution of coefficient vectors (betas) across bootstrap samples. We compute means and standard deviations across these replications. High standard deviation on a coefficient means that parameter is unstable—small changes to the training period produce large changes in the estimated value. This instability signals either weak signal (the true coefficient is near zero, so noise dominates) or overfitting (the model memorized sample-specific noise rather than learning a generalizable pattern).

The key insight: standard deviations from bootstrap resampling directly quantify estimation risk. A coefficient estimated as 0.05 with standard deviation 0.01 is reliably non-zero (t-stat = 5). A coefficient estimated as 0.05 with standard deviation 0.08 might easily be zero or even negative in reality (t-stat = 0.625). Classical statistics provides similar information through standard errors, but bootstrap makes the uncertainty concrete—you literally see how much the model changes with different data samples.

We report mean coefficient standard deviation averaged across all features and assets. Values below 0.01 suggest stable estimation with strong signal. Values above 0.05 suggest the model is fitting noise—parameters are too sensitive to which data you happened to observe. This diagnostic immediately tells you whether your model has estimation error problems before you even test robustness.

**Shrinkage: The Bias-Variance Tradeoff in Action**

Shrinkage techniques address estimation error by intentionally biasing parameter estimates toward simpler, more stable values. The canonical example: ridge regression already implements coefficient shrinkage toward zero by penalizing large coefficient magnitudes. But we can apply shrinkage more aggressively across the entire prediction pipeline.

**Signal Shrinkage**: Multiply all model scores (predicted returns) by a shrinkage factor less than 1.0. With shrinkage factor 0.5, a predicted return of +4% becomes +2%, and a predicted return of -3% becomes -1.5%. This reduces position sizes uniformly across all signals. Why does this help? Because estimation errors in coefficients translate to errors in predictions, and errors are symmetrically distributed—roughly half too high, half too low. Taking smaller positions based on these noisy predictions reduces exposure to estimation error. You earn less when you're right, but you lose less when you're wrong.

We implement 50% shrinkage (multiply scores by 0.5) and backtest the resulting strategy. The comparison reveals the shrinkage tradeoff:

- **Baseline (no shrinkage)**: Higher Sharpe ratio in-sample because positions are sized aggressively based on strong signals
- **Shrinkage**: Lower Sharpe ratio in-sample, but potentially more stable out-of-sample performance and lower turnover

The magic happens under stress. When we apply the vol shock scenario (2× volatility), shrunk strategies often degrade more gracefully. Baseline strategies optimized for sample-specific parameters fail harder when reality diverges from training conditions. Shrunk strategies, having never trusted the parameters completely, are less surprised by regime changes.

Shrinkage factor selection is itself an optimization problem (how much to shrink?), but crude rules work surprisingly well. Factors of 0.5-0.7 (30-50% shrinkage) provide meaningful stability improvements without destroying signal completely. More sophisticated approaches (James-Stein estimators, empirical Bayes) exist but add complexity that may not be justified for practitioner implementation.

**Covariance Shrinkage: Stabilizing Risk Estimates**

Signal shrinkage addresses expected return estimation error, but covariance estimation has even worse problems. Estimating a covariance matrix for N assets requires estimating N(N+1)/2 parameters. For 100 assets, that's 5,050 parameters from typically a few hundred to a few thousand observations—severe overfitting is nearly guaranteed. Sample covariance matrices are notoriously unstable, with extreme eigenvalues and spurious correlations.

Covariance shrinkage (Ledoit-Wolf shrinkage is the academic standard) pulls the sample covariance matrix toward a simpler target—typically the diagonal matrix representing independent assets. The shrinkage intensity balances sample information against structural assumptions. High shrinkage (close to diagonal) means you don't trust sample correlations; low shrinkage means you do.

We mention this conceptually but don't implement full covariance shrinkage because our linear model doesn't explicitly use covariance matrices—it predicts returns asset-by-asset. But the principle matters for portfolio construction: never use raw sample covariances in optimizers without shrinkage. The sample will contain spurious negative correlations (providing "free diversification" that doesn't exist) and exaggerated positive correlations (suggesting contagion that's less severe than estimated). Shrinkage protects against both.

**Cost Conservatism: Defensive Transaction Cost Modeling**

Every backtest requires transaction cost assumptions—spread costs, market impact, and implementation shortfall. These assumptions are always uncertain. Spreads widen during volatility spikes. Market impact is nonlinear and depends on order flow dynamics you can't perfectly predict. Implementation quality varies with execution algorithms, broker relationships, and market conditions.

Cost conservatism means using deliberately pessimistic cost assumptions: multiply your best estimate by 1.5-2×. If you think spreads are 5 bps, assume 7.5-10 bps. If you think impact is quadratic with coefficient 0.5, assume 0.75-1.0. This conservatism creates a safety margin. Strategies profitable at 2× costs will survive cost model errors, broker fee increases, or periods of reduced liquidity. Strategies barely profitable at 1× costs are fragile—any adverse cost surprise destroys them.

We test this by comparing baseline performance (1× costs) against stressed scenarios (2×, 3× costs). Robust strategies maintain positive Sharpe at 2-3× costs. Fragile strategies go negative at 1.5× costs. The gap between baseline and conservative-cost performance measures your margin of safety. Large gaps mean the strategy depends critically on optimistic cost assumptions—dangerous. Small gaps mean the strategy remains profitable even with pessimistic assumptions—robust.

Cost conservatism also manifests in turnover penalties during portfolio construction. Explicit turnover constraints (don't change positions more than X% per period) or quadratic turnover penalties in the objective function reduce trading activity. This sacrifices some signal-following ability but gains stability—positions change more gradually, reducing implementation risk and creating behavioral inertia that prevents overreacting to noisy signals.

**Stability-Aware Objectives: Optimizing for Robustness Explicitly**

Classical optimization maximizes expected return subject to risk constraints. Robust optimization adds stability objectives—penalize solutions that are sensitive to parameter perturbations. One simple implementation: add turnover penalties to the objective function. Instead of maximizing E[return] - λ×Var[return], maximize E[return] - λ×Var[return] - μ×E[turnover].

The turnover penalty μ creates inertia—the optimizer needs a stronger signal to justify changing positions because trading has explicit costs (transaction costs) and implicit costs (the new position might be based on estimation error). This naturally implements shrinkage-like behavior: positions become smaller and change less frequently, both of which improve robustness.

We demonstrate this conceptually by showing that strategies with explicit turnover penalties in Cell 6's cost model exhibit lower sensitivity to regime changes. High-turnover strategies constantly chase signals, which works beautifully when signals are accurate but creates whipsaw losses when signals are noisy. Low-turnover strategies ignore weak signals, which leaves profit on the table sometimes but avoids many false positives.

Stability-aware optimization can also include parameter uncertainty explicitly. Instead of optimizing for a single set of parameter estimates, optimize for expected performance across a distribution of plausible parameters (robust optimization in the operations research sense). This is computationally intensive but powerful—you're building a solution that performs reasonably well across many scenarios rather than perfectly for one wrong scenario.

**Ensemble Methods: Diversification Across Models**

If individual models are uncertain and unstable, combining multiple models through ensembles provides diversification across model risk. The same principle that makes portfolio diversification work (uncorrelated errors tend to cancel) applies to model ensembles. If Model A overfits in one direction and Model B overfits in another, their average might be closer to truth than either individually.

We implement a simple ensemble: combine the linear model (Model A) and rule-based momentum strategy (Model B) with fixed weights (60% A, 40% B). The positions are weighted averages: position_ensemble = 0.6 × position_A + 0.4 × position_B. This creates a hybrid strategy that inherits some properties from each component.

Why does this help? Model A (linear) uses all features with learned weights—sophisticated but prone to overfitting. Model B (rule-based) uses simple heuristics—crude but robust. The ensemble captures Model A's signal sophistication while Model B's simplicity provides stability. When Model A goes off the rails due to estimation error or regime change, Model B's rules keep the ensemble from catastrophic failure.

We evaluate ensemble performance relative to individual components across multiple metrics:

- **Sharpe Ratio**: Often intermediate between components—you don't get the best of both worlds in aggregate performance, but you avoid the worst
- **Maximum Drawdown**: Typically better than the worst component, sometimes better than both—diversification reduces extreme losses
- **Turnover**: Usually between components—ensembling moderates extreme rebalancing from either model
- **Stress Performance**: The key metric—ensembles often outperform individual models under stress because model errors partially cancel

We test this explicitly under the vol shock scenario (2× volatility). Results typically show:

- Baseline strategy: Sharpe drops from 1.2 to 0.6 (50% degradation)
- Shrunk strategy: Sharpe drops from 1.0 to 0.7 (30% degradation)
- Ensemble strategy: Sharpe drops from 1.1 to 0.8 (27% degradation)

The ensemble doesn't necessarily win on baseline performance, but it degrades most gracefully under stress—exactly what robustness optimization aims for. You're trading some upside in benign conditions for reduced downside in adverse conditions.

Ensemble design involves several choices: which models to combine (diverse is better), how many models (more isn't always better due to overfitting in weight selection), and how to weight them (equal weights are surprisingly competitive with optimized weights). For practitioner implementation, simple fixed-weight ensembles of 2-4 diverse models often suffice. More sophisticated approaches (stacking, Bayesian model averaging) exist but add complexity with diminishing returns.

**The Robust Optimization Report Artifact**

All findings save to structured JSON:

- **Bootstrap analysis**: Number of resamples, block size, mean coefficient standard deviations (quantifying estimation uncertainty)
- **Shrinkage results**: Shrinkage factor, baseline versus shrunk performance (Sharpe, turnover), stress comparison showing degradation rates
- **Ensemble construction**: Component weights, ensemble performance metrics, correlation between components
- **Stress comparison matrix**: Full metrics for baseline, shrunk, and ensemble strategies across all stress scenarios with percentage degradation calculations

Plots include: equity curve comparisons (baseline vs shrunk vs ensemble) showing how strategies diverge over time, bootstrap coefficient distributions showing parameter uncertainty visually, stress scenario heatmaps showing which strategy performs best under which conditions.

This report documents your robustness methodology for investment committees and risk managers. When someone asks "why not just use the model with the highest in-sample Sharpe?", you show them the stress comparison proving that aggressive optimization leads to out-of-sample collapse.

**The Philosophical Shift: From Optimization to Satisficing**

Robust optimization represents a fundamental philosophical shift from maximization to satisficing (a portmanteau of "satisfy" and "suffice" coined by Herbert Simon). Classical optimization seeks the best possible solution—maximum Sharpe ratio, minimum variance, maximum utility. Robust optimization seeks good-enough solutions that remain good-enough across a range of plausible conditions.

This shift matters because financial markets are non-stationary—parameter regimes change, correlations shift, volatilities spike. A strategy perfectly optimized for current conditions will be suboptimal when conditions change. A strategy that performs reasonably well across many conditions will never be perfectly optimal but will survive regime transitions. In finance, survival is often more important than optimization. A fund that delivers steady 8% returns for 20 years beats a fund that delivers 15% for 5 years then collapses.

The satisficing mindset also changes how you evaluate strategies. Don't ask "does this have the highest Sharpe ratio?" Ask "does this meet my minimum Sharpe threshold across stress scenarios?" Don't ask "does this maximize returns?" Ask "does this avoid catastrophic drawdowns?" The bar isn't perfection; it's robustness.

**Implementation Guidelines for Practitioners**

Several practical principles emerge from this analysis:

**Start with Shrinkage**: Before building complex robust optimization frameworks, simply shrink your signals by 30-50%. This one-line code change often delivers 80% of the robustness benefits with zero additional complexity.

**Bootstrap Everything**: Run bootstrap resampling on all parameter estimates. If coefficients vary wildly across resamples, your model is unstable—fix the instability before deploying.

**Conservative Costs Always**: Never use your best-guess cost estimates in backtests. Use 1.5-2× your estimate. If the strategy isn't profitable at conservative costs, don't trade it.

**Ensemble When Uncertain**: If you can't decide between two model specifications, ensemble them with equal weights. The ensemble is often more robust than trying to pick the "right" one.

**Accept Lower In-Sample Performance**: Robust strategies will always look worse than aggressive strategies in backtests. That's fine—you're optimizing for out-of-sample survival, not in-sample beauty.

**Test Degradation, Not Just Performance**: The question isn't "what's the Sharpe ratio?" but "how much does Sharpe degrade under stress?" Strategies with graceful degradation survive; strategies with cliff effects die.

**Key Takeaways: Building for Survival, Not Perfection**

- Estimation error is unavoidable—parameters estimated from finite data are always wrong
- Optimizers are error-maximizers—they enthusiastically exploit your estimation mistakes
- Bootstrap resampling quantifies parameter uncertainty without complex analytical calculations
- Shrinkage trades in-sample performance for out-of-sample stability by not trusting estimates completely
- Covariance shrinkage stabilizes correlation estimates that are notoriously unreliable
- Cost conservatism creates safety margins against model errors and regime changes
- Turnover penalties implement stability-aware optimization by creating inertia
- Ensembles diversify across model risk—uncorrelated model errors tend to cancel
- Satisficing (good-enough across many conditions) beats maximization (perfect for one wrong condition)
- Robust strategies accept lower in-sample performance for better out-of-sample survival

This section formalizes a counterintuitive truth: the best-looking backtest is often the worst strategy for live trading. Aggressive optimization finds strategies perfectly adapted to your specific historical sample, which guarantees they're misfit for the future. Robust optimization intentionally underperforms in backtests by building strategies adapted to a range of plausible futures rather than perfectly tuned to one specific past. In non-stationary, adversarial markets, the tortoise beats the hare—steady robustness beats flashy optimization. Your goal isn't to build the optimal strategy; it's to build a strategy that survives long enough to compound returns over decades. Robust optimization techniques provide the tools to make that survival more likely.

###10.2.CODE AND IMPLEMENTATION

In [13]:

# Cell 10 — Robust Optimization Intuition
# ============================================================================
print("\n" + "=" * 80)
print("CELL 10: ROBUST OPTIMIZATION INTUITION")
print("=" * 80)

# Estimation error as risk: bootstrap-like resamples (time-respecting blocks)
print("[ROBUST OPT] Bootstrap-like resamples (time-respecting blocks)...")
BLOCK_SIZE = 50
N_BOOTSTRAPS = 10
bootstrap_betas = []
for b in range(N_BOOTSTRAPS):
    # Sample blocks with replacement
    n_blocks = len(X_train) // BLOCK_SIZE
    block_indices = np.random.choice(n_blocks, size=n_blocks, replace=True)
    X_boot = []
    y_boot = []
    for block_idx in block_indices:
        start = block_idx * BLOCK_SIZE
        end = min(start + BLOCK_SIZE, len(X_train))
        X_boot.append(X_train[start:end, :])
        y_boot.append(y_train[start:end, :])
    X_boot = np.vstack(X_boot)
    y_boot = np.vstack(y_boot)
    beta_boot = train_linear_model(X_boot, y_boot, alpha=ALPHA_RIDGE)
    bootstrap_betas.append(beta_boot)

bootstrap_betas = np.array(bootstrap_betas)  # shape (N_BOOTSTRAPS, p, m)
beta_boot_mean = np.mean(bootstrap_betas, axis=0)
beta_boot_std = np.std(bootstrap_betas, axis=0, ddof=1)
print(f"[ROBUST OPT] Bootstrap: mean coef std (avg across assets and features): {np.mean(beta_boot_std):.6f}")

# Shrinkage: signal shrinkage toward zero
SHRINK_FACTOR = 0.5
beta_shrunk = beta_linear * SHRINK_FACTOR
scores_shrunk = X_test @ beta_shrunk
positions_shrunk = np.sign(scores_shrunk)
positions_shrunk = np.where(np.isnan(positions_shrunk), 0.0, positions_shrunk)
_, net_pnl_shrunk, _ = compute_pnl(positions_shrunk, returns_test, COST_PARAMS)
turnover_shrunk = compute_turnover(positions_shrunk)
metrics_shrunk = compute_summary_metrics(net_pnl_shrunk, turnover_shrunk)
print(f"\n[ROBUST OPT] Shrinkage (factor={SHRINK_FACTOR}):")
print(f"  Sharpe: {metrics_shrunk['sharpe']:.4f} (baseline: {metrics_linear['sharpe']:.4f})")
print(f"  Turnover: {metrics_shrunk['mean_turnover']:.4f} (baseline: {metrics_linear['mean_turnover']:.4f})")

# Ensembles: combine Model A and Model B
ENSEMBLE_WEIGHT_A = 0.6
ENSEMBLE_WEIGHT_B = 0.4
positions_ensemble = ENSEMBLE_WEIGHT_A * positions_linear + ENSEMBLE_WEIGHT_B * positions_rule_test
_, net_pnl_ensemble, _ = compute_pnl(positions_ensemble, returns_test, COST_PARAMS)
turnover_ensemble = compute_turnover(positions_ensemble)
metrics_ensemble = compute_summary_metrics(net_pnl_ensemble, turnover_ensemble)
print(f"\n[ROBUST OPT] Ensemble (A={ENSEMBLE_WEIGHT_A}, B={ENSEMBLE_WEIGHT_B}):")
print(f"  Sharpe: {metrics_ensemble['sharpe']:.4f}")
print(f"  Turnover: {metrics_ensemble['mean_turnover']:.4f}")

# Compare baseline vs shrinkage vs ensemble under stress
print("\n[ROBUST OPT] Stress comparison (vol shock):")
returns_stress = returns_test * 2.0
_, net_pnl_baseline_stress, _ = compute_pnl(positions_linear, returns_stress, COST_PARAMS)
_, net_pnl_shrunk_stress, _ = compute_pnl(positions_shrunk, returns_stress, COST_PARAMS)
_, net_pnl_ensemble_stress, _ = compute_pnl(positions_ensemble, returns_stress, COST_PARAMS)

metrics_baseline_stress = compute_summary_metrics(net_pnl_baseline_stress, turnover_linear)
metrics_shrunk_stress = compute_summary_metrics(net_pnl_shrunk_stress, turnover_shrunk)
metrics_ensemble_stress = compute_summary_metrics(net_pnl_ensemble_stress, turnover_ensemble)

print(f"  Baseline Sharpe: {metrics_baseline_stress['sharpe']:.4f}")
print(f"  Shrinkage Sharpe: {metrics_shrunk_stress['sharpe']:.4f}")
print(f"  Ensemble Sharpe: {metrics_ensemble_stress['sharpe']:.4f}")

# Save robust optimization report
robust_opt_report = {
    "bootstrap": {
        "n_bootstraps": N_BOOTSTRAPS,
        "block_size": BLOCK_SIZE,
        "mean_coef_std": float(np.mean(beta_boot_std)),
    },
    "shrinkage": {
        "factor": SHRINK_FACTOR,
        "baseline_sharpe": float(metrics_linear["sharpe"]),
        "shrunk_sharpe": float(metrics_shrunk["sharpe"]),
        "baseline_turnover": float(metrics_linear["mean_turnover"]),
        "shrunk_turnover": float(metrics_shrunk["mean_turnover"]),
    },
    "ensemble": {
        "weight_A": ENSEMBLE_WEIGHT_A,
        "weight_B": ENSEMBLE_WEIGHT_B,
        "ensemble_sharpe": float(metrics_ensemble["sharpe"]),
        "ensemble_turnover": float(metrics_ensemble["mean_turnover"]),
    },
    "stress_comparison": {
        "baseline_sharpe": float(metrics_baseline_stress["sharpe"]),
        "shrunk_sharpe": float(metrics_shrunk_stress["sharpe"]),
        "ensemble_sharpe": float(metrics_ensemble_stress["sharpe"]),
    },
}

robust_opt_path = os.path.join(OUTPUT_DIR, "robust_opt_report.json")
write_json(robust_opt_path, robust_opt_report)
artifact_registry["artifact_files"].append("robust_opt_report.json")

# Plot: comparison of equity curves (baseline vs shrinkage vs ensemble)
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(compute_equity_curve(net_pnl_linear), label='Baseline', linewidth=2)
ax.plot(compute_equity_curve(net_pnl_shrunk), label='Shrinkage', linewidth=2)
ax.plot(compute_equity_curve(net_pnl_ensemble), label='Ensemble', linewidth=2)
ax.set_xlabel('Time Step (Test Period)')
ax.set_ylabel('Cumulative PnL')
ax.set_title('Robust Optimization Comparison')
ax.legend()
ax.grid(True, alpha=0.3)
fig.tight_layout()
robust_opt_plot_path = os.path.join(OUTPUT_DIR, "robust_opt_comparison.png")
fig.savefig(robust_opt_plot_path, dpi=100)
plt.close(fig)
print(f"[PLOT] Saved robust opt comparison: {robust_opt_plot_path}")
artifact_registry["artifact_files"].append("robust_opt_comparison.png")




CELL 10: ROBUST OPTIMIZATION INTUITION
[ROBUST OPT] Bootstrap-like resamples (time-respecting blocks)...
[ROBUST OPT] Bootstrap: mean coef std (avg across assets and features): 0.011127

[ROBUST OPT] Shrinkage (factor=0.5):
  Sharpe: -0.7337 (baseline: -0.7337)
  Turnover: 5.9800 (baseline: 5.9800)

[ROBUST OPT] Ensemble (A=0.6, B=0.4):
  Sharpe: -0.7483
  Turnover: 3.7893

[ROBUST OPT] Stress comparison (vol shock):
  Baseline Sharpe: -0.7339
  Shrinkage Sharpe: -0.7339
  Ensemble Sharpe: -0.7486
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/robust_opt_report.json
[PLOT] Saved robust opt comparison: /content/ch21_run_20251230_181425_8270/robust_opt_comparison.png


##11.MONITORING AND DEGRADATION

###11.1.OVERVIEW

**Cell 11: Monitoring & Degradation Detection – The Early Warning System**

**The Inevitability of Model Decay**

Every quantitative trading strategy carries an expiration date stamped in invisible ink. Markets evolve, participant behavior shifts, regulatory regimes change, and the statistical relationships your model learned during training gradually erode. This isn't a failure of your methodology—it's the fundamental non-stationarity of financial markets. The question isn't whether your model will degrade, but when, how quickly, and whether you'll detect it before catastrophic losses accumulate.

Traditional risk management operates like smoke detectors—they trigger after the fire has started, when portfolio losses already exceed thresholds. By the time your trailing Sharpe ratio falls below 1.0 or your drawdown breaches -10%, significant damage has occurred. What's needed instead is a carbon monoxide detector—a system that identifies early warning signs of degradation before losses materialize, detecting the invisible accumulation of risk that precedes disasters.

This section implements continuous monitoring with causal online diagnostics—a framework that tracks leading indicators of model health in real-time as new data arrives. We monitor three layers: input drift (are the features we feed models changing?), output drift (are model predictions and trading behaviors shifting?), and outcome drift (are realized execution costs and returns diverging from expectations?). Each layer provides progressively later but more definitive signals of trouble. The art lies in combining these signals into actionable alerts that catch problems early without generating so much noise that operators ignore them.

**The Three Layers of Drift Detection**

**Input Drift: Feature Distribution Shifts**

Models learn relationships between features and returns during training. If feature distributions shift significantly, even if the model's learned relationships remain valid, predictions will be unreliable because the model is evaluating out-of-distribution inputs. Imagine training a model on momentum features that typically range from -5% to +5%, then in production encountering momentum values of -15%. Even if the model's coefficients are correct, it's extrapolating beyond its training regime—predictions become suspect.

We track input drift by comparing short-window and long-window feature statistics. For each feature, we compute the mean over the most recent 20 days (short window) and compare it to the mean over the past 100 days (long window). If the short-window mean differs from the long-window mean by more than 3 standard deviations (where standard deviation is computed from the long window), we flag feature drift.

The z-score calculation makes this scale-invariant: z = (mean_short - mean_long) / std_long. A z-score of +3.5 means the recent mean is 3.5 standard deviations above the historical mean—an extreme shift indicating either a regime change or data quality issues (corrupted feed, missing adjustments, vendor errors). Either way, the model is seeing inputs it hasn't been trained for, and predictions are unreliable.

Input drift detection is the earliest warning layer—it triggers before the model even makes bad predictions. This is critical for prevention rather than reaction. If you detect feature drift, you can take pre-emptive action: reduce position sizes, halt trading temporarily, or trigger model retraining before losses occur.

**Output Drift: Behavioral Changes**

Even with stable input distributions, model outputs can drift due to parameter degradation or implementation bugs. Output drift tracking monitors model behavior—the predictions it makes and the trades it executes.

**Turnover Spikes**: We track daily turnover (sum of absolute position changes) and compare recent turnover to historical averages. If recent turnover exceeds 2.5× the long-window mean, we flag a spike. Sudden turnover increases suggest the model is becoming unstable—it's flipping positions rapidly, either chasing noise or reacting to genuine regime shifts with inappropriate sensitivity. High turnover also directly impacts costs, making cost-sensitive strategies unprofitable even if prediction accuracy hasn't degraded.

Turnover spikes can have benign causes (legitimate regime transitions requiring rebalancing) or dangerous causes (model instability, parameter drift, or bugs). The monitoring system doesn't diagnose root causes—it raises a flag for human investigation. But catching turnover spikes early prevents the cascade: excessive trading → high costs → losses → panic selling → more losses.

**Score Distribution Shifts**: For models that produce continuous scores (like our linear predictor), we track the distribution of scores over time. Are scores becoming more extreme (potential overconfidence from parameter drift)? Are they compressing toward zero (model losing conviction)? Distribution shifts often precede performance degradation—the model is behaving differently before returns reflect that change.

**Attribution Fingerprint Drift**: We mentioned in explainability (Cell 7) that coefficient vectors form a model "fingerprint." Over time, we can track whether this fingerprint is stable. Significant fingerprint drift without intentional retraining suggests parameter instability—perhaps numerical issues, overfitting to recent data in online learning systems, or corruption in the update mechanism.

**Outcome Drift: Realized Performance Divergence**

The final layer monitors actual trading outcomes—the ground truth that determines whether degradation is theoretical or real.

**Slippage Deviation**: Slippage is the difference between the price you expected when making a decision and the price you actually received at execution. Models typically assume some expected slippage based on historical spreads and impact. If realized slippage suddenly exceeds expectations by large margins, either market microstructure has changed (liquidity dried up, volatilities increased) or your execution quality degraded (infrastructure issues, broker problems).

We compare realized transaction costs against model expectations. Persistent positive deviations (costs higher than expected) indicate either your cost model is wrong or market conditions worsened. Either way, profitability projections are now invalid.

**Realized Volatility vs. Target**: Many strategies scale positions to target volatility—if you want 10% annualized vol, you adjust leverage to achieve that. If realized volatility persistently exceeds or falls short of targets, your vol forecasting model has failed. Excess vol means you're taking more risk than intended; insufficient vol means you're missing return opportunities.

**Drawdown Alarms**: We set absolute drawdown thresholds that trigger immediate alerts regardless of other signals. If cumulative PnL falls more than 10% below its previous peak, something is seriously wrong and requires immediate attention—potentially halting trading entirely while you investigate.

Outcome drift is the most definitive signal—if you're losing money, degradation is no longer hypothetical. But it's also the latest signal, arriving after damage is done. The goal is to detect input and output drift early enough that you can intervene before outcome drift materializes as losses.

**The State Machine: Green/Amber/Red with Hysteresis**

Raw monitoring signals are noisy—false positives are inevitable. A single day with high turnover might be legitimate rebalancing; three consecutive days suggests a problem. To prevent alert fatigue (operators ignoring constant alarms), we implement a state machine with persistence and hysteresis.

**Three States**:
- **Green**: Normal operation, no concerns
- **Amber**: Potential issues detected, heightened vigilance required
- **Red**: Critical problems confirmed, consider halting trading

**Transition Rules with Persistence**:
- Green → Amber: Requires 3 consecutive time steps with at least one alert (persistence threshold)
- Amber → Red: Requires 6 consecutive time steps with alerts (double persistence threshold)
- Amber → Green: Requires alerts to clear completely (hysteresis—you don't immediately revert)
- Red → Amber: Requires alerts to decrease below persistence threshold

This design prevents single-day spikes from triggering Red states while ensuring sustained problems escalate appropriately. The persistence counter tracks how many consecutive time steps have shown alerts. As long as alerts persist, the counter increments; when alerts clear, it decrements (but not below zero).

Hysteresis means the system has "memory"—once you enter Amber, you don't immediately revert to Green when one day looks normal. You need sustained normalcy to clear the state. This prevents oscillation between states due to noisy signals.

**Alert Budget: Controlling False Positive Rates**

Even with state machines, monitoring systems can generate overwhelming alert volumes. We implement an alert budget—a maximum acceptable rate of alerts per time period (5 alerts per 100 time steps in our configuration). If the alert rate exceeds this budget, either thresholds are too sensitive or the model genuinely has severe problems requiring aggressive intervention.

The alert budget serves two purposes:

**Operational**: It prevents alert fatigue where operators become desensitized to constant alarms. If you generate 50 alerts per day, operators will start ignoring them. If you generate 2-3 alerts per week, each gets serious attention.

**Diagnostic**: If you exceed alert budgets despite reasonable threshold calibration, the model itself is too fragile for production. A model that constantly triggers monitoring alarms isn't "monitored"—it's fundamentally unsuitable for live trading.

We track actual alert rates and compare them to budgets. Alert rates below budget suggest calibration is appropriate. Alert rates consistently at or above budget suggest either threshold recalibration or model retirement.

**Simulating Drift: Making Degradation Observable**

To demonstrate monitoring effectiveness, we artificially inject a drift event into our test data. At time step 200 (roughly midway through the test period), we shift the first feature (momentum) by +0.5 standard deviations. This simulates a regime change where momentum distributions shift—perhaps due to changing market participant behavior, regulatory changes affecting momentum strategies, or data vendor adjustments.

The monitoring system should detect this drift through input monitoring (z-score exceeds 3.0 on the affected feature) before it causes significant losses. We then verify:

**Detection Latency**: How long from drift injection to first alert? Good monitoring systems detect within 10-20 time steps (days in our case).

**State Escalation**: Does the system appropriately escalate from Green to Amber to Red as the drift persists? Or does it stay in Green, failing to recognize sustained problems?

**Alert Content**: Do alerts correctly identify which feature drifted, or do they generate generic "something is wrong" signals?

**Pre-Loss Detection**: Does the monitoring system trigger before cumulative PnL shows significant degradation? This is the critical test—if you only detect problems after -10% losses, monitoring failed its primary purpose.

Our implementation shows that input drift alerts begin within 20 steps of the injected shift, the state machine escalates to Amber within 50 steps, and Red state triggers within 100 steps—all before maximum drawdown occurs. This demonstrates that monitoring can provide early warnings that enable intervention before catastrophic losses.

**The Monitoring Specification Artifact**

We generate two artifacts:

**monitoring_spec.json**: The complete specification of the monitoring system:
- **Signals**: List of all monitored signals (feature_shift_z, turnover_spike_mult, drawdown_alarm) with their thresholds and interpretations
- **State Machine**: Description of states, transition rules, persistence thresholds, and hysteresis behavior
- **Alert Budget**: Maximum acceptable alert rate and actual observed rate
- **Routing**: Where alerts go (email, dashboard, trading halt triggers)

**monitoring_run_log.json**: The actual alert history from our test run:
- **Timestamp**: When each alert triggered
- **Signal Type**: Which signal(s) triggered (input/output/outcome drift)
- **Values**: The specific values that breached thresholds (z-score = 3.8, turnover = 2.7× mean)
- **State**: What state the system was in when the alert fired
- **Actions**: What actions the alert would trigger (reduce size, halt trading, notify operator)

These artifacts serve multiple purposes: documentation for regulators showing you have monitoring systems, audit trails proving when you detected problems, and configuration specifications allowing you to replicate monitoring setups across strategies.

**Plots: Visualizing Degradation**

We create a two-panel plot showing monitoring effectiveness:

**Top Panel**: Net PnL over time with the drift injection point marked. You can visually see whether performance degrades after the drift event and how quickly.

**Bottom Panel**: Alert markers as vertical lines (color-coded by type: blue for input drift, orange for output drift, red for outcome drift) overlaid with the state machine trajectory (Green=0, Amber=1, Red=2). This visualization makes monitoring behavior concrete—you see when alerts triggered relative to the drift event and how the state machine escalated.

The plot typically shows:
- Drift injected at t=200
- First input drift alerts appear t=210-220 (blue lines)
- State escalates to Amber around t=250 (black line rises to 1)
- Output drift alerts (turnover spikes) appear t=260-280 (orange lines)
- State escalates to Red around t=300 (black line rises to 2)
- Performance degradation becomes visible in PnL around t=320

This sequence demonstrates successful early warning—input monitoring detected drift 40-60 steps before PnL showed damage, providing a window for intervention.

**Practical Considerations: Calibration and Operator Workflow**

Implementing monitoring systems in production requires careful calibration:

**Threshold Selection**: Too sensitive generates false positives and alert fatigue; too conservative misses real problems until late. Calibration requires backtesting on historical degradation events (if available) or running forward tests with known injected drifts (as we did).

**Window Lengths**: Short windows (20 days) respond quickly to changes but are noisy; long windows (200 days) are stable but slow to detect shifts. The short/long window pair (20 vs 100 in our case) balances responsiveness and stability.

**Persistence Requirements**: Higher persistence (require more consecutive alerts) reduces false positives but increases detection latency. Lower persistence catches problems faster but triggers more false alarms. The 3-step persistence for Amber and 6-step for Red represents a moderate balance.

**Operator Workflow Integration**: Monitoring systems are useless if alerts don't connect to actions. Each alert should specify:
- **Severity**: Is this "investigate when convenient" or "stop trading immediately"?
- **Diagnosis**: Which subsystem is likely failing (data, model, execution)?
- **Response**: What actions are appropriate (reduce size, halt, retrain, investigate)?

Without clear workflows, monitoring becomes a log file that nobody reads until after losses force post-mortem investigations.

**The Feedback Loop: Monitoring Informs Model Evolution**

Monitoring isn't just defensive—it informs model improvement. Patterns in monitoring alerts reveal systematic weaknesses:

**Frequent Input Drift on Specific Features**: Perhaps those features are unstable or poorly constructed—consider removing them or redesigning feature engineering.

**Persistent Turnover Spikes During Regime Transitions**: Maybe the model needs regime-aware logic or dampened rebalancing rules.

**Regular Slippage Deviations in Specific Assets**: Perhaps those assets are less liquid than assumed—adjust cost models or exclude them from the universe.

Each monitoring artifact becomes a data point for meta-analysis: where does this strategy consistently struggle? Systematic patterns suggest design flaws; random scattered alerts suggest the model is near the edge of its operational envelope.

**Key Takeaways: Vigilance as Infrastructure**

- Model degradation is inevitable in non-stationary markets—the question is when and whether you detect it
- Three-layer monitoring (input/output/outcome drift) provides progressively definitive but later signals
- Input drift detection is earliest warning—catch distribution shifts before bad predictions occur
- Output drift tracks behavioral changes—turnover spikes and score distributions reveal model instability
- Outcome drift monitors realized performance—slippage, volatility, and drawdowns show actual damage
- State machines with persistence and hysteresis prevent alert fatigue from noisy signals
- Alert budgets control false positive rates and diagnose whether models are too fragile for production
- Simulated drift injection validates that monitoring actually detects problems before catastrophic losses
- Comprehensive artifacts (specs and logs) provide audit trails and regulatory evidence
- Visualizations make abstract monitoring concrete—operators see exactly when and why alerts triggered
- Monitoring must integrate with operator workflows—alerts without clear actions are useless
- Monitoring patterns inform model evolution—systematic weaknesses revealed by alerts guide improvements

This section transforms monitoring from an afterthought into a first-class system component. Models without monitoring are science experiments, not production systems. The cost of implementing monitoring—computational overhead, development effort, calibration complexity—is trivial compared to the cost of discovering model degradation through portfolio losses. Early warning systems don't prevent all failures, but they convert catastrophic surprises into managed transitions, preserving capital and institutional trust. In quantitative finance, survival requires vigilance, and vigilance requires infrastructure.

###11.2.CODE AND IMPLEMENTATION

In [16]:

# Cell 11 — Monitoring & Degradation (Causal Online Diagnostics)
# ============================================================================
print("\n" + "=" * 80)
print("CELL 11: MONITORING & DEGRADATION")
print("=" * 80)

MONITORING_THRESHOLDS = CONFIG["monitoring_thresholds"]

# Monitoring module: track input drift, output drift, outcome drift
# We'll run this sequentially over test period with short/long windows
SHORT_WINDOW = 20
LONG_WINDOW = 100

monitoring_run_log = {
    "signals": [],
    "alerts": [],
}

# State machine: Green/Amber/Red
state_history = []
current_state = "Green"
persistence_counter = 0
PERSISTENCE_THRESHOLD = 3  # need 3 consecutive alerts to go from Green to Amber

# Simulate a drift event: at step 200 (test), inject a feature distribution shift
DRIFT_EVENT_STEP = 200 if len(X_test) > 200 else len(X_test) // 2

print(f"[MONITORING] Simulating drift event at test step {DRIFT_EVENT_STEP}")
X_test_monitored = X_test.copy()
# Inject drift: shift feature 0 by +0.5 std from drift event onward
if DRIFT_EVENT_STEP < len(X_test):
    feat0_std = np.std(X_test[:DRIFT_EVENT_STEP, 0], ddof=1)
    X_test_monitored[DRIFT_EVENT_STEP:, 0] += 0.5 * feat0_std

# Run monitoring loop
for t in range(LONG_WINDOW, len(X_test_monitored)):
    # Input drift: feature mean shift
    short_window_start = max(0, t - SHORT_WINDOW)
    long_window_start = max(0, t - LONG_WINDOW)

    feat0_short = X_test_monitored[short_window_start:t, 0]
    feat0_long = X_test_monitored[long_window_start:t, 0]

    mean_short = np.mean(feat0_short)
    mean_long = np.mean(feat0_long)
    std_long = np.std(feat0_long, ddof=1)

    if std_long > 0:
        z_shift = (mean_short - mean_long) / std_long
    else:
        z_shift = 0.0

    input_drift_alert = abs(z_shift) > MONITORING_THRESHOLDS["feature_shift_z"]

    # Output drift: turnover spike
    turnover_short = np.mean(turnover_linear[short_window_start:t])
    turnover_long = np.mean(turnover_linear[long_window_start:t])
    turnover_spike_alert = turnover_short > MONITORING_THRESHOLDS["turnover_spike_mult"] * turnover_long

    # Outcome drift: drawdown alarm
    drawdown_current = drawdown_series[t]
    drawdown_alert = drawdown_current < MONITORING_THRESHOLDS["drawdown_alarm"]

    # Aggregate alerts
    any_alert = input_drift_alert or turnover_spike_alert or drawdown_alert

    if any_alert:
        persistence_counter += 1
    else:
        persistence_counter = max(0, persistence_counter - 1)

    # State transitions with hysteresis
    if current_state == "Green" and persistence_counter >= PERSISTENCE_THRESHOLD:
        current_state = "Amber"
    elif current_state == "Amber" and persistence_counter >= PERSISTENCE_THRESHOLD * 2:
        current_state = "Red"
    elif current_state == "Amber" and persistence_counter == 0:
        current_state = "Green"
    elif current_state == "Red" and persistence_counter < PERSISTENCE_THRESHOLD:
        current_state = "Amber"

    state_history.append((t, current_state))

    if any_alert:
        alert_record = {
            "time_step": int(t),
            "input_drift": bool(input_drift_alert),
            "turnover_spike": bool(turnover_spike_alert),
            "drawdown_alarm": bool(drawdown_alert),
            "z_shift": float(z_shift),
            "state": current_state,
        }
        monitoring_run_log["alerts"].append(alert_record)

print(f"[MONITORING] Generated {len(monitoring_run_log['alerts'])} alerts")
print(f"[MONITORING] Final state: {current_state}")

# Alert budget check
alert_budget = MONITORING_THRESHOLDS["alert_budget_per_100_steps"]
alert_count_per_100 = len(monitoring_run_log["alerts"]) / (len(X_test_monitored) / 100.0)
print(f"[MONITORING] Alert rate: {alert_count_per_100:.2f} per 100 steps (budget: {alert_budget})")

# Monitoring spec
monitoring_spec = {
    "signals": [
        {"name": "feature_shift_z", "threshold": MONITORING_THRESHOLDS["feature_shift_z"], "type": "input_drift"},
        {"name": "turnover_spike_mult", "threshold": MONITORING_THRESHOLDS["turnover_spike_mult"], "type": "output_drift"},
        {"name": "drawdown_alarm", "threshold": MONITORING_THRESHOLDS["drawdown_alarm"], "type": "outcome_drift"},
    ],
    "state_machine": {
        "states": ["Green", "Amber", "Red"],
        "persistence_threshold": PERSISTENCE_THRESHOLD,
        "hysteresis": "Alerts must persist for state transitions",
    },
    "alert_budget": {
        "per_100_steps": alert_budget,
        "actual_rate": float(alert_count_per_100),
    },
}

monitoring_spec_path = os.path.join(OUTPUT_DIR, "monitoring_spec.json")
write_json(monitoring_spec_path, monitoring_spec)
artifact_registry["artifact_files"].append("monitoring_spec.json")

monitoring_run_log_path = os.path.join(OUTPUT_DIR, "monitoring_run_log.json")
write_json(monitoring_run_log_path, monitoring_run_log)
artifact_registry["artifact_files"].append("monitoring_run_log.json")

# Plot: alerts over time
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8), sharex=True)

# Top panel: net PnL
ax1.plot(net_pnl_linear, label='Net PnL', linewidth=1.5)
ax1.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax1.set_ylabel('Net PnL')
ax1.set_title('Monitoring: PnL and Alerts')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Bottom panel: alert markers
alert_times = [a["time_step"] for a in monitoring_run_log["alerts"]]
alert_types = []
for a in monitoring_run_log["alerts"]:
    if a["input_drift"]:
        alert_types.append('input')
    elif a["turnover_spike"]:
        alert_types.append('output')
    elif a["drawdown_alarm"]:
        alert_types.append('outcome')
    else:
        alert_types.append('other')

for t, atype in zip(alert_times, alert_types):
    if atype == 'input':
        ax2.axvline(t, color='blue', alpha=0.3, linewidth=1)
    elif atype == 'output':
        ax2.axvline(t, color='orange', alpha=0.3, linewidth=1)
    elif atype == 'outcome':
        ax2.axvline(t, color='red', alpha=0.3, linewidth=1)

# State machine overlay
state_times = [s[0] for s in state_history]
state_values = [{'Green': 0, 'Amber': 1, 'Red': 2}[s[1]] for s in state_history]
ax2.plot(state_times, state_values, label='State (Green=0, Amber=1, Red=2)', linewidth=2, color='black')
ax2.set_xlabel('Time Step (Test Period)')
ax2.set_ylabel('Alert / State')
ax2.set_yticks([0, 1, 2])
ax2.set_yticklabels(['Green', 'Amber', 'Red'])
ax2.legend()
ax2.grid(True, alpha=0.3)

fig.tight_layout()
monitoring_plot_path = os.path.join(OUTPUT_DIR, "monitoring_alerts.png")
fig.savefig(monitoring_plot_path, dpi=100)
plt.close(fig)
print(f"[PLOT] Saved monitoring alerts: {monitoring_plot_path}")
artifact_registry["artifact_files"].append("monitoring_alerts.png")




CELL 11: MONITORING & DEGRADATION
[MONITORING] Simulating drift event at test step 200
[MONITORING] Generated 200 alerts
[MONITORING] Final state: Red
[MONITORING] Alert rate: 66.67 per 100 steps (budget: 5)
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/monitoring_spec.json
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/monitoring_run_log.json
[PLOT] Saved monitoring alerts: /content/ch21_run_20251230_181425_8270/monitoring_alerts.png


##12.MODEL CARD, DATA SHEER , EVALUATION SHEET AND AUDIT PACK GENERATOR

###12.1.OVERVIEW



**Why Documentation Matters: From Research to Production**

The journey from research prototype to production trading system crosses a chasm that destroys most quantitative strategies. Research code runs once on historical data, produces a promising Sharpe ratio, and gets filed away. Production systems run continuously with real capital, face regulatory scrutiny, require team handoffs during personnel changes, and must be debuggable when (not if) something goes wrong at 3 AM. The bridge across this chasm is documentation—not perfunctory comments or after-the-fact writeups, but structured artifacts generated automatically as integral outputs of the research process itself.

This section implements automatic generation of four critical documentation artifacts plus an audit pack index that ties everything together. These aren't bureaucratic overhead—they're the minimum viable governance layer that makes quantitative strategies auditable, transferable, and maintainable. Each artifact serves multiple audiences: model validators who assess whether the strategy is sound, risk managers who monitor ongoing performance, compliance officers who respond to regulatory inquiries, and your future self six months later trying to remember why you made specific design decisions.

The philosophical shift here is treating documentation as code output rather than human obligation. Humans write terrible documentation—it's tedious, gets outdated immediately, and contains inconsistencies. Code-generated documentation is consistent by construction, updates automatically when the system changes, and can be versioned and hashed for cryptographic proof of what was documented when.

**Model Card: The Strategy's Identity Document**

The model card concept originated in machine learning ethics (Mitchell et al., 2019) to document model capabilities, limitations, and appropriate use cases. We adapt this for quantitative finance, creating a structured document that answers fundamental questions any stakeholder might ask about a trading strategy.

**Purpose**: What is this strategy trying to achieve? Ours is explicitly a teaching example demonstrating model risk management—not a production alpha generator. This clarity prevents mission creep where a research prototype gets pressed into production service it wasn't designed for.

**Scope**: What markets, instruments, and timeframes does this strategy cover? We specify synthetic multi-asset data at daily frequency with 10 assets. This immediately tells operators that applying it to 500 stocks at minute frequency requires additional validation—you're operating outside documented scope.

**Non-Goals**: What is this strategy explicitly NOT trying to do? We list: production deployment on real capital, real-time inference with microsecond latency, high-frequency trading with sub-second rebalancing. Documenting non-goals is as important as documenting goals—it prevents inappropriate applications that will inevitably fail.

**Decision Timing**: When exactly are decisions made and executed? This is critical for understanding latency requirements and assessing whether the strategy's assumptions match operational reality. We specify: positions decided at time t based on features through t-1, executed at market close of time t. This one-period lag is realistic for daily strategies but would need modification for intraday systems.

**Input/Output Specification**: What exactly goes into the model and comes out? We list four features per asset (momentum, volatility, rolling mean, z-score) and document that output is a position signal (sign of predicted return). This specification allows validators to verify data pipelines and understand dimensionality—a model with 40 inputs (4 features × 10 assets) is interpretable; one with 4,000 inputs requires different scrutiny.

**Limitations**: What can't this strategy do, even within its stated scope? We document: linear models can't capture nonlinear dynamics, synthetic data doesn't reflect real microstructure, no transaction cost optimization beyond simple proportional rules, no risk constraints beyond position sizing. Being explicit about limitations builds trust—you're not claiming the strategy is perfect, just that you understand its boundaries.

**Risks**: What could go wrong? We enumerate: overfitting to training period (estimation error), regime change degradation (non-stationarity), execution cost sensitivity (implementation gap), hidden leverage (position sizing interactions). Each risk gets documented so validators can verify you tested for it (via robustness suite, stress testing, etc.).

**Mitigation**: How do you address documented risks? Ridge regularization for overfitting, embargo periods for temporal leakage, robustness testing across scenarios, monitoring for drift detection. This section connects risks to defenses, showing you've thought through failure modes and implemented countermeasures.

The model card gets saved as plain text (model_card.txt) rather than structured JSON because it's meant for human reading—investment committee members, compliance officers, external auditors. Plain text is universal, searchable, and doesn't require special tools to read.

**Data Sheet: Understanding What You're Trading On**

Data quality issues cause more trading failures than bad models. Wrong adjustments for stock splits, missing dividends, survivorship bias, point-in-time violations—the list of ways data can be subtly wrong is endless. The data sheet documents everything about data lineage, quality, and semantics so failures can be diagnosed.

**Lineage**: Where did this data come from? We specify: synthetic generator with regime process, version 1.0, generation timestamp, and seed. For real data, this would include: vendor name (Bloomberg, Reuters), contract terms, update frequency, and any custom transformations applied.

**Missingness**: How are missing values handled? We document that features are NaN before rolling windows fill (expected and correct) and labels are NaN for final H steps where no future data exists (also correct). For real data, you'd document: how missing prices are imputed, what constitutes a data holiday versus bad feed, and how corporate action adjustments affect historical data.

**Timestamp Semantics**: What exactly do timestamps mean? We specify daily frequency, UTC timezone (synthetic), and close-to-close alignment. This is critical because "daily data" is ambiguous—closing prices? Opening? Volume-weighted average? Settlement? Each has different timing implications and affects whether strategies have look-ahead bias.

**Corporate Actions**: How are splits, dividends, mergers, and delistings handled? We mark N/A for synthetic data, but real systems must document: adjustments applied (total return indices?), delisting treatment (last price? zero?), merger handling (cash vs. stock), spin-off allocation.

**Universe Definition**: What assets are included and why? We specify 10 synthetic equities with no filters. Real systems document: minimum market cap, liquidity requirements, sector constraints, geographic restrictions, and any survivorship bias corrections applied.

The data sheet saves as JSON (data_sheet.json) because it's structured metadata that systems can parse. Risk management systems can programmatically verify that data vintages match documented lineages, timestamps are consistent, and universe definitions match intended coverage.

**Evaluation Sheet: How Performance Was Measured**

Every backtest includes implicit assumptions about what constitutes "good performance" and how tests were conducted. The evaluation sheet makes these assumptions explicit and auditable.

**Metrics**: Which quantitative measures defined success? We report Sharpe ratio, maximum drawdown, and mean turnover for our linear model. But we also document baseline comparisons—the rule-based strategy's metrics. This contextualizes performance: is a Sharpe of 1.2 impressive, or did a simple rule achieve 1.1?

**Baselines**: What simpler alternatives were tested? We include the rule-based momentum strategy as a sanity check. If sophisticated machine learning couldn't beat simple rules, that's information—maybe the problem doesn't require complexity, or maybe features lack predictive power.

**Protocol**: How exactly were train/test splits conducted? We document: training period (steps 0-500), embargo (steps 500-505), test period (steps 505-805), and walk-forward refitting (6 windows of 200 steps). This allows validators to verify no data leakage occurred and understand how realistic the out-of-sample testing was.

**Cost Assumptions**: What transaction cost model was used? We specify spread costs (5 bps) and quadratic impact (coefficient 0.5). These numbers are critical—change spread costs to 10 bps and the strategy might become unprofitable. Documenting assumptions allows sensitivity analysis and makes clear that profitability depends on cost model accuracy.

**Embargo Rules**: Why the 5-step embargo and what would happen without it? We explain: embargo prevents overlapping windows where training labels and test features share data, mimicking production deployment where test data is genuinely unavailable during training.

**Robustness Summary**: What fraction of robustness tests passed? We include the pass/fail counts from Cell 8's robustness suite—5 of 5 tests passed (or whatever the actual result was). This single number summarizes hours of testing: did the strategy survive stress scenarios or barely squeak through backtests under optimal conditions?

The evaluation sheet saves as JSON (evaluation_sheet.json) because it's quantitative data that risk management dashboards can automatically ingest. You can programmatically alert if Sharpe falls below documented baselines or if robustness pass rates decline in updated versions.

**Audit Pack Index: The Master Catalog**

The audit pack index is the manifest of all artifacts—a single file that lists every output, its location, and its cryptographic hash. This serves as the top-level entry point for audits: "Here are all 25 artifacts this system produced, with SHA256 hashes proving these exact files existed at this timestamp."

For each artifact file (config.json, robustness_suite_report.json, coefficients_global.png, etc.), we compute:

**Filename**: The artifact's name
**Path**: Its full filesystem location
**Hash**: SHA256 digest of file contents

The hashing is critical for immutability. If anyone modifies an artifact after initial generation—even changing a single character—the hash changes, making tampering detectable. This isn't paranoia; it's basic governance. Imagine a regulator asks to see your robustness testing from six months ago. You produce the audit pack index showing hash X for robustness_suite_report.json. The regulator recomputes the hash from your archived file. If hashes match, you've cryptographically proven this is the original artifact. If they don't match, either you modified it (bad) or the archive is corrupted (also bad, but differently).

The audit pack index also includes the run_id (unique identifier), creation timestamp, and links to key artifacts for human navigation. Think of it as both a machine-readable manifest (for automated compliance systems) and a human-readable directory (for auditors trying to understand what you produced).

We save this as audit_pack_index.json and also print a human-readable summary showing total artifacts and key document locations. In production systems, this file would be automatically uploaded to immutable storage (blockchain, write-once databases) immediately after generation to create an irrefutable audit trail.

**The Minimum Artifact Table: A Checklist for Chapter 21**

Beyond the four main artifacts, we generate a text file listing the minimum required artifacts for Chapter 21 specifically. This is a compliance checklist connecting chapter concepts to concrete outputs:

1. **Run Manifest**: Proves determinism—seed, config hash, environment details
2. **Risk Register**: Embedded in robustness suite report—documented risks and tests
3. **Model Card**: Strategy identity—purpose, scope, limitations, risks, mitigations
4. **Data Sheet**: Data provenance—lineage, missingness, timestamps, universe
5. **Evaluation Matrix**: Performance measurement—metrics, baselines, protocol, costs
6. **Robustness Suite Report**: Stress testing evidence—pass/fail for 5 test categories
7. **Explainability Pack**: Model behavior—global coefficients, local contributions, sensitivity
8. **Exposure Decomposition**: Factor analysis—market beta, sector concentrations
9. **Monitoring Spec**: Surveillance system—signals, thresholds, state machine, alert budget
10. **Post-Mortem Template**: Incident response framework—timeline, root cause, mitigations

Each item maps to a file in the output directory. This checklist serves multiple purposes: self-validation (did I generate all required artifacts?), validator guidance (what should I review?), and regulatory response (you asked for model risk document

###12.2.CODE AND IMPLEMENTATION

In [17]:

# Cell 12 — Model Card, Data Sheet, Evaluation Sheet, Audit Pack Generator
# ============================================================================
print("\n" + "=" * 80)
print("CELL 12: MODEL CARD, DATA SHEET, EVALUATION SHEET, AUDIT PACK")
print("=" * 80)

# Model card
model_card_text = """
MODEL CARD: Chapter 21 Linear Predictor

PURPOSE:
Demonstrate model risk management, explainability, and robustness testing for algorithmic trading.

SCOPE:
- Multi-asset synthetic market (10 assets, 1000 time steps)
- Linear predictor with ridge regularization
- Prediction horizon: 1 step ahead
- Train/test split with embargo

NON-GOALS:
- Production deployment (synthetic data only)
- Real-time inference
- High-frequency trading

DECISION TIMING:
Position decided at time t based on features computed from data up to t-1.
Execution at time t.

INPUT/OUTPUT:
- Input: 4 features per asset (momentum, volatility, rolling mean, rolling z-score)
- Output: Position signal (sign of predicted return)

LIMITATIONS:
- Model is linear and may not capture non-linear dynamics
- Synthetic data does not reflect real market microstructure
- No transaction cost optimization
- No risk constraints beyond position sizing

RISKS:
- Overfitting to training period
- Regime change degradation
- Execution cost sensitivity
- Hidden leverage

MITIGATION:
- Regularization (ridge alpha=0.1)
- Embargo period to prevent leakage
- Robustness testing across scenarios
- Monitoring for drift and degradation
"""

model_card_path = os.path.join(OUTPUT_DIR, "model_card.txt")
write_text(model_card_path, model_card_text)
artifact_registry["artifact_files"].append("model_card.txt")

# Data sheet
data_sheet = {
    "lineage": {
        "source": "Synthetic generator with regime process",
        "version": "1.0",
        "generation_timestamp": run_manifest["timestamp_start"],
        "seed": SEED,
    },
    "missingness": {
        "early_steps": "Features are NaN before rolling window fills",
        "late_steps": "Labels are NaN for last H steps",
    },
    "timestamp_semantics": {
        "frequency": CONFIG["bar_frequency"],
        "timezone": "UTC (synthetic)",
        "alignment": "Close-to-close",
    },
    "corporate_actions": {
        "handling": "N/A (synthetic data)",
    },
    "universe_definition": {
        "n_assets": N_ASSETS,
        "asset_type": "Synthetic equities",
        "filters": "None",
    },
}

data_sheet_path = os.path.join(OUTPUT_DIR, "data_sheet.json")
write_json(data_sheet_path, data_sheet)
artifact_registry["artifact_files"].append("data_sheet.json")

# Evaluation sheet
evaluation_sheet = {
    "metrics": {
        "sharpe": metrics_linear["sharpe"],
        "max_drawdown": metrics_linear["max_drawdown"],
        "mean_turnover": metrics_linear["mean_turnover"],
    },
    "baselines": {
        "rule_based": {
            "sharpe": metrics_rule["sharpe"],
            "max_drawdown": metrics_rule["max_drawdown"],
        },
    },
    "protocol": {
        "train_period": f"0 to {train_end}",
        "embargo": f"{train_end} to {embargo_end}",
        "test_period": f"{test_start} to {test_end}",
        "walk_forward": f"{len(refits)} refits",
    },
    "cost_assumptions": COST_PARAMS,
    "embargo_rules": {
        "embargo_steps": EMBARGO,
        "rationale": "Prevent leakage from overlapping windows",
    },
    "robustness_summary": robustness_suite_report["summary"],
}

evaluation_sheet_path = os.path.join(OUTPUT_DIR, "evaluation_sheet.json")
write_json(evaluation_sheet_path, evaluation_sheet)
artifact_registry["artifact_files"].append("evaluation_sheet.json")

# Audit pack index
audit_pack_index = {
    "run_id": RUN_ID,
    "artifacts": [],
}

for fname in artifact_registry["artifact_files"]:
    fpath = os.path.join(OUTPUT_DIR, fname)
    if os.path.exists(fpath):
        with open(fpath, 'rb') as f:
            file_hash = sha256_of_bytes(f.read())
        audit_pack_index["artifacts"].append({
            "filename": fname,
            "path": fpath,
            "hash": file_hash,
        })

audit_pack_path = os.path.join(OUTPUT_DIR, "audit_pack_index.json")
write_json(audit_pack_path, audit_pack_index)
artifact_registry["artifact_files"].append("audit_pack_index.json")

# Minimum Artifact Table for Chapter 21
min_artifact_table = """
MINIMUM ARTIFACT TABLE FOR CHAPTER 21: MODEL RISK, EXPLAINABILITY, ROBUSTNESS

1. Run manifest: run_manifest.json
   - Run ID, timestamp, seed, config hash, code hash, environment

2. Risk register: (conceptual, embedded in robustness_suite_report.json)
   - Identified risks: overfitting, regime change, cost sensitivity, hidden leverage

3. Model card: model_card.txt
   - Purpose, scope, non-goals, decision timing, I/O, limitations, risks, mitigation

4. Data sheet: data_sheet.json
   - Lineage, missingness, timestamp semantics, corporate actions, universe definition

5. Evaluation matrix: evaluation_sheet.json
   - Metrics, baselines, protocol, cost assumptions, embargo rules, robustness summary

6. Robustness suite report: robustness_suite_report.json
   - Temporal, cross-sectional, microstructure, regime, adversarial tests with pass/fail

7. Explainability pack: explainability_pack.json
   - Global (coefficients), local (contributions), sensitivity analysis

8. Exposure decomposition: stress_test_report.json (embedded)
   - Market factor exposure, sector/liquidity buckets (conceptual)

9. Monitoring spec: monitoring_spec.json
   - Signals, thresholds, state machine, alert budget

10. Post-mortem template: post_mortem_template.json + post_mortem_example.json
    - Incident summary, timeline, root cause, detection gap, mitigations, tests added
"""

min_artifact_table_path = os.path.join(OUTPUT_DIR, "minimum_artifact_table.txt")
write_text(min_artifact_table_path, min_artifact_table)
artifact_registry["artifact_files"].append("minimum_artifact_table.txt")

print(f"[ARTIFACTS] Generated {len(artifact_registry['artifact_files'])} artifact files")



CELL 12: MODEL CARD, DATA SHEET, EVALUATION SHEET, AUDIT PACK
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/model_card.txt
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/data_sheet.json
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/evaluation_sheet.json
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/audit_pack_index.json
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/minimum_artifact_table.txt
[ARTIFACTS] Generated 29 artifact files


##13.POST MORTEM TEMPLATE AND EXAMPLE INCIDENT

###13.1.OVERVIEW


**The Post-Mortem Culture: Blameless Inquiry**

Most organizations treat failures as embarrassments to be minimized, buried, or blamed on individuals. This culture guarantees repeated failures because root causes never get addressed systematically. High-reliability organizations—aviation, nuclear power, elite military units—instead practice blameless post-mortems: detailed forensic analysis of failures with the explicit goal of learning rather than punishing. When an airplane crashes, the National Transportation Safety Board doesn't ask "who screwed up?" but rather "what systemic factors contributed to this outcome, and how do we prevent recurrence?"

Quantitative finance needs the same discipline. When a trading strategy loses money unexpectedly, the question isn't "which quant built this bad model?" but "what did we fail to test, monitor, or understand about this strategy, and what processes will prevent similar failures?" This section implements post-mortem infrastructure—both a reusable template for documenting any future incident and a worked example analyzing a simulated failure scenario.

**The Post-Mortem Template: Structured Root Cause Analysis**

We create a JSON template with standardized fields capturing every critical aspect of an incident. Structure matters—freeform narratives omit important details and make cross-incident comparison impossible. Structured templates ensure completeness and enable meta-analysis: across 20 incidents, which root causes appear repeatedly? That's where systemic fixes are needed.

**Incident Summary**: One-paragraph description capturing what happened at the highest level. This is what executives read first—you need to convey severity, scope, and status immediately.

**Timeline**: Chronological sequence of events with timestamps. Timelines reveal patterns invisible in narratives: did detection lag the incident by hours or days? Did escalation stall at certain organizational levels? Was the fix fast but communication slow?

**Root Cause**: Deep analysis going beyond proximate causes to underlying systemic failures. "The model made bad predictions" is a proximate cause. "The model was trained without regime-conditional features, and we didn't test regime robustness before deployment" is a root cause addressing why the model was vulnerable.

**Detection Gap**: Why didn't existing systems catch this earlier? This field forces honest assessment of monitoring effectiveness. If your monitoring failed to alert before significant losses, the monitoring system needs improvement, not just the model.

**Impact Quantification**: PnL loss magnitude, duration of incident, affected assets, and any cascading effects (margin calls, investor redemptions, regulatory inquiries). Quantification enables prioritization—incidents causing $10M losses get more engineering resources than $10K losses.

**Mitigations**: Immediate actions taken to stop the bleeding—reduced position sizes, halted trading, hedged exposures, switched to backup systems. This documents your incident response capability and provides playbooks for future similar incidents.

**Tests Added**: New robustness tests, monitoring signals, or validation gates implemented to catch this failure mode in the future. This is the most important field—it's how post-mortems translate to improved systems. Every incident should yield new tests; if you can't think of tests that would have caught the failure, you haven't understood the root cause yet.

**Follow-Up Actions**: Open action items with owners and deadlines. Post-mortems are worthless if they don't drive changes. Assigning clear ownership with deadlines ensures findings translate to fixes.

**The Example Incident: Data Distribution Shift**

We simulate a realistic incident using our synthetic data's drift injection (from Cell 11). At test step 200, feature distributions shifted due to regime transition. We construct a complete post-mortem documenting this simulated failure as if it were real.

**Incident Summary**: "Feature distribution shift caused by regime transition led to unstable predictions, increased turnover, and 10-day cumulative losses totaling $X."

**Timeline**: Documents when drift occurred (day 200), when monitoring first alerted (day 205), when state machine escalated to Amber (day 250), when manual review began (day 210), and when losses were quantified (day 210). The lag between drift occurrence and alerting reveals monitoring effectiveness.

**Root Cause**: "Feature engineering pipeline lacked regime-aware normalization. The rolling z-score feature became unstable during regime transitions because it used a single lookback window that blended both regimes. Model trained on stable regime periods couldn't generalize to transition periods."

This goes beyond "the model failed" to explain exactly why it failed and what architectural decision caused the vulnerability. This level of detail enables surgical fixes rather than wholesale model replacement.

**Detection Gap**: "Monitoring system detected drift but alert budget was too conservative. Persistence threshold required 3 consecutive alerts before escalating to Amber state, delaying human intervention by approximately 5 days. During this delay, losses accumulated."

This identifies a specific monitoring parameter (persistence threshold) that needs adjustment—actionable rather than vague.

**Impact**: Documents actual PnL loss computed from the simulated incident, duration (10 days from drift to intervention), and that all 10 assets were affected (systemic not idiosyncratic).

**Mitigations**: Lists concrete actions: "Added regime-aware feature normalization using separate statistics per regime, lowered persistence threshold from 3 to 2 steps, increased alert budget from 5 to 10 per 100 steps, implemented emergency kill switch triggering automatic halt on Red state."

**Tests Added**: Specifies three new tests added to the robustness suite: explicit regime transition testing, feature stability tests under distribution shifts, and alert latency measurement (time from drift to Red state escalation).

**Follow-Up Actions**: Assigns owners and deadlines for implementing mitigations, backtesting the improved system on historical regime transitions, and reviewing alert budgets across all production models.

**Key Takeaways**

- Post-mortems are blameless learning opportunities, not blame assignment exercises
- Structured templates ensure completeness and enable cross-incident meta-analysis
- Timelines reveal organizational response patterns and detection lags
- Root cause analysis goes beyond proximate causes to systemic vulnerabilities
- Detection gaps identify monitoring failures requiring system improvements
- Impact quantification enables prioritization of engineering resources
- Every incident must yield new tests—otherwise root causes weren't truly understood
- Follow-up actions with owners and deadlines ensure findings drive actual changes
- Simulated incidents validate that post-mortem templates are practical and complete
- Post-mortem artifacts become institutional memory preventing repeated failures

Post-mortems transform failures from setbacks into learning opportunities. Without structured documentation, knowledge remains siloed in individuals who experienced the incident. With templates and examples, every failure enriches the organization's collective understanding of what can go wrong and how to prevent it. The best organizations aren't those that never fail—they're those that fail forward, ensuring each failure makes the next one less likely.

###13.2.CODE AND IMPLEMENTATION

In [18]:

# Cell 13 — Post-Mortem Template + Example Incident
# ============================================================================
print("\n" + "=" * 80)
print("CELL 13: POST-MORTEM TEMPLATE + EXAMPLE INCIDENT")
print("=" * 80)

# Post-mortem template
post_mortem_template = {
    "incident_id": "<unique_id>",
    "incident_summary": "<brief description>",
    "timeline": [
        {"time": "<timestamp>", "event": "<event description>"},
    ],
    "root_cause": "<detailed root cause analysis>",
    "detection_gap": "<why was this not caught earlier?>",
    "impact": {
        "pnl_loss": "<quantify loss>",
        "duration": "<time span>",
        "affected_assets": "<list>",
    },
    "mitigations": [
        "<mitigation 1>",
        "<mitigation 2>",
    ],
    "tests_added": [
        "<new test 1>",
        "<new test 2>",
    ],
    "follow_up": [
        {"action": "<action item>", "owner": "<responsible person>", "deadline": "<date>"},
    ],
}

post_mortem_template_path = os.path.join(OUTPUT_DIR, "post_mortem_template.json")
write_json(post_mortem_template_path, post_mortem_template)
artifact_registry["artifact_files"].append("post_mortem_template.json")

# Example incident: data missingness spike
incident_day = DRIFT_EVENT_STEP  # reuse drift event as incident trigger
incident_pnl_loss = np.sum(net_pnl_linear[incident_day:incident_day+10])

post_mortem_example = {
    "incident_id": "INC-2025-001",
    "incident_summary": "Data missingness spike caused by feature distribution shift, leading to unexpected losses.",
    "timeline": [
        {"time": f"Test day {incident_day}", "event": "Feature drift detected (z-shift > 3.0)"},
        {"time": f"Test day {incident_day+1}", "event": "Positions became erratic, turnover spiked"},
        {"time": f"Test day {incident_day+5}", "event": "Monitoring alert triggered (Amber state)"},
        {"time": f"Test day {incident_day+10}", "event": "Manual review initiated, losses quantified"},
    ],
    "root_cause": "Feature engineering pipeline did not handle regime-dependent distribution shifts. "
                   "The rolling z-score feature became unstable during regime transition, causing misaligned signals.",
    "detection_gap": "Monitoring system detected drift, but alert budget was too conservative. "
                     "Persistence threshold delayed escalation to Red state.",
    "impact": {
        "pnl_loss": float(incident_pnl_loss),
        "duration": "10 days (test period)",
        "affected_assets": "All 10 assets",
    },
    "mitigations": [
        "Added regime-aware feature normalization",
        "Lowered persistence threshold from 3 to 2",
        "Increased alert budget from 5 to 10 per 100 steps",
        "Implemented emergency kill switch for Red state",
    ],
    "tests_added": [
        "Regime transition robustness test (explicit state change simulation)",
        "Feature stability test under distribution shift",
        "Alert latency test (time from drift to Red state)",
    ],
    "follow_up": [
        {"action": "Update feature engineering to use regime-conditional normalization", "owner": "Quant Dev", "deadline": "2025-02-01"},
        {"action": "Backtest mitigation strategy on historical regime transitions", "owner": "Risk Manager", "deadline": "2025-02-15"},
        {"action": "Review alert budget across all models", "owner": "Model Owner", "deadline": "2025-02-10"},
    ],
}

post_mortem_example_path = os.path.join(OUTPUT_DIR, "post_mortem_example.json")
write_json(post_mortem_example_path, post_mortem_example)
artifact_registry["artifact_files"].append("post_mortem_example.json")

print(f"[POST-MORTEM] Example incident documented (INC-2025-001)")
print(f"[POST-MORTEM] PnL loss: {incident_pnl_loss:.6f}")




CELL 13: POST-MORTEM TEMPLATE + EXAMPLE INCIDENT
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/post_mortem_template.json
[ARTIFACT] Written: /content/ch21_run_20251230_181425_8270/post_mortem_example.json
[POST-MORTEM] Example incident documented (INC-2025-001)
[POST-MORTEM] PnL loss: -466.075823


## 14.DOCUMENTATION

In [19]:

# Cell 14 — Transition to Chapter 22
# ============================================================================
print("\n" + "=" * 80)
print("CELL 14: TRANSITION TO CHAPTER 22")
print("=" * 80)

transition_note = """
TRANSITION TO CHAPTER 22: GOVERNANCE OBLIGATIONS AND REGULATORY FRAMING

Chapter 21 has built a governance-native model risk toolkit, including:
- Explainability diagnostics (global, local, sensitivity)
- Robustness test suites (temporal, cross-sectional, microstructure, regime, adversarial)
- Stress testing for P&L and risk (hidden leverage, tail risk, drawdown decomposition, exposure)
- Robust optimization intuition (shrinkage, ensembles)
- Monitoring & degradation detection (causal online diagnostics, state machines)
- Model cards, data sheets, evaluation sheets, audit packs, and post-mortem templates

Chapter 22 will formalize the governance obligations and regulatory framing for these artifacts:
- Mapping artifacts to regulatory requirements (e.g., SR 11-7, MiFID II, BCBS 239)
- Defining roles and responsibilities (model owner, validator, risk manager)
- Establishing review cycles and escalation paths
- Documenting compliance evidence and audit trails

All artifacts generated in Chapter 21 are designed to be audit-ready and traceable,
providing the foundation for a robust model governance framework in Chapter 22.
"""

print(transition_note)

transition_note_path = os.path.join(OUTPUT_DIR, "transition_to_ch22.txt")
write_text(transition_note_path, transition_note)
artifact_registry["artifact_files"].append("transition_to_ch22.txt")


# Final Summary
# ============================================================================
print("\n" + "=" * 80)
print("FINAL SUMMARY")
print("=" * 80)

print(f"\n[RUN COMPLETE] Run ID: {RUN_ID}")
print(f"[OUTPUT DIR] {OUTPUT_DIR}")
print(f"[ARTIFACTS] Generated {len(artifact_registry['artifact_files'])} files:")
for fname in artifact_registry["artifact_files"]:
    print(f"  - {fname}")

print("\n[EVALUATION METRICS]")
print(f"  Model A (Linear) - Sharpe: {metrics_linear['sharpe']:.4f}, Max DD: {metrics_linear['max_drawdown']:.4f}")
print(f"  Model B (Rule) - Sharpe: {metrics_rule['sharpe']:.4f}, Max DD: {metrics_rule['max_drawdown']:.4f}")

print("\n[ROBUSTNESS SUITE]")
print(f"  Total tests: {robustness_suite_report['summary']['total']}")
print(f"  Passed: {robustness_suite_report['summary']['passed']}")
print(f"  Failed: {robustness_suite_report['summary']['failed']}")

print("\n[MONITORING]")
print(f"  Total alerts: {len(monitoring_run_log['alerts'])}")
print(f"  Final state: {current_state}")
print(f"  Alert rate: {alert_count_per_100:.2f} per 100 steps (budget: {alert_budget})")

print("\n[AUDIT PACK]")
print(f"  Audit pack index: {audit_pack_path}")
print(f"  All artifacts hashed and indexed for traceability.")

print("\n[DETERMINISM CHECK]")
print(f"  Config hash: {config_hash[:16]}...")
print(f"  Dataset hash: {dataset_hash[:16]}...")
print("  Re-running with same seed will yield identical hashes.")

print("\n" + "=" * 80)
print("CHAPTER 21 NOTEBOOK COMPLETE")
print("=" * 80)
print("\nAll artifacts saved to:", OUTPUT_DIR)
print("Notebook run end-to-end successfully.")
print("Ready for governance review and transition to Chapter 22.")


CELL 14: TRANSITION TO CHAPTER 22

TRANSITION TO CHAPTER 22: GOVERNANCE OBLIGATIONS AND REGULATORY FRAMING

Chapter 21 has built a governance-native model risk toolkit, including:
- Explainability diagnostics (global, local, sensitivity)
- Robustness test suites (temporal, cross-sectional, microstructure, regime, adversarial)
- Stress testing for P&L and risk (hidden leverage, tail risk, drawdown decomposition, exposure)
- Robust optimization intuition (shrinkage, ensembles)
- Monitoring & degradation detection (causal online diagnostics, state machines)
- Model cards, data sheets, evaluation sheets, audit packs, and post-mortem templates

Chapter 22 will formalize the governance obligations and regulatory framing for these artifacts:
- Mapping artifacts to regulatory requirements (e.g., SR 11-7, MiFID II, BCBS 239)
- Defining roles and responsibilities (model owner, validator, risk manager)
- Establishing review cycles and escalation paths
- Documenting compliance evidence and audit 

##15.CONCLUSIONS


**The Journey from Model to Production-Ready System**

This chapter has constructed a comprehensive model risk management framework that transforms raw quantitative research into governance-ready, auditable, production-grade trading systems. What began as simple linear predictors on synthetic data evolved through thirteen carefully orchestrated steps into a complete pipeline demonstrating explainability, robustness testing, stress analysis, defensive optimization, continuous monitoring, and systematic documentation. This isn't just academic exercise—it's the minimum viable governance infrastructure that separates hobbyist backtests from institutional-quality quantitative strategies.

The pipeline we've built addresses the central problem of quantitative finance: models that perform brilliantly on historical data often fail catastrophically in live markets. This performance gap—what we've called "alpha tax"—arises from multiple sources: overfitting to specific historical conditions, estimation error in parameters, regime changes that invalidate learned relationships, transaction costs higher than assumed, and gradual model degradation as markets evolve. Traditional research stops at backtest results showing attractive Sharpe ratios. Professional risk management begins there, systematically probing every assumption and failure mode before risking real capital.

Let's walk through the complete pipeline step by step, understanding what each stage accomplishes and why it matters.

**Step 1: Deterministic Foundations and Configuration Management**

We began by establishing determinism through explicit random seeds and comprehensive configuration dictionaries. Every parameter—from sample sizes to cost assumptions to robustness thresholds—lives in a structured CONFIG object that gets hashed cryptographically. This isn't bureaucracy; it's reproducibility infrastructure. When results diverge six months later, you can prove whether your code changed or your configuration changed or your data changed, because everything is versioned and traceable.

The governance utilities layer—hash functions, artifact writers, run manifests—creates an immutable audit trail from the start. Every artifact generated gets hashed and logged in the manifest, creating cryptographic proof of what was produced when. This satisfies regulatory requirements and enables debugging: if a strategy fails, you can retrieve the exact configuration, code version, and data vintage that produced it.

**Step 2: Synthetic Market Generation with Regime Awareness**

Rather than depending on external data that may be unavailable, proprietary, or inconsistent across vendors, we generated synthetic multi-asset markets with explicit regime processes. This pedagogical choice provides complete control over statistical properties: we know the true volatility regime at every time step, the true correlation structure, and can inject specific stress events like tail shocks.

The two-regime structure (low-volatility/low-correlation versus high-volatility/high-correlation) captures the essential non-stationarity of real markets—conditions change, and strategies must survive transitions. By generating correlation matrices that shift with regimes, we ensured our test data contains the clustering behavior observed in real markets where diversification benefits evaporate during crises. This synthetic environment becomes our laboratory for everything that follows, providing ground truth for validation.

**Step 3: Causal Feature Engineering with Leakage Prevention**

Feature construction implemented obsessive temporal discipline: every feature at time t uses only data through time t-1. We computed trailing momentum, volatility, rolling means, and z-scores using explicit loops with clear window boundaries, making the causality visible in code structure rather than hidden in library calls. Multiple assertions verified that features are NaN before windows fill and that no feature accesses future data.

We also demonstrated a leaky feature example—using returns[t] as input at time t—to teach what not to do. This pedagogical approach matters because look-ahead bias is the most common and most pernicious error in quantitative research. By making causality explicit and testable, we built features that will work in production because they respect the fundamental constraint: you can only use information available at decision time.

**Step 4: Baseline Model Development with Interpretability**

Rather than immediately building complex machine learning systems, we implemented two transparent baselines: a regularized linear predictor and a simple rule-based momentum strategy. The linear model uses ridge regression—explicit closed-form solution with controlled shrinkage—rather than black-box optimizers. The rule-based strategy encodes market intuition (momentum works better in low-volatility regimes) in five lines of code anyone can understand.

These baselines serve multiple purposes: sanity checks (if sophisticated models can't beat simple rules, something is wrong), benchmarks (complex models must justify their complexity through superior performance), and interpretability foundations (linear coefficients and simple rules are fully explainable). The train-test-embargo split with explicit temporal boundaries ensures out-of-sample evaluation mimics production deployment where test data is genuinely unavailable during training.

**Step 5: Realistic Backtesting with Transaction Costs**

The backtest engine implements the harsh reality check that destroys many promising strategies: transaction costs. We modeled both spread costs (linear in turnover) and market impact (quadratic in turnover), capturing that large trades move prices nonlinearly. Computing both gross and net PnL makes the alpha tax visible—many strategies have positive gross Sharpe but negative net Sharpe after costs.

The one-period execution lag (decide at t based on data through t-1, execute at t, realize returns at t) reflects actual trading mechanics. Summary metrics include not just return statistics but turnover and maximum drawdown, recognizing that implementation matters as much as prediction accuracy. This backtest engine becomes the evaluation framework for all robustness testing—we stress-test by modifying its parameters and observing performance degradation.

**Step 6: Explainability Diagnostics at Three Scales**

Explainability tools make models debuggable rather than black boxes. Global explanations via coefficient magnitudes and rolling refit stability show which features matter overall and whether their importance is consistent or regime-dependent. Local explanations via per-feature contributions identify what drove decisions on specific days, particularly worst loss days where forensic analysis reveals failure modes.

Sensitivity analysis via finite-difference perturbations tests fragility—robust models degrade gracefully under small input changes while brittle models flip predictions from microscopic noise. These three scales (global, local, sensitivity) provide complementary views: global shows average behavior, local shows specific instances, sensitivity shows robustness. Together they enable operators to understand what models do, why they fail, and whether failures are systemic or idiosyncratic.

**Step 7: Robustness Testing Across Five Dimensions**

The robustness suite systematically probes failure modes through explicit pass/fail gates. Temporal robustness via walk-forward analysis tests whether performance is consistent across time or period-specific. Cross-sectional robustness via random asset removal tests whether the strategy depends on specific instruments or generalizes across universes. Microstructure robustness via cost multipliers and latency tests reveals implementation sensitivity—can the strategy survive 3× higher costs or 2-step execution delays?

Regime robustness via conditional performance ensures strategies work in both calm and turbulent markets, not just on average. Adversarial perturbations via feature noise, missingness, and outliers test whether models degrade gracefully under corruption. Each test has quantitative thresholds defined upfront (minimum Sharpe 0.5, maximum drawdown -15%, etc.), converting subjective assessment into objective gates. Strategies must pass all tests to be considered robust—partial success isn't acceptable when real capital is at stake.

**Step 8: Stress Testing for Catastrophic Scenarios**

Beyond robustness testing (which examines graceful degradation), stress testing explores catastrophic failures. Hidden leverage diagnostics via gross exposure reveal amplification that risk models might miss. Tail risk metrics (VaR and Expected Shortfall) capture fat-tailed distributions that standard deviation completely fails to measure. Drawdown decomposition identifies whether losses come from single events or accumulated deterioration, systemic factors or idiosyncratic shocks.

Exposure decomposition via factor regression reveals unintended bets—a strategy claiming to be market-neutral might have substantial market beta. The scenario library explores synthetic disasters beyond historical data: vol shocks simulating market panic, correlation spikes destroying diversification, cost ramps simulating illiquidity. These scenarios answer the pre-mortem question: how could this strategy blow up? Finding vulnerabilities in simulation prevents discovering them with real money.

**Step 9: Robust Optimization to Mitigate Estimation Error**

Classical optimization produces strategies perfectly adapted to sample-specific noise. Robust optimization trades in-sample performance for out-of-sample stability through systematic defenses. Bootstrap resampling quantifies parameter uncertainty, showing which coefficients are reliably non-zero versus which are fitting noise. Shrinkage techniques—multiplying signals by factors less than 1.0—reduce position sizes based on the recognition that parameter estimates are wrong.

Covariance shrinkage stabilizes correlation estimates that are notoriously unreliable from finite samples. Cost conservatism via 1.5-2× multipliers creates safety margins against model errors and regime changes. Turnover penalties implement stability-aware objectives that penalize excessive rebalancing. Ensemble methods diversify across model risk by combining multiple strategies, allowing uncorrelated estimation errors to cancel. These techniques embody a philosophical shift from maximization (find the optimal strategy) to satisficing (find a good-enough strategy that remains good-enough across many scenarios).

**Step 10: Continuous Monitoring with Causal Online Diagnostics**

Models degrade inevitably as markets evolve. Monitoring systems detect degradation early through three layers: input drift (feature distributions shifting), output drift (prediction and turnover changes), and outcome drift (realized costs and returns diverging from expectations). Each layer provides progressively definitive but later signals—input drift catches problems before bad predictions, output drift catches behavioral changes, outcome drift confirms actual damage.

The state machine with persistence and hysteresis (Green/Amber/Red states with escalation rules) prevents alert fatigue while ensuring genuine problems escalate. Alert budgets limit false positive rates, making monitoring operationally sustainable. We validated effectiveness by injecting synthetic drift and verifying that monitoring detected it before significant losses—demonstrating early warning capability rather than just post-mortem documentation.

**Step 11: Automated Documentation Artifact Generation**

Rather than treating documentation as manual overhead, we generated artifacts automatically as code outputs. Model cards document purpose, scope, limitations, and risks—preventing inappropriate applications. Data sheets capture lineage, quality, and semantics—enabling diagnosis when data issues arise. Evaluation sheets make performance measurement transparent with explicit baselines, protocols, and cost assumptions.

The audit pack index catalogs all artifacts with cryptographic hashes, providing immutable proof of completeness and authenticity. The minimum artifact table maps chapter concepts to concrete outputs, serving as a compliance checklist. These artifacts serve multiple audiences—researchers, risk managers, compliance officers, operators—creating a shared source of truth that ensures consistency across organizational silos.

**Step 12: Post-Mortem Infrastructure for Systematic Learning**

Failures are inevitable; learning from them is optional. We created post-mortem templates with structured fields (timeline, root cause, detection gaps, mitigations, tests added) ensuring complete forensic analysis. The worked example analyzing a simulated drift incident demonstrates the template's practicality while establishing expectations for depth of analysis—proximate causes aren't enough, systemic root causes must be identified.

Post-mortems translate failures into improved systems through new tests and monitoring signals. Without structured documentation, institutional knowledge remains siloed in individuals; with templates, every failure enriches collective understanding of what can go wrong and how to prevent recurrence.

**The Integrated Pipeline: More Than the Sum of Parts**

Each step addresses a specific aspect of model risk, but their power emerges from integration. Explainability tools identify which features drive predictions; robustness testing verifies those features work across scenarios; stress testing explores extreme conditions; monitoring detects when relationships break down; post-mortems document what monitoring missed. Artifacts from each stage feed into others—monitoring thresholds come from stress test results, post-mortem improvements become new robustness tests, explainability findings inform feature engineering refinements.

This integrated approach creates defense in depth—multiple layers catching different failure modes at different stages. A vulnerability that slips past robustness testing might be caught by stress testing. A regime change that survives stress scenarios might trigger monitoring alerts before significant losses. And failures that penetrate all defenses get documented in post-mortems that prevent recurrence.

**From Chapter 21 to Production Reality**

This chapter provides the technical infrastructure for model risk management, but production deployment requires organizational infrastructure too. Model validation teams must review artifacts and approve strategies before deployment. Risk management committees must set governance policies defining acceptable risk levels. Compliance teams must map artifacts to regulatory requirements. Operations teams must integrate monitoring systems into trading workflows with clear escalation procedures.

Chapter 22 will address these governance obligations and regulatory frameworks, but the technical foundation built here is prerequisite—you can't have governance without artifacts to govern, monitoring without signals to track, or validation without tests to review. The thirteen steps documented here create the minimum viable technical layer on which institutional governance rests.

**Final Reflections: Defensive Engineering as Discipline**

The philosophy underlying this entire pipeline is defensive engineering—assume things will go wrong and build systems to detect problems before they become disasters. This mindset pervades every decision: use synthetic data to ensure reproducibility, implement causality assertions to prevent leakage, define pass/fail gates to make robustness objective, inject synthetic drift to validate monitoring, generate artifacts automatically to ensure completeness.

Defensive engineering doesn't guarantee success, but it dramatically increases survival probability. Strategies built with this discipline may show lower backtested Sharpe ratios than aggressively optimized alternatives, but they're more likely to deliver those returns in production because they've been stress-tested, monitored, and documented. In quantitative finance, survival compounds—a strategy returning 8% annually for twenty years vastly outperforms one returning 15% for five years before collapsing.

The complete pipeline—from deterministic foundations through synthetic data generation, causal features, transparent baselines, realistic backtesting, explainability, robustness testing, stress scenarios, robust optimization, continuous monitoring, automated documentation, and post-mortem infrastructure—represents the minimum viable approach to treating model risk as a first-class concern rather than an afterthought. It transforms quantitative research from intellectual exercise into production-ready systems that institutions can deploy with confidence, regulators can audit with trust, and operators can maintain over years of market evolution.