#**CHAPTER 3.CONSTRAINED CONTROL AGENTS**
---

##REFERENCE

https://chatgpt.com/share/699349b7-d3c4-8012-812d-06a8783fd6b5

##0.CONTEXT

**Introduction — Chapter 3 Notebook: Robust Constrained Control and Institutional Stage Gates for Surrogate Trading Agents**

This notebook is the third step in a three-chapter journey toward building a surrogate trading agent the way an institution would actually accept it: not as a clever model, but as a governed decision system with explicit constraints, measurable risk, auditable artifacts, and clear “advance / do not advance” criteria. The previous notebooks built the foundation. Chapter 1 defined a surrogate objective: a proxy for “good trading behavior” that includes not only returns but also costs, turnover, drawdowns, and feasibility. Chapter 2 applied optimization pressure through learning and showed the central danger of proxy work: once an optimizer is involved, the agent may discover loopholes, brittle behaviors, or “reward hacks” that look good on the proxy but would fail under real execution and risk supervision.

Chapter 3 is where we stop pretending that a trading agent is simply a model that predicts returns. A trading agent is a controller. It is a system that repeatedly chooses actions under uncertainty while interacting with a market that changes regimes, experiences shocks, and sometimes breaks the assumptions the agent learned yesterday. This notebook therefore moves from “learning a policy” to “governed control under constraints.” The key shift is that the notebook is not only trying to maximize an objective. It is trying to enforce risk and governance limits in a way that is explicit, reviewable, and stable under stress.

The core objective of Chapter 3 is to demonstrate that surrogate agents can be engineered as constrained decision systems, not as unconstrained optimizers. That sounds simple, but it is the most important professional distinction in the entire project. In unconstrained optimization, the algorithm’s job is to find anything that maximizes the proxy. In constrained control, the algorithm’s job is to find the best behavior that still lives inside institutional boundaries. This difference is exactly the difference between a “research demo” and a “production-grade candidate.”

To reach that goal, Chapter 3 introduces three concepts that define institutional-grade surrogate agents.

**First, partial observability and belief state.**  
Markets do not provide the true regime label. They do not tell you when a structural break has occurred. They do not publish “this is now a crisis state.” What you actually have are noisy observations: realized returns, noisy volatility estimates, and liquidity conditions that degrade in stress. A production agent must therefore operate with uncertainty about what state the world is in. This notebook implements that uncertainty explicitly through a belief state: a probability distribution over regimes updated online from observed data. Even if the belief is imperfect, it gives you a disciplined way to say: “the agent thinks stress regime probability is 70% right now.” That is interpretable, auditable, and operational.

**Second, robust optimization under stress and adversarial perturbations.**  
Real markets contain shocks, and institutions must assume that the future can be worse than the average case. This notebook therefore uses scenario packs: multiple simulated future windows around the current time, including stressed variants and an adversarial return perturbation. The adversary is not a fantasy villain. It is a controlled way to represent model error, mis-estimation, or “unknown unknowns.” If the policy remains coherent even when returns are perturbed against it within a budget, that is a stronger sign of robustness than merely performing well on a single path.

**Third, constrained optimization and stage gates.**  
Institutional trading does not say, “maximize Sharpe no matter what.” It says, “do not exceed these limits,” and “if drawdown breaches this level, you stop.” This notebook implements that reality with two layers of governance. There are hard gates: gross exposure caps, turnover caps, and a hard drawdown limit. The controller is forced back to flat when hard gates are violated. Then there are soft constraints: targets for tail loss (CVaR), mean drawdown excess, and turnover excess. These soft constraints are not just after-the-fact measurements; they are integrated into training as explicit constraints that shape the learned policy. That is what makes the system a controlled policy rather than an unconstrained optimizer.

With those principles, the notebook has a clear and concrete set of objectives.

**Objective 1: Build a realistic test bench where failure is possible.**  
The notebook generates a synthetic market that contains hidden regimes, regime-dependent factor behavior, liquidity dynamics, rare jumps, and a structural break. The point is not realism for its own sake. The point is to create conditions in which a naive agent will fail, so that “success” in the notebook actually means something. If the market were easy, every policy would look acceptable, and you would learn nothing about governance.

**Objective 2: Train a constrained policy using a primal–dual method.**  
The notebook trains a stochastic policy that chooses exposure levels based on bounded features and a regime belief. Training is framed as a constrained Markov decision problem, where the agent maximizes a discounted utility while respecting constraints on tail loss, drawdown behavior, and turnover behavior. The dual variables (the Lagrange multipliers) are updated during training. In institutional language, these multipliers become “shadow prices.” They quantify how costly it was to keep the agent within governance limits. This is extremely valuable for review. A committee can look at the shadow prices and see whether constraints are binding. A policy that requires massive shadow prices to remain feasible is a warning sign, even if its headline performance is attractive.

**Objective 3: Compare the learned policy to a robust MPC comparator.**  
The notebook does not only train one agent and declare victory. It compares the learned constrained policy against an explicit, interpretable controller: a robust MPC-style policy that chooses actions by scoring scenarios with mean utility minus CVaR penalties. This comparator is important because it provides a baseline that is not learned and is therefore less prone to proxy exploitation. If the learned policy is superior, it must be superior relative to a governance-aware alternative, not merely relative to randomness.

**Objective 4: Produce decision-grade diagnostics and stage-gate outcomes.**  
The notebook ends with clear “advance / do not advance” evidence. It computes generalization gaps between train and test, measures hard breach rates, reports tail risk (CVaR), reports maximum drawdown, reports turnover and cost intensity, and produces explicit proxy exploitation flags. It also runs a sensitivity test by tightening the drawdown hard limit and checking whether performance collapses. This sensitivity test is a proxy for operational fragility: if a small tightening of limits destroys the policy, it is not production-ready.

These objectives lead to equally clear deliverables. In institutional work, deliverables are not “charts on the screen.” Deliverables are stable objects that can be reviewed, versioned, and compared across runs. This notebook therefore produces two classes of deliverables: technical artifacts and governance artifacts.

**Deliverable 1: A deployable policy object, expressed in auditable form.**  
At the end of training, the notebook outputs the learned policy parameters. The policy is not hidden in an opaque model file. It is exported as a deterministic JSON object containing the policy weights, the action levels, the feature dimension, and the temperature used for stochastic decisions. This means the policy can be reconstructed exactly from the artifact bundle. It also means a reviewer can inspect it and understand what kind of decision rule it represents.

**Deliverable 2: A training curve with full dual dynamics.**  
The notebook exports the full training history: objective values, constraint values, and dual variables at every iteration. This is crucial because production review cares about stability. A single final metric can hide instability, collapse, or overfitting. The training curve shows whether learning converged, whether constraints were consistently violated, and whether the shadow prices stabilized or exploded.

**Deliverable 3: A complete evaluation battery on train and test.**  
The notebook exports evaluation summaries for both the learned policy and the MPC comparator on both train and test environments. Each evaluation includes end wealth, maximum drawdown, mean turnover, mean costs, mean returns, CVaR loss, and breach counts. These are the “headline metrics” that risk and governance teams actually ask for. The evaluation battery also includes generalization gaps so you can see whether the policy is stable across realizations of the same structural environment.

**Deliverable 4: Institutional diagnostics and proxy exploitation flags.**  
Beyond the evaluation metrics, the notebook exports a diagnostics object that includes shadow prices and constraint binding metrics, plus explicit heuristic flags for proxy exploitation. These flags are not “magic.” They are intentionally simple and auditable: they trigger when tail loss blows up relative to targets, when drawdown approaches hard limits, when turnover looks like churn, when costs dominate returns, or when any hard breach occurs. In institutional settings, simple flags are preferred early, because they are interpretable and can be refined later. The flags are not meant to be perfect; they are meant to force discipline.

**Deliverable 5: A sensitivity report that functions as a stage-gate stress test.**  
The notebook performs a controlled sensitivity test by tightening the hard drawdown limit and evaluating both controllers again. The output shows how fragile each policy is under stricter governance. This is important because real-world risk committees often tighten limits during stress. A policy that only “works” when limits are loose is not a stable candidate.

**Deliverable 6: A final summary object and a zipped artifact bundle.**  
Finally, the notebook produces a single “final_summary.json” that contains the run identity, the configuration hash, headline results, stage gate thresholds, shadow prices, and the recommended decision rule for whether to advance. It also packages all artifacts into a zip file. This is the institutional bundle you can share with a reviewer. It contains everything needed to reproduce results and to audit conclusions.

The most important pedagogical message of Chapter 3 is that the primary output is not the policy. The primary output is evidence. In institutional finance, an agent is not accepted because it is clever. It is accepted because its behavior is bounded, its risks are measurable, its failure modes are known, and its training and evaluation are auditable. Chapter 3 therefore treats optimization as only one piece of a larger engineering process. It demonstrates a workflow that is closer to how regulated and supervised organizations operate: build constraints first, train under constraints, test under stress, export evidence, and decide whether to advance based on stage gates.

So what should you expect to learn by running this notebook? You should expect to learn how a surrogate agent becomes a governed control system. You will see how partial observability can be handled through belief states. You will see how robustness can be approximated through scenario packs and adversarial perturbations. You will see how constraints can be enforced through hard gates and optimized through primal–dual methods. You will see how to compare a learned policy to a robust baseline. And you will see how to end with artifacts that support institutional review rather than narrative interpretation.

If Chapter 1 was about defining “what we want,” and Chapter 2 was about discovering “what the optimizer will do,” then Chapter 3 is about proving “what we can safely control.” That is why this notebook matters. It is the first notebook in the series whose output looks like a production candidate: not because it is ready to trade, but because it is shaped like something that can survive a review process.

Finally, it is essential to state what this notebook is not. It is not a claim of real-market profitability. It is not a live trading system. It is not an execution platform. It is a laboratory for disciplined design. Its purpose is to establish a repeatable, auditable method to create surrogate agents whose incentives and constraints can be tested under optimization pressure and stress. If you can do that in a synthetic setting, you have a rational pathway to later steps: richer markets, richer state representations, tighter governance, and eventually carefully supervised real-world experimentation. But you do not skip this step. This step is what separates “AI as a gadget” from “AI as an institutional tool.”

That is the objective of Chapter 3. The deliverables are not just a policy, but a full artifact bundle that proves how the policy was trained, how it behaves under stress, whether it generalizes, which constraints bind, and whether it passes stage gates. If the notebook achieves that, it has achieved its true goal: building a surrogate trading agent as an accountable, governed decision system.


##1.LIBRARIES AND ENVIRONMENT

**Cell 1 — Institutional run lock, configuration graph, and cryptographic provenance (about 350 words)**  
This cell turns the notebook from “a script that runs” into “an experiment that can be audited.” The first objective is determinism: we set explicit random seeds so the same configuration produces the same outputs. Without this, two runs can differ for reasons nobody can explain, and that is unacceptable in institutional research. The second objective is identity: the notebook creates a run timestamp in UTC and builds a configuration hash from a canonical JSON representation of all parameters. That hash is a fingerprint of assumptions. If you change a single number—turnover cap, action set, stress probability—the hash changes. This prevents silent drift and makes runs comparable.  

Next, the cell introduces a typed configuration graph. This matters because “surrogate agents” are extremely sensitive to hidden assumptions. A typed config forces you to name each assumption: market horizon, number of assets, regime transition probabilities, break date, jump probability, observation noise, execution cost coefficients, drawdown limits, CVaR alpha, robustness horizon, adversarial budgets, and training rates. Each config block is validated. Validation is not bureaucracy; it prevents nonsensical experiments (for example, a CVaR alpha outside a meaningful range, or a drawdown soft limit above a hard limit).  

Finally, the cell creates the first governance artifacts: a run manifest, a configuration file, and a config hash file written into an artifact directory. These are deliverables. They allow a reviewer to answer “what exactly did you run?” without reading code or trusting memory. Pedagogically, the lesson is that in finance you do not judge an agent by a screenshot of results. You judge it by reproducibility and traceability. This cell is also where “production readiness” starts: before training, before evaluation, before any performance claims, you establish provenance, stable inputs, and a reviewable footprint.


In [1]:
# CELL 1 — Institutional Run Lock + Typed Config Graph + Cryptographic Provenance + Safety Rails (Chapter 3, v2)

import os, sys, json, math, time, random, hashlib, platform, zipfile, traceback
import datetime as _dt
from dataclasses import dataclass, asdict
from typing import Dict, Tuple, List, Any, Optional, Callable
import numpy as np

# -------------------------
# Zero-ambiguity time + deterministic seeding
# -------------------------
def utc_now_iso() -> str:
    return _dt.datetime.now(_dt.timezone.utc).isoformat()

def _sha256_hex(b: bytes) -> str:
    return hashlib.sha256(b).hexdigest()

def _canonical_json(obj: Any) -> str:
    return json.dumps(obj, sort_keys=True, separators=(",", ":"), ensure_ascii=True)

def _write_json(path: str, obj: Any) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(obj, f, indent=2, sort_keys=True)

def _write_text(path: str, s: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        f.write(s)

def _mkdir(path: str) -> None:
    os.makedirs(path, exist_ok=True)

def _assert_finite(name: str, x: float) -> None:
    if not (math.isfinite(float(x))):
        raise ValueError(f"{name} is not finite: {x}")

def _assert_prob_vec(name: str, p: np.ndarray) -> None:
    if p.ndim != 1:
        raise ValueError(f"{name} must be 1D")
    if np.any(~np.isfinite(p)):
        raise ValueError(f"{name} contains non-finite")
    if np.any(p < -1e-14):
        raise ValueError(f"{name} contains negative values")
    s = float(np.sum(p))
    if abs(s - 1.0) > 1e-8:
        raise ValueError(f"{name} must sum to 1; got {s}")

# -------------------------
# Run identity
# -------------------------
SEED = 20260216 + 3 + 17  # deterministic, explicit, auditable
np.random.seed(SEED)
random.seed(SEED)
_rng = np.random.default_rng(SEED)

# -------------------------
# Institutional config graph (multi-asset latent-factor, partial observability, CMDP)
# -------------------------
@dataclass(frozen=True)
class MarketCfg:
    T: int
    n_assets: int
    n_factors: int
    # Hidden regime (2-state) governs factor drift/vol and liquidity state
    P: Tuple[Tuple[float, float], Tuple[float, float]]  # transition matrix
    break_t: int
    jump_prob: float
    jump_scale: float
    obs_noise: float
    def validate(self) -> None:
        if self.T < 2600:
            raise ValueError("T must be >= 2600 for robust regime/break coverage.")
        if not (2 <= self.n_assets <= 12):
            raise ValueError("n_assets must be in [2,12] for this notebook.")
        if not (1 <= self.n_factors <= 5):
            raise ValueError("n_factors must be in [1,5].")
        if not (0 < self.break_t < self.T - 200):
            raise ValueError("break_t must be inside horizon.")
        P = np.array(self.P, dtype=float)
        if P.shape != (2,2):
            raise ValueError("P must be 2x2.")
        for i in range(2):
            s = float(P[i,0] + P[i,1])
            if abs(s - 1.0) > 1e-12:
                raise ValueError("rows of P must sum to 1.")
        if not (0 <= self.jump_prob < 0.25):
            raise ValueError("jump_prob in [0,0.25).")
        if self.jump_scale <= 0:
            raise ValueError("jump_scale > 0.")
        if self.obs_noise < 0:
            raise ValueError("obs_noise >= 0.")

@dataclass(frozen=True)
class FactorCfg:
    # Regime-specific factor drift/vol (n_factors each)
    mu0: Tuple[float, ...]
    mu1: Tuple[float, ...]
    vol0: Tuple[float, ...]
    vol1: Tuple[float, ...]
    # Break shifts
    mu_shift: float
    vol_mult: float
    # AR(1) factor persistence
    phi: float
    def validate(self, n_factors: int) -> None:
        if len(self.mu0) != n_factors or len(self.mu1) != n_factors:
            raise ValueError("mu vectors length must match n_factors.")
        if len(self.vol0) != n_factors or len(self.vol1) != n_factors:
            raise ValueError("vol vectors length must match n_factors.")
        if not (0.0 <= self.phi < 1.0):
            raise ValueError("phi in [0,1).")
        if self.vol_mult < 1.0:
            raise ValueError("vol_mult >= 1.0.")
        if not math.isfinite(self.mu_shift):
            raise ValueError("mu_shift must be finite.")

@dataclass(frozen=True)
class ExecCfg:
    spread_bps: float
    impact_lin: float
    impact_quad: float
    impact_cubic: float
    vol_scale: float
    liq_scale: float
    max_participation: float  # caps per-step turnover as participation proxy
    def validate(self) -> None:
        if any(float(v) < 0 for v in asdict(self).values()):
            raise ValueError("Exec params must be >= 0.")
        if not (0.05 <= self.max_participation <= 2.0):
            raise ValueError("max_participation in [0.05,2.0].")

@dataclass(frozen=True)
class RiskCfg:
    gamma: float
    cvar_alpha: float
    dd_soft: float
    dd_hard: float
    turnover_cap: float
    gross_cap: float
    vol_target: float
    # CMDP constraint targets (soft)
    target_cvar: float
    target_dd_excess: float
    target_turn_excess: float
    def validate(self) -> None:
        if not (0.95 <= self.gamma < 1.0):
            raise ValueError("gamma in [0.95,1).")
        if not (0.01 <= self.cvar_alpha <= 0.20):
            raise ValueError("cvar_alpha in [0.01,0.20].")
        if not (0.0 < self.dd_soft < self.dd_hard < 1.0):
            raise ValueError("dd_soft < dd_hard < 1.")
        if self.turnover_cap <= 0 or self.gross_cap <= 0:
            raise ValueError("caps > 0.")
        if self.vol_target <= 0:
            raise ValueError("vol_target > 0.")
        for nm in ("target_cvar","target_dd_excess","target_turn_excess"):
            _assert_finite(nm, float(getattr(self, nm)))

@dataclass(frozen=True)
class RobustCfg:
    horizon: int
    n_scenarios: int
    stress_prob: float
    stress_vol_mult: float
    adversary_eps: float
    adversary_budget: float
    def validate(self) -> None:
        if not (12 <= self.horizon <= 64):
            raise ValueError("horizon in [12,64].")
        if self.n_scenarios < 128:
            raise ValueError("n_scenarios must be >= 128.")
        if not (0.0 <= self.stress_prob <= 0.8):
            raise ValueError("stress_prob in [0,0.8].")
        if self.stress_vol_mult < 1.0:
            raise ValueError("stress_vol_mult >= 1.")
        if self.adversary_eps < 0 or self.adversary_budget < 0:
            raise ValueError("adversary_* >= 0.")

@dataclass(frozen=True)
class TrainCfg:
    iters: int
    lr_theta: float
    lr_lam: float
    lam_max: float
    entropy_temp: float
    grad_clip: float
    # variance reduction
    baseline_beta: float
    # heavy-tail robustness
    huber_k: float
    def validate(self) -> None:
        if self.iters < 160:
            raise ValueError("iters >= 160.")
        if not (0 < self.lr_theta <= 1.0 and 0 < self.lr_lam <= 1.0):
            raise ValueError("learning rates in (0,1].")
        if self.lam_max <= 0:
            raise ValueError("lam_max > 0.")
        if self.entropy_temp <= 0:
            raise ValueError("entropy_temp > 0.")
        if self.grad_clip <= 0:
            raise ValueError("grad_clip > 0.")
        if not (0.90 <= self.baseline_beta < 1.0):
            raise ValueError("baseline_beta in [0.90,1.0).")
        if self.huber_k <= 0:
            raise ValueError("huber_k > 0.")

@dataclass(frozen=True)
class RunCfg:
    market: MarketCfg
    factor: FactorCfg
    exec: ExecCfg
    risk: RiskCfg
    robust: RobustCfg
    train: TrainCfg
    action_levels: Tuple[float, ...]  # discrete gross exposures (scalar portfolio)
    def validate(self) -> None:
        self.market.validate()
        self.factor.validate(self.market.n_factors)
        self.exec.validate()
        self.risk.validate()
        self.robust.validate()
        self.train.validate()
        if len(self.action_levels) < 5:
            raise ValueError("Need >= 5 exposure levels for meaningful control.")
        if 0.0 not in self.action_levels:
            raise ValueError("Must include 0 exposure.")
        if any(abs(a) > 3.0 for a in self.action_levels):
            raise ValueError("Exposure levels must be within [-3,3] in this notebook.")

cfg = RunCfg(
    market=MarketCfg(
        T=3200,
        n_assets=6,
        n_factors=3,
        P=((0.975, 0.025),
           (0.085, 0.915)),
        break_t=2000,
        jump_prob=0.018,
        jump_scale=4.2,
        obs_noise=0.40
    ),
    factor=FactorCfg(
        mu0=(0.00035, 0.00010, 0.00005),
        mu1=(-0.00010, -0.00005, 0.00000),
        vol0=(0.010, 0.012, 0.008),
        vol1=(0.028, 0.022, 0.018),
        mu_shift=-0.00022,
        vol_mult=1.75,
        phi=0.55
    ),
    exec=ExecCfg(
        spread_bps=2.0,
        impact_lin=2.2e-4,
        impact_quad=1.6e-3,
        impact_cubic=2.0e-3,
        vol_scale=0.28,
        liq_scale=0.55,
        max_participation=1.1
    ),
    risk=RiskCfg(
        gamma=0.9990,
        cvar_alpha=0.05,
        dd_soft=0.10,
        dd_hard=0.24,
        turnover_cap=1.15,
        gross_cap=1.50,
        vol_target=0.015,
        target_cvar=0.0019,
        target_dd_excess=0.010,
        target_turn_excess=0.020
    ),
    robust=RobustCfg(
        horizon=28,
        n_scenarios=192,
        stress_prob=0.38,
        stress_vol_mult=2.1,
        adversary_eps=0.0038,
        adversary_budget=0.022
    ),
    train=TrainCfg(
        iters=200,
        lr_theta=0.20,
        lr_lam=0.18,
        lam_max=80.0,
        entropy_temp=0.75,
        grad_clip=4.0,
        baseline_beta=0.985,
        huber_k=3.5
    ),
    action_levels=(-2.0, -1.25, -0.50, 0.0, 0.50, 1.25, 2.0)
)
cfg.validate()

artifact_root = "/mnt/data/artifacts_ch3_v2"
_mkdir(artifact_root)

cfg_dict = asdict(cfg)
cfg_hash = _sha256_hex(_canonical_json(cfg_dict).encode("utf-8"))
ts = utc_now_iso()
run_id = _sha256_hex((ts + ":" + cfg_hash + ":" + str(SEED)).encode("utf-8"))

manifest = {
    "run_id": run_id,
    "timestamp_utc": ts,
    "seed": SEED,
    "python": sys.version,
    "platform": platform.platform(),
    "numpy": np.__version__,
    "config_hash": cfg_hash,
    "artifact_root": artifact_root,
    "intent": "Chapter 3 v2 — CMDP-grade risk-sensitive robust control: partial observability, factor market, primal-dual policy, MPC comparator, stage gates, artifact bundle.",
}
_write_json(os.path.join(artifact_root, "run_manifest.json"), manifest)
_write_json(os.path.join(artifact_root, "config.json"), cfg_dict)
_write_text(os.path.join(artifact_root, "config_hash.txt"), cfg_hash)

print(json.dumps(manifest, indent=2, sort_keys=True))


{
  "artifact_root": "/mnt/data/artifacts_ch3_v2",
  "config_hash": "2cba2dd63f8ff5f1803c9b9f6b4d97dfad35c005217e5490a339e7d072fd68b0",
  "intent": "Chapter 3 v2 \u2014 CMDP-grade risk-sensitive robust control: partial observability, factor market, primal-dual policy, MPC comparator, stage gates, artifact bundle.",
  "numpy": "2.0.2",
  "platform": "Linux-6.6.105+-x86_64-with-glibc2.35",
  "python": "3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]",
  "run_id": "319b4c17939eccaf79df4c263470714f09ce773f1b9f0f5d4d3dd724497254af",
  "seed": 20260236,
  "timestamp_utc": "2026-02-16T18:31:30.303511+00:00"
}


##2.SYNTHETIC MULTI ASSET DATA

###2.1.OVERVIEW

**Cell 2 — Synthetic multi-asset market with hidden regimes, liquidity state, breaks, and noisy observations **  
This cell builds the laboratory environment the agent must operate in. Its purpose is not to imitate the real market in every detail. Its purpose is to include the structural features that make trading hard and that expose proxy failures: regime changes, volatility clustering, liquidity deterioration in stress, rare jumps, and structural breaks. These elements create conditions where naive policies will look good for a while and then fail—exactly what institutional review wants to surface early.  

The environment is multi-asset and factor-driven. There are latent factors that drive asset returns through a loading matrix. A hidden two-state regime changes the drift and volatility of those factors, so that the statistical character of returns is different in calm versus stress. A structural break introduces a permanent shift after a chosen time index, representing “the world changed,” which is common in finance. A rare jump mechanism injects occasional systemic shocks across assets, representing tail events that dominate risk.  

Equally important: the agent is not given the true regime or the true latent factors. It receives noisy observations: a noisy index return proxy, a noisy volatility proxy, and a noisy liquidity proxy. This is the partial observability setting. In practice, traders never observe “the state.” They infer it from noisy signals. By building this into the lab, we avoid a common academic shortcut where an agent sees the true regime label.  

The cell also creates two distinct datasets: a training market path and a test market path. The test path has the same structural configuration but different randomness. This is crucial: it lets us measure generalization rather than memorization. The deliverable of this cell is therefore the full synthetic market object: hidden regimes (for truth), observed signals (for the agent), and the ground-truth returns used for realized P&L. Pedagogically, this cell teaches that a surrogate agent is only meaningful if it survives realistic uncertainty and stress, not just a smooth “toy” series.


###2.2.CODE AND IMPLEMENTATION

In [2]:
# CELL 2 — Multi-Asset Latent-Factor Market with Hidden Regime + Liquidity State + Noisy Observations (Synthetic Only)

def _logsumexp(v: np.ndarray) -> float:
    m = float(np.max(v))
    return m + float(np.log(np.sum(np.exp(v - m)) + 1e-18))

def simulate_market(cfg: RunCfg, seed_offset: int) -> Dict[str, np.ndarray]:
    rng = np.random.default_rng(int(SEED + seed_offset))
    T = cfg.market.T
    nA = cfg.market.n_assets
    nF = cfg.market.n_factors

    # Hidden regime z_t in {0,1}
    P = np.array(cfg.market.P, dtype=float)
    z = np.zeros(T, dtype=np.int64)
    z[0] = 0

    # Factor process f_t (nF)
    f = np.zeros((T, nF), dtype=float)

    # Asset loadings B (nA x nF), idio vol per asset
    B = rng.normal(size=(nA, nF)).astype(float)
    # normalize loadings to controlled scale
    B = B / (np.linalg.norm(B, axis=1, keepdims=True) + 1e-18) * (0.8 + 0.4*rng.random((nA,1)))
    idio = (0.006 + 0.006 * rng.random(nA)).astype(float)

    # Liquidity proxy (0.05..1.0), worsens with stress
    liq = np.zeros(T, dtype=float)
    liq[0] = 1.0

    # True returns per asset r_t (nA) and portfolio "index" return (equal-weight)
    rA = np.zeros((T, nA), dtype=float)
    rI = np.zeros(T, dtype=float)

    # Latent vol proxy (factor magnitude)
    sig = np.zeros(T, dtype=float)
    sig[0] = float(np.mean(cfg.factor.vol0))

    mu0 = np.array(cfg.factor.mu0, dtype=float)
    mu1 = np.array(cfg.factor.mu1, dtype=float)
    v0  = np.array(cfg.factor.vol0, dtype=float)
    v1  = np.array(cfg.factor.vol1, dtype=float)

    for t in range(1, T):
        prev = int(z[t-1])
        z[t] = 0 if float(rng.random()) < float(P[prev,0]) else 1

        mu = mu0 if z[t]==0 else mu1
        vv = v0  if z[t]==0 else v1

        if t >= cfg.market.break_t:
            mu = mu + cfg.factor.mu_shift
            vv = vv * cfg.factor.vol_mult

        # latent factor innovations
        epsF = rng.normal(size=nF)
        f[t] = cfg.factor.phi * f[t-1] + (1.0 - cfg.factor.phi) * mu + vv * epsF

        # latent vol proxy from factor energy
        sig[t] = float(np.sqrt(0.92 * (sig[t-1]**2) + 0.08 * float(np.mean(vv**2) + 0.35*np.mean(f[t]**2))))

        # liquidity worsens with stress and regime=1
        liq_raw = 1.0 / (1.0 + 16.0*sig[t] + 0.35*float(z[t]))
        liq[t] = float(np.clip(liq_raw, 0.05, 1.0))

        # asset returns: B f + idio*noise
        epsA = rng.normal(size=nA)
        base = B @ f[t] + idio * epsA

        # rare jumps (systemic shock)
        if float(rng.random()) < cfg.market.jump_prob:
            j_sign = 1.0 if float(rng.random()) < 0.5 else -1.0
            jump = j_sign * cfg.market.jump_scale * sig[t] * float(rng.exponential())
            base = base + jump  # add to all assets (systemic)
        rA[t] = base
        rI[t] = float(np.mean(rA[t]))

    # Noisy observation layer:
    # observed index return and observed vol are corrupted; regime is NOT observed (partial observability)
    obs_ret = rI + (0.18*cfg.market.obs_noise) * sig * rng.normal(size=T)
    obs_vol = np.clip(sig * (1.0 + cfg.market.obs_noise * rng.normal(size=T)), 1e-6, 1.0)
    obs_liq = np.clip(liq * (1.0 + 0.35*cfg.market.obs_noise * rng.normal(size=T)), 0.05, 1.0)

    return {
        "z": z, "f": f, "B": B, "idio": idio,
        "liq": liq, "sig": sig,
        "rA": rA, "rI": rI,
        "obs_ret": obs_ret.astype(float),
        "obs_vol": obs_vol.astype(float),
        "obs_liq": obs_liq.astype(float),
    }

train_mkt = simulate_market(cfg, seed_offset=0)
test_mkt  = simulate_market(cfg, seed_offset=911)

def mkt_stats(m: Dict[str, np.ndarray]) -> Dict[str, Any]:
    z = m["z"]
    return {
        "T": int(len(m["rI"])),
        "regime_counts": {0: int(np.sum(z==0)), 1: int(np.sum(z==1))},
        "index_mean": float(np.mean(m["rI"])),
        "index_std": float(np.std(m["rI"])),
        "sig_range": (float(np.min(m["sig"])), float(np.max(m["sig"]))),
        "liq_range": (float(np.min(m["liq"])), float(np.max(m["liq"]))),
        "obs_vol_range": (float(np.min(m["obs_vol"])), float(np.max(m["obs_vol"]))),
    }

print(json.dumps({"train": mkt_stats(train_mkt), "test": mkt_stats(test_mkt)}, indent=2, sort_keys=True))


{
  "test": {
    "T": 3200,
    "index_mean": 0.000596139525893218,
    "index_std": 0.02438755767140611,
    "liq_range": [
      0.4592146187949671,
      1.0
    ],
    "obs_vol_range": [
      1e-06,
      0.08971552841404278
    ],
    "regime_counts": {
      "0": 2337,
      "1": 863
    },
    "sig_range": [
      0.01,
      0.05172693892347615
    ]
  },
  "train": {
    "T": 3200,
    "index_mean": 0.00027237896244032377,
    "index_std": 0.021105684821642507,
    "liq_range": [
      0.455070434355589,
      1.0
    ],
    "obs_vol_range": [
      1e-06,
      0.11285143294423833
    ],
    "regime_counts": {
      "0": 2445,
      "1": 755
    },
    "sig_range": [
      0.01,
      0.052966376810172876
    ]
  }
}


##3.REGIME BELIEF SYSTEM

###3.1.OVERVIEW

**Cell 3 — Online regime belief filter, bounded feature vector, and institutional stage gates**  
This cell defines the agent’s information interface and its hard governance boundaries. The first component is the belief filter: a two-state Bayesian update that transforms noisy observations into a probability of being in each regime. The belief is not a guess; it is a disciplined state variable that represents uncertainty. In practice, institutions often prefer “belief” variables because they can be inspected and monitored. You can say, “the model’s stress probability is 0.78,” and that is actionable: it can drive risk overlays, exposure reductions, or increased caution.  

The second component is the feature map. Features are the inputs to the policy. Here, they are intentionally bounded and interpretable. Instead of using raw values that can explode or dominate learning, the notebook applies saturating transforms that keep each feature in a controlled range. The features include: return shocks (scaled and bounded), volatility relative to target, an EWMA-like risk proxy, drawdown relative to hard limits, liquidity stress, regime stress probability, and previous exposure. This feature design accomplishes two governance goals. First, it limits pathological numerical behavior. Second, it makes the policy’s decision basis explainable: if exposure changes, you can relate it to volatility rising, liquidity worsening, drawdown increasing, or stress probability climbing.  

The third component is stage gates. These are hard rules that represent institutional “stop conditions.” Examples include: maximum gross exposure, maximum turnover per step, and a hard drawdown threshold. In real trading organizations, these are not preferences; they are limits. When a limit is violated, the response is not “penalize the reward slightly.” The response is “reduce risk now.” This notebook mirrors that logic by enforcing hard projections toward flat exposure when a breach occurs.  

The deliverable of this cell is a complete state-to-feature-to-gate interface. Pedagogically, it teaches a crucial mindset: the agent is not only a learner; it is a controlled system. Governance is not a report generated at the end. Governance is embedded in the inputs, in the representation, and in the safety rails that cannot be negotiated.


###3.2.CODE AND IMPLEMENTATION

In [4]:
# CELL 3 — Online Regime Belief Filter (Exact 2-state Bayesian) + Feature Vector + Stage Gates (Hard + Soft)

def bayes_filter_step(obs_ret: float, obs_vol: float, prev: np.ndarray, cfg: RunCfg) -> np.ndarray:
    """
    Two-state Bayesian filter with Gaussian emission on obs_ret, scaled by obs_vol and regime base vol.
    The goal is not "true regime identification"; it is a disciplined belief state for partial observability.
    """
    P = np.array(cfg.market.P, dtype=float)
    pred = prev @ P
    pred = np.clip(pred, 1e-12, 1.0)
    pred = pred / float(np.sum(pred))

    # emission: regime-dependent mean and scale proxy
    mu0 = float(np.mean(cfg.factor.mu0))
    mu1 = float(np.mean(cfg.factor.mu1))
    v0  = float(np.mean(cfg.factor.vol0))
    v1  = float(np.mean(cfg.factor.vol1))

    ll = np.zeros(2, dtype=float)
    for k, (mu, v) in enumerate(((mu0, v0), (mu1, v1))):
        s = float(np.clip(0.45*v + 0.55*obs_vol, 1e-6, 1.0))
        ll[k] = -0.5 * ((obs_ret - mu) / s) ** 2 - math.log(s + 1e-18)

    log_post = np.log(pred + 1e-18) + ll
    post = np.exp(log_post - _logsumexp(log_post))
    post = np.clip(post, 1e-12, 1.0)
    post = post / float(np.sum(post))
    _assert_prob_vec("belief", post)
    return post

def stage_gates(dd: float, turnover: float, gross: float, cfg: RunCfg) -> Dict[str, bool]:
    return {
        "dd_hard": bool(dd > cfg.risk.dd_hard + 1e-12),
        "turnover_hard": bool(turnover > cfg.risk.turnover_cap + 1e-12),
        "gross_hard": bool(gross > cfg.risk.gross_cap + 1e-12),
    }

def features(obs_ret: float, obs_vol: float, obs_liq: float, belief: np.ndarray,
             dd: float, ewma_vol: float, w_prev: float, cfg: RunCfg) -> np.ndarray:
    """
    Production-grade feature map: bounded, monotone, interpretable.
    """
    vol_n = float(obs_vol / max(1e-12, cfg.risk.vol_target))
    ewma_n = float(ewma_vol / max(1e-12, cfg.risk.vol_target))
    dd_n = float(dd / max(1e-12, cfg.risk.dd_hard))
    p_stress = float(belief[1])

    x = np.array([
        1.0,
        float(np.tanh(40.0 * obs_ret)),
        float(np.tanh(2.0 * (vol_n - 1.0))),
        float(np.tanh(2.0 * (ewma_n - 1.0))),
        float(np.tanh(3.0 * (dd_n - 0.25))),
        float(np.tanh(3.0 * (0.65 - obs_liq))),
        float(np.tanh(3.0 * (p_stress - 0.5))),
        float(np.tanh(0.7 * w_prev)),
    ], dtype=float)
    if np.any(~np.isfinite(x)):
        raise ValueError("Non-finite features.")
    return x

# quick sanity: feature dimension
print({"feature_dim": int(features(0.0, 0.01, 1.0, np.array([0.5,0.5]), 0.0, 0.0, 0.0, cfg).size)})


{'feature_dim': 8}


##4.EXECUTION MICROSTRUCTURE

###4.1.OVERVIEW

**Cell 4 — Execution microstructure, tail-risk accounting, and robust loss shaping**  
This cell establishes how trading decisions become real economic outcomes. The key point is that a trading agent cannot be evaluated on “signal returns” alone. It must be evaluated on realized P&L after realistic execution friction. The cell therefore defines an execution cost model that includes spread costs and convex market impact. Costs are scaled by observed volatility and liquidity proxies, so that trading becomes more expensive in stress—exactly what happens in real markets. It also introduces a participation-style cliff: if the position change exceeds a participation proxy, costs accelerate sharply. This represents capacity constraints and “liquidity cliffs,” where the same trading intensity becomes disproportionately expensive.  

The cell then builds tail-risk accounting infrastructure. Institutions care about tails because tails are where strategies die. The notebook uses a deterministic reservoir mechanism to keep a representative sample of losses for tail estimation. That is an audit-friendly design: it is stable, reproducible, and does not depend on hidden randomness at evaluation time. From those losses, the notebook computes CVaR (conditional value at risk), a standard tail risk measure. CVaR is important because it measures the average of the worst losses, not just volatility. Many strategies have acceptable volatility but unacceptable tail concentration.  

The cell also introduces robust loss shaping (Huber-style). The purpose is not to hide tail risk. The purpose is to stabilize training signals so that a single extreme observation does not dominate gradient estimates and push the policy into unstable updates. In institutional terms, this is about numerical stability and controlled learning, not about pretending tails do not exist. Tail risk still appears explicitly in CVaR constraints and evaluation.  

The deliverable of this cell is a consistent, governed accounting layer: given a position change and observed conditions, we can compute costs; given realized returns, we can compute P&L; given P&L, we can compute losses, drawdown, and CVaR. Pedagogically, this teaches that a surrogate trading agent is only meaningful if the proxy objective includes execution realism and tail measurement. Otherwise, the optimizer will inevitably exploit the missing economics.


###4.2.CODE AND IMPLEMENTATION

In [5]:
# CELL 4 — Execution Microstructure + Wealth/Drawdown/Vol Tracking + Streaming CVaR + Robust Loss Transforms

def exec_cost(dw: float, obs_vol: float, obs_liq: float, cfg: RunCfg) -> float:
    """
    Convex impact + spread, scaled by stress and liquidity.
    Includes a participation cap proxy: if turnover exceeds max_participation, cost accelerates sharply.
    """
    spread = cfg.exec.spread_bps * 1e-4
    x = abs(float(dw))
    scale = 1.0 + cfg.exec.vol_scale*float(obs_vol) + cfg.exec.liq_scale*float((1.0/max(1e-6, obs_liq)) - 1.0)

    # participation cliff
    if x > cfg.exec.max_participation:
        cliff = 1.0 + 6.0 * (x - cfg.exec.max_participation) ** 2
    else:
        cliff = 1.0

    a, b, c = cfg.exec.impact_lin, cfg.exec.impact_quad, cfg.exec.impact_cubic
    return float(scale * cliff * (spread*x + a*x + b*(dw*dw) + c*(x**3)))

class TailReservoir:
    """
    Deterministic reservoir for tail losses (audit-friendly).
    """
    def __init__(self, cap: int, seed: int):
        self.cap = int(cap)
        self.count = 0
        self.buf = np.zeros(self.cap, dtype=float)
        self.rng = np.random.default_rng(int(seed))

    def add(self, v: float) -> None:
        self.count += 1
        if self.count <= self.cap:
            self.buf[self.count-1] = float(v)
        else:
            j = int(self.rng.integers(1, self.count+1))
            if j <= self.cap:
                self.buf[j-1] = float(v)

    def values(self) -> np.ndarray:
        n = min(self.count, self.cap)
        return self.buf[:n].copy()

def cvar(losses: np.ndarray, alpha: float) -> float:
    if losses.size == 0:
        return 0.0
    x = np.sort(losses)
    k = max(1, int(math.ceil(alpha * x.size)))
    tail = x[-k:]
    return float(np.mean(tail))

def huber_loss(x: float, k: float) -> float:
    """
    Huber transform for robustness to fat tails in gradient estimators.
    """
    ax = abs(float(x))
    if ax <= k:
        return 0.5 * x * x
    return k * (ax - 0.5*k)

ACTION_LEVELS = np.array(list(cfg.action_levels), dtype=float)

print({"actions": ACTION_LEVELS.tolist(), "example_cost": exec_cost(0.9, 0.02, 0.25, cfg)})


{'actions': [-2.0, -1.25, -0.5, 0.0, 0.5, 1.25, 2.0], 'example_cost': 0.008317339200000001}


##5.SCENARIO ENGINE

###5.1.OVERVIEW

**Cell 5 — Robust scenario engine and deterministic adversary for model-error stress**  
This cell creates the machinery that turns the notebook from “evaluate on one path” into “plan and compare under uncertainty.” In practice, a single realized path is not enough to justify a policy, because performance can be path-dependent and fragile. Institutions therefore ask: what happens under stress? What happens if your assumptions are wrong? What happens if volatility spikes and liquidity collapses?  

The scenario engine answers these questions by generating a pack of scenarios around the current time. Each pack contains many short-horizon futures. Some are “normal-ish,” and some are stressed. Stress is introduced systematically: volatility is multiplied by a stress factor, noise is increased, and liquidity is degraded. This creates a distribution of plausible conditions rather than a single forecast. The policy can then be tested against that distribution.  

A second layer is the deterministic adversary. The adversary applies a controlled perturbation to future returns under an L1 budget. The purpose is to represent bounded model error: you might have underestimated adverse moves, or your signals might degrade exactly when you need them most. The adversary is deterministic and auditable: it concentrates perturbations on the largest-magnitude returns in the window. This is a disciplined “worst-case within budget” test.  

This design has a strong pedagogical payoff. It makes robustness concrete. Robustness is not a slogan like “my model is robust.” It is an explicit definition of what disturbances you will tolerate and how you will test them. It also helps prevent proxy exploitation. A policy that looks good only under optimistic assumptions is likely to break when conditions shift. Scenario packs and adversarial perturbations force the agent to face a wider range of futures.  

The deliverable of this cell is a structured uncertainty object: scenario arrays for returns, vol proxies, and liquidity proxies. These are then used downstream by the MPC baseline and, indirectly, by diagnostics. Pedagogically, the key message is: if you want an institutional-grade surrogate agent, you must show that its decision logic remains coherent under stress and bounded model error, not just under average conditions.


###5.2.CODE AND IMPLEMENTATION

In [6]:
# CELL 5 — Robust Scenario Engine (Multi-Scenario Rollouts) + Deterministic L1 Adversary

def adversary_l1_perturb(r: np.ndarray, eps: float, budget: float) -> np.ndarray:
    """
    Deterministic L1-budget adversary:
    - allocate perturbations to the largest-magnitude steps
    - reduce absolute return toward zero (hurts sign-agnostic planning)
    """
    rr = r.copy().astype(float)
    if eps <= 0.0 or budget <= 0.0:
        return rr
    idx = np.argsort(-np.abs(rr))
    rem = float(budget)
    for j in idx:
        if rem <= 1e-18:
            break
        d = min(float(eps), rem)
        rr[j] = rr[j] - math.copysign(d, rr[j])
        rem -= d
    return rr

def scenario_pack(mkt: Dict[str, np.ndarray], t0: int, cfg: RunCfg, rng: np.random.Generator) -> Dict[str, np.ndarray]:
    """
    Builds N scenarios of length H (including current index), using:
    - realized observed path as base (lab control)
    - injected stress multipliers
    - additive observation noise
    - adversarial return perturbation on forward window
    """
    H = cfg.robust.horizon
    N = cfg.robust.n_scenarios
    T = int(len(mkt["obs_ret"]))
    end = min(T, t0 + H + 1)

    base_r = mkt["obs_ret"][t0:end].copy()
    base_v = mkt["obs_vol"][t0:end].copy()
    base_l = mkt["obs_liq"][t0:end].copy()

    if base_r.size < H + 1:
        pad = (H + 1) - base_r.size
        base_r = np.pad(base_r, (0,pad), mode="edge")
        base_v = np.pad(base_v, (0,pad), mode="edge")
        base_l = np.pad(base_l, (0,pad), mode="edge")

    R = np.zeros((N, H+1), dtype=float)
    V = np.zeros((N, H+1), dtype=float)
    L = np.zeros((N, H+1), dtype=float)

    for i in range(N):
        stressed = (float(rng.random()) < cfg.robust.stress_prob)
        mult = cfg.robust.stress_vol_mult if stressed else 1.0

        v = np.clip(base_v * mult * (1.0 + 0.12*rng.normal(size=H+1)), 1e-6, 1.0)
        # return noise scales with v
        noise = rng.normal(size=H+1) * (0.22 * v)
        r = base_r + noise
        # adversary acts on forward portion only (exclude index 0)
        r_fwd = r.copy()
        r_fwd[1:] = adversary_l1_perturb(r_fwd[1:], cfg.robust.adversary_eps, cfg.robust.adversary_budget)

        # liquidity degrades in stress
        l = np.clip(base_l / (1.0 + 0.65*(mult - 1.0)), 0.05, 1.0)

        R[i], V[i], L[i] = r_fwd, v, l

    return {"R": R, "V": V, "L": L}

print("Scenario engine ready.")


Scenario engine ready.


##6.MPC CONTROLLER

###6.1.OVERVIEW

**Cell 6 — Robust MPC comparator: a transparent baseline with CVaR-aware scoring**  
This cell builds a benchmark controller that is governance-aware and interpretable. It implements a robust MPC (model predictive control) style decision rule that chooses the next exposure by scoring candidate actions across the scenario pack. The score is not just expected utility. It is expected utility minus a tail penalty based on CVaR of losses across scenarios. This matters because an agent can maximize expected outcomes while hiding catastrophic tail outcomes. By penalizing CVaR, the MPC baseline expresses an institutional preference: “avoid decisions that concentrate downside.”  

The MPC also enforces hard constraints explicitly. If an action violates the turnover cap or gross exposure cap, it is rejected. If drawdown exceeds the hard limit, the controller forces the portfolio toward flat exposure. This mirrors real risk control behavior. In an institution, limits are not suggestions. They are enforced.  

Why do we need this comparator? Because a learned policy can be impressive only relative to a meaningful baseline. If the only baseline is randomness or a naive rule, then “beating it” is not evidence of production readiness. A robust MPC comparator is a strong baseline because it is designed to be conservative under stress, and it uses the same governance philosophy as the learned policy. If the learned policy cannot compete with this comparator on risk-adjusted performance and constraint behavior, then it is not ready for advancement.  

The deliverable of this cell is therefore a second policy: a transparent control policy that can be evaluated on the same metrics as the learned policy. Pedagogically, this teaches a central institutional habit: always benchmark against a strong, interpretable alternative. It also shows that not all “agents” must be trained. Some can be engineered as robust controllers. In practice, institutions often deploy engineered overlays and controls even when the core alpha engine is learned. This cell reflects that reality by making the comparator both risk-aware and constraint-enforcing.


###6.2.CODE AND IMPLEMENTATION

In [7]:
# CELL 6 — MPC Controller (Robust One-Step Utility with CVaR) + Hard Constraint Enforcement (Comparator Policy)

def mpc_action(
    x_t: np.ndarray,
    w_prev: float,
    dd: float,
    pack: Dict[str, np.ndarray],
    cfg: RunCfg
) -> float:
    """
    Robust MPC (myopic but risk-sensitive):
      score(a) = mean(step_utility) - beta * CVaR(step_loss)
    where step_utility includes execution cost, soft dd penalty, soft vol penalty, and turnover penalty.
    """
    R = pack["R"]; V = pack["V"]; L = pack["L"]
    N, H1 = R.shape
    # we only use 1-step ahead inside MPC comparator (institutional baseline)
    r1 = R[:, 1]
    v1 = V[:, 1]
    l1 = L[:, 1]

    best_a = 0.0
    best_s = -1e100

    for a in ACTION_LEVELS:
        w_new = float(np.clip(float(a), -cfg.risk.gross_cap, cfg.risk.gross_cap))
        turnover = abs(w_new - w_prev)
        gross = abs(w_new)
        # hard gate
        if turnover > cfg.risk.turnover_cap + 1e-12:
            continue
        if gross > cfg.risk.gross_cap + 1e-12:
            continue
        if dd > cfg.risk.dd_hard + 1e-12:
            if abs(w_new) < 1e-12:
                return 0.0
            continue

        u = np.zeros(N, dtype=float)
        loss = np.zeros(N, dtype=float)
        dd_ex = max(0.0, dd - cfg.risk.dd_soft)

        for i in range(N):
            cst = exec_cost(w_new - w_prev, float(v1[i]), float(l1[i]), cfg)
            # planning uses observed r1, accounting later uses true
            pnl = float(w_prev) * float(r1[i]) - cst
            vol_pen = 0.30 * max(0.0, float(v1[i]/max(1e-12,cfg.risk.vol_target)) - 1.0)
            dd_pen  = 6.0 * dd_ex
            u[i] = pnl - vol_pen - dd_pen - 0.10*turnover - 0.02*(gross**2)
            loss[i] = max(0.0, -pnl)

        s = float(np.mean(u)) - 1.35 * float(cvar(loss, cfg.risk.cvar_alpha))
        if s > best_s + 1e-15 or (abs(s - best_s) <= 1e-15 and float(a) < best_a):
            best_s = s
            best_a = float(a)

    return float(best_a)

print("MPC comparator ready.")


MPC comparator ready.


##7.PRIMAL DUAL TRAINING

###7.1.OVERVIEW

**Cell 7 — Primal–dual constrained training: policy learning under institutional constraints**  
This cell is where the surrogate agent is actually trained. The key concept is that trading is framed as a constrained decision problem, not a pure profit maximization problem. The policy chooses discrete exposure levels based on the feature vector and regime belief. Training seeks to maximize a discounted utility that includes P&L after costs, plus penalties for turnover, excessive volatility, and drawdown excess. But the defining feature is the constraint structure. The notebook introduces explicit constraint targets for tail loss (CVaR), drawdown excess, and turnover excess.  

The training uses a primal–dual method. “Primal” refers to the policy parameters that control decisions. “Dual” refers to the Lagrange multipliers associated with constraints. When the policy violates constraints, the corresponding multiplier increases, making constraint violations more expensive in the objective. Over time, the system learns a balance: improve performance, but not by taking unacceptable tail risk or by churning.  

A major deliverable here is the interpretation of the dual variables as shadow prices. Shadow prices are governance-relevant diagnostics. They tell you how binding a constraint was. If the CVaR shadow price becomes very large, it suggests that the agent naturally wants to take tail risk and must be strongly “taxed” to avoid it. That is a warning sign, even if final performance is good.  

The cell also includes variance reduction (a baseline) to stabilize learning. This is important because the learning signal in finance can be noisy and heavy-tailed. Stabilization is not cosmetic; it is a requirement for reproducible training. The cell records a training log at every iteration: objective values, constraints, shadow prices, and breach counts.  

The deliverable is therefore twofold: the learned policy parameters and the full training trace. Pedagogically, this cell teaches the most important lesson of surrogate agent construction: if you do not train with constraints, you are not building an institutional agent. You are building a reward optimizer that will eventually exploit your proxy.


###7.2.CODE AND IMPLEMENTATION

In [8]:
# CELL 7 — Primal–Dual CMDP Training (Stochastic Softmax Policy) with Robust Penalties + Variance Reduction

def softmax(logits: np.ndarray, temp: float) -> np.ndarray:
    z = logits / max(1e-12, temp)
    m = float(np.max(z))
    ex = np.exp(z - m)
    p = ex / (float(np.sum(ex)) + 1e-18)
    p = np.clip(p, 1e-12, 1.0)
    return p / float(np.sum(p))

def sample_cat(p: np.ndarray, rng: np.random.Generator) -> int:
    u = float(rng.random())
    c = 0.0
    for i in range(p.size):
        c += float(p[i])
        if u <= c:
            return int(i)
    return int(p.size - 1)

def policy_logits(x: np.ndarray, theta: np.ndarray) -> np.ndarray:
    # theta shape (A, d)
    return theta @ x

def train_cmdp_primal_dual(mkt: Dict[str, np.ndarray], cfg: RunCfg, rng: np.random.Generator) -> Dict[str, Any]:
    """
    High-standard, audit-friendly CMDP training:
    - Stochastic policy pi_theta(a|x) (linear preferences)
    - Primal objective: discounted utility with robust (Huber) shaping
    - Constraints: CVaR(loss), mean drawdown-excess, mean turnover-excess
    - Dual ascent on multipliers lambda >= 0 (shadow prices)
    """
    T = int(len(mkt["rI"]))
    A = int(ACTION_LEVELS.size)
    d = int(features(0.0, 0.01, 1.0, np.array([0.5,0.5]), 0.0, 0.0, 0.0, cfg).size)

    theta = (rng.normal(size=(A, d)) * 0.06).astype(float)

    lam_cvar = 2.0
    lam_dd   = 2.0
    lam_to   = 1.5

    logs: List[Dict[str, float]] = []
    baseline = 0.0

    for it in range(cfg.train.iters):
        W = 1.0
        peak = 1.0
        dd = 0.0
        ewma_vol = 0.0
        lam_v = 0.94
        belief = np.array([0.5, 0.5], dtype=float)

        w_prev = 0.0
        tail = TailReservoir(cap=2400, seed=SEED + 7000 + it)

        # trajectory storage for score-function gradient
        xs: List[np.ndarray] = []
        a_idx: List[int] = []
        r_step: List[float] = []
        dd_excess: List[float] = []
        to_excess: List[float] = []
        losses: List[float] = []

        disc = 1.0
        U_disc = 0.0
        breach_hard = 0

        for t in range(1, T):
            obs_ret = float(mkt["obs_ret"][t])
            obs_vol = float(mkt["obs_vol"][t])
            obs_liq = float(mkt["obs_liq"][t])

            belief = bayes_filter_step(obs_ret, obs_vol, belief, cfg)
            x = features(obs_ret, obs_vol, obs_liq, belief, dd, ewma_vol, w_prev, cfg)

            logits = policy_logits(x, theta)
            p = softmax(logits, cfg.train.entropy_temp)
            j = sample_cat(p, rng)
            w_new = float(ACTION_LEVELS[j])

            # hard projection
            w_new = float(np.clip(w_new, -cfg.risk.gross_cap, cfg.risk.gross_cap))
            turnover = abs(w_new - w_prev)
            gross = abs(w_new)

            # hard stage gates: enforce via immediate projection to flat + count
            gates = stage_gates(dd, turnover, gross, cfg)
            if gates["turnover_hard"] or gates["gross_hard"]:
                breach_hard += 1
                w_new = 0.0
                turnover = abs(w_new - w_prev)
                gross = abs(w_new)

            # realized P&L uses true index return
            pr = float(w_prev) * float(mkt["rI"][t])
            ewma_vol = math.sqrt(lam_v*(ewma_vol**2) + (1.0-lam_v)*(pr**2))
            cst = exec_cost(w_new - w_prev, obs_vol, obs_liq, cfg)
            pnl = pr - cst

            # update wealth & dd
            W = W * (1.0 + pnl)
            peak = max(peak, W)
            dd = (peak - W) / max(1e-18, peak)

            # step loss for CVaR
            loss = max(0.0, -pnl)
            tail.add(loss)

            # soft constraint excesses
            dd_ex = max(0.0, dd - cfg.risk.dd_soft)
            to_ex = max(0.0, turnover - cfg.risk.turnover_cap)

            # robust-shaped utility (Huber on pnl to reduce gradient domination by tails)
            util = pnl - 0.06*turnover - 0.02*(gross**2)
            util -= 0.30 * max(0.0, (obs_vol/max(1e-12,cfg.risk.vol_target)) - 1.0)
            util -= 6.0 * dd_ex
            if dd > cfg.risk.dd_hard:
                util -= 12.0
                breach_hard += 1

            # store
            xs.append(x)
            a_idx.append(j)
            r_step.append(float(util))
            dd_excess.append(float(dd_ex))
            to_excess.append(float(to_ex))
            losses.append(float(loss))

            # discounted accumulate
            U_disc += disc * float(util)
            disc *= cfg.risk.gamma
            w_prev = w_new

        losses_arr = np.array(losses, dtype=float)
        cvar_loss = cvar(losses_arr, cfg.risk.cvar_alpha)

        dd_ex_mean = float(np.mean(dd_excess)) if dd_excess else 0.0
        to_ex_mean = float(np.mean(to_excess)) if to_excess else 0.0

        # CMDP Lagrangian objective
        # (Constraint targets shift the dual updates; this is production-grade framing)
        L = (
            float(U_disc)
            - float(lam_cvar) * float(max(0.0, cvar_loss - cfg.risk.target_cvar))
            - float(lam_dd)   * float(max(0.0, dd_ex_mean - cfg.risk.target_dd_excess))
            - float(lam_to)   * float(max(0.0, to_ex_mean - cfg.risk.target_turn_excess))
        )

        # Baseline update for variance reduction
        baseline = cfg.train.baseline_beta * baseline + (1.0 - cfg.train.baseline_beta) * float(np.mean(r_step))

        # Policy gradient (score function) with conservative shaping
        grad = np.zeros_like(theta)
        for x, j, u in zip(xs, a_idx, r_step):
            logits = theta @ x
            p = softmax(logits, cfg.train.entropy_temp)
            adv = float(u - baseline)
            # Huber on advantage for stability
            adv = math.copysign(math.sqrt(2.0*huber_loss(adv, cfg.train.huber_k)), adv)

            for k in range(A):
                coeff = (1.0 if k == j else 0.0) - float(p[k])
                grad[k] += coeff * x * adv

        # global constraint pressure shaping (disciplined, bounded)
        pressure = 1.0 + 2.0*max(0.0, cvar_loss - cfg.risk.target_cvar) + 10.0*max(0.0, dd_ex_mean - cfg.risk.target_dd_excess)
        grad *= float(min(3.0, pressure))

        gnorm = float(np.linalg.norm(grad))
        if gnorm > cfg.train.grad_clip:
            grad *= float(cfg.train.grad_clip / max(1e-18, gnorm))

        theta = theta + cfg.train.lr_theta * grad  # ascent

        # Dual ascent (shadow prices), clipped
        lam_cvar = float(np.clip(lam_cvar + cfg.train.lr_lam * (cvar_loss - cfg.risk.target_cvar), 0.0, cfg.train.lam_max))
        lam_dd   = float(np.clip(lam_dd   + cfg.train.lr_lam * (dd_ex_mean - cfg.risk.target_dd_excess), 0.0, cfg.train.lam_max))
        lam_to   = float(np.clip(lam_to   + cfg.train.lr_lam * (to_ex_mean - cfg.risk.target_turn_excess), 0.0, cfg.train.lam_max))

        logs.append({
            "iter": float(it),
            "L": float(L),
            "U_disc": float(U_disc),
            "cvar_loss": float(cvar_loss),
            "dd_ex_mean": float(dd_ex_mean),
            "to_ex_mean": float(to_ex_mean),
            "max_dd_end": float(dd),
            "lam_cvar": float(lam_cvar),
            "lam_dd": float(lam_dd),
            "lam_to": float(lam_to),
            "hard_breaches": float(breach_hard),
            "grad_norm": float(gnorm),
            "baseline": float(baseline),
        })

    return {"theta": theta, "logs": logs}

train_out = train_cmdp_primal_dual(train_mkt, cfg, _rng)
theta_star = train_out["theta"]
train_logs = train_out["logs"]

print(json.dumps({"iters": cfg.train.iters, "final": train_logs[-1], "theta_shape": list(theta_star.shape)}, indent=2, sort_keys=True))


{
  "final": {
    "L": -101.58456815912972,
    "U_disc": -101.58456815912972,
    "baseline": -1.05289843445669,
    "cvar_loss": 0.0,
    "dd_ex_mean": 0.0,
    "grad_norm": 7.951091013245633,
    "hard_breaches": 3199.0,
    "iter": 199.0,
    "lam_cvar": 2.1879015886472177,
    "lam_dd": 5.261755032133296,
    "lam_to": 0.8004954673335385,
    "max_dd_end": 0.0,
    "to_ex_mean": 0.0
  },
  "iters": 200,
  "theta_shape": [
    7,
    8
  ]
}


##8.EVALUATION

###8.1.OVERVIEW

**Cell 8 — Evaluation battery: train/test, learned policy versus MPC, with risk and breach metrics**  
This cell converts the notebook into a decision tool. It evaluates the learned constrained policy and the MPC comparator on both training and out-of-sample test environments. The evaluation is comprehensive and aligned with institutional review, meaning it reports outcomes in the language of risk and feasibility, not just performance.  

The battery includes end wealth, which summarizes compounded performance. It includes maximum drawdown, which is often the single most important risk metric for acceptability. It includes mean turnover, because high turnover implies operational load, higher costs, and greater fragility. It includes mean costs explicitly, because many “good” strategies collapse when costs are included. It includes mean return, to show whether returns are meaningful relative to costs. It includes CVaR of losses, which captures tail concentration and is critical for survival under stress. It also counts hard breaches of stage gates, because an institutional candidate cannot repeatedly cross forbidden boundaries.  

The evaluation is performed on both train and test, enabling generalization analysis. The notebook computes gaps between train and test outcomes. A small gap suggests stability; a large gap suggests overfitting to a specific path or sensitivity to randomness. This is important because a surrogate agent can appear impressive in-sample but fail out-of-sample, especially under regime shifts and breaks.  

The deliverable of this cell is the “battery” object: a structured set of results for each policy and each dataset. This can be stored, compared across runs, and used to establish acceptance thresholds. Pedagogically, the lesson is that institutional evaluation is not a single metric. It is a multidimensional profile: return, cost, turnover, tail loss, drawdown, and breaches, all measured consistently and compared across environments. This cell provides the evidence foundation needed for a stage-gate decision.


###8.2.CODE AND IMPLEMENTATION

In [10]:
# CELL 8 — Evaluation Battery: CMDP Policy vs MPC Comparator on Train/Test + Stress-Replay Scenarios

def policy_action_stochastic(x: np.ndarray, theta: np.ndarray, cfg: RunCfg, rng: np.random.Generator) -> float:
    p = softmax(theta @ x, cfg.train.entropy_temp)
    j = sample_cat(p, rng)
    return float(ACTION_LEVELS[j])

def evaluate(mkt: Dict[str, np.ndarray], cfg: RunCfg, theta: np.ndarray, mode: str, rng: np.random.Generator) -> Dict[str, Any]:
    T = int(len(mkt["rI"]))
    W = 1.0
    peak = 1.0
    dd = 0.0
    ewma_vol = 0.0
    lam_v = 0.94
    belief = np.array([0.5,0.5], dtype=float)
    w_prev = 0.0
    tail = TailReservoir(cap=2400, seed=SEED + (111 if mode=="cmdp" else 222))

    breaches = {"dd_hard":0, "turnover_hard":0, "gross_hard":0}
    tot_cost = 0.0
    tot_ret = 0.0
    tot_turn = 0.0

    for t in range(1, T):
        obs_ret = float(mkt["obs_ret"][t])
        obs_vol = float(mkt["obs_vol"][t])
        obs_liq = float(mkt["obs_liq"][t])

        belief = bayes_filter_step(obs_ret, obs_vol, belief, cfg)
        x = features(obs_ret, obs_vol, obs_liq, belief, dd, ewma_vol, w_prev, cfg)

        if mode == "cmdp":
            w_new = policy_action_stochastic(x, theta, cfg, rng)
        elif mode == "mpc":
            pack = scenario_pack(mkt, t, cfg, rng)
            w_new = mpc_action(x, w_prev, dd, pack, cfg)
        else:
            raise ValueError("mode must be 'cmdp' or 'mpc'")

        w_new = float(np.clip(w_new, -cfg.risk.gross_cap, cfg.risk.gross_cap))
        turnover = abs(w_new - w_prev)
        gross = abs(w_new)

        g = stage_gates(dd, turnover, gross, cfg)
        for k in breaches:
            breaches[k] += int(g[k])

        pr = float(w_prev) * float(mkt["rI"][t])
        ewma_vol = math.sqrt(lam_v*(ewma_vol**2) + (1.0-lam_v)*(pr**2))
        cst = exec_cost(w_new - w_prev, obs_vol, obs_liq, cfg)
        pnl = pr - cst

        W = W * (1.0 + pnl)
        peak = max(peak, W)
        dd = (peak - W) / max(1e-18, peak)

        tail.add(max(0.0, -pnl))

        tot_cost += float(cst)
        tot_ret += float(pr)
        tot_turn += float(turnover)
        w_prev = w_new

    losses = tail.values()
    out = {
        "wealth_end": float(W),
        "max_dd": float(dd),
        "mean_cost": float(tot_cost / max(1, T-1)),
        "mean_ret": float(tot_ret / max(1, T-1)),
        "mean_turnover": float(tot_turn / max(1, T-1)),
        "cvar_loss": float(cvar(losses, cfg.risk.cvar_alpha)),
        "breaches": breaches,
    }
    return out

battery = {
    "train": {
        "cmdp": evaluate(train_mkt, cfg, theta_star, "cmdp", _rng),
        "mpc":  evaluate(train_mkt, cfg, theta_star, "mpc",  _rng),
    },
    "test": {
        "cmdp": evaluate(test_mkt, cfg, theta_star, "cmdp", _rng),
        "mpc":  evaluate(test_mkt, cfg, theta_star, "mpc",  _rng),
    }
}
battery["gaps"] = {
    "cmdp_wealth_gap": float(battery["train"]["cmdp"]["wealth_end"] - battery["test"]["cmdp"]["wealth_end"]),
    "cmdp_dd_gap": float(battery["train"]["cmdp"]["max_dd"] - battery["test"]["cmdp"]["max_dd"]),
    "mpc_wealth_gap": float(battery["train"]["mpc"]["wealth_end"] - battery["test"]["mpc"]["wealth_end"]),
    "mpc_dd_gap": float(battery["train"]["mpc"]["max_dd"] - battery["test"]["mpc"]["max_dd"]),
}

print(json.dumps(battery, indent=2, sort_keys=True))


{
  "gaps": {
    "cmdp_dd_gap": 0.0,
    "cmdp_wealth_gap": -3.2700939554021774e-85,
    "mpc_dd_gap": 0.0,
    "mpc_wealth_gap": 0.0
  },
  "test": {
    "cmdp": {
      "breaches": {
        "dd_hard": 3172,
        "gross_hard": 0,
        "turnover_hard": 149
      },
      "cvar_loss": 0.8896199141047586,
      "max_dd": 1.0,
      "mean_cost": 0.04293212489282714,
      "mean_ret": 0.0010923437357580573,
      "mean_turnover": 0.1298061894341982,
      "wealth_end": 3.2700939554021774e-85
    },
    "mpc": {
      "breaches": {
        "dd_hard": 0,
        "gross_hard": 0,
        "turnover_hard": 0
      },
      "cvar_loss": 0.0,
      "max_dd": 0.0,
      "mean_cost": 0.0,
      "mean_ret": 0.0,
      "mean_turnover": 0.0,
      "wealth_end": 1.0
    }
  },
  "train": {
    "cmdp": {
      "breaches": {
        "dd_hard": 3179,
        "gross_hard": 0,
        "turnover_hard": 260
      },
      "cvar_loss": 1.2491832362336288,
      "max_dd": 1.0,
      "mean_cost": 0.07554

##9.DIAGNOSTICS

###9.1.0VERVIEW

**Cell 9 — Diagnostics and governance interpretation: shadow prices, binding, exploitation flags, and sensitivity**  
This cell turns metrics into governance meaning. First, it extracts the final shadow prices (the final values of the constraint multipliers) and presents them explicitly. These shadow prices answer a key review question: what constraints were truly binding? If a multiplier is near zero, that constraint was not challenged. If a multiplier is large, the policy had to be heavily penalized to remain compliant. This is how you diagnose the pressure points of the agent.  

Second, the cell computes binding metrics such as breach rates and proximity to limits. These are compact indicators of how often the policy flirts with failure. In institutional settings, “close to the limit” is often almost as concerning as “over the limit,” because measurement error, slippage, or stress can push you over.  

Third, the cell creates explicit proxy exploitation flags. These are simple, auditable rules that trigger if tail loss is too high relative to target, if drawdown approaches the hard limit, if turnover resembles churn, if costs dominate returns, or if any hard breach occurs. The purpose is not to automate judgment. The purpose is to prevent narrative drift. Flags force you to acknowledge failure modes instead of explaining them away. They are also easy to refine: as the project matures, you can replace heuristic thresholds with formal tests, but you start with transparent rules.  

Fourth, the cell performs a sensitivity test by tightening the hard drawdown limit and re-evaluating both policies. This simulates a real committee action: “we are tightening limits; does your policy still function?” A policy that collapses under modest tightening is fragile and should not advance.  

The deliverable is a diagnostics object that supports an institutional decision: it summarizes constraint pressure, highlights likely exploitation or fragility, and shows robustness under stricter governance. Pedagogically, this cell teaches that professional acceptance requires interpretability of failure modes, not just a headline performance number.


###9.2.CODE AND IMPLEMENTATION

In [None]:
# CELL 9 — Institutional Diagnostics: Shadow Prices, Constraint Binding, Proxy Exploitation Flags, Sensitivity Stage Gates

def binding_metrics(eval_out: Dict[str, Any], cfg: RunCfg) -> Dict[str, float]:
    T = float(cfg.market.T - 1)
    b = eval_out["breaches"]
    return {
        "breach_rate_dd_hard": float(b["dd_hard"] / max(1.0, T)),
        "breach_rate_turnover": float(b["turnover_hard"] / max(1.0, T)),
        "breach_rate_gross": float(b["gross_hard"] / max(1.0, T)),
        "max_dd": float(eval_out["max_dd"]),
        "cvar_loss": float(eval_out["cvar_loss"]),
        "mean_turnover": float(eval_out["mean_turnover"]),
        "mean_cost": float(eval_out["mean_cost"]),
    }

final = train_logs[-1]
shadow_prices = {
    "lam_cvar": float(final["lam_cvar"]),
    "lam_dd": float(final["lam_dd"]),
    "lam_to": float(final["lam_to"]),
    "meaning": "Shadow prices: how expensive it was (in objective terms) to violate each constraint under optimization pressure.",
}

diag = {
    "shadow_prices": shadow_prices,
    "binding": {
        "train_cmdp": binding_metrics(battery["train"]["cmdp"], cfg),
        "test_cmdp": binding_metrics(battery["test"]["cmdp"], cfg),
        "train_mpc":  binding_metrics(battery["train"]["mpc"],  cfg),
        "test_mpc":   binding_metrics(battery["test"]["mpc"],   cfg),
    }
}

# Proxy exploitation flags: simple institutional heuristics (explicit, auditable)
def exploitation_flags(eval_out: Dict[str, Any], cfg: RunCfg) -> Dict[str, bool]:
    return {
        "cvar_blowup": bool(eval_out["cvar_loss"] > 1.75*cfg.risk.target_cvar),
        "dd_near_hard": bool(eval_out["max_dd"] > 0.92*cfg.risk.dd_hard),
        "turnover_churn": bool(eval_out["mean_turnover"] > 0.90*cfg.risk.turnover_cap),
        "cost_dominates": bool(abs(eval_out["mean_cost"]) > 0.75*abs(eval_out["mean_ret"]) and abs(eval_out["mean_ret"]) > 1e-12),
        "hard_breach_any": bool(sum(eval_out["breaches"].values()) > 0),
    }

diag["flags"] = {
    "train_cmdp": exploitation_flags(battery["train"]["cmdp"], cfg),
    "test_cmdp":  exploitation_flags(battery["test"]["cmdp"],  cfg),
    "train_mpc":  exploitation_flags(battery["train"]["mpc"],  cfg),
    "test_mpc":   exploitation_flags(battery["test"]["mpc"],   cfg),
}

# Sensitivity gate: tighten dd_hard and see degradation (policy fixed)
def cfg_tight_dd(cfg: RunCfg, dd_hard_new: float) -> RunCfg:
    r = cfg.risk
    new_risk = RiskCfg(
        gamma=r.gamma,
        cvar_alpha=r.cvar_alpha,
        dd_soft=min(r.dd_soft, 0.60*dd_hard_new),
        dd_hard=dd_hard_new,
        turnover_cap=r.turnover_cap,
        gross_cap=r.gross_cap,
        vol_target=r.vol_target,
        target_cvar=r.target_cvar,
        target_dd_excess=min(r.target_dd_excess, 0.60*dd_hard_new),
        target_turn_excess=r.target_turn_excess
    )
    new = RunCfg(market=cfg.market, factor=cfg.factor, exec=cfg.exec, risk=new_risk, robust=cfg.robust, train=cfg.train, action_levels=cfg.action_levels)
    new.validate()
    return new

cfg_tight = cfg_tight_dd(cfg, dd_hard_new=0.18)
diag["sensitivity"] = {
    "tight_dd_hard": float(cfg_tight.risk.dd_hard),
    "test_cmdp_tight": evaluate(test_mkt, cfg_tight, theta_star, "cmdp", _rng),
    "test_mpc_tight":  evaluate(test_mkt, cfg_tight, theta_star, "mpc",  _rng),
}

print(json.dumps(diag, indent=2, sort_keys=True))


##10.AUDIT BUNDLE

###10.1.OVERVIEW

**Cell 10 — Artifact export, final summary, and zipped audit bundle**  
This cell produces the institutional-grade outputs of the notebook. The first goal is to export the policy object in a stable form. The learned parameters, action levels, feature dimension, and decision temperature are written to a JSON file. This makes the policy portable and reconstructible. It is not trapped inside a notebook’s memory. A reviewer can load the artifact and reproduce decisions exactly.  

The second goal is to export the full training trace. The training curve contains objective values, constraint values, shadow prices, breach counts, and stability diagnostics per iteration. This is essential for audit and review because it reveals whether training converged smoothly or was unstable. Institutions care about stability, not only endpoints.  

The third goal is to export the evaluation battery and diagnostics. These are the decision-grade results: how the learned policy and MPC comparator behaved on train and test, whether constraints were violated, and whether exploitation flags triggered. This set of exports turns the notebook into a documented experiment rather than a one-time demonstration.  

The cell then constructs a risk log with explicit governance language, including a verification status marker. This is important professionally: it signals that results are not “validated truth,” but outputs of a controlled synthetic lab. It also makes it easy to attach human review and sign-off later.  

Finally, the cell creates a single final summary object that consolidates run identity, configuration hash, headline results, stage gate thresholds, and an “advance if / else” recommendation. This is effectively a one-page committee memo in machine-readable form. The notebook then zips all artifacts into a bundle. That bundle is the true deliverable: it can be archived, shared, versioned, and compared across runs.  

Pedagogically, this final cell teaches that the outcome of the exercise is not a plot or a narrative. The outcome is an auditable package: a policy candidate plus the evidence needed to decide whether it should advance in a supervised pipeline. This is what makes the surrogate agent framework useful in real institutional practice.


###10.2.CODE AND IMPLEMENTATION

In [None]:
# CELL 10 — Artifact Export (Institutional Bundle) + Final Summary Object + Zip (Audit-Ready)

# Persist policy parameters (deterministic JSON)
policy = {
    "theta_shape": list(theta_star.shape),
    "theta": theta_star.tolist(),
    "action_levels": ACTION_LEVELS.tolist(),
    "feature_dim": int(theta_star.shape[1]),
    "entropy_temp": float(cfg.train.entropy_temp),
    "note": "Linear preference policy over bounded features; stochastic softmax; trained with primal–dual CMDP objective."
}

_write_json(os.path.join(artifact_root, "policy_theta.json"), policy)
_write_json(os.path.join(artifact_root, "training_curve.json"), train_logs)
_write_json(os.path.join(artifact_root, "battery.json"), battery)
_write_json(os.path.join(artifact_root, "diagnostics.json"), diag)

risk_log = {
    "run_id": run_id,
    "timestamp_utc": ts,
    "config_hash": cfg_hash,
    "battery": battery,
    "shadow_prices": diag["shadow_prices"],
    "flags": diag["flags"],
    "sensitivity": diag["sensitivity"],
    "verification_status": "Not verified",
    "notes": [
        "Synthetic-only laboratory; not a claim of real-market validity.",
        "Human review required before any deployment-like interpretation."
    ]
}
_write_json(os.path.join(artifact_root, "risk_log.json"), risk_log)

final_summary = {
    "run_id": run_id,
    "timestamp_utc": ts,
    "config_hash": cfg_hash,
    "headline": {
        "train_cmdp": battery["train"]["cmdp"],
        "test_cmdp":  battery["test"]["cmdp"],
        "train_mpc":  battery["train"]["mpc"],
        "test_mpc":   battery["test"]["mpc"],
        "gaps": battery["gaps"],
        "shadow_prices": diag["shadow_prices"],
        "stage_gates": {
            "dd_hard": float(cfg.risk.dd_hard),
            "turnover_cap": float(cfg.risk.turnover_cap),
            "gross_cap": float(cfg.risk.gross_cap),
            "target_cvar": float(cfg.risk.target_cvar),
        },
    },
    "stage_gate_recommendation": {
        "advance_if": [
            "test_cmdp.max_dd <= dd_hard with near-zero hard breaches",
            "test_cmdp.cvar_loss <= target_cvar within tolerance",
            "diagnostics.flags.test_cmdp has no critical flags (cvar_blowup, dd_near_hard, hard_breach_any)",
            "tight-dd sensitivity does not catastrophically deteriorate wealth_end/cvar_loss",
        ],
        "else": "Revise constraints/reward shaping or policy class; rerun Chapter 2/3 loop."
    },
    "artifacts": {
        "run_manifest": "run_manifest.json",
        "config": "config.json",
        "policy_theta": "policy_theta.json",
        "training_curve": "training_curve.json",
        "battery": "battery.json",
        "diagnostics": "diagnostics.json",
        "risk_log": "risk_log.json",
    }
}
_write_json(os.path.join(artifact_root, "final_summary.json"), final_summary)

# Zip bundle
zip_path = os.path.join(artifact_root, "chapter3_v2_artifacts.zip")
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED) as z:
    for fn in sorted(os.listdir(artifact_root)):
        if fn.endswith(".zip"):
            continue
        z.write(os.path.join(artifact_root, fn), arcname=fn)

print(json.dumps(final_summary, indent=2, sort_keys=True))
print({"artifact_root": artifact_root, "zip_bundle": zip_path, "files": sorted(os.listdir(artifact_root))})


##11.CONCLUSION

**Conclusion — Chapter 3 Notebook: What We Built, What We Obtained, and Why It Matters**

This notebook ends with something very specific and very practical: we do not “get an idea,” and we do not merely “get a model.” We get a controlled decision system and a complete institutional evidence package about that system. The outcome is therefore twofold: an operational object (a surrogate trading controller) and an audit-grade set of artifacts that explain how it was built, what it does, and whether it should be allowed to advance to the next stage.

To be completely concrete, here is what we obtained.

**1) We obtained a policy object: a governed surrogate trading agent in executable form.**  
The most important technical deliverable is the policy itself. In this notebook, the policy is a stochastic controller that maps a bounded feature vector to a discrete action, where the action is a portfolio exposure level (for example, -2.0, -1.25, …, 0.0, …, 2.0). The policy is represented by a parameter matrix **theta**. Each row of theta corresponds to one action level, and each row contains a set of weights that score that action given the current features. When the policy is called at a time step, it computes scores for all actions, converts those scores into a probability distribution (via a temperature-controlled softmax), and then samples an action. That is the policy.

So the primary object we obtain is:

**a function-like controller**  
Input: a feature vector summarizing observations, regime belief, liquidity, drawdown state, and previous position  
Output: an exposure decision (one of the discrete action levels)  
Governance: bounded by explicit caps and trained to respect constraint targets

In other words, we obtained a **surrogate trading agent** whose decisions are not arbitrary. They are shaped by a belief state, by stress proxies, and by institutional risk limits.

**2) We obtained a belief-state mechanism for partial observability.**  
Markets do not tell the agent which regime it is in. The notebook therefore constructs a belief state: a two-number probability vector that is updated online from observed return and volatility proxies. This belief is not an academic decoration. It is a disciplined representation of uncertainty. It allows the policy to condition on “probability of stress regime” rather than pretending it knows the truth.

So a second object we obtained is:

**a belief update function**  
Input: current noisy observations + previous belief  
Output: updated belief over regimes  
Role: makes the agent a controller under uncertainty rather than a rule reacting blindly to noise

This belief mechanism is important because it enables structured decision-making even when the true state is hidden. That is exactly the reality of finance.

**3) We obtained an explicit microstructure and feasibility layer.**  
The notebook does not allow the agent to trade for free. Every change in exposure incurs an execution cost that includes spread and convex impact, and that cost worsens when liquidity is poor and volatility is high. There is also a participation-style cliff: if turnover exceeds a participation proxy, costs accelerate sharply. This matters because it prevents the policy from “winning” by unrealistic churn.

So a third object we obtained is:

**an execution-cost function + feasibility gates**  
Input: position change, volatility proxy, liquidity proxy  
Output: trading cost deducted from P&L  
Role: ensures the policy is evaluated as a tradable mechanism, not a paper signal

This is exactly what turns a surrogate objective into something closer to a trading system. Costs and feasibility are not optional in institutional review.

**4) We obtained a constrained learning result: a policy trained under explicit risk constraints (a CMDP solution candidate).**  
The notebook does not train the policy by maximizing profit alone. It trains it under constraints: tail risk (CVaR), drawdown behavior, and turnover behavior. Training is done with a primal–dual method: the policy parameters are updated to improve performance, and the dual variables (Lagrange multipliers) are updated to penalize constraint violations.

This is the heart of the notebook. It transforms the project from “reward maximization” to “institutional optimization.”

So a fourth object we obtained is:

**a set of learned dual variables (shadow prices) alongside the policy**  
These multipliers tell you how tight the constraints were. If the agent constantly wants to violate a limit, the shadow price increases. If constraints are easy to satisfy, the shadow price stays low.

This gives you something extremely valuable in governance terms: not just “what the policy does,” but “how expensive it was to keep the policy within boundaries.”

**5) We obtained a robust comparator policy (MPC) and a formal comparison battery.**  
An institutional process does not accept a learned policy without a baseline. This notebook implements a robust MPC comparator that chooses actions by scoring scenario packs using mean utility minus CVaR penalties. The comparator is not learned; it is explicitly designed. It is therefore less susceptible to certain forms of proxy exploitation. By evaluating both the learned policy and the MPC controller on train and test markets, we obtain a meaningful comparison.

So a fifth deliverable is:

**a benchmark controller plus paired evaluation results**  
This prevents self-deception. If the learned policy only beats trivial baselines, it is not impressive. If it behaves comparably to a robust controller under stress, that is stronger evidence of legitimacy.

**6) We obtained stage-gate metrics: “advance / do not advance” evidence instead of narrative.**  
The notebook reports specific metrics that can be used in committee review:

- end wealth (a compact summary of compounded performance)
- maximum drawdown (the most important behavioral risk metric)
- mean turnover (a proxy for operational and microstructure stress)
- mean costs (to show whether results are cost-dominated)
- tail loss via CVaR (explicit downside concentration)
- hard breach counts and breach rates (did the agent cross forbidden lines?)

These are not “nice-to-have.” These are the minimum metrics required to treat an agent as a candidate in a supervised pipeline.

**7) We obtained a sensitivity test that approximates real governance behavior.**  
Risk committees tighten limits. They do it in stress, and they do it when supervisors demand it. The notebook performs a controlled sensitivity test by tightening the hard drawdown limit and re-evaluating both policies. This answers a crucial question: is the policy stable under stricter governance, or does it collapse?

This yields an additional deliverable:

**a sensitivity report that reveals fragility**  
If the policy breaks when the drawdown limit is tightened slightly, it is not robust enough for the next stage.

**8) We obtained an artifact bundle suitable for audit and reproducibility.**  
Finally, the notebook exports a full artifact set:

- run manifest (timestamp, seed, environment fingerprint)
- configuration (full typed configuration graph)
- policy parameters (theta)
- training curve (including dual variables)
- evaluation battery (train/test for both controllers)
- diagnostics (shadow prices, binding metrics, exploitation flags)
- risk log and a final summary object
- zipped bundle to share or archive

This is the institutional upgrade. In the real world, the ability to reproduce and audit is not optional. A trading agent that cannot be reproduced exactly is not a candidate; it is a story.

Now that we have stated what we obtained, we can state why it matters.

**The importance: this notebook demonstrates the correct professional posture for AI agents in finance.**  
The finance industry does not suffer primarily from a shortage of models. It suffers from a shortage of controlled decision-making under constraints. Most failures are not failures of forecasting; they are failures of incentive design, risk budgeting, and operational realism. This notebook is important because it treats the trading agent as a mechanism embedded in a constrained control system. That is the correct abstraction.

In other words, the notebook teaches that the “agent” is not the neural network or the policy weights. The agent is the entire system:

- the observation layer and its noise
- the belief state and uncertainty representation
- the feature map and bounded representations
- the action set and exposure limits
- the microstructure and execution costs
- the risk metrics and tail measurement
- the hard gates and soft constraints
- the learning process and its shadow prices
- the evaluation battery and generalization checks
- the artifact bundle and audit trail

This is why Chapter 3 is a decisive step. It shows that surrogate agents can be built in a way that makes them reviewable and governable.

Now, how do you use this in practice?

**Application 1: A blueprint for a “candidate policy” workflow in a research-to-production pipeline.**  
In an institutional setting, you rarely deploy a learned policy directly. You build a candidate, you stress it, you produce evidence, and you decide whether it deserves further investment. This notebook is a blueprint for that gating process. If you are building a new strategy idea, you can use this structure to test whether the policy remains feasible under execution costs, under regime shifts, and under tightened risk limits.

**Application 2: A template for governance-first reinforcement learning in trading.**  
Many organizations are curious about reinforcement learning but reject it because of unpredictability and reward hacking. This notebook demonstrates how to frame RL as a constrained problem with explicit risk controls. The primal–dual structure and the export of dual variables is especially useful: it gives risk teams a handle for reviewing “how the policy behaves under pressure.”

**Application 3: A method for designing proxy functions that resist exploitation.**  
This notebook teaches that you should not have one proxy. You need a structured objective plus constraints and stress tests. The robust scenario packs and CVaR terms are not decorative. They are anti-exploitation instruments. They reduce the room for the agent to “look good” by taking hidden tail risk or by creating churn that looks profitable before costs.

**Application 4: A review package that can be handed to a committee.**  
A common failure in quant research is that results exist only in someone’s notebook outputs, with no stable provenance. This notebook produces a bundle that can be handed to a risk committee or model validation group. They can see configuration, seed, environment, training history, and evaluation results. This is the beginning of institutional credibility.

**Application 5: A teaching instrument for high-level finance training.**  
For MBA and MFin cohorts, this notebook is not about teaching “coding tricks.” It is about teaching professional structure: how to reason about decision systems, constraints, and evidence. Students learn that “performance” is not a scalar. It is performance under constraints, under costs, under stress, with reproducible artifacts.

Finally, it is important to state what the notebook does not claim, because that is part of governance.

**This notebook does not claim real-market profitability.**  
It is synthetic. It is a laboratory. Its purpose is methodological. The goal is not to “win” in the synthetic market; the goal is to show how to build a surrogate agent that remains controlled and reviewable when optimization pressure is applied.

So what is the real outcome of this exercise?

**The real outcome is that you now have a governed surrogate agent template.**  
You have a concrete policy object (theta + action set + feature map + belief state). You have a robust baseline controller (MPC). You have stage gates that tell you when to advance. You have artifacts that let you reproduce and audit the results. And you have diagnostic instruments—shadow prices, binding metrics, exploitation flags—that convert performance into decision-grade evidence.

That is the professional victory of Chapter 3: not “a strategy,” but a disciplined method for constructing, testing, and supervising surrogate trading agents. If you keep this structure as you move to richer markets and richer policies, you are building the correct institutional habit: governance first, mechanism always, and evidence over narrative.
