#**CHAPTER 2.GOVERNED REINFORCEMENT LEARNING**
---

##REFERENCE

https://chatgpt.com/share/699349b7-d3c4-8012-812d-06a8783fd6b5

##0.CONTEXT

**Introduction: Objective and Meaning of the Chapter 2 Colab Exercise**

This notebook is the second step in the three-chapter journey. Chapter 1 built the surrogate objective in a disciplined way: we made the proxy explicit, decomposed it into interpretable components, stress-tested its parameter sensitivity, and exported audit artifacts. Chapter 2 now answers the next institutional question: what happens when we introduce optimization pressure? In other words, what happens when an algorithm is allowed to search for behavior that maximizes the surrogate objective you designed?

The purpose of Chapter 2 is not to “beat the market.” It is not to prove that reinforcement learning works. The purpose is to study the exact moment when a proxy becomes dangerous: when a learning system begins to exploit what the objective rewards and what the objective fails to penalize. Chapter 2 is the laboratory where we test the alignment of the surrogate objective under optimization.

In finance, this step is often misunderstood. Many people think the main difficulty is selecting a learning algorithm. Institutions know that the learning algorithm is not the main risk. The main risk is that optimization amplifies the proxy. If the proxy is imperfect — and it always is — an optimizer will find the imperfections. That is not a moral failure of the algorithm. It is the definition of optimization: it pushes probability mass toward whatever improves the score. If the score can be improved through fragile behavior, the learned policy will drift toward fragility.

So the core objective of this notebook is to make proxy amplification measurable and reviewable.

We do that by designing an intentionally constrained learning setup and then analyzing what it learns with institutional diagnostics. The learning setup is constrained on purpose. It uses a small action space and a discrete state representation so every part of the system is visible, inspectable, and auditable. This is not a limitation. It is a methodological decision: before you scale to large models, you must understand the failure modes at a scale where you can fully interpret the mechanism.

The learning task is framed as a sequential decision problem. At each time step, the policy chooses an exposure state: short, flat, or long. That exposure generates portfolio returns, and exposure changes generate trading costs. Risk and drawdowns are tracked online. The surrogate objective from Chapter 1 is then used as the reward signal: returns are good, but costs, volatility, drawdowns, turnover, leverage, and governance breaches are penalized. This means the learning system is not trained to maximize raw returns; it is trained to maximize the governed proxy of “acceptable performance.”

The main meaning of Chapter 2 is that learning is a microscope. It reveals what your surrogate objective truly incentivizes. If the surrogate is well designed, learning should discover behavior that looks economically reasonable, feasible under costs, and stable under risk controls. If the surrogate is poorly designed, learning will produce behavior that looks “smart” on the proxy but unacceptable under institutional review.

This notebook therefore has four concrete goals.

First, it demonstrates that the surrogate objective is an operational object. It is not just a formula on paper. It is a signal that can drive learning and produce a policy. If you cannot use the surrogate as a learning signal, your objective is not an objective; it is an opinion.

Second, it measures whether the learned policy generalizes. We train on one synthetic history and test on a different synthetic history. The goal is not to claim real-world predictive power. The goal is to detect whether the policy has learned stable incentives or has overfit to the particular path of the training environment. Generalization in this setting is a proxy for stability under nonstationarity.

Third, it instruments proxy exploitation. We log reward components, turnover, costs, drawdowns, and other diagnostics so we can see precisely how the policy “earns” its score. We also compute stability signals such as action entropy, Bellman residual diagnostics, and regime-conditional fragility measures. These are not decorative metrics. They are the mechanism-level indicators that tell you whether the learning system is collapsing into a narrow, brittle behavior.

Fourth, it integrates governance into the learning loop. In institutional settings, governance is not a post-hoc report; it is an active constraint. This notebook includes hard penalties for breaching limits and exports all run artifacts so that a reviewer can audit both what the policy did and why it did it. The idea is to treat learning as something that must be supervised, logged, and reviewable at every step, even in a toy laboratory.

By the end of Chapter 2, you should understand a professional reality: learning does not create alignment. Learning tests alignment. The proxy you built in Chapter 1 becomes the single point of truth for what the system considers good. Chapter 2 shows whether that truth is stable when optimization pressure is applied.

If you take one lesson from this notebook, it should be this: the difference between a safe trading agent and a fragile trading agent is rarely “the algorithm.” It is the objective and the governance around it. Chapter 2 is where that lesson becomes concrete, because you will be able to see the policy’s behavior shift as it learns to maximize your surrogate.

This is why this notebook is structured as a governed laboratory rather than a performance demo. It is designed to generate audit artifacts, to compare train versus test behavior, and to expose proxy failure cliffs before you ever allow a larger or more powerful model into the system.

Run the notebook slowly. Inspect the component summaries. Compare the learned policy to the random baseline. Look at turnover and cost intensity. Look at drawdown behavior. Look at how sensitive the conclusions are across regimes and across the structural break. If the policy looks good only by exploiting a weakness in the proxy, you will see it here. If the proxy is disciplined, you will see disciplined behavior emerge here.

That is the objective and meaning of Chapter 2: introduce optimization pressure under a governed surrogate objective, and measure — in an auditable way — whether the proxy produces stable, feasible, institutionally acceptable learned behavior.


##1.LIBRARIES AND ENVIRONMENT

**Cell 1 — Institutional environment lock, configuration freeze, and audit identity**  
This cell establishes the notebook as an institutional experiment rather than an informal demonstration. The central idea is that learning results are meaningless if they cannot be reproduced, compared, and audited. So the cell’s purpose is to fix the run’s identity and to create an explicit “contract” describing what will happen in the entire notebook. It sets stable randomness so that training behavior, evaluation behavior, and exported artifacts are deterministic under the declared seed. It then collects every important setting into a single configuration object: how the synthetic market will be generated, how costs will be applied, how the surrogate objective will be computed, what governance limits exist, and how the learning algorithm will schedule exploration and step sizes.  

The cell also generates a run identifier and a configuration hash. Conceptually, the run identifier is the unique fingerprint of the experiment, and the configuration hash is the fingerprint of the design assumptions. This distinction matters: two runs can share the same design but differ in time or seed; the configuration hash proves the design, while the run id proves the specific execution. The cell also creates an artifacts directory and writes a run manifest. The manifest is the minimal institutional record: it states the run id, timestamp, environment details, seed, and configuration identity.  

Pedagogically, the lesson is that agent training must be treated like a controlled laboratory experiment. Reinforcement learning is especially sensitive to small changes, and without strict locking you cannot separate genuine effects from noise. This cell also pushes a governance-first habit: every number that matters should be visible in configuration, not buried in the code. That makes reviews possible. A reviewer can read the config and know what the notebook intends to do before seeing any results.  

By the end of the cell, you have not trained anything yet, but you have already achieved something essential: you have made the run auditable, reproducible, and comparable across versions. That is the foundation that allows everything else in Chapter 2 to be taken seriously.


In [1]:
# CELL 1 — Institutional Environment Lock, Frozen Config, Run Identity, Artifact Root (Chapter 2)

import os, sys, json, math, random, hashlib, platform
import datetime as _dt
from dataclasses import dataclass, asdict
from typing import Dict, Tuple, List, Any, Optional
import numpy as np

# -----------------------------
# Determinism (hard requirement)
# -----------------------------
SEED = 20260216 + 2
np.random.seed(SEED)
random.seed(SEED)
_rng = np.random.default_rng(SEED)

# -----------------------------
# Utilities (institutional)
# -----------------------------
def utc_now_iso() -> str:
    return _dt.datetime.now(_dt.timezone.utc).isoformat()

def sha256_hex(b: bytes) -> str:
    return hashlib.sha256(b).hexdigest()

def canonical_json_dumps(obj: Any) -> str:
    return json.dumps(obj, sort_keys=True, separators=(",", ":"), ensure_ascii=True)

def safe_mkdir(path: str) -> None:
    os.makedirs(path, exist_ok=True)

def write_json(path: str, obj: Any) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(obj, f, indent=2, sort_keys=True)

def write_text(path: str, s: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        f.write(s)

# -----------------------------
# Config schema (frozen)
# -----------------------------
@dataclass(frozen=True)
class MarketConfig:
    T: int
    mu_regime: Tuple[float, float]
    vol_regime: Tuple[float, float]
    P: Tuple[Tuple[float, float], Tuple[float, float]]
    structural_break_t: int
    break_mu_shift: float
    break_vol_mult: float
    jump_prob: float
    jump_scale: float
    def validate(self) -> None:
        if self.T < 1500:
            raise ValueError("T must be >= 1500 for robust train/test evaluation.")
        if not (0 < self.structural_break_t < self.T - 100):
            raise ValueError("structural_break_t must be inside horizon.")
        for i in range(2):
            s = self.P[i][0] + self.P[i][1]
            if abs(s - 1.0) > 1e-12:
                raise ValueError("Each transition row must sum to 1.")
            if any(p < 0 or p > 1 for p in self.P[i]):
                raise ValueError("Transition probabilities must be in [0,1].")
        if not (0 <= self.jump_prob < 0.2):
            raise ValueError("jump_prob must be in [0,0.2).")
        if not (self.jump_scale > 0):
            raise ValueError("jump_scale must be > 0.")
        if not (self.break_vol_mult >= 1.0):
            raise ValueError("break_vol_mult must be >= 1.0.")

@dataclass(frozen=True)
class CostConfig:
    a_lin: float
    b_quad: float
    c_cubic: float
    spread_bps: float
    vol_scale: float
    def validate(self) -> None:
        if any(x < 0 for x in [self.a_lin, self.b_quad, self.c_cubic, self.spread_bps, self.vol_scale]):
            raise ValueError("All cost parameters must be nonnegative.")

@dataclass(frozen=True)
class RewardConfig:
    gamma: float
    lambda_vol: float
    lambda_dd: float
    lambda_cost: float
    lambda_turn: float
    lambda_lev: float
    def validate(self) -> None:
        if not (0.90 <= self.gamma < 1.0):
            raise ValueError("gamma must be in [0.90,1).")
        for k, v in asdict(self).items():
            if k != "gamma" and v < 0:
                raise ValueError("lambdas must be >= 0.")

@dataclass(frozen=True)
class GovernanceConfig:
    max_gross_leverage: float
    max_turnover: float
    max_drawdown: float
    gov_penalty: float
    def validate(self) -> None:
        if self.max_gross_leverage <= 0:
            raise ValueError("max_gross_leverage must be > 0.")
        if self.max_turnover <= 0:
            raise ValueError("max_turnover must be > 0.")
        if not (0 < self.max_drawdown < 1.0):
            raise ValueError("max_drawdown must be in (0,1).")
        if self.gov_penalty <= 0:
            raise ValueError("gov_penalty must be > 0.")

@dataclass(frozen=True)
class QLearnConfig:
    episodes: int
    alpha0: float
    alpha_min: float
    alpha_decay: float
    eps0: float
    eps_min: float
    eps_decay: float
    td_clip: float
    double_q: bool
    def validate(self) -> None:
        if self.episodes < 50:
            raise ValueError("episodes must be >= 50.")
        if not (0 < self.alpha_min <= self.alpha0 <= 1.0):
            raise ValueError("alpha bounds invalid.")
        if not (0 < self.eps_min <= self.eps0 <= 1.0):
            raise ValueError("epsilon bounds invalid.")
        if not (0 < self.alpha_decay <= 1.0 and 0 < self.eps_decay <= 1.0):
            raise ValueError("decays must be in (0,1].")
        if self.td_clip <= 0:
            raise ValueError("td_clip must be > 0.")

@dataclass(frozen=True)
class DiscretizeConfig:
    vol_bins: Tuple[float, ...]
    dd_bins: Tuple[float, ...]
    ret_sign_deadband: float
    def validate(self) -> None:
        if len(self.vol_bins) < 3:
            raise ValueError("vol_bins must have at least 3 cutpoints.")
        if len(self.dd_bins) < 3:
            raise ValueError("dd_bins must have at least 3 cutpoints.")
        if self.ret_sign_deadband < 0:
            raise ValueError("ret_sign_deadband must be >= 0.")

@dataclass(frozen=True)
class RunConfig:
    market: MarketConfig
    cost: CostConfig
    reward: RewardConfig
    gov: GovernanceConfig
    qlearn: QLearnConfig
    disc: DiscretizeConfig
    actions: Tuple[int, ...]
    def validate(self) -> None:
        self.market.validate()
        self.cost.validate()
        self.reward.validate()
        self.gov.validate()
        self.qlearn.validate()
        self.disc.validate()
        if len(self.actions) < 3 or any(a not in (-1, 0, 1) for a in self.actions):
            raise ValueError("actions must be subset of (-1,0,1) and length>=3")

cfg = RunConfig(
    market=MarketConfig(
        T=2200,
        mu_regime=(0.00035, -0.00010),
        vol_regime=(0.010, 0.030),
        P=((0.975, 0.025),
           (0.080, 0.920)),
        structural_break_t=1400,
        break_mu_shift=-0.00025,
        break_vol_mult=1.60,
        jump_prob=0.015,
        jump_scale=4.0
    ),
    cost=CostConfig(
        a_lin=2.0e-4,
        b_quad=1.2e-3,
        c_cubic=1.5e-3,
        spread_bps=1.8,
        vol_scale=0.25
    ),
    reward=RewardConfig(
        gamma=0.9990,
        lambda_vol=0.70,
        lambda_dd=2.20,
        lambda_cost=1.00,
        lambda_turn=0.55,
        lambda_lev=0.10
    ),
    gov=GovernanceConfig(
        max_gross_leverage=1.0,   # action space is {-1,0,1}, so gross=|w| <= 1 by construction
        max_turnover=1.0,         # turnover is |Δw|; with actions, max is 2, but we enforce penalty threshold
        max_drawdown=0.25,
        gov_penalty=25.0
    ),
    qlearn=QLearnConfig(
        episodes=140,
        alpha0=0.35,
        alpha_min=0.03,
        alpha_decay=0.985,
        eps0=0.45,
        eps_min=0.02,
        eps_decay=0.987,
        td_clip=5.0,
        double_q=True
    ),
    disc=DiscretizeConfig(
        vol_bins=(0.0, 0.006, 0.012, 0.020, 0.035, 1.0),
        dd_bins=(0.0, 0.03, 0.08, 0.15, 0.25, 1.0),
        ret_sign_deadband=1e-5
    ),
    actions=(-1, 0, 1)
)
cfg.validate()

artifact_root = "/mnt/data/artifacts_ch2"
safe_mkdir(artifact_root)

cfg_dict = asdict(cfg)
cfg_json = canonical_json_dumps(cfg_dict)
config_hash = sha256_hex(cfg_json.encode("utf-8"))

timestamp_utc = utc_now_iso()
run_id = sha256_hex((timestamp_utc + ":" + config_hash + ":" + str(SEED)).encode("utf-8"))

manifest = {
    "run_id": run_id,
    "timestamp_utc": timestamp_utc,
    "seed": SEED,
    "python": sys.version,
    "platform": platform.platform(),
    "numpy": np.__version__,
    "config_hash": config_hash,
    "artifact_root": artifact_root,
    "model_intent": "Chapter 2 — Tabular RL under surrogate objective; institutional audit artifacts; synthetic-only.",
}
write_json(os.path.join(artifact_root, "run_manifest.json"), manifest)
write_text(os.path.join(artifact_root, "config_hash.txt"), config_hash)

print(json.dumps(manifest, indent=2, sort_keys=True))


{
  "artifact_root": "/mnt/data/artifacts_ch2",
  "config_hash": "50b9a4b1a67adf03cd1124320f6bb2e0ab084ff25b4f3e024df0aa30d0504abc",
  "model_intent": "Chapter 2 \u2014 Tabular RL under surrogate objective; institutional audit artifacts; synthetic-only.",
  "numpy": "2.0.2",
  "platform": "Linux-6.6.105+-x86_64-with-glibc2.35",
  "python": "3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]",
  "run_id": "866597573132eb18e52cac1a5de73366d8699eb3815d6d36da6b9f06998d0fa3",
  "seed": 20260218,
  "timestamp_utc": "2026-02-16T16:52:26.972665+00:00"
}


##2.SYNTHETIC MARKET REGIMES

###2.1.OVERVIEW

**Cell 2 — Synthetic training and testing markets with regimes, breaks, and shocks**  
This cell builds the environment in which learning will occur and, equally importantly, the separate environment in which generalization will be tested. It generates a synthetic market process that is intentionally challenging for trading agents: it includes regime switching, time-varying volatility, correlation shifts in spirit, and rare shock events. It also includes a structural break, meaning the market’s “rules of the game” change midstream. This is not decorative realism. It is a deliberate attempt to replicate the core difficulty of finance: nonstationarity.  

The cell produces two distinct market histories. One is used for training, and the other is used for testing. They share the same structural template but differ in random realization. This separation is crucial. If you only evaluate in the same path you train on, you cannot tell whether the agent learned robust incentives or simply adapted to one particular history. By creating a second path, the notebook forces a minimal but meaningful generalization test: does the learned behavior remain coherent when the realized sequence differs?  

Pedagogically, this cell teaches that “data” is not just input; it defines what the agent can and cannot learn. In a synthetic environment, you can control what patterns exist and therefore test specific failure modes. Regime switching forces the agent to deal with changing drift and risk levels. Volatility clustering forces the agent to face persistent stress rather than isolated spikes. Rare shocks force the agent to face tail outcomes that can dominate drawdowns and governance breaches. The structural break forces the agent to confront a shift that invalidates prior assumptions.  

By the end of the cell, you obtain the environment objects that will be used throughout the notebook: return series, regime labels, and volatility proxies for both train and test. You also obtain basic summary statistics that confirm the environment is nontrivial. Institutionally, this cell is the “test bench construction.” It ensures that when you later interpret the learned policy, you can attribute behavior to a well-defined environment rather than to an unknown historical mix. It also ensures that when you observe proxy exploitation, it emerges under conditions that resemble real stress mechanisms rather than only under gentle, stable dynamics.


###2.2.CODE AND IMPLEMENTATION

In [2]:
# CELL 2 — Synthetic Market with Regimes + Structural Break + Jumps (Train/Test Splits)

def simulate_regime_returns(cfg: RunConfig, rng: np.random.Generator, seed_offset: int = 0) -> Dict[str, np.ndarray]:
    # local RNG for strict reproducibility
    local_rng = np.random.default_rng(int(SEED + seed_offset))
    T = cfg.market.T
    P = np.array(cfg.market.P, dtype=float)
    mu = np.array(cfg.market.mu_regime, dtype=float)
    vol = np.array(cfg.market.vol_regime, dtype=float)

    regimes = np.zeros(T, dtype=np.int64)
    r = np.zeros(T, dtype=float)
    sig = np.zeros(T, dtype=float)

    regimes[0] = 0
    sig[0] = vol[0]
    for t in range(1, T):
        prev = regimes[t-1]
        u = float(local_rng.random())
        regimes[t] = 0 if u < P[prev, 0] else 1

        # structural break shifts after configured time
        mu_t = mu[regimes[t]]
        vol_t = vol[regimes[t]]
        if t >= cfg.market.structural_break_t:
            mu_t = mu_t + cfg.market.break_mu_shift
            vol_t = vol_t * cfg.market.break_vol_mult

        # mild vol clustering proxy (EWMA-like)
        sig[t] = math.sqrt(0.96 * (sig[t-1] ** 2) + 0.04 * (vol_t ** 2))

        shock = float(local_rng.normal())
        base = mu_t + sig[t] * shock

        # rare jump
        if float(local_rng.random()) < cfg.market.jump_prob:
            j = (1.0 if float(local_rng.random()) < 0.5 else -1.0) * cfg.market.jump_scale * sig[t] * float(local_rng.exponential())
        else:
            j = 0.0

        r[t] = base + j

    return {"returns": r, "regimes": regimes, "sigma": sig}

# Train set A (seed_offset=0) and Test set B (seed_offset=999) to enforce separation
train_mkt = simulate_regime_returns(cfg, _rng, seed_offset=0)
test_mkt  = simulate_regime_returns(cfg, _rng, seed_offset=999)

r_tr = train_mkt["returns"]
reg_tr = train_mkt["regimes"]
sig_tr = train_mkt["sigma"]

r_te = test_mkt["returns"]
reg_te = test_mkt["regimes"]
sig_te = test_mkt["sigma"]

print({
    "T": cfg.market.T,
    "break_t": cfg.market.structural_break_t,
    "train_regime_counts": {0: int(np.sum(reg_tr==0)), 1: int(np.sum(reg_tr==1))},
    "test_regime_counts": {0: int(np.sum(reg_te==0)), 1: int(np.sum(reg_te==1))},
    "train_sigma_range": (float(np.min(sig_tr)), float(np.max(sig_tr))),
    "test_sigma_range": (float(np.min(sig_te)), float(np.max(sig_te))),
})


{'T': 2200, 'break_t': 1400, 'train_regime_counts': {0: 1630, 1: 570}, 'test_regime_counts': {0: 1551, 1: 649}, 'train_sigma_range': (0.01, 0.04227173553353449), 'test_sigma_range': (0.01, 0.04635371266983538)}


##3.STATE CONSTRUCTION

###3.1.OVERVIEW

**Cell 3 — Discrete state representation and disciplined feature buckets**  
This cell defines how the agent perceives the world. Reinforcement learning is not magic; it can only condition decisions on the state representation you provide. This cell builds that representation in an interpretable and auditable way by discretizing continuous signals into buckets. The state includes a small number of features designed to be meaningful in finance: a crude sign of recent return direction, a regime indicator, a bucketed volatility estimate, and a bucketed drawdown level. Together, these components represent a compressed picture of “where we are”: momentum direction, market regime, stress level, and damage level.  

Pedagogically, the lesson is that Chapter 2 is not trying to build a superhuman trader. It is trying to study proxy optimization under conditions where every design choice is visible. A discretized state space makes the system inspectable: you can list states, count how often they occur, and examine what action the policy chooses in each. This helps you audit what the agent is doing. It also helps you diagnose failure. If the policy behaves poorly, you can ask whether the state representation is missing crucial information, or whether the objective is misaligned, or whether learning collapsed.  

This cell also introduces careful bucketing rules. The bins are chosen so that the agent can distinguish calm from stressed volatility and mild from severe drawdown. The bins are also defined in configuration so they are not accidental. The return sign uses a deadband to avoid treating tiny noise as direction, which prevents unstable flipping. These details matter because an agent will exploit any spurious regularity in the state. If the state representation is overly sensitive, the agent may learn to chase noise and create turnover.  

By the end of the cell, you have a formal “state object” and the deterministic functions that map raw environment and portfolio signals into that state. In institutional terms, you have defined the observation model. This is a core deliverable because the policy you learn is only meaningful relative to the state representation. Later, when you export the learned decision rule, it will be expressed over these states. This cell therefore makes the agent interpretable by construction: it limits the complexity so behavior can be inspected, reasoned about, and governed before scaling to richer representations.


###3.2.CODE AND IMPLEMENTATION

In [3]:
# CELL 3 — State Construction (Sign, Regime, Vol Bucket, Drawdown Bucket) + Discretizers

def bucketize(x: float, bins: Tuple[float, ...]) -> int:
    # returns index in [0, len(bins)-2]
    # bins are inclusive left, exclusive right, except last
    for i in range(len(bins) - 1):
        if (x >= bins[i]) and (x < bins[i+1]):
            return i
    return len(bins) - 2

def sign_bucket(x: float, deadband: float) -> int:
    if x > deadband:
        return 1
    if x < -deadband:
        return -1
    return 0

def compute_drawdown_bucket(W: float, peak: float, dd_bins: Tuple[float, ...]) -> Tuple[float, int]:
    dd = (peak - W) / max(1e-18, peak)
    b = bucketize(dd, dd_bins)
    return dd, b

@dataclass(frozen=True)
class State:
    ret_sign: int
    regime: int
    vol_bin: int
    dd_bin: int
    def as_tuple(self) -> Tuple[int, int, int, int]:
        return (self.ret_sign, self.regime, self.vol_bin, self.dd_bin)

# Quick sanity checks on discretization scheme
vol_bins = cfg.disc.vol_bins
dd_bins = cfg.disc.dd_bins

print({
    "vol_bins": vol_bins,
    "dd_bins": dd_bins,
    "example_bucket_vol_0.015": bucketize(0.015, vol_bins),
    "example_bucket_dd_0.10": bucketize(0.10, dd_bins),
})


{'vol_bins': (0.0, 0.006, 0.012, 0.02, 0.035, 1.0), 'dd_bins': (0.0, 0.03, 0.08, 0.15, 0.25, 1.0), 'example_bucket_vol_0.015': 2, 'example_bucket_dd_0.10': 2}


##4.ACTION SPACE

###4.1.OVERVIEW

**Cell 4 — Action space, policy selection rule, and value function storage**  
This cell defines what the agent is allowed to do and how learning information will be stored. The action space is intentionally constrained to a small set of exposure choices: short, flat, or long. This is a methodological choice, not a limitation. With a small action space, you can clearly see whether the objective rewards excessive switching, whether the agent is biased toward always being invested, and whether it learns asymmetric behavior across regimes. In institutional terms, a restricted action space is also a governance tool: you are limiting the scope of possible actions to reduce the chance of extreme or uncontrolled behavior.  

The cell also defines how the agent will represent its learned “preferences.” In tabular reinforcement learning, this appears as a table of state-action values. The table is a concrete object: for each state and each action, it stores an estimate of long-run surrogate utility. This table is not just technical machinery; it is an inspectable artifact. You can later export it, review which states have strong preferences, and identify whether learning created suspiciously extreme values that may signal exploitation or instability.  

Pedagogically, this cell teaches that learning is not a mysterious black box even when it is optimizing. The system’s learned knowledge has an explicit representation you can audit. It also teaches that policy selection must be deterministic under ties and must be defined clearly. In institutional settings, tie-breaking rules matter because small numeric differences can cause unstable policy changes. This cell therefore enforces deterministic tie breaks so that results remain reproducible and policy behavior does not depend on incidental ordering.  

Finally, this cell defines the exploration policy used during training: an approach that sometimes chooses random actions to gather information, and otherwise chooses the best-known action. The exact schedule is not the main point here; the point is that exploration is a controlled part of the run, not a vague “the agent tried things.”  

By the end of this cell, you have established the agent’s allowable control actions, the mechanism for choosing actions during training, and the storage structure that will hold learned values. Institutionally, you have defined the control boundary and the memory representation of the agent. This will later allow you to export a policy that is not only “working” but reviewable: you can explain what actions exist, what the agent believes about them, and how those beliefs evolved through training.


###4.2.CODE AND IMPLEMENTATION

In [4]:
# CELL 4 — Action Space, Q-Tables (Double Q optional), Greedy/Epsilon Policy, Stable Serialization

ACTIONS = list(cfg.actions)

def q_get(Q: Dict[Tuple[int,int,int,int], Dict[int, float]], s: Tuple[int,int,int,int], a: int) -> float:
    d = Q.get(s)
    if d is None:
        return 0.0
    return float(d.get(a, 0.0))

def q_set(Q: Dict[Tuple[int,int,int,int], Dict[int, float]], s: Tuple[int,int,int,int], a: int, v: float) -> None:
    d = Q.get(s)
    if d is None:
        d = {}
        Q[s] = d
    d[a] = float(v)

def q_argmax(Q: Dict[Tuple[int,int,int,int], Dict[int, float]], s: Tuple[int,int,int,int]) -> Tuple[int, float]:
    # deterministic tie-break for reproducibility: smallest action among ties
    best_a = ACTIONS[0]
    best_v = q_get(Q, s, best_a)
    for a in ACTIONS[1:]:
        v = q_get(Q, s, a)
        if v > best_v + 1e-15:
            best_v = v
            best_a = a
        elif abs(v - best_v) <= 1e-15 and a < best_a:
            best_a = a
            best_v = v
    return best_a, float(best_v)

def epsilon_greedy(Q: Dict[Tuple[int,int,int,int], Dict[int, float]], s: Tuple[int,int,int,int], eps: float, rng: np.random.Generator) -> int:
    if float(rng.random()) < eps:
        return int(rng.choice(ACTIONS))
    a, _ = q_argmax(Q, s)
    return int(a)

def softmax_policy_probs(Q: Dict[Tuple[int,int,int,int], Dict[int, float]], s: Tuple[int,int,int,int], temp: float = 1.0) -> Dict[int, float]:
    # For entropy diagnostics only (not used for action selection)
    vals = np.array([q_get(Q, s, a) for a in ACTIONS], dtype=float)
    # stabilize
    m = float(np.max(vals))
    ex = np.exp((vals - m) / max(1e-12, temp))
    Z = float(np.sum(ex)) + 1e-18
    probs = ex / Z
    return {ACTIONS[i]: float(probs[i]) for i in range(len(ACTIONS))}

# Initialize Q-tables
Q1: Dict[Tuple[int,int,int,int], Dict[int, float]] = {}
Q2: Dict[Tuple[int,int,int,int], Dict[int, float]] = {}  # used if double_q=True

print({
    "double_q": cfg.qlearn.double_q,
    "actions": ACTIONS,
    "q_tables_initialized": True
})


{'double_q': True, 'actions': [-1, 0, 1], 'q_tables_initialized': True}


##5.REWARD ENGINE

###5.1.OVERVIEW

**Cell 5 — Governed surrogate reward engine integrated with costs and online risk**  
This cell constructs the reward signal that drives learning, using the surrogate objective designed in Chapter 1 as the conceptual template. The core point is that the agent is not rewarded on raw returns alone. It is rewarded on a governed proxy that includes penalties for risk, drawdown progression, trading costs, turnover, leverage intensity, and governance breaches. This cell therefore operationalizes the philosophy of the entire project: optimization must be tied to feasibility and institutional acceptability from the beginning.  

Pedagogically, this cell teaches that rewards are not “numbers you compute at the end.” Rewards are the local incentives that shape every step of behavior. If you omit a penalty here, the agent may discover an unrealistic behavior that exploits the omission. If you include a penalty but do not compute it consistently, the agent may learn unstable tricks. That is why this cell emphasizes decomposed components. Even though the agent receives a single total reward at each step, the notebook records how that reward is constructed. This makes later review possible.  

This cell also integrates execution realism. Costs depend on how aggressively the agent changes exposure and may scale with market stress. This is crucial: without cost realism, an agent often learns to churn actions because frequent switching can create accidental advantage in synthetic settings. With convex costs, churning becomes expensive, and the policy must learn to trade off responsiveness against friction.  

The reward engine also uses online risk tracking. Instead of computing risk with hindsight, the notebook updates risk proxies as time moves forward. This mirrors institutional monitoring: you do not know the future, but you do maintain a rolling view of current stress. The reward can then penalize behavior when stress rises, which is precisely what a governed policy should do.  

By the end of this cell, you have the most important object for learning: a function that takes the current environment state, the prior action, and the new action, and produces a scalar reward plus a transparent decomposition. Institutionally, this is the “incentive contract.” It ensures learning is constrained by the same economic realities that would apply in a real trading organization. It is also the bridge between Chapter 1 and Chapter 2: Chapter 1 created the surrogate objective; this cell makes it the live driving force of optimization.


###5.2.CODE AND IMPLEMENTATION

In [5]:
# CELL 5 — Surrogate Reward Engine (Decomposed) + Cost Model + Online Drawdown/Vol for RL Loop

def exec_cost_scalar(dw: float, sigma: float, cfg: RunConfig) -> float:
    # convex + spread, scaled by volatility
    a = cfg.cost.a_lin
    b = cfg.cost.b_quad
    c = cfg.cost.c_cubic
    spread = cfg.cost.spread_bps * 1e-4
    vol_scale = cfg.cost.vol_scale
    abs_dw = abs(dw)
    return float(
        spread * abs_dw
        + a * abs_dw * (1.0 + vol_scale * sigma)
        + b * (dw * dw) * (1.0 + vol_scale * sigma)
        + c * (abs_dw ** 3) * (1.0 + vol_scale * sigma)
    )

def step_reward_components(
    r_t: float,
    w_prev: float,
    w_new: float,
    sigma_t: float,
    dd_inc_t: float,
    gross_t: float,
    turnover_t: float,
    cfg: RunConfig
) -> Dict[str, float]:
    # realized portfolio return (single-asset weight in {-1,0,1})
    ret_comp = float(w_prev * r_t)

    # penalties
    vol_pen = float(cfg.reward.lambda_vol * abs(sigma_t))
    dd_pen  = float(cfg.reward.lambda_dd * dd_inc_t)

    # execution cost is based on trade from w_prev to w_new
    dw = float(w_new - w_prev)
    cst = exec_cost_scalar(dw, sigma_t, cfg)
    cost_pen = float(cfg.reward.lambda_cost * cst)

    turn_pen = float(cfg.reward.lambda_turn * turnover_t)
    lev_pen  = float(cfg.reward.lambda_lev * (gross_t ** 2))

    # governance penalty (hard discontinuity)
    breach = False
    if gross_t > cfg.gov.max_gross_leverage + 1e-12:
        breach = True
    if turnover_t > cfg.gov.max_turnover + 1e-12:
        breach = True
    # drawdown breach handled outside (needs dd_t); we pass via dd_inc tracking; enforce with a conservative proxy:
    # if dd_inc is positive and gross is high, treat as governance stress indicator. true dd breach added in loop with dd_t.
    gov_pen = cfg.gov.gov_penalty if breach else 0.0

    total = ret_comp - vol_pen - dd_pen - cost_pen - turn_pen - lev_pen - gov_pen

    return {
        "return": ret_comp,
        "vol_pen": vol_pen,
        "dd_pen": dd_pen,
        "cost": cst,
        "cost_pen": cost_pen,
        "turn_pen": turn_pen,
        "lev_pen": lev_pen,
        "gov_pen": float(gov_pen),
        "total": float(total),
    }

print("Reward engine ready.")


Reward engine ready.


##6.TRAINING LOOP

###6.1.OVERVIEW

**Cell 6 — Training loop: learning dynamics, stability signals, and controlled exploration**  
This cell performs the core work of Chapter 2: it runs the learning algorithm over multiple episodes and updates the value tables that define the policy. The important idea is not merely that learning occurs, but that learning is instrumented. The loop tracks how the agent’s objective score changes across episodes, whether drawdowns become larger or smaller, whether turnover and costs drift upward or downward, and whether the learning process appears stable.  

Pedagogically, this cell teaches what optimization pressure means in practice. Each episode is a full run through the synthetic market. The agent chooses actions step by step, experiences the governed surrogate reward, and updates its values. Over time, the agent becomes more confident in some actions and less confident in others. This confidence is measurable. The cell records diagnostic signals that reveal whether the agent is collapsing into brittle behavior, such as a sharp decline in action diversity. It also records signals that indicate whether learning is internally consistent, such as the magnitude of update errors.  

This cell is also where the relationship between exploration and exploitation becomes visible. Early on, the agent explores to learn which actions are valuable in which states. Later, it exploits what it has learned and reduces randomness. A production-grade notebook must log this schedule because different exploration settings can produce very different learned behaviors. Without logging, you cannot explain why results changed between runs.  

Institutionally, this cell treats the learning process as an auditable sequence of decisions. It does not hide the training dynamics. It produces episode-level summaries that later become artifacts. This matters because a learned policy can be unstable even if its final performance looks good. For example, it might improve sharply for a few episodes, then degrade due to overfitting to rare states. The training curve reveals that.  

By the end of the cell, you obtain trained value tables and a complete training history: the path of objective scores, drawdowns, turnover, cost intensity, and diagnostic signals through time. This is a core deliverable of Chapter 2. It demonstrates that the surrogate objective is “learnable” and, more importantly, it shows how the agent’s behavior emerges under optimization pressure. This is where proxy exploitation would begin to appear if the objective is vulnerable, because the optimizer will discover those vulnerabilities as it pushes the score upward.


###6.2.CODE AND IMPLEMENTATION

In [6]:
# CELL 6 — Training Loop (Double Q-Learning), Logged Decomposition, Bellman Residuals, Entropy

def train_qlearning(
    r: np.ndarray,
    reg: np.ndarray,
    sig: np.ndarray,
    cfg: RunConfig,
    rng: np.random.Generator,
    Q1: Dict[Tuple[int,int,int,int], Dict[int, float]],
    Q2: Dict[Tuple[int,int,int,int], Dict[int, float]],
) -> Dict[str, Any]:
    T = len(r)
    g = cfg.reward.gamma

    # logs (episode-level)
    ep_summary: List[Dict[str, float]] = []
    # for artifact export: limited snapshots
    q_snapshot_points = [int(cfg.qlearn.episodes * x) for x in (0.25, 0.50, 0.75, 1.00)]
    q_snapshots: List[Dict[str, Any]] = []

    for ep in range(cfg.qlearn.episodes):
        # schedules
        alpha = max(cfg.qlearn.alpha_min, cfg.qlearn.alpha0 * (cfg.qlearn.alpha_decay ** ep))
        eps   = max(cfg.qlearn.eps_min,   cfg.qlearn.eps0   * (cfg.qlearn.eps_decay   ** ep))

        # episode state
        W = 1.0
        peak = 1.0
        dd_prev = 0.0
        port_vol = 0.0
        lam_v = 0.94

        w_prev = 0  # previous action/weight
        total_reward = 0.0
        total_turnover = 0.0
        total_cost = 0.0
        max_dd = 0.0

        # proxy logs for diagnostics
        td_errors: List[float] = []
        entropies: List[float] = []
        visited_states = 0

        # Start at t=1 to use last return sign
        for t in range(1, T):
            # update online portfolio vol estimator using realized portfolio return
            pr_t = float(w_prev * r[t])
            port_vol = math.sqrt(lam_v * (port_vol ** 2) + (1.0 - lam_v) * (pr_t ** 2))

            # update wealth and drawdown
            W = W * (1.0 + pr_t)
            peak = max(peak, W)
            dd_t = (peak - W) / max(1e-18, peak)
            max_dd = max(max_dd, dd_t)
            dd_inc = max(0.0, dd_t - dd_prev)
            dd_prev = dd_t

            # build current state
            s = State(
                ret_sign=sign_bucket(r[t-1], cfg.disc.ret_sign_deadband),
                regime=int(reg[t]),
                vol_bin=bucketize(float(port_vol), cfg.disc.vol_bins),
                dd_bin=bucketize(float(dd_t), cfg.disc.dd_bins)
            ).as_tuple()

            visited_states += 1

            # entropy diagnostic (from combined Q if double)
            if cfg.qlearn.double_q:
                # combined estimate for entropy only
                Qc: Dict[Tuple[int,int,int,int], Dict[int, float]] = {}
                # avoid copying full dict; just compute probs using on-the-fly combined values
                vals = np.array([q_get(Q1, s, a) + q_get(Q2, s, a) for a in ACTIONS], dtype=float)
                m = float(np.max(vals))
                ex = np.exp(vals - m)
                p = ex / (float(np.sum(ex)) + 1e-18)
                ent = float(-np.sum(p * np.log(p + 1e-18)))
            else:
                probs = softmax_policy_probs(Q1, s, temp=1.0)
                p = np.array([probs[a] for a in ACTIONS], dtype=float)
                ent = float(-np.sum(p * np.log(p + 1e-18)))
            entropies.append(ent)

            # choose action
            if cfg.qlearn.double_q:
                # epsilon greedy on combined Q
                if float(rng.random()) < eps:
                    a = int(rng.choice(ACTIONS))
                else:
                    # argmax on Q1+Q2
                    best_a = ACTIONS[0]
                    best_v = q_get(Q1, s, best_a) + q_get(Q2, s, best_a)
                    for aa in ACTIONS[1:]:
                        vv = q_get(Q1, s, aa) + q_get(Q2, s, aa)
                        if vv > best_v + 1e-15 or (abs(vv-best_v) <= 1e-15 and aa < best_a):
                            best_a, best_v = aa, vv
                    a = int(best_a)
            else:
                a = epsilon_greedy(Q1, s, eps, rng)

            # transition
            w_new = a
            turnover_t = abs(float(w_new - w_prev))
            gross_t = abs(float(w_new))

            # reward decomposition at time t depends on action change and current risk state
            comps = step_reward_components(
                r_t=float(r[t]),
                w_prev=float(w_prev),
                w_new=float(w_new),
                sigma_t=float(port_vol),
                dd_inc_t=float(dd_inc),
                gross_t=float(gross_t),
                turnover_t=float(turnover_t),
                cfg=cfg
            )

            # add drawdown governance breach explicitly (institutional: hard)
            if dd_t > cfg.gov.max_drawdown + 1e-12:
                comps["gov_pen"] = float(comps["gov_pen"] + cfg.gov.gov_penalty)
                comps["total"] = float(comps["total"] - cfg.gov.gov_penalty)

            reward = comps["total"]

            # next state s'
            # compute next state features using current updated W/dd and next return sign proxy; for terminal, keep last
            if t < T - 1:
                pr_next = float(w_new * r[t+1])
                port_vol_next = math.sqrt(lam_v * (port_vol ** 2) + (1.0 - lam_v) * (pr_next ** 2))
                # next wealth / dd for bucket only (do not mutate W/peak for learning target)
                W_next = W * (1.0 + pr_next)
                peak_next = max(peak, W_next)
                dd_next = (peak_next - W_next) / max(1e-18, peak_next)
                sp = State(
                    ret_sign=sign_bucket(r[t], cfg.disc.ret_sign_deadband),
                    regime=int(reg[t+1]),
                    vol_bin=bucketize(float(port_vol_next), cfg.disc.vol_bins),
                    dd_bin=bucketize(float(dd_next), cfg.disc.dd_bins)
                ).as_tuple()
            else:
                sp = s

            # TD update
            if cfg.qlearn.double_q:
                # randomly update Q1 or Q2 (double Q-learning)
                if float(rng.random()) < 0.5:
                    # target uses Q2 evaluated at argmax of Q1
                    a_star, _ = q_argmax(Q1, sp)
                    target = reward + g * q_get(Q2, sp, a_star)
                    current = q_get(Q1, s, a)
                    td = target - current
                    td = float(np.clip(td, -cfg.qlearn.td_clip, cfg.qlearn.td_clip))
                    q_set(Q1, s, a, current + alpha * td)
                else:
                    a_star, _ = q_argmax(Q2, sp)
                    target = reward + g * q_get(Q1, sp, a_star)
                    current = q_get(Q2, s, a)
                    td = target - current
                    td = float(np.clip(td, -cfg.qlearn.td_clip, cfg.qlearn.td_clip))
                    q_set(Q2, s, a, current + alpha * td)
            else:
                _, best_next = q_argmax(Q1, sp)
                target = reward + g * best_next
                current = q_get(Q1, s, a)
                td = target - current
                td = float(np.clip(td, -cfg.qlearn.td_clip, cfg.qlearn.td_clip))
                q_set(Q1, s, a, current + alpha * td)

            td_errors.append(abs(td))

            # accumulate episode stats
            total_reward += float(reward)
            total_turnover += float(turnover_t)
            total_cost += float(comps["cost"])

            # advance
            w_prev = w_new

        ep_summary.append({
            "episode": float(ep),
            "alpha": float(alpha),
            "epsilon": float(eps),
            "total_reward": float(total_reward),
            "max_drawdown": float(max_dd),
            "mean_turnover": float(total_turnover / max(1, (T-1))),
            "mean_cost": float(total_cost / max(1, (T-1))),
            "mean_td_abs": float(np.mean(td_errors) if td_errors else 0.0),
            "mean_entropy": float(np.mean(entropies) if entropies else 0.0),
            "visited_states": float(visited_states),
        })

        if (ep + 1) in q_snapshot_points:
            # snapshot only visited states to keep size reasonable
            def pack_q(Q: Dict[Tuple[int,int,int,int], Dict[int, float]]) -> Dict[str, Dict[str, float]]:
                out: Dict[str, Dict[str, float]] = {}
                for s, d in Q.items():
                    if d:
                        out[str(s)] = {str(a): float(v) for a, v in d.items()}
                return out
            snap = {
                "episode": ep + 1,
                "alpha": float(alpha),
                "epsilon": float(eps),
                "Q1_states": len(Q1),
                "Q2_states": len(Q2) if cfg.qlearn.double_q else 0,
                "Q1": pack_q(Q1),
            }
            if cfg.qlearn.double_q:
                snap["Q2"] = pack_q(Q2)
            q_snapshots.append(snap)

    return {"ep_summary": ep_summary, "q_snapshots": q_snapshots}

train_out = train_qlearning(r_tr, reg_tr, sig_tr, cfg, _rng, Q1, Q2)

# quick training headline
last = train_out["ep_summary"][-1]
print({
    "episodes": cfg.qlearn.episodes,
    "final_total_reward": last["total_reward"],
    "final_max_drawdown": last["max_drawdown"],
    "final_mean_td_abs": last["mean_td_abs"],
    "final_mean_entropy": last["mean_entropy"],
    "Q1_states": len(Q1),
    "Q2_states": len(Q2) if cfg.qlearn.double_q else 0
})


{'episodes': 140, 'final_total_reward': -10159.175486074811, 'final_max_drawdown': 0.5370022304728858, 'final_mean_td_abs': 1.2589171420710958, 'final_mean_entropy': 0.6872679035742603, 'Q1_states': 112, 'Q2_states': 113}


##7.POLICY EVALUATION

###7.1.OVERVIEW

**Cell 7 — Policy evaluation: train versus test performance with decomposed accounting**  
This cell evaluates what was learned. It runs the learned policy without exploration noise and measures outcomes under the same decomposed accounting used for the reward. It also compares the learned policy to a baseline policy, typically random behavior, to ensure the learning signal actually created structure rather than mere noise. The evaluation is done on both the training environment and the separate test environment. That train-test split is the simplest institutional check against overfitting.  

Pedagogically, this cell teaches that policy quality is not one number. The notebook collects a decomposed summary: how much return was earned, how much was lost to costs, how much was lost to drawdown penalties, and whether governance penalties were triggered. This decomposition is the key interpretability tool. It tells you whether the policy is “winning” by sensible behavior or by a proxy loophole. For example, if the policy’s score is high but costs are extreme, that indicates churn. If the score is high but governance penalties are frequent, that indicates boundary violation. If the score collapses on test, that indicates instability under path variation.  

Institutionally, this cell creates decision-grade evidence. It produces comparable summaries for learned and baseline policies across train and test. That allows you to compute meaningful gaps: a generalization gap between train and test for the learned policy, and a baseline gap showing whether the learned policy is meaningfully better than naive behavior on test. These gaps are not “performance bragging.” They are stability diagnostics. In a governed setting, you would rather see modest gains that generalize than dramatic gains that disappear out of sample.  

By the end of the cell, you obtain evaluation objects that include the final utility score, ending wealth, maximum drawdown, turnover intensity, and decomposed totals for each reward component. This becomes one of the most important artifacts: it is the clearest snapshot of what the agent does when deployed in the same synthetic world but without training noise. It also provides the first serious answer to the Chapter 2 question: does the surrogate objective, when optimized, produce behavior that is coherent, feasible, and stable? If the answer is no, this cell will show it in the decomposition and the generalization gap.


###7.2.CODE AND IMPLEMENTATION

In [7]:
# CELL 7 — Policy Evaluation (Train vs Test) + Baseline Comparison + Reward Decomposition Export

def greedy_action(Q1, Q2, s, double_q: bool) -> int:
    if not double_q:
        a, _ = q_argmax(Q1, s)
        return int(a)
    best_a = ACTIONS[0]
    best_v = q_get(Q1, s, best_a) + q_get(Q2, s, best_a)
    for aa in ACTIONS[1:]:
        vv = q_get(Q1, s, aa) + q_get(Q2, s, aa)
        if vv > best_v + 1e-15 or (abs(vv-best_v) <= 1e-15 and aa < best_a):
            best_a, best_v = aa, vv
    return int(best_a)

def run_policy_episode(
    r: np.ndarray, reg: np.ndarray,
    cfg: RunConfig,
    Q1, Q2,
    rng: np.random.Generator,
    policy: str = "greedy"
) -> Dict[str, Any]:
    T = len(r)
    g = cfg.reward.gamma

    W = 1.0
    peak = 1.0
    dd_prev = 0.0
    port_vol = 0.0
    lam_v = 0.94

    w_prev = 0
    disc = 1.0
    U = 0.0

    # time-series logs (kept modest; T is 2200)
    comps_series = {
        "return": np.zeros(T, dtype=float),
        "vol_pen": np.zeros(T, dtype=float),
        "dd_pen": np.zeros(T, dtype=float),
        "cost_pen": np.zeros(T, dtype=float),
        "turn_pen": np.zeros(T, dtype=float),
        "lev_pen": np.zeros(T, dtype=float),
        "gov_pen": np.zeros(T, dtype=float),
        "total": np.zeros(T, dtype=float),
        "w": np.zeros(T, dtype=float),
        "dd": np.zeros(T, dtype=float),
        "vol": np.zeros(T, dtype=float),
    }

    for t in range(1, T):
        pr_t = float(w_prev * r[t])
        port_vol = math.sqrt(lam_v * (port_vol ** 2) + (1.0 - lam_v) * (pr_t ** 2))

        W = W * (1.0 + pr_t)
        peak = max(peak, W)
        dd_t = (peak - W) / max(1e-18, peak)
        dd_inc = max(0.0, dd_t - dd_prev)
        dd_prev = dd_t

        s = State(
            ret_sign=sign_bucket(r[t-1], cfg.disc.ret_sign_deadband),
            regime=int(reg[t]),
            vol_bin=bucketize(float(port_vol), cfg.disc.vol_bins),
            dd_bin=bucketize(float(dd_t), cfg.disc.dd_bins)
        ).as_tuple()

        if policy == "greedy":
            a = greedy_action(Q1, Q2, s, cfg.qlearn.double_q)
        elif policy == "random":
            a = int(rng.choice(ACTIONS))
        else:
            raise ValueError("policy must be 'greedy' or 'random'.")

        w_new = a
        turnover_t = abs(float(w_new - w_prev))
        gross_t = abs(float(w_new))

        comps = step_reward_components(
            r_t=float(r[t]),
            w_prev=float(w_prev),
            w_new=float(w_new),
            sigma_t=float(port_vol),
            dd_inc_t=float(dd_inc),
            gross_t=float(gross_t),
            turnover_t=float(turnover_t),
            cfg=cfg
        )
        if dd_t > cfg.gov.max_drawdown + 1e-12:
            comps["gov_pen"] = float(comps["gov_pen"] + cfg.gov.gov_penalty)
            comps["total"] = float(comps["total"] - cfg.gov.gov_penalty)

        # discounted utility
        U += disc * comps["total"]
        disc *= g

        # store
        comps_series["return"][t] = comps["return"]
        comps_series["vol_pen"][t] = comps["vol_pen"]
        comps_series["dd_pen"][t] = comps["dd_pen"]
        comps_series["cost_pen"][t] = comps["cost_pen"]
        comps_series["turn_pen"][t] = comps["turn_pen"]
        comps_series["lev_pen"][t] = comps["lev_pen"]
        comps_series["gov_pen"][t] = comps["gov_pen"]
        comps_series["total"][t] = comps["total"]
        comps_series["w"][t] = float(w_new)
        comps_series["dd"][t] = float(dd_t)
        comps_series["vol"][t] = float(port_vol)

        w_prev = w_new

    return {
        "U": float(U),
        "wealth_end": float(W),
        "max_dd": float(np.max(comps_series["dd"])),
        "mean_turnover": float(np.mean(np.abs(np.diff(comps_series["w"])))),
        "series": comps_series,
    }

# Evaluate: trained greedy vs random baseline on TRAIN and TEST
eval_train_g = run_policy_episode(r_tr, reg_tr, cfg, Q1, Q2, _rng, policy="greedy")
eval_train_r = run_policy_episode(r_tr, reg_tr, cfg, Q1, Q2, _rng, policy="random")
eval_test_g  = run_policy_episode(r_te, reg_te, cfg, Q1, Q2, _rng, policy="greedy")
eval_test_r  = run_policy_episode(r_te, reg_te, cfg, Q1, Q2, _rng, policy="random")

# Component summaries (institutional: decomposed)
def summarize_component(series: Dict[str, np.ndarray]) -> Dict[str, float]:
    out = {}
    for k in ["return","vol_pen","dd_pen","cost_pen","turn_pen","lev_pen","gov_pen","total"]:
        x = series[k][1:]
        out[k] = float(np.sum(x))
    out["max_dd"] = float(np.max(series["dd"]))
    out["mean_vol"] = float(np.mean(series["vol"][1:]))
    out["mean_turnover"] = float(np.mean(np.abs(np.diff(series["w"]))))
    return out

train_g_sum = summarize_component(eval_train_g["series"])
train_r_sum = summarize_component(eval_train_r["series"])
test_g_sum  = summarize_component(eval_test_g["series"])
test_r_sum  = summarize_component(eval_test_r["series"])

reward_components = {
    "train": {"greedy": train_g_sum, "random": train_r_sum},
    "test":  {"greedy": test_g_sum,  "random": test_r_sum},
}

print(json.dumps({
    "train_greedy": {"U": eval_train_g["U"], "wealth_end": eval_train_g["wealth_end"], "max_dd": eval_train_g["max_dd"]},
    "test_greedy":  {"U": eval_test_g["U"],  "wealth_end": eval_test_g["wealth_end"],  "max_dd": eval_test_g["max_dd"]},
    "generalization_gap_U": float(eval_train_g["U"] - eval_test_g["U"]),
    "baseline_gap_test_U": float(eval_test_g["U"] - eval_test_r["U"])
}, indent=2, sort_keys=True))


{
  "baseline_gap_test_U": 5125.111357854199,
  "generalization_gap_U": 10174.521374661876,
  "test_greedy": {
    "U": -13493.04662737944,
    "max_dd": 0.9622922031235737,
    "wealth_end": 0.050802823641266086
  },
  "train_greedy": {
    "U": -3318.5252527175644,
    "max_dd": 0.808145023305782,
    "wealth_end": 0.6001112625020416
  }
}


##8.PROXY APPROXIMATION EVALUATION

###8.1.OVERVIEW

**Cell 8 — Proxy amplification diagnostics: exploitation detection and regime fragility**  
This cell is the institutional heart of Chapter 2. It does not merely report that the agent achieved a certain score. It asks whether the process of optimization amplified proxy weaknesses. It computes diagnostics designed specifically to detect reward hacking and fragility, using signals that are understandable to finance practitioners and risk reviewers.  

Pedagogically, this cell teaches that proxy exploitation is not a theoretical concern; it is observable. The notebook measures how concentrated the policy becomes, whether it collapses into extreme confidence, and whether the learned value structure is consistent with the reward dynamics. It also measures regime-conditional fragility: in calm conditions the agent may behave reasonably, but under stress conditions the penalty terms may dominate returns and behavior may become unstable. By separating diagnostics by regime, the notebook helps you see whether the agent is robust or whether it is implicitly relying on favorable regimes.  

Institutionally, this cell provides evidence for governance discussions. It computes signals that can be presented to a committee: how stable the learned value estimates appear, whether the policy’s internal consistency is acceptable, and whether the objective behaves differently under stress. It also produces an explicit “proxy exploitation” indicator that flags whether fragility rises out of sample. The exact label is less important than the concept: you need a scalar alert that says, “this is likely exploiting the proxy.”  

This cell also introduces the idea of testing the learned policy in a way that challenges its assumptions. Instead of only looking at end performance, you analyze the internal coherence of the learned decision system and its sensitivity to regime conditions. That is professional practice. In production settings, you must know whether a policy becomes cost-dominated when liquidity worsens, whether drawdown penalties begin to dominate in stress, and whether learning created behavior that would be unacceptable during crises.  

By the end of the cell, you obtain a diagnostics object that summarizes exploitation risk and fragility across train and test. This is a critical deliverable because it tells you whether you can responsibly move forward. If the diagnostics show that optimization found a loophole, Chapter 2 has succeeded in its true purpose: it has exposed a problem before you scaled the system. If the diagnostics look stable, Chapter 2 provides evidence that the surrogate objective is relatively well aligned under optimization pressure.


###8.2.CODE AND IMPLEMENTATION

In [8]:
# CELL 8 — Proxy Amplification Diagnostics (Leverage/Turnover Drift, Entropy Collapse, Regime Fragility, Bellman Residual)

def bellman_residual_sample(
    r: np.ndarray, reg: np.ndarray,
    cfg: RunConfig,
    Q1, Q2,
    rng: np.random.Generator,
    n_samples: int = 8000
) -> Dict[str, float]:
    T = len(r)
    g = cfg.reward.gamma
    lam_v = 0.94

    # simulate a short rollout to produce plausible states and vol/dd levels
    W = 1.0
    peak = 1.0
    dd_prev = 0.0
    port_vol = 0.0
    w_prev = 0

    residuals: List[float] = []
    entropies: List[float] = []
    regime_counts = {0: 0, 1: 0}
    frag_by_reg = {0: [], 1: []}

    for t in range(2, T - 2):
        pr_t = float(w_prev * r[t])
        port_vol = math.sqrt(lam_v * (port_vol ** 2) + (1.0 - lam_v) * (pr_t ** 2))

        W = W * (1.0 + pr_t)
        peak = max(peak, W)
        dd_t = (peak - W) / max(1e-18, peak)
        dd_inc = max(0.0, dd_t - dd_prev)
        dd_prev = dd_t

        s = State(
            ret_sign=sign_bucket(r[t-1], cfg.disc.ret_sign_deadband),
            regime=int(reg[t]),
            vol_bin=bucketize(float(port_vol), cfg.disc.vol_bins),
            dd_bin=bucketize(float(dd_t), cfg.disc.dd_bins)
        ).as_tuple()

        # entropy from combined Q
        vals = np.array([q_get(Q1, s, a) + (q_get(Q2, s, a) if cfg.qlearn.double_q else 0.0) for a in ACTIONS], dtype=float)
        m = float(np.max(vals))
        ex = np.exp(vals - m)
        p = ex / (float(np.sum(ex)) + 1e-18)
        ent = float(-np.sum(p * np.log(p + 1e-18)))
        entropies.append(ent)

        # pick greedy a
        a = greedy_action(Q1, Q2, s, cfg.qlearn.double_q)
        w_new = a
        turnover_t = abs(float(w_new - w_prev))
        gross_t = abs(float(w_new))
        comps = step_reward_components(float(r[t]), float(w_prev), float(w_new), float(port_vol), float(dd_inc), float(gross_t), float(turnover_t), cfg)
        if dd_t > cfg.gov.max_drawdown:
            comps["total"] = float(comps["total"] - cfg.gov.gov_penalty)

        # build s'
        pr_next = float(w_new * r[t+1])
        port_vol_next = math.sqrt(lam_v * (port_vol ** 2) + (1.0 - lam_v) * (pr_next ** 2))
        W_next = W * (1.0 + pr_next)
        peak_next = max(peak, W_next)
        dd_next = (peak_next - W_next) / max(1e-18, peak_next)

        sp = State(
            ret_sign=sign_bucket(r[t], cfg.disc.ret_sign_deadband),
            regime=int(reg[t+1]),
            vol_bin=bucketize(float(port_vol_next), cfg.disc.vol_bins),
            dd_bin=bucketize(float(dd_next), cfg.disc.dd_bins)
        ).as_tuple()

        # compute Bellman residual for combined Q (policy evaluation style)
        if cfg.qlearn.double_q:
            q_sa = q_get(Q1, s, a) + q_get(Q2, s, a)
            ap = greedy_action(Q1, Q2, sp, True)
            q_sp = q_get(Q1, sp, ap) + q_get(Q2, sp, ap)
        else:
            q_sa = q_get(Q1, s, a)
            ap, q_sp = q_argmax(Q1, sp)

        res = (q_sa - (comps["total"] + g * q_sp))
        residuals.append(float(res))

        # regime fragility proxy: penalty dominance ratio under stress
        rg = int(reg[t])
        regime_counts[rg] += 1
        pen = comps["vol_pen"] + comps["dd_pen"] + comps["cost_pen"] + comps["turn_pen"] + comps["lev_pen"] + comps["gov_pen"]
        denom = abs(comps["return"]) + 1e-12
        frag_by_reg[rg].append(float(pen / denom))

        w_prev = w_new
        if len(residuals) >= n_samples:
            break

    residuals = np.array(residuals, dtype=float)
    entropies = np.array(entropies, dtype=float)

    out = {
        "bellman_residual_mean": float(np.mean(residuals)),
        "bellman_residual_std": float(np.std(residuals)),
        "bellman_residual_rmse": float(math.sqrt(float(np.mean(residuals * residuals)))),
        "entropy_mean": float(np.mean(entropies)),
        "entropy_p05": float(np.quantile(entropies, 0.05)),
        "entropy_p50": float(np.quantile(entropies, 0.50)),
        "entropy_p95": float(np.quantile(entropies, 0.95)),
        "fragility_regime0_mean": float(np.mean(frag_by_reg[0])) if frag_by_reg[0] else 0.0,
        "fragility_regime1_mean": float(np.mean(frag_by_reg[1])) if frag_by_reg[1] else 0.0,
        "regime_counts": regime_counts,
    }
    return out

diag_train = bellman_residual_sample(r_tr, reg_tr, cfg, Q1, Q2, _rng, n_samples=8000)
diag_test  = bellman_residual_sample(r_te, reg_te, cfg, Q1, Q2, _rng, n_samples=8000)

amplification_metrics = {
    "train": diag_train,
    "test": diag_test,
    "generalization_gap_U": float(eval_train_g["U"] - eval_test_g["U"]),
    "generalization_gap_max_dd": float(eval_train_g["max_dd"] - eval_test_g["max_dd"]),
    "policy_turnover_train": float(eval_train_g["mean_turnover"]),
    "policy_turnover_test": float(eval_test_g["mean_turnover"]),
    "proxy_exploitation_score": float(max(0.0, (diag_test["fragility_regime1_mean"] - diag_train["fragility_regime1_mean"]))),
}

print(json.dumps(amplification_metrics, indent=2, sort_keys=True))


{
  "generalization_gap_U": 10174.521374661876,
  "generalization_gap_max_dd": -0.1541471798177917,
  "policy_turnover_test": 0.516143701682583,
  "policy_turnover_train": 0.29149613460663937,
  "proxy_exploitation_score": 8105092999.554581,
  "test": {
    "bellman_residual_mean": 22.05625173626453,
    "bellman_residual_rmse": 504.5856499400123,
    "bellman_residual_std": 504.10336230254524,
    "entropy_mean": 0.9653304308454388,
    "entropy_p05": 0.06563476967428153,
    "entropy_p50": 1.0896243120492473,
    "entropy_p95": 1.0986108933979684,
    "fragility_regime0_mean": 23326022350.761307,
    "fragility_regime1_mean": 111011076835.45274,
    "regime_counts": {
      "0": 1547,
      "1": 649
    }
  },
  "train": {
    "bellman_residual_mean": 9.631796790196635,
    "bellman_residual_rmse": 374.1443269801867,
    "bellman_residual_std": 374.0203281936013,
    "entropy_mean": 0.7371135462157979,
    "entropy_p05": 3.727880267566927e-07,
    "entropy_p50": 1.0293703666416583,
 

##9.GOVERNANCE ARTIFACTS

###9.1.0VERVIEW

**Cell 9 — Artifact export: policy, training history, evaluations, and diagnostic bundle**  
This cell turns the notebook into a production-grade research run by exporting all key results as auditable artifacts. It is not enough to print outcomes to the screen. Institutional work requires persistent records that can be reviewed by others, compared across versions, and used as evidence in governance decisions. This cell writes out the configuration identity, training curves, learned value tables, evaluation decompositions, fragility diagnostics, and risk logs.  

Pedagogically, this cell teaches a discipline: every claim should have a file behind it. If you say the policy improved, the training curve should be saved. If you say it generalizes, the train and test decompositions should be saved. If you say it does not exploit the proxy, the amplification diagnostics should be saved. This avoids narrative drift and makes disagreement productive, because reviewers can point to artifacts rather than to interpretations.  

The cell also addresses a practical concern: learned tables can become large. It therefore exports them in a bounded and deterministic manner, keeping the most relevant states first. This is an institutional compromise between completeness and operational usability. The goal is not to dump unlimited data; the goal is to export enough to audit behavior and reproduce conclusions.  

Institutionally, this cell is what enables regression testing. Once artifacts exist, you can run the notebook after code changes and compare whether the learned policy became more fragile, whether generalization worsened, or whether governance breaches increased. That is how production research is maintained over time.  

By the end of the cell, you have a complete artifact directory that includes the agent representation, its training trajectory, its evaluation outcomes, and the diagnostics that interpret those outcomes. This directory is the deliverable you would hand to a reviewer. It is also the evidence base you will use in Chapter 3. Chapter 3 will introduce richer notions of constrained control and institutional stage gates; those rely on artifacts as inputs. This cell therefore completes the governance-first cycle: training is not a hidden process; it is a logged, reviewable, reproducible procedure with documented outputs.


###9.2.CODE AND IMPLEMENTATION

In [9]:
# CELL 9 — Governance Artifacts Export (Q tables, reward components, fragility diagnostics, training curves)

def pack_qtables(Q1, Q2, double_q: bool, max_states: int = 20000) -> Dict[str, Any]:
    # deterministic truncation: keep states with largest |Q| magnitude (reviewable, bounded size)
    def state_score(d: Dict[int, float]) -> float:
        if not d:
            return 0.0
        return float(max(abs(v) for v in d.values()))
    items = list(Q1.items())
    items.sort(key=lambda kv: state_score(kv[1]), reverse=True)
    items = items[:max_states]
    outQ1 = {str(s): {str(a): float(v) for a, v in d.items()} for s, d in items if d}
    out = {"Q1": outQ1, "Q1_states_kept": len(outQ1), "Q1_total_states": len(Q1)}
    if double_q:
        items2 = list(Q2.items())
        items2.sort(key=lambda kv: state_score(kv[1]), reverse=True)
        items2 = items2[:max_states]
        outQ2 = {str(s): {str(a): float(v) for a, v in d.items()} for s, d in items2 if d}
        out.update({"Q2": outQ2, "Q2_states_kept": len(outQ2), "Q2_total_states": len(Q2)})
    return out

qtables_payload = pack_qtables(Q1, Q2, cfg.qlearn.double_q, max_states=20000)

# Training curve summary
train_curve = train_out["ep_summary"]

# Regime fragility object (explicit)
regime_fragility = {
    "train_fragility_regime0_mean": amplification_metrics["train"]["fragility_regime0_mean"],
    "train_fragility_regime1_mean": amplification_metrics["train"]["fragility_regime1_mean"],
    "test_fragility_regime0_mean": amplification_metrics["test"]["fragility_regime0_mean"],
    "test_fragility_regime1_mean": amplification_metrics["test"]["fragility_regime1_mean"],
    "interpretation_hint": "Higher means penalties dominate returns; large train-test divergence indicates proxy non-robustness.",
}

# Reward components export (summaries only; time-series can be huge but still manageable if needed)
write_json(os.path.join(artifact_root, "reward_components.json"), reward_components)
write_json(os.path.join(artifact_root, "q_table_snapshot.json"), {
    "qtables": qtables_payload,
    "snapshots": train_out["q_snapshots"]  # contains selective snapshots
})
write_json(os.path.join(artifact_root, "amplification_metrics.json"), amplification_metrics)
write_json(os.path.join(artifact_root, "regime_fragility.json"), regime_fragility)
write_json(os.path.join(artifact_root, "training_curve.json"), train_curve)

risk_log = {
    "train_greedy": {"U": eval_train_g["U"], "wealth_end": eval_train_g["wealth_end"], "max_dd": eval_train_g["max_dd"]},
    "test_greedy":  {"U": eval_test_g["U"],  "wealth_end": eval_test_g["wealth_end"],  "max_dd": eval_test_g["max_dd"]},
    "train_random": {"U": eval_train_r["U"], "wealth_end": eval_train_r["wealth_end"], "max_dd": eval_train_r["max_dd"]},
    "test_random":  {"U": eval_test_r["U"],  "wealth_end": eval_test_r["wealth_end"],  "max_dd": eval_test_r["max_dd"]},
    "governance": asdict(cfg.gov),
}
write_json(os.path.join(artifact_root, "risk_log.json"), risk_log)

print({
    "artifact_root": artifact_root,
    "files": sorted(os.listdir(artifact_root)),
    "Q1_states": len(Q1),
    "Q2_states": len(Q2) if cfg.qlearn.double_q else 0
})


{'artifact_root': '/mnt/data/artifacts_ch2', 'files': ['amplification_metrics.json', 'config_hash.txt', 'q_table_snapshot.json', 'regime_fragility.json', 'reward_components.json', 'risk_log.json', 'run_manifest.json', 'training_curve.json'], 'Q1_states': 112, 'Q2_states': 113}


##10.AUDIT BUNDLE

###10.1.OVERVIEW


**Cell 10 — Institutional stability report and overfitting challenge tests**  
This final cell produces a decision-grade summary and performs one additional robustness test designed to detect a subtle form of overfitting: reliance on a particular labeling or state signal. In a controlled synthetic environment, regime labels are available as an input feature. A learned policy might become overly dependent on that feature in a way that would not transfer if the regime indicator were noisy or misestimated. This cell challenges that dependence by altering the regime signal and measuring how much performance changes. The goal is not to create a perfect test; the goal is to add a targeted stress that reveals reliance on a fragile input.  

Pedagogically, this cell teaches how to conclude a learning experiment responsibly. You do not end with “the agent scored well.” You end with “here is the summary, here are the checks, here are the artifacts, and here is what we learned about stability.” The cell gathers the headline metrics into a single structured report: training utility, testing utility, generalization gap, drawdown behavior, diagnostic signals, and the outcomes of the robustness test. It also lists exactly where the artifacts are stored so a reviewer can jump directly to the evidence.  

Institutionally, this cell is the bridge from research to governance decision. It provides a compact set of indicators that can be used as stage gates. For example, a large generalization gap might trigger a “do not advance” decision. High governance penalties might require objective redesign. Large sensitivity to the regime test might require removing regime as an input or adding realism by estimating regimes rather than providing them.  

By the end of the cell, you obtain the Chapter 2 “final summary object,” which is the official output of the run. It is the one file that a committee could read first, and then trace into deeper artifacts as needed. This final object completes the Chapter 2 goal: you applied optimization pressure to a governed surrogate objective, obtained a learned policy, evaluated it out of sample, measured proxy exploitation risk, and produced an audit-ready record of the entire process. This is what makes the work institutional: it is not merely that the agent exists, but that its creation and its behavior are documented in a way that supports supervision and accountable iteration.


###10.2.CODE AND IMPLEMENTATION

In [10]:
# CELL 10 — Institutional Stability Report (Overfitting Test via Regime Permutation + Final Summary)

def regime_permutation_test(
    r: np.ndarray, reg: np.ndarray,
    cfg: RunConfig,
    Q1, Q2,
    rng: np.random.Generator,
    n_perm: int = 40
) -> Dict[str, Any]:
    # Keep returns fixed, randomize regime labels to break regime conditioning.
    # If policy relies too heavily on regime label, performance should degrade sharply.
    base = run_policy_episode(r, reg, cfg, Q1, Q2, rng, policy="greedy")["U"]
    Us = []
    for k in range(n_perm):
        perm_reg = reg.copy()
        rng.shuffle(perm_reg)
        Uperm = run_policy_episode(r, perm_reg, cfg, Q1, Q2, rng, policy="greedy")["U"]
        Us.append(float(Uperm))
    Us = np.array(Us, dtype=float)
    return {
        "U_base": float(base),
        "U_perm_mean": float(np.mean(Us)),
        "U_perm_std": float(np.std(Us)),
        "U_perm_p05": float(np.quantile(Us, 0.05)),
        "U_perm_p50": float(np.quantile(Us, 0.50)),
        "U_perm_p95": float(np.quantile(Us, 0.95)),
        "delta_base_minus_perm_mean": float(base - float(np.mean(Us))),
    }

perm_train = regime_permutation_test(r_tr, reg_tr, cfg, Q1, Q2, _rng, n_perm=40)
perm_test  = regime_permutation_test(r_te, reg_te, cfg, Q1, Q2, _rng, n_perm=40)

final_summary = {
    "run_id": run_id,
    "timestamp_utc": timestamp_utc,
    "config_hash": config_hash,
    "headlines": {
        "train_greedy_U": float(eval_train_g["U"]),
        "test_greedy_U": float(eval_test_g["U"]),
        "generalization_gap_U": float(eval_train_g["U"] - eval_test_g["U"]),
        "train_greedy_max_dd": float(eval_train_g["max_dd"]),
        "test_greedy_max_dd": float(eval_test_g["max_dd"]),
        "train_entropy_mean": float(amplification_metrics["train"]["entropy_mean"]),
        "test_entropy_mean": float(amplification_metrics["test"]["entropy_mean"]),
        "bellman_rmse_train": float(amplification_metrics["train"]["bellman_residual_rmse"]),
        "bellman_rmse_test": float(amplification_metrics["test"]["bellman_residual_rmse"]),
        "fragility_regime1_train": float(amplification_metrics["train"]["fragility_regime1_mean"]),
        "fragility_regime1_test": float(amplification_metrics["test"]["fragility_regime1_mean"]),
        "proxy_exploitation_score": float(amplification_metrics["proxy_exploitation_score"]),
    },
    "overfitting_checks": {
        "regime_permutation_train": perm_train,
        "regime_permutation_test": perm_test,
    },
    "audit_artifacts": {
        "run_manifest": os.path.join(artifact_root, "run_manifest.json"),
        "q_table_snapshot": os.path.join(artifact_root, "q_table_snapshot.json"),
        "reward_components": os.path.join(artifact_root, "reward_components.json"),
        "amplification_metrics": os.path.join(artifact_root, "amplification_metrics.json"),
        "regime_fragility": os.path.join(artifact_root, "regime_fragility.json"),
        "training_curve": os.path.join(artifact_root, "training_curve.json"),
        "risk_log": os.path.join(artifact_root, "risk_log.json"),
    }
}

write_json(os.path.join(artifact_root, "final_summary.json"), final_summary)
print(json.dumps(final_summary, indent=2, sort_keys=True))


{
  "audit_artifacts": {
    "amplification_metrics": "/mnt/data/artifacts_ch2/amplification_metrics.json",
    "q_table_snapshot": "/mnt/data/artifacts_ch2/q_table_snapshot.json",
    "regime_fragility": "/mnt/data/artifacts_ch2/regime_fragility.json",
    "reward_components": "/mnt/data/artifacts_ch2/reward_components.json",
    "risk_log": "/mnt/data/artifacts_ch2/risk_log.json",
    "run_manifest": "/mnt/data/artifacts_ch2/run_manifest.json",
    "training_curve": "/mnt/data/artifacts_ch2/training_curve.json"
  },
  "config_hash": "50b9a4b1a67adf03cd1124320f6bb2e0ab084ff25b4f3e024df0aa30d0504abc",
  "headlines": {
    "bellman_rmse_test": 504.5856499400123,
    "bellman_rmse_train": 374.1443269801867,
    "fragility_regime1_test": 111011076835.45274,
    "fragility_regime1_train": 102905983835.89816,
    "generalization_gap_U": 10174.521374661876,
    "proxy_exploitation_score": 8105092999.554581,
    "test_entropy_mean": 0.9653304308454388,
    "test_greedy_U": -13493.04662737944,

##11.CONCLUSION

**Conclusion: What You Obtain from Chapter 2 and What Goal You Achieved**

Chapter 2 ends with a deliverable that is even more concrete than Chapter 1. Chapter 1 produced an objective. Chapter 2 produces the first object that can legitimately be called an “agent,” but in a constrained and auditable form. The purpose of this conclusion is to state exactly what the outcome is, what objects you now possess, and what goal you have achieved at an institutional standard.

We can summarize the outcome in one sentence:

You obtained a learned policy that maximizes your governed surrogate objective, plus the full audit trail proving how it learned, how it behaves, and whether it generalizes.

That single sentence contains three important pieces: a learned policy, the governed objective it optimizes, and the evidence bundle that makes the result reviewable.

To make this precise, Chapter 2 produces five major outputs.

**First, you obtain a learned policy object.**  
This is the central product of the notebook. It is the mapping from states to actions that the learning system discovered. In this chapter the policy is represented as a tabular decision rule: given the discretized state (recent return sign, regime proxy, volatility bucket, drawdown bucket), the policy chooses an exposure action (short, flat, or long). This is a real agent in the technical sense: it takes a state description as input and outputs an action intended to maximize an objective over time.

In institutional terms, the importance is that you now have a policy that was not hand-designed. It was produced by optimization pressure against your surrogate objective. That means it is the first moment where proxy misalignment can actively appear. The policy is therefore not merely a “result.” It is a test instrument: it reveals what your objective truly incentivizes.

So if someone asks, “What did we build in Chapter 2?” the answer is:

We built a reproducible learned decision rule that optimizes the Chapter 1 surrogate objective.

**Second, you obtain a value function representation and learning traces.**  
The policy is not alone. The learning system also produces an internal representation of expected long-run utility under the surrogate objective. In tabular reinforcement learning, this appears as Q-values (or two Q-tables if double estimation is used). These tables are not just implementation detail. They are part of the scientific output because they encode what the agent believes about the long-run consequences of actions.

You also obtain learning traces: episode-level logs of total reward, drawdowns, turnover, costs, and diagnostic signals. These traces prove whether learning stabilized or drifted, whether exploration collapsed too early, and whether the system improved the objective in the expected manner.

This matters institutionally because you can audit training quality. Without training traces, you cannot tell whether the policy is the result of stable learning or of noise.

**Third, you obtain a train-versus-test evaluation bundle.**  
A learned policy is not meaningful without evaluation. Chapter 2 therefore produces a controlled generalization test. The policy is trained on one synthetic path and evaluated on a different synthetic path generated under the same structural design but different randomness. This is not a claim about real-world performance. It is a stability test.

The output is a pair of comparable evaluation objects:
- behavior and score on training environment
- behavior and score on test environment

Each evaluation includes the same decomposition as Chapter 1: returns, cost penalties, volatility penalties, drawdown penalties, turnover penalties, leverage penalties, and governance penalties. The key is not the final score alone. The key is whether the decomposition remains reasonable on the test path. If the policy’s score is maintained only by exploiting peculiarities of the training path, the decomposition will change sharply out of sample.

So you obtain a generalization report that is audit-ready.

**Fourth, you obtain proxy amplification diagnostics.**  
This is one of the most important outcomes of Chapter 2. The learned policy is used as a probe: it tells you whether your surrogate objective is robust under optimization or vulnerable to exploitation.

The notebook produces proxy amplification signals such as:
- action entropy behavior (does the policy collapse into a narrow, brittle behavior?)
- Bellman residual diagnostics (does the learned value structure look internally consistent?)
- regime-conditional fragility measures (does the policy become penalty-dominated in stress?)
- turnover and cost intensity drift (does learning “discover” excessive trading strategies?)
- governance breach rates under learning pressure (does the policy try to harvest reward by violating limits?)

These diagnostics are not optional. They are the institutional answer to the question: “Did the agent learn the mechanism we intended, or did it learn a loophole in the proxy?”

In other words, Chapter 2 makes surrogate exploitation observable.

**Fifth, you obtain an institutional artifact bundle that proves reproducibility and reviewability.**  
This is the governance deliverable. The notebook exports run identity, config hash, training curves, learned tables, evaluation decompositions, risk logs, and stability diagnostics. This bundle is what converts the work from “a notebook someone ran” into “an institutional research record.”

If a reviewer asks what happened, you can point to the artifacts.
If someone asks whether results are reproducible, you can point to the run manifest and config hash.
If a risk reviewer asks why the agent scored well, you can point to the decomposition.
If a governance reviewer asks whether constraints were respected, you can point to breach metrics.

Now we can answer the second and more important question: what goal did we achieve?

The goal of Chapter 2 is **validation under optimization pressure**.

Chapter 1 specified an objective. Chapter 2 validates that objective by applying an optimizer and observing what it does. This is the key methodological shift. Many projects start with training and only later worry about alignment. Chapter 2 reverses the order: you treat training as a diagnostic tool to test alignment.

So the achieved goal is not “we trained a trading agent.” The achieved goal is:

We demonstrated that the surrogate objective can drive a learning process, and we measured whether the learned behavior is stable, feasible, and governed when optimization pressure is applied.

This goal has three sub-goals, each of which the notebook explicitly completes.

**Sub-goal 1: Convert the surrogate objective into a learning signal.**  
You achieved operationalization. The surrogate objective is no longer a conceptual model. It is a reward signal that an algorithm can optimize. This matters because only an operational objective can be audited in the context of learning.

**Sub-goal 2: Produce a policy and evaluate it out of sample.**  
You achieved a clean separation between training and testing in a controlled environment. You now have evidence about whether the learned policy is stable across paths and across structural break conditions. In institutional language, you performed a minimal generalization check.

**Sub-goal 3: Detect proxy exploitation and fragility modes.**  
You achieved the most important institutional capability: you can identify whether the agent is exploiting the proxy. If proxy exploitation exists, you can see it via decomposition shifts, regime-conditional fragility increases, entropy collapse, or breach behavior. This is what makes Chapter 2 a governance-first notebook rather than a “look at my RL results” demo.

At the end of Chapter 2, the practical meaning is that you now possess an agentic object you can carry forward into Chapter 3 with clarity.

You have:
- an objective (from Chapter 1)
- a learned policy that optimizes that objective (from Chapter 2)
- evidence about how optimization interacts with the objective (from Chapter 2)

This is the correct staircase toward institutional-grade systems. You are not guessing whether the objective is safe. You are testing it with an optimizer in a controlled lab.

Finally, it is important to be explicit about what Chapter 2 does not claim, because institutional credibility depends on scope honesty.

Chapter 2 does not claim real-market profitability.  
Chapter 2 does not claim a deployable trading system.  
Chapter 2 does not claim that the synthetic market is reality.  

What it does claim is more defensible and more useful:

It claims that you have built a governed learning laboratory in which proxy objectives can be optimized and audited, and that you can now observe and measure alignment versus exploitation before scaling up.

If you want a single “success criterion” for Chapter 2, it is this:

When you compare train and test, the learned policy’s decomposition remains coherent, constraints remain respected, and the proxy does not reward pathological behavior that would fail institutional review.

If that criterion is met, you move to Chapter 3.
If it is not met, you revise the surrogate objective from Chapter 1 and rerun Chapter 2.

That is the institutional loop.

So the outcome of the exercise is not a vague “trained model.” It is a set of concrete objects: a learned policy, value tables, evaluation decompositions, amplification diagnostics, and a complete audit artifact bundle. And the goal achieved is equally concrete: you applied optimization pressure to your surrogate objective and produced evidence about whether that objective generates disciplined, governed behavior or fragile, exploitable behavior.

That is exactly what Chapter 2 is for.
