#**CHAPTER 1.SURROGATE OBJECTIVE ENGINEERING**
---

##REFERENCE

https://chatgpt.com/share/699349b7-d3c4-8012-812d-06a8783fd6b5

##0.CONTEXT

**Introduction: Objective and Meaning of the Chapter 1 Colab Exercise**

This notebook is the first step in a three-chapter journey. Its purpose is not to “train an AI trader” and it is not to “discover alpha.” The purpose is to build the intellectual and technical foundation required to speak honestly about trading agents. In this first chapter, we do something that many people skip: we isolate the most important design decision in the entire system — the surrogate objective — and we study it before we ever introduce learning.

In finance, we often talk as if profit is the objective. But in any real system, “profit” is not a single function you can optimize directly. Profit depends on the entire future path of prices, on costs, on liquidity, on constraints, on risk limits, on drawdowns, and on regimes that change. The true objective we would like to maximize is something like “long-horizon risk-adjusted utility under uncertainty with realistic execution.” That objective is not directly computable, and it is not directly optimizable. It is too complex, too path-dependent, and too tied to future states that cannot be known.

Because of that, every trading system — whether it uses machine learning or not — must replace the true objective with a tractable substitute. This substitute is called a surrogate objective (or proxy objective). It is a function that stands in for the “real thing” and is designed to be measurable, stable, and usable inside an algorithm. This is not a weakness of AI. It is a structural reality of finance. Even traditional quantitative strategies are built on surrogates: Sharpe ratio is a surrogate for utility, volatility targeting is a surrogate for risk control, drawdown limits are a surrogate for survivability, and transaction cost models are surrogates for market impact. The entire industry runs on proxies.

The problem is that most people do not treat surrogate design as a primary modeling act. They treat it as an afterthought. They assume the objective is “return” and then they bolt on constraints later. But if you build a trading agent — especially a learning agent — it will optimize whatever objective you actually give it, not the objective you wished you had given it. If the proxy is naive, optimization will amplify the naivety. If the proxy is misaligned, optimization will exploit the misalignment. In AI language, this is reward hacking. In finance language, it is the creation of fragile strategies that look good in backtests and collapse when exposed to real constraints.

This first notebook is designed to prevent that failure mode. It teaches you to see surrogate design as the core of the system.

The exercise has three major objectives.

The first objective is to make the surrogate objective explicit. We write down, in code, a reward function that decomposes into components. Instead of saying “the strategy performed well,” we break performance into a return component and a set of penalties. The return component represents participation in market opportunity. The penalties represent the costs and risks that make that participation meaningful and realistic. The surrogate objective in this notebook includes at least the following components:

Return: the realized portfolio return from the chosen exposures.  
Volatility penalty: a cost for risk variability, implemented through a volatility estimate.  
Drawdown penalty: a cost for losing capital relative to peak, implemented through drawdown increments.  
Execution cost penalty: a cost for trading, implemented with a convex impact model.  
Turnover penalty: a cost for changing positions too frequently.  
Leverage or exposure penalty: a cost for holding large gross exposure.  
Governance penalty: a hard penalty if predefined limits are violated.

The key point is not the exact formula. The key point is that the surrogate is not “one number.” It is a structured contract. It is a statement of what behavior is acceptable. If you cannot decompose the objective, you cannot audit it. And if you cannot audit it, you cannot claim it is institutional-grade.

The second objective is to show that the surrogate objective shapes behavior even without learning. In this notebook, the portfolio policy is deterministic. We are not training a model. Yet the choice of penalties still changes what “good” means, and therefore changes what would be preferred if optimization were introduced. This is a subtle but crucial lesson: learning is not the beginning of behavior; the objective is.

You can think of it like this. Suppose we have two policies, A and B. Policy A has higher raw returns but also higher drawdowns and much higher trading costs. Policy B has lower raw returns but stable behavior, low turnover, and controlled drawdowns. If you say “maximize return,” policy A is better. If you say “maximize risk-adjusted utility with drawdown control and cost realism,” policy B may be better. The surrogate objective decides which policy is considered superior. That is why the surrogate is a primary design choice.

The third objective is to show that surrogate design must be stress-tested and sensitivity-tested. In real institutions, you do not trust a single parameter setting. You explore how results change when penalties change. You ask: if we increase the drawdown penalty, does the preferred configuration change dramatically? If the answer is yes, your objective landscape may be unstable. If we tighten the cost penalty, does the system collapse into inactivity? If the answer is yes, your strategy may be dependent on unrealistic execution. If small changes in penalty weights lead to large changes in preferred behavior, you have a fragile surrogate.

This notebook therefore includes a lambda sensitivity surface: a grid over key penalty weights. We evaluate the surrogate utility and the risk diagnostics across this grid. This creates a “reward landscape” — a map of how your proxy behaves as you vary the parameters that encode your economic intent. The goal is not to find the “best” lambdas. The goal is to understand whether your surrogate is stable, interpretable, and aligned with governance constraints.

A crucial aspect of institutional work is that you must separate “mechanism” from “optimization.” This notebook is intentionally a mechanism notebook. We build:

A synthetic market environment with regimes and realistic features.  
A portfolio accounting engine that tracks wealth and exposures.  
An execution model that penalizes trading through convex costs.  
A risk system that measures volatility, drawdown, and tail risk.  
A surrogate objective that integrates all these components.  
A diagnostic layer that tests stability, curvature, and failure cliffs.  
A governance layer that records artifacts for audit and review.

We do not build an agent yet. That is deliberate. If you cannot specify and understand the objective in a controlled setting, you should not trust any learning process that optimizes it. Learning will not fix a poorly designed objective. It will sharpen it.

The synthetic market is also deliberate. Many people want to start with real data. But real data hides too many things at once. It hides the regime process, the true cost process, and the causality of results. Synthetic markets let us control the structure and therefore diagnose the system. In this notebook, we include nontrivial features: regime switching, time-varying volatility, correlation shifts across regimes, and rare jumps. These features are not included to make the notebook “fancy.” They are included because surrogate objectives must be tested under conditions that resemble the types of nonstationarity and stress that break strategies.

You should interpret the synthetic environment as a laboratory. It is not claiming to replicate the real market. It is designed to expose the weaknesses of objectives and policies. It is designed to force you to confront how your proxy behaves under stress, under changing volatility, and under changing liquidity.

The governance requirements in the notebook are equally important. This is not a casual exercise. Each run produces artifacts: a run manifest, a configuration hash, a reward decomposition, a sensitivity surface, stability metrics, and a risk log. The purpose is to ensure that results are reproducible and reviewable. In institutional settings, you need to be able to answer basic questions:

What configuration produced this output?  
When did the run occur?  
What code and parameters were used?  
What were the risk outcomes?  
How did the objective decompose?  
Were any governance thresholds breached?  

If you cannot answer these questions, you are not doing controlled research. You are doing narrative.

This notebook teaches a specific posture: treat objective design as a governance act. The surrogate objective is not just math. It is a contract. It specifies acceptable risk, acceptable turnover, acceptable drawdown behavior, and acceptable execution realism. If those constraints are not encoded and audited, the system is not institutionally credible.

You can also interpret this notebook as an antidote to a common confusion: the confusion between “good backtests” and “good mechanisms.” A backtest can be good because it accidentally exploited a period of low costs, or because it implicitly took too much tail risk, or because it assumed liquidity that disappears in stress. A surrogate objective that includes explicit costs, explicit drawdown penalties, and explicit governance thresholds forces you to confront these realities before you talk about performance.

The meaning of the exercise is therefore deeper than writing code. It is to change how you think about trading agents. A trading agent is not a predictor of returns. It is a control system that chooses actions under uncertainty. The surrogate objective defines what actions are considered good. If your objective is naive, the agent will be dangerous. If your objective is aligned and governed, the agent can be disciplined.

At the end of this notebook, you should have a concrete understanding of the following ideas:

First, surrogacy is not optional. It is a structural necessity.  
Second, surrogate objectives are multi-term contracts, not single metrics.  
Third, reward shaping controls behavior even before learning begins.  
Fourth, sensitivity analysis is essential to reveal objective fragility.  
Fifth, governance artifacts are required to make any result reviewable.  
Sixth, the purpose of the laboratory is to expose weaknesses, not to claim truth.

If you take only one lesson forward, it should be this: the most important design choice in a trading agent is not the model architecture. It is the objective. The surrogate objective is where finance becomes engineering. It is where you encode your economic intent. It is where you impose survivability. It is where you decide what kinds of behavior you will allow the system to learn.

This is why Chapter 1 begins here.

Only after the surrogate is explicit, decomposed, stress-tested, and auditable do we proceed to Chapter 2. In Chapter 2, we will introduce learning and show how optimization pressure amplifies whatever structure you have embedded in the surrogate. And only after that do we proceed to Chapter 3, where we treat the entire system as constrained optimal control under regimes and institutional governance.

So the objective of Chapter 1 is simple in statement and profound in implication: build the surrogate objective first, and treat it as the primary modeling act.

That is the meaning of the exercise.

Run the notebook slowly. Change penalty weights deliberately. Observe how the system’s behavior changes. Study the decomposition. Inspect the artifacts. Learn to see reward design as the center of gravity. If you do that well, everything that follows — learning, governance, stress testing, and institutional deployment — will be built on solid ground.


##1.LIBRARIES AND ENVIRONMENT

**Cell 1 — Environment lock and governance identity**  
This cell turns the notebook into a controlled experiment rather than an improvisation. It does four practical jobs. First, it fixes randomness so the same inputs produce the same outputs. That matters because finance research is full of “it worked once” stories that collapse when you try to reproduce them. Second, it defines the configuration as a single, explicit object: market settings, cost settings, reward settings, and governance thresholds. The configuration is not just convenience; it is the formal statement of what problem you are solving. Third, it creates a run identity and writes a run manifest. Think of the manifest as the lab notebook page that says what you did, when you did it, and under what conditions. Fourth, it creates an artifact directory where every output will be saved, so the run produces reviewable evidence rather than ephemeral screen text.  

Pedagogically, this cell teaches a mindset: before you debate results, you must be able to prove what you ran. If a colleague, a reviewer, or your future self cannot reconstruct the run, then the result is not knowledge, it is a rumor. This cell also establishes discipline about “no hidden constants.” Any number that influences behavior should live in the config and therefore be visible, hashable, and auditable.  

By the end of Cell 1 you do not yet have trading logic, but you do have institutional foundations: determinism, traceability, and artifact discipline. You can point to a unique run identifier, you can point to a frozen configuration, and you can point to files that record the context of the experiment. That is the prerequisite for the rest of the notebook, because everything later will be judged in relation to this declared configuration.  
This cell also defines accountability. If a limit is breached later, you can point to the exact threshold. If the objective feels too harsh, you can point to the exact weights. And when you compare runs, the configuration hash gives a clean proof of whether two experiments were truly identical or quietly different in ways that matter most.


In [5]:
# CELL 1 — Institutional Environment Lock, Config Freezing, Hashing, Artifact Root

import os, sys, json, math, random, hashlib, platform
import datetime as _dt
from dataclasses import dataclass, asdict
from typing import Dict, Tuple, List, Any, Optional
import numpy as np

# -----------------------------
# Determinism (hard requirement)
# -----------------------------
SEED = 20260216
np.random.seed(SEED)
random.seed(SEED)

# -----------------------------
# Utilities (institutional)
# -----------------------------
def utc_now_iso() -> str:
    return _dt.datetime.now(_dt.timezone.utc).isoformat()

def sha256_hex(b: bytes) -> str:
    return hashlib.sha256(b).hexdigest()

def canonical_json_dumps(obj: Any) -> str:
    # Canonical representation for hashing/audit reproducibility
    return json.dumps(obj, sort_keys=True, separators=(",", ":"), ensure_ascii=True)

def safe_mkdir(path: str) -> None:
    os.makedirs(path, exist_ok=True)

def write_json(path: str, obj: Any) -> None:
    with open(path, "w", encoding="utf-8") as f:
        json.dump(obj, f, indent=2, sort_keys=True)

def write_text(path: str, s: str) -> None:
    with open(path, "w", encoding="utf-8") as f:
        f.write(s)

# -----------------------------
# Config schema (frozen)
# -----------------------------
@dataclass(frozen=True)
class GARCHConfig:
    omega: float
    alpha: float
    beta: float
    def validate(self) -> None:
        if not (self.omega > 0):
            raise ValueError("GARCH omega must be > 0")
        if not (0 <= self.alpha < 1 and 0 <= self.beta < 1 and self.alpha + self.beta < 1):
            raise ValueError("GARCH alpha,beta must be in [0,1) and alpha+beta<1")

@dataclass(frozen=True)
class RegimeConfig:
    P: Tuple[Tuple[float, float], Tuple[float, float]]  # 2x2 Markov transition matrix
    mu: Tuple[float, float]                              # per-regime drift (daily)
    base_vol: Tuple[float, float]                        # per-regime unconditional vol scale
    corr: Tuple[float, float]                            # per-regime constant correlation (single rho) for multi-asset
    liquidity: Tuple[float, float]                       # per-regime liquidity multiplier (higher => easier)
    def validate(self) -> None:
        for i in range(2):
            s = self.P[i][0] + self.P[i][1]
            if abs(s - 1.0) > 1e-12:
                raise ValueError("Each transition row must sum to 1")
            if any(p < 0 or p > 1 for p in self.P[i]):
                raise ValueError("Transition probabilities must be in [0,1]")
        if len(self.mu) != 2 or len(self.base_vol) != 2 or len(self.corr) != 2 or len(self.liquidity) != 2:
            raise ValueError("Regime arrays must have length 2")
        for rho in self.corr:
            if not (-0.95 < rho < 0.95):
                raise ValueError("rho must be in (-0.95,0.95) to keep correlation matrix well-conditioned")
        for L in self.liquidity:
            if not (L > 0):
                raise ValueError("liquidity multiplier must be > 0")

@dataclass(frozen=True)
class MarketConfig:
    T: int
    n_assets: int
    regime: RegimeConfig
    garch: GARCHConfig
    jump_prob: float
    jump_scale: float
    def validate(self) -> None:
        if not (self.T >= 500):
            raise ValueError("T must be >= 500 for meaningful diagnostics")
        if not (2 <= self.n_assets <= 10):
            raise ValueError("n_assets must be between 2 and 10 (institutional toy-lab but nontrivial)")
        self.regime.validate()
        self.garch.validate()
        if not (0 <= self.jump_prob < 0.2):
            raise ValueError("jump_prob should be in [0, 0.2)")
        if not (self.jump_scale > 0):
            raise ValueError("jump_scale must be > 0")

@dataclass(frozen=True)
class CostConfig:
    # Convex impact + spread + volatility scaling + liquidity scaling
    a_lin: float
    b_quad: float
    c_cubic: float
    spread_bps: float
    vol_scale: float
    def validate(self) -> None:
        if any(x < 0 for x in [self.a_lin, self.b_quad, self.c_cubic, self.spread_bps, self.vol_scale]):
            raise ValueError("Cost parameters must be nonnegative")

@dataclass(frozen=True)
class RewardConfig:
    gamma: float
    lambda_vol: float
    lambda_dd: float
    lambda_cost: float
    lambda_turn: float
    lambda_lev: float
    def validate(self) -> None:
        if not (0.90 <= self.gamma < 1.0):
            raise ValueError("gamma must be in [0.90,1)")
        for k, v in asdict(self).items():
            if k != "gamma" and v < 0:
                raise ValueError("All lambdas must be >= 0")

@dataclass(frozen=True)
class GovernanceConfig:
    max_gross_leverage: float
    max_turnover: float
    max_drawdown: float
    min_liquidity_multiplier: float
    def validate(self) -> None:
        if not (self.max_gross_leverage > 0):
            raise ValueError("max_gross_leverage must be > 0")
        if not (self.max_turnover > 0):
            raise ValueError("max_turnover must be > 0")
        if not (0 < self.max_drawdown < 1.0):
            raise ValueError("max_drawdown must be in (0,1)")
        if not (self.min_liquidity_multiplier > 0):
            raise ValueError("min_liquidity_multiplier must be > 0")

@dataclass(frozen=True)
class RunConfig:
    market: MarketConfig
    cost: CostConfig
    reward: RewardConfig
    gov: GovernanceConfig
    def validate(self) -> None:
        self.market.validate()
        self.cost.validate()
        self.reward.validate()
        self.gov.validate()

# -----------------------------
# Institutional config (nontrivial)
# -----------------------------
cfg = RunConfig(
    market=MarketConfig(
        T=2500,
        n_assets=4,
        regime=RegimeConfig(
            P=((0.975, 0.025),
               (0.080, 0.920)),
            mu=( 0.00035, -0.00015),      # calm vs stressed drift
            base_vol=(0.010, 0.028),      # calm vs stressed vol scales
            corr=(0.20, 0.70),            # calm vs stressed correlation
            liquidity=(1.00, 0.45)        # calm vs stressed liquidity multiplier
        ),
        garch=GARCHConfig(omega=1e-7, alpha=0.06, beta=0.92),
        jump_prob=0.015,
        jump_scale=4.0
    ),
    cost=CostConfig(
        a_lin=2.0e-4,
        b_quad=1.2e-3,
        c_cubic=2.0e-3,
        spread_bps=1.8,
        vol_scale=0.25
    ),
    reward=RewardConfig(
        gamma=0.9992,
        lambda_vol=0.70,
        lambda_dd=2.20,
        lambda_cost=1.00,
        lambda_turn=0.55,
        lambda_lev=0.10
    ),
    gov=GovernanceConfig(
        max_gross_leverage=2.2,
        max_turnover=0.35,
        max_drawdown=0.25,
        min_liquidity_multiplier=0.35
    )
)
cfg.validate()

# -----------------------------
# Run identity + artifact root
# -----------------------------
artifact_root = "/mnt/data/artifacts_ch1"
safe_mkdir(artifact_root)

cfg_dict = asdict(cfg)
cfg_json = canonical_json_dumps(cfg_dict)
config_hash = sha256_hex(cfg_json.encode("utf-8"))

timestamp_utc = utc_now_iso()
run_id = sha256_hex((timestamp_utc + ":" + config_hash + ":" + str(SEED)).encode("utf-8"))

manifest = {
    "run_id": run_id,
    "timestamp_utc": timestamp_utc,
    "seed": SEED,
    "python": sys.version,
    "platform": platform.platform(),
    "numpy": np.__version__,
    "config_hash": config_hash,
    "artifact_root": artifact_root,
    "governance": asdict(cfg.gov),
}

write_json(os.path.join(artifact_root, "run_manifest.json"), manifest)
write_text(os.path.join(artifact_root, "config_hash.txt"), config_hash)

print(json.dumps(manifest, indent=2, sort_keys=True))


{
  "artifact_root": "/mnt/data/artifacts_ch1",
  "config_hash": "f0c5677b30ea4d4c3625b7a60fcaad227e8b03f92faf1422234a2fc943b3e846",
  "governance": {
    "max_drawdown": 0.25,
    "max_gross_leverage": 2.2,
    "max_turnover": 0.35,
    "min_liquidity_multiplier": 0.35
  },
  "numpy": "2.0.2",
  "platform": "Linux-6.6.105+-x86_64-with-glibc2.35",
  "python": "3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]",
  "run_id": "19f290c4b62afc42ca15e550eb75b5f96334dec802cdf1c955616fcfa364f08a",
  "seed": 20260216,
  "timestamp_utc": "2026-02-16T16:08:55.558387+00:00"
}


##2.MARKET GENERATOR

###2.1.OVERVIEW

**Cell 2 — Synthetic market generator with regimes**  
This cell builds the laboratory world in which the rest of the notebook makes sense. Instead of pulling real market data, it creates synthetic returns that contain controlled structure: calm periods and stressed periods, changing volatility, changing cross-asset co-movement, and occasional shocks. The point is not to imitate reality perfectly. The point is to create a repeatable environment where you can ask precise questions about objectives. With real data, you never know whether an outcome came from the objective, from an unobserved regime shift, from a one-off event, or from a data quirk. In a synthetic generator, you can control those ingredients and therefore learn clean lessons.  

Pedagogically, this cell teaches that good evaluation requires a “known testbed.” You want a market process that is rich enough to challenge a strategy, but transparent enough that you can explain what happened. The regime path provides a simple way to represent nonstationarity: conditions change, sometimes slowly, sometimes abruptly. The volatility path provides a way to represent risk clustering: calm days are followed by calm days, and turbulent days tend to come in clusters. Correlation changes are crucial because diversification often disappears in stress, and an objective that ignores this can be dangerously optimistic. Rare shocks matter because tail behavior is where many strategies fail.  

By the end of this cell you obtain a complete synthetic dataset: returns for multiple assets, a regime label at each time step, a volatility proxy, a liquidity proxy, and a price path built from the returns. These objects are not “the answer” of the notebook; they are the controlled input that makes the surrogate objective meaningful. Later, when you change penalty weights or inspect risk metrics, you will be able to connect outcomes to a clearly defined environment rather than to an unknown historical mix of conditions.  
An institutional benefit is that the generator is reproducible under the run seed and config, so reviewers can recreate the same “market history” on demand. Using multiple assets forces objectives to confront allocation, concentration, and co-movement, which single-asset demos often hide explicitly.


###2.2.CODE AND IMPLEMENTATION

In [6]:
# CELL 2 — Multi-Asset Regime-Switching Market with GARCH Vol + Jumps + Correlation Shifts

def _chol_from_rho(n: int, rho: float) -> np.ndarray:
    # Equicorrelation matrix: (1-rho)I + rho 11'
    # Cholesky via eigen structure is stable; but n small: build and chol.
    C = (1.0 - rho) * np.eye(n) + rho * np.ones((n, n))
    # numerical jitter
    C = C + 1e-12 * np.eye(n)
    return np.linalg.cholesky(C)

def simulate_market(cfg: RunConfig) -> Dict[str, np.ndarray]:
    T, n = cfg.market.T, cfg.market.n_assets
    P = np.array(cfg.market.regime.P, dtype=float)
    mu = np.array(cfg.market.regime.mu, dtype=float)
    base_vol = np.array(cfg.market.regime.base_vol, dtype=float)
    rho = np.array(cfg.market.regime.corr, dtype=float)
    liq = np.array(cfg.market.regime.liquidity, dtype=float)

    omega = cfg.market.garch.omega
    alpha = cfg.market.garch.alpha
    beta  = cfg.market.garch.beta

    jump_prob = cfg.market.jump_prob
    jump_scale = cfg.market.jump_scale

    regimes = np.zeros(T, dtype=np.int64)
    sigma2 = np.zeros(T, dtype=float)   # common volatility factor (stochastic vol proxy)
    R = np.zeros((T, n), dtype=float)   # returns

    regimes[0] = 0
    # unconditional variance for GARCH
    sigma2[0] = omega / max(1e-12, (1.0 - alpha - beta))

    for t in range(1, T):
        prev = regimes[t-1]
        u = random.random()
        regimes[t] = 0 if u < P[prev, 0] else 1

        # GARCH recursion uses previous common factor return proxy (market factor)
        factor_ret_prev = float(np.mean(R[t-1]))
        sigma2[t] = omega + alpha * (factor_ret_prev ** 2) + beta * sigma2[t-1]

        # correlated Gaussian shock
        L = _chol_from_rho(n, float(rho[regimes[t]]))
        z = L @ np.random.normal(size=n)

        # idiosyncratic jump risk (rare, heavy)
        if random.random() < jump_prob:
            # jump direction random; magnitude scaled by regime
            jump = (np.random.choice([-1.0, 1.0], size=n) *
                    (jump_scale * base_vol[regimes[t]] * math.sqrt(sigma2[t])) *
                    np.random.exponential(scale=1.0, size=n))
        else:
            jump = np.zeros(n)

        # regime-dependent drift + vol scaling
        vol_t = base_vol[regimes[t]] * math.sqrt(max(1e-18, sigma2[t]))
        R[t] = mu[regimes[t]] + vol_t * z + jump

    # liquidity multiplier path (regime-dependent)
    liquidity_path = liq[regimes].astype(float)

    # compute synthetic prices from returns (for drawdown realism)
    prices = np.ones((T, n), dtype=float)
    for t in range(1, T):
        prices[t] = prices[t-1] * (1.0 + R[t])

    return {
        "returns": R,
        "prices": prices,
        "regimes": regimes,
        "sigma2": sigma2,
        "liquidity": liquidity_path,
    }

mkt = simulate_market(cfg)
R = mkt["returns"]
prices = mkt["prices"]
regimes = mkt["regimes"]
sigma2 = mkt["sigma2"]
liq_path = mkt["liquidity"]

print({
    "shape_returns": R.shape,
    "regime_counts": {0: int(np.sum(regimes==0)), 1: int(np.sum(regimes==1))},
    "sigma2_range": (float(np.min(sigma2)), float(np.max(sigma2))),
    "liq_range": (float(np.min(liq_path)), float(np.max(liq_path))),
})


{'shape_returns': (2500, 4), 'regime_counts': {0: 1959, 1: 541}, 'sigma2_range': (1.269610737251874e-06, 5.000000000000023e-06), 'liq_range': (0.45, 1.0)}


##3.PORTFOLIO ACCOUNTING ENGINE

###3.1.OVERVIEW

**Cell 3 — Portfolio accounting and deterministic policy**  
This cell creates a full trading trajectory without any learning. It chooses portfolio exposures through a deterministic rule and then computes what that rule implies for wealth, positions, leverage, and turnover over time. This is important because Chapter 1 is about objectives, not about training. If you introduce learning too early, you cannot tell whether behavior changes because the objective changed or because the learning algorithm changed. A deterministic policy gives you a stable “behavioral baseline” so the only moving part, later, is the scoring function that evaluates that behavior.  

Pedagogically, the key lesson is that trading is a sequence of decisions embedded in accounting. A portfolio is not just weights on paper; it is a state that evolves. Wealth updates depend on yesterday’s exposure and today’s returns. Exposures change through trades, and trades create turnover. Turnover matters because it is the gateway to costs, slippage, and capacity limits. The cell therefore records not only the end wealth, but also the path of exposures and the size of changes from one step to the next. This path-level view is essential for understanding drawdowns and for understanding why an objective might punish a strategy even when its average return looks fine.  

By the end of this cell you obtain a “portfolio path object”: a time series of weights for each asset, the resulting wealth curve, the gross exposure through time, the net exposure through time, and the turnover series. This object is the main input to later components: costs will be applied to the trading changes, risk measures will be computed from the realized portfolio returns, and the surrogate reward will score the path. The outcome is a concrete, audit-friendly representation of trading behavior that you can evaluate under different surrogate designs.  
A second lesson is exposure interpretation. Gross exposure captures how much risk you are taking, while net exposure captures directional bias. Many failures come from confusing the two. Because the policy is deterministic, if the score changes later, you know the change came from the objective, not from randomness by design.


###3.2.CODE AND IMPLEMENTATION

In [7]:
# CELL 3 — Portfolio Accounting Engine (Multi-Asset, Self-Financing, Gross/Net Exposure, Wealth)

def normalize_to_gross_cap(w: np.ndarray, gross_cap: float) -> np.ndarray:
    g = float(np.sum(np.abs(w)))
    if g <= 1e-18:
        return w.copy()
    scale = min(1.0, gross_cap / g)
    return w * scale

def portfolio_path(R: np.ndarray, liq_path: np.ndarray, cfg: RunConfig) -> Dict[str, np.ndarray]:
    """
    Deterministic policy (no learning): regime-aware tilt + mean-reversion overlay.
    Institutional goal: show that even without learning, reward shaping matters.
    """
    T, n = R.shape
    W = np.ones(T, dtype=float)
    w = np.zeros((T, n), dtype=float)
    turnover = np.zeros(T, dtype=float)
    gross = np.zeros(T, dtype=float)
    net = np.zeros(T, dtype=float)

    # Simple but nontrivial deterministic signal:
    # - cross-sectional momentum over 20 days
    # - mean-reversion over 5 days
    look_mom, look_mr = 20, 5

    # Initialize weights at t=0
    w[0] = np.zeros(n)

    for t in range(1, T):
        # compute signals with safe lookbacks
        t0_mom = max(0, t - look_mom)
        t0_mr  = max(0, t - look_mr)

        mom = np.sum(R[t0_mom:t], axis=0)
        mr  = -np.sum(R[t0_mr:t], axis=0)

        # regime-aware mixing coefficient (more defensive in stressed regime)
        # regimes: 0 calm, 1 stressed (in this chapter still 2-state)
        reg = regimes[t]
        mix = 0.70 if reg == 0 else 0.35

        raw = mix * mom + (1.0 - mix) * mr

        # rank-based long/short construction
        # take top2 long, bottom2 short (n=4)
        idx = np.argsort(raw)
        longs = idx[-2:]
        shorts = idx[:2]

        w_t = np.zeros(n)
        w_t[longs] = 0.50
        w_t[shorts] = -0.50

        # liquidity-aware scaling (shrink risk when liquidity low)
        liq = float(liq_path[t])
        liq = max(liq, cfg.gov.min_liquidity_multiplier)
        w_t *= liq

        # enforce gross leverage cap (governance pre-control layer)
        w_t = normalize_to_gross_cap(w_t, cfg.gov.max_gross_leverage)

        # compute turnover
        dw = w_t - w[t-1]
        turnover[t] = float(np.sum(np.abs(dw)))

        # update wealth (self-financing, frictionless here; costs applied later)
        # gross return: w_{t-1}·R_t
        port_ret = float(np.dot(w[t-1], R[t]))
        W[t] = W[t-1] * (1.0 + port_ret)

        w[t] = w_t
        gross[t] = float(np.sum(np.abs(w_t)))
        net[t] = float(np.sum(w_t))

    return {"W": W, "w": w, "turnover": turnover, "gross": gross, "net": net}

pf = portfolio_path(R, liq_path, cfg)
W = pf["W"]
w = pf["w"]
turnover = pf["turnover"]
gross = pf["gross"]
net = pf["net"]

print({
    "wealth_end": float(W[-1]),
    "gross_mean": float(np.mean(gross[1:])),
    "turnover_mean": float(np.mean(turnover[1:])),
    "net_mean": float(np.mean(net[1:])),
})


{'wealth_end': 0.9998195634576709, 'gross_mean': 1.7618647458983592, 'turnover_mean': 0.4665066026410565, 'net_mean': 0.0}


##4.INSTITUTIONAL EXECUTION COST MODELS

###4.1.OVERVIEW

**Cell 4 — Execution and trading cost engine**  
This cell introduces the main realism filter: trading is not free. It takes the changes in portfolio exposures from Cell 3 and converts them into explicit costs. The notebook treats costs as structured, not as a single flat fee. It includes spread-like frictions that rise with trading intensity, and impact-like terms that become increasingly punitive as you trade more aggressively. It also allows costs to scale with market conditions such as volatility and liquidity, because the same trade is not equally easy in calm markets and stressed markets.  

Pedagogically, the purpose is to separate “opportunity” from “feasibility.” A strategy may look profitable in a frictionless world but become untradeable once you account for costs. This is one of the most common sources of overconfidence in backtests, and it is also one of the most common ways that naive AI systems learn to do unrealistic things. If costs are omitted, an optimizer can exploit rapid flipping, extreme rebalancing, or micro-timing artifacts that would never survive real execution. By computing costs as a time series, this cell makes the cost burden visible: you can see when costs spike, whether they cluster in stressed regimes, and whether certain styles of behavior are simply too expensive.  

By the end of this cell you obtain a “cost object”: the total cost per time step and its internal breakdown (for example, spread contribution and impact contributions). This object is later used in two ways. First, it becomes a penalty term inside the surrogate objective, making the objective sensitive to tradability. Second, it becomes a diagnostic in its own right, allowing you to measure cost intensity and identify capacity cliffs. The outcome is a disciplined bridge between portfolio actions and real-world implementation constraints.  
A key concept learned here is convexity: doubling turnover can more than double costs, which creates hidden nonlinear risk for fast-trading policies. Liquidity scaling teaches that the same behavior can be acceptable in calm conditions but unacceptable in stress. Finally, exporting cost summaries supports review: you can justify whether the objective punished behavior for good reasons.


###4.2.CODE AND IMPLEMENTATION

In [8]:
# CELL 4 — Institutional Execution Cost Model (Spread + Convex Impact + Vol/ Liquidity Scaling)

def execution_costs(w: np.ndarray, sigma2: np.ndarray, liq_path: np.ndarray, cfg: RunConfig) -> Dict[str, np.ndarray]:
    """
    Institutional execution model:
    cost_t = spread_cost + linear + quadratic + cubic impact, scaled by volatility and inverse liquidity.
    """
    T, n = w.shape
    a = cfg.cost.a_lin
    b = cfg.cost.b_quad
    c = cfg.cost.c_cubic
    spread = cfg.cost.spread_bps * 1e-4  # bps -> decimal
    vol_scale = cfg.cost.vol_scale

    cost = np.zeros(T, dtype=float)
    cost_lin = np.zeros(T, dtype=float)
    cost_quad = np.zeros(T, dtype=float)
    cost_cubic = np.zeros(T, dtype=float)
    cost_spread = np.zeros(T, dtype=float)

    for t in range(1, T):
        dw = w[t] - w[t-1]
        # volatility proxy: sqrt(sigma2)
        v = math.sqrt(max(1e-18, float(sigma2[t])))
        # liquidity scaling: low liquidity => higher cost
        liq = max(float(liq_path[t]), cfg.gov.min_liquidity_multiplier)
        inv_liq = 1.0 / liq

        abs_dw = np.abs(dw)
        # spread proportional to turnover (approx)
        cost_spread[t] = float(spread * np.sum(abs_dw) * inv_liq)
        # linear, quadratic, cubic impacts (convex)
        cost_lin[t] = float(a * np.sum(abs_dw) * (1.0 + vol_scale * v) * inv_liq)
        cost_quad[t] = float(b * np.sum(dw * dw) * (1.0 + vol_scale * v) * inv_liq)
        cost_cubic[t] = float(c * np.sum(abs_dw ** 3) * (1.0 + vol_scale * v) * inv_liq)

        cost[t] = cost_spread[t] + cost_lin[t] + cost_quad[t] + cost_cubic[t]

    return {
        "cost": cost,
        "cost_spread": cost_spread,
        "cost_lin": cost_lin,
        "cost_quad": cost_quad,
        "cost_cubic": cost_cubic
    }

cx = execution_costs(w, sigma2, liq_path, cfg)
cost = cx["cost"]

print({
    "avg_cost": float(np.mean(cost[1:])),
    "p95_cost": float(np.quantile(cost[1:], 0.95)),
    "max_cost": float(np.max(cost[1:])),
})


{'avg_cost': 0.001571251214834909, 'p95_cost': 0.007161968966095024, 'max_cost': 0.01483423272718156}


##5.RISK SYSTEM

###5.1.OVERVIEW

**Cell 5 — Risk system and stability measures**  
This cell measures the risk profile implied by the trading trajectory. It converts the wealth path and portfolio returns into interpretable risk diagnostics: how variable the strategy is, how deep it can fall from its peak, and how bad outcomes look in the tail of the distribution. The key is that risk is treated as path-dependent, not as a single average number. A strategy that looks fine on average can still be unacceptable if it experiences deep drawdowns or concentrated bursts of volatility.  

Pedagogically, this cell teaches that risk in trading is not only “volatility.” Institutions care about drawdowns because drawdowns reflect survivability, investor patience, and risk limits. Institutions care about tail outcomes because rare events are what break funding and trigger forced deleveraging. The notebook also includes measures of concentration, because concentration risk is another way that strategies fail quietly: performance may be driven by a few large bets rather than by a stable mechanism. The risk system therefore produces multiple lenses on the same behavior, so later you can decide which lens belongs inside the surrogate objective and which lens belongs in reporting and governance.  

By the end of this cell you obtain a “risk object”: time series of volatility-like estimates, the drawdown series, a maximum drawdown statistic, and tail summaries that describe how severe the worst outcomes are. You also obtain concentration indicators that help you interpret whether the strategy is behaving like a diversified mechanism or like a narrow gamble. These outputs serve two roles in the rest of the notebook. They become inputs to penalties in the surrogate objective, and they become audit metrics that allow you to interpret the consequences of any reward design. The outcome is a disciplined risk measurement layer that makes the objective economically grounded rather than purely performance driven.  
An additional lesson is consistency: risk is estimated using a procedure that updates through time, rather than recalculating with hindsight. This mirrors how real desks monitor risk. By tying risk measures to regimes, you can see whether the objective responds appropriately to stress.


###5.2.CODE AND IMPLEMENTATION

In [9]:
# CELL 5 — Risk System (EWMA Covariance, Realized Vol, Drawdown, CVaR/ES, Concentration)

def ewma_cov(R: np.ndarray, lam: float) -> np.ndarray:
    """
    EWMA covariance estimator:
    S_t = lam S_{t-1} + (1-lam) r_t r_t'
    Returns full path S[t,:,:]
    """
    T, n = R.shape
    S = np.zeros((T, n, n), dtype=float)
    # initialize with sample cov of first 50 observations or identity if too short
    m = min(50, T-1)
    if m >= 10:
        X = R[1:m+1] - np.mean(R[1:m+1], axis=0)
        S0 = (X.T @ X) / max(1, (m - 1))
    else:
        S0 = np.eye(n) * 1e-6
    S[0] = S0
    for t in range(1, T):
        rt = R[t].reshape(-1, 1)
        S[t] = lam * S[t-1] + (1.0 - lam) * (rt @ rt.T)
    return S

def path_drawdown(W: np.ndarray) -> Tuple[np.ndarray, np.ndarray, float]:
    T = W.shape[0]
    dd = np.zeros(T, dtype=float)
    dd_inc = np.zeros(T, dtype=float)
    peak = float(W[0])
    for t in range(1, T):
        peak = max(peak, float(W[t]))
        dd[t] = (peak - float(W[t])) / max(1e-18, peak)
        if dd[t] > dd[t-1]:
            dd_inc[t] = dd[t] - dd[t-1]
    return dd, dd_inc, float(np.max(dd))

def portfolio_returns_from_weights(R: np.ndarray, w: np.ndarray) -> np.ndarray:
    # realized portfolio return series using lagged weights
    T = R.shape[0]
    pr = np.zeros(T, dtype=float)
    for t in range(1, T):
        pr[t] = float(np.dot(w[t-1], R[t]))
    return pr

def cvar_es(x: np.ndarray, alpha: float) -> Tuple[float, float]:
    # left-tail CVaR/ES on returns (loss tail): take quantile at alpha
    # ES = E[x | x <= VaR_alpha]
    xs = np.sort(x.copy())
    k = int(max(1, math.floor(alpha * len(xs))))
    var = float(xs[k-1])
    es = float(np.mean(xs[:k]))
    return var, es

# EWMA covariance path (institutional)
LAMBDA_EWMA = 0.97
S = ewma_cov(R, LAMBDA_EWMA)

# portfolio returns and realized vol (rolling)
port_ret = portfolio_returns_from_weights(R, w)

# realized vol via EWMA on portfolio returns (not just raw)
port_vol = np.zeros_like(port_ret)
lam_v = 0.94
for t in range(1, len(port_ret)):
    port_vol[t] = math.sqrt(lam_v * port_vol[t-1]**2 + (1.0 - lam_v) * port_ret[t]**2)

dd, dd_inc, dd_max = path_drawdown(W)

# tail metrics
VaR_01, ES_01 = cvar_es(port_ret[1:], 0.01)
VaR_05, ES_05 = cvar_es(port_ret[1:], 0.05)

# concentration (Herfindahl on absolute weights)
herf = np.zeros(len(port_ret), dtype=float)
for t in range(1, len(port_ret)):
    a = np.abs(w[t])
    s = float(np.sum(a))
    if s <= 1e-18:
        herf[t] = 0.0
    else:
        p = a / s
        herf[t] = float(np.sum(p * p))

risk_snapshot = {
    "dd_max": dd_max,
    "VaR_01": VaR_01, "ES_01": ES_01,
    "VaR_05": VaR_05, "ES_05": ES_05,
    "mean_port_vol": float(np.mean(port_vol[1:])),
    "mean_herfindahl": float(np.mean(herf[1:])),
}

print(json.dumps(risk_snapshot, indent=2, sort_keys=True))


{
  "ES_01": -4.012642182046557e-05,
  "ES_05": -2.4426615756686602e-05,
  "VaR_01": -2.57906259911447e-05,
  "VaR_05": -1.727510251354637e-05,
  "dd_max": 0.0008564431512818341,
  "mean_herfindahl": 0.25,
  "mean_port_vol": 1.1824222576664884e-05
}


##6.SURROGATE REWARD

###6.1.OVERVIEW

**Cell 6 — Surrogate objective and full decomposition**  
This is the central cell of Chapter 1. It defines, in an explicit and auditable way, what the system considers “good performance.” Instead of using an informal phrase like “maximize profit,” it builds a surrogate objective that scores behavior by combining multiple components: realized returns and several penalties that represent risk, losses, and implementation friction. The output is not a mysterious single number. It is a structured decomposition that shows exactly why a score is high or low.  

Pedagogically, the message is simple: an agent will optimize the objective you implement, not the objective you intend in your head. If the objective ignores costs, the system will favor excessive trading. If the objective ignores drawdowns, the system may accept strategies that blow up occasionally. If the objective ignores leverage, the system may prefer unstable exposure. The surrogate objective is therefore the most important design decision. It is the contract that links economic intent to algorithmic behavior.  

This cell also introduces governance as part of the objective rather than as a vague afterthought. When predefined limits are breached, the objective applies a hard penalty. This teaches that institutions do not merely “prefer” certain behavior; they enforce boundaries. The governance penalty creates a clear signal that some trajectories are unacceptable regardless of how good they look on other metrics.  

By the end of this cell you obtain the main deliverable of the chapter: a surrogate scoring object that includes time series for each reward component and a total reward series. You also obtain a single aggregated score for the entire run, which is the quantity an optimizer would later try to maximize. Crucially, you can now audit the score: you can see whether returns dominated, whether costs dominated, whether drawdown penalties dominated, and whether governance breaches were frequent. The goal achieved here is objective specification with full transparency.  
A final point is comparability. The objective compresses the path into one score so policies can be ranked consistently. This same surrogate will later be used for learning in Chapter 2, so precision here prevents confusion.


###6.2.CODE AND IMPLEMENTATION

In [10]:
# CELL 6 — Surrogate Reward (Full Decomposition + Discounted Utility + Governance Penalty Channel)

def surrogate_reward(
    port_ret: np.ndarray,
    port_vol: np.ndarray,
    dd_inc: np.ndarray,
    cost: np.ndarray,
    turnover: np.ndarray,
    gross: np.ndarray,
    cfg: RunConfig
) -> Dict[str, np.ndarray]:
    """
    r_t = R_t
        - λ_vol * vol_t
        - λ_dd * ΔDD_t
        - λ_cost * cost_t
        - λ_turn * turnover_t
        - λ_lev * gross_t^2
        - governance_penalty(t)
    governance_penalty triggers if hard thresholds are breached.
    """
    T = len(port_ret)
    lam = cfg.reward
    gov = cfg.gov

    r = np.zeros(T, dtype=float)
    comp = {
        "return": np.zeros(T, dtype=float),
        "vol_pen": np.zeros(T, dtype=float),
        "dd_pen": np.zeros(T, dtype=float),
        "cost_pen": np.zeros(T, dtype=float),
        "turn_pen": np.zeros(T, dtype=float),
        "lev_pen": np.zeros(T, dtype=float),
        "gov_pen": np.zeros(T, dtype=float),
        "total": r,
    }

    # governance penalty: hard, discontinuous (institutional reality)
    GOV_PENALTY = 50.0  # large enough to dominate (unitless proxy)
    for t in range(1, T):
        comp["return"][t] = port_ret[t]
        comp["vol_pen"][t] = lam.lambda_vol * port_vol[t]
        comp["dd_pen"][t] = lam.lambda_dd * dd_inc[t]
        comp["cost_pen"][t] = lam.lambda_cost * cost[t]
        comp["turn_pen"][t] = lam.lambda_turn * turnover[t]
        comp["lev_pen"][t] = lam.lambda_lev * (gross[t] ** 2)

        # hard breaches:
        breach = False
        if gross[t] > gov.max_gross_leverage + 1e-12:
            breach = True
        if turnover[t] > gov.max_turnover + 1e-12:
            breach = True
        if dd[t] > gov.max_drawdown + 1e-12:
            breach = True

        comp["gov_pen"][t] = GOV_PENALTY if breach else 0.0

        r[t] = (
            comp["return"][t]
            - comp["vol_pen"][t]
            - comp["dd_pen"][t]
            - comp["cost_pen"][t]
            - comp["turn_pen"][t]
            - comp["lev_pen"][t]
            - comp["gov_pen"][t]
        )

    return comp

comp = surrogate_reward(port_ret, port_vol, dd_inc, cost, turnover, gross, cfg)

# discounted surrogate utility
gamma = cfg.reward.gamma
disc = gamma ** np.arange(len(port_ret), dtype=float)
U = float(np.sum(disc * comp["total"]))

# component contribution ratios (institutional: scale awareness)
abs_total = float(np.sum(np.abs(comp["total"][1:]))) + 1e-18
contrib = {k: float(np.sum(np.abs(v[1:])))/abs_total for k, v in comp.items() if k != "total"}

print({
    "discounted_surrogate_utility": U,
    "component_abs_share": {k: round(v, 4) for k, v in sorted(contrib.items())},
})


{'discounted_surrogate_utility': -16614.246447318386, 'component_abs_share': {'cost_pen': 0.0001, 'dd_pen': 0.0, 'gov_pen': 0.9592, 'lev_pen': 0.0229, 'return': 0.0, 'turn_pen': 0.0178, 'vol_pen': 0.0}}


##7.LAMBDA HYPERGRID

###7.1.OVERVIEW

**Cell 7 — Sensitivity analysis over proxy weights**  
This cell asks a disciplined question: if you slightly change the penalty weights inside the surrogate objective, does your conclusion change dramatically? Instead of trusting one arbitrary set of weights, you evaluate a grid of settings and record how the surrogate score and key risk outcomes respond. The result is a landscape that shows how the proxy behaves as a design object.  

Pedagogically, this teaches that objectives can be fragile. In fragile objectives, small changes in weights flip the ranking of policies, or create extreme behavior where the objective becomes dominated by one term. For example, if a cost penalty is too light, the objective may reward high turnover. If it is too heavy, the objective may reward inactivity. Sensitivity analysis reveals whether there is a stable region where the objective meaningfully balances return, risk, and feasibility. It also exposes “knife-edge” designs where you can make the objective say almost anything by tweaking a single number. Institutions do not accept knife-edge surrogates, because they are easy to game and hard to defend.  

By the end of this cell you obtain a structured table of outcomes indexed by proxy parameters. For each grid point you have the surrogate score, drawdown severity, average volatility, average leverage, cost intensity, turnover intensity, and a governance breach rate. This is much more informative than a single “best” result, because it tells you whether the objective is robust and whether governance limits matter across the design space.  

Most importantly, this cell produces evidence for decision-making. If a reviewer asks why you chose a particular balance between drawdown control and return seeking, you can show the sensitivity surface and justify that your selection is not arbitrary. The goal achieved here is robustness analysis of the surrogate objective, which is a core requirement before you allow any optimization algorithm to push hard against the proxy.  
This cell also introduces transparent trade-offs. Instead of hiding preferences, you can use a stated scoring rule to pick candidates and then check that they are not pathological. That practice mirrors institutional selection committees directly.


###7.2.CODE AND IMPLEMENTATION

In [11]:
# CELL 7 — Lambda Hypergrid Sensitivity (Multi-Metric Surface, Not Just "Total Reward")

def evaluate_lambda_point(
    lam_vol: float, lam_dd: float, lam_cost: float,
    base: Dict[str, np.ndarray],
    port_ret: np.ndarray, port_vol: np.ndarray, dd_inc: np.ndarray,
    cost: np.ndarray, turnover: np.ndarray, gross: np.ndarray,
    dd: np.ndarray, cfg: RunConfig
) -> Dict[str, float]:
    # Recompute total reward under varied lambda triplet (others fixed)
    lam_turn = cfg.reward.lambda_turn
    lam_lev  = cfg.reward.lambda_lev
    gamma = cfg.reward.gamma
    GOV_PENALTY = 50.0

    T = len(port_ret)
    r = np.zeros(T, dtype=float)
    gov_breaches = 0
    for t in range(1, T):
        breach = (gross[t] > cfg.gov.max_gross_leverage) or (turnover[t] > cfg.gov.max_turnover) or (dd[t] > cfg.gov.max_drawdown)
        gov_breaches += 1 if breach else 0
        gov_pen = GOV_PENALTY if breach else 0.0
        r[t] = (
            port_ret[t]
            - lam_vol * port_vol[t]
            - lam_dd  * dd_inc[t]
            - lam_cost* cost[t]
            - lam_turn* turnover[t]
            - lam_lev * (gross[t]**2)
            - gov_pen
        )
    disc = (gamma ** np.arange(T, dtype=float))
    U = float(np.sum(disc * r))

    # institutional metrics
    # - robustness proxy: utility per unit drawdown (avoid divide by 0)
    robustness = U / (1e-12 + float(np.max(dd)))
    # - cost intensity
    cost_ratio = float(np.sum(cost[1:])) / (1e-18 + float(np.sum(np.abs(port_ret[1:]))))
    # - turnover intensity
    turn_mean = float(np.mean(turnover[1:]))

    return {
        "U": U,
        "max_dd": float(np.max(dd)),
        "mean_vol": float(np.mean(port_vol[1:])),
        "mean_gross": float(np.mean(gross[1:])),
        "mean_turnover": turn_mean,
        "cost_ratio": cost_ratio,
        "gov_breach_rate": float(gov_breaches) / max(1, (T-1)),
        "robustness_proxy": robustness
    }

# Hypergrid (institutional: sparse but informative; extend as needed)
grid_vol  = [0.35, 0.70, 1.20, 2.00]
grid_dd   = [0.80, 1.60, 2.20, 3.20]
grid_cost = [0.50, 1.00, 1.70, 2.50]

surface: List[Dict[str, Any]] = []
for lv in grid_vol:
    for ld in grid_dd:
        for lc in grid_cost:
            m = evaluate_lambda_point(lv, ld, lc, comp, port_ret, port_vol, dd_inc, cost, turnover, gross, dd, cfg)
            m.update({"lambda_vol": lv, "lambda_dd": ld, "lambda_cost": lc})
            surface.append(m)

# Identify efficient points (Pareto-ish): maximize U, minimize max_dd and gov_breach_rate
# Simple scalarization for reporting (institutional transparency)
for s in surface:
    s["score"] = s["U"] - 10.0*s["max_dd"] - 5.0*s["gov_breach_rate"] - 0.5*s["mean_turnover"]

best_by_score = max(surface, key=lambda x: x["score"])
best_by_U = max(surface, key=lambda x: x["U"])

print({
    "surface_points": len(surface),
    "best_by_score": {k: best_by_score[k] for k in ["lambda_vol","lambda_dd","lambda_cost","U","max_dd","gov_breach_rate","score"]},
    "best_by_U": {k: best_by_U[k] for k in ["lambda_vol","lambda_dd","lambda_cost","U","max_dd","gov_breach_rate"]},
})


{'surface_points': 64, 'best_by_score': {'lambda_vol': 0.35, 'lambda_dd': 0.8, 'lambda_cost': 0.5, 'U': -16613.34423611583, 'max_dd': 0.0008564431512818341, 'gov_breach_rate': 0.2769107643057223, 'score': -16614.97060767019}, 'best_by_U': {'lambda_vol': 0.35, 'lambda_dd': 0.8, 'lambda_cost': 0.5, 'U': -16613.34423611583, 'max_dd': 0.0008564431512818341, 'gov_breach_rate': 0.2769107643057223}}


##8.STRUCTURAL DIAGNOSTICS

###8.1.OVERVIEW

**Cell 8 — Structural diagnostics and failure cliffs**  
This cell interprets the sensitivity results and asks whether the surrogate objective behaves like a stable, defensible design or like a fragile scoring trick. It computes diagnostics that summarize stability: whether the best settings cluster in a region or jump around, whether performance changes smoothly or abruptly, and whether there are “failure cliffs” where the score looks attractive but the behavior violates governance limits or produces large losses.  

Pedagogically, this teaches you to distrust raw optimization outcomes. A high score is not automatically good. The correct question is: what kind of behavior produced the score, and would that behavior survive institutional review? Failure cliffs are especially important. They are regions where the objective can be fooled: it rewards something that looks good on paper while hiding a serious risk or infeasibility. In practice, such cliffs show up when the cost model is too forgiving, when drawdown penalties are too weak, or when leverage penalties are too light. The diagnostic layer helps you identify these regions before you introduce learning, because learning will actively search for them.  

This cell also introduces the idea of explaining sensitivity with simple summaries rather than drowning in tables. Instead of giving a reviewer hundreds of grid points, you extract decision-relevant signals: stability of the top candidates, signs of sharp transitions, and examples of configurations that look strong but breach constraints. These diagnostics are the bridge between engineering detail and governance conversation.  

By the end of this cell you obtain a “stability object” that contains the key findings: how stable the best configurations are, how sensitive the objective is to each penalty weight, and how many high-scoring configurations are actually unacceptable due to drawdowns or breaches. The goal achieved here is early risk identification for the proxy itself. You are not only measuring the strategy; you are measuring the reliability of the objective that will later define what the strategy is allowed to learn.  
Finally, the diagnostics provide a compact story for committees: which penalties matter most, which trade-offs are steep, and where the objective becomes unreliable operationally.


###8.2.CODE AND IMPLEMENTATION

In [12]:
# CELL 8 — Structural Diagnostics (Argmax Stability, Curvature Proxy, Elasticities, Failure Cliff Detection)

def argmax_stability(surface: List[Dict[str, Any]], key: str = "U") -> Dict[str, Any]:
    # Stability of optimizer under small perturbations: look at top-k dispersion in lambda space
    S = sorted(surface, key=lambda x: x[key], reverse=True)
    top = S[:10]
    # compute pairwise distances among top points
    dists = []
    for i in range(len(top)):
        for j in range(i+1, len(top)):
            di = abs(top[i]["lambda_vol"] - top[j]["lambda_vol"])
            dj = abs(top[i]["lambda_dd"]  - top[j]["lambda_dd"])
            dk = abs(top[i]["lambda_cost"]- top[j]["lambda_cost"])
            dists.append(di + dj + dk)
    return {
        "top_k": len(top),
        "top_key": key,
        "mean_pairwise_L1_dist": float(np.mean(dists)) if dists else 0.0,
        "min_pairwise_L1_dist": float(np.min(dists)) if dists else 0.0,
        "max_pairwise_L1_dist": float(np.max(dists)) if dists else 0.0,
        "best_point": {k: top[0][k] for k in ["lambda_vol","lambda_dd","lambda_cost", key, "max_dd", "gov_breach_rate"]},
    }

def local_curvature_proxy(surface: List[Dict[str, Any]], target: Dict[str, Any]) -> Dict[str, float]:
    # crude curvature proxy: compare target to nearest neighbors in grid
    lv, ld, lc = target["lambda_vol"], target["lambda_dd"], target["lambda_cost"]
    neigh = []
    for s in surface:
        # neighbors within one grid step in L1 (heuristic)
        if (abs(s["lambda_vol"]-lv) + abs(s["lambda_dd"]-ld) + abs(s["lambda_cost"]-lc)) in (0.5, 0.8, 1.0, 1.2, 1.5, 2.0):
            neigh.append(s)
    if not neigh:
        return {"curvature_proxy": float("nan"), "neighbors": 0}
    vals = np.array([s["U"] for s in neigh], dtype=float)
    curv = float(target["U"] - np.mean(vals))
    return {"curvature_proxy": curv, "neighbors": len(neigh)}

# Elasticities: how U responds to lambdas (simple regression-like sensitivity)
X = np.array([[s["lambda_vol"], s["lambda_dd"], s["lambda_cost"], 1.0] for s in surface], dtype=float)
y = np.array([s["U"] for s in surface], dtype=float)
# OLS via normal equations (small dimension, stable enough with ridge)
ridge = 1e-6
XtX = X.T @ X + ridge * np.eye(X.shape[1])
beta = np.linalg.solve(XtX, X.T @ y)

stU = argmax_stability(surface, "U")
stScore = argmax_stability(surface, "score")
curv = local_curvature_proxy(surface, best_by_score)

# Failure cliff detection: points with high U but breach governance or large DD
cliffs = [s for s in surface if (s["U"] > np.quantile(y, 0.90) and (s["gov_breach_rate"] > 0.05 or s["max_dd"] > 0.30))]

diagnostics = {
    "ols_U_sensitivity": {"dU_dlambda_vol": float(beta[0]), "dU_dlambda_dd": float(beta[1]), "dU_dlambda_cost": float(beta[2]), "intercept": float(beta[3])},
    "argmax_stability_U": stU,
    "argmax_stability_score": stScore,
    "local_curvature_proxy_best_score": curv,
    "failure_cliff_count": len(cliffs),
    "failure_cliff_examples": [
        {k: s[k] for k in ["lambda_vol","lambda_dd","lambda_cost","U","max_dd","gov_breach_rate"]} for s in cliffs[:5]
    ],
}

print(json.dumps(diagnostics, indent=2, sort_keys=True))


{
  "argmax_stability_U": {
    "best_point": {
      "U": -16613.34423611583,
      "gov_breach_rate": 0.2769107643057223,
      "lambda_cost": 0.5,
      "lambda_dd": 0.8,
      "lambda_vol": 0.35,
      "max_dd": 0.0008564431512818341
    },
    "max_pairwise_L1_dist": 3.2500000000000004,
    "mean_pairwise_L1_dist": 1.4133333333333338,
    "min_pairwise_L1_dist": 0.35,
    "top_k": 10,
    "top_key": "U"
  },
  "argmax_stability_score": {
    "best_point": {
      "gov_breach_rate": 0.2769107643057223,
      "lambda_cost": 0.5,
      "lambda_dd": 0.8,
      "lambda_vol": 0.35,
      "max_dd": 0.0008564431512818341,
      "score": -16614.97060767019
    },
    "max_pairwise_L1_dist": 3.2500000000000004,
    "mean_pairwise_L1_dist": 1.4133333333333338,
    "min_pairwise_L1_dist": 0.35,
    "top_k": 10,
    "top_key": "score"
  },
  "failure_cliff_count": 7,
  "failure_cliff_examples": [
    {
      "U": -16613.34423611583,
      "gov_breach_rate": 0.2769107643057223,
      "lambda_co

##9.GOVERNANCE ARTIFACT

###9.1.0VERVIEW

**Cell 9 — Artifact export for audit and review**  
This cell converts the notebook from an interactive exploration into an institutional record. It writes the results to files in a structured format so the run can be reviewed without rerunning the analysis and without relying on screenshots. The artifacts include the run manifest, the frozen configuration hash, summaries of the reward decomposition, the full sensitivity surface, the stability diagnostics, and the risk log.  

Pedagogically, this teaches that in finance, results must be portable and reviewable. A strategy discussion does not happen only between the person who ran the notebook and their own screen. It happens across teams: research, risk, execution, and governance. Those teams need stable artifacts. They need to compare runs, track changes, and reproduce conclusions. Saving artifacts also discourages narrative drift. When you can see the decomposition and the risk metrics in files, it is harder to selectively remember only the flattering parts.  

This cell also reinforces a good habit: export both raw outputs and summaries. Raw outputs support deep forensic review. Summaries support fast decision-making. Together, they provide an audit trail that is usable at different levels of attention. The goal is not “more files.” The goal is a clean package that answers common questions: what happened, why it happened, and whether it violates limits.  

By the end of this cell you obtain a complete artifact bundle stored in a predictable directory. It contains everything needed to reproduce the run’s claims: the identity of the run, the configuration used, the key risk outcomes, and the objective behavior across parameter choices. This is the moment where the notebook becomes production-oriented. You are no longer just experimenting; you are generating evidence. The goal achieved here is institutional auditability: anyone with the artifacts can review, challenge, and validate the run without guessing what you did.  
These artifacts also enable versioning and regression tests. After a code change, you can rerun and compare surfaces and diagnostics to detect unintended shifts. Over time, this supports simple stage gates: promote only if risk stays stable and breach rates do not rise materially unexpectedly.


###9.2.CODE AND IMPLEMENTATION

In [13]:
# CELL 9 — Governance Artifact Export (Run-Complete, Decomposed, Review-Ready)

# Component time-series is large; export both summaries and compressed stats for audit.
def summarize_series(x: np.ndarray) -> Dict[str, float]:
    return {
        "mean": float(np.mean(x[1:])),
        "std": float(np.std(x[1:])),
        "p05": float(np.quantile(x[1:], 0.05)),
        "p50": float(np.quantile(x[1:], 0.50)),
        "p95": float(np.quantile(x[1:], 0.95)),
        "max": float(np.max(x[1:])),
        "min": float(np.min(x[1:])),
    }

reward_decomp_summary = {k: summarize_series(v) for k, v in comp.items() if k != "total"}
reward_decomp_summary["total"] = summarize_series(comp["total"])

risk_log = {
    "risk_snapshot": risk_snapshot,
    "wealth_end": float(W[-1]),
    "wealth_min": float(np.min(W)),
    "wealth_max": float(np.max(W)),
    "dd_max": dd_max,
    "VaR_01": VaR_01, "ES_01": ES_01,
    "VaR_05": VaR_05, "ES_05": ES_05,
    "mean_turnover": float(np.mean(turnover[1:])),
    "mean_gross": float(np.mean(gross[1:])),
    "gov_thresholds": asdict(cfg.gov),
}

# Save: summaries + surface + diagnostics
write_json(os.path.join(artifact_root, "reward_decomposition.json"), {
    "discounted_surrogate_utility": U,
    "component_abs_share": contrib,
    "series_summary": reward_decomp_summary
})
write_json(os.path.join(artifact_root, "lambda_surface.json"), surface)
write_json(os.path.join(artifact_root, "stability_metrics.json"), diagnostics)
write_json(os.path.join(artifact_root, "risk_log.json"), risk_log)

print({
    "artifact_root": artifact_root,
    "files": sorted(os.listdir(artifact_root)),
})


{'artifact_root': '/mnt/data/artifacts_ch1', 'files': ['config_hash.txt', 'lambda_surface.json', 'reward_decomposition.json', 'risk_log.json', 'run_manifest.json', 'stability_metrics.json']}


##10.AUDIT BUNDLE

###10.1.OVERVIEW

**Cell 10 — Institutional summary and decision-grade output**  
This final cell produces the run’s decision-grade summary. It gathers the most important results into a single structured report that can be read quickly and compared across runs. The report includes the headline surrogate score, the final wealth, key risk statistics, the governance breach rate, and the best-performing proxy settings under a clearly stated selection rule. It also lists the artifact paths so a reviewer can immediately locate the supporting evidence.  

Pedagogically, this cell teaches how to end an experiment responsibly. Many notebooks end with a plot or a triumphant number. Institutional work ends with a compact statement of outcomes, limitations, and where the evidence lives. The summary is designed to prevent overinterpretation. It does not claim that the strategy is profitable in the real world. It claims something narrower and more defensible: given a specified synthetic environment, a specified policy path, and a specified surrogate objective, here is how the objective evaluated the behavior and how risk and governance behaved.  

This cell also clarifies what Chapter 1 has accomplished. You have not built an agent. You have built the objective that will define what an agent tries to do. The summary makes this explicit by highlighting objective-related diagnostics: whether the best proxy settings are stable, whether there are failure cliffs, and how sensitive the objective is to design choices. These are the right outputs to carry into Chapter 2, where learning will be introduced.  

By the end of this cell you obtain a standardized run report that can be archived, compared, and discussed. It is the “cover page” of the artifact bundle. The goal achieved here is closure with accountability: a clear statement of what was produced, what it means, and how it can be verified. This is the professional way to conclude research before moving to optimization.  
The summary also works as a review checklist. If a metric is missing, the run is incomplete. If breach rates are high, reject the run. If stability is weak, revise the proxy before training. This enforces discipline as the project scales across teams.

###10.2.CODE AND IMPLEMENTATION

In [14]:
# CELL 10 — Institutional Summary (Decision-Grade Output, No Handwaving)

summary = {
    "run_id": run_id,
    "timestamp_utc": timestamp_utc,
    "config_hash": config_hash,
    "headline_metrics": {
        "discounted_surrogate_utility": U,
        "wealth_end": float(W[-1]),
        "max_drawdown": dd_max,
        "mean_port_vol": float(np.mean(port_vol[1:])),
        "mean_turnover": float(np.mean(turnover[1:])),
        "mean_gross_leverage": float(np.mean(gross[1:])),
        "gov_breach_rate_baseline_policy": float(np.mean(
            ((gross[1:] > cfg.gov.max_gross_leverage) |
             (turnover[1:] > cfg.gov.max_turnover) |
             (dd[1:] > cfg.gov.max_drawdown)).astype(float)
        )),
        "ES_05": ES_05,
        "mean_herfindahl": float(np.mean(herf[1:])),
    },
    "best_lambda_by_score": {k: best_by_score[k] for k in ["lambda_vol","lambda_dd","lambda_cost","U","max_dd","gov_breach_rate","score","robustness_proxy"]},
    "best_lambda_by_utility": {k: best_by_U[k] for k in ["lambda_vol","lambda_dd","lambda_cost","U","max_dd","gov_breach_rate","robustness_proxy"]},
    "argmax_stability": {
        "by_U": diagnostics["argmax_stability_U"],
        "by_score": diagnostics["argmax_stability_score"],
    },
    "proxy_failure_cliffs": {
        "count": diagnostics["failure_cliff_count"],
        "examples": diagnostics["failure_cliff_examples"],
    },
    "audit_artifacts": {
        "run_manifest": os.path.join(artifact_root, "run_manifest.json"),
        "reward_decomposition": os.path.join(artifact_root, "reward_decomposition.json"),
        "lambda_surface": os.path.join(artifact_root, "lambda_surface.json"),
        "stability_metrics": os.path.join(artifact_root, "stability_metrics.json"),
        "risk_log": os.path.join(artifact_root, "risk_log.json"),
    }
}

print(json.dumps(summary, indent=2, sort_keys=True))


{
  "argmax_stability": {
    "by_U": {
      "best_point": {
        "U": -16613.34423611583,
        "gov_breach_rate": 0.2769107643057223,
        "lambda_cost": 0.5,
        "lambda_dd": 0.8,
        "lambda_vol": 0.35,
        "max_dd": 0.0008564431512818341
      },
      "max_pairwise_L1_dist": 3.2500000000000004,
      "mean_pairwise_L1_dist": 1.4133333333333338,
      "min_pairwise_L1_dist": 0.35,
      "top_k": 10,
      "top_key": "U"
    },
    "by_score": {
      "best_point": {
        "gov_breach_rate": 0.2769107643057223,
        "lambda_cost": 0.5,
        "lambda_dd": 0.8,
        "lambda_vol": 0.35,
        "max_dd": 0.0008564431512818341,
        "score": -16614.97060767019
      },
      "max_pairwise_L1_dist": 3.2500000000000004,
      "mean_pairwise_L1_dist": 1.4133333333333338,
      "min_pairwise_L1_dist": 0.35,
      "top_k": 10,
      "top_key": "score"
    }
  },
  "audit_artifacts": {
    "lambda_surface": "/mnt/data/artifacts_ch1/lambda_surface.json",
    

##11.CONCLUSION

**Conclusion: What You Obtain from Chapter 1 and What Goal You Achieved**

This chapter ends in a very concrete place. It is not a “conceptual introduction” that disappears once we move on to learning. Chapter 1 produces a tangible outcome that will be used, inspected, and stress-tested throughout the rest of the program. The goal of the exercise is to manufacture, in an auditable and reproducible way, the **surrogate objective** that a future trading agent will optimize.

So what do we obtain?

We obtain three things that matter, and each has a different level of importance.

**First, we obtain a formal surrogate objective functional.**  
This is the main deliverable. It is a computable, explicit mapping that takes a trading trajectory and returns a scalar score. In practice it takes time series of portfolio behavior (returns, risk measures, drawdown increments, trading costs, turnover, leverage) and produces a **single scalar utility value**. This value is the thing you would maximize if you later introduced optimization. You can call it a “reward,” a “proxy utility,” or a “surrogate objective,” but the meaning is the same: it is the computational stand-in for the true financial goal that cannot be directly optimized.

This is not just a function of one variable. It is a **functional** in the finance and control sense: it evaluates a path. It aggregates across time with discounting. It “judges” a full trading history. That is why it is the proper object for an agent: an agent is not judged on one decision; it is judged on the full sequence of decisions it makes through time.

To be precise in plain language: the notebook produces a piece of code that implements something like

U = Σ γ^t [ return_t − penalties_t − governance_penalty_t ]

and the penalties are decomposed into volatility, drawdown increments, execution costs, turnover, and leverage. This U is the surrogate objective.

If you were asked, “What did we build?” the best one-line answer is:

We built the **objective** that defines what “good trading behavior” means in a computable and auditable way.

**Second, we obtain a complete reward decomposition object.**  
This is the second deliverable. It is not just the scalar U. We also obtain the full decomposition into components, stored as arrays and saved as artifacts. This matters because institutional systems do not accept a single opaque score. They require a breakdown: how much did returns contribute? how much did costs subtract? how much did drawdown penalties dominate? were governance penalties triggered?

In other words, we do not only compute “the reward.” We compute the reward **as an explained object**:

- return contribution series
- volatility penalty series
- drawdown-increment penalty series
- execution cost penalty series
- turnover penalty series
- leverage penalty series
- governance penalty series
- total reward series

This decomposition is the bridge between mathematics and governance. It lets you audit and reason about behavior. It is how you prevent self-deception. If the total looks good, but the decomposition shows that costs were ignored or drawdowns were tolerated, you immediately see the problem.

So the outcome is not just “a function.” It is “a function plus its full decomposition,” which is what makes it reviewable.

**Third, we obtain a sensitivity surface over proxy parameters.**  
This is the third deliverable. The notebook does not assume that one set of penalty weights is “the truth.” Instead, it explores a grid over key lambdas (for example: volatility penalty weight, drawdown penalty weight, cost penalty weight) and evaluates the surrogate outcome and risk outcomes across that grid.

This creates a “reward landscape” or “objective surface.”

That surface is important because it answers a question that institutions always ask:

Is your objective stable, or does it flip its preferences dramatically when parameters move slightly?

In a fragile system, tiny changes in penalty weights produce huge changes in what policy looks optimal. That is a warning sign. It tells you that your proxy is not a well-behaved representation of your intended economic goal; it is a brittle scoring function that can be gamed or that depends on narrow assumptions.

So the sensitivity surface is a diagnostic deliverable: it tells you whether your surrogate objective is robust enough to support learning in the next chapter.

With those three deliverables in mind, we can now answer the most important question you asked: what is the goal we achieved?

The goal of Chapter 1 is **objective specification and objective governance**.

Before you can train an agent, you must be able to say exactly what it is optimizing. And you must be able to defend that objective in terms of economics and institutional risk.

Many people confuse the objective with performance. They try to “get good returns” and assume the objective is obvious. But in any agentic system, the objective is not what you say in English. The objective is what you implement in code.

Chapter 1 ensures that the objective is:

- explicit
- decomposed
- stress-aware
- cost-aware
- governance-aware
- reproducible
- auditable

This is the professional standard.

Now, what does this enable?

It enables two critical next steps, and these are why Chapter 1 is not optional.

**It enables learning (Chapter 2).**  
Reinforcement learning is an optimization engine. It will push hard. It will search for loopholes. If you give it a naive objective, it will find naive solutions. The reason Chapter 1 matters is that it manufactures an objective that is not naive. It includes costs, drawdowns, and constraints. It is designed to reduce the space of reward-hacking solutions.

So the deliverable of Chapter 1 becomes the reward function used in Chapter 2. It is literally the thing that appears inside the Bellman update or policy gradient objective. Without Chapter 1, Chapter 2 would be meaningless or dangerous.

**It enables governance (Chapter 3).**  
Governance is not a separate “overlay.” Governance begins at the objective level. In Chapter 1 we already introduced governance penalties and threshold tracking, and we exported artifacts. In Chapter 3 we will deepen this into regime-aware objectives, hard constraints, stress testing, and feasibility surfaces. But Chapter 3 needs a foundation: a disciplined way of decomposing, logging, and evaluating objectives.

So Chapter 1’s deliverables are also the seed of the governance layer.

At this point, it is worth making the outcome even more concrete by naming the “objects” you have in hand after running the notebook.

When you run Chapter 1, you obtain:

**A synthetic market object**  
This includes arrays for returns, prices, regimes, volatility paths, and liquidity multipliers. It is an engineered environment that can be reproduced exactly with the config hash and seed. This is not the main deliverable of the chapter, but it is the testbed that makes the objective meaningful. In production language, it is the controlled environment in which the objective is defined and evaluated.

**A portfolio trajectory object**  
This includes the weight path, the turnover path, the gross and net exposure paths, and the wealth path. Again, this is not learning; it is a deterministic policy that generates a nontrivial trading trajectory. Why is it valuable? Because the objective must be evaluated on real trajectories, not on abstract assumptions. The portfolio trajectory is the “input” on which the objective operates.

**An execution and cost object**  
This includes a cost time series and its decomposition (spread, linear impact, quadratic impact, cubic impact). This matters because costs are not a single number in real systems. Costs have structure and convexity. By computing them explicitly, the objective becomes tied to feasibility rather than fantasy.

**A risk object**  
This includes EWMA covariance estimates, realized volatility measures, drawdowns, and tail metrics such as VaR and expected shortfall. This matters because risk must be computed consistently and transparently. The objective uses these risk measures as penalty inputs.

**The surrogate objective object itself**  
This is the core output: a collection of time-series components plus the final aggregated scalar score U (discounted surrogate utility). It is “the thing you optimize later.” It is the surrogate for the true, unreachable economic objective.

**A sensitivity and stability object**  
This includes the lambda grid results and diagnostics such as argmax stability, curvature proxies, elasticity estimates, and “failure cliffs” where high utility coincides with governance breaches or large drawdowns. This matters because it tells you whether the surrogate is structurally sane.

**A governance artifact bundle**  
This is the institutional outcome. It is not optional. It includes:
- run_manifest.json (what ran, when, with what environment)
- config_hash.txt (the configuration identity)
- reward_decomposition.json (what the surrogate objective did)
- lambda_surface.json (how sensitive the objective is)
- stability_metrics.json (how fragile or stable the objective is)
- risk_log.json (the risk outcomes and thresholds)

This bundle is what makes the chapter production-grade. If someone else runs the notebook with the same config and seed, they can reproduce the exact outputs. If a reviewer asks what drove results, you can point to the decomposition. If a risk committee asks whether the proxy is stable, you can point to the sensitivity surface.

Now we can state the achieved goal in the clearest possible terms.

**The goal we achieved is: we created a governed surrogate objective that can be optimized without ambiguity.**

That is the meaning of Chapter 1.

We did not build an agent.  
We built the thing an agent would optimize.

This is an important distinction. If you skip this step, you end up in the common trap where the agent “learns” but nobody can clearly explain what it learned, why it behaves that way, or how it would behave if conditions change. In institutional finance, that is unacceptable. You must be able to explain behavior in terms of the objective and constraints.

Chapter 1 therefore establishes the causal chain:

Objective design → Behavior incentives → Risk and cost realism → Auditability

Learning comes later. Optimization comes later. But the “meaning” of what the system is doing is already determined here.

Finally, it is helpful to state what success looks like at the end of this chapter.

You are successful if you can answer all of the following questions using only the artifacts produced:

- What is the surrogate objective, written explicitly as a formula and implemented as code?  
- How does it decompose into return and penalties?  
- Which penalty dominates in typical conditions?  
- How does the preferred parameter setting change across a lambda grid?  
- Are there “failure cliffs” where utility is high but governance is violated?  
- What are the key risk outcomes (drawdown, volatility, expected shortfall)?  
- Were any governance thresholds breached, and at what rate?  
- Can the run be reproduced exactly from the config hash and seed?

If you can answer these questions, you have achieved the goal of Chapter 1.

You have turned a vague statement like “make a trading agent” into a concrete, auditable object:

A surrogate utility functional with explicit cost and risk structure, equipped with stability diagnostics and institutional artifacts.

That is what we obtained. That is the goal we achieved. And that is why Chapter 1 is the foundation for everything that follows.
