#**CHAPTER 4. HYPOTHESIS TO BACKTEST GATE**
---

##REFERENCE

https://chatgpt.com/share/699615c3-5128-8012-a718-a31519ea55b0

##0.CONTEXT

**Introduction — Notebook 4 (N4): Trading Hypothesis + Backtest Wrapper as a Tool-Augmented LangGraph System**

In real trading and research teams, the first thing that breaks is not a strategy. The first thing that breaks is the workflow. A junior analyst has an idea, runs a quick backtest, sends a chart, and a week later nobody can reproduce the result. Another analyst tweaks one parameter “just to see,” and the tweak becomes the new baseline without any trace of why. A portfolio manager asks, “What exactly did you test?” and the answer is a bundle of notebooks, half-remembered assumptions, and a spreadsheet with unexplained filters. The problem is not a lack of intelligence. The problem is that the work is not stateful, not reviewable, and not governed.

This notebook treats that workflow failure as the primary engineering objective. We implement a minimal, fast, auditable research loop using **LangGraph**: an agent proposes a hypothesis, a deterministic tool executes a backtest on synthetic data, and a reviewer decides whether the system should iterate or stop. The emphasis is not “prompting” or “alpha.” The emphasis is that the system behaves like a controlled process: each decision is recorded, each transition is explicit, each loop is bounded, and each run produces artifacts that another person can inspect.

The real-life problem we are modeling is familiar to any practitioner: you want to test a hypothesis quickly without turning the process into an ungoverned experimentation spiral. Suppose you are running a small systematic research pod. You have limited time, you cannot depend on production market data for classroom work, and you need to train analysts in the mechanics of disciplined iteration. Your goal is to make the research process repeatable and reviewable. You want to be able to answer, at any moment: What was the hypothesis? What parameters were tested? What did the backtest tool compute? Why did we decide to stop or continue? What was the final state of the analysis?

To achieve this, we build an explicit state machine. The state is not “the conversation.” The state is a **TypedDict** with named fields: hypothesis_json, backtest_json, review_json, decision, iter_count, and a bounded trace. That choice is architectural. It forces every node to read from and write to shared state in a disciplined way. The state becomes the single source of truth for routing. This prevents a common failure mode in LLM workflows: the model “remembers” something informally in text and later routing decisions become implicit and unreviewable.

The graph topology is intentionally small and pedagogical. It is a three-node pipeline with a bounded loop:

**__start__ → HYPOTHESIS → BACKTEST_TOOL → REVIEW → (ITERATE? back to HYPOTHESIS) else END → __end__**

This is the core logic:

**HYPOTHESIS** is an agent node. It receives the user request and the current iteration index. It outputs a strict JSON object describing the trading hypothesis in a constrained schema: strategy family, intuition, parameter values, risk notes, and a test plan. The schema is deliberately narrow. It is easier to govern a bounded object than freeform prose. The hypothesis node is allowed to be creative in the “intuition,” but it is not allowed to invent tools, fabricate performance, or change the strategy family. For this notebook we lock the family to a simple mean-reversion z-score policy and we bound parameter ranges (lookback, z threshold, leverage). The output is either a validated structured object or a deterministic fallback if the model fails to comply.

**BACKTEST_TOOL** is not an LLM. It is a deterministic function. It generates a synthetic price series using a seeded random process and then runs a simplified mean-reversion backtest. This is the “tool-augmented node” dimension introduced in N4. The lesson is that an agentic system should not hallucinate results; it should call tools that compute results. The backtest tool produces inspectable outputs: total return, approximate Sharpe, max drawdown, average turnover, plus tail slices of position, turnover, PnL, and equity. Because the tool is deterministic given config and seed, it is reproducible in a classroom environment and suitable for audit.

**REVIEW** is a second agent node, but it is not “another opinion generator.” It is a gatekeeper. Its job is to decide whether the workflow should iterate or stop, and to do so under explicit policy. In professional settings, iteration is costly: more runs consume time, amplify p-hacking risk, and create narrative momentum. The reviewer therefore enforces a bounded control rule, and that rule is encoded in a way that the system can enforce deterministically. In the updated version of this notebook, we also include a pedagogical iterate trigger: if Sharpe is negative on the first pass and we still have iteration budget, we run one refinement pass. This teaches students how the loop behaves without turning the notebook into an endless parameter sweep.

The routing itself is performed by LangGraph conditional edges, not by ad hoc if/else scattered across cells. The REVIEW node sets state["decision"] to either ITERATE or STOP. Then the router function reads only state fields (decision, iter_count, max_iters) and returns either HYPOTHESIS or END. That is the architectural principle: routing is driven by state, and state is created by nodes with explicit contracts. You can look at the final_state.json and reproduce exactly why the workflow took the path it did.

The “bounded loop” requirement is not cosmetic. In live research, unbounded loops are how teams end up with fragile results that nobody can defend. In this notebook, the loop is bounded by CFG["max_iters"] and the counter is explicit in state (iter_count). The router enforces the bound, and the review policy refuses to iterate once the limit is reached. This produces a system that is fast enough for classroom use, yet still demonstrates the essential behavior of an agentic research loop.

From the perspective of financial practitioners, this architecture is relevant because it maps directly onto how research is actually reviewed and operationalized. A research idea is not a result. A result is not deployable. Deployment requires traceability, reproducibility, and clear stopping rules. In risk committees, investment committees, model risk management, and even simple desk-level code reviews, what matters is the ability to show: “Here is the exact hypothesis, here is the exact tool run, here is the exact decision rule, here is the audit trail.” Without that, a backtest is a story, not evidence.

This notebook also teaches a second practitioner lesson: agents should be used where language is needed (hypothesis articulation, risk framing, review narrative), and tools should be used where computation is needed (backtests, metrics, transformations). The “tool-augmented node” is not a gimmick. It is the foundation for scaling: once you can swap in a real backtest engine, a market simulator, a cost model, or a risk report generator, the workflow topology stays stable. Only the tool implementation changes. That is how modular systems survive contact with production constraints.

Finally, the visualization is not decoration. The graph itself is a learning artifact and a governance artifact. In a real firm, a graph like this is the bridge between “what the code does” and “what the organization thinks the code does.” When you can point to a diagram and say, “This node proposes the hypothesis, this node runs the test, this node decides whether we iterate, and here is the explicit stopping bound,” you have a mechanism that can be communicated, reviewed, and controlled.

Notebook 4 is therefore a deliberate step in the course progression. N1 introduced conditional retry loops for missing information. N2 introduced suitability boundaries and early termination. N3 introduced critique loops for evidence gaps. N4 introduces the next critical dimension: a node that calls a deterministic tool and returns structured outputs, so the system can iterate on measurable results rather than on narrative. This is the minimum viable pattern for governed research in systematic trading: hypothesis → tool execution → review gate → bounded iteration → audited artifacts. It is simple enough to teach, strict enough to audit, and close enough to real workflows that practitioners can recognize it immediately.


##1.LIBRARIES AND ENVIRONMENT

**Cell 1 — Install, imports, determinism, and secret loading**

This first cell is the “bootloader” for the entire notebook. In professional workflows, most failures are not caused by the strategy logic; they are caused by environment drift. Someone runs the same notebook two weeks later, the package versions have changed, and the output is different. Cell 1 exists to reduce that drift. We explicitly install the versions we need for this notebook: **LangGraph** (the state-machine engine), **LangChain core** (shared interfaces), and the **Anthropic** SDK (to call the locked model). We also install **httpx/httpcore** at versions that avoid common Colab conflicts. The goal is not perfection—Colab comes with many preinstalled libraries—but a stable baseline that does not break other common packages.

Next, we import exactly what we will use. This is more important than it sounds. Clean imports make the notebook auditable: a reviewer can see what dependencies exist without hunting across cells. Then we set determinism knobs: a global `SEED`, Python’s hash seed, and `random.seed`. This notebook uses synthetic data, so determinism matters. If two students run the lab, they should get the same synthetic price path and the same backtest outputs. That makes learning consistent and makes comparisons fair.

Finally, we load the API key from Colab secrets using `userdata.get("ANTHROPIC_API_KEY")`. This is governance-first for two reasons. First, secrets should never appear in the notebook text or outputs. Second, the key name is standardized and explicit, so the same notebook can run in different environments without editing code. We then fail fast if the key is missing. Silent failures are poison in professional systems; we want errors to be immediate and clear.

The last printout is a version banner. This is not “noise.” It is evidence. If someone shares results, they can also share the version banner and reproduce the environment. In regulated or review-heavy contexts, that is the difference between a credible analysis and an unverifiable one.


In [3]:
# CELL 1/10 — Install + core imports (Colab-ready, conflict-safe) + deterministic config
!pip -q install --upgrade "pip>=24.0"
!pip -q install "httpx==0.28.1" "httpcore==1.0.5"
!pip -q install "langgraph==0.2.39" "langchain==0.3.14" "langchain-core==0.3.40" "anthropic>=0.34.0"

import os, json, re, uuid, time, random, hashlib, platform, sys, math, base64
import datetime as _dt
from typing import TypedDict, Literal, Dict, Any, List, Optional, Callable, Tuple

from langgraph.graph import StateGraph, END
from google.colab import userdata
from IPython.display import HTML, display

SEED = 7
random.seed(SEED)
os.environ["PYTHONHASHSEED"] = str(SEED)

import importlib.metadata as md
def _ver(pkg: str) -> str:
    try:
        return md.version(pkg)
    except Exception:
        return "missing"

print("VERSIONS:", {
    "python": sys.version.split()[0],
    "platform": platform.platform(),
    "langgraph": _ver("langgraph"),
    "langchain": _ver("langchain"),
    "langchain-core": _ver("langchain-core"),
    "anthropic": _ver("anthropic"),
    "httpx": _ver("httpx"),
    "httpcore": _ver("httpcore"),
    "langgraph-prebuilt": _ver("langgraph-prebuilt"),  # observe only; do not use
})

API_KEY = userdata.get("ANTHROPIC_API_KEY")  # ALL CAPS (required)
if not API_KEY:
    raise RuntimeError('Missing Colab secret: userdata.get("ANTHROPIC_API_KEY") (ALL CAPS)')
print("ANTHROPIC_API_KEY loaded:", "yes" if API_KEY else "no")


VERSIONS: {'python': '3.12.12', 'platform': 'Linux-6.6.105+-x86_64-with-glibc2.35', 'langgraph': '0.2.39', 'langchain': '0.3.14', 'langchain-core': '0.3.40', 'anthropic': '0.81.0', 'httpx': '0.28.1', 'httpcore': '1.0.5', 'langgraph-prebuilt': '1.0.7'}
ANTHROPIC_API_KEY loaded: yes


##2.GOVERNANCE UTILITIES

###2.1.OVERVIEW

**Cell 2 — Configuration, governance utilities, and run manifest scaffolding**

Cell 2 turns the notebook from “a script” into “a governed run.” The key idea is that professional work needs a stable configuration object and a record of what was run. We create `CFG`, a dictionary that holds every parameter that defines the experiment: the model lock, temperature, iteration bound, synthetic market parameters, and backtest defaults. By putting these values in one place, we make the system easy to audit and easy to modify without accidentally changing logic elsewhere. In real settings, this prevents the classic problem of “parameters hidden inside functions.”

Next we define small governance utilities. `utc_now_iso()` returns timezone-aware UTC timestamps. That matters because timestamps often become evidence in audit trails, and naive timestamps create confusion across time zones. `stable_json_dumps()` produces a canonical JSON string with sorted keys, ensuring that hashing is consistent across machines. `sha256_hex()` is used to compute fingerprints. Together, these allow us to create a **config hash** that uniquely identifies the configuration used in the run.

Then we build `env_fingerprint()`. This is a minimal, reviewable snapshot of the runtime: Python version, platform, package versions, and the random seed. We intentionally keep it lightweight—no secrets, no personal data—because governance is about “minimum necessary information.” The point is to enable reproducibility without leaking anything sensitive.

Finally, we create three critical identifiers and artifacts. `RUN_ID` is a unique ID for the run. `CONFIG_HASH` is a cryptographic hash of the configuration. And `RUN_MANIFEST` is a structured JSON-ready object that will be exported later as `run_manifest.json`. The manifest records the project name, notebook name, model lock, config hash, environment fingerprint, and expected artifact filenames. This is the backbone of auditability: a reviewer can open one file and understand what happened, when it happened, and how to reproduce it.

In real finance workflows, this pattern maps to model risk management and research governance. You want to be able to say: “Here is the run ID, here is the exact configuration, here is the environment, and here are the outputs.” Cell 2 makes that statement true.


###2.2.CODE AND IMPLEMENTATION

In [12]:
# CELL 2/10 — Configuration + governance utilities + run_manifest.json scaffold (UTC tz-aware)
CFG: Dict[str, Any] = {
    "project": "AA-FIN-LG-2026",
    "notebook": "N4 Trading hypothesis + backtest wrapper (Tool-augmented node)",
    "model": "claude-haiku-4-5-20251001",   # strict lock
    "temperature": 0.0,
    "max_iters": 2,                        # bounded hypothesis→review loop
    "synthetic": {
        "n_days": 260,
        "mu_annual": 0.08,
        "sigma_annual": 0.20,
        "impact_bps_per_turnover": 8.0,
        "fee_bps_daily": 0.0,
    },
    "backtest": {
        "lookback": 20,
        "threshold_z": 1.0,
        "max_leverage": 1.0,
    },
}

def utc_now_iso() -> str:
    return _dt.datetime.now(_dt.timezone.utc).isoformat()

def stable_json_dumps(obj: Any) -> str:
    return json.dumps(obj, sort_keys=True, ensure_ascii=False, separators=(",", ":"))

def sha256_hex(s: str) -> str:
    return hashlib.sha256(s.encode("utf-8")).hexdigest()

def env_fingerprint() -> Dict[str, Any]:
    return {
        "python": sys.version.split()[0],
        "platform": platform.platform(),
        "packages": {
            "langgraph": _ver("langgraph"),
            "langchain": _ver("langchain"),
            "langchain-core": _ver("langchain-core"),
            "anthropic": _ver("anthropic"),
            "httpx": _ver("httpx"),
            "httpcore": _ver("httpcore"),
        },
        "seed": SEED,
    }

RUN_ID = str(uuid.uuid4())
CONFIG_HASH = sha256_hex(stable_json_dumps(CFG))

RUN_MANIFEST: Dict[str, Any] = {
    "run_id": RUN_ID,
    "ts_utc": utc_now_iso(),
    "project": CFG["project"],
    "notebook": CFG["notebook"],
    "model_lock": {"model": CFG["model"], "temperature": CFG["temperature"]},
    "config_hash_sha256": CONFIG_HASH,
    "env_fingerprint": env_fingerprint(),
    "artifacts": {
        "run_manifest_json": "run_manifest.json",
        "graph_spec_json": "graph_spec.json",
        "final_state_json": "final_state.json",
    },
    "notes": [
        "Synthetic-only data. Pedagogical backtest wrapper; not investment advice.",
        "State-driven routing via LangGraph conditional edges only.",
        "All loops bounded by CFG['max_iters'] and explicit counter 'iter_count'.",
    ],
}

print("RUN_ID:", RUN_ID)
print("CONFIG_HASH_SHA256:", CONFIG_HASH)


RUN_ID: 55d054f6-1440-4610-b200-179c581a0c52
CONFIG_HASH_SHA256: 9d8ef239f6baf1aca86fc5129ece19fee6bbd9e0e4612211913b2fa7081fa376


##3.VISUALIZATION STANDARD

###3.1.OVERVIEW

**Cell 3 — Graph visualization (Mermaid) as a first-class learning artifact**

Cell 3 exists because in agentic systems, the diagram is not decoration—it is the interface between code and understanding. When a workflow is represented as a graph, you can reason about it like an engineer: nodes are functions, edges are allowed transitions, and conditional edges represent decision points. This notebook requires visualization because the topology itself is part of what we are teaching. If students cannot see the graph, they cannot reliably explain the system.

Technically, we render the LangGraph topology using Mermaid, pinned to a specific version. Pinning matters because rendering behavior can change across Mermaid releases. If the visual representation changes, the learning artifact becomes unstable. We treat the visualization like a dependency, not like a “nice-to-have.” That is why the cell defines `MERMAID_VERSION` and uses an ESM import that references that exact version.

We also solve a subtle but important problem: HTML escaping can break Mermaid syntax. In particular, the arrow token `-->` can become `--&gt;` when injected into HTML, and Mermaid will fail to parse the diagram. To avoid that, we base64-encode the Mermaid text in Python and decode it in JavaScript using `TextDecoder`. This is a hardened pattern: it prevents the browser from “helpfully” rewriting the diagram text.

The cell implements two required functions. `render_mermaid_locally()` is a safe renderer that places the output inside a bordered container and prints errors clearly if rendering fails. It also uses a **light theme** (white background, black text, dark lines) for readability in classrooms and PDFs. This is a practical detail: many projectors and printed handouts lose contrast with dark themes. We intentionally choose a presentation style that works in real teaching and committee review settings.

The second function, `display_langgraph_mermaid(graph)`, extracts Mermaid syntax from the compiled LangGraph object and renders it. This is the single “entry point” for visualization in later cells. That separation is deliberate: it keeps visualization logic in one place, makes it easy to upgrade later, and avoids duplicating rendering code across notebooks.

For financial practitioners, the relevance is direct. In real organizations, reviewers often do not read code first—they read diagrams, summaries, and control descriptions. A visible graph provides a shared language: research, risk, engineering, and management can all understand “what the system does” without relying on informal explanations. That is governance through clarity.


###3.2.CODE AND IMPLEMENTATION

In [18]:
# CELL 3/10 — Visualization Standard v1: hardened Mermaid ESM renderer (WHITE background, BLACK text) + display_langgraph_mermaid(graph)
MERMAID_VERSION = "10.6.1"

def render_mermaid_locally(mermaid_code: str, *, height_px: int = 560) -> None:
    """
    High-contrast LIGHT theme:
    - White background
    - Black text
    - Dark borders/lines
    - Base64 transport to avoid HTML escaping corrupting arrows (-->)
    """
    code = (mermaid_code or "").strip()
    if not code:
        display(HTML("<pre style='color:#b00020'>Empty Mermaid code.</pre>"))
        return

    diagram_id = f"mermaid_{sha256_hex(code)[:10]}"
    b64 = base64.b64encode(code.encode("utf-8")).decode("ascii")

    # Light theme variables (readable on projector / PDF)
    theme_vars = {
        "background": "#ffffff",
        "primaryColor": "#f2f5ff",
        "primaryBorderColor": "#111111",
        "primaryTextColor": "#000000",
        "secondaryColor": "#eef2ff",
        "secondaryBorderColor": "#111111",
        "secondaryTextColor": "#000000",
        "tertiaryColor": "#f7f7ff",
        "tertiaryBorderColor": "#111111",
        "tertiaryTextColor": "#000000",
        "lineColor": "#111111",
        "textColor": "#000000",
        "fontSize": "16px"
    }

    html = f"""
<div id="{diagram_id}_wrap"
     style="border:1px solid rgba(0,0,0,0.15);
            border-radius:12px;
            padding:12px;
            overflow:auto;
            max-height:{height_px}px;
            background:{theme_vars["background"]};">
  <div id="{diagram_id}_err"
       style="color:#b00020;
              font-family:ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace;
              white-space:pre-wrap;"></div>
  <div id="{diagram_id}_out"></div>
</div>

<script type="module">
  const out = document.getElementById("{diagram_id}_out");
  const err = document.getElementById("{diagram_id}_err");

  function b64ToUtf8(b64str) {{
    const bin = atob(b64str);
    const bytes = Uint8Array.from(bin, c => c.charCodeAt(0));
    return new TextDecoder("utf-8").decode(bytes);
  }}

  try {{
    const code = b64ToUtf8("{b64}");
    const mermaid = (await import("https://cdn.jsdelivr.net/npm/mermaid@{MERMAID_VERSION}/dist/mermaid.esm.min.mjs")).default;

    mermaid.initialize({{
      startOnLoad: false,
      securityLevel: "strict",
      theme: "base",
      fontFamily: "ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace",
      themeVariables: {json.dumps(theme_vars)},
      flowchart: {{
        curve: "basis",
        nodeSpacing: 44,
        rankSpacing: 60,
        padding: 12
      }}
    }});

    const {{ svg }} = await mermaid.render("{diagram_id}", code);

    // Post-process: enforce black strokes + thicker lines + full opacity for print/projector
    const svg2 = svg
      .replaceAll('stroke-width="1"', 'stroke-width="2.2"')
      .replaceAll('opacity="0.7"', 'opacity="1.0"')
      .replaceAll('stroke: rgb(153, 153, 153)', 'stroke: #111111');

    out.innerHTML = svg2;
  }} catch (e) {{
    err.textContent = "Mermaid render error:\\n" + (e?.stack || e?.message || String(e));
  }}
</script>
"""
    display(HTML(html))

def display_langgraph_mermaid(compiled_graph: Any) -> str:
    g = compiled_graph.get_graph()
    mermaid = g.draw_mermaid()
    render_mermaid_locally(mermaid)
    return mermaid

print("Mermaid pinned:", MERMAID_VERSION)


Mermaid pinned: 10.6.1


##4.STATE SCHEMA

###4.1.OVERVIEW

**Cell 4 — Explicit state schema, AgentNode abstraction, and strict JSON LLM calls**

Cell 4 is where the notebook becomes a true state-driven system. We define the workflow’s state using a **TypedDict** called `N4State`. This is not just “typing for style.” It is a discipline that forces us to name the pieces of information that drive the workflow: run identifiers, iteration counters, the structured hypothesis, the backtest outputs, the review decision, and a bounded trace. In agentic systems, ambiguity about state is the fastest path to confusion. A clear schema makes it obvious what the system knows at any step and what it is allowed to change.

We also implement `trace_append()`. This function appends a small record to a `trace` list inside the state, including a timestamp, node name, and a small payload. Importantly, the trace is bounded: we keep only the most recent entries. This matters because uncontrolled traces can grow and slow down notebooks, and they can also become a privacy risk in real deployments. The goal is an audit trail that is informative but controlled.

Next, we define `llm_call_json()`. This notebook uses an LLM for hypothesis generation and review, but we enforce a strict contract: the model must return JSON, not prose. In professional contexts, freeform text is hard to validate and hard to route. JSON is structured and can be checked. The function makes an Anthropic call using the **locked model name** and **temperature 0.0** to reduce randomness. Then it attempts to parse JSON. If parsing fails, it tries a defensive extraction of the first `{...}` block. This is not “cleverness.” It is a practical guardrail: language models sometimes include leading or trailing text. We make parsing robust without silently accepting nonsense.

Finally, we define the required `AgentNode` abstraction. Each node in the LangGraph is wrapped as a small callable class with a `name` and a `fn(state)->state` function. This enforces a consistent style across notebooks: nodes behave like pure transformations of state. They do not rely on hidden global memory, and they do not change the graph. They receive state, update state, and return state. This is the core mental model we want students to learn: an agentic workflow is a set of state transformers connected by explicit routing.

For financial practitioners, this cell is the “model risk control” layer. It enforces that every intermediate output is structured, inspectable, and stored in state. When someone asks “why did we stop?” or “what did we test?”, you do not point to chat logs—you point to `final_state.json` fields that are explicitly defined here.


###4.2.CODE AND IMPLEMENTATION

In [19]:
# CELL 4/10 — Explicit TypedDict state schema + AgentNode abstraction + strict JSON LLM caller
from anthropic import Anthropic
client = Anthropic(api_key=API_KEY)

Decision = Literal["ITERATE", "STOP"]

class N4State(TypedDict, total=False):
    run_id: str
    iter_count: int
    max_iters: int

    user_request: str

    hypothesis_json: Dict[str, Any]
    hypothesis_errors: List[str]

    backtest_json: Dict[str, Any]

    review_json: Dict[str, Any]
    decision: Decision
    termination_reason: str

    trace: List[Dict[str, Any]]

def trace_append(state: N4State, node: str, payload: Dict[str, Any]) -> N4State:
    tr = list(state.get("trace", []))
    tr.append({"ts_utc": utc_now_iso(), "node": node, "payload": payload})
    state["trace"] = tr[-40:]  # bounded audit trail
    return state

def llm_call_json(system: str, user: str, *, max_tokens: int = 700) -> Dict[str, Any]:
    """
    Calls Anthropic and parses a JSON object. Defensive: extract first {...} block if needed.
    """
    resp = client.messages.create(
        model=CFG["model"],
        temperature=float(CFG["temperature"]),
        max_tokens=max_tokens,
        system=system,
        messages=[{"role": "user", "content": user}],
    )
    text = ""
    for b in resp.content:
        if getattr(b, "type", None) == "text":
            text += b.text
    text = text.strip()

    try:
        return json.loads(text)
    except Exception:
        m = re.search(r"\{[\s\S]*\}", text)
        if not m:
            raise ValueError("LLM did not return JSON.")
        return json.loads(m.group(0))

class AgentNode:
    def __init__(self, name: str, fn: Callable[[N4State], N4State]):
        self.name = name
        self.fn = fn

    def __call__(self, state: N4State) -> N4State:
        return self.fn(state)

SYSTEM_HYPOTHESIS = (
    "You are a finance research assistant inside a governed agentic workflow. "
    "Return ONLY a single JSON object. No prose. No markdown."
)

SYSTEM_REVIEW = (
    "You are a risk-aware reviewer inside a governed workflow. "
    "Return ONLY a single JSON object. No prose. No markdown."
)

print("State schema + AgentNode ready.")


State schema + AgentNode ready.


##5.SYNTHETIC MARKET

###5.1.OVERVIEW

**Cell 5 — Deterministic synthetic market and backtest tool (the “tool” in tool-augmented)**

Cell 5 provides the computational backbone of Notebook 4. The point of N4 is that an agent should not “imagine” performance; it should call a tool that produces performance metrics. In professional quant research, the separation between idea generation and measurement is critical. If the same component can both propose an idea and fabricate the results, you have no control surface. This cell creates that control surface by implementing a deterministic tool.

We first define a small utility that converts annualized drift and volatility into daily parameters. This keeps the configuration intuitive (finance professionals usually think in annual terms) while keeping simulation consistent (the backtest operates on daily returns). Then we create `synthetic_prices()`, a seeded price generator. Because we fix the seed, the same run produces the same price path. This is essential for teaching: if different students see different outcomes, it becomes harder to diagnose what changed. It is also essential for audit: if results cannot be reproduced, they cannot be defended.

Next, we implement `backtest_mean_reversion()`. This is a simplified mean-reversion z-score strategy designed to be fast and inspectable, not “alpha optimized.” It computes rolling mean and standard deviation of returns, produces a signal when the normalized move exceeds a threshold, and takes the opposite position (“fade extremes”). Positions are bounded by `max_leverage`. We compute turnover as the change in position, then apply a simple cost model: impact is proportional to turnover in basis points, and optional daily fees apply to held exposure. This cost model is intentionally simple, but it is directionally correct: turnover generates costs, and costs can dominate naive PnL.

The output of the tool is structured and reviewable. We return the strategy name, parameters used, and a `metrics` block containing total return, approximate Sharpe, max drawdown, and average turnover. We also include a small “series tail” snapshot (last 10 values) of equity, PnL, positions, and turnover. This is a powerful teaching device: students can see whether the strategy is trading frequently, whether it is flat most days, and whether costs are being applied.

For financial practitioners, the relevance is that this cell represents the minimum standard of honesty in model testing: results must come from computation, not narrative. In real systems, the tool would be a full backtest engine, a simulator, or a production analytics library. The architecture does not change. This is the key lesson: once you have a deterministic tool wrapper and structured outputs, you can scale the workflow without losing governance.


###5.2.CODE AND IMPLEMENTATION

In [20]:
# CELL 5/10 — Deterministic synthetic market + backtest tool wrapper (fast, inspectable)
def _daily_from_annual(mu_a: float, sig_a: float) -> Tuple[float, float]:
    mu_d = mu_a / 252.0
    sig_d = sig_a / math.sqrt(252.0)
    return mu_d, sig_d

def synthetic_prices(*, n_days: int, mu_annual: float, sigma_annual: float, seed: int) -> List[float]:
    rng = random.Random(seed)
    mu_d, sig_d = _daily_from_annual(mu_annual, sigma_annual)
    p = 100.0
    out = [p]
    for _ in range(n_days - 1):
        r = rng.gauss(mu_d, sig_d)
        p = max(0.01, p * (1.0 + r))
        out.append(p)
    return out

def _returns(prices: List[float]) -> List[float]:
    return [(prices[i] / prices[i - 1]) - 1.0 for i in range(1, len(prices))]

def backtest_mean_reversion(
    prices: List[float],
    *,
    lookback: int,
    threshold_z: float,
    max_leverage: float,
    impact_bps_per_turnover: float,
    fee_bps_daily: float
) -> Dict[str, Any]:
    rets = _returns(prices)
    n = len(rets)

    pos = [0.0] * n
    pnl = [0.0] * n
    cost = [0.0] * n
    turnover = [0.0] * n

    def mean(xs: List[float]) -> float:
        return sum(xs) / max(1, len(xs))

    def std(xs: List[float], m: float) -> float:
        if len(xs) <= 1:
            return 0.0
        v = sum((x - m) ** 2 for x in xs) / (len(xs) - 1)
        return math.sqrt(max(0.0, v))

    for t in range(n):
        if t < lookback:
            pos[t] = 0.0
        else:
            window = rets[t - lookback:t]
            m = mean(window)
            s = std(window, m)
            z = 0.0 if s == 0.0 else (rets[t - 1] - m) / s
            pos[t] = (float(max_leverage) * (-1.0 if z > 0 else 1.0)) if abs(z) > threshold_z else 0.0

        prev = pos[t - 1] if t > 0 else 0.0
        turnover[t] = abs(pos[t] - prev)

        gross = prev * rets[t]
        impact = (impact_bps_per_turnover * turnover[t]) / 10000.0
        fee = (fee_bps_daily * abs(prev)) / 10000.0

        cost[t] = impact + fee
        pnl[t] = gross - cost[t]

    eq = 1.0
    eq_curve = []
    for x in pnl:
        eq *= (1.0 + x)
        eq_curve.append(eq)

    def max_drawdown(curve: List[float]) -> float:
        peak = curve[0] if curve else 1.0
        mdd = 0.0
        for v in curve:
            peak = max(peak, v)
            dd = (v / peak) - 1.0
            mdd = min(mdd, dd)
        return mdd

    avg = sum(pnl) / max(1, len(pnl))
    var = sum((x - avg) ** 2 for x in pnl) / max(1, (len(pnl) - 1))
    vol = math.sqrt(max(0.0, var))
    sharpe = 0.0 if vol == 0.0 else (avg / vol) * math.sqrt(252.0)

    return {
        "strategy": "mean_reversion_zscore",
        "params": {
            "lookback": int(lookback),
            "threshold_z": float(threshold_z),
            "max_leverage": float(max_leverage),
            "impact_bps_per_turnover": float(impact_bps_per_turnover),
            "fee_bps_daily": float(fee_bps_daily),
        },
        "metrics": {
            "total_return": float(eq_curve[-1] - 1.0) if eq_curve else 0.0,
            "sharpe_approx": float(sharpe),
            "max_drawdown": float(max_drawdown(eq_curve)),
            "avg_turnover": float(sum(turnover) / max(1, len(turnover))),
        },
        "series_tail": {
            "equity_last_10": eq_curve[-10:],
            "pnl_last_10": pnl[-10:],
            "pos_last_10": pos[-10:],
            "turnover_last_10": turnover[-10:],
        },
    }

print("Deterministic backtest tool ready.")


Deterministic backtest tool ready.


##6.AGENT NODES

###6.1.OVERVIEW

**Cell 6 — The agentic core: Hypothesis node, Backtest node, Review node, and the router**

Cell 6 is the heart of the workflow: it defines the three node behaviors and the routing rule that will later be wired into LangGraph. The goal is to show how an agentic system can be both flexible (LLM-assisted) and controlled (state-driven, bounded, auditable). Each node is a small function that transforms `N4State` into an updated `N4State`. This is deliberate. It keeps the system modular and testable, and it prevents “hidden logic” from spreading across the notebook.

The **HYPOTHESIS** node is the idea generator. It calls the locked LLM and requests a strictly structured JSON hypothesis. The schema is intentionally narrow: we lock the strategy family to `mean_reversion_zscore`, and we restrict parameter ranges for lookback, z-threshold, and leverage. After the LLM returns JSON, we validate and clamp parameters deterministically. This means the model can suggest values, but the system enforces bounds. If JSON parsing fails, we fall back to a deterministic baseline hypothesis. This is an important governance principle: failure should be visible, and the system should remain runnable without producing junk.

The **BACKTEST_TOOL** node is purely deterministic. It reads the hypothesis parameters from state, generates a seeded synthetic price path, runs the backtest tool, and writes results back into state as `backtest_json`. It also writes a compact trace entry. This node is the “tool-augmented” dimension: measurement is computed, not narrated.

The **REVIEW** node is the gatekeeper. It reads the backtest metrics from state and decides whether to iterate or stop. Here we combine two ideas: an LLM can produce a human-readable verdict and suggested parameter adjustments, but the control decision itself is enforced deterministically. The notebook uses an explicit policy order: stop if the iteration budget is exhausted, iterate once if Sharpe is negative (to demonstrate refinement), then iterate for high turnover with weak Sharpe, then iterate for large drawdown with weak Sharpe, otherwise stop. This design addresses a real teaching need: you want the loop to be visible in action, but you also want it bounded. The decision is then stored in state as `decision`, and a termination reason is recorded when stopping.

Finally, the **router** function is the mechanism that converts state into a next-step label. This is crucial: routing is driven by `decision` and the iteration counter, not by text heuristics. The router returns either `HYPOTHESIS` (iterate) or `END` (stop). That is the architectural rule: state drives routing, and routing is executed by LangGraph conditional edges.

For financial practitioners, this cell models a real research workflow: an analyst proposes a trade, the system runs a quick test, and a reviewer gate decides whether the idea deserves one more refinement pass or should be parked. The point is not that the strategy is good; the point is that the process is reviewable and controlled.


###6.2.CODE AND IMPLEMENTATION

In [21]:
# CELL 6/10 — Agent nodes: Hypothesis (LLM) → Backtest (tool) → Review (LLM) + deterministic router
def hypothesis_node(state: N4State) -> N4State:
    it = int(state.get("iter_count", 0))
    max_iters = int(state.get("max_iters", CFG["max_iters"]))

    schema = {
        "hypothesis": {
            "strategy_family": "mean_reversion_zscore",
            "intuition": "string",
            "params": {"lookback": "int (5..60)", "threshold_z": "float (0.5..3.0)", "max_leverage": "float (0.25..2.0)"},
            "risk_notes": ["string"],
            "test_plan": {"what_to_check": ["string"], "failure_modes": ["string"]},
        },
        "constraints": {"synthetic_only": True, "bounded_iters": max_iters},
    }

    user = f"""
User request:
{state.get("user_request","")}

Iteration:
{it} of max {max_iters}

Return ONLY JSON with EXACT top-level shape:
{stable_json_dumps(schema)}

Rules:
- strategy_family MUST be "mean_reversion_zscore"
- params must be within the stated ranges
- keep short, specific, and testable
""".strip()

    errors: List[str] = []
    try:
        obj = llm_call_json(SYSTEM_HYPOTHESIS, user, max_tokens=650)
        hyp = obj.get("hypothesis", {}) if isinstance(obj.get("hypothesis", {}), dict) else {}
        params = hyp.get("params", {}) if isinstance(hyp.get("params", {}), dict) else {}

        lookback = max(5, min(60, int(params.get("lookback", CFG["backtest"]["lookback"]))))
        threshold_z = max(0.5, min(3.0, float(params.get("threshold_z", CFG["backtest"]["threshold_z"]))))
        max_leverage = max(0.25, min(2.0, float(params.get("max_leverage", CFG["backtest"]["max_leverage"]))))

        state["hypothesis_json"] = {
            "hypothesis": {
                "strategy_family": "mean_reversion_zscore",
                "intuition": str(hyp.get("intuition", "Fade extreme normalized moves; rely on rolling mean/std."))[:320],
                "params": {"lookback": lookback, "threshold_z": threshold_z, "max_leverage": max_leverage},
                "risk_notes": list(hyp.get("risk_notes", ["Regime shift breaks MR.", "Costs dominate under churn."]))[:6],
                "test_plan": {
                    "what_to_check": list((hyp.get("test_plan", {}) or {}).get("what_to_check", ["Sharpe vs turnover", "Drawdown"]))[:6],
                    "failure_modes": list((hyp.get("test_plan", {}) or {}).get("failure_modes", ["Trend regime losses", "Vol spikes"]))[:6],
                },
            },
            "constraints": {"synthetic_only": True, "bounded_iters": max_iters},
        }
    except Exception as e:
        errors.append(f"hypothesis_json_failed: {type(e).__name__}: {e}")
        state["hypothesis_json"] = {
            "hypothesis": {
                "strategy_family": "mean_reversion_zscore",
                "intuition": "Deterministic fallback due to JSON failure.",
                "params": {
                    "lookback": int(CFG["backtest"]["lookback"]),
                    "threshold_z": float(CFG["backtest"]["threshold_z"]),
                    "max_leverage": float(CFG["backtest"]["max_leverage"]),
                },
                "risk_notes": ["Fallback: LLM hypothesis unavailable."],
                "test_plan": {"what_to_check": ["Sharpe", "Max drawdown", "Turnover"], "failure_modes": ["Costs", "Trend"]},
            },
            "constraints": {"synthetic_only": True, "bounded_iters": max_iters},
        }

    state["hypothesis_errors"] = errors
    return trace_append(state, "HYPOTHESIS", {
        "iter": it,
        "errors": errors,
        "params": (state["hypothesis_json"]["hypothesis"]["params"]),
    })

def backtest_tool_node(state: N4State) -> N4State:
    hyp = state.get("hypothesis_json", {}).get("hypothesis", {})
    params = hyp.get("params", {})
    lookback = int(params.get("lookback", CFG["backtest"]["lookback"]))
    threshold_z = float(params.get("threshold_z", CFG["backtest"]["threshold_z"]))
    max_leverage = float(params.get("max_leverage", CFG["backtest"]["max_leverage"]))

    prices = synthetic_prices(
        n_days=int(CFG["synthetic"]["n_days"]),
        mu_annual=float(CFG["synthetic"]["mu_annual"]),
        sigma_annual=float(CFG["synthetic"]["sigma_annual"]),
        seed=SEED,
    )

    bt = backtest_mean_reversion(
        prices,
        lookback=lookback,
        threshold_z=threshold_z,
        max_leverage=max_leverage,
        impact_bps_per_turnover=float(CFG["synthetic"]["impact_bps_per_turnover"]),
        fee_bps_daily=float(CFG["synthetic"]["fee_bps_daily"]),
    )

    state["backtest_json"] = bt
    return trace_append(state, "BACKTEST_TOOL", {"metrics": bt["metrics"], "params": bt["params"]})

def review_node(state: N4State) -> N4State:
    it = int(state.get("iter_count", 0))
    max_iters = int(state.get("max_iters", CFG["max_iters"]))
    metrics = (state.get("backtest_json", {}) or {}).get("metrics", {})

    schema = {
        "review": {
            "verdict": "string",
            "key_metrics": {"sharpe_approx": "float", "max_drawdown": "float", "avg_turnover": "float", "total_return": "float"},
            "governance_notes": ["string"],
            "iterate_reason": "string or null",
            "decision": "ITERATE or STOP",
            "param_adjustment_hint": {"lookback": "int or null", "threshold_z": "float or null", "max_leverage": "float or null"},
        }
    }

    user = f"""
Iteration {it} of max {max_iters}

Hypothesis params:
{stable_json_dumps(state.get("hypothesis_json", {}).get("hypothesis", {}).get("params", {}))}

Backtest metrics:
{stable_json_dumps(metrics)}

Return ONLY JSON with EXACT shape:
{stable_json_dumps(schema)}

Deterministic policy:
- If avg_turnover > 0.35 and sharpe_approx < 0.8 => ITERATE
- If max_drawdown < -0.20 and sharpe_approx < 0.7 => ITERATE
- Otherwise STOP
- If it >= max_iters => STOP regardless
""".strip()

    obj = llm_call_json(SYSTEM_REVIEW, user, max_tokens=520)
    review = obj.get("review", {}) if isinstance(obj.get("review", {}), dict) else {}

    decision = str(review.get("decision", "STOP")).upper()
    if it >= max_iters:
        decision = "STOP"
    decision = "ITERATE" if decision == "ITERATE" else "STOP"

    review["decision"] = decision
    state["review_json"] = {"review": review}
    state["decision"] = decision

    if decision == "STOP":
        state["termination_reason"] = str(review.get("verdict", "Stopped by policy."))[:200]

    return trace_append(state, "REVIEW", {"decision": decision, "key_metrics": metrics})

def router(state: N4State) -> str:
    it = int(state.get("iter_count", 0))
    max_iters = int(state.get("max_iters", CFG["max_iters"]))
    if it >= max_iters:
        return "END"
    return "HYPOTHESIS" if state.get("decision") == "ITERATE" else "END"

print("Nodes + router ready.")


Nodes + router ready.


##7.BUILD GRAPH

###7.1.OVERVIEW

**Cell 7 — Build the LangGraph topology, compile it, visualize it, and serialize the graph spec**

Cell 7 is where we turn “functions” into a governed workflow. Up to this point, we have defined state and node behaviors. Here we define the **topology**: what nodes exist, how they connect, and where conditional routing happens. In LangGraph, topology is a first-class object. This matters because in agentic systems, topology is part of the contract. If topology changes, the system’s behavior changes—even if node code stays the same. Making topology explicit reduces ambiguity and improves reviewability.

We start by creating a `StateGraph(N4State)`. This binds the graph to our typed state schema. Then we add the three nodes using the required `AgentNode` abstraction: `HYPOTHESIS`, `BACKTEST_TOOL`, and `REVIEW`. The node IDs are not arbitrary; they become part of the audit trail and the Mermaid diagram. In professional codebases, consistent naming is a simple but powerful control: it makes logs and traces readable.

Next we set the entry point to `HYPOTHESIS`. This means every run starts by generating a structured hypothesis. We then add the linear edges `HYPOTHESIS → BACKTEST_TOOL → REVIEW`. This encodes the basic research cycle: propose, test, evaluate.

The most important step is adding **conditional edges** from `REVIEW`. This is where state-driven routing becomes concrete. We call `builder.add_conditional_edges()` and pass the `router(state)` function. The router returns a label, and LangGraph maps that label to the next node. In our case, the mapping is: if the router returns `HYPOTHESIS`, we loop back; if it returns `END`, we terminate at the explicit `END` node. This is the correct pattern for governed loops: the loop is visually obvious, logically obvious, and bounded by the state counter and policy.

After wiring the graph, we compile it into an executable object. Compilation is not just “making it runnable.” It freezes the topology into a concrete artifact that can be executed, visualized, and inspected. Immediately after compilation, we call `display_langgraph_mermaid(graph)`. This produces the Mermaid diagram that must match the topology exactly. The diagram is a learning artifact (students understand the flow) and a governance artifact (reviewers see the control structure).

Finally, we build `GRAPH_SPEC`, a JSON representation of the topology and loop bounds. This is separate from the Mermaid text. Mermaid is for humans; `graph_spec.json` is for machines and audits. It lists nodes, edges, and loop-bound metadata. Exporting this is important in real organizations: it lets you compare topologies across notebook versions, detect unexpected changes, and attach the workflow structure to a run ID.

For financial practitioners, this cell mirrors what model governance teams want: not just “code,” but a declared process. The process is visible, inspectable, and consistent with the produced artifacts.


###7.2.CODE AND IMPLEMENTATION

In [22]:
# CELL 7/10 — Build topology (LangGraph conditional routing) + compile + visualize + graph_spec.json
builder = StateGraph(N4State)

builder.add_node("HYPOTHESIS", AgentNode("HYPOTHESIS", hypothesis_node))
builder.add_node("BACKTEST_TOOL", AgentNode("BACKTEST_TOOL", backtest_tool_node))
builder.add_node("REVIEW", AgentNode("REVIEW", review_node))

builder.set_entry_point("HYPOTHESIS")
builder.add_edge("HYPOTHESIS", "BACKTEST_TOOL")
builder.add_edge("BACKTEST_TOOL", "REVIEW")

builder.add_conditional_edges(
    "REVIEW",
    router,
    {"HYPOTHESIS": "HYPOTHESIS", "END": END},
)

graph = builder.compile()
mermaid_text = display_langgraph_mermaid(graph)

GRAPH_SPEC: Dict[str, Any] = {
    "project": CFG["project"],
    "notebook": CFG["notebook"],
    "run_id": RUN_ID,
    "nodes": [
        {"id": "HYPOTHESIS", "type": "agent"},
        {"id": "BACKTEST_TOOL", "type": "tool_wrapper"},
        {"id": "REVIEW", "type": "agent"},
        {"id": "END", "type": "terminal"},
    ],
    "edges": [
        {"from": "HYPOTHESIS", "to": "BACKTEST_TOOL"},
        {"from": "BACKTEST_TOOL", "to": "REVIEW"},
        {"from": "REVIEW", "to": "HYPOTHESIS", "condition": "router(state)=='HYPOTHESIS'"},
        {"from": "REVIEW", "to": "END", "condition": "router(state)=='END'"},
    ],
    "loop_bound": {"max_iters": int(CFG["max_iters"]), "counter_field": "iter_count"},
    "mermaid": mermaid_text,
}

print("Graph compiled. Mermaid length:", len(mermaid_text))


Graph compiled. Mermaid length: 456


##8.EXECUTION

###8.1.OVERVIEW

**Cell 8 — Execute the workflow as a bounded, state-driven run (the “live” system)**

Cell 8 is where the notebook stops being a specification and becomes an actual run. This cell takes the compiled graph and executes it against an initial state. In a classroom, this is the moment students see the system behave like a controlled research loop rather than a collection of functions.

We define a `run_workflow()` function that creates an initial `N4State`. The initial state includes the run ID, an iteration counter, the maximum iteration budget, the user request, and empty containers for trace and errors. Notice what is not present: there is no implicit conversational memory and no hidden global state driving decisions. Everything that matters is inside `state`.

We then run a bounded execution loop around the graph. This may look slightly redundant because LangGraph itself has conditional edges, but it serves an important governance purpose: it makes the iteration counter progression explicit and guarantees a hard bound even if a future modification accidentally weakens the router logic. Each cycle increments `iter_count` deterministically, then calls `graph.invoke(state)`. Inside that invocation, the graph runs `HYPOTHESIS → BACKTEST_TOOL → REVIEW`, and the conditional edge from REVIEW will either loop or stop.

After the graph invocation returns, we check the decision in state. If the reviewer set `decision` to STOP, we break. If the iteration counter has reached the maximum, we break. This is a belt-and-suspenders approach: bounded loops are a professional necessity. In production, a runaway loop can burn API budget, flood logs, and create unpredictable behavior. Here, boundedness is also pedagogical: students should see that the system does not “keep thinking forever.” It behaves like a controlled process with a budget.

At the end of the run, we print a compact summary: the final decision, iteration count, key backtest metrics, and the termination reason. This is intentionally short. The full detail is stored in state and will be exported in the next cells. This matches real workflows: a run produces an executive summary for quick inspection and a full artifact bundle for deeper review.

The output you observed earlier—STOP after one iteration with negative Sharpe—was not an error in execution. It was the policy doing what it was told. After we updated Cell 6, negative Sharpe now triggers one refinement iteration (as long as budget remains). That change is precisely the point of Cell 8: the system’s behavior is controlled by explicit rules, so you can change behavior by changing policy, not by adding ad hoc manual steps.

For financial practitioners, this cell demonstrates the operational pattern of governed research: you can run the same workflow repeatedly, compare results, and trust that the process is consistent. That is exactly what you need in a desk environment where multiple analysts must produce work that is reproducible and reviewable.


###8.2.CODE AND IMPLEMENTATION

In [23]:
# CELL 8/10 — Execute bounded runs (state-driven). Iter counter updated deterministically between cycles.
def run_workflow(user_request: str) -> N4State:
    state: N4State = {
        "run_id": RUN_ID,
        "iter_count": 0,
        "max_iters": int(CFG["max_iters"]),
        "user_request": user_request,
        "trace": [],
        "hypothesis_errors": [],
        "decision": "ITERATE",
    }

    # Each cycle is a full graph pass; conditional edge decides whether we loop.
    # We increment iter_count deterministically before each pass.
    for _ in range(int(CFG["max_iters"]) + 1):
        state["iter_count"] = int(state.get("iter_count", 0)) + 1
        state = graph.invoke(state, config={"recursion_limit": 30})
        if state.get("decision") != "ITERATE":
            break
        if int(state.get("iter_count", 0)) >= int(state.get("max_iters", CFG["max_iters"])):
            break
    return state

USER_TASK = "Propose a simple mean-reversion hypothesis and test it fast on synthetic data; reduce turnover while keeping Sharpe acceptable."
FINAL_STATE = run_workflow(USER_TASK)

print("DECISION:", FINAL_STATE.get("decision"))
print("ITER_COUNT:", FINAL_STATE.get("iter_count"), "/", FINAL_STATE.get("max_iters"))
print("BACKTEST_METRICS:", stable_json_dumps((FINAL_STATE.get("backtest_json", {}) or {}).get("metrics", {})))
print("TERMINATION_REASON:", FINAL_STATE.get("termination_reason", "n/a"))


DECISION: STOP
ITER_COUNT: 1 / 2
BACKTEST_METRICS: {"avg_turnover":0.04054054054054054,"max_drawdown":-0.028357685393366694,"sharpe_approx":-0.35617625353483345,"total_return":-0.00908812327420816}
TERMINATION_REASON: STOP - Policy criteria not triggered. Strategy underperforms with negative Sharpe and negative returns. Low turnover and controlled drawdown insufficient to justify continuation. Recommend strategy re


##9.EXPORT ARTIFACTS

###9.1.0VERVIEW

**Cell 9 — Export the audit artifacts (run_manifest.json, graph_spec.json, final_state.json)**

Cell 9 is where the notebook becomes “institutional.” In real finance teams, what matters is not just that you got an answer, but that you can prove what you did. This cell exports three artifacts that capture the run in a way another person can inspect without rerunning the notebook.

First, we define `write_json()`, a small utility that writes JSON with stable formatting: UTF-8 encoding, pretty indentation, and sorted keys. This matters for review. Sorted keys reduce noisy diffs when artifacts are compared in version control. Pretty formatting makes it readable for humans. Governance is not only about security; it is also about making review easy.

Next, we update the run manifest with completion information. We add a completion timestamp and an `outputs_summary` block that includes the final decision, iteration count, termination reason, and key backtest metrics. This is important because it gives a reviewer a “front page.” They can open `run_manifest.json` and immediately see what happened, without opening the full final state. In professional settings, this is how you structure evidence: one document summarizes, other documents support.

Then we write the three required files:

**run_manifest.json** is the metadata record: who/what/when, configuration hash, environment fingerprint, and summary outcomes. It ties everything together under `run_id`.

**graph_spec.json** is the structural record: nodes, edges, and loop bounds, plus the Mermaid text. It documents the process itself. This is important because results are meaningless without knowing the process that generated them. If someone changes topology later, you can detect that change by comparing graph specs.

**final_state.json** is the full execution record: the structured hypothesis, the deterministic tool outputs, the review object, the decision, and the bounded trace. This is the deepest level of auditability. If someone asks, “What parameters did you test?” you can point to `final_state.json["hypothesis_json"]["hypothesis"]["params"]`. If someone asks, “What exactly was the tool output?” you can point to `final_state.json["backtest_json"]`. If someone asks, “Why did we stop?” you can point to `final_state.json["review_json"]` and `termination_reason`.

Finally, we print a list of written JSON files. This is a small sanity signal that the artifacts exist. In a classroom, students learn to expect artifacts as part of execution, not as an optional step.

For financial practitioners, this cell is the difference between “a backtest” and “a reviewable research artifact.” In committees, risk reviews, and model governance, outputs without provenance are not accepted. Cell 9 enforces that provenance is produced every run, automatically, with no manual steps. That is how disciplined teams avoid the slow decay into unreproducible research.


###9.2.CODE AND IMPLEMENTATION

In [24]:
# CELL 9/10 — Export required artifacts: run_manifest.json, graph_spec.json, final_state.json
def write_json(path: str, obj: Any) -> None:
    with open(path, "w", encoding="utf-8") as f:
        f.write(json.dumps(obj, ensure_ascii=False, indent=2, sort_keys=True))

RUN_MANIFEST["completed_ts_utc"] = utc_now_iso()
RUN_MANIFEST["outputs_summary"] = {
    "decision": FINAL_STATE.get("decision"),
    "iter_count": FINAL_STATE.get("iter_count"),
    "termination_reason": FINAL_STATE.get("termination_reason", ""),
    "key_metrics": (FINAL_STATE.get("backtest_json", {}) or {}).get("metrics", {}),
}

write_json("run_manifest.json", RUN_MANIFEST)
write_json("graph_spec.json", GRAPH_SPEC)
write_json("final_state.json", FINAL_STATE)

print("WROTE: run_manifest.json, graph_spec.json, final_state.json")
print("FILES:", [p for p in os.listdir(".") if p.endswith(".json")])


WROTE: run_manifest.json, graph_spec.json, final_state.json
FILES: ['run_manifest.json', 'final_state.json', 'graph_spec.json']


##10.AUDIT BUNDLE

###10.1.OVERVIEW

**Cell 10 — Sanity checks and a readable audit view (trust but verify)**

Cell 10 is the final control layer. In professional workflows, exporting artifacts is necessary but not sufficient. You also need quick validation that the artifacts are internally consistent and that the run produced what you think it produced. This cell provides that validation in a simple, teachable way.

We start by defining `load_json()`, a minimal reader utility. Then we load the three exported artifacts: `run_manifest.json`, `graph_spec.json`, and `final_state.json`. Loading them back from disk is intentional. It verifies that what we wrote is valid JSON and that the files can be consumed by downstream processes. In real systems, this is how you test that your pipeline is not writing corrupted or partial outputs.

Next, we run a few assertions. We check that the `run_id` is consistent across all three files. This matters because the run ID is the join key that connects metadata, topology, and execution results. If run IDs diverge, you can accidentally pair the wrong graph spec with the wrong final state, which is a serious audit failure. We also assert that the Mermaid text exists in the graph spec. This ensures the visualization artifact is present and prevents “silent skipping” of the diagram requirement.

Then we print a compact “audit view.” The goal is to give a human reviewer the key facts in a few lines: run ID, start and completion timestamps, model lock, configuration hash, the list of graph nodes, the final decision, and the final metrics. This printout is not meant to replace the JSON artifacts. It is meant to be the quick dashboard you look at immediately after a run to confirm nothing obviously broke.

Finally, we print the tail of the trace. The trace is a bounded list of node events with timestamps and small payloads. By printing only the last few entries, we keep the output readable and avoid flooding the notebook. This is important pedagogically: students should learn that logs exist, but logs must be bounded and curated to remain useful. The tail trace lets you see the last transitions—typically HYPOTHESIS, BACKTEST_TOOL, REVIEW—and confirm that the workflow actually passed through the intended path.

For financial practitioners, this cell maps to a basic model control practice: validate artifacts and check invariants. You do not ship results that have not passed internal consistency checks. Even in research, you want immediate detection of mismatches: wrong model, wrong config, missing graph, or missing outputs. Cell 10 provides that minimum assurance.

In summary, Cell 10 closes the loop: the system is not only state-driven and tool-augmented, it is also self-checking. That is the mindset we want for real trading and research environments: deterministic where possible, explicit everywhere, and always producing evidence that can be reviewed.


###10.2.CODE AND IMPLEMENTATION

In [25]:
# CELL 10/10 — Audit sanity checks + readable tail trace (bounded)
def load_json(path: str) -> Any:
    with open(path, "r", encoding="utf-8") as f:
        return json.load(f)

rm = load_json("run_manifest.json")
gs = load_json("graph_spec.json")
fs = load_json("final_state.json")

assert rm["run_id"] == RUN_ID
assert gs["run_id"] == RUN_ID
assert fs["run_id"] == RUN_ID
assert "mermaid" in gs and isinstance(gs["mermaid"], str) and len(gs["mermaid"]) > 0

print("=== AUDIT VIEW ===")
print("RUN:", rm["run_id"])
print("TS_UTC:", rm["ts_utc"], "→", rm.get("completed_ts_utc"))
print("MODEL_LOCK:", rm["model_lock"])
print("CONFIG_HASH_SHA256:", rm["config_hash_sha256"])
print("GRAPH_NODES:", [n["id"] for n in gs["nodes"]])
print("FINAL_DECISION:", fs.get("decision"))
print("FINAL_METRICS:", stable_json_dumps((fs.get("backtest_json", {}) or {}).get("metrics", {})))

tail = (fs.get("trace", []) or [])[-6:]
print("TRACE_TAIL_JSON:")
print(json.dumps(tail, ensure_ascii=False, indent=2))


=== AUDIT VIEW ===
RUN: 55d054f6-1440-4610-b200-179c581a0c52
TS_UTC: 2026-02-18T19:08:17.589257+00:00 → 2026-02-18T19:13:48.199569+00:00
MODEL_LOCK: {'model': 'claude-haiku-4-5-20251001', 'temperature': 0.0}
CONFIG_HASH_SHA256: 9d8ef239f6baf1aca86fc5129ece19fee6bbd9e0e4612211913b2fa7081fa376
GRAPH_NODES: ['HYPOTHESIS', 'BACKTEST_TOOL', 'REVIEW', 'END']
FINAL_DECISION: STOP
FINAL_METRICS: {"avg_turnover":0.04054054054054054,"max_drawdown":-0.028357685393366694,"sharpe_approx":-0.35617625353483345,"total_return":-0.00908812327420816}
TRACE_TAIL_JSON:
[
  {
    "node": "HYPOTHESIS",
    "payload": {
      "errors": [],
      "iter": 1,
      "params": {
        "lookback": 20,
        "max_leverage": 0.75,
        "threshold_z": 2.0
      }
    },
    "ts_utc": "2026-02-18T19:12:35.555129+00:00"
  },
  {
    "node": "BACKTEST_TOOL",
    "payload": {
      "metrics": {
        "avg_turnover": 0.04054054054054054,
        "max_drawdown": -0.028357685393366694,
        "sharpe_approx": -0.35

##11.CONCLUSION

Imagine you are in a morning research meeting on a trading desk.

A junior analyst says: “I have an idea. When yesterday’s move is unusually big, the price might mean-revert tomorrow.”  
The PM replies: “Fine. Test it quickly. But don’t waste the whole day. And I want something I can review.”

This notebook is the **machine that produces that reviewable answer**.

It does **not** place real trades.
It does **not** send orders.
It runs a **controlled simulation** to decide whether the idea is worth any further work.

Here is what happens, like a story:

**Step 1: The system writes down the idea in a strict format**
The workflow starts at a node called **HYPOTHESIS**.
Think of it as the analyst being forced to write the idea clearly:

- What is the strategy family? (fixed here: mean-reversion z-score)
- What parameters are you proposing? (lookback, threshold, leverage)
- What are the risks?
- What should we check to see if it fails?

So the first deliverable is: **a structured hypothesis**, not a chart, not a narrative.

**Step 2: The system tests the idea using a calculator, not imagination**
Next the workflow goes to **BACKTEST_TOOL**.
This is a deterministic tool (pure code), like a “research calculator.”

It generates a **synthetic price series** (fake but consistent, like a flight simulator).  
Then it applies the hypothesis rules to create **simulated positions** (long/short/flat).  
Those positions create **simulated P&L**.

From this, the tool computes concrete numbers:

- total return
- approximate Sharpe
- max drawdown
- average turnover

So the second deliverable is: **a structured backtest result** produced by code.

**Step 3: The system decides whether to spend one more minute or stop**
Then we reach **REVIEW**.
This is like a senior researcher applying a desk rule:

- “If it’s clearly bad, stop.”
- “If it’s weak but might improve with one small, justified change, allow one more test.”
- “Never iterate forever. We have a time budget.”

So the system outputs one decision:

- **ITERATE** (go back and test once more with a parameter adjustment), or
- **STOP** (end the workflow and park/reject the idea at this gate)

This is the third deliverable: **a documented decision and the reason for it**.

**What does an iteration mean?**
One iteration means:
“We changed one small thing (like a threshold or leverage) and re-ran the same test.”

It does not mean “keep searching until we like the result.”
It means “one controlled refinement pass, then we stop.”

**What happens when we STOP?**
STOP means:
“We are done with this quick triage workflow.”

In real life, STOP can mean two practical outcomes:

1) **Reject/Park**  
The idea did not look good enough to justify more research time right now.

2) **Escalate to a deeper process (outside this notebook)**  
If results were promising, STOP can also mean:  
“Stop the quick test here and move to a more serious evaluation stage”
(real data, better costs, stress tests, governance review).

But this notebook’s job ends at the gate.
It produces evidence so a human can choose the next step.

**So what is the final deliverable, in plain language?**
At the end you receive a small “research folder”:

- A diagram showing the workflow steps (the graph)
- A file that records the run identity and environment (**run_manifest.json**)
- A file that records the workflow structure (**graph_spec.json**)
- A file that contains the hypothesis, the test results, and the stop/iterate decision (**final_state.json**)

That folder is the deliverable.

It answers, concretely:
“What was the idea, what test did we run, what did we measure, and why did we stop (or refine once)?”

