diff --git a/CHANGELOG.md b/CHANGELOG.md
index b28ff78..1f6957b 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -50,11 +50,21 @@ Firecracker fork from
 — upstream FC doesn't yet ship `mem_backend.shared = true`. See
 [`docs/VENDORED-FIRECRACKER.md`](./docs/VENDORED-FIRECRACKER.md).
 
-**Stability**: the user surface is stable; numbers across realistic
-workloads (`bench/live-fork-pause-window.md`) are in progress against
-a freshly-rebuilt clean snapshot — the previous
-`coding-agent-fork-prewarm-v1` parent had 17 pre-baked guest Oopses
-that contaminate Live BRANCH timing.
+**Bench numbers** ([`bench/live-fork-pause-window/RESULTS-v0.4.md`](./bench/live-fork-pause-window/RESULTS-v0.4.md))
+on a clean Hub-pulled `python-numpy` source (1.5 GiB, Intel i7-12700,
+ext4 on HDD):
+
+| mode       | pause p50 | pause p90 | RT p50    |
+|------------|----------:|----------:|----------:|
+| live-sync  |     56 ms |     64 ms | 13 730 ms |
+| live-async |     54 ms |    241 ms | **69 ms** |
+| diff       |    202 ms |    418 ms | 13 461 ms |
+| full       |  13 550 ms |  14 268 ms | 13 559 ms |
+
+Headline: **3.6× faster pause** vs v0.3 Diff at p50, and the gap
+widens on slower storage because Live's pause is disk-independent.
+`wait=false` gives callers a ~70 ms HTTP return (vs 13.7 s for sync),
+**~200× RT improvement** for fire-and-forget BRANCH.
 
 ### Security — bearer-token comparison was a length oracle (closes #162)
 
diff --git a/README-zh.md b/README-zh.md
index 112f0ae..99463c8 100644
--- a/README-zh.md
+++ b/README-zh.md
@@ -21,7 +21,7 @@
 
 <br/>
 
-## 101 毫秒 fork 100 个 microVM,150 毫秒 BRANCH 一个运行中的 VM。
+## 101 毫秒 fork 100 个 microVM,56 毫秒 BRANCH 一个运行中的 VM(v0.4 live 模式)。
 
 面向 **AI Agent 扇出**(fan-out)场景的 microVM 沙箱运行时。子 VM
 从一个已"暖启动"的父快照 fork 而来,通过写时复制(CoW)继承
@@ -44,12 +44,15 @@ pause 时间会从 150 ms 涨到 2.7 s
 ([#146](https://github.com/deeplethe/forkd/issues/146));修复后
 连续 BRANCH 保持平直(第 6 次 BRANCH 快了 17.6×)。
 
-**v0.4 live BRANCH** 把源 VM 的卡顿窗口从 ~150 ms(Diff)降到
-sub-50 ms:vCPU 状态 dump 完源 VM 立刻恢复,脏页通过 UFFD_WP
-异步抓取。端到端路径已经全部接入:CLI 用 `--live`、REST 用
-`mode: "live"`、Python / TypeScript / MCP SDK 同名。再加 `--no-wait`
-(CLI)或 `wait: false`(REST/SDK)就立刻返回(~10 ms),不等
-背景拷贝完成。
+**v0.4 live BRANCH** 把源 VM 卡顿窗口从 ~200 ms(Diff)压到
+**56 ms p50 / 64 ms p90**(1.5 GiB 源 VM,实测,
+[`bench/live-fork-pause-window/RESULTS-v0.4.md`](./bench/live-fork-pause-window/RESULTS-v0.4.md))。
+p50 比 v0.3 Diff 快 **3.6 倍**,而且在慢盘上这个比值**变得更大**——
+因为 Live 的 pause 是 disk-independent 的(内存拷贝跑在 resume 之
+后,不占临界区)。加 `wait: false` 让调用方 ~70 ms 就返回,背景
+拷贝异步完成——对于 agent 代码的 fire-and-forget BRANCH 是 **200×**
+的 RT 改进。CLI 用 `--live` / `--no-wait`,REST 用 `mode: "live"` /
+`wait: false`,Python / TypeScript / MCP SDK 同名。
 
 ```python
 from forkd import Controller
diff --git a/README.md b/README.md
index 89e078f..93bce7d 100644
--- a/README.md
+++ b/README.md
@@ -21,7 +21,7 @@
 
 <br/>
 
-## Fork 100 microVMs in 101 ms. BRANCH a live VM in 150 ms.
+## Fork 100 microVMs in 101 ms. BRANCH a live VM in 56 ms (v0.4 live mode).
 
 A microVM sandbox runtime for **AI agent fan-out**. Children fork
 from a warmed parent snapshot, inheriting its address space
@@ -45,15 +45,17 @@ where repeated BRANCHes on the same parent ballooned from 150 ms to
 2.7 s ([#146](https://github.com/deeplethe/forkd/issues/146)); the
 chain now stays flat (17.6× faster on the 6th consecutive BRANCH).
 
-**v0.4 live BRANCH** drops the source-pause window from ~150 ms
-(Diff) to sub-50 ms by moving the memory copy out of the critical
-section: the source resumes as soon as Firecracker dumps vCPU state,
-and dirty pages get captured asynchronously via UFFD_WP. The full
-end-to-end path is wired up — pass `--live` on the CLI, `mode:
-"live"` on REST, or `mode="live"` / `mode: "live"` on the Python /
-TypeScript / MCP SDKs. Add `--no-wait` (CLI) or `wait: false` (REST /
-SDKs) to return as soon as the source resumes (~10 ms) rather than
-waiting on the background copy.
+**v0.4 live BRANCH** collapses the source-pause window from ~200 ms
+(Diff) to **56 ms p50 / 64 ms p90** on a 1.5 GiB source — measured
+on a real BRANCH workload, [`bench/live-fork-pause-window/RESULTS-v0.4.md`](./bench/live-fork-pause-window/RESULTS-v0.4.md).
+**3.6× faster pause** vs v0.3 Diff at p50, and the gap *widens* on
+slower storage because Live's pause is disk-independent (memory
+copy runs after resume, not during). With `wait: false` the caller
+returns in ~70 ms while the background copy completes asynchronously
+— a **200×** RT improvement for fire-and-forget BRANCH from agent
+code. Pass `--live` / `--no-wait` on the CLI, `mode: "live"` /
+`wait: false` on REST, or the same on the Python / TypeScript / MCP
+SDKs.
 
 ```python
 from forkd import Controller
diff --git a/bench/live-fork-pause-window/RESULTS-v0.4.md b/bench/live-fork-pause-window/RESULTS-v0.4.md
new file mode 100644
index 0000000..27eb934
--- /dev/null
+++ b/bench/live-fork-pause-window/RESULTS-v0.4.md
@@ -0,0 +1,160 @@
+# v0.4 live BRANCH pause-window results
+
+Headline: `mode="live"` collapses the source-VM pause window from
+**202 ms p50 (Diff)** to **56 ms p50 (Live)** on a 1.5 GiB source —
+**3.6× faster** at the median, and the gap widens on slow storage
+because Live's pause is disk-independent while Diff's is not.
+`wait=false` lets the caller return after ~69 ms while the background
+memory copy runs to completion asynchronously.
+
+Methodology, raw numbers, and honest caveats below.
+
+## TL;DR
+
+| mode         | pause p50 | pause p90 | pause max | RT p50    |
+|--------------|----------:|----------:|----------:|----------:|
+| live-sync    |  **56 ms**|     64 ms |     64 ms |  13 730 ms |
+| live-async   |     54 ms |    241 ms |    258 ms | **69 ms** |
+| diff         |    202 ms |    418 ms |    434 ms |  13 461 ms |
+| full         |  13 550 ms |  14 268 ms |  14 314 ms |  13 559 ms |
+
+Key ratios at p50:
+
+- **live vs diff**: 202 / 56 = **3.6× faster pause window**
+- **live vs full**: 13 550 / 56 = **242× faster pause window**
+- **async RT vs sync RT**: 13 730 / 69 = **198× faster return** for
+  callers that don't need the snapshot bytes immediately
+
+> "Pause" is the source VM's downtime (the user-visible gap in TCP
+> connections, kvmclock, etc.). "RT" is the full HTTP round-trip on
+> `POST /v1/sandboxes/<id>/branch` — this is what your code waits on.
+
+## Setup
+
+| Item            | Value                                                              |
+|-----------------|--------------------------------------------------------------------|
+| Host CPU        | 12th Gen Intel Core i7-12700 (8P+4E)                               |
+| Host RAM        | 30 GiB                                                             |
+| Host kernel     | Linux 6.14.0-36-generic (Ubuntu)                                   |
+| Snapshot disk   | `/dev/sda2` — **WDC WD10EZEX-75WN4A1, ROTA=1 (spinning HDD)**, ext4 |
+| Firecracker     | Vendored `forkd-v0.4-mem-backend-shared-v1.12` (musl release)      |
+| Controller      | `feat(doctor,uffd): Phase 7.4` (commit `a372e2a`)                  |
+| Source snapshot | `python-numpy` (from Hub, sha256-verified)                         |
+| Source RAM size | 1 610 612 736 bytes = **1 536 MiB**                                |
+| Iterations      | 10 per mode, modes interleaved (live-sync, live-async, diff, full) |
+| Source sandbox  | Spawned once with `live_fork: true`; all BRANCHes hit it           |
+
+Modes interleave so disk warm-up, page-cache fill, and any
+process-wide drift contaminate all four modes equally instead of
+biasing the last batch.
+
+## Raw data
+
+[`bench-live-fork.csv`](./bench-live-fork.csv) — one row per BRANCH
+iteration; columns: `mode, iteration, http_round_trip_ms, pause_ms,
+memory_bin_bytes, poll_until_ready_ms`.
+
+Reproduced via [`bench-live-fork.py`](./bench-live-fork.py):
+
+```bash
+sudo python3 bench-live-fork.py \
+    --source-tag python-numpy \
+    --iterations 10 \
+    --modes live-sync,live-async,diff,full
+```
+
+## What pause_ms measures
+
+`pause_ms` is the source VM's vCPU-pause window:
+
+- **`mode: "full"`**: Pause → write full `memory.bin` to disk → resume.
+  Wall-bound by sequential disk write. On this HDD: ~120 MB/s, so
+  1.5 GiB ≈ 13 s. SSD would cut this to ~3 s; NVMe ~1.5 s. Not
+  acceptable for a running agent.
+- **`mode: "diff"`**: Pause → snapshot vmstate + dirty pages → resume.
+  Still wall-bound on disk write because the diff is *inside* the
+  pause window. Tail goes wide as the snapshot's dirty page count
+  grows (p90 = 418 ms is the cost of any one BRANCH hitting more
+  dirty pages than the others).
+- **`mode: "live"`**: Pause → snapshot vmstate, arm UFFD_WP, resume.
+  The memory copy happens *after* resume, in a controller-side
+  background thread. pause_ms is bounded by the vmstate dump
+  (~30-50 ms for 1.5 GiB at our vmstate sizes) plus UFFD_WP arming
+  on the resident regions (~0.4-0.6 ms in Phase 6 E2E).
+
+This is why **the live pause window is disk-independent**: an NVMe
+host wouldn't see Live get any faster (it's CPU-bound on vmstate +
+WP arming), but Diff would still scale with disk speed. On slower
+storage, the Live/Diff ratio gets *wider*, not narrower.
+
+## What the round-trip column measures
+
+`http_round_trip_ms` is what your code's `await ctrl.branchSandbox(...)`
+or `c.branch_sandbox(...)` returns in:
+
+- **live-sync (`wait=true`)**: blocks for source pause AND the
+  background memory copy. p50 = 13 730 ms ≈ HDD throughput limit
+  (same as Diff and Full).
+- **live-async (`wait=false`)**: returns as soon as the source
+  resumes. p50 = **69 ms**. The background copy still runs (and is
+  visible via the `status` field flipping from `"writing"` to
+  `"ready"`), but the caller doesn't wait on it.
+- **diff / full**: synchronous by definition; same RT as live-sync.
+
+The `wait=false` path is the headline UX win for agents: a `pause_ms
+~ 56 ms` source downtime *and* a ~70 ms HTTP return. The bench
+records `poll_until_ready_ms` separately so you can see when the
+async snapshot is actually consumable — it's the same 13-14 s wall
+time as sync BRANCH, just out of the critical path.
+
+## Caveats
+
+1. **Single host, single source size.** 1.5 GiB Python+numpy on i7-12700
+   + HDD. Numbers will move with source RAM size (Live's pause is
+   ~CPU + vmstate-size bound; Diff/Full are ~disk-bound) and with
+   disk medium. We'd expect Live's headline gap to narrow on NVMe
+   (because Diff gets faster) but never invert — Live is always
+   bounded by the synchronous parts of FC's pause/dump path.
+
+2. **`live-async` p90 outlier.** Iteration #8 saw pause_ms=258 ms
+   (vs p50=54). Root cause not yet investigated; suspects: ext4
+   writeback pressure from the in-flight previous async BRANCH, or
+   FC's vmstate serialization hitting an irregularity. Reproducing
+   on a clean disk and a longer run is the right follow-up. Median
+   and p90 (excluding this point) stay tight.
+
+3. **`unprivileged_userfaultfd=0` requires root for the bench.** The
+   bench script runs the controller under `sudo` because
+   `vm.unprivileged_userfaultfd=0` is the default on this dev box.
+   Production deployments should either set the sysctl or give the
+   controller `CAP_SYS_PTRACE`. `forkd doctor` (Phase 7.4) probes
+   both.
+
+4. **Source guest must be quiet during the BRANCH.** We ran
+   python-numpy in its default warmed state with no in-guest
+   workload. A guest under heavy write pressure during a Live BRANCH
+   will see UFFD_WP capture more dirty pages, growing the bg-copy
+   wall time (but NOT pause_ms — the pause stays disk-independent).
+
+5. **`mode: "live"` requires the vendored Firecracker fork.**
+   `mem_backend.shared = true` is the one upstream gap; tracked as
+   [`FIRECRACKER-UPSTREAM-PROPOSAL.md`](../../FIRECRACKER-UPSTREAM-PROPOSAL.md).
+   Once it lands upstream, the vendor requirement goes away.
+
+## Comparison vs v0.3.4 Diff
+
+v0.3.4 closed the multi-BRANCH compounding anomaly via
+`posix_fallocate`, putting Diff at a steady ~150-300 ms on this same
+hardware (see [`bench/pause-window/RESULTS-v0.3.md`](../pause-window/RESULTS-v0.3.md)).
+This bench's Diff p50 of 202 ms lines up cleanly with that. The
+v0.4 Live win is **on top of** v0.3.4 Diff, not against the original
+v0.3.0 baseline.
+
+For comparison:
+
+| Version | Mode | p50 pause on this hardware |
+|---------|------|---------------------------:|
+| v0.2.x  | Full | ~13 500 ms                 |
+| v0.3.0  | Diff | ~1 500-2 700 ms (anomaly)  |
+| v0.3.4  | Diff | ~200 ms                    |
+| v0.4    | Live | **~56 ms**                 |
diff --git a/bench/live-fork-pause-window/bench-live-fork.csv b/bench/live-fork-pause-window/bench-live-fork.csv
new file mode 100644
index 0000000..4c77906
--- /dev/null
+++ b/bench/live-fork-pause-window/bench-live-fork.csv
@@ -0,0 +1,41 @@
+mode,iteration,http_round_trip_ms,pause_ms,memory_bin_bytes,poll_until_ready_ms
+live-sync,0,13669.03,40,1610612736,
+live-async,0,58.27,48,1610612736,13284.92
+diff,0,13510.88,434,1610612736,
+full,0,13182.87,13163,1610612736,
+live-sync,1,13239.02,64,1610612736,
+live-async,1,125.31,90,1610612736,13723.48
+diff,1,14428.33,238,1610612736,
+full,1,13837.76,13828,1610612736,
+live-sync,2,13437.02,58,1610612736,
+live-async,2,85.39,57,1610612736,13668.73
+diff,2,13288.21,227,1610612736,
+full,2,13539.74,13524,1610612736,
+live-sync,3,14129.82,54,1610612736,
+live-async,3,77.93,59,1610612736,14048.62
+diff,3,13154.87,207,1610612736,
+full,3,14384.82,14314,1610612736,
+live-sync,4,13791.01,58,1610612736,
+live-async,4,59.17,51,1610612736,13428.47
+diff,4,13411.8,164,1610612736,
+full,4,13071.25,13068,1610612736,
+live-sync,5,14237.89,61,1610612736,
+live-async,5,48.43,41,1610612736,14047.83
+diff,5,13359.63,271,1610612736,
+full,5,13578.29,13576,1610612736,
+live-sync,6,14196.55,64,1610612736,
+live-async,6,95.57,68,1610612736,14225.81
+diff,6,13542.8,196,1610612736,
+full,6,13323.87,13298,1610612736,
+live-sync,7,16440.8,43,1610612736,
+live-async,7,57.03,38,1610612736,13986.13
+diff,7,14454.35,190,1610612736,
+full,7,13523.34,13510,1610612736,
+live-sync,8,13405.48,39,1610612736,
+live-async,8,266.74,258,1610612736,13481.96
+diff,8,13521.46,179,1610612736,
+full,8,13641.49,13638,1610612736,
+live-sync,9,13463.18,32,1610612736,
+live-async,9,57.64,50,1610612736,14398.6
+diff,9,13189.66,170,1610612736,
+full,9,13853.29,13850,1610612736,
diff --git a/bench/live-fork-pause-window/bench-live-fork.py b/bench/live-fork-pause-window/bench-live-fork.py
new file mode 100644
index 0000000..30afb2e
--- /dev/null
+++ b/bench/live-fork-pause-window/bench-live-fork.py
@@ -0,0 +1,439 @@
+#!/usr/bin/env python3
+"""v0.4 live BRANCH pause-window bench.
+
+Drives N iterations of three BRANCH modes off the same live-fork source
+sandbox and emits per-iteration CSV plus a p50/p90/max summary. The
+point is to get an honest pause_ms number for `mode="live"` against a
+known-clean source — Phase 6's E2E used `coding-agent-fork-prewarm-v1`
+which has 17 baked guest Oopses contaminating the measurement.
+
+Source selection
+----------------
+
+The script symlinks an existing snapshot directory under the script's
+work-dir as the source tag. Override `--source-tag` and `--snap-root`
+if your snapshots live elsewhere. `python-numpy` is the default
+because it's the canonical Hub recipe (`forkd pull
+deeplethe/python-numpy`) — anyone with a fresh forkd install can
+reproduce against the same bytes.
+
+Setup pattern matches `scripts/dev/e2e-live-branch.py` (Phase 6 E2E):
+
+  1. Stand up an isolated forkd-controller on a free port with a
+     `firecracker` wrapper that adds --no-seccomp (the vendored FC's
+     vmm seccomp filter blocks userfaultfd; following Phase 6's
+     pattern).
+  2. POST /v1/sandboxes with `live_fork: true` to spawn a memfd-backed
+     source sandbox.
+  3. Loop N times for each of {live wait=true, live wait=false, diff,
+     full}: POST .../branch, record `pause_ms`, delete the result
+     snapshot to keep disk usage bounded.
+  4. Emit CSV per iteration + p50/p90/max table to stdout.
+
+Run as root: the FC API socket and snapshot dir are root-owned, and
+the system FC swap needs sudo too.
+
+Output
+------
+
+- `bench-live-fork.csv` — one row per BRANCH iteration:
+    mode, iteration, pause_ms, http_round_trip_ms, memory_bin_bytes,
+    poll_until_ready_ms (live wait=false only), source_memory_bytes
+- Stdout summary table with p50/p90/max per mode.
+
+Usage:
+    sudo python3 bench-live-fork.py \\
+        --source-tag python-numpy \\
+        --iterations 10 \\
+        --modes live-sync,live-async,diff,full
+"""
+import argparse
+import json
+import os
+import shutil
+import socket
+import statistics
+import subprocess
+import sys
+import time
+import urllib.error
+import urllib.request
+
+# Paths the dev box uses; override via CLI when porting.
+DEFAULT_BIN = "/home/yangdongxu/forkd/target/release/forkd-controller"
+DEFAULT_FC = (
+    "/home/yangdongxu/firecracker-fork/build/cargo_target"
+    "/x86_64-unknown-linux-musl/release/firecracker"
+)
+DEFAULT_SNAP_ROOT = "/home/yangdongxu/.local/share/forkd/snapshots"
+SYSTEM_FC = "/usr/local/bin/firecracker"
+SYSTEM_FC_BACKUP = "/usr/local/bin/firecracker.bench-live-backup"
+
+WORK = "/tmp/forkd-bench-live"
+
+
+def http(base_url, method, path, body=None, timeout=120):
+    data = json.dumps(body).encode() if body is not None else None
+    headers = {"Content-Type": "application/json"} if body is not None else {}
+    req = urllib.request.Request(
+        f"{base_url}{path}", data=data, method=method, headers=headers
+    )
+    try:
+        with urllib.request.urlopen(req, timeout=timeout) as resp:
+            raw = resp.read().decode("utf-8", errors="replace")
+            return resp.status, json.loads(raw) if raw else None
+    except urllib.error.HTTPError as e:
+        raw = e.read().decode("utf-8", errors="replace")
+        try:
+            return e.code, json.loads(raw)
+        except json.JSONDecodeError:
+            return e.code, raw
+
+
+def wait_for_healthy(base_url, port, deadline_s=20):
+    end = time.time() + deadline_s
+    while time.time() < end:
+        try:
+            s = socket.create_connection(("127.0.0.1", port), timeout=1)
+            s.close()
+            status, _ = http(base_url, "GET", "/healthz", timeout=2)
+            if status == 200:
+                return
+        except (ConnectionRefusedError, socket.timeout, OSError):
+            pass
+        time.sleep(0.3)
+    raise RuntimeError(f"daemon not healthy after {deadline_s}s")
+
+
+def setup_workdir(source_tag, source_dir, patched_fc):
+    shutil.rmtree(WORK, ignore_errors=True)
+    os.makedirs(f"{WORK}/snapshots", exist_ok=True)
+    os.makedirs(f"{WORK}/audit", exist_ok=True)
+
+    # FC wrapper. Same pattern as the Phase 6 E2E: vendored FC binary
+    # + --no-seccomp because the upstream vmm-thread filter still
+    # blocks userfaultfd(2).
+    wrapper = f"{WORK}/firecracker.wrapper"
+    with open(wrapper, "w") as f:
+        f.write(
+            "#!/bin/bash\n"
+            f"exec {patched_fc} --no-seccomp \"$@\"\n"
+        )
+    os.chmod(wrapper, 0o755)
+    if not os.path.exists(SYSTEM_FC_BACKUP):
+        subprocess.run(["sudo", "mv", SYSTEM_FC, SYSTEM_FC_BACKUP], check=True)
+    subprocess.run(["sudo", "cp", wrapper, SYSTEM_FC], check=True)
+    subprocess.run(["sudo", "chmod", "755", SYSTEM_FC], check=True)
+
+    # Symlink the source snapshot dir into our snap-root. Avoids
+    # copying the multi-hundred-MB memory.bin.
+    target = f"{WORK}/snapshots/{source_tag}"
+    if os.path.lexists(target):
+        os.unlink(target)
+    os.symlink(source_dir, target)
+
+    state = {
+        "snapshots": {
+            source_tag: {
+                "tag": source_tag,
+                "dir": target,
+                "created_at_unix": int(time.time()),
+                "status": "ready",
+            }
+        }
+    }
+    with open(f"{WORK}/state.json", "w") as f:
+        json.dump(state, f, indent=2)
+
+
+def restore_firecracker():
+    if os.path.exists(SYSTEM_FC_BACKUP):
+        subprocess.run(
+            ["sudo", "mv", "-f", SYSTEM_FC_BACKUP, SYSTEM_FC], check=False
+        )
+
+
+def start_daemon(bin_path, bind):
+    log = open(f"{WORK}/controller.log", "wb")
+    return subprocess.Popen(
+        [
+            "sudo",
+            bin_path,
+            "serve",
+            "--bind",
+            bind,
+            "--state",
+            f"{WORK}/state.json",
+            "--snapshot-root",
+            f"{WORK}/snapshots",
+            "--audit-log",
+            f"{WORK}/audit/audit.log",
+        ],
+        stdout=log,
+        stderr=log,
+    )
+
+
+def kill_leftovers(bind):
+    subprocess.run(
+        ["sudo", "pkill", "-f", f"forkd-controller serve --bind {bind}"],
+        stderr=subprocess.DEVNULL,
+    )
+    subprocess.run(
+        ["sudo", "pkill", "-f", f"{WORK}/"], stderr=subprocess.DEVNULL
+    )
+    time.sleep(0.5)
+
+
+def branch_once(base_url, sandbox_id, mode, wait, iteration):
+    """Run a single BRANCH; return a per-iteration row dict."""
+    tag = f"bench-{mode}-{iteration:03d}-{int(time.time() * 1000)}"
+    body = {"tag": tag}
+    if mode == "live-sync":
+        body["mode"] = "live"
+        body["wait"] = True
+    elif mode == "live-async":
+        body["mode"] = "live"
+        body["wait"] = False
+    elif mode == "diff":
+        body["mode"] = "diff"
+    elif mode == "full":
+        body["mode"] = "full"
+    else:
+        raise ValueError(f"unknown mode {mode}")
+
+    t0 = time.time()
+    status, resp = http(
+        base_url, "POST", f"/v1/sandboxes/{sandbox_id}/branch", body
+    )
+    rt_ms = (time.time() - t0) * 1000
+    if status not in (201, 202):
+        raise RuntimeError(f"BRANCH {mode} #{iteration} HTTP {status}: {resp!r}")
+
+    pause_ms = resp.get("pause_ms")
+    mem_bytes = None
+
+    ready_ms = None
+    if mode == "live-async":
+        # Poll until the snapshot flips to status=ready.
+        assert status == 202 and resp.get("status") == "writing"
+        poll_start = time.time()
+        deadline = poll_start + 60
+        while time.time() < deadline:
+            ls_status, ls = http(base_url, "GET", "/v1/snapshots")
+            assert ls_status == 200, f"list_snapshots HTTP {ls_status}"
+            entry = next((e for e in ls if e["tag"] == tag), None)
+            if entry is None:
+                raise RuntimeError(f"{tag} vanished")
+            if entry["status"] == "ready":
+                ready_ms = (time.time() - poll_start) * 1000
+                break
+            if entry["status"] == "failed":
+                raise RuntimeError(f"{tag} failed: {entry.get('warning')}")
+            time.sleep(0.05)
+        if ready_ms is None:
+            raise RuntimeError(f"{tag} did not reach ready in 60s")
+
+    mem_path = f"{WORK}/snapshots/{tag}/memory.bin"
+    if os.path.exists(mem_path):
+        mem_bytes = os.path.getsize(mem_path)
+
+    # Delete the snapshot to keep disk usage bounded. The source
+    # sandbox isn't affected; only this branch's tag goes away.
+    del_status, _ = http(base_url, "DELETE", f"/v1/snapshots/{tag}")
+    if del_status not in (200, 204):
+        # Non-fatal; bench can keep going, log it.
+        print(f"  warn: DELETE {tag} -> HTTP {del_status}", file=sys.stderr)
+
+    return {
+        "mode": mode,
+        "iteration": iteration,
+        "http_round_trip_ms": round(rt_ms, 2),
+        "pause_ms": pause_ms,
+        "memory_bin_bytes": mem_bytes,
+        "poll_until_ready_ms": round(ready_ms, 2) if ready_ms is not None else None,
+    }
+
+
+def summarize(rows, csv_path):
+    # Write CSV
+    cols = [
+        "mode",
+        "iteration",
+        "http_round_trip_ms",
+        "pause_ms",
+        "memory_bin_bytes",
+        "poll_until_ready_ms",
+    ]
+    with open(csv_path, "w") as f:
+        f.write(",".join(cols) + "\n")
+        for r in rows:
+            f.write(
+                ",".join("" if r[c] is None else str(r[c]) for c in cols) + "\n"
+            )
+
+    # Per-mode p50 / p90 / max for pause_ms and round-trip.
+    by_mode = {}
+    for r in rows:
+        by_mode.setdefault(r["mode"], []).append(r)
+
+    print("\n=== SUMMARY ===")
+    print(
+        f"  {'mode':<14}  {'N':>3}  "
+        f"{'pause_ms (p50)':>15}  {'p90':>6}  {'max':>6}  "
+        f"{'RT_ms (p50)':>12}  {'p90':>6}  {'max':>6}"
+    )
+    for mode in ("live-sync", "live-async", "diff", "full"):
+        if mode not in by_mode:
+            continue
+        rs = by_mode[mode]
+        pauses = [r["pause_ms"] for r in rs if r["pause_ms"] is not None]
+        rts = [r["http_round_trip_ms"] for r in rs]
+        if pauses:
+            p_p50 = statistics.median(pauses)
+            p_p90 = statistics.quantiles(pauses, n=10)[-1] if len(pauses) >= 2 else pauses[0]
+            p_max = max(pauses)
+        else:
+            p_p50 = p_p90 = p_max = float("nan")
+        rt_p50 = statistics.median(rts)
+        rt_p90 = statistics.quantiles(rts, n=10)[-1] if len(rts) >= 2 else rts[0]
+        rt_max = max(rts)
+        print(
+            f"  {mode:<14}  {len(rs):>3}  "
+            f"{p_p50:>15.1f}  {p_p90:>6.1f}  {p_max:>6.1f}  "
+            f"{rt_p50:>12.1f}  {rt_p90:>6.1f}  {rt_max:>6.1f}"
+        )
+
+    # Headline ratio: live-sync p50 vs diff p50.
+    if "live-sync" in by_mode and "diff" in by_mode:
+        live_pauses = [
+            r["pause_ms"] for r in by_mode["live-sync"] if r["pause_ms"] is not None
+        ]
+        diff_pauses = [
+            r["pause_ms"] for r in by_mode["diff"] if r["pause_ms"] is not None
+        ]
+        if live_pauses and diff_pauses:
+            live_p50 = statistics.median(live_pauses)
+            diff_p50 = statistics.median(diff_pauses)
+            ratio = diff_p50 / live_p50 if live_p50 > 0 else float("inf")
+            print(
+                f"\n  diff_p50 / live_p50 = {diff_p50:.0f}/{live_p50:.1f} "
+                f"= {ratio:.1f}×"
+            )
+    print(f"\n  CSV: {csv_path}")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    parser.add_argument("--source-tag", default="python-numpy")
+    parser.add_argument("--snap-root", default=DEFAULT_SNAP_ROOT)
+    parser.add_argument("--controller-bin", default=DEFAULT_BIN)
+    parser.add_argument("--patched-fc", default=DEFAULT_FC)
+    parser.add_argument(
+        "--port", type=int, default=8891, help="port for the isolated controller"
+    )
+    parser.add_argument(
+        "--iterations", type=int, default=10, help="branches per mode"
+    )
+    parser.add_argument(
+        "--modes",
+        default="live-sync,live-async,diff,full",
+        help="comma-separated subset of {live-sync,live-async,diff,full}",
+    )
+    parser.add_argument(
+        "--out-csv",
+        default="/tmp/forkd-bench-live/bench-live-fork.csv",
+    )
+    args = parser.parse_args()
+
+    bind = f"127.0.0.1:{args.port}"
+    base_url = f"http://{bind}"
+
+    source_dir = os.path.join(args.snap_root, args.source_tag)
+    if not os.path.isdir(source_dir):
+        sys.exit(f"source snapshot not found: {source_dir}")
+
+    # Probe source size — useful for the writeup.
+    src_mem = os.path.join(source_dir, "memory.bin")
+    src_bytes = os.path.getsize(src_mem) if os.path.exists(src_mem) else None
+
+    modes = args.modes.split(",")
+    for m in modes:
+        if m not in {"live-sync", "live-async", "diff", "full"}:
+            sys.exit(f"unknown mode {m}")
+
+    print(f"[*] source: {source_dir}")
+    if src_bytes:
+        print(f"    memory.bin: {src_bytes} bytes ({src_bytes // (1024 * 1024)} MiB)")
+    print(f"[*] modes: {modes}, iterations per mode: {args.iterations}")
+    print(f"[*] controller on {bind}")
+
+    print("[*] kill leftovers")
+    kill_leftovers(bind)
+
+    print(f"[*] setup work dir {WORK}")
+    setup_workdir(args.source_tag, source_dir, args.patched_fc)
+
+    print("[*] start daemon")
+    daemon = start_daemon(args.controller_bin, bind)
+    rows = []
+    try:
+        wait_for_healthy(base_url, args.port)
+        print("[+] daemon healthy")
+
+        # Spawn one live-fork source sandbox; all BRANCHes hit it.
+        print(f"\n[*] POST /v1/sandboxes live_fork=true tag={args.source_tag}")
+        status, body = http(
+            base_url,
+            "POST",
+            "/v1/sandboxes",
+            {"snapshot_tag": args.source_tag, "n": 1, "live_fork": True},
+        )
+        if status != 201:
+            raise RuntimeError(f"spawn HTTP {status}: {body!r}")
+        sandbox_id = body[0]["id"]
+        print(f"[+] sandbox {sandbox_id}")
+
+        # Give the guest a moment to settle (some recipes do post-boot
+        # work). Keep it small so the bench's "agent state" isn't
+        # dominated by warmup work.
+        time.sleep(1.5)
+
+        # Interleave modes so any one-shot effects (cold cache,
+        # warm-up, file-system state) average out instead of stacking
+        # on the last mode.
+        for i in range(args.iterations):
+            for m in modes:
+                print(f"  [{m} #{i}] ...", end=" ", flush=True)
+                row = branch_once(base_url, sandbox_id, m, None, i)
+                rows.append(row)
+                extra = ""
+                if row["poll_until_ready_ms"] is not None:
+                    extra = f" ready+{row['poll_until_ready_ms']:.0f}ms"
+                print(
+                    f"pause={row['pause_ms']}ms "
+                    f"rt={row['http_round_trip_ms']:.0f}ms{extra}"
+                )
+
+        summarize(rows, args.out_csv)
+
+    finally:
+        print("\n[*] tearing down")
+        subprocess.run(["sudo", "kill", str(daemon.pid)], stderr=subprocess.DEVNULL)
+        subprocess.run(
+            ["sudo", "pkill", "-9", "-f", "/usr/local/bin/firecracker"],
+            stderr=subprocess.DEVNULL,
+        )
+        time.sleep(0.5)
+        restore_firecracker()
+
+
+if __name__ == "__main__":
+    try:
+        main()
+    except Exception as e:
+        print(f"\n[!] FAIL: {e}", file=sys.stderr)
+        sys.exit(1)