deeplethe · WaylandYang · May 31, 2026 · May 31, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -50,11 +50,21 @@ Firecracker fork from
 — upstream FC doesn't yet ship `mem_backend.shared = true`. See
 [`docs/VENDORED-FIRECRACKER.md`](./docs/VENDORED-FIRECRACKER.md).
 
-**Stability**: the user surface is stable; numbers across realistic
-workloads (`bench/live-fork-pause-window.md`) are in progress against
-a freshly-rebuilt clean snapshot — the previous
-`coding-agent-fork-prewarm-v1` parent had 17 pre-baked guest Oopses
-that contaminate Live BRANCH timing.
+**Bench numbers** ([`bench/live-fork-pause-window/RESULTS-v0.4.md`](./bench/live-fork-pause-window/RESULTS-v0.4.md))
+on a clean Hub-pulled `python-numpy` source (1.5 GiB, Intel i7-12700,
+ext4 on HDD):
+
+| mode       | pause p50 | pause p90 | RT p50    |
+|------------|----------:|----------:|----------:|
+| live-sync  |     56 ms |     64 ms | 13 730 ms |
+| live-async |     54 ms |    241 ms | **69 ms** |
+| diff       |    202 ms |    418 ms | 13 461 ms |
+| full       |  13 550 ms |  14 268 ms | 13 559 ms |
+
+Headline: **3.6× faster pause** vs v0.3 Diff at p50, and the gap
+widens on slower storage because Live's pause is disk-independent.
+`wait=false` gives callers a ~70 ms HTTP return (vs 13.7 s for sync),
+**~200× RT improvement** for fire-and-forget BRANCH.
 
 ### Security — bearer-token comparison was a length oracle (closes #162)
 

diff --git a/README-zh.md b/README-zh.md
@@ -21,7 +21,7 @@
 
 <br/>
 
-## 101 毫秒 fork 100 个 microVM,150 毫秒 BRANCH 一个运行中的 VM。
+## 101 毫秒 fork 100 个 microVM,56 毫秒 BRANCH 一个运行中的 VM(v0.4 live 模式)。
 
 面向 **AI Agent 扇出**(fan-out)场景的 microVM 沙箱运行时。子 VM
 从一个已"暖启动"的父快照 fork 而来,通过写时复制(CoW)继承
@@ -44,12 +44,15 @@ pause 时间会从 150 ms 涨到 2.7 s
 ([#146](https://github.com/deeplethe/forkd/issues/146));修复后
 连续 BRANCH 保持平直(第 6 次 BRANCH 快了 17.6×)。
 
-**v0.4 live BRANCH** 把源 VM 的卡顿窗口从 ~150 ms(Diff)降到
-sub-50 ms:vCPU 状态 dump 完源 VM 立刻恢复,脏页通过 UFFD_WP
-异步抓取。端到端路径已经全部接入:CLI 用 `--live`、REST 用
-`mode: "live"`、Python / TypeScript / MCP SDK 同名。再加 `--no-wait`
-(CLI)或 `wait: false`(REST/SDK)就立刻返回(~10 ms),不等
-背景拷贝完成。
+**v0.4 live BRANCH** 把源 VM 卡顿窗口从 ~200 ms(Diff)压到
+**56 ms p50 / 64 ms p90**(1.5 GiB 源 VM,实测,
+[`bench/live-fork-pause-window/RESULTS-v0.4.md`](./bench/live-fork-pause-window/RESULTS-v0.4.md))。
+p50 比 v0.3 Diff 快 **3.6 倍**,而且在慢盘上这个比值**变得更大**——
+因为 Live 的 pause 是 disk-independent 的(内存拷贝跑在 resume 之
+后,不占临界区)。加 `wait: false` 让调用方 ~70 ms 就返回,背景
+拷贝异步完成——对于 agent 代码的 fire-and-forget BRANCH 是 **200×**
+的 RT 改进。CLI 用 `--live` / `--no-wait`,REST 用 `mode: "live"` /
+`wait: false`,Python / TypeScript / MCP SDK 同名。
 
 ```python
 from forkd import Controller

diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@
 
 <br/>
 
-## Fork 100 microVMs in 101 ms. BRANCH a live VM in 150 ms.
+## Fork 100 microVMs in 101 ms. BRANCH a live VM in 56 ms (v0.4 live mode).
 
 A microVM sandbox runtime for **AI agent fan-out**. Children fork
 from a warmed parent snapshot, inheriting its address space
@@ -45,15 +45,17 @@ where repeated BRANCHes on the same parent ballooned from 150 ms to
 2.7 s ([#146](https://github.com/deeplethe/forkd/issues/146)); the
 chain now stays flat (17.6× faster on the 6th consecutive BRANCH).
 
-**v0.4 live BRANCH** drops the source-pause window from ~150 ms
-(Diff) to sub-50 ms by moving the memory copy out of the critical
-section: the source resumes as soon as Firecracker dumps vCPU state,
-and dirty pages get captured asynchronously via UFFD_WP. The full
-end-to-end path is wired up — pass `--live` on the CLI, `mode:
-"live"` on REST, or `mode="live"` / `mode: "live"` on the Python /
-TypeScript / MCP SDKs. Add `--no-wait` (CLI) or `wait: false` (REST /
-SDKs) to return as soon as the source resumes (~10 ms) rather than
-waiting on the background copy.
+**v0.4 live BRANCH** collapses the source-pause window from ~200 ms
+(Diff) to **56 ms p50 / 64 ms p90** on a 1.5 GiB source — measured
+on a real BRANCH workload, [`bench/live-fork-pause-window/RESULTS-v0.4.md`](./bench/live-fork-pause-window/RESULTS-v0.4.md).
+**3.6× faster pause** vs v0.3 Diff at p50, and the gap *widens* on
+slower storage because Live's pause is disk-independent (memory
+copy runs after resume, not during). With `wait: false` the caller
+returns in ~70 ms while the background copy completes asynchronously
+— a **200×** RT improvement for fire-and-forget BRANCH from agent
+code. Pass `--live` / `--no-wait` on the CLI, `mode: "live"` /
+`wait: false` on REST, or the same on the Python / TypeScript / MCP
+SDKs.
 
 ```python
 from forkd import Controller

diff --git a/bench/live-fork-pause-window/RESULTS-v0.4.md b/bench/live-fork-pause-window/RESULTS-v0.4.md
@@ -0,0 +1,160 @@
+# v0.4 live BRANCH pause-window results
+
+Headline: `mode="live"` collapses the source-VM pause window from
+**202 ms p50 (Diff)** to **56 ms p50 (Live)** on a 1.5 GiB source —
+**3.6× faster** at the median, and the gap widens on slow storage
+because Live's pause is disk-independent while Diff's is not.
+`wait=false` lets the caller return after ~69 ms while the background
+memory copy runs to completion asynchronously.
+
+Methodology, raw numbers, and honest caveats below.
+
+## TL;DR
+
+| mode         | pause p50 | pause p90 | pause max | RT p50    |
+|--------------|----------:|----------:|----------:|----------:|
+| live-sync    |  **56 ms**|     64 ms |     64 ms |  13 730 ms |
+| live-async   |     54 ms |    241 ms |    258 ms | **69 ms** |
+| diff         |    202 ms |    418 ms |    434 ms |  13 461 ms |
+| full         |  13 550 ms |  14 268 ms |  14 314 ms |  13 559 ms |
+
+Key ratios at p50:
+
+- **live vs diff**: 202 / 56 = **3.6× faster pause window**
+- **live vs full**: 13 550 / 56 = **242× faster pause window**
+- **async RT vs sync RT**: 13 730 / 69 = **198× faster return** for
+  callers that don't need the snapshot bytes immediately
+
+> "Pause" is the source VM's downtime (the user-visible gap in TCP
+> connections, kvmclock, etc.). "RT" is the full HTTP round-trip on
+> `POST /v1/sandboxes/<id>/branch` — this is what your code waits on.
+
+## Setup
+
+| Item            | Value                                                              |
+|-----------------|--------------------------------------------------------------------|
+| Host CPU        | 12th Gen Intel Core i7-12700 (8P+4E)                               |
+| Host RAM        | 30 GiB                                                             |
+| Host kernel     | Linux 6.14.0-36-generic (Ubuntu)                                   |
+| Snapshot disk   | `/dev/sda2` — **WDC WD10EZEX-75WN4A1, ROTA=1 (spinning HDD)**, ext4 |
+| Firecracker     | Vendored `forkd-v0.4-mem-backend-shared-v1.12` (musl release)      |
+| Controller      | `feat(doctor,uffd): Phase 7.4` (commit `a372e2a`)                  |
+| Source snapshot | `python-numpy` (from Hub, sha256-verified)                         |
+| Source RAM size | 1 610 612 736 bytes = **1 536 MiB**                                |
+| Iterations      | 10 per mode, modes interleaved (live-sync, live-async, diff, full) |
+| Source sandbox  | Spawned once with `live_fork: true`; all BRANCHes hit it           |
+
+Modes interleave so disk warm-up, page-cache fill, and any
+process-wide drift contaminate all four modes equally instead of
+biasing the last batch.
+
+## Raw data
+
+[`bench-live-fork.csv`](./bench-live-fork.csv) — one row per BRANCH
+iteration; columns: `mode, iteration, http_round_trip_ms, pause_ms,
+memory_bin_bytes, poll_until_ready_ms`.
+
+Reproduced via [`bench-live-fork.py`](./bench-live-fork.py):
+
+```bash
+sudo python3 bench-live-fork.py \
+    --source-tag python-numpy \
+    --iterations 10 \
+    --modes live-sync,live-async,diff,full
+```
+
+## What pause_ms measures
+
+`pause_ms` is the source VM's vCPU-pause window:
+
+- **`mode: "full"`**: Pause → write full `memory.bin` to disk → resume.
+  Wall-bound by sequential disk write. On this HDD: ~120 MB/s, so
+  1.5 GiB ≈ 13 s. SSD would cut this to ~3 s; NVMe ~1.5 s. Not
+  acceptable for a running agent.
+- **`mode: "diff"`**: Pause → snapshot vmstate + dirty pages → resume.
+  Still wall-bound on disk write because the diff is *inside* the
+  pause window. Tail goes wide as the snapshot's dirty page count
+  grows (p90 = 418 ms is the cost of any one BRANCH hitting more
+  dirty pages than the others).
+- **`mode: "live"`**: Pause → snapshot vmstate, arm UFFD_WP, resume.
+  The memory copy happens *after* resume, in a controller-side
+  background thread. pause_ms is bounded by the vmstate dump
+  (~30-50 ms for 1.5 GiB at our vmstate sizes) plus UFFD_WP arming
+  on the resident regions (~0.4-0.6 ms in Phase 6 E2E).
+
+This is why **the live pause window is disk-independent**: an NVMe
+host wouldn't see Live get any faster (it's CPU-bound on vmstate +
+WP arming), but Diff would still scale with disk speed. On slower
+storage, the Live/Diff ratio gets *wider*, not narrower.
+
+## What the round-trip column measures
+
+`http_round_trip_ms` is what your code's `await ctrl.branchSandbox(...)`
+or `c.branch_sandbox(...)` returns in:
+
+- **live-sync (`wait=true`)**: blocks for source pause AND the
+  background memory copy. p50 = 13 730 ms ≈ HDD throughput limit
+  (same as Diff and Full).
+- **live-async (`wait=false`)**: returns as soon as the source
+  resumes. p50 = **69 ms**. The background copy still runs (and is
+  visible via the `status` field flipping from `"writing"` to
+  `"ready"`), but the caller doesn't wait on it.
+- **diff / full**: synchronous by definition; same RT as live-sync.
+
+The `wait=false` path is the headline UX win for agents: a `pause_ms
+~ 56 ms` source downtime *and* a ~70 ms HTTP return. The bench
+records `poll_until_ready_ms` separately so you can see when the
+async snapshot is actually consumable — it's the same 13-14 s wall
+time as sync BRANCH, just out of the critical path.
+
+## Caveats
+
+1. **Single host, single source size.** 1.5 GiB Python+numpy on i7-12700
+   + HDD. Numbers will move with source RAM size (Live's pause is
+   ~CPU + vmstate-size bound; Diff/Full are ~disk-bound) and with
+   disk medium. We'd expect Live's headline gap to narrow on NVMe
+   (because Diff gets faster) but never invert — Live is always
+   bounded by the synchronous parts of FC's pause/dump path.
+
+2. **`live-async` p90 outlier.** Iteration #8 saw pause_ms=258 ms
+   (vs p50=54). Root cause not yet investigated; suspects: ext4
+   writeback pressure from the in-flight previous async BRANCH, or
+   FC's vmstate serialization hitting an irregularity. Reproducing
+   on a clean disk and a longer run is the right follow-up. Median
+   and p90 (excluding this point) stay tight.
+
+3. **`unprivileged_userfaultfd=0` requires root for the bench.** The
+   bench script runs the controller under `sudo` because
+   `vm.unprivileged_userfaultfd=0` is the default on this dev box.
+   Production deployments should either set the sysctl or give the
+   controller `CAP_SYS_PTRACE`. `forkd doctor` (Phase 7.4) probes
+   both.
+
+4. **Source guest must be quiet during the BRANCH.** We ran
+   python-numpy in its default warmed state with no in-guest
+   workload. A guest under heavy write pressure during a Live BRANCH
+   will see UFFD_WP capture more dirty pages, growing the bg-copy
+   wall time (but NOT pause_ms — the pause stays disk-independent).
+
+5. **`mode: "live"` requires the vendored Firecracker fork.**
+   `mem_backend.shared = true` is the one upstream gap; tracked as
+   [`FIRECRACKER-UPSTREAM-PROPOSAL.md`](../../FIRECRACKER-UPSTREAM-PROPOSAL.md).
+   Once it lands upstream, the vendor requirement goes away.
+
+## Comparison vs v0.3.4 Diff
+
+v0.3.4 closed the multi-BRANCH compounding anomaly via
+`posix_fallocate`, putting Diff at a steady ~150-300 ms on this same
+hardware (see [`bench/pause-window/RESULTS-v0.3.md`](../pause-window/RESULTS-v0.3.md)).
+This bench's Diff p50 of 202 ms lines up cleanly with that. The
+v0.4 Live win is **on top of** v0.3.4 Diff, not against the original
+v0.3.0 baseline.
+
+For comparison:
+
+| Version | Mode | p50 pause on this hardware |
+|---------|------|---------------------------:|
+| v0.2.x  | Full | ~13 500 ms                 |
+| v0.3.0  | Diff | ~1 500-2 700 ms (anomaly)  |
+| v0.3.4  | Diff | ~200 ms                    |
+| v0.4    | Live | **~56 ms**                 |
diff --git a/bench/live-fork-pause-window/bench-live-fork.csv b/bench/live-fork-pause-window/bench-live-fork.csv
@@ -0,0 +1,41 @@
+mode,iteration,http_round_trip_ms,pause_ms,memory_bin_bytes,poll_until_ready_ms
+live-sync,0,13669.03,40,1610612736,
+live-async,0,58.27,48,1610612736,13284.92
+diff,0,13510.88,434,1610612736,
+full,0,13182.87,13163,1610612736,
+live-sync,1,13239.02,64,1610612736,
+live-async,1,125.31,90,1610612736,13723.48
+diff,1,14428.33,238,1610612736,
+full,1,13837.76,13828,1610612736,
+live-sync,2,13437.02,58,1610612736,
+live-async,2,85.39,57,1610612736,13668.73
+diff,2,13288.21,227,1610612736,
+full,2,13539.74,13524,1610612736,
+live-sync,3,14129.82,54,1610612736,
+live-async,3,77.93,59,1610612736,14048.62
+diff,3,13154.87,207,1610612736,
+full,3,14384.82,14314,1610612736,
+live-sync,4,13791.01,58,1610612736,
+live-async,4,59.17,51,1610612736,13428.47
+diff,4,13411.8,164,1610612736,
+full,4,13071.25,13068,1610612736,
+live-sync,5,14237.89,61,1610612736,
+live-async,5,48.43,41,1610612736,14047.83
+diff,5,13359.63,271,1610612736,
+full,5,13578.29,13576,1610612736,
+live-sync,6,14196.55,64,1610612736,
+live-async,6,95.57,68,1610612736,14225.81
+diff,6,13542.8,196,1610612736,
+full,6,13323.87,13298,1610612736,
+live-sync,7,16440.8,43,1610612736,
+live-async,7,57.03,38,1610612736,13986.13
+diff,7,14454.35,190,1610612736,
+full,7,13523.34,13510,1610612736,
+live-sync,8,13405.48,39,1610612736,
+live-async,8,266.74,258,1610612736,13481.96
+diff,8,13521.46,179,1610612736,
+full,8,13641.49,13638,1610612736,
+live-sync,9,13463.18,32,1610612736,
+live-async,9,57.64,50,1610612736,14398.6
+diff,9,13189.66,170,1610612736,
+full,9,13853.29,13850,1610612736,