Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 15 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,21 @@ Firecracker fork from
— upstream FC doesn't yet ship `mem_backend.shared = true`. See
[`docs/VENDORED-FIRECRACKER.md`](./docs/VENDORED-FIRECRACKER.md).

**Stability**: the user surface is stable; numbers across realistic
workloads (`bench/live-fork-pause-window.md`) are in progress against
a freshly-rebuilt clean snapshot — the previous
`coding-agent-fork-prewarm-v1` parent had 17 pre-baked guest Oopses
that contaminate Live BRANCH timing.
**Bench numbers** ([`bench/live-fork-pause-window/RESULTS-v0.4.md`](./bench/live-fork-pause-window/RESULTS-v0.4.md))
on a clean Hub-pulled `python-numpy` source (1.5 GiB, Intel i7-12700,
ext4 on HDD):

| mode | pause p50 | pause p90 | RT p50 |
|------------|----------:|----------:|----------:|
| live-sync | 56 ms | 64 ms | 13 730 ms |
| live-async | 54 ms | 241 ms | **69 ms** |
| diff | 202 ms | 418 ms | 13 461 ms |
| full | 13 550 ms | 14 268 ms | 13 559 ms |

Headline: **3.6× faster pause** vs v0.3 Diff at p50, and the gap
widens on slower storage because Live's pause is disk-independent.
`wait=false` gives callers a ~70 ms HTTP return (vs 13.7 s for sync),
**~200× RT improvement** for fire-and-forget BRANCH.

### Security — bearer-token comparison was a length oracle (closes #162)

Expand Down
17 changes: 10 additions & 7 deletions README-zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

<br/>

## 101 毫秒 fork 100 个 microVM,150 毫秒 BRANCH 一个运行中的 VM。
## 101 毫秒 fork 100 个 microVM,56 毫秒 BRANCH 一个运行中的 VM(v0.4 live 模式)

面向 **AI Agent 扇出**(fan-out)场景的 microVM 沙箱运行时。子 VM
从一个已"暖启动"的父快照 fork 而来,通过写时复制(CoW)继承
Expand All @@ -44,12 +44,15 @@ pause 时间会从 150 ms 涨到 2.7 s
([#146](https://github.com/deeplethe/forkd/issues/146));修复后
连续 BRANCH 保持平直(第 6 次 BRANCH 快了 17.6×)。

**v0.4 live BRANCH** 把源 VM 的卡顿窗口从 ~150 ms(Diff)降到
sub-50 ms:vCPU 状态 dump 完源 VM 立刻恢复,脏页通过 UFFD_WP
异步抓取。端到端路径已经全部接入:CLI 用 `--live`、REST 用
`mode: "live"`、Python / TypeScript / MCP SDK 同名。再加 `--no-wait`
(CLI)或 `wait: false`(REST/SDK)就立刻返回(~10 ms),不等
背景拷贝完成。
**v0.4 live BRANCH** 把源 VM 卡顿窗口从 ~200 ms(Diff)压到
**56 ms p50 / 64 ms p90**(1.5 GiB 源 VM,实测,
[`bench/live-fork-pause-window/RESULTS-v0.4.md`](./bench/live-fork-pause-window/RESULTS-v0.4.md))。
p50 比 v0.3 Diff 快 **3.6 倍**,而且在慢盘上这个比值**变得更大**——
因为 Live 的 pause 是 disk-independent 的(内存拷贝跑在 resume 之
后,不占临界区)。加 `wait: false` 让调用方 ~70 ms 就返回,背景
拷贝异步完成——对于 agent 代码的 fire-and-forget BRANCH 是 **200×**
的 RT 改进。CLI 用 `--live` / `--no-wait`,REST 用 `mode: "live"` /
`wait: false`,Python / TypeScript / MCP SDK 同名。

```python
from forkd import Controller
Expand Down
22 changes: 12 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

<br/>

## Fork 100 microVMs in 101 ms. BRANCH a live VM in 150 ms.
## Fork 100 microVMs in 101 ms. BRANCH a live VM in 56 ms (v0.4 live mode).

A microVM sandbox runtime for **AI agent fan-out**. Children fork
from a warmed parent snapshot, inheriting its address space
Expand All @@ -45,15 +45,17 @@ where repeated BRANCHes on the same parent ballooned from 150 ms to
2.7 s ([#146](https://github.com/deeplethe/forkd/issues/146)); the
chain now stays flat (17.6× faster on the 6th consecutive BRANCH).

**v0.4 live BRANCH** drops the source-pause window from ~150 ms
(Diff) to sub-50 ms by moving the memory copy out of the critical
section: the source resumes as soon as Firecracker dumps vCPU state,
and dirty pages get captured asynchronously via UFFD_WP. The full
end-to-end path is wired up — pass `--live` on the CLI, `mode:
"live"` on REST, or `mode="live"` / `mode: "live"` on the Python /
TypeScript / MCP SDKs. Add `--no-wait` (CLI) or `wait: false` (REST /
SDKs) to return as soon as the source resumes (~10 ms) rather than
waiting on the background copy.
**v0.4 live BRANCH** collapses the source-pause window from ~200 ms
(Diff) to **56 ms p50 / 64 ms p90** on a 1.5 GiB source — measured
on a real BRANCH workload, [`bench/live-fork-pause-window/RESULTS-v0.4.md`](./bench/live-fork-pause-window/RESULTS-v0.4.md).
**3.6× faster pause** vs v0.3 Diff at p50, and the gap *widens* on
slower storage because Live's pause is disk-independent (memory
copy runs after resume, not during). With `wait: false` the caller
returns in ~70 ms while the background copy completes asynchronously
— a **200×** RT improvement for fire-and-forget BRANCH from agent
code. Pass `--live` / `--no-wait` on the CLI, `mode: "live"` /
`wait: false` on REST, or the same on the Python / TypeScript / MCP
SDKs.

```python
from forkd import Controller
Expand Down
160 changes: 160 additions & 0 deletions bench/live-fork-pause-window/RESULTS-v0.4.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# v0.4 live BRANCH pause-window results

Headline: `mode="live"` collapses the source-VM pause window from
**202 ms p50 (Diff)** to **56 ms p50 (Live)** on a 1.5 GiB source —
**3.6× faster** at the median, and the gap widens on slow storage
because Live's pause is disk-independent while Diff's is not.
`wait=false` lets the caller return after ~69 ms while the background
memory copy runs to completion asynchronously.

Methodology, raw numbers, and honest caveats below.

## TL;DR

| mode | pause p50 | pause p90 | pause max | RT p50 |
|--------------|----------:|----------:|----------:|----------:|
| live-sync | **56 ms**| 64 ms | 64 ms | 13 730 ms |
| live-async | 54 ms | 241 ms | 258 ms | **69 ms** |
| diff | 202 ms | 418 ms | 434 ms | 13 461 ms |
| full | 13 550 ms | 14 268 ms | 14 314 ms | 13 559 ms |

Key ratios at p50:

- **live vs diff**: 202 / 56 = **3.6× faster pause window**
- **live vs full**: 13 550 / 56 = **242× faster pause window**
- **async RT vs sync RT**: 13 730 / 69 = **198× faster return** for
callers that don't need the snapshot bytes immediately

> "Pause" is the source VM's downtime (the user-visible gap in TCP
> connections, kvmclock, etc.). "RT" is the full HTTP round-trip on
> `POST /v1/sandboxes/<id>/branch` — this is what your code waits on.

## Setup

| Item | Value |
|-----------------|--------------------------------------------------------------------|
| Host CPU | 12th Gen Intel Core i7-12700 (8P+4E) |
| Host RAM | 30 GiB |
| Host kernel | Linux 6.14.0-36-generic (Ubuntu) |
| Snapshot disk | `/dev/sda2` — **WDC WD10EZEX-75WN4A1, ROTA=1 (spinning HDD)**, ext4 |
| Firecracker | Vendored `forkd-v0.4-mem-backend-shared-v1.12` (musl release) |
| Controller | `feat(doctor,uffd): Phase 7.4` (commit `a372e2a`) |
| Source snapshot | `python-numpy` (from Hub, sha256-verified) |
| Source RAM size | 1 610 612 736 bytes = **1 536 MiB** |
| Iterations | 10 per mode, modes interleaved (live-sync, live-async, diff, full) |
| Source sandbox | Spawned once with `live_fork: true`; all BRANCHes hit it |

Modes interleave so disk warm-up, page-cache fill, and any
process-wide drift contaminate all four modes equally instead of
biasing the last batch.

## Raw data

[`bench-live-fork.csv`](./bench-live-fork.csv) — one row per BRANCH
iteration; columns: `mode, iteration, http_round_trip_ms, pause_ms,
memory_bin_bytes, poll_until_ready_ms`.

Reproduced via [`bench-live-fork.py`](./bench-live-fork.py):

```bash
sudo python3 bench-live-fork.py \
--source-tag python-numpy \
--iterations 10 \
--modes live-sync,live-async,diff,full
```

## What pause_ms measures

`pause_ms` is the source VM's vCPU-pause window:

- **`mode: "full"`**: Pause → write full `memory.bin` to disk → resume.
Wall-bound by sequential disk write. On this HDD: ~120 MB/s, so
1.5 GiB ≈ 13 s. SSD would cut this to ~3 s; NVMe ~1.5 s. Not
acceptable for a running agent.
- **`mode: "diff"`**: Pause → snapshot vmstate + dirty pages → resume.
Still wall-bound on disk write because the diff is *inside* the
pause window. Tail goes wide as the snapshot's dirty page count
grows (p90 = 418 ms is the cost of any one BRANCH hitting more
dirty pages than the others).
- **`mode: "live"`**: Pause → snapshot vmstate, arm UFFD_WP, resume.
The memory copy happens *after* resume, in a controller-side
background thread. pause_ms is bounded by the vmstate dump
(~30-50 ms for 1.5 GiB at our vmstate sizes) plus UFFD_WP arming
on the resident regions (~0.4-0.6 ms in Phase 6 E2E).

This is why **the live pause window is disk-independent**: an NVMe
host wouldn't see Live get any faster (it's CPU-bound on vmstate +
WP arming), but Diff would still scale with disk speed. On slower
storage, the Live/Diff ratio gets *wider*, not narrower.

## What the round-trip column measures

`http_round_trip_ms` is what your code's `await ctrl.branchSandbox(...)`
or `c.branch_sandbox(...)` returns in:

- **live-sync (`wait=true`)**: blocks for source pause AND the
background memory copy. p50 = 13 730 ms ≈ HDD throughput limit
(same as Diff and Full).
- **live-async (`wait=false`)**: returns as soon as the source
resumes. p50 = **69 ms**. The background copy still runs (and is
visible via the `status` field flipping from `"writing"` to
`"ready"`), but the caller doesn't wait on it.
- **diff / full**: synchronous by definition; same RT as live-sync.

The `wait=false` path is the headline UX win for agents: a `pause_ms
~ 56 ms` source downtime *and* a ~70 ms HTTP return. The bench
records `poll_until_ready_ms` separately so you can see when the
async snapshot is actually consumable — it's the same 13-14 s wall
time as sync BRANCH, just out of the critical path.

## Caveats

1. **Single host, single source size.** 1.5 GiB Python+numpy on i7-12700
+ HDD. Numbers will move with source RAM size (Live's pause is
~CPU + vmstate-size bound; Diff/Full are ~disk-bound) and with
disk medium. We'd expect Live's headline gap to narrow on NVMe
(because Diff gets faster) but never invert — Live is always
bounded by the synchronous parts of FC's pause/dump path.

2. **`live-async` p90 outlier.** Iteration #8 saw pause_ms=258 ms
(vs p50=54). Root cause not yet investigated; suspects: ext4
writeback pressure from the in-flight previous async BRANCH, or
FC's vmstate serialization hitting an irregularity. Reproducing
on a clean disk and a longer run is the right follow-up. Median
and p90 (excluding this point) stay tight.

3. **`unprivileged_userfaultfd=0` requires root for the bench.** The
bench script runs the controller under `sudo` because
`vm.unprivileged_userfaultfd=0` is the default on this dev box.
Production deployments should either set the sysctl or give the
controller `CAP_SYS_PTRACE`. `forkd doctor` (Phase 7.4) probes
both.

4. **Source guest must be quiet during the BRANCH.** We ran
python-numpy in its default warmed state with no in-guest
workload. A guest under heavy write pressure during a Live BRANCH
will see UFFD_WP capture more dirty pages, growing the bg-copy
wall time (but NOT pause_ms — the pause stays disk-independent).

5. **`mode: "live"` requires the vendored Firecracker fork.**
`mem_backend.shared = true` is the one upstream gap; tracked as
[`FIRECRACKER-UPSTREAM-PROPOSAL.md`](../../FIRECRACKER-UPSTREAM-PROPOSAL.md).
Once it lands upstream, the vendor requirement goes away.

## Comparison vs v0.3.4 Diff

v0.3.4 closed the multi-BRANCH compounding anomaly via
`posix_fallocate`, putting Diff at a steady ~150-300 ms on this same
hardware (see [`bench/pause-window/RESULTS-v0.3.md`](../pause-window/RESULTS-v0.3.md)).
This bench's Diff p50 of 202 ms lines up cleanly with that. The
v0.4 Live win is **on top of** v0.3.4 Diff, not against the original
v0.3.0 baseline.

For comparison:

| Version | Mode | p50 pause on this hardware |
|---------|------|---------------------------:|
| v0.2.x | Full | ~13 500 ms |
| v0.3.0 | Diff | ~1 500-2 700 ms (anomaly) |
| v0.3.4 | Diff | ~200 ms |
| v0.4 | Live | **~56 ms** |
41 changes: 41 additions & 0 deletions bench/live-fork-pause-window/bench-live-fork.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
mode,iteration,http_round_trip_ms,pause_ms,memory_bin_bytes,poll_until_ready_ms
live-sync,0,13669.03,40,1610612736,
live-async,0,58.27,48,1610612736,13284.92
diff,0,13510.88,434,1610612736,
full,0,13182.87,13163,1610612736,
live-sync,1,13239.02,64,1610612736,
live-async,1,125.31,90,1610612736,13723.48
diff,1,14428.33,238,1610612736,
full,1,13837.76,13828,1610612736,
live-sync,2,13437.02,58,1610612736,
live-async,2,85.39,57,1610612736,13668.73
diff,2,13288.21,227,1610612736,
full,2,13539.74,13524,1610612736,
live-sync,3,14129.82,54,1610612736,
live-async,3,77.93,59,1610612736,14048.62
diff,3,13154.87,207,1610612736,
full,3,14384.82,14314,1610612736,
live-sync,4,13791.01,58,1610612736,
live-async,4,59.17,51,1610612736,13428.47
diff,4,13411.8,164,1610612736,
full,4,13071.25,13068,1610612736,
live-sync,5,14237.89,61,1610612736,
live-async,5,48.43,41,1610612736,14047.83
diff,5,13359.63,271,1610612736,
full,5,13578.29,13576,1610612736,
live-sync,6,14196.55,64,1610612736,
live-async,6,95.57,68,1610612736,14225.81
diff,6,13542.8,196,1610612736,
full,6,13323.87,13298,1610612736,
live-sync,7,16440.8,43,1610612736,
live-async,7,57.03,38,1610612736,13986.13
diff,7,14454.35,190,1610612736,
full,7,13523.34,13510,1610612736,
live-sync,8,13405.48,39,1610612736,
live-async,8,266.74,258,1610612736,13481.96
diff,8,13521.46,179,1610612736,
full,8,13641.49,13638,1610612736,
live-sync,9,13463.18,32,1610612736,
live-async,9,57.64,50,1610612736,14398.6
diff,9,13189.66,170,1610612736,
full,9,13853.29,13850,1610612736,
Loading
Loading