diff --git a/CHANGELOG.md b/CHANGELOG.md index b28ff78..1f6957b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -50,11 +50,21 @@ Firecracker fork from — upstream FC doesn't yet ship `mem_backend.shared = true`. See [`docs/VENDORED-FIRECRACKER.md`](./docs/VENDORED-FIRECRACKER.md). -**Stability**: the user surface is stable; numbers across realistic -workloads (`bench/live-fork-pause-window.md`) are in progress against -a freshly-rebuilt clean snapshot — the previous -`coding-agent-fork-prewarm-v1` parent had 17 pre-baked guest Oopses -that contaminate Live BRANCH timing. +**Bench numbers** ([`bench/live-fork-pause-window/RESULTS-v0.4.md`](./bench/live-fork-pause-window/RESULTS-v0.4.md)) +on a clean Hub-pulled `python-numpy` source (1.5 GiB, Intel i7-12700, +ext4 on HDD): + +| mode | pause p50 | pause p90 | RT p50 | +|------------|----------:|----------:|----------:| +| live-sync | 56 ms | 64 ms | 13 730 ms | +| live-async | 54 ms | 241 ms | **69 ms** | +| diff | 202 ms | 418 ms | 13 461 ms | +| full | 13 550 ms | 14 268 ms | 13 559 ms | + +Headline: **3.6× faster pause** vs v0.3 Diff at p50, and the gap +widens on slower storage because Live's pause is disk-independent. +`wait=false` gives callers a ~70 ms HTTP return (vs 13.7 s for sync), +**~200× RT improvement** for fire-and-forget BRANCH. ### Security — bearer-token comparison was a length oracle (closes #162) diff --git a/README-zh.md b/README-zh.md index 112f0ae..99463c8 100644 --- a/README-zh.md +++ b/README-zh.md @@ -21,7 +21,7 @@
-## 101 毫秒 fork 100 个 microVM,150 毫秒 BRANCH 一个运行中的 VM。 +## 101 毫秒 fork 100 个 microVM,56 毫秒 BRANCH 一个运行中的 VM(v0.4 live 模式)。 面向 **AI Agent 扇出**(fan-out)场景的 microVM 沙箱运行时。子 VM 从一个已"暖启动"的父快照 fork 而来,通过写时复制(CoW)继承 @@ -44,12 +44,15 @@ pause 时间会从 150 ms 涨到 2.7 s ([#146](https://github.com/deeplethe/forkd/issues/146));修复后 连续 BRANCH 保持平直(第 6 次 BRANCH 快了 17.6×)。 -**v0.4 live BRANCH** 把源 VM 的卡顿窗口从 ~150 ms(Diff)降到 -sub-50 ms:vCPU 状态 dump 完源 VM 立刻恢复,脏页通过 UFFD_WP -异步抓取。端到端路径已经全部接入:CLI 用 `--live`、REST 用 -`mode: "live"`、Python / TypeScript / MCP SDK 同名。再加 `--no-wait` -(CLI)或 `wait: false`(REST/SDK)就立刻返回(~10 ms),不等 -背景拷贝完成。 +**v0.4 live BRANCH** 把源 VM 卡顿窗口从 ~200 ms(Diff)压到 +**56 ms p50 / 64 ms p90**(1.5 GiB 源 VM,实测, +[`bench/live-fork-pause-window/RESULTS-v0.4.md`](./bench/live-fork-pause-window/RESULTS-v0.4.md))。 +p50 比 v0.3 Diff 快 **3.6 倍**,而且在慢盘上这个比值**变得更大**—— +因为 Live 的 pause 是 disk-independent 的(内存拷贝跑在 resume 之 +后,不占临界区)。加 `wait: false` 让调用方 ~70 ms 就返回,背景 +拷贝异步完成——对于 agent 代码的 fire-and-forget BRANCH 是 **200×** +的 RT 改进。CLI 用 `--live` / `--no-wait`,REST 用 `mode: "live"` / +`wait: false`,Python / TypeScript / MCP SDK 同名。 ```python from forkd import Controller diff --git a/README.md b/README.md index 89e078f..93bce7d 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@
-## Fork 100 microVMs in 101 ms. BRANCH a live VM in 150 ms. +## Fork 100 microVMs in 101 ms. BRANCH a live VM in 56 ms (v0.4 live mode). A microVM sandbox runtime for **AI agent fan-out**. Children fork from a warmed parent snapshot, inheriting its address space @@ -45,15 +45,17 @@ where repeated BRANCHes on the same parent ballooned from 150 ms to 2.7 s ([#146](https://github.com/deeplethe/forkd/issues/146)); the chain now stays flat (17.6× faster on the 6th consecutive BRANCH). -**v0.4 live BRANCH** drops the source-pause window from ~150 ms -(Diff) to sub-50 ms by moving the memory copy out of the critical -section: the source resumes as soon as Firecracker dumps vCPU state, -and dirty pages get captured asynchronously via UFFD_WP. The full -end-to-end path is wired up — pass `--live` on the CLI, `mode: -"live"` on REST, or `mode="live"` / `mode: "live"` on the Python / -TypeScript / MCP SDKs. Add `--no-wait` (CLI) or `wait: false` (REST / -SDKs) to return as soon as the source resumes (~10 ms) rather than -waiting on the background copy. +**v0.4 live BRANCH** collapses the source-pause window from ~200 ms +(Diff) to **56 ms p50 / 64 ms p90** on a 1.5 GiB source — measured +on a real BRANCH workload, [`bench/live-fork-pause-window/RESULTS-v0.4.md`](./bench/live-fork-pause-window/RESULTS-v0.4.md). +**3.6× faster pause** vs v0.3 Diff at p50, and the gap *widens* on +slower storage because Live's pause is disk-independent (memory +copy runs after resume, not during). With `wait: false` the caller +returns in ~70 ms while the background copy completes asynchronously +— a **200×** RT improvement for fire-and-forget BRANCH from agent +code. Pass `--live` / `--no-wait` on the CLI, `mode: "live"` / +`wait: false` on REST, or the same on the Python / TypeScript / MCP +SDKs. ```python from forkd import Controller diff --git a/bench/live-fork-pause-window/RESULTS-v0.4.md b/bench/live-fork-pause-window/RESULTS-v0.4.md new file mode 100644 index 0000000..27eb934 --- /dev/null +++ b/bench/live-fork-pause-window/RESULTS-v0.4.md @@ -0,0 +1,160 @@ +# v0.4 live BRANCH pause-window results + +Headline: `mode="live"` collapses the source-VM pause window from +**202 ms p50 (Diff)** to **56 ms p50 (Live)** on a 1.5 GiB source — +**3.6× faster** at the median, and the gap widens on slow storage +because Live's pause is disk-independent while Diff's is not. +`wait=false` lets the caller return after ~69 ms while the background +memory copy runs to completion asynchronously. + +Methodology, raw numbers, and honest caveats below. + +## TL;DR + +| mode | pause p50 | pause p90 | pause max | RT p50 | +|--------------|----------:|----------:|----------:|----------:| +| live-sync | **56 ms**| 64 ms | 64 ms | 13 730 ms | +| live-async | 54 ms | 241 ms | 258 ms | **69 ms** | +| diff | 202 ms | 418 ms | 434 ms | 13 461 ms | +| full | 13 550 ms | 14 268 ms | 14 314 ms | 13 559 ms | + +Key ratios at p50: + +- **live vs diff**: 202 / 56 = **3.6× faster pause window** +- **live vs full**: 13 550 / 56 = **242× faster pause window** +- **async RT vs sync RT**: 13 730 / 69 = **198× faster return** for + callers that don't need the snapshot bytes immediately + +> "Pause" is the source VM's downtime (the user-visible gap in TCP +> connections, kvmclock, etc.). "RT" is the full HTTP round-trip on +> `POST /v1/sandboxes//branch` — this is what your code waits on. + +## Setup + +| Item | Value | +|-----------------|--------------------------------------------------------------------| +| Host CPU | 12th Gen Intel Core i7-12700 (8P+4E) | +| Host RAM | 30 GiB | +| Host kernel | Linux 6.14.0-36-generic (Ubuntu) | +| Snapshot disk | `/dev/sda2` — **WDC WD10EZEX-75WN4A1, ROTA=1 (spinning HDD)**, ext4 | +| Firecracker | Vendored `forkd-v0.4-mem-backend-shared-v1.12` (musl release) | +| Controller | `feat(doctor,uffd): Phase 7.4` (commit `a372e2a`) | +| Source snapshot | `python-numpy` (from Hub, sha256-verified) | +| Source RAM size | 1 610 612 736 bytes = **1 536 MiB** | +| Iterations | 10 per mode, modes interleaved (live-sync, live-async, diff, full) | +| Source sandbox | Spawned once with `live_fork: true`; all BRANCHes hit it | + +Modes interleave so disk warm-up, page-cache fill, and any +process-wide drift contaminate all four modes equally instead of +biasing the last batch. + +## Raw data + +[`bench-live-fork.csv`](./bench-live-fork.csv) — one row per BRANCH +iteration; columns: `mode, iteration, http_round_trip_ms, pause_ms, +memory_bin_bytes, poll_until_ready_ms`. + +Reproduced via [`bench-live-fork.py`](./bench-live-fork.py): + +```bash +sudo python3 bench-live-fork.py \ + --source-tag python-numpy \ + --iterations 10 \ + --modes live-sync,live-async,diff,full +``` + +## What pause_ms measures + +`pause_ms` is the source VM's vCPU-pause window: + +- **`mode: "full"`**: Pause → write full `memory.bin` to disk → resume. + Wall-bound by sequential disk write. On this HDD: ~120 MB/s, so + 1.5 GiB ≈ 13 s. SSD would cut this to ~3 s; NVMe ~1.5 s. Not + acceptable for a running agent. +- **`mode: "diff"`**: Pause → snapshot vmstate + dirty pages → resume. + Still wall-bound on disk write because the diff is *inside* the + pause window. Tail goes wide as the snapshot's dirty page count + grows (p90 = 418 ms is the cost of any one BRANCH hitting more + dirty pages than the others). +- **`mode: "live"`**: Pause → snapshot vmstate, arm UFFD_WP, resume. + The memory copy happens *after* resume, in a controller-side + background thread. pause_ms is bounded by the vmstate dump + (~30-50 ms for 1.5 GiB at our vmstate sizes) plus UFFD_WP arming + on the resident regions (~0.4-0.6 ms in Phase 6 E2E). + +This is why **the live pause window is disk-independent**: an NVMe +host wouldn't see Live get any faster (it's CPU-bound on vmstate + +WP arming), but Diff would still scale with disk speed. On slower +storage, the Live/Diff ratio gets *wider*, not narrower. + +## What the round-trip column measures + +`http_round_trip_ms` is what your code's `await ctrl.branchSandbox(...)` +or `c.branch_sandbox(...)` returns in: + +- **live-sync (`wait=true`)**: blocks for source pause AND the + background memory copy. p50 = 13 730 ms ≈ HDD throughput limit + (same as Diff and Full). +- **live-async (`wait=false`)**: returns as soon as the source + resumes. p50 = **69 ms**. The background copy still runs (and is + visible via the `status` field flipping from `"writing"` to + `"ready"`), but the caller doesn't wait on it. +- **diff / full**: synchronous by definition; same RT as live-sync. + +The `wait=false` path is the headline UX win for agents: a `pause_ms +~ 56 ms` source downtime *and* a ~70 ms HTTP return. The bench +records `poll_until_ready_ms` separately so you can see when the +async snapshot is actually consumable — it's the same 13-14 s wall +time as sync BRANCH, just out of the critical path. + +## Caveats + +1. **Single host, single source size.** 1.5 GiB Python+numpy on i7-12700 + + HDD. Numbers will move with source RAM size (Live's pause is + ~CPU + vmstate-size bound; Diff/Full are ~disk-bound) and with + disk medium. We'd expect Live's headline gap to narrow on NVMe + (because Diff gets faster) but never invert — Live is always + bounded by the synchronous parts of FC's pause/dump path. + +2. **`live-async` p90 outlier.** Iteration #8 saw pause_ms=258 ms + (vs p50=54). Root cause not yet investigated; suspects: ext4 + writeback pressure from the in-flight previous async BRANCH, or + FC's vmstate serialization hitting an irregularity. Reproducing + on a clean disk and a longer run is the right follow-up. Median + and p90 (excluding this point) stay tight. + +3. **`unprivileged_userfaultfd=0` requires root for the bench.** The + bench script runs the controller under `sudo` because + `vm.unprivileged_userfaultfd=0` is the default on this dev box. + Production deployments should either set the sysctl or give the + controller `CAP_SYS_PTRACE`. `forkd doctor` (Phase 7.4) probes + both. + +4. **Source guest must be quiet during the BRANCH.** We ran + python-numpy in its default warmed state with no in-guest + workload. A guest under heavy write pressure during a Live BRANCH + will see UFFD_WP capture more dirty pages, growing the bg-copy + wall time (but NOT pause_ms — the pause stays disk-independent). + +5. **`mode: "live"` requires the vendored Firecracker fork.** + `mem_backend.shared = true` is the one upstream gap; tracked as + [`FIRECRACKER-UPSTREAM-PROPOSAL.md`](../../FIRECRACKER-UPSTREAM-PROPOSAL.md). + Once it lands upstream, the vendor requirement goes away. + +## Comparison vs v0.3.4 Diff + +v0.3.4 closed the multi-BRANCH compounding anomaly via +`posix_fallocate`, putting Diff at a steady ~150-300 ms on this same +hardware (see [`bench/pause-window/RESULTS-v0.3.md`](../pause-window/RESULTS-v0.3.md)). +This bench's Diff p50 of 202 ms lines up cleanly with that. The +v0.4 Live win is **on top of** v0.3.4 Diff, not against the original +v0.3.0 baseline. + +For comparison: + +| Version | Mode | p50 pause on this hardware | +|---------|------|---------------------------:| +| v0.2.x | Full | ~13 500 ms | +| v0.3.0 | Diff | ~1 500-2 700 ms (anomaly) | +| v0.3.4 | Diff | ~200 ms | +| v0.4 | Live | **~56 ms** | diff --git a/bench/live-fork-pause-window/bench-live-fork.csv b/bench/live-fork-pause-window/bench-live-fork.csv new file mode 100644 index 0000000..4c77906 --- /dev/null +++ b/bench/live-fork-pause-window/bench-live-fork.csv @@ -0,0 +1,41 @@ +mode,iteration,http_round_trip_ms,pause_ms,memory_bin_bytes,poll_until_ready_ms +live-sync,0,13669.03,40,1610612736, +live-async,0,58.27,48,1610612736,13284.92 +diff,0,13510.88,434,1610612736, +full,0,13182.87,13163,1610612736, +live-sync,1,13239.02,64,1610612736, +live-async,1,125.31,90,1610612736,13723.48 +diff,1,14428.33,238,1610612736, +full,1,13837.76,13828,1610612736, +live-sync,2,13437.02,58,1610612736, +live-async,2,85.39,57,1610612736,13668.73 +diff,2,13288.21,227,1610612736, +full,2,13539.74,13524,1610612736, +live-sync,3,14129.82,54,1610612736, +live-async,3,77.93,59,1610612736,14048.62 +diff,3,13154.87,207,1610612736, +full,3,14384.82,14314,1610612736, +live-sync,4,13791.01,58,1610612736, +live-async,4,59.17,51,1610612736,13428.47 +diff,4,13411.8,164,1610612736, +full,4,13071.25,13068,1610612736, +live-sync,5,14237.89,61,1610612736, +live-async,5,48.43,41,1610612736,14047.83 +diff,5,13359.63,271,1610612736, +full,5,13578.29,13576,1610612736, +live-sync,6,14196.55,64,1610612736, +live-async,6,95.57,68,1610612736,14225.81 +diff,6,13542.8,196,1610612736, +full,6,13323.87,13298,1610612736, +live-sync,7,16440.8,43,1610612736, +live-async,7,57.03,38,1610612736,13986.13 +diff,7,14454.35,190,1610612736, +full,7,13523.34,13510,1610612736, +live-sync,8,13405.48,39,1610612736, +live-async,8,266.74,258,1610612736,13481.96 +diff,8,13521.46,179,1610612736, +full,8,13641.49,13638,1610612736, +live-sync,9,13463.18,32,1610612736, +live-async,9,57.64,50,1610612736,14398.6 +diff,9,13189.66,170,1610612736, +full,9,13853.29,13850,1610612736, diff --git a/bench/live-fork-pause-window/bench-live-fork.py b/bench/live-fork-pause-window/bench-live-fork.py new file mode 100644 index 0000000..30afb2e --- /dev/null +++ b/bench/live-fork-pause-window/bench-live-fork.py @@ -0,0 +1,439 @@ +#!/usr/bin/env python3 +"""v0.4 live BRANCH pause-window bench. + +Drives N iterations of three BRANCH modes off the same live-fork source +sandbox and emits per-iteration CSV plus a p50/p90/max summary. The +point is to get an honest pause_ms number for `mode="live"` against a +known-clean source — Phase 6's E2E used `coding-agent-fork-prewarm-v1` +which has 17 baked guest Oopses contaminating the measurement. + +Source selection +---------------- + +The script symlinks an existing snapshot directory under the script's +work-dir as the source tag. Override `--source-tag` and `--snap-root` +if your snapshots live elsewhere. `python-numpy` is the default +because it's the canonical Hub recipe (`forkd pull +deeplethe/python-numpy`) — anyone with a fresh forkd install can +reproduce against the same bytes. + +Setup pattern matches `scripts/dev/e2e-live-branch.py` (Phase 6 E2E): + + 1. Stand up an isolated forkd-controller on a free port with a + `firecracker` wrapper that adds --no-seccomp (the vendored FC's + vmm seccomp filter blocks userfaultfd; following Phase 6's + pattern). + 2. POST /v1/sandboxes with `live_fork: true` to spawn a memfd-backed + source sandbox. + 3. Loop N times for each of {live wait=true, live wait=false, diff, + full}: POST .../branch, record `pause_ms`, delete the result + snapshot to keep disk usage bounded. + 4. Emit CSV per iteration + p50/p90/max table to stdout. + +Run as root: the FC API socket and snapshot dir are root-owned, and +the system FC swap needs sudo too. + +Output +------ + +- `bench-live-fork.csv` — one row per BRANCH iteration: + mode, iteration, pause_ms, http_round_trip_ms, memory_bin_bytes, + poll_until_ready_ms (live wait=false only), source_memory_bytes +- Stdout summary table with p50/p90/max per mode. + +Usage: + sudo python3 bench-live-fork.py \\ + --source-tag python-numpy \\ + --iterations 10 \\ + --modes live-sync,live-async,diff,full +""" +import argparse +import json +import os +import shutil +import socket +import statistics +import subprocess +import sys +import time +import urllib.error +import urllib.request + +# Paths the dev box uses; override via CLI when porting. +DEFAULT_BIN = "/home/yangdongxu/forkd/target/release/forkd-controller" +DEFAULT_FC = ( + "/home/yangdongxu/firecracker-fork/build/cargo_target" + "/x86_64-unknown-linux-musl/release/firecracker" +) +DEFAULT_SNAP_ROOT = "/home/yangdongxu/.local/share/forkd/snapshots" +SYSTEM_FC = "/usr/local/bin/firecracker" +SYSTEM_FC_BACKUP = "/usr/local/bin/firecracker.bench-live-backup" + +WORK = "/tmp/forkd-bench-live" + + +def http(base_url, method, path, body=None, timeout=120): + data = json.dumps(body).encode() if body is not None else None + headers = {"Content-Type": "application/json"} if body is not None else {} + req = urllib.request.Request( + f"{base_url}{path}", data=data, method=method, headers=headers + ) + try: + with urllib.request.urlopen(req, timeout=timeout) as resp: + raw = resp.read().decode("utf-8", errors="replace") + return resp.status, json.loads(raw) if raw else None + except urllib.error.HTTPError as e: + raw = e.read().decode("utf-8", errors="replace") + try: + return e.code, json.loads(raw) + except json.JSONDecodeError: + return e.code, raw + + +def wait_for_healthy(base_url, port, deadline_s=20): + end = time.time() + deadline_s + while time.time() < end: + try: + s = socket.create_connection(("127.0.0.1", port), timeout=1) + s.close() + status, _ = http(base_url, "GET", "/healthz", timeout=2) + if status == 200: + return + except (ConnectionRefusedError, socket.timeout, OSError): + pass + time.sleep(0.3) + raise RuntimeError(f"daemon not healthy after {deadline_s}s") + + +def setup_workdir(source_tag, source_dir, patched_fc): + shutil.rmtree(WORK, ignore_errors=True) + os.makedirs(f"{WORK}/snapshots", exist_ok=True) + os.makedirs(f"{WORK}/audit", exist_ok=True) + + # FC wrapper. Same pattern as the Phase 6 E2E: vendored FC binary + # + --no-seccomp because the upstream vmm-thread filter still + # blocks userfaultfd(2). + wrapper = f"{WORK}/firecracker.wrapper" + with open(wrapper, "w") as f: + f.write( + "#!/bin/bash\n" + f"exec {patched_fc} --no-seccomp \"$@\"\n" + ) + os.chmod(wrapper, 0o755) + if not os.path.exists(SYSTEM_FC_BACKUP): + subprocess.run(["sudo", "mv", SYSTEM_FC, SYSTEM_FC_BACKUP], check=True) + subprocess.run(["sudo", "cp", wrapper, SYSTEM_FC], check=True) + subprocess.run(["sudo", "chmod", "755", SYSTEM_FC], check=True) + + # Symlink the source snapshot dir into our snap-root. Avoids + # copying the multi-hundred-MB memory.bin. + target = f"{WORK}/snapshots/{source_tag}" + if os.path.lexists(target): + os.unlink(target) + os.symlink(source_dir, target) + + state = { + "snapshots": { + source_tag: { + "tag": source_tag, + "dir": target, + "created_at_unix": int(time.time()), + "status": "ready", + } + } + } + with open(f"{WORK}/state.json", "w") as f: + json.dump(state, f, indent=2) + + +def restore_firecracker(): + if os.path.exists(SYSTEM_FC_BACKUP): + subprocess.run( + ["sudo", "mv", "-f", SYSTEM_FC_BACKUP, SYSTEM_FC], check=False + ) + + +def start_daemon(bin_path, bind): + log = open(f"{WORK}/controller.log", "wb") + return subprocess.Popen( + [ + "sudo", + bin_path, + "serve", + "--bind", + bind, + "--state", + f"{WORK}/state.json", + "--snapshot-root", + f"{WORK}/snapshots", + "--audit-log", + f"{WORK}/audit/audit.log", + ], + stdout=log, + stderr=log, + ) + + +def kill_leftovers(bind): + subprocess.run( + ["sudo", "pkill", "-f", f"forkd-controller serve --bind {bind}"], + stderr=subprocess.DEVNULL, + ) + subprocess.run( + ["sudo", "pkill", "-f", f"{WORK}/"], stderr=subprocess.DEVNULL + ) + time.sleep(0.5) + + +def branch_once(base_url, sandbox_id, mode, wait, iteration): + """Run a single BRANCH; return a per-iteration row dict.""" + tag = f"bench-{mode}-{iteration:03d}-{int(time.time() * 1000)}" + body = {"tag": tag} + if mode == "live-sync": + body["mode"] = "live" + body["wait"] = True + elif mode == "live-async": + body["mode"] = "live" + body["wait"] = False + elif mode == "diff": + body["mode"] = "diff" + elif mode == "full": + body["mode"] = "full" + else: + raise ValueError(f"unknown mode {mode}") + + t0 = time.time() + status, resp = http( + base_url, "POST", f"/v1/sandboxes/{sandbox_id}/branch", body + ) + rt_ms = (time.time() - t0) * 1000 + if status not in (201, 202): + raise RuntimeError(f"BRANCH {mode} #{iteration} HTTP {status}: {resp!r}") + + pause_ms = resp.get("pause_ms") + mem_bytes = None + + ready_ms = None + if mode == "live-async": + # Poll until the snapshot flips to status=ready. + assert status == 202 and resp.get("status") == "writing" + poll_start = time.time() + deadline = poll_start + 60 + while time.time() < deadline: + ls_status, ls = http(base_url, "GET", "/v1/snapshots") + assert ls_status == 200, f"list_snapshots HTTP {ls_status}" + entry = next((e for e in ls if e["tag"] == tag), None) + if entry is None: + raise RuntimeError(f"{tag} vanished") + if entry["status"] == "ready": + ready_ms = (time.time() - poll_start) * 1000 + break + if entry["status"] == "failed": + raise RuntimeError(f"{tag} failed: {entry.get('warning')}") + time.sleep(0.05) + if ready_ms is None: + raise RuntimeError(f"{tag} did not reach ready in 60s") + + mem_path = f"{WORK}/snapshots/{tag}/memory.bin" + if os.path.exists(mem_path): + mem_bytes = os.path.getsize(mem_path) + + # Delete the snapshot to keep disk usage bounded. The source + # sandbox isn't affected; only this branch's tag goes away. + del_status, _ = http(base_url, "DELETE", f"/v1/snapshots/{tag}") + if del_status not in (200, 204): + # Non-fatal; bench can keep going, log it. + print(f" warn: DELETE {tag} -> HTTP {del_status}", file=sys.stderr) + + return { + "mode": mode, + "iteration": iteration, + "http_round_trip_ms": round(rt_ms, 2), + "pause_ms": pause_ms, + "memory_bin_bytes": mem_bytes, + "poll_until_ready_ms": round(ready_ms, 2) if ready_ms is not None else None, + } + + +def summarize(rows, csv_path): + # Write CSV + cols = [ + "mode", + "iteration", + "http_round_trip_ms", + "pause_ms", + "memory_bin_bytes", + "poll_until_ready_ms", + ] + with open(csv_path, "w") as f: + f.write(",".join(cols) + "\n") + for r in rows: + f.write( + ",".join("" if r[c] is None else str(r[c]) for c in cols) + "\n" + ) + + # Per-mode p50 / p90 / max for pause_ms and round-trip. + by_mode = {} + for r in rows: + by_mode.setdefault(r["mode"], []).append(r) + + print("\n=== SUMMARY ===") + print( + f" {'mode':<14} {'N':>3} " + f"{'pause_ms (p50)':>15} {'p90':>6} {'max':>6} " + f"{'RT_ms (p50)':>12} {'p90':>6} {'max':>6}" + ) + for mode in ("live-sync", "live-async", "diff", "full"): + if mode not in by_mode: + continue + rs = by_mode[mode] + pauses = [r["pause_ms"] for r in rs if r["pause_ms"] is not None] + rts = [r["http_round_trip_ms"] for r in rs] + if pauses: + p_p50 = statistics.median(pauses) + p_p90 = statistics.quantiles(pauses, n=10)[-1] if len(pauses) >= 2 else pauses[0] + p_max = max(pauses) + else: + p_p50 = p_p90 = p_max = float("nan") + rt_p50 = statistics.median(rts) + rt_p90 = statistics.quantiles(rts, n=10)[-1] if len(rts) >= 2 else rts[0] + rt_max = max(rts) + print( + f" {mode:<14} {len(rs):>3} " + f"{p_p50:>15.1f} {p_p90:>6.1f} {p_max:>6.1f} " + f"{rt_p50:>12.1f} {rt_p90:>6.1f} {rt_max:>6.1f}" + ) + + # Headline ratio: live-sync p50 vs diff p50. + if "live-sync" in by_mode and "diff" in by_mode: + live_pauses = [ + r["pause_ms"] for r in by_mode["live-sync"] if r["pause_ms"] is not None + ] + diff_pauses = [ + r["pause_ms"] for r in by_mode["diff"] if r["pause_ms"] is not None + ] + if live_pauses and diff_pauses: + live_p50 = statistics.median(live_pauses) + diff_p50 = statistics.median(diff_pauses) + ratio = diff_p50 / live_p50 if live_p50 > 0 else float("inf") + print( + f"\n diff_p50 / live_p50 = {diff_p50:.0f}/{live_p50:.1f} " + f"= {ratio:.1f}×" + ) + print(f"\n CSV: {csv_path}") + + +def main(): + parser = argparse.ArgumentParser( + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + parser.add_argument("--source-tag", default="python-numpy") + parser.add_argument("--snap-root", default=DEFAULT_SNAP_ROOT) + parser.add_argument("--controller-bin", default=DEFAULT_BIN) + parser.add_argument("--patched-fc", default=DEFAULT_FC) + parser.add_argument( + "--port", type=int, default=8891, help="port for the isolated controller" + ) + parser.add_argument( + "--iterations", type=int, default=10, help="branches per mode" + ) + parser.add_argument( + "--modes", + default="live-sync,live-async,diff,full", + help="comma-separated subset of {live-sync,live-async,diff,full}", + ) + parser.add_argument( + "--out-csv", + default="/tmp/forkd-bench-live/bench-live-fork.csv", + ) + args = parser.parse_args() + + bind = f"127.0.0.1:{args.port}" + base_url = f"http://{bind}" + + source_dir = os.path.join(args.snap_root, args.source_tag) + if not os.path.isdir(source_dir): + sys.exit(f"source snapshot not found: {source_dir}") + + # Probe source size — useful for the writeup. + src_mem = os.path.join(source_dir, "memory.bin") + src_bytes = os.path.getsize(src_mem) if os.path.exists(src_mem) else None + + modes = args.modes.split(",") + for m in modes: + if m not in {"live-sync", "live-async", "diff", "full"}: + sys.exit(f"unknown mode {m}") + + print(f"[*] source: {source_dir}") + if src_bytes: + print(f" memory.bin: {src_bytes} bytes ({src_bytes // (1024 * 1024)} MiB)") + print(f"[*] modes: {modes}, iterations per mode: {args.iterations}") + print(f"[*] controller on {bind}") + + print("[*] kill leftovers") + kill_leftovers(bind) + + print(f"[*] setup work dir {WORK}") + setup_workdir(args.source_tag, source_dir, args.patched_fc) + + print("[*] start daemon") + daemon = start_daemon(args.controller_bin, bind) + rows = [] + try: + wait_for_healthy(base_url, args.port) + print("[+] daemon healthy") + + # Spawn one live-fork source sandbox; all BRANCHes hit it. + print(f"\n[*] POST /v1/sandboxes live_fork=true tag={args.source_tag}") + status, body = http( + base_url, + "POST", + "/v1/sandboxes", + {"snapshot_tag": args.source_tag, "n": 1, "live_fork": True}, + ) + if status != 201: + raise RuntimeError(f"spawn HTTP {status}: {body!r}") + sandbox_id = body[0]["id"] + print(f"[+] sandbox {sandbox_id}") + + # Give the guest a moment to settle (some recipes do post-boot + # work). Keep it small so the bench's "agent state" isn't + # dominated by warmup work. + time.sleep(1.5) + + # Interleave modes so any one-shot effects (cold cache, + # warm-up, file-system state) average out instead of stacking + # on the last mode. + for i in range(args.iterations): + for m in modes: + print(f" [{m} #{i}] ...", end=" ", flush=True) + row = branch_once(base_url, sandbox_id, m, None, i) + rows.append(row) + extra = "" + if row["poll_until_ready_ms"] is not None: + extra = f" ready+{row['poll_until_ready_ms']:.0f}ms" + print( + f"pause={row['pause_ms']}ms " + f"rt={row['http_round_trip_ms']:.0f}ms{extra}" + ) + + summarize(rows, args.out_csv) + + finally: + print("\n[*] tearing down") + subprocess.run(["sudo", "kill", str(daemon.pid)], stderr=subprocess.DEVNULL) + subprocess.run( + ["sudo", "pkill", "-9", "-f", "/usr/local/bin/firecracker"], + stderr=subprocess.DEVNULL, + ) + time.sleep(0.5) + restore_firecracker() + + +if __name__ == "__main__": + try: + main() + except Exception as e: + print(f"\n[!] FAIL: {e}", file=sys.stderr) + sys.exit(1)