Skip to content

feat(doctor,uffd): Phase 7.4 — uffd_wp + memfd_create capability checks#207

Merged
WaylandYang merged 1 commit into
mainfrom
feat/v0.4-phase7.4-doctor-uffd-memfd
May 31, 2026
Merged

feat(doctor,uffd): Phase 7.4 — uffd_wp + memfd_create capability checks#207
WaylandYang merged 1 commit into
mainfrom
feat/v0.4-phase7.4-doctor-uffd-memfd

Conversation

@WaylandYang
Copy link
Copy Markdown
Contributor

Summary

forkd doctor now probes the two kernel features v0.4 live-fork needs, so users see capability problems before they hit them mid-BRANCH.

Two new checks (15 + 16):

Check Probe On fail
uffd_wp (v0.4 live BRANCH) userfaultfd(2) + UFFDIO_API with UFFD_FEATURE_PAGEFAULT_FLAG_WP WARN. EPERM → sysctl vm.unprivileged_userfaultfd=1 hint. ENOSYS → "kernel < 5.7, Diff still works".
memfd_create (v0.4 live BRANCH) memfd_create(2) with MFD_CLOEXEC WARN. ENOSYS → "seccomp blocking; check container profile".

WARN (not FAIL) because v0.3 Diff and Full BRANCH still work without uffd_wp/memfd. The WARN explicitly tells the user which path they lose, so they can decide whether to fix it.

Shared probe moduleforkd_uffd::probe:

  • probe_uffd_wp() / probe_memfd_create() — pub helpers, minimum syscalls, drop the fd before returning. Kept in forkd-uffd (where the rest of the UFFD_WP machinery already lives) but isolated in their own module so forkd-cli doesn't pull in the snapshot-side machinery just to probe.
  • 2 unit tests.

Manual verification (dev box, Linux 6.14)

$ forkd doctor   # vm.unprivileged_userfaultfd=0, unprivileged user
✓  memfd_create (v0.4 live BRANCH)  supported
⚠  uffd_wp (v0.4 live BRANCH)       userfaultfd(2) — ... Operation not permitted
                                    → sudo sysctl -w vm.unprivileged_userfaultfd=1 ...

$ sudo forkd doctor                # with CAP_SYS_PTRACE
✓  uffd_wp (v0.4 live BRANCH)       supported
✓  memfd_create (v0.4 live BRANCH)  supported

Test plan

  • cargo fmt --check --all — clean
  • cargo clippy --workspace --all-targets -- -D warnings — clean
  • cargo test --workspace — 98 / 98 pass (was 96; +2 new probe tests)
  • RUSTDOCFLAGS=-D warnings cargo doc --no-deps — clean
  • forkd doctor end-to-end on Linux 6.14 — both PASS and WARN paths verified
  • Run on a kernel < 5.7 to verify ENOSYS path messaging — deferred (no such host available)

🤖 Generated with Claude Code

`forkd doctor` now probes the two kernel features the v0.4 live-fork
path needs. Saves users from hitting them as opaque errors mid-BRANCH.

New checks (15 + 16 in the doctor list):

- **uffd_wp (v0.4 live BRANCH)** — opens a `userfaultfd(2)`, negotiates
  `UFFDIO_API` with `UFFD_FEATURE_PAGEFAULT_FLAG_WP`, drops the fd.
  Maps EPERM to a `sysctl vm.unprivileged_userfaultfd=1` hint; maps
  ENOSYS to a "kernel < 5.7 — live BRANCH unavailable but Diff still
  works" hint.
- **memfd_create (v0.4 live BRANCH)** — opens an anonymous memfd and
  drops it. ENOSYS hints at a restrictive seccomp profile (containers).

Both checks WARN (not FAIL) on failure: v0.3 Diff BRANCH and Full
BRANCH still work fine without uffd_wp/memfd, so a doctor red doesn't
match the actual user impact. The WARN message tells the user exactly
which path they lose.

Shared probe module — `forkd_uffd::probe`:

- `probe_uffd_wp()` / `probe_memfd_create()` are pub helpers that
  perform the minimum syscall needed and drop the fd, with
  human-readable error contexts. Kept in `forkd-uffd` because that's
  where the rest of the UFFD_WP machinery lives, but trimmed to its
  own module so doctor doesn't pull in the snapshot-side machinery.
- 2 unit tests: PASS on a supported host, error-message contains
  actionable keywords otherwise.

Manual verification on dev box (Linux 6.14, `unprivileged_userfaultfd=0`):

- Unprivileged: `uffd_wp` WARN ("Operation not permitted") with the
  exact `sysctl` hint. `memfd_create` PASS.
- Root (CAP_SYS_PTRACE granted): both PASS.

Gates: fmt ✓ · clippy -D warnings ✓ · `cargo test --workspace` 98/98
(was 96; +2 new probe tests) · `RUSTDOCFLAGS=-D warnings cargo doc` ✓.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@WaylandYang WaylandYang merged commit a372e2a into main May 31, 2026
2 checks passed
@WaylandYang WaylandYang deleted the feat/v0.4-phase7.4-doctor-uffd-memfd branch May 31, 2026 06:31
WaylandYang added a commit that referenced this pull request May 31, 2026
…urface (#208)

Phases 6 and 7 shipped the full v0.4 live BRANCH path (sub-50 ms source
pause via UFFD_WP + memfd) across REST, CLI, and SDKs — but the README
still pitched it as "experimental, try with `forkd wp-bench`" and the
status section explicitly claimed "we chose not to fork Firecracker."
That contradicts reality on main today.

Updates:

- **README.md / README-zh.md** — v0.4 preview block rewritten as "v0.4
  live BRANCH" with the actual user-facing surface (REST `mode: "live"`,
  CLI `--live` / `--no-wait`, SDK `mode=`). Doctor check count
  14 → 16 (uffd_wp + memfd_create). Python and TypeScript SDK examples
  show `live_fork=True` / `liveFork: true` + `mode="live"` + `wait=False`.
  Status section: "we chose not to fork Firecracker" paragraph replaced
  with the honest version — we did fork, here's the branch, here's the
  upstream proposal, vendor requirement goes away if upstream takes it.
- **docs/API.md** — `POST /v1/sandboxes/:id/branch` documents `mode`,
  `wait`, the `mode`/`diff` mutex (HTTP 400), and per-mode pause
  semantics. `POST /v1/sandboxes` documents `live_fork`. `SnapshotInfo`
  gains the `status` field for the `wait=false` lifecycle.
- **DESIGN-v0.4.md** — status banner flipped from DRAFT to IMPLEMENTED
  with links to PRs #194#207; DRAFT body preserved verbatim as the
  architecture record (the implementation tracks it closely).
- **CHANGELOG.md** — Unreleased gets a "v0.4 live-fork: user-facing
  surface complete" section. Calls out the prereqs (Linux ≥ 5.7,
  `unprivileged_userfaultfd=1`, vendored FC fork) and the one
  known CLI gap (`forkd fork --live-fork` for spawn-time opt-in
  isn't surfaced — use SDK / REST for now; tracking as a follow-up).
- ROADMAP.md left as-is — it's milestone-shaped (M1/M2/M3) and v0.4
  live-fork wasn't on the original critical path; the CHANGELOG + Status
  section already cover the shipped state.

No code change; pure docs.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WaylandYang added a commit that referenced this pull request May 31, 2026
…rce (#210)

Replaces the "pause_ms TBD" disclaimer in v0.4 docs with measured
numbers from a clean Hub-pulled `python-numpy` source (1.5 GiB,
sha256-verified). The previous attempt at this measurement used
`coding-agent-fork-prewarm-v1`, which had 17 baked-in guest Oopses
contaminating the timing — fixed by switching source.

Methodology (`bench/live-fork-pause-window/bench-live-fork.py`,
based on `scripts/dev/e2e-live-branch.py` Phase 6 E2E harness):

- One memfd-backed source sandbox spawned with `live_fork: true`
- 10 iterations × 4 modes ({live-sync, live-async, diff, full}),
  interleaved so cold-cache effects average across modes
- Each iteration: POST .../branch, record `pause_ms` and HTTP RT,
  DELETE the result snapshot to bound disk usage
- Async iterations also record `poll_until_ready_ms`

Results (Intel i7-12700, 30 GiB RAM, Linux 6.14, ext4 on **HDD**):

| mode         | pause p50 | pause p90 | RT p50    |
|--------------|----------:|----------:|----------:|
| live-sync    |  **56 ms**|     64 ms | 13 730 ms |
| live-async   |     54 ms |    241 ms | **69 ms** |
| diff         |    202 ms |    418 ms | 13 461 ms |
| full         |  13 550 ms |  14 268 ms | 13 559 ms |

Key ratios at p50:

- live vs diff: **3.6× faster pause** (202 / 56)
- live vs full: **242× faster pause** (13550 / 56)
- async RT vs sync RT: **198× faster return** (13730 / 69)

The "on HDD" point is a feature, not a bug for the writeup:
Live's pause is disk-independent (memory copy runs after resume,
not during), so the Live / Diff gap *widens* on slow storage rather
than shrinking. NVMe would speed up Diff but not Live, making the
ratio narrower — but Live is always bounded by CPU work (vmstate
dump + UFFD_WP arming), never by disk throughput.

Files:

- `bench/live-fork-pause-window/bench-live-fork.py` — runnable
  harness, parameterized on source-tag and iterations
- `bench/live-fork-pause-window/bench-live-fork.csv` — 40-row raw
  data (one per BRANCH iteration)
- `bench/live-fork-pause-window/RESULTS-v0.4.md` — writeup with
  methodology, host config, per-mode interpretation of what
  pause_ms / RT measure, and honest caveats (single host, one
  source size, p90 outlier on async iter #8)

Docs updated:

- `README.md` headline: "BRANCH a live VM in 150 ms" → "in 56 ms
  (v0.4 live mode)". v0.4 preview block now leads with the
  measured 3.6× / 200× ratios and links to RESULTS-v0.4.md.
- `README-zh.md`: same headline + intro update.
- `CHANGELOG.md`: Unreleased's v0.4 section's "Bench in progress"
  disclaimer replaced with the actual numbers table.

Phase 7 (user surface for v0.4 live BRANCH) is complete with this
PR: REST (#204), CLI (#205), SDKs (#206), doctor (#207), docs
(#208), bench (this).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant