fix(cli): three bug-bash findings — live-socket detection, temp leaks, NaN/Inf by WaylandYang · Pull Request #36 · deeplethe/forkd

WaylandYang · 2026-05-13T22:11:12Z

Bug-bash session against last week's cleanup + Snapshot Hub work. Found three real issues; this PR fixes all three with retests on the dev box.

1. `forkd cleanup` would nuke live VMs (CRITICAL safety)

`workdir_has_live_process()` used `lsof ` to detect in-use work_dirs. On Ubuntu 24.04 / lsof 4.95, `lsof` against a Firecracker UNIX domain socket emits warnings to stderr and zero rows on stdout, even while a process actively holds it. Our code redirected stderr to /dev/null and trusted empty stdout to mean "no one is using this" — would have nuked a live VM's socket directory under `forkd cleanup --yes`.

Fix: replaced with a `/proc/*/cmdline` scan. Firecracker children pass `--api-sock /tmp/forkd-fork-/child-N.sock` on argv, so the work_dir path appears verbatim in cmdline while the VM is alive. Errs on the side of "live" if /proc is unreadable.

Reverified: the previously-misclassified `forkd-fork-pwb` (with two live firecracker children) now correctly shows `SKIP ... (live socket — a forkd run looks active)`.

2. `forkd unpack` leaked `/tmp/forkd-unpack-/` on failure

On any error after `create_dir_all(&tmp)` (corrupted tar.zst, truncated archive, sha256 mismatch, dest-exists-no-force), the temp extraction dir was never removed. Two such dirs were already sitting on the dev box, 16 MiB combined.

Fix: refactored into `unpack_into()` so the cleanup is a single `if result.is_err() { rm tmp }` wrapper. Same pattern applied to `pull_cmd` which had the parallel leak for the downloaded `.tar.zst`.

3. `forkd cleanup` didn't sweep `forkd-unpack-` / `forkd-pull-`

Only looked at `forkd-fork-` and `forkd-parent-`. Added the other two prefixes to a shared `PREFIXES` table; the "starts with /tmp/" + "name matches a known prefix" safety check used right before each remove uses the same table.

Bonus: warmup.js `Infinity` / `NaN` sentinels

`JSON.stringify(Infinity)` is `null` per JSON spec. Same for `NaN` / `-Infinity`. So `sb.eval("return 1/0")` came back as plain `null`, indistinguishable from a legitimately null return. Replacer now emits sentinel strings:

```
Infinity → "js_Infinity"
-Infinity → "js-Infinity_"
NaN → "js_NaN"
```

Takes effect on next rootfs rebuild of the recipe.

Verified on dev box (still-live pwb fork pair)

```
$ forkd cleanup
5 candidate work_dir(s):
SKIP /tmp/forkd-fork-pwb 4 KiB (live socket — a forkd run looks active)
DEL /tmp/forkd-fork-pyagent 3 KiB
DEL /tmp/forkd-unpack-2991226 0 KiB
DEL /tmp/forkd-unpack-2991309 16.0 MiB
DEL /tmp/forkd-unpack-test 0 KiB
dry run — pass `--yes` to delete the 4 dir(s) marked DEL above.

$ echo bogus > /tmp/bogus.tar.zst && forkd unpack /tmp/bogus.tar.zst
Error: tar entry
$ ls -d /tmp/forkd-unpack-* # no NEW dir added — old leaks plus pre-existing
$ forkd cleanup --yes
removed /tmp/forkd-unpack-2991226 / 2991309 / test

live fork untouched throughout

$ sudo -E forkd ping --child forkd-child-1
{ "pong": true, "agent_lang": "node", "warmup_ready": true }
```

Test plan

cargo fmt + clippy + test (all pass)
/proc scan correctly flags live firecracker work_dir
unpack failure no longer leaks
cleanup sweeps all four prefixes
live VM survives `forkd cleanup --yes`

🤖 Generated with Claude Code

New workflow `.github/workflows/publish-pypi.yml`: Triggered when a GitHub Release is published (i.e. after release.yml has built binaries + created the release on tag push). Builds sdist + wheel from sdk/python/ and uploads to PyPI using PyPI's Trusted Publishers (OIDC) — no API token, no repo secret. Trust relationship configured on PyPI side: PyPI project = forkd Owner = deeplethe Repository = forkd Workflow file = publish-pypi.yml Environment = pypi (GitHub Actions env, also created in repo settings with optional branch protection) workflow_dispatch is enabled so maintainers can re-publish a tag manually (e.g. after a transient PyPI outage). A guard step compares the SDK version in pyproject.toml against the release tag (with the leading v stripped) — refuses to publish if they're out of sync, so `forkd-0.1.2` on PyPI is guaranteed to match `git tag v0.1.2`. Trigger chain: push tag v0.x.y → release.yml: builds binaries, builds Docker image, creates GitHub Release → publish-pypi.yml fires on "release published" → sdist + wheel land on https://pypi.org/p/forkd

forkd is now installable via pip install forkd; surface the current PyPI version next to the GitHub release badge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- README-zh.md: full Chinese translation of the README, keeping technical terms (Firecracker, KVM, microVM, mmap, CoW, etc.) intact. - README.md: swap the "built on Firecracker" badge for a red 中文 README badge that links to README-zh.md. Firecracker attribution is already in the prose, the badge slot is more valuable as a language-switch link for Chinese readers. - README-zh.md mirrors back with an English README badge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A CubeSandbox maintainer suggested the 20.3s N=100 figure might be due to 3-4 layers of nested virtualisation. It's not — the host is bare-metal i7-12700 (systemd-detect-virt: none), and CubeSandbox ran via the official one-click installer directly on the host, no dev-env VM in between. - bench/CUBESANDBOX.md: lead with a "Host" section that pastes the `systemd-detect-virt` / cpuinfo output so anyone suspecting nested virt can verify the claim themselves. - README.md: tighten the cubesandbox footnote to (a) show the host is bare metal up front, (b) explicitly call out that fork-from-warm (forkd) and cold-start (every other project) are different operating points being shown on one chart for shape comparison, not as equivalent primitives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- ROADMAP.md: M1 (browser recipe + snapshot hub + marketing pulse, ≈4 weeks) and M2 (diff snapshots + time-travel branching, gated on M1.3 user signal). Done criteria + risks per item. - recipes/playwright-browser/: alpha scaffold for the M1.1 deliverable. build.sh layers a tiny Node warm-up script (launches headless Chromium + about:blank tab) onto the official mcr.microsoft.com/ playwright rootfs, so the resulting parent VM has a fully initialised Chromium resident at snapshot time. README.md documents target shape + interim CLI driving path until the Playwright bridge in forkd-agent lands (see follow-up issue). - recipes/README.md: add playwright-browser row + "browser-driving agent" entry in the chooser. The recipe is marked alpha — the warm-up script will run as soon as forkd-agent is taught to exec FORKD_WARMUP_CMD from /etc/forkd-recipe.env before the snapshot. Until that lands, users can drive Chromium via `forkd exec` (documented in the recipe README). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The recipe README referenced #28 for the Playwright bridge in forkd-agent; #28 ended up being the diff-snapshots tracker, with the bridge filed as #30. Point all four references to #30. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Teach forkd-agent to delegate the `eval` action to a recipe-supplied warmup subprocess via a tiny line-JSON protocol — enables Node-based recipes (starting with playwright-browser) to expose `sb.eval(<js>)` that runs against the parent VM's pre-launched Chromium + Playwright state, instead of cold-spawning node/Chromium per call. Wire protocol ------------- Recipes drop `/etc/forkd-recipe.env` into the rootfs declaring: FORKD_WARMUP_CMD="node /opt/forkd-warmup.js" FORKD_AGENT_LANG=node On boot, forkd-agent.py: 1. Reads /etc/forkd-recipe.env (env vars override file — handy for dev smoke tests on the host). 2. If FORKD_WARMUP_CMD is set, spawns the warmup with pipes for stdin/stdout (protocol) + stderr (forwarded to agent stdout as "forkd-warmup: ..." for visibility). 3. Reads one JSON line from warmup stdout expecting {"ready": true}; once received, the bridge is open. 4. Routes the `eval` action to the warmup: request {"id": <n>, "code": <str>} reply {"id": <n>, "result": <json>} error {"id": <n>, "error": <str>, "stack": <str>} Serialised by _warmup_lock so concurrent connections on one VM don't interleave on the shared pipes; cross-VM parallelism is unaffected since each child has its own agent+warmup pair. The `eval` action stays single in the SDK surface — the agent dispatches based on FORKD_AGENT_LANG. Python-recipe `eval` still returns `{"result": <repr-string>}` (unchanged), Node-recipe `eval` returns `{"result_json": <json-string>}` so the SDK can deserialise into a native Python value cleanly without touching repr()-based paths. `ping` response gains `agent_lang` + `warmup_ready` for SDK-side debug visibility. SDK --- Sandbox.eval() prefers `result_json` when present and json.loads()es it. Legacy `result` (repr) path unchanged for all existing Python recipes — fully backwards compatible. Docstring updated with both semantics + a playwright-browser example. playwright-browser recipe ------------------------- build.sh's warmup.js rewritten from a "park forever" stub into the actual readline-based command loop: launches Chromium + about:blank, sends ready handshake, evaluates incoming code as async functions with (browser, context, page) in scope, top-level await supported. README updated to drop the "interim shell path" workaround — the SDK example now works. Smoke tests ----------- rootfs-init/tests/ (new): - fake-warmup.py: reference protocol implementation in Python - smoke-test.sh: end-to-end on a Linux host (agent + fake warmup + nc-driven TCP requests). Verified on bare i7-12700 dev box — agent_lang=node, warmup_ready=true, eval routed correctly, result_json round-trips. - smoke-sdk.py: exercises Sandbox.eval() stubbed against both result_json and result paths plus the error path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds four CLI commands and a new `hub` module for moving warmed parent snapshots between hosts without re-running each recipe's `build.sh`. Pack format v1 ============== `.forkd-snapshot.tar.zst` containing: manifest.toml — tag, sha256 per file, format version, reserved parent_tag for M2.1 diff chains memory.bin — CoW source for child mmap vmstate — Firecracker vCPU + device state snapshot.json — forkd metadata (volumes) rootfs.ext4 — block device for child overlays Manifest's `forkd_pack_version` lets us evolve the format without breaking older clients. `parent_tag` is reserved for the M2.1 diff chain work (currently always None). CLI surface =========== - `forkd pack --tag <local> [--out <file>] [--description <s>] [--base-image <ref>]` - `forkd unpack <file> [--tag <new>] [--force]` - `forkd pull <url-or-short-form> [--tag <new>] [--force] [--hub <base-url>]` - `forkd images` (list local snapshots with sizes) `pull` accepts either a plain HTTPS URL or a `<owner>/<tag>` short form that resolves against `$FORKD_HUB_URL` (default `https://forkd-hub.deeplethe.com`). Integrity guarantees ==================== - pack records sha256 per file in manifest.toml - unpack verifies sha256 against the manifest after extraction; partial extracts are visible for debugging - unpack refuses path-traversal entries (`../escape`, absolute paths) - pack-format version mismatch rejected with a clear error Tests ===== Adds unit tests covering pack ↔ unpack roundtrip with synthetic snapshot bytes, manifest TOML roundtrip, path-traversal rejection, human-byte formatting, and the epoch → RFC3339 stamper (rolled by hand to avoid pulling chrono just for one timestamp). E2E scaffolding =============== rootfs-init/tests/e2e-playwright.sh: end-to-end recipe verification script for the dev box (clone → build → recipe build → snapshot → fork → eval). Currently blocks on passwordless sudo at the recipe install step — separate dev-box config item. Dependencies ============ - sha2, tar, zstd, toml, ureq (sync HTTP, rustls-tls by default) - tempfile in dev-deps for pack roundtrip tests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… check The `tar` crate's `Builder::append_data()` refuses to write entries with `..` segments, so the malicious archive can't be crafted via the safe API. The unpack-side check in hub.rs remains as defense- in-depth against tars produced by other tooling (raw bytes, the shell `tar(1)`, language-mismatched implementations) — replace the broken test with an inline comment explaining what would be needed to exercise it. Also drop the unused PathBuf import flagged by cargo check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`forkd push --tag <local> <url>` packs the snapshot to a temp file and HTTP PUTs the body to the given URL (presigned PUT from R2/S3 is the intended fit). Streams the body via a ProgressReader so a multi-GiB pack doesn't materialise in RAM, and prints throughput + total time on completion. - hub.rs: `upload()` + `ProgressReader` for PUT-with-progress. - main.rs: `Cmd::Push` + `push_cmd`. Cleans up the temp pack whether the upload succeeds or fails. - README.md: adds a "Snapshot Hub" subsection under Quick start showing the pack/push/pull/fork flow; doesn't yet swap the primary quickstart over to `forkd pull` (that flip lands when the hub bucket goes live and recipes are pushed). - ROADMAP.md: mark M1.2 CLI item ✓, bucket provisioning and README-quickstart-flip still pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end Playwright recipe is working. Pulling all the fixes needed to get there into one commit: CLI changes ----------- - `forkd snapshot --mem-size-mib <u32>`: optional override on the parent VM's memory. The 512 MiB default is fine for Python + numpy but OOMs Chromium (the kernel logs `__vm_enough_memory: ... comm: headless_shell, no enough memory` and `traps: headless_shell[379] trap int3`). Recipe README now requires 2048 for playwright-browser. - `forkd eval`: also prints `result_json` (Node-recipe replies), not just `result` (Python-recipe). Without this the bridge worked but the CLI silently dropped output. Also surfaces JS stack traces from `error.stack` on failure. Recipe build.sh --------------- - npm install -g playwright@1.50.0 inside the rootfs via chroot — the official mcr.microsoft.com/playwright image ships ONLY the browser binaries under /ms-playwright, not the JS module. - /etc/forkd-recipe.env wraps node with `env NODE_PATH=... PLAYWRIGHT_BROWSERS_PATH=/ms-playwright` so require('playwright') resolves and the JS driver finds Chromium. - Cleanup function unmounts proc/sys/dev/pts before the loopback to avoid EBUSY when the build script's chroot leaves bind mounts behind. Verification on dev box (bare-metal i7-12700) --------------------------------------------- - snapshot of warmed Chromium parent: 16 s wall-clock (2 GiB memory.bin) - fork 3 children, per-child netns: 56 ms wall-clock - ping returns `agent_lang: "node"`, `warmup_ready: true` - `forkd eval --child forkd-child-N -- "return await page.title()"`: 10–82 ms per call, output `"Example Domain"` - Also verified pure JS evals (`Math.PI * 2`, `[1,2,3].map(x=>x*x)`) return correctly across multiple children. ROADMAP ------- M1.1 done bullets updated; risk marked resolved (Chromium does survive snapshot-restore cleanly when given enough RAM). Larger-N benchmark remains open. Recipe README and e2e-playwright.sh updated with the working command lines (--mem-size-mib 2048, --memory-limit-mib 2560). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…wn.sh Addresses the leftover-state pile-up observed after multiple forkd runs (11 orphan /tmp/forkd-{fork,parent}-*/ dirs + 5 stale netns). Splits into three pieces, each safe in isolation. 1. Auto-clean work_dir at end of `forkd snapshot` / `forkd fork` ------------------------------------------------------------ On a successful run the temp work_dir (Firecracker API sockets + console logs) is recursively removed. New `--keep-workdir` opt-out for both commands when post-mortem inspection is desired. On any error path the work_dir is preserved (early-return through ?). cleanup_workdir() asserts the path stays under /tmp/forkd- before recursive rm; refuses anything else with an inline warning. 2. `forkd cleanup` — sweep leaked work_dirs ---------------------------------------- Scans /tmp/{forkd-fork-*,forkd-parent-*}/. For each candidate it does an lsof on any contained `.sock` to detect a live Firecracker (false-positive on "live" is safe; false-negative would nuke a live VM, so be conservative — if lsof is missing or unreadable, mark as live and skip). Dry-run by default; `-y/--yes` actually deletes. Path is re-checked to start with `/tmp/forkd-fork-` or `/tmp/forkd-parent-` immediately before each remove call. 3. scripts/netns-teardown.sh — reverse netns-setup.sh -------------------------------------------------- New script, dry-run by default, removes the per-child network namespaces created by netns-setup.sh. Multiple safety nets: - Strict regex match on `^forkd-child-[0-9]+$` for netns names. - Belt-and-suspenders `case "$ns" in forkd-child-*) ;; *) refuse` immediately before each `ip netns delete`. - Bridge/tap are NEVER deleted by default — gated on --include-bridge / --include-tap with the same name-match check. - Deleting a netns auto-destroys the paired veth, so the script doesn't enumerate veths directly (reduces blast radius). docker0, br-<hex> (docker), and any other user-owned interface or netns is untouchable by name. Verified on a host with 22 docker networks present. Verified on dev box ------------------- - `forkd cleanup` (dry): listed 11 dirs, 0 marked live, none deleted. - `forkd cleanup --yes`: removed all 11, recovered ~1.4 MiB. - `scripts/netns-teardown.sh` (dry): listed 5 netns, exited 0. - `scripts/netns-teardown.sh --yes`: deleted forkd-child-1..5, all forkd-v-Nh veths went with them; docker bridges + forkd-br0 + forkd-tap0 untouched (verified post-hoc). - Fresh `forkd fork --tag pwb -n 2 --per-child-netns`: ran in 57 ms, on exit logged `cleaned work_dir /tmp/forkd-fork-pwb`, ls confirms no leftover dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, NaN/Inf Bug-bash exposed three real issues in last week's cleanup + Snapshot Hub work. All three fixed with retests on the dev box. 1. forkd cleanup nuked live VMs (CRITICAL) ---------------------------------------- `workdir_has_live_process()` used `lsof <socket-path>` to detect in-use work_dirs. On Ubuntu 24.04 / lsof 4.95, `lsof` against a Firecracker UNIX domain socket emits warnings to stderr and zero rows on stdout, even while a process is actively holding it. Our code redirected stderr to /dev/null and trusted empty stdout to mean "no one is using this" — would have nuked a live VM's socket directory under `forkd cleanup --yes`. Replaced with a `/proc/*/cmdline` scan: Firecracker children pass `--api-sock /tmp/forkd-fork-<tag>/child-N.sock` on argv, so the work_dir path appears verbatim in cmdline while the VM is alive. Errs on the side of "live" if /proc is unreadable. Reverified: the previously-misclassified `forkd-fork-pwb` (with two live firecracker children) now correctly shows `SKIP ... (live socket — a forkd run looks active)`. 2. forkd unpack leaked /tmp/forkd-unpack-<pid>/ on failure ------------------------------------------------------ On any error after `create_dir_all(&tmp)` (corrupted tar.zst, truncated archive, sha256 mismatch, dest-exists-no-force), the temp extraction dir was never removed. Two such dirs were on the dev box already, 16 MiB combined. Refactored into `unpack_into()` so the cleanup path is a single `if result.is_err() { rm tmp }` wrapper. Same pattern applied to `pull_cmd`, which had the parallel leak for the downloaded `.tar.zst`. 3. forkd cleanup didn't sweep forkd-unpack-* / forkd-pull-* ------------------------------------------------------- `cleanup` only looked at `forkd-fork-*` and `forkd-parent-*`. Added the other two prefixes to a shared PREFIXES table; the "starts with /tmp/" + "name matches a known prefix" safety check used right before each remove uses the same table. Bonus: warmup.js Infinity/NaN sentinels The Playwright bridge's JSON.stringify turned `Infinity` / `NaN` / `-Infinity` into `null` silently — so `sb.eval("return 1/0")` came back as `null`, indistinguishable from a legitimately null return. Replacer now emits `"__js_Infinity__"` / `"__js_-Infinity__"` / `"__js_NaN__"` sentinels. Takes effect on next rootfs rebuild of the recipe (build.sh emits the new warmup.js). Verified on dev box with the still-live pwb pair: cleanup correctly skipped the live fork, cleared 3 stale unpack scratch dirs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two more findings from bug-bash, both around concurrent operations on the same tag. 1. Concurrent forkd fork/snapshot on same tag cascaded into confusing Firecracker errors -------------------------------------------------------- Two simultaneous `forkd fork --tag X` runs produce overlapping API sockets in /tmp/forkd-fork-X/. The second one's failure surfaced as: Error: restore_many_with failed Caused by: firecracker API PUT /snapshot/load returned 400: "Open tap device failed: Resource busy ... Invalid TUN/TAP Backend provided by forkd-tap0." That's three layers deep and doesn't tell the user the real reason: "another forkd run is already using this tag". Added `preflight_workdir()` to snapshot_cmd and fork_cmd. If the work_dir exists AND has a live process holding fds inside it, refuse up-front with: Error: another `forkd fork` looks active on tag 'pwb' — its work_dir at /tmp/forkd-fork-pwb still has a live Firecracker process holding sockets. Wait for the other run to finish (or kill it) before re-running. If you're sure nothing's alive, run `forkd cleanup --yes`. If the work_dir exists but no process is using it (--keep-workdir from a previous run, or a crash), preflight cleans it before proceeding. Logged so the user knows why we're touching it. 2. workdir_has_live_process() over-flagged any process whose argv mentions the path ------------------------------------------------------- The cmdline-substring approach from PR #36 had false positives: ANY shell command mentioning the work_dir path — including the shell running `forkd cleanup` itself — got flagged as "live". Caught while bash-bashing the preflight code path. Switched to /proc/<pid>/fd/* readlink scan: returns true iff some process holds an open fd resolving to a path under the work_dir. Firecracker children redirect stdout to <work_dir>/child-N.console, so a real live VM is always detectable. False-positive surface drops to zero. Comments explain why we don't use lsof (the bug fixed in PR #36) and why /proc cmdline (this bug). Errs on the side of "live" if /proc is unreadable. Verified -------- - Stale /tmp/forkd-fork-pwb (no live firecracker) → preflight cleans + new fork succeeds in 56 ms. - Two concurrent `forkd fork --tag pwb` → first runs to completion, second gets the clean "another forkd fork looks active" error. - `forkd cleanup` no longer shows the running shell as a live process holder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(release): publish Python SDK to PyPI via Trusted Publishers (OIDC) New workflow `.github/workflows/publish-pypi.yml`: Triggered when a GitHub Release is published (i.e. after release.yml has built binaries + created the release on tag push). Builds sdist + wheel from sdk/python/ and uploads to PyPI using PyPI's Trusted Publishers (OIDC) — no API token, no repo secret. Trust relationship configured on PyPI side: PyPI project = forkd Owner = deeplethe Repository = forkd Workflow file = publish-pypi.yml Environment = pypi (GitHub Actions env, also created in repo settings with optional branch protection) workflow_dispatch is enabled so maintainers can re-publish a tag manually (e.g. after a transient PyPI outage). A guard step compares the SDK version in pyproject.toml against the release tag (with the leading v stripped) — refuses to publish if they're out of sync, so `forkd-0.1.2` on PyPI is guaranteed to match `git tag v0.1.2`. Trigger chain: push tag v0.x.y → release.yml: builds binaries, builds Docker image, creates GitHub Release → publish-pypi.yml fires on "release published" → sdist + wheel land on https://pypi.org/p/forkd * docs(readme): add PyPI version badge forkd is now installable via pip install forkd; surface the current PyPI version next to the GitHub release badge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add Chinese README and replace Firecracker badge with a zh-link - README-zh.md: full Chinese translation of the README, keeping technical terms (Firecracker, KVM, microVM, mmap, CoW, etc.) intact. - README.md: swap the "built on Firecracker" badge for a red 中文 README badge that links to README-zh.md. Firecracker attribution is already in the prose, the badge slot is more valuable as a language-switch link for Chinese readers. - README-zh.md mirrors back with an English README badge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(bench): rebut nested-virt suspicion + clarify form-factor delta A CubeSandbox maintainer suggested the 20.3s N=100 figure might be due to 3-4 layers of nested virtualisation. It's not — the host is bare-metal i7-12700 (systemd-detect-virt: none), and CubeSandbox ran via the official one-click installer directly on the host, no dev-env VM in between. - bench/CUBESANDBOX.md: lead with a "Host" section that pastes the `systemd-detect-virt` / cpuinfo output so anyone suspecting nested virt can verify the claim themselves. - README.md: tighten the cubesandbox footnote to (a) show the host is bare metal up front, (b) explicitly call out that fork-from-warm (forkd) and cold-start (every other project) are different operating points being shown on one chart for shape comparison, not as equivalent primitives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(roadmap): publish M1+M2 plan and scaffold playwright-browser recipe - ROADMAP.md: M1 (browser recipe + snapshot hub + marketing pulse, ≈4 weeks) and M2 (diff snapshots + time-travel branching, gated on M1.3 user signal). Done criteria + risks per item. - recipes/playwright-browser/: alpha scaffold for the M1.1 deliverable. build.sh layers a tiny Node warm-up script (launches headless Chromium + about:blank tab) onto the official mcr.microsoft.com/ playwright rootfs, so the resulting parent VM has a fully initialised Chromium resident at snapshot time. README.md documents target shape + interim CLI driving path until the Playwright bridge in forkd-agent lands (see follow-up issue). - recipes/README.md: add playwright-browser row + "browser-driving agent" entry in the chooser. The recipe is marked alpha — the warm-up script will run as soon as forkd-agent is taught to exec FORKD_WARMUP_CMD from /etc/forkd-recipe.env before the snapshot. Until that lands, users can drive Chromium via `forkd exec` (documented in the recipe README). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(recipes): fix issue refs in playwright-browser to #30 The recipe README referenced #28 for the Playwright bridge in forkd-agent; #28 ended up being the diff-snapshots tracker, with the bridge filed as #30. Point all four references to #30. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(agent): recipe-level eval bridge (closes #30) Teach forkd-agent to delegate the `eval` action to a recipe-supplied warmup subprocess via a tiny line-JSON protocol — enables Node-based recipes (starting with playwright-browser) to expose `sb.eval(<js>)` that runs against the parent VM's pre-launched Chromium + Playwright state, instead of cold-spawning node/Chromium per call. Wire protocol ------------- Recipes drop `/etc/forkd-recipe.env` into the rootfs declaring: FORKD_WARMUP_CMD="node /opt/forkd-warmup.js" FORKD_AGENT_LANG=node On boot, forkd-agent.py: 1. Reads /etc/forkd-recipe.env (env vars override file — handy for dev smoke tests on the host). 2. If FORKD_WARMUP_CMD is set, spawns the warmup with pipes for stdin/stdout (protocol) + stderr (forwarded to agent stdout as "forkd-warmup: ..." for visibility). 3. Reads one JSON line from warmup stdout expecting {"ready": true}; once received, the bridge is open. 4. Routes the `eval` action to the warmup: request {"id": <n>, "code": <str>} reply {"id": <n>, "result": <json>} error {"id": <n>, "error": <str>, "stack": <str>} Serialised by _warmup_lock so concurrent connections on one VM don't interleave on the shared pipes; cross-VM parallelism is unaffected since each child has its own agent+warmup pair. The `eval` action stays single in the SDK surface — the agent dispatches based on FORKD_AGENT_LANG. Python-recipe `eval` still returns `{"result": <repr-string>}` (unchanged), Node-recipe `eval` returns `{"result_json": <json-string>}` so the SDK can deserialise into a native Python value cleanly without touching repr()-based paths. `ping` response gains `agent_lang` + `warmup_ready` for SDK-side debug visibility. SDK --- Sandbox.eval() prefers `result_json` when present and json.loads()es it. Legacy `result` (repr) path unchanged for all existing Python recipes — fully backwards compatible. Docstring updated with both semantics + a playwright-browser example. playwright-browser recipe ------------------------- build.sh's warmup.js rewritten from a "park forever" stub into the actual readline-based command loop: launches Chromium + about:blank, sends ready handshake, evaluates incoming code as async functions with (browser, context, page) in scope, top-level await supported. README updated to drop the "interim shell path" workaround — the SDK example now works. Smoke tests ----------- rootfs-init/tests/ (new): - fake-warmup.py: reference protocol implementation in Python - smoke-test.sh: end-to-end on a Linux host (agent + fake warmup + nc-driven TCP requests). Verified on bare i7-12700 dev box — agent_lang=node, warmup_ready=true, eval routed correctly, result_json round-trips. - smoke-sdk.py: exercises Sandbox.eval() stubbed against both result_json and result paths plus the error path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: nudge re-run on PR #31 * docs(roadmap): mark M1.1 progress — recipe scaffold + agent bridge done * feat(cli): Snapshot Hub MVP — pack / unpack / pull / images (closes #29) Adds four CLI commands and a new `hub` module for moving warmed parent snapshots between hosts without re-running each recipe's `build.sh`. Pack format v1 ============== `.forkd-snapshot.tar.zst` containing: manifest.toml — tag, sha256 per file, format version, reserved parent_tag for M2.1 diff chains memory.bin — CoW source for child mmap vmstate — Firecracker vCPU + device state snapshot.json — forkd metadata (volumes) rootfs.ext4 — block device for child overlays Manifest's `forkd_pack_version` lets us evolve the format without breaking older clients. `parent_tag` is reserved for the M2.1 diff chain work (currently always None). CLI surface =========== - `forkd pack --tag <local> [--out <file>] [--description <s>] [--base-image <ref>]` - `forkd unpack <file> [--tag <new>] [--force]` - `forkd pull <url-or-short-form> [--tag <new>] [--force] [--hub <base-url>]` - `forkd images` (list local snapshots with sizes) `pull` accepts either a plain HTTPS URL or a `<owner>/<tag>` short form that resolves against `$FORKD_HUB_URL` (default `https://forkd-hub.deeplethe.com`). Integrity guarantees ==================== - pack records sha256 per file in manifest.toml - unpack verifies sha256 against the manifest after extraction; partial extracts are visible for debugging - unpack refuses path-traversal entries (`../escape`, absolute paths) - pack-format version mismatch rejected with a clear error Tests ===== Adds unit tests covering pack ↔ unpack roundtrip with synthetic snapshot bytes, manifest TOML roundtrip, path-traversal rejection, human-byte formatting, and the epoch → RFC3339 stamper (rolled by hand to avoid pulling chrono just for one timestamp). E2E scaffolding =============== rootfs-init/tests/e2e-playwright.sh: end-to-end recipe verification script for the dev box (clone → build → recipe build → snapshot → fork → eval). Currently blocks on passwordless sudo at the recipe install step — separate dev-box config item. Dependencies ============ - sha2, tar, zstd, toml, ureq (sync HTTP, rustls-tls by default) - tempfile in dev-deps for pack roundtrip tests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cli): drop unreachable path-traversal unit test; keep the runtime check The `tar` crate's `Builder::append_data()` refuses to write entries with `..` segments, so the malicious archive can't be crafted via the safe API. The unpack-side check in hub.rs remains as defense- in-depth against tars produced by other tooling (raw bytes, the shell `tar(1)`, language-mismatched implementations) — replace the broken test with an inline comment explaining what would be needed to exercise it. Also drop the unused PathBuf import flagged by cargo check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(cli): apply cargo fmt to hub.rs + main.rs * style(cli): inline 'ROOTFS?' literal to silence clippy::print-literal * feat(cli): add `forkd push` + Snapshot Hub README section `forkd push --tag <local> <url>` packs the snapshot to a temp file and HTTP PUTs the body to the given URL (presigned PUT from R2/S3 is the intended fit). Streams the body via a ProgressReader so a multi-GiB pack doesn't materialise in RAM, and prints throughput + total time on completion. - hub.rs: `upload()` + `ProgressReader` for PUT-with-progress. - main.rs: `Cmd::Push` + `push_cmd`. Cleans up the temp pack whether the upload succeeds or fails. - README.md: adds a "Snapshot Hub" subsection under Quick start showing the pack/push/pull/fork flow; doesn't yet swap the primary quickstart over to `forkd pull` (that flip lands when the hub bucket goes live and recipes are pushed). - ROADMAP.md: mark M1.2 CLI item ✓, bucket provisioning and README-quickstart-flip still pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cli): --mem-size-mib + fix eval result_json print; M1.1 verified End-to-end Playwright recipe is working. Pulling all the fixes needed to get there into one commit: CLI changes ----------- - `forkd snapshot --mem-size-mib <u32>`: optional override on the parent VM's memory. The 512 MiB default is fine for Python + numpy but OOMs Chromium (the kernel logs `__vm_enough_memory: ... comm: headless_shell, no enough memory` and `traps: headless_shell[379] trap int3`). Recipe README now requires 2048 for playwright-browser. - `forkd eval`: also prints `result_json` (Node-recipe replies), not just `result` (Python-recipe). Without this the bridge worked but the CLI silently dropped output. Also surfaces JS stack traces from `error.stack` on failure. Recipe build.sh --------------- - npm install -g playwright@1.50.0 inside the rootfs via chroot — the official mcr.microsoft.com/playwright image ships ONLY the browser binaries under /ms-playwright, not the JS module. - /etc/forkd-recipe.env wraps node with `env NODE_PATH=... PLAYWRIGHT_BROWSERS_PATH=/ms-playwright` so require('playwright') resolves and the JS driver finds Chromium. - Cleanup function unmounts proc/sys/dev/pts before the loopback to avoid EBUSY when the build script's chroot leaves bind mounts behind. Verification on dev box (bare-metal i7-12700) --------------------------------------------- - snapshot of warmed Chromium parent: 16 s wall-clock (2 GiB memory.bin) - fork 3 children, per-child netns: 56 ms wall-clock - ping returns `agent_lang: "node"`, `warmup_ready: true` - `forkd eval --child forkd-child-N -- "return await page.title()"`: 10–82 ms per call, output `"Example Domain"` - Also verified pure JS evals (`Math.PI * 2`, `[1,2,3].map(x=>x*x)`) return correctly across multiple children. ROADMAP ------- M1.1 done bullets updated; risk marked resolved (Chromium does survive snapshot-restore cleanly when given enough RAM). Larger-N benchmark remains open. Recipe README and e2e-playwright.sh updated with the working command lines (--mem-size-mib 2048, --memory-limit-mib 2560). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cli): auto-clean work_dir + forkd cleanup + scripts/netns-teardown.sh Addresses the leftover-state pile-up observed after multiple forkd runs (11 orphan /tmp/forkd-{fork,parent}-*/ dirs + 5 stale netns). Splits into three pieces, each safe in isolation. 1. Auto-clean work_dir at end of `forkd snapshot` / `forkd fork` ------------------------------------------------------------ On a successful run the temp work_dir (Firecracker API sockets + console logs) is recursively removed. New `--keep-workdir` opt-out for both commands when post-mortem inspection is desired. On any error path the work_dir is preserved (early-return through ?). cleanup_workdir() asserts the path stays under /tmp/forkd- before recursive rm; refuses anything else with an inline warning. 2. `forkd cleanup` — sweep leaked work_dirs ---------------------------------------- Scans /tmp/{forkd-fork-*,forkd-parent-*}/. For each candidate it does an lsof on any contained `.sock` to detect a live Firecracker (false-positive on "live" is safe; false-negative would nuke a live VM, so be conservative — if lsof is missing or unreadable, mark as live and skip). Dry-run by default; `-y/--yes` actually deletes. Path is re-checked to start with `/tmp/forkd-fork-` or `/tmp/forkd-parent-` immediately before each remove call. 3. scripts/netns-teardown.sh — reverse netns-setup.sh -------------------------------------------------- New script, dry-run by default, removes the per-child network namespaces created by netns-setup.sh. Multiple safety nets: - Strict regex match on `^forkd-child-[0-9]+$` for netns names. - Belt-and-suspenders `case "$ns" in forkd-child-*) ;; *) refuse` immediately before each `ip netns delete`. - Bridge/tap are NEVER deleted by default — gated on --include-bridge / --include-tap with the same name-match check. - Deleting a netns auto-destroys the paired veth, so the script doesn't enumerate veths directly (reduces blast radius). docker0, br-<hex> (docker), and any other user-owned interface or netns is untouchable by name. Verified on a host with 22 docker networks present. Verified on dev box ------------------- - `forkd cleanup` (dry): listed 11 dirs, 0 marked live, none deleted. - `forkd cleanup --yes`: removed all 11, recovered ~1.4 MiB. - `scripts/netns-teardown.sh` (dry): listed 5 netns, exited 0. - `scripts/netns-teardown.sh --yes`: deleted forkd-child-1..5, all forkd-v-Nh veths went with them; docker bridges + forkd-br0 + forkd-tap0 untouched (verified post-hoc). - Fresh `forkd fork --tag pwb -n 2 --per-child-netns`: ran in 57 ms, on exit logged `cleaned work_dir /tmp/forkd-fork-pwb`, ls confirms no leftover dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cli): three bug-bash findings — live-socket detection, temp leaks, NaN/Inf Bug-bash exposed three real issues in last week's cleanup + Snapshot Hub work. All three fixed with retests on the dev box. 1. forkd cleanup nuked live VMs (CRITICAL) ---------------------------------------- `workdir_has_live_process()` used `lsof <socket-path>` to detect in-use work_dirs. On Ubuntu 24.04 / lsof 4.95, `lsof` against a Firecracker UNIX domain socket emits warnings to stderr and zero rows on stdout, even while a process is actively holding it. Our code redirected stderr to /dev/null and trusted empty stdout to mean "no one is using this" — would have nuked a live VM's socket directory under `forkd cleanup --yes`. Replaced with a `/proc/*/cmdline` scan: Firecracker children pass `--api-sock /tmp/forkd-fork-<tag>/child-N.sock` on argv, so the work_dir path appears verbatim in cmdline while the VM is alive. Errs on the side of "live" if /proc is unreadable. Reverified: the previously-misclassified `forkd-fork-pwb` (with two live firecracker children) now correctly shows `SKIP ... (live socket — a forkd run looks active)`. 2. forkd unpack leaked /tmp/forkd-unpack-<pid>/ on failure ------------------------------------------------------ On any error after `create_dir_all(&tmp)` (corrupted tar.zst, truncated archive, sha256 mismatch, dest-exists-no-force), the temp extraction dir was never removed. Two such dirs were on the dev box already, 16 MiB combined. Refactored into `unpack_into()` so the cleanup path is a single `if result.is_err() { rm tmp }` wrapper. Same pattern applied to `pull_cmd`, which had the parallel leak for the downloaded `.tar.zst`. 3. forkd cleanup didn't sweep forkd-unpack-* / forkd-pull-* ------------------------------------------------------- `cleanup` only looked at `forkd-fork-*` and `forkd-parent-*`. Added the other two prefixes to a shared PREFIXES table; the "starts with /tmp/" + "name matches a known prefix" safety check used right before each remove uses the same table. Bonus: warmup.js Infinity/NaN sentinels The Playwright bridge's JSON.stringify turned `Infinity` / `NaN` / `-Infinity` into `null` silently — so `sb.eval("return 1/0")` came back as `null`, indistinguishable from a legitimately null return. Replacer now emits `"__js_Infinity__"` / `"__js_-Infinity__"` / `"__js_NaN__"` sentinels. Takes effect on next rootfs rebuild of the recipe (build.sh emits the new warmup.js). Verified on dev box with the still-live pwb pair: cleanup correctly skipped the live fork, cleared 3 stale unpack scratch dirs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cli): tag path-traversal + better error messages Two more findings from the bug-bash. First one is a security-class issue (file write outside data dir via a tag flag), second is about making the existing error chain visible. 1. Path-traversal via --tag (SECURITY) ---------------------------------- `snapshot_dir(tag)` was `data_dir().join("snapshots").join(tag)`. Path::join silently keeps the right side when it's absolute, so: forkd snapshot --tag /etc/forkd-bad → /etc/forkd-bad forkd snapshot --tag ../../etc/x → ~/.local/share/../../etc/x ≡ /home/<user>/../etc/x Same risk in `forkd unpack` reading the manifest's `tag` field: a malicious pack on the Snapshot Hub could declare `tag = "../../etc/whatever"` and write anywhere. Added `validate_tag()` (1-64 chars, must start with alnum/`_`, chars allowed: alnum + `. _ -`). Called from every CLI surface that accepts a tag — snapshot, fork, pack, push, unpack (both --tag arg AND manifest tag), pull. Manifest validation fires *after* read so we still show what the malicious pack tried. Verified: 5/5 traversal attempts now print a clean rejection with the offending character pointed out. Trailing tests confirm legitimate tags (pwb, pyagent, run-python-3-12-slim) still pass. 2. Error message quality --------------------- anyhow's Debug impl prints the cause chain, but our contexts were terse and the chain header was easy to miss when grepping logs. T2.2 and T2.4 from the bug-bash both surfaced this: before: Error: tar entry after: Error: read an entry from /tmp/bogus.tar.zst — pack may be corrupted, truncated, or not a forkd snapshot pack Caused by: Unknown frame descriptor before: Error: GET https://example.invalid/foo.tar.zst after: Error: HTTP GET failed for ... (check the URL, DNS, and whether the server is reachable) Caused by: 0: Dns Failed: resolve dns name '...': ... 1: failed to lookup address information: ... Same treatment for: zstd init, manifest parse, integrity-check failure (now hints at common causes like truncated download), HTTP non-2xx status (hints at 403 = expired presigned URL, 404 = tag not published, etc.). Verified on dev box; live pwb fork still pings clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cli): preflight check on tag; tighten live-process detection Two more findings from bug-bash, both around concurrent operations on the same tag. 1. Concurrent forkd fork/snapshot on same tag cascaded into confusing Firecracker errors -------------------------------------------------------- Two simultaneous `forkd fork --tag X` runs produce overlapping API sockets in /tmp/forkd-fork-X/. The second one's failure surfaced as: Error: restore_many_with failed Caused by: firecracker API PUT /snapshot/load returned 400: "Open tap device failed: Resource busy ... Invalid TUN/TAP Backend provided by forkd-tap0." That's three layers deep and doesn't tell the user the real reason: "another forkd run is already using this tag". Added `preflight_workdir()` to snapshot_cmd and fork_cmd. If the work_dir exists AND has a live process holding fds inside it, refuse up-front with: Error: another `forkd fork` looks active on tag 'pwb' — its work_dir at /tmp/forkd-fork-pwb still has a live Firecracker process holding sockets. Wait for the other run to finish (or kill it) before re-running. If you're sure nothing's alive, run `forkd cleanup --yes`. If the work_dir exists but no process is using it (--keep-workdir from a previous run, or a crash), preflight cleans it before proceeding. Logged so the user knows why we're touching it. 2. workdir_has_live_process() over-flagged any process whose argv mentions the path ------------------------------------------------------- The cmdline-substring approach from PR #36 had false positives: ANY shell command mentioning the work_dir path — including the shell running `forkd cleanup` itself — got flagged as "live". Caught while bash-bashing the preflight code path. Switched to /proc/<pid>/fd/* readlink scan: returns true iff some process holds an open fd resolving to a path under the work_dir. Firecracker children redirect stdout to <work_dir>/child-N.console, so a real live VM is always detectable. False-positive surface drops to zero. Comments explain why we don't use lsof (the bug fixed in PR #36) and why /proc cmdline (this bug). Errs on the side of "live" if /proc is unreadable. Verified -------- - Stale /tmp/forkd-fork-pwb (no live firecracker) → preflight cleans + new fork succeeds in 56 ms. - Two concurrent `forkd fork --tag pwb` → first runs to completion, second gets the clean "another forkd fork looks active" error. - `forkd cleanup` no longer shows the running shell as a live process holder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er recipe (#39) * ci(release): publish Python SDK to PyPI via Trusted Publishers (OIDC) New workflow `.github/workflows/publish-pypi.yml`: Triggered when a GitHub Release is published (i.e. after release.yml has built binaries + created the release on tag push). Builds sdist + wheel from sdk/python/ and uploads to PyPI using PyPI's Trusted Publishers (OIDC) — no API token, no repo secret. Trust relationship configured on PyPI side: PyPI project = forkd Owner = deeplethe Repository = forkd Workflow file = publish-pypi.yml Environment = pypi (GitHub Actions env, also created in repo settings with optional branch protection) workflow_dispatch is enabled so maintainers can re-publish a tag manually (e.g. after a transient PyPI outage). A guard step compares the SDK version in pyproject.toml against the release tag (with the leading v stripped) — refuses to publish if they're out of sync, so `forkd-0.1.2` on PyPI is guaranteed to match `git tag v0.1.2`. Trigger chain: push tag v0.x.y → release.yml: builds binaries, builds Docker image, creates GitHub Release → publish-pypi.yml fires on "release published" → sdist + wheel land on https://pypi.org/p/forkd * docs(readme): add PyPI version badge forkd is now installable via pip install forkd; surface the current PyPI version next to the GitHub release badge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add Chinese README and replace Firecracker badge with a zh-link - README-zh.md: full Chinese translation of the README, keeping technical terms (Firecracker, KVM, microVM, mmap, CoW, etc.) intact. - README.md: swap the "built on Firecracker" badge for a red 中文 README badge that links to README-zh.md. Firecracker attribution is already in the prose, the badge slot is more valuable as a language-switch link for Chinese readers. - README-zh.md mirrors back with an English README badge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(bench): rebut nested-virt suspicion + clarify form-factor delta A CubeSandbox maintainer suggested the 20.3s N=100 figure might be due to 3-4 layers of nested virtualisation. It's not — the host is bare-metal i7-12700 (systemd-detect-virt: none), and CubeSandbox ran via the official one-click installer directly on the host, no dev-env VM in between. - bench/CUBESANDBOX.md: lead with a "Host" section that pastes the `systemd-detect-virt` / cpuinfo output so anyone suspecting nested virt can verify the claim themselves. - README.md: tighten the cubesandbox footnote to (a) show the host is bare metal up front, (b) explicitly call out that fork-from-warm (forkd) and cold-start (every other project) are different operating points being shown on one chart for shape comparison, not as equivalent primitives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(roadmap): publish M1+M2 plan and scaffold playwright-browser recipe - ROADMAP.md: M1 (browser recipe + snapshot hub + marketing pulse, ≈4 weeks) and M2 (diff snapshots + time-travel branching, gated on M1.3 user signal). Done criteria + risks per item. - recipes/playwright-browser/: alpha scaffold for the M1.1 deliverable. build.sh layers a tiny Node warm-up script (launches headless Chromium + about:blank tab) onto the official mcr.microsoft.com/ playwright rootfs, so the resulting parent VM has a fully initialised Chromium resident at snapshot time. README.md documents target shape + interim CLI driving path until the Playwright bridge in forkd-agent lands (see follow-up issue). - recipes/README.md: add playwright-browser row + "browser-driving agent" entry in the chooser. The recipe is marked alpha — the warm-up script will run as soon as forkd-agent is taught to exec FORKD_WARMUP_CMD from /etc/forkd-recipe.env before the snapshot. Until that lands, users can drive Chromium via `forkd exec` (documented in the recipe README). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(recipes): fix issue refs in playwright-browser to #30 The recipe README referenced #28 for the Playwright bridge in forkd-agent; #28 ended up being the diff-snapshots tracker, with the bridge filed as #30. Point all four references to #30. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(agent): recipe-level eval bridge (closes #30) Teach forkd-agent to delegate the `eval` action to a recipe-supplied warmup subprocess via a tiny line-JSON protocol — enables Node-based recipes (starting with playwright-browser) to expose `sb.eval(<js>)` that runs against the parent VM's pre-launched Chromium + Playwright state, instead of cold-spawning node/Chromium per call. Wire protocol ------------- Recipes drop `/etc/forkd-recipe.env` into the rootfs declaring: FORKD_WARMUP_CMD="node /opt/forkd-warmup.js" FORKD_AGENT_LANG=node On boot, forkd-agent.py: 1. Reads /etc/forkd-recipe.env (env vars override file — handy for dev smoke tests on the host). 2. If FORKD_WARMUP_CMD is set, spawns the warmup with pipes for stdin/stdout (protocol) + stderr (forwarded to agent stdout as "forkd-warmup: ..." for visibility). 3. Reads one JSON line from warmup stdout expecting {"ready": true}; once received, the bridge is open. 4. Routes the `eval` action to the warmup: request {"id": <n>, "code": <str>} reply {"id": <n>, "result": <json>} error {"id": <n>, "error": <str>, "stack": <str>} Serialised by _warmup_lock so concurrent connections on one VM don't interleave on the shared pipes; cross-VM parallelism is unaffected since each child has its own agent+warmup pair. The `eval` action stays single in the SDK surface — the agent dispatches based on FORKD_AGENT_LANG. Python-recipe `eval` still returns `{"result": <repr-string>}` (unchanged), Node-recipe `eval` returns `{"result_json": <json-string>}` so the SDK can deserialise into a native Python value cleanly without touching repr()-based paths. `ping` response gains `agent_lang` + `warmup_ready` for SDK-side debug visibility. SDK --- Sandbox.eval() prefers `result_json` when present and json.loads()es it. Legacy `result` (repr) path unchanged for all existing Python recipes — fully backwards compatible. Docstring updated with both semantics + a playwright-browser example. playwright-browser recipe ------------------------- build.sh's warmup.js rewritten from a "park forever" stub into the actual readline-based command loop: launches Chromium + about:blank, sends ready handshake, evaluates incoming code as async functions with (browser, context, page) in scope, top-level await supported. README updated to drop the "interim shell path" workaround — the SDK example now works. Smoke tests ----------- rootfs-init/tests/ (new): - fake-warmup.py: reference protocol implementation in Python - smoke-test.sh: end-to-end on a Linux host (agent + fake warmup + nc-driven TCP requests). Verified on bare i7-12700 dev box — agent_lang=node, warmup_ready=true, eval routed correctly, result_json round-trips. - smoke-sdk.py: exercises Sandbox.eval() stubbed against both result_json and result paths plus the error path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: nudge re-run on PR #31 * docs(roadmap): mark M1.1 progress — recipe scaffold + agent bridge done * feat(cli): Snapshot Hub MVP — pack / unpack / pull / images (closes #29) Adds four CLI commands and a new `hub` module for moving warmed parent snapshots between hosts without re-running each recipe's `build.sh`. Pack format v1 ============== `.forkd-snapshot.tar.zst` containing: manifest.toml — tag, sha256 per file, format version, reserved parent_tag for M2.1 diff chains memory.bin — CoW source for child mmap vmstate — Firecracker vCPU + device state snapshot.json — forkd metadata (volumes) rootfs.ext4 — block device for child overlays Manifest's `forkd_pack_version` lets us evolve the format without breaking older clients. `parent_tag` is reserved for the M2.1 diff chain work (currently always None). CLI surface =========== - `forkd pack --tag <local> [--out <file>] [--description <s>] [--base-image <ref>]` - `forkd unpack <file> [--tag <new>] [--force]` - `forkd pull <url-or-short-form> [--tag <new>] [--force] [--hub <base-url>]` - `forkd images` (list local snapshots with sizes) `pull` accepts either a plain HTTPS URL or a `<owner>/<tag>` short form that resolves against `$FORKD_HUB_URL` (default `https://forkd-hub.deeplethe.com`). Integrity guarantees ==================== - pack records sha256 per file in manifest.toml - unpack verifies sha256 against the manifest after extraction; partial extracts are visible for debugging - unpack refuses path-traversal entries (`../escape`, absolute paths) - pack-format version mismatch rejected with a clear error Tests ===== Adds unit tests covering pack ↔ unpack roundtrip with synthetic snapshot bytes, manifest TOML roundtrip, path-traversal rejection, human-byte formatting, and the epoch → RFC3339 stamper (rolled by hand to avoid pulling chrono just for one timestamp). E2E scaffolding =============== rootfs-init/tests/e2e-playwright.sh: end-to-end recipe verification script for the dev box (clone → build → recipe build → snapshot → fork → eval). Currently blocks on passwordless sudo at the recipe install step — separate dev-box config item. Dependencies ============ - sha2, tar, zstd, toml, ureq (sync HTTP, rustls-tls by default) - tempfile in dev-deps for pack roundtrip tests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cli): drop unreachable path-traversal unit test; keep the runtime check The `tar` crate's `Builder::append_data()` refuses to write entries with `..` segments, so the malicious archive can't be crafted via the safe API. The unpack-side check in hub.rs remains as defense- in-depth against tars produced by other tooling (raw bytes, the shell `tar(1)`, language-mismatched implementations) — replace the broken test with an inline comment explaining what would be needed to exercise it. Also drop the unused PathBuf import flagged by cargo check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(cli): apply cargo fmt to hub.rs + main.rs * style(cli): inline 'ROOTFS?' literal to silence clippy::print-literal * feat(cli): add `forkd push` + Snapshot Hub README section `forkd push --tag <local> <url>` packs the snapshot to a temp file and HTTP PUTs the body to the given URL (presigned PUT from R2/S3 is the intended fit). Streams the body via a ProgressReader so a multi-GiB pack doesn't materialise in RAM, and prints throughput + total time on completion. - hub.rs: `upload()` + `ProgressReader` for PUT-with-progress. - main.rs: `Cmd::Push` + `push_cmd`. Cleans up the temp pack whether the upload succeeds or fails. - README.md: adds a "Snapshot Hub" subsection under Quick start showing the pack/push/pull/fork flow; doesn't yet swap the primary quickstart over to `forkd pull` (that flip lands when the hub bucket goes live and recipes are pushed). - ROADMAP.md: mark M1.2 CLI item ✓, bucket provisioning and README-quickstart-flip still pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cli): --mem-size-mib + fix eval result_json print; M1.1 verified End-to-end Playwright recipe is working. Pulling all the fixes needed to get there into one commit: CLI changes ----------- - `forkd snapshot --mem-size-mib <u32>`: optional override on the parent VM's memory. The 512 MiB default is fine for Python + numpy but OOMs Chromium (the kernel logs `__vm_enough_memory: ... comm: headless_shell, no enough memory` and `traps: headless_shell[379] trap int3`). Recipe README now requires 2048 for playwright-browser. - `forkd eval`: also prints `result_json` (Node-recipe replies), not just `result` (Python-recipe). Without this the bridge worked but the CLI silently dropped output. Also surfaces JS stack traces from `error.stack` on failure. Recipe build.sh --------------- - npm install -g playwright@1.50.0 inside the rootfs via chroot — the official mcr.microsoft.com/playwright image ships ONLY the browser binaries under /ms-playwright, not the JS module. - /etc/forkd-recipe.env wraps node with `env NODE_PATH=... PLAYWRIGHT_BROWSERS_PATH=/ms-playwright` so require('playwright') resolves and the JS driver finds Chromium. - Cleanup function unmounts proc/sys/dev/pts before the loopback to avoid EBUSY when the build script's chroot leaves bind mounts behind. Verification on dev box (bare-metal i7-12700) --------------------------------------------- - snapshot of warmed Chromium parent: 16 s wall-clock (2 GiB memory.bin) - fork 3 children, per-child netns: 56 ms wall-clock - ping returns `agent_lang: "node"`, `warmup_ready: true` - `forkd eval --child forkd-child-N -- "return await page.title()"`: 10–82 ms per call, output `"Example Domain"` - Also verified pure JS evals (`Math.PI * 2`, `[1,2,3].map(x=>x*x)`) return correctly across multiple children. ROADMAP ------- M1.1 done bullets updated; risk marked resolved (Chromium does survive snapshot-restore cleanly when given enough RAM). Larger-N benchmark remains open. Recipe README and e2e-playwright.sh updated with the working command lines (--mem-size-mib 2048, --memory-limit-mib 2560). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cli): auto-clean work_dir + forkd cleanup + scripts/netns-teardown.sh Addresses the leftover-state pile-up observed after multiple forkd runs (11 orphan /tmp/forkd-{fork,parent}-*/ dirs + 5 stale netns). Splits into three pieces, each safe in isolation. 1. Auto-clean work_dir at end of `forkd snapshot` / `forkd fork` ------------------------------------------------------------ On a successful run the temp work_dir (Firecracker API sockets + console logs) is recursively removed. New `--keep-workdir` opt-out for both commands when post-mortem inspection is desired. On any error path the work_dir is preserved (early-return through ?). cleanup_workdir() asserts the path stays under /tmp/forkd- before recursive rm; refuses anything else with an inline warning. 2. `forkd cleanup` — sweep leaked work_dirs ---------------------------------------- Scans /tmp/{forkd-fork-*,forkd-parent-*}/. For each candidate it does an lsof on any contained `.sock` to detect a live Firecracker (false-positive on "live" is safe; false-negative would nuke a live VM, so be conservative — if lsof is missing or unreadable, mark as live and skip). Dry-run by default; `-y/--yes` actually deletes. Path is re-checked to start with `/tmp/forkd-fork-` or `/tmp/forkd-parent-` immediately before each remove call. 3. scripts/netns-teardown.sh — reverse netns-setup.sh -------------------------------------------------- New script, dry-run by default, removes the per-child network namespaces created by netns-setup.sh. Multiple safety nets: - Strict regex match on `^forkd-child-[0-9]+$` for netns names. - Belt-and-suspenders `case "$ns" in forkd-child-*) ;; *) refuse` immediately before each `ip netns delete`. - Bridge/tap are NEVER deleted by default — gated on --include-bridge / --include-tap with the same name-match check. - Deleting a netns auto-destroys the paired veth, so the script doesn't enumerate veths directly (reduces blast radius). docker0, br-<hex> (docker), and any other user-owned interface or netns is untouchable by name. Verified on a host with 22 docker networks present. Verified on dev box ------------------- - `forkd cleanup` (dry): listed 11 dirs, 0 marked live, none deleted. - `forkd cleanup --yes`: removed all 11, recovered ~1.4 MiB. - `scripts/netns-teardown.sh` (dry): listed 5 netns, exited 0. - `scripts/netns-teardown.sh --yes`: deleted forkd-child-1..5, all forkd-v-Nh veths went with them; docker bridges + forkd-br0 + forkd-tap0 untouched (verified post-hoc). - Fresh `forkd fork --tag pwb -n 2 --per-child-netns`: ran in 57 ms, on exit logged `cleaned work_dir /tmp/forkd-fork-pwb`, ls confirms no leftover dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cli): three bug-bash findings — live-socket detection, temp leaks, NaN/Inf Bug-bash exposed three real issues in last week's cleanup + Snapshot Hub work. All three fixed with retests on the dev box. 1. forkd cleanup nuked live VMs (CRITICAL) ---------------------------------------- `workdir_has_live_process()` used `lsof <socket-path>` to detect in-use work_dirs. On Ubuntu 24.04 / lsof 4.95, `lsof` against a Firecracker UNIX domain socket emits warnings to stderr and zero rows on stdout, even while a process is actively holding it. Our code redirected stderr to /dev/null and trusted empty stdout to mean "no one is using this" — would have nuked a live VM's socket directory under `forkd cleanup --yes`. Replaced with a `/proc/*/cmdline` scan: Firecracker children pass `--api-sock /tmp/forkd-fork-<tag>/child-N.sock` on argv, so the work_dir path appears verbatim in cmdline while the VM is alive. Errs on the side of "live" if /proc is unreadable. Reverified: the previously-misclassified `forkd-fork-pwb` (with two live firecracker children) now correctly shows `SKIP ... (live socket — a forkd run looks active)`. 2. forkd unpack leaked /tmp/forkd-unpack-<pid>/ on failure ------------------------------------------------------ On any error after `create_dir_all(&tmp)` (corrupted tar.zst, truncated archive, sha256 mismatch, dest-exists-no-force), the temp extraction dir was never removed. Two such dirs were on the dev box already, 16 MiB combined. Refactored into `unpack_into()` so the cleanup path is a single `if result.is_err() { rm tmp }` wrapper. Same pattern applied to `pull_cmd`, which had the parallel leak for the downloaded `.tar.zst`. 3. forkd cleanup didn't sweep forkd-unpack-* / forkd-pull-* ------------------------------------------------------- `cleanup` only looked at `forkd-fork-*` and `forkd-parent-*`. Added the other two prefixes to a shared PREFIXES table; the "starts with /tmp/" + "name matches a known prefix" safety check used right before each remove uses the same table. Bonus: warmup.js Infinity/NaN sentinels The Playwright bridge's JSON.stringify turned `Infinity` / `NaN` / `-Infinity` into `null` silently — so `sb.eval("return 1/0")` came back as `null`, indistinguishable from a legitimately null return. Replacer now emits `"__js_Infinity__"` / `"__js_-Infinity__"` / `"__js_NaN__"` sentinels. Takes effect on next rootfs rebuild of the recipe (build.sh emits the new warmup.js). Verified on dev box with the still-live pwb pair: cleanup correctly skipped the live fork, cleared 3 stale unpack scratch dirs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cli): tag path-traversal + better error messages Two more findings from the bug-bash. First one is a security-class issue (file write outside data dir via a tag flag), second is about making the existing error chain visible. 1. Path-traversal via --tag (SECURITY) ---------------------------------- `snapshot_dir(tag)` was `data_dir().join("snapshots").join(tag)`. Path::join silently keeps the right side when it's absolute, so: forkd snapshot --tag /etc/forkd-bad → /etc/forkd-bad forkd snapshot --tag ../../etc/x → ~/.local/share/../../etc/x ≡ /home/<user>/../etc/x Same risk in `forkd unpack` reading the manifest's `tag` field: a malicious pack on the Snapshot Hub could declare `tag = "../../etc/whatever"` and write anywhere. Added `validate_tag()` (1-64 chars, must start with alnum/`_`, chars allowed: alnum + `. _ -`). Called from every CLI surface that accepts a tag — snapshot, fork, pack, push, unpack (both --tag arg AND manifest tag), pull. Manifest validation fires *after* read so we still show what the malicious pack tried. Verified: 5/5 traversal attempts now print a clean rejection with the offending character pointed out. Trailing tests confirm legitimate tags (pwb, pyagent, run-python-3-12-slim) still pass. 2. Error message quality --------------------- anyhow's Debug impl prints the cause chain, but our contexts were terse and the chain header was easy to miss when grepping logs. T2.2 and T2.4 from the bug-bash both surfaced this: before: Error: tar entry after: Error: read an entry from /tmp/bogus.tar.zst — pack may be corrupted, truncated, or not a forkd snapshot pack Caused by: Unknown frame descriptor before: Error: GET https://example.invalid/foo.tar.zst after: Error: HTTP GET failed for ... (check the URL, DNS, and whether the server is reachable) Caused by: 0: Dns Failed: resolve dns name '...': ... 1: failed to lookup address information: ... Same treatment for: zstd init, manifest parse, integrity-check failure (now hints at common causes like truncated download), HTTP non-2xx status (hints at 403 = expired presigned URL, 404 = tag not published, etc.). Verified on dev box; live pwb fork still pings clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cli): preflight check on tag; tighten live-process detection Two more findings from bug-bash, both around concurrent operations on the same tag. 1. Concurrent forkd fork/snapshot on same tag cascaded into confusing Firecracker errors -------------------------------------------------------- Two simultaneous `forkd fork --tag X` runs produce overlapping API sockets in /tmp/forkd-fork-X/. The second one's failure surfaced as: Error: restore_many_with failed Caused by: firecracker API PUT /snapshot/load returned 400: "Open tap device failed: Resource busy ... Invalid TUN/TAP Backend provided by forkd-tap0." That's three layers deep and doesn't tell the user the real reason: "another forkd run is already using this tag". Added `preflight_workdir()` to snapshot_cmd and fork_cmd. If the work_dir exists AND has a live process holding fds inside it, refuse up-front with: Error: another `forkd fork` looks active on tag 'pwb' — its work_dir at /tmp/forkd-fork-pwb still has a live Firecracker process holding sockets. Wait for the other run to finish (or kill it) before re-running. If you're sure nothing's alive, run `forkd cleanup --yes`. If the work_dir exists but no process is using it (--keep-workdir from a previous run, or a crash), preflight cleans it before proceeding. Logged so the user knows why we're touching it. 2. workdir_has_live_process() over-flagged any process whose argv mentions the path ------------------------------------------------------- The cmdline-substring approach from PR #36 had false positives: ANY shell command mentioning the work_dir path — including the shell running `forkd cleanup` itself — got flagged as "live". Caught while bash-bashing the preflight code path. Switched to /proc/<pid>/fd/* readlink scan: returns true iff some process holds an open fd resolving to a path under the work_dir. Firecracker children redirect stdout to <work_dir>/child-N.console, so a real live VM is always detectable. False-positive surface drops to zero. Comments explain why we don't use lsof (the bug fixed in PR #36) and why /proc cmdline (this bug). Errs on the side of "live" if /proc is unreadable. Verified -------- - Stale /tmp/forkd-fork-pwb (no live firecracker) → preflight cleans + new fork succeeds in 56 ms. - Two concurrent `forkd fork --tag pwb` → first runs to completion, second gets the clean "another forkd fork looks active" error. - `forkd cleanup` no longer shows the running shell as a live process holder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * release: 0.1.3 — security fix (path traversal) + Snapshot Hub + browser recipe Highlights ---------- SECURITY: path traversal via `--tag` (CVE-class) All forkd CLI commands that took a `--tag` flag derived their destination from `Path::join(tag)`, which silently keeps the right side when it's absolute and doesn't reject `..` segments. A tag like `/etc/forkd-bad` or `../../etc/x` could write Firecracker snapshot files outside the data directory. Same risk applied to the `tag` field of `manifest.toml` inside a Snapshot Hub pack — a malicious or compromised pack could write its files anywhere the running user can write. Affects 0.1.0–0.1.2; fixed by validating tags against `[A-Za-z0-9_][A-Za-z0-9._-]{0,63}` at every CLI surface and again on the manifest tag. Full advisory in docs/SECURITY.md. forkd cleanup --yes used to be capable of tearing down live VMs Detection used `lsof` against the Firecracker UNIX-domain API socket, which on recent Ubuntu returns warnings on stderr and zero rows on stdout. We trusted empty stdout to mean "no one is using this." Replaced with a `/proc/<pid>/fd/*` readlink scan that catches Firecracker's console-redirect fd. Browser recipe lands recipes/playwright-browser/ + recipe-level eval bridge in forkd-agent.py. Per-call `sb.eval("await page.title()")` returns in ~10-80 ms against a forked Chromium child vs ~2 s for cold-spawning a fresh browser. Requires `--mem-size-mib 2048` on the parent VM (Chromium OOMs at the 512 MiB default, also new in this release). Snapshot Hub MVP `forkd pack` / `forkd unpack` / `forkd pull` / `forkd push` / `forkd images`. Tag-resolved short form `<owner>/<tag>` over HTTPS, manifest-per-pack with format version + per-file sha256. 23× compression typical on memory.bin. Bucket provisioning + initial recipe uploads land separately. Operational hygiene - `forkd cleanup` sweep for `/tmp/forkd-{fork,parent,unpack, pull}-*` work_dirs, with `/proc` fd live-check. - `scripts/netns-teardown.sh` reverses netns-setup.sh, with multiple safety nets so docker / system interfaces are never reached. - work_dirs auto-cleaned on successful `forkd snapshot` / `forkd fork`, preserved on failure for debugging. - Pre-flight check refuses to start a fork/snapshot when another forkd run is active on the same tag, avoiding the cascade of confusing Firecracker "Resource busy" errors. Files touched ------------- - Cargo.toml: workspace.package.version 0.1.2 → 0.1.3 - sdk/python/pyproject.toml: version 0.1.2 → 0.1.3 - sdk/python/forkd/__init__.py: __version__ → 0.1.3 - docs/SECURITY.md: "Past advisories" section with the path-traversal advisory. - CHANGELOG.md: new file, this release + 0.1.0 / 0.1.1 / 0.1.2 back-fills. Tagging this commit's eventual squash-merge as v0.1.3 will fire release.yml (binaries + GitHub Release) and publish-pypi.yml (sdist + wheel → PyPI via Trusted Publishers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

WaylandYang and others added 22 commits May 13, 2026 01:15

docs(readme): add PyPI version badge

36e12b2

forkd is now installable via pip install forkd; surface the current PyPI version next to the GitHub release badge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ci: nudge re-run on PR #31

df9f872

docs(roadmap): mark M1.1 progress — recipe scaffold + agent bridge done

4890beb

Merge origin/main into dev (resolve squash-merge desync)

dc47a6e

Merge branch 'main' of github.com:deeplethe/forkd into dev

533348b

style(cli): apply cargo fmt to hub.rs + main.rs

3ba220f

style(cli): inline 'ROOTFS?' literal to silence clippy::print-literal

ca86f32

Merge origin/main into dev (squash-merge desync)

01788d1

Merge origin/main into dev (squash-merge desync)

ae62958

Merge origin/main into dev (squash-merge desync)

8816e59

WaylandYang merged commit 6674edd into main May 13, 2026
1 check passed

WaylandYang mentioned this pull request May 13, 2026

fix(cli): preflight tag check + tighten live-process detection #38

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli): three bug-bash findings — live-socket detection, temp leaks, NaN/Inf#36

fix(cli): three bug-bash findings — live-socket detection, temp leaks, NaN/Inf#36
WaylandYang merged 22 commits into
mainfrom
dev

WaylandYang commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WaylandYang commented May 13, 2026

1. `forkd cleanup` would nuke live VMs (CRITICAL safety)

2. `forkd unpack` leaked `/tmp/forkd-unpack-/` on failure

3. `forkd cleanup` didn't sweep `forkd-unpack-` / `forkd-pull-`

Bonus: warmup.js `Infinity` / `NaN` sentinels

Verified on dev box (still-live pwb fork pair)

live fork untouched throughout

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant