Skip to content

bench harness — 44 scenarios across 5 axes#592

Draft
G4614 wants to merge 13 commits into
boxlite-ai:mainfrom
G4614:feat/bench
Draft

bench harness — 44 scenarios across 5 axes#592
G4614 wants to merge 13 commits into
boxlite-ai:mainfrom
G4614:feat/bench

Conversation

@G4614
Copy link
Copy Markdown
Contributor

@G4614 G4614 commented May 26, 2026

Status: Draft. Mechanically clean; opening for review before un-drafting. Squash already done.

Summary

Adds a runtime performance harness under boxlite bench — 44 scenarios across 5 axes (latency, resource, density, throughput, stability), versioned JSON reports, direction-aware compare against a baseline, p99 regression gate.

Stack: 6 commits on top of main, +8558 LOC. See src/cli/src/commands/bench/README.md for the full scenario catalog, JSON schema, and compare semantics.

Commit layout

  1. feat(cli): bench harness foundation — runner / Scenario trait / JSON schema / compare / host_info / stats / scenarios skeleton.
  2. test(bench): latency axis — 10 scenarios — also ships scenarios/common.rs (BoxGuard, alpine_options, build_runtime) since it's the first axis to need them. throughput-export rides along because it shares lifecycle.rs's source-box prep helper with latency-clone.
  3. test(bench): resource axis — 6 scenarios
  4. test(bench): density axis — parallel-10
  5. test(bench): throughput axis — 19 scenarios
  6. test(bench): stability axis — 7 scenarios

Every commit individually: compiles, passes the registry / build_by_name lock-step tests, clippy -D warnings clean. mod.rs grows progressively so no commit has dead arms or dead module decls.

Scenario coverage

  • Latency (10): cold/warm start (incl. jailed), clone (single + batch-10), snapshot, inspect-list, get-or-create dedup, image-pull (cached), runtime-shutdown.
  • Resource (6): idle / cpu-load / mem-pressure / density-10-idle / multi-vcpu-load / runtime-metrics-poll.
  • Density (1): parallel-10 burst spawn.
  • Throughput (19): image pull, disk (dd seq + fio random, R+W), virtiofs, net (tcp-sink, iperf3 host→guest + egress + parallel, tcp-cps, dns-latency, udp SKIPs), serve-rps, lifecycle (export, import, copy-into, copy-out), setup-cost (many-ports, volumes-multi).
  • Stability (7): churn (50 cycles), soak (idle + load), exec-loop (500), exec-parallel (20), restart-loop (20), snapshot-loop (20).

What's in this branch

  • Foundation: BenchArgs / Scenario trait / RunContext, versioned JSON schema (schema_version = "1.0"), nearest-rank percentiles (ISO 16269-4) + Bessel-corrected stdev — self-contained, zero crate-external deps (no criterion / statistical).
  • Direction-aware compare: higher_is_better flag on each MetricAggregate, inferred from _per_sec / _rps / _iops suffixes; comparator flips the sign so "positive regression_ratio = bad" holds for both latency and throughput.
  • compare double-guards: schema-version + scenario-name mismatch errors before producing a misleading delta.
  • Registry / build_by_name lock-step pinned by unit test; duplicate-name guard too.
  • E2E SKIP convention for host-prereq scenarios (iperf3 binary, AppArmor userns, KRUN_VIRTIO_FS_MAX, gvproxy TCP+UDP same-port bug) — scenario returns Ok with a marker metric instead of failing the runner.

Real boxlite quirks the bench surfaced (documented in scenario file headers + commit messages)

  • LiteBox::copy_out from /tmp (tmpfs) returns NotFound — needs a rootfs path like /root. Fixed in scenarios/copy_io.rs.
  • libkrun KRUN_VIRTIO_FS_MAX = 2; ≥ 3 -v mounts give status=-22 at start. scenarios/volumes_multi.rs capped at N=2.
  • After SnapshotHandle::create (or restore), the current disk depends on the snap → immediate remove fails with "current disk depends on this snapshot". latency-snapshot drops the trailing remove; stability-snapshot-loop drops restore and just measures create churn.
  • LiteBox handle invalidates on stop; the next start panics with "Handle invalidated after stop()". stability-restart-loop re-fetches via rt.get(id) between cycles.
  • gvproxy doesn't support TCP+UDP forward on same host_port → iperf3 -u control channel can't connect. throughput-net-udp SKIPs by default; BOXLITE_BENCH_UDP_FORCE=1 to attempt.
  • Guest exec subsystem regresses around exec fix(python): export SecurityOptions from top-level boxlite module #247InitReady / IntermediateReady(0) mismatch. stability-exec-loop is tolerant of partial completion and reports exec_completed_count so the boundary surfaces as a regression signal instead of an aborted run.
  • Ubuntu 24+ AppArmor restricts unprivileged userns by default; jailed cold-start preflights bwrap --unshare-user and SKIPs when denied.

Verification

  • cargo clippy -p boxlite-cli --tests --no-deps -- -D warnings — clean on Linux + macOS (test-mod cfg-gated for /proc-only assertions).
  • cargo test -p boxlite-cli --bin boxlite commands::bench:: — 28/28 (registry lock-step + uniqueness + stats + parse helpers).
  • 25+ scenarios E2E-run on this host (numbers in axis-commit messages).

Why still draft

  • Net-UDP scenario carries BOXLITE_BENCH_UDP_FORCE=1 escape hatch but is SKIP by default until gvproxy gains TCP+UDP same-port support.
  • No CI integration (baseline comparison wired into a workflow) yet.
  • Want a review pass on the JSON schema before declaring it 1.0 stable for downstream tooling.

Test plan

  • make test:unit:rust clean
  • boxlite bench list shows all 44
  • Spot-run 3-4 scenarios per axis, confirm reports parse + compare gates trip on synthetic regressions
  • Sign off on the JSON schema shape

🤖 Generated with Claude Code

G4614 pushed a commit to G4614/boxlite that referenced this pull request May 26, 2026
macOS clippy on PR boxlite-ai#592 caught `unused import: super::*` in the
tests submodule: the only test was already `#[cfg(target_os =
"linux")]`-gated, so on macOS the module compiled empty and the
`use super::*` had nothing to import → `-D warnings` fails.

Move the cfg to the mod declaration so the whole tests block is
elided off-Linux. No behavior change on Linux.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gamnaansong and others added 6 commits May 26, 2026 03:45
Adds the `boxlite bench` subcommand scaffold — runner / scenario
trait / JSON schema / compare / host metadata. Scenario
implementations land in the per-axis follow-up commits.

Components:

- `BenchArgs` + dispatch (`bench list` / `bench run` / `bench compare`)
- `Scenario` trait (`async fn run_once` via `async-trait`), runner
  loop with warmup-drop, `RunContext`
- Versioned JSON report (`schema_version = "1.0"`): metadata,
  per-iteration samples, aggregates (min/p50/p90/p99/max/mean/stdev)
  with `unit` hint and `higher_is_better` flag
- Self-contained stats: nearest-rank percentile (ISO 16269-4),
  Bessel-corrected sample stdev — zero crate-external deps (no
  `criterion`/`statistical`)
- `HostInfo` snapshot from `/proc` — kernel / arch / CPU model+count
  / memory total; tests gated by `cfg(target_os = "linux")` so
  macOS clippy stays clean
- `compare BASELINE CURRENT` with `--threshold`/--on` knobs:
  schema-version + scenario-name double guard; direction-aware
  (`higher_is_better` metrics get their delta sign flipped into
  `regression_ratio` so the "positive=bad" invariant holds for
  both latency and throughput metrics)
- Registry / `build_by_name` lock-step invariant pinned by unit
  test, plus a uniqueness check on scenario names

Plus the CLI plumbing (`src/cli/src/main.rs` dispatch, README
pointer at `src/cli/src/commands/bench/README.md`) and the
`async-trait = "0.1"` dep.

Subsequent commits will populate the registry per-axis:
latency / resource / density / throughput / stability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-operation timing across the box lifecycle:

- latency-cold-start / latency-warm-start — fresh vs shared `--home`
- latency-cold-start-jailed — `SecurityOptions::maximum()`; isolation
  tax delta. SKIPs when `is_full_isolation_available()` is false OR
  when a `bwrap --unshare-user` preflight fails (Ubuntu 24+ AppArmor
  default).
- latency-clone / latency-clone-batch-10 — single-call vs batch
  `clone_boxes(N=10)` per-clone amortized cost
- latency-snapshot — `SnapshotHandle` create+restore round-trip
  (remove omitted; after restore the current disk depends on the
  snapshot, so remove would fail with the qcow2 dep invariant)
- latency-inspect-list — `list_info` + `get_info` at N=20 boxes;
  SQLite index scaling signal
- latency-get-or-create-dedup — 100 `get_or_create(name)` hits on a
  pre-materialized box; µs floor for name→box-id lookup
- latency-image-pull-cached — shared `--home` warm-cache pull
  (counterpart to throughput-image-pull's cold-cache headline)
- latency-runtime-shutdown — `rt.shutdown` with N=3 running boxes;
  graceful-stop SLA floor

`throughput-export` rides along because it shares the source-box
prep helper with `latency-clone` in `lifecycle.rs`; rest of the
throughput axis lands separately.

Shared scenario plumbing in `scenarios/common.rs` (BoxGuard RAII,
`alpine_options`, `build_runtime`) ships with this axis since it's
the first one to use it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-box resource footprint, idle and under load:

- resource-idle — RSS / COW / CPU% after 3 s settle
- resource-cpu-load — one vCPU pegged by stress-ng --cpu 1 for 10 s;
  catches libkrun-shim RSS growth under work
- resource-mem-pressure — box capped at 256 MiB, stress-ng
  --vm-bytes 150m. Documented signal is the exit-code transition
  from `1` (clean baseline on alpine's stress-ng build) to `137`
  (SIGKILL by OOM-killer); reports `mem_pressure_limit_bytes` and
  `mem_pressure_alloc_bytes` for context
- resource-density-10-idle — 10 idle boxes coexisting; sums RSS +
  COW + host fd. Steady-state coexistence cost (vs density-parallel-
  10's concurrent-spawn signal)
- resource-multi-vcpu-load — 4 vCPUs all saturated; tests libkrun
  vCPU thread mapping + multi-core KVM exit path
- resource-runtime-metrics-poll — `rt.metrics()` cost at N=10 running
  boxes, 500 samples → µs mean/p50/p99/max. Floor for Prometheus
  scrape overhead

Shared `--home` across iterations for steady-state numbers;
`stress-ng` install via apk amortized across the scenario instance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`density-parallel-10`: concurrent spawn of 10 alpine boxes through
one runtime. Headline is the per-box max/mean latency under
contention — the init-pipeline surcharge over a single warm start
that catches lock-contention regressions invisible to a serial
scenario.

Distinct from `resource-density-10-idle` (which measures
steady-state coexistence cost once all 10 are up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bytes-per-second / requests-per-second across every external code
path.

**Image**
- throughput-image-pull — cold-cache pull MB/s

**Disk (in-box)**
- throughput-disk-write / -disk-read — sequential dd 64 MiB,
  qcow2-COW-over-virtio bandwidth
- throughput-disk-fio / -disk-fio-read — fio 4K random + fsync,
  IOPS + clat p50/p99/p999 tail latency. One-time apk add amortized
- throughput-virtiofs — dd write to a `-v` mount; distinct codepath
  from the qcow2 disk-write (host volumes bypass the COW overlay)

**Net (gvproxy)**
- throughput-net-tcp-sink — host→guest via `nc` sink. Catches
  gvproxy regressions without needing iperf3 on host
- throughput-net-iperf3 / -iperf3-egress — host→guest and
  guest→host iperf3 TCP, bps + retransmits
- throughput-net-iperf3-parallel — iperf3 -P 4 multi-stream,
  fairness signal via per-stream stdev
- throughput-net-udp — iperf3 -u 1 Gbps; SKIPs by default (gvproxy
  doesn't support TCP+UDP forward on same host_port; iperf3 -u needs
  both for control + data channels). `BOXLITE_BENCH_UDP_FORCE=1`
  escape hatch
- throughput-tcp-cps — TCP connection-establish rate via tight
  TcpStream::connect loop against in-box nc respawn
- throughput-dns-latency — `getent ahosts` lookups for internal +
  external targets; tests gvproxy embedded DNS + recursive forward

**REST**
- throughput-serve-rps — spawn `boxlite serve` child, hammer
  /v1/config with 16 reqwest workers

**Lifecycle**
- throughput-copy-into / -copy-out — 64 MiB tar-stream via
  `LiteBox::copy_into/out`. Payload staged on `/root` not `/tmp`
  because tmpfs isn't visible to the guest agent's file interface
- throughput-import — `rt.import_box` from a pre-exported archive;
  counterpart to `throughput-export` (which shipped with the latency
  axis to share lifecycle.rs's source-box helper)

**Setup-cost**
- throughput-many-ports-setup — 16 `-p` forwards; gvproxy
  port-table fan-out delta vs zero-ports warm-start
- throughput-volumes-multi-setup — 2 `-v` mounts (libkrun
  `KRUN_VIRTIO_FS_MAX = 2` cap; N=3 fails with status=-22)

Host-side prereqs (iperf3 on PATH) checked at scenario start; SKIPs
cleanly with marker metric instead of aborting the runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Long-running / repeated-op leak detection:

- stability-churn — 50 create+start+stop cycles, per-cycle latency
  + host fd delta. Catches per-cycle leaks (fd / tempfile / DB row
  accounting)
- stability-soak — keep one box alive for BOXLITE_BENCH_SOAK_SECS
  (default 30 s), sample RSS/COW/fd every 2 s, report first→last
  deltas. Catches steady-state idle leaks churn misses
- stability-soak-load — soak with continuous in-box fio random
  reads. Catches under-load leaks (gvproxy goroutine pools, libkrun
  dirty-page buffers) that idle soak misses
- stability-exec-loop — 500 sequential execs on one box. Tolerant
  of partial completion: reports `exec_completed_count` and a
  per-iter failure-index marker. Documents the historical
  ~exec boxlite-ai#247 InitReady/IntermediateReady mismatch in alpine x86_64
  so future regressions show as the failure boundary moving down
- stability-exec-parallel — 20 concurrent execs via tokio::spawn
  fan-out, batch wall + per-exec p99 under contention. Tests gRPC
  fairness and the guest's exec state-map lock
- stability-restart-loop — 20 stop+start cycles on the SAME box
  (distinct from churn's create-each-time). Re-fetches the LiteBox
  handle via `rt.get(id)` between cycles because `stop` invalidates
  the previous handle (would otherwise abort cycle 0)
- stability-snapshot-loop — 20 sequential SnapshotHandle::create
  calls. Headline is per-create mean/max + COW byte delta. Remove
  omitted: creating snapshot N moves the current overlay's parent
  to N, so remove(N) fails until N+1 exists; mixing the two ops
  would couple latencies

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 changed the title WIP: bench harness — 44 scenarios + cross-cutting fixes bench harness — 44 scenarios across 5 axes May 26, 2026
gamnaansong and others added 7 commits May 26, 2026 05:10
Sweep surfaced `throughput-net-iperf3-parallel` hanging > 600 s when
run after other iperf3 scenarios + with /tmp under pressure. Root
cause is a rare gvproxy state-leak race that wedges the iperf3 -P 4
control channel — iperf3's own `-t 5` budget should cap the run, but
on the failure path the client never exits. With `.output().await`
unbounded, one hang consumes a whole sweep slot.

Switch the three iperf3 client invocations (net_iperf3, net_iperf3_-
parallel, net_iperf3_egress) from `.output().await` to
`spawn() + timeout(TRANSFER_SECS + 25s, wait_with_output())`. On
timeout: capture host iperf3 process list for diagnostics and bail
with a clear error.

For net_iperf3_egress (the in-box client variant) the same bound is
applied to the stdout drain loop + the exec wait separately —
hangs can wedge either.

Verified E2E on this host: all three pass cleanly with the bound in
place — net-iperf3 1.42 Gbps, egress 8.16 Gbps, parallel 1.45 Gbps.
Manual repro confirmed `iperf3 -s -1 -p 5201 + -P 4 -J -t 5` works
fine outside the scenario harness, ruling out an iperf3 server-flag
incompatibility.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep-style single-sample mode (`--runs 1 --warmup 0`) was reporting
the SAME `total_create_ms` as cold-start, defeating the whole point
of the scenario. Root cause: with one call to `run_once`, `self.home`
is freshly initialized and never populated by a prior iteration, so
the measured cycle pays full image pull + base disk build + guest
rootfs bootstrap.

Fix: track a `prewarmed` flag on `WarmStart`. The first `run_once`
call drives a hidden box cycle to populate the home, then drives
the measured cycle. From the second call onward `prewarmed=true`
so no extra cycle. The runner's `--warmup` knob still works on top
— those iterations are just additional warm cycles, harmless.

Verified E2E single-sample on this host:
  before: total_create=25,275 ms, image_prepare=24,263, rootfs=20,868
  after:  total_create=1,076  ms, image_prepare=3,      rootfs=1

The pre-warm cycle still costs ~20s on the first call's wall_ms;
that's expected and is the cost of producing one valid warm sample
from a fresh runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-call cold state

Sanity check on the 44-scenario sweep numbers showed that several
setup-cost scenarios were reporting cold-pull cost under a warm-cost
label at `--runs 1 --warmup 0`:

- `density-parallel-10`: per-box max latency 23s on first iter (cold
  pull contention) vs the headline-grade ~5s init-pipeline number it
  claims to measure
- `throughput-many-ports-setup`: `start_ms` 14,703 (image pull) vs
  the gvproxy port-table fanout cost it advertises
- `throughput-volumes-multi-setup`: `start_ms` 15,584 (image pull) vs
  virtiofs-fanout
- `latency-image-pull-cached`: `pull_cached_ms` 4,397 (cold pull on
  first call) vs the ms-scale warm-cache hit it's named for

Same pattern as the earlier `latency-warm-start` fix: track a
`prewarmed: bool` on each scenario struct. First `run_once` does a
hidden pre-warm cycle (throwaway box, no ports/volumes, or a
throwaway pull) to populate the shared home; subsequent calls skip
the prewarm. The runner's `--warmup` knob still works on top.

Verified E2E (single-sample):
- density max latency: 23s → 5,493 ms
- many-ports start_ms: 14,703 → 1,060 ms (now ~= warm-start floor,
  confirming gvproxy port-table fanout is nearly free)
- volumes start_ms: 15,584 → 1,045 ms (same conclusion for virtiofs)
- image-pull-cached: 4,397 → 0.5 ms (manifest cache short-circuit)

The remaining 40 scenarios were re-checked: their measured metrics
(disk dd/fio, iperf3 bps, RSS-after-settle, exec ms, etc.) are
captured inside the running box, decoupled from first-call setup
cost. No pre-warm needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
44 scenario reports from a clean re-sweep at 251397a after the
pre-warm fixes landed. Schema 1.0, captured on:

  CPU:    Intel(R) Xeon(R) 6975P-C, 4 vCPU
  Kernel: Linux 6.17.0-1015-aws
  Mem:    7.6 GB
  Registry mirrors: docker.m.daocloud.io, docker.1ms.run, docker.io

Used as the reference baseline for `boxlite bench compare`:

  boxlite bench run latency-warm-start --runs 10 --warmup 1 \
      --out /tmp/current.json
  boxlite bench compare bench/baselines/latency-warm-start.json \
      /tmp/current.json --threshold 0.20 --on p99

Headline numbers:
  latency-warm-start total_create_ms ............ 1,012 ms
  latency-cold-start total_create_ms ........... 15,573 ms
  latency-image-pull-cached pull_cached_ms ......... 0.5 ms
  throughput-disk-write mb_per_sec ............... 232 MB/s
  throughput-disk-read mb_per_sec .............. 4,608 MB/s
  throughput-disk-fio iops_per_sec ............ 872,064 IOPS
  throughput-net-iperf3 bps ..................... 1.59 Gbps
  throughput-net-iperf3-egress bps .............. 8.93 Gbps
  throughput-net-iperf3-parallel bps ............ 1.36 Gbps
  throughput-serve-rps .......................... 9,613 rps
  throughput-tcp-cps ............................ 6,889 conn/s
  resource-idle rss ............................ 248.9 MB
  resource-density-10-idle total_rss ........... 2,492 MB

Two SKIPs by design:
  latency-cold-start-jailed — AppArmor blocks unprivileged userns
  throughput-net-udp        — gvproxy TCP+UDP dual-fwd unsupported

Refresh procedure (when boxlite engine changes meaningfully):

  rm -rf /tmp/bench-sweep && mkdir -p /tmp/bench-sweep
  for s in $(boxlite bench list | awk '/^  / {print $1}'); do
    boxlite bench run "$s" --runs 1 --warmup 0 --out "/tmp/bench-sweep/${s}.json"
  done
  cp /tmp/bench-sweep/*.json bench/baselines/

Host hardware in metadata.host of each JSON; compare across hosts
should be expected to differ wildly — the baseline is reference,
not absolute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `async fn teardown(&mut self, ctx: &TeardownContext) -> Result<()>`
to the `Scenario` trait (default no-op) and runner integration: the
hook fires once after the iteration loop, on BOTH success and error
paths, before the report writes out. Teardown errors are surfaced as
warnings — they don't mask the iteration result.

Implementations for the 8 scenarios with cross-iteration persistent
state:

- lifecycle.rs (latency-clone, throughput-export): rt.remove(source)
- clone_batch.rs: rt.remove(source)
- snapshot.rs: rt.remove(source) — cascades accumulated snapshots
- snapshot_loop.rs: rt.remove(source) — cascades 20 snaps per iter
- inspect_list.rs: rt.remove each of the 20 unstarted boxes
- dedup_lookup.rs: rt.remove("bench-dedup-target")
- runtime_metrics_poll.rs: stop + remove the 10 idle boxes held in Vec
- net_iperf3_egress.rs: pkill any host `iperf3 -s -D` daemons by port
  (the `-1` flag self-exits on clean disconnect, but errors mid-
  handshake can leave the daemon waiting forever)

Verified E2E:
  latency-inspect-list:        DB rows = 0, box dirs on disk = 0
  resource-runtime-metrics-poll: DB rows = 0, box dirs on disk = 0
(both previously left 20 / 10 boxes around until TempDir drop, which
only fires on clean process exit — not on SIGTERM or panic.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three more cleanup paths on top of the per-scenario teardown hook:

1. **SIGINT/SIGTERM handler**: race the iteration loop against
   `tokio::signal::unix` futures via `select!`. When a sweep
   wrapper's `timeout 600` sends SIGTERM, the select-cancellation
   drops the iter_fut (which runs BoxGuard's Drop synchronously
   via `block_in_place + rt.remove`) AND we get to run the explicit
   teardown hook before exiting. Previously the bench would just
   die with no teardown.

2. **`catch_unwind` around the iteration loop**: a `panic!` inside
   a scenario (unwrap, index OOB, etc.) used to bring down the
   binary mid-cleanup. Now panics are caught, the panic message
   surfaces as a regular `Err`, and teardown still runs. Uses
   `AssertUnwindSafe` because after the panic we only invoke
   teardown — never read partially-mutated scenario state.

3. **`kill_descendants_of_self` last-ditch reaper**: walks
   `/proc/<pid>/task/<pid>/children` BFS and SIGKILLs any
   `libkrun VM` or `boxlite-shim` descendant after a failed run.
   On the happy path BoxGuard's Drop + boxlite engine self-cleanup
   already handle this and the reaper finds 0 procs to kill; this
   covers the edge case where SIGTERM hits during VMM spawn before
   BoxGuard wraps the box.

Verified E2E with mid-iteration SIGTERM on stability-churn:
  before: 7s exit, 2 libkrun + 3 shim orphans, 0 box dirs (TempDir
          drop ran but VMM children leaked)
  after:  <1s exit, 0 orphans, 0 box dirs, clean DB

`futures::FutureExt::catch_unwind` brings in `futures` crate
(already in deps). `nix::sys::signal::kill` for the SIGKILL
(already in deps with `signal` feature).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g-image

Adds the high-ROI surfaces from the code-coverage gap analysis:

**REST / WS path (SDK clients live here)**
- `latency-rest-cold-start` — full cold-start through axum + tower +
  serde + HTTP round-trips. Delta vs `latency-cold-start` quantifies
  the REST tax that every Python/Node/Go SDK call pays.
- `latency-ws-exec` — 100 echo execs through the REST/WebSocket exec
  channel. Per-exec mean/p50/p99/max ms; delta vs in-process
  `stability-exec-loop` ≈ tungstenite + axum framing cost.
- `throughput-rest-metrics-rps` — Prometheus-scrape-shaped hammer
  against `/v1/metrics` (16 workers × 5s). Caps scrape density.

**Workload-variant cold-starts**
- `latency-cold-start-no-net` — `NetworkSpec::Disabled` skips
  gvproxy. Delta vs `latency-cold-start` = gvproxy boot cost; useful
  for compute-only workloads that don't need network.
- `latency-cold-start-big-image` — `python:3.12-alpine` (~50 MB,
  multi-layer). Stresses layer-tarball-extraction at non-trivial
  scale; size-dependent stage scaling falls out of the diff vs
  alpine.

**Background-task overhead**
- `resource-healthcheck-overhead` — `HealthCheckOptions { interval:
  500ms }`. Sample CPU%/RSS over 10s. Extrapolate × (real interval
  / 500ms) for production tuning.

Shared helper `common::ServeChild` (probe-port + spawn `boxlite
serve` + poll-ready) extracted so the 3 REST scenarios don't
re-implement the lifecycle. Forwards `--registry` from
`GlobalFlags` to the child so its image pulls go through the same
mirrors the parent was given (avoids docker.io rate limit during
sweeps).

Verified E2E on this host:
  no-net total_create     = 19,900 ms
  big-image total_create  = 17,583 ms  (python:3.12-alpine)
  healthcheck cpu_mean    =      0.6 % at 500ms interval
  rest-cold-start         = 18,759 ms  (+REST tax)
  ws-exec mean_ms         =     13.7 ms (vs in-proc 9.1 ms)
  rest-metrics-rps        =  3,716 rps (vs /v1/config 9,613 rps)

50 scenarios total now; registry lockstep + uniqueness tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant