bench harness — 44 scenarios across 5 axes by G4614 · Pull Request #592 · boxlite-ai/boxlite

G4614 · 2026-05-26T03:29:39Z

Status: Draft. Mechanically clean; opening for review before un-drafting. Squash already done.

Summary

Adds a runtime performance harness under boxlite bench — 44 scenarios across 5 axes (latency, resource, density, throughput, stability), versioned JSON reports, direction-aware compare against a baseline, p99 regression gate.

Stack: 6 commits on top of main, +8558 LOC. See src/cli/src/commands/bench/README.md for the full scenario catalog, JSON schema, and compare semantics.

Commit layout

feat(cli): bench harness foundation — runner / Scenario trait / JSON schema / compare / host_info / stats / scenarios skeleton.
test(bench): latency axis — 10 scenarios — also ships scenarios/common.rs (BoxGuard, alpine_options, build_runtime) since it's the first axis to need them. throughput-export rides along because it shares lifecycle.rs's source-box prep helper with latency-clone.
test(bench): resource axis — 6 scenarios
test(bench): density axis — parallel-10
test(bench): throughput axis — 19 scenarios
test(bench): stability axis — 7 scenarios

Every commit individually: compiles, passes the registry / build_by_name lock-step tests, clippy -D warnings clean. mod.rs grows progressively so no commit has dead arms or dead module decls.

Scenario coverage

Latency (10): cold/warm start (incl. jailed), clone (single + batch-10), snapshot, inspect-list, get-or-create dedup, image-pull (cached), runtime-shutdown.
Resource (6): idle / cpu-load / mem-pressure / density-10-idle / multi-vcpu-load / runtime-metrics-poll.
Density (1): parallel-10 burst spawn.
Throughput (19): image pull, disk (dd seq + fio random, R+W), virtiofs, net (tcp-sink, iperf3 host→guest + egress + parallel, tcp-cps, dns-latency, udp SKIPs), serve-rps, lifecycle (export, import, copy-into, copy-out), setup-cost (many-ports, volumes-multi).
Stability (7): churn (50 cycles), soak (idle + load), exec-loop (500), exec-parallel (20), restart-loop (20), snapshot-loop (20).

What's in this branch

Foundation: BenchArgs / Scenario trait / RunContext, versioned JSON schema (schema_version = "1.0"), nearest-rank percentiles (ISO 16269-4) + Bessel-corrected stdev — self-contained, zero crate-external deps (no criterion / statistical).
Direction-aware compare: higher_is_better flag on each MetricAggregate, inferred from _per_sec / _rps / _iops suffixes; comparator flips the sign so "positive regression_ratio = bad" holds for both latency and throughput.
compare double-guards: schema-version + scenario-name mismatch errors before producing a misleading delta.
Registry / build_by_name lock-step pinned by unit test; duplicate-name guard too.
E2E SKIP convention for host-prereq scenarios (iperf3 binary, AppArmor userns, KRUN_VIRTIO_FS_MAX, gvproxy TCP+UDP same-port bug) — scenario returns Ok with a marker metric instead of failing the runner.

Real boxlite quirks the bench surfaced (documented in scenario file headers + commit messages)

LiteBox::copy_out from /tmp (tmpfs) returns NotFound — needs a rootfs path like /root. Fixed in scenarios/copy_io.rs.
libkrun KRUN_VIRTIO_FS_MAX = 2; ≥ 3 -v mounts give status=-22 at start. scenarios/volumes_multi.rs capped at N=2.
After SnapshotHandle::create (or restore), the current disk depends on the snap → immediate remove fails with "current disk depends on this snapshot". latency-snapshot drops the trailing remove; stability-snapshot-loop drops restore and just measures create churn.
LiteBox handle invalidates on stop; the next start panics with "Handle invalidated after stop()". stability-restart-loop re-fetches via rt.get(id) between cycles.
gvproxy doesn't support TCP+UDP forward on same host_port → iperf3 -u control channel can't connect. throughput-net-udp SKIPs by default; BOXLITE_BENCH_UDP_FORCE=1 to attempt.
Guest exec subsystem regresses around exec fix(python): export SecurityOptions from top-level boxlite module #247 → InitReady / IntermediateReady(0) mismatch. stability-exec-loop is tolerant of partial completion and reports exec_completed_count so the boundary surfaces as a regression signal instead of an aborted run.
Ubuntu 24+ AppArmor restricts unprivileged userns by default; jailed cold-start preflights bwrap --unshare-user and SKIPs when denied.

Verification

cargo clippy -p boxlite-cli --tests --no-deps -- -D warnings — clean on Linux + macOS (test-mod cfg-gated for /proc-only assertions).
cargo test -p boxlite-cli --bin boxlite commands::bench:: — 28/28 (registry lock-step + uniqueness + stats + parse helpers).
25+ scenarios E2E-run on this host (numbers in axis-commit messages).

Why still draft

Net-UDP scenario carries BOXLITE_BENCH_UDP_FORCE=1 escape hatch but is SKIP by default until gvproxy gains TCP+UDP same-port support.
No CI integration (baseline comparison wired into a workflow) yet.
Want a review pass on the JSON schema before declaring it 1.0 stable for downstream tooling.

Test plan

make test:unit:rust clean
boxlite bench list shows all 44
Spot-run 3-4 scenarios per axis, confirm reports parse + compare gates trip on synthetic regressions
Sign off on the JSON schema shape

🤖 Generated with Claude Code

macOS clippy on PR boxlite-ai#592 caught `unused import: super::*` in the tests submodule: the only test was already `#[cfg(target_os = "linux")]`-gated, so on macOS the module compiled empty and the `use super::*` had nothing to import → `-D warnings` fails. Move the cfg to the mod declaration so the whole tests block is elided off-Linux. No behavior change on Linux. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the `boxlite bench` subcommand scaffold — runner / scenario trait / JSON schema / compare / host metadata. Scenario implementations land in the per-axis follow-up commits. Components: - `BenchArgs` + dispatch (`bench list` / `bench run` / `bench compare`) - `Scenario` trait (`async fn run_once` via `async-trait`), runner loop with warmup-drop, `RunContext` - Versioned JSON report (`schema_version = "1.0"`): metadata, per-iteration samples, aggregates (min/p50/p90/p99/max/mean/stdev) with `unit` hint and `higher_is_better` flag - Self-contained stats: nearest-rank percentile (ISO 16269-4), Bessel-corrected sample stdev — zero crate-external deps (no `criterion`/`statistical`) - `HostInfo` snapshot from `/proc` — kernel / arch / CPU model+count / memory total; tests gated by `cfg(target_os = "linux")` so macOS clippy stays clean - `compare BASELINE CURRENT` with `--threshold`/--on` knobs: schema-version + scenario-name double guard; direction-aware (`higher_is_better` metrics get their delta sign flipped into `regression_ratio` so the "positive=bad" invariant holds for both latency and throughput metrics) - Registry / `build_by_name` lock-step invariant pinned by unit test, plus a uniqueness check on scenario names Plus the CLI plumbing (`src/cli/src/main.rs` dispatch, README pointer at `src/cli/src/commands/bench/README.md`) and the `async-trait = "0.1"` dep. Subsequent commits will populate the registry per-axis: latency / resource / density / throughput / stability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single-operation timing across the box lifecycle: - latency-cold-start / latency-warm-start — fresh vs shared `--home` - latency-cold-start-jailed — `SecurityOptions::maximum()`; isolation tax delta. SKIPs when `is_full_isolation_available()` is false OR when a `bwrap --unshare-user` preflight fails (Ubuntu 24+ AppArmor default). - latency-clone / latency-clone-batch-10 — single-call vs batch `clone_boxes(N=10)` per-clone amortized cost - latency-snapshot — `SnapshotHandle` create+restore round-trip (remove omitted; after restore the current disk depends on the snapshot, so remove would fail with the qcow2 dep invariant) - latency-inspect-list — `list_info` + `get_info` at N=20 boxes; SQLite index scaling signal - latency-get-or-create-dedup — 100 `get_or_create(name)` hits on a pre-materialized box; µs floor for name→box-id lookup - latency-image-pull-cached — shared `--home` warm-cache pull (counterpart to throughput-image-pull's cold-cache headline) - latency-runtime-shutdown — `rt.shutdown` with N=3 running boxes; graceful-stop SLA floor `throughput-export` rides along because it shares the source-box prep helper with `latency-clone` in `lifecycle.rs`; rest of the throughput axis lands separately. Shared scenario plumbing in `scenarios/common.rs` (BoxGuard RAII, `alpine_options`, `build_runtime`) ships with this axis since it's the first one to use it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per-box resource footprint, idle and under load: - resource-idle — RSS / COW / CPU% after 3 s settle - resource-cpu-load — one vCPU pegged by stress-ng --cpu 1 for 10 s; catches libkrun-shim RSS growth under work - resource-mem-pressure — box capped at 256 MiB, stress-ng --vm-bytes 150m. Documented signal is the exit-code transition from `1` (clean baseline on alpine's stress-ng build) to `137` (SIGKILL by OOM-killer); reports `mem_pressure_limit_bytes` and `mem_pressure_alloc_bytes` for context - resource-density-10-idle — 10 idle boxes coexisting; sums RSS + COW + host fd. Steady-state coexistence cost (vs density-parallel- 10's concurrent-spawn signal) - resource-multi-vcpu-load — 4 vCPUs all saturated; tests libkrun vCPU thread mapping + multi-core KVM exit path - resource-runtime-metrics-poll — `rt.metrics()` cost at N=10 running boxes, 500 samples → µs mean/p50/p99/max. Floor for Prometheus scrape overhead Shared `--home` across iterations for steady-state numbers; `stress-ng` install via apk amortized across the scenario instance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`density-parallel-10`: concurrent spawn of 10 alpine boxes through one runtime. Headline is the per-box max/mean latency under contention — the init-pipeline surcharge over a single warm start that catches lock-contention regressions invisible to a serial scenario. Distinct from `resource-density-10-idle` (which measures steady-state coexistence cost once all 10 are up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bytes-per-second / requests-per-second across every external code path. **Image** - throughput-image-pull — cold-cache pull MB/s **Disk (in-box)** - throughput-disk-write / -disk-read — sequential dd 64 MiB, qcow2-COW-over-virtio bandwidth - throughput-disk-fio / -disk-fio-read — fio 4K random + fsync, IOPS + clat p50/p99/p999 tail latency. One-time apk add amortized - throughput-virtiofs — dd write to a `-v` mount; distinct codepath from the qcow2 disk-write (host volumes bypass the COW overlay) **Net (gvproxy)** - throughput-net-tcp-sink — host→guest via `nc` sink. Catches gvproxy regressions without needing iperf3 on host - throughput-net-iperf3 / -iperf3-egress — host→guest and guest→host iperf3 TCP, bps + retransmits - throughput-net-iperf3-parallel — iperf3 -P 4 multi-stream, fairness signal via per-stream stdev - throughput-net-udp — iperf3 -u 1 Gbps; SKIPs by default (gvproxy doesn't support TCP+UDP forward on same host_port; iperf3 -u needs both for control + data channels). `BOXLITE_BENCH_UDP_FORCE=1` escape hatch - throughput-tcp-cps — TCP connection-establish rate via tight TcpStream::connect loop against in-box nc respawn - throughput-dns-latency — `getent ahosts` lookups for internal + external targets; tests gvproxy embedded DNS + recursive forward **REST** - throughput-serve-rps — spawn `boxlite serve` child, hammer /v1/config with 16 reqwest workers **Lifecycle** - throughput-copy-into / -copy-out — 64 MiB tar-stream via `LiteBox::copy_into/out`. Payload staged on `/root` not `/tmp` because tmpfs isn't visible to the guest agent's file interface - throughput-import — `rt.import_box` from a pre-exported archive; counterpart to `throughput-export` (which shipped with the latency axis to share lifecycle.rs's source-box helper) **Setup-cost** - throughput-many-ports-setup — 16 `-p` forwards; gvproxy port-table fan-out delta vs zero-ports warm-start - throughput-volumes-multi-setup — 2 `-v` mounts (libkrun `KRUN_VIRTIO_FS_MAX = 2` cap; N=3 fails with status=-22) Host-side prereqs (iperf3 on PATH) checked at scenario start; SKIPs cleanly with marker metric instead of aborting the runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Long-running / repeated-op leak detection: - stability-churn — 50 create+start+stop cycles, per-cycle latency + host fd delta. Catches per-cycle leaks (fd / tempfile / DB row accounting) - stability-soak — keep one box alive for BOXLITE_BENCH_SOAK_SECS (default 30 s), sample RSS/COW/fd every 2 s, report first→last deltas. Catches steady-state idle leaks churn misses - stability-soak-load — soak with continuous in-box fio random reads. Catches under-load leaks (gvproxy goroutine pools, libkrun dirty-page buffers) that idle soak misses - stability-exec-loop — 500 sequential execs on one box. Tolerant of partial completion: reports `exec_completed_count` and a per-iter failure-index marker. Documents the historical ~exec boxlite-ai#247 InitReady/IntermediateReady mismatch in alpine x86_64 so future regressions show as the failure boundary moving down - stability-exec-parallel — 20 concurrent execs via tokio::spawn fan-out, batch wall + per-exec p99 under contention. Tests gRPC fairness and the guest's exec state-map lock - stability-restart-loop — 20 stop+start cycles on the SAME box (distinct from churn's create-each-time). Re-fetches the LiteBox handle via `rt.get(id)` between cycles because `stop` invalidates the previous handle (would otherwise abort cycle 0) - stability-snapshot-loop — 20 sequential SnapshotHandle::create calls. Headline is per-create mean/max + COW byte delta. Remove omitted: creating snapshot N moves the current overlay's parent to N, so remove(N) fails until N+1 exists; mixing the two ops would couple latencies Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sweep surfaced `throughput-net-iperf3-parallel` hanging > 600 s when run after other iperf3 scenarios + with /tmp under pressure. Root cause is a rare gvproxy state-leak race that wedges the iperf3 -P 4 control channel — iperf3's own `-t 5` budget should cap the run, but on the failure path the client never exits. With `.output().await` unbounded, one hang consumes a whole sweep slot. Switch the three iperf3 client invocations (net_iperf3, net_iperf3_- parallel, net_iperf3_egress) from `.output().await` to `spawn() + timeout(TRANSFER_SECS + 25s, wait_with_output())`. On timeout: capture host iperf3 process list for diagnostics and bail with a clear error. For net_iperf3_egress (the in-box client variant) the same bound is applied to the stdout drain loop + the exec wait separately — hangs can wedge either. Verified E2E on this host: all three pass cleanly with the bound in place — net-iperf3 1.42 Gbps, egress 8.16 Gbps, parallel 1.45 Gbps. Manual repro confirmed `iperf3 -s -1 -p 5201 + -P 4 -J -t 5` works fine outside the scenario harness, ruling out an iperf3 server-flag incompatibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sweep-style single-sample mode (`--runs 1 --warmup 0`) was reporting the SAME `total_create_ms` as cold-start, defeating the whole point of the scenario. Root cause: with one call to `run_once`, `self.home` is freshly initialized and never populated by a prior iteration, so the measured cycle pays full image pull + base disk build + guest rootfs bootstrap. Fix: track a `prewarmed` flag on `WarmStart`. The first `run_once` call drives a hidden box cycle to populate the home, then drives the measured cycle. From the second call onward `prewarmed=true` so no extra cycle. The runner's `--warmup` knob still works on top — those iterations are just additional warm cycles, harmless. Verified E2E single-sample on this host: before: total_create=25,275 ms, image_prepare=24,263, rootfs=20,868 after: total_create=1,076 ms, image_prepare=3, rootfs=1 The pre-warm cycle still costs ~20s on the first call's wall_ms; that's expected and is the cost of producing one valid warm sample from a fresh runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t-call cold state Sanity check on the 44-scenario sweep numbers showed that several setup-cost scenarios were reporting cold-pull cost under a warm-cost label at `--runs 1 --warmup 0`: - `density-parallel-10`: per-box max latency 23s on first iter (cold pull contention) vs the headline-grade ~5s init-pipeline number it claims to measure - `throughput-many-ports-setup`: `start_ms` 14,703 (image pull) vs the gvproxy port-table fanout cost it advertises - `throughput-volumes-multi-setup`: `start_ms` 15,584 (image pull) vs virtiofs-fanout - `latency-image-pull-cached`: `pull_cached_ms` 4,397 (cold pull on first call) vs the ms-scale warm-cache hit it's named for Same pattern as the earlier `latency-warm-start` fix: track a `prewarmed: bool` on each scenario struct. First `run_once` does a hidden pre-warm cycle (throwaway box, no ports/volumes, or a throwaway pull) to populate the shared home; subsequent calls skip the prewarm. The runner's `--warmup` knob still works on top. Verified E2E (single-sample): - density max latency: 23s → 5,493 ms - many-ports start_ms: 14,703 → 1,060 ms (now ~= warm-start floor, confirming gvproxy port-table fanout is nearly free) - volumes start_ms: 15,584 → 1,045 ms (same conclusion for virtiofs) - image-pull-cached: 4,397 → 0.5 ms (manifest cache short-circuit) The remaining 40 scenarios were re-checked: their measured metrics (disk dd/fio, iperf3 bps, RSS-after-settle, exec ms, etc.) are captured inside the running box, decoupled from first-call setup cost. No pre-warm needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

44 scenario reports from a clean re-sweep at 251397a after the pre-warm fixes landed. Schema 1.0, captured on: CPU: Intel(R) Xeon(R) 6975P-C, 4 vCPU Kernel: Linux 6.17.0-1015-aws Mem: 7.6 GB Registry mirrors: docker.m.daocloud.io, docker.1ms.run, docker.io Used as the reference baseline for `boxlite bench compare`: boxlite bench run latency-warm-start --runs 10 --warmup 1 \ --out /tmp/current.json boxlite bench compare bench/baselines/latency-warm-start.json \ /tmp/current.json --threshold 0.20 --on p99 Headline numbers: latency-warm-start total_create_ms ............ 1,012 ms latency-cold-start total_create_ms ........... 15,573 ms latency-image-pull-cached pull_cached_ms ......... 0.5 ms throughput-disk-write mb_per_sec ............... 232 MB/s throughput-disk-read mb_per_sec .............. 4,608 MB/s throughput-disk-fio iops_per_sec ............ 872,064 IOPS throughput-net-iperf3 bps ..................... 1.59 Gbps throughput-net-iperf3-egress bps .............. 8.93 Gbps throughput-net-iperf3-parallel bps ............ 1.36 Gbps throughput-serve-rps .......................... 9,613 rps throughput-tcp-cps ............................ 6,889 conn/s resource-idle rss ............................ 248.9 MB resource-density-10-idle total_rss ........... 2,492 MB Two SKIPs by design: latency-cold-start-jailed — AppArmor blocks unprivileged userns throughput-net-udp — gvproxy TCP+UDP dual-fwd unsupported Refresh procedure (when boxlite engine changes meaningfully): rm -rf /tmp/bench-sweep && mkdir -p /tmp/bench-sweep for s in $(boxlite bench list | awk '/^ / {print $1}'); do boxlite bench run "$s" --runs 1 --warmup 0 --out "/tmp/bench-sweep/${s}.json" done cp /tmp/bench-sweep/*.json bench/baselines/ Host hardware in metadata.host of each JSON; compare across hosts should be expected to differ wildly — the baseline is reference, not absolute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `async fn teardown(&mut self, ctx: &TeardownContext) -> Result<()>` to the `Scenario` trait (default no-op) and runner integration: the hook fires once after the iteration loop, on BOTH success and error paths, before the report writes out. Teardown errors are surfaced as warnings — they don't mask the iteration result. Implementations for the 8 scenarios with cross-iteration persistent state: - lifecycle.rs (latency-clone, throughput-export): rt.remove(source) - clone_batch.rs: rt.remove(source) - snapshot.rs: rt.remove(source) — cascades accumulated snapshots - snapshot_loop.rs: rt.remove(source) — cascades 20 snaps per iter - inspect_list.rs: rt.remove each of the 20 unstarted boxes - dedup_lookup.rs: rt.remove("bench-dedup-target") - runtime_metrics_poll.rs: stop + remove the 10 idle boxes held in Vec - net_iperf3_egress.rs: pkill any host `iperf3 -s -D` daemons by port (the `-1` flag self-exits on clean disconnect, but errors mid- handshake can leave the daemon waiting forever) Verified E2E: latency-inspect-list: DB rows = 0, box dirs on disk = 0 resource-runtime-metrics-poll: DB rows = 0, box dirs on disk = 0 (both previously left 20 / 10 boxes around until TempDir drop, which only fires on clean process exit — not on SIGTERM or panic.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three more cleanup paths on top of the per-scenario teardown hook: 1. **SIGINT/SIGTERM handler**: race the iteration loop against `tokio::signal::unix` futures via `select!`. When a sweep wrapper's `timeout 600` sends SIGTERM, the select-cancellation drops the iter_fut (which runs BoxGuard's Drop synchronously via `block_in_place + rt.remove`) AND we get to run the explicit teardown hook before exiting. Previously the bench would just die with no teardown. 2. **`catch_unwind` around the iteration loop**: a `panic!` inside a scenario (unwrap, index OOB, etc.) used to bring down the binary mid-cleanup. Now panics are caught, the panic message surfaces as a regular `Err`, and teardown still runs. Uses `AssertUnwindSafe` because after the panic we only invoke teardown — never read partially-mutated scenario state. 3. **`kill_descendants_of_self` last-ditch reaper**: walks `/proc/<pid>/task/<pid>/children` BFS and SIGKILLs any `libkrun VM` or `boxlite-shim` descendant after a failed run. On the happy path BoxGuard's Drop + boxlite engine self-cleanup already handle this and the reaper finds 0 procs to kill; this covers the edge case where SIGTERM hits during VMM spawn before BoxGuard wraps the box. Verified E2E with mid-iteration SIGTERM on stability-churn: before: 7s exit, 2 libkrun + 3 shim orphans, 0 box dirs (TempDir drop ran but VMM children leaked) after: <1s exit, 0 orphans, 0 box dirs, clean DB `futures::FutureExt::catch_unwind` brings in `futures` crate (already in deps). `nix::sys::signal::kill` for the SIGKILL (already in deps with `signal` feature). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g-image Adds the high-ROI surfaces from the code-coverage gap analysis: **REST / WS path (SDK clients live here)** - `latency-rest-cold-start` — full cold-start through axum + tower + serde + HTTP round-trips. Delta vs `latency-cold-start` quantifies the REST tax that every Python/Node/Go SDK call pays. - `latency-ws-exec` — 100 echo execs through the REST/WebSocket exec channel. Per-exec mean/p50/p99/max ms; delta vs in-process `stability-exec-loop` ≈ tungstenite + axum framing cost. - `throughput-rest-metrics-rps` — Prometheus-scrape-shaped hammer against `/v1/metrics` (16 workers × 5s). Caps scrape density. **Workload-variant cold-starts** - `latency-cold-start-no-net` — `NetworkSpec::Disabled` skips gvproxy. Delta vs `latency-cold-start` = gvproxy boot cost; useful for compute-only workloads that don't need network. - `latency-cold-start-big-image` — `python:3.12-alpine` (~50 MB, multi-layer). Stresses layer-tarball-extraction at non-trivial scale; size-dependent stage scaling falls out of the diff vs alpine. **Background-task overhead** - `resource-healthcheck-overhead` — `HealthCheckOptions { interval: 500ms }`. Sample CPU%/RSS over 10s. Extrapolate × (real interval / 500ms) for production tuning. Shared helper `common::ServeChild` (probe-port + spawn `boxlite serve` + poll-ready) extracted so the 3 REST scenarios don't re-implement the lifecycle. Forwards `--registry` from `GlobalFlags` to the child so its image pulls go through the same mirrors the parent was given (avoids docker.io rate limit during sweeps). Verified E2E on this host: no-net total_create = 19,900 ms big-image total_create = 17,583 ms (python:3.12-alpine) healthcheck cpu_mean = 0.6 % at 500ms interval rest-cold-start = 18,759 ms (+REST tax) ws-exec mean_ms = 13.7 ms (vs in-proc 9.1 ms) rest-metrics-rps = 3,716 rps (vs /v1/config 9,613 rps) 50 scenarios total now; registry lockstep + uniqueness tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gamnaansong and others added 6 commits May 26, 2026 03:45

G4614 force-pushed the feat/bench branch from 6410e56 to c3a9950 Compare May 26, 2026 03:57

G4614 changed the title ~~WIP: bench harness — 44 scenarios + cross-cutting fixes~~ bench harness — 44 scenarios across 5 axes May 26, 2026

gamnaansong and others added 7 commits May 26, 2026 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench harness — 44 scenarios across 5 axes#592

bench harness — 44 scenarios across 5 axes#592
G4614 wants to merge 13 commits into
boxlite-ai:mainfrom
G4614:feat/bench

G4614 commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

G4614 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit layout

Scenario coverage

What's in this branch

Real boxlite quirks the bench surfaced (documented in scenario file headers + commit messages)

Verification

Why still draft

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

G4614 commented May 26, 2026 •

edited

Loading