v0.3.0
Observability-focused release (SQLite recording + experiment tracking). One breaking change to the model server contract.
Highlights
- SQLite recording +
vla-eval merge: each eval writes step and episode rows to a singlerecording-<eval_id>.sqlite(WAL mode). Step rows are keyed by(sid, eid, step_id)and merged with ajson_patchUPSERT, so N sharded writers and an external model server can field-union the same row concurrently. The orchestrator injects anEpisodeRecorderinto every episode run;vla-eval mergethen materializes per-episode jsonl + a per-benchmark aggregate JSON from the DB.--no-saveruns in-memory only. Requires SQLite ≥ 3.38 (checked at import). - Experiment tracking (wandb / trackio):
tracking.report_toaccepts a backend name, a list, or"all"/"none". Backend settings come from each library's native env vars (WANDB_*,TRACKIO_*); the harness only injects the run identity (eval_id) +resume="allow"so the live andvla-eval mergepaths converge on one run. - Progress watchdog: a daemon thread
os._exit(124)s a wedged benchmark (e.g. a native SAPIEN/Vulkan call that freezes the event loop so no await-timeout can fire) instead of hanging forever. - jsonargparse-based serve:
run_server/cmd_servenow route through jsonargparse. A yamlargs:block maps onto the server class__init__; list/dict values round-trip as JSON and yamlnullserializes asnull(not the literal stringNone). - ROCm GPU runtime: benchmark containers can run on AMD GPUs (
--device=/dev/kfd+/dev/dri,rocm-smidevice detection). (#70) GET /healthreadiness endpoint: returns200 {"status": "ok"}once the server's__init__returns.GET /configkeeps its read/update role.
⚠️ Breaking changes
- Model server weight-loading contract: subclasses must now load all weights in
__init__(including JIT warmup if first-call latency would blow the HELLO response budget). The_load_modelgetattr hook and the lazy-load re-entry guards are gone; the framework starts accepting connections as soon as__init__returns. All 11 bundled servers (cogact, dexbotic/cogact, groot, mme_vla, molmobot, oft, openvla, pi0, starvla, vlanext, xvla) are migrated. Custom external servers must migrate themselves.
Recording hardening
- New opt-in
docker.user:docker.user: "host"runs the benchmark container as the hostuid:gid(or"<uid>:<gid>"to pin) so recording files are not root-owned and a different-uid external writer can co-write. Default is unchanged (image-default user). - The shared recording DB is
chmod 666(best-effort, including its-wal/-shmsiblings) so a different-uid writer can field-union viajson_patch; SQLite WAL requires the-shmwritable by every writer. - Retry the WAL-switch lock (
busy_timeout+ bounded retry) so N sharded writers opening one fresh DB don't lose the "database is locked" race. - Warn when the recording DB sits on a network/parallel filesystem (WAL needs coherent shared memory + advisory locks those do not reliably provide).
- Include the task name in
DEFAULT_FILENAME_STEM({name}_ep{episode_idx:04d}_{status}).
Fixes
- docker build: strip ANSI escapes (
NO_COLOR=1+-q) fromhatch versionoutput, which was breaking setuptools-scm parsing inside the image build. run_sharded.sh: fail clearly whenvla-evalis not on PATH.vla-eval mergedrives the tracker lifecycle (on_eval_begin/on_benchmark_begin) so sharded runs emit the same tracking as the live path.
Maintenance
- Leaderboard monthly content updates (2026-05, 2026-06).
- Bump
actions/checkout6 → 7.
Recording, tracking, jsonargparse serve, watchdog, the recording hardening, /health, and the fixes above shipped in #72; the ROCm runtime in #70.