v0.3.0

Observability-focused release (SQLite recording + experiment tracking). One breaking change to the model server contract.

Highlights

SQLite recording + vla-eval merge: each eval writes step and episode rows to a single recording-<eval_id>.sqlite (WAL mode). Step rows are keyed by (sid, eid, step_id) and merged with a json_patch UPSERT, so N sharded writers and an external model server can field-union the same row concurrently. The orchestrator injects an EpisodeRecorder into every episode run; vla-eval merge then materializes per-episode jsonl + a per-benchmark aggregate JSON from the DB. --no-save runs in-memory only. Requires SQLite ≥ 3.38 (checked at import).
Experiment tracking (wandb / trackio): tracking.report_to accepts a backend name, a list, or "all" / "none". Backend settings come from each library's native env vars (WANDB_*, TRACKIO_*); the harness only injects the run identity (eval_id) + resume="allow" so the live and vla-eval merge paths converge on one run.
Progress watchdog: a daemon thread os._exit(124)s a wedged benchmark (e.g. a native SAPIEN/Vulkan call that freezes the event loop so no await-timeout can fire) instead of hanging forever.
jsonargparse-based serve: run_server / cmd_serve now route through jsonargparse. A yaml args: block maps onto the server class __init__; list/dict values round-trip as JSON and yaml null serializes as null (not the literal string None).
ROCm GPU runtime: benchmark containers can run on AMD GPUs (--device=/dev/kfd + /dev/dri, rocm-smi device detection). (#70)
GET /health readiness endpoint: returns 200 {"status": "ok"} once the server's __init__ returns. GET /config keeps its read/update role.

⚠️ Breaking changes

Model server weight-loading contract: subclasses must now load all weights in __init__ (including JIT warmup if first-call latency would blow the HELLO response budget). The _load_model getattr hook and the lazy-load re-entry guards are gone; the framework starts accepting connections as soon as __init__ returns. All 11 bundled servers (cogact, dexbotic/cogact, groot, mme_vla, molmobot, oft, openvla, pi0, starvla, vlanext, xvla) are migrated. Custom external servers must migrate themselves.

Recording hardening

New opt-in docker.user: docker.user: "host" runs the benchmark container as the host uid:gid (or "<uid>:<gid>" to pin) so recording files are not root-owned and a different-uid external writer can co-write. Default is unchanged (image-default user).
The shared recording DB is chmod 666 (best-effort, including its -wal / -shm siblings) so a different-uid writer can field-union via json_patch; SQLite WAL requires the -shm writable by every writer.
Retry the WAL-switch lock (busy_timeout + bounded retry) so N sharded writers opening one fresh DB don't lose the "database is locked" race.
Warn when the recording DB sits on a network/parallel filesystem (WAL needs coherent shared memory + advisory locks those do not reliably provide).
Include the task name in DEFAULT_FILENAME_STEM ({name}_ep{episode_idx:04d}_{status}).

Fixes

docker build: strip ANSI escapes (NO_COLOR=1 + -q) from hatch version output, which was breaking setuptools-scm parsing inside the image build.
run_sharded.sh: fail clearly when vla-eval is not on PATH.
vla-eval merge drives the tracker lifecycle (on_eval_begin / on_benchmark_begin) so sharded runs emit the same tracking as the live path.

Maintenance

Leaderboard monthly content updates (2026-05, 2026-06).
Bump actions/checkout 6 → 7.

Recording, tracking, jsonargparse serve, watchdog, the recording hardening, /health, and the fixes above shipped in #72; the ROCm runtime in #70.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.3.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v0.3.0

Highlights

⚠️ Breaking changes

Recording hardening

Fixes

Maintenance

Uh oh!