v0.4.0 — Rollout/Sandbox architecture release
What changed
BenchFlow v0.4.0 is the architecture release. It is much larger than the hosted-environment adapter alone: the release moves BenchFlow onto the Rollout/Sandbox core, removes the old Harbor-centered framing, adds composable rewards and external adapters, refreshes CLI/docs, and closes a set of dogfood regressions found while preparing the v0.4 boundary.
Compared with v0.3.3, this release changes 210 files with roughly 18k lines added and 6k removed.
Install
pip install benchflow==0.4.0PyPI is published and verified as benchflow==0.4.0.
Core architecture
- Makes
Rolloutthe canonical execution unit andEvaluationthe canonical batch/eval accounting path. Legacy names such asTrial,TrialConfig,Job, andJobConfigremain as compatibility aliases, but docs and examples are now Rollout-first. - Adds a unified Scene/Role/Turn type layer with per-role configuration,
parallel_group, and cleaner role/scene modeling. - Introduces the BenchFlow-native
SandboxandImageBuilderprotocols, with Docker, Daytona, and Modal implementations. Harbor is no longer a core dependency or the conceptual center of the runtime. - Consolidates the public API and module layout around
benchflow.rollout,benchflow.evaluation,benchflow.sandbox,benchflow.task,benchflow.traces, and provider/agent utility modules. - Adds reusable task path/config/verifier helpers and trace import/generation infrastructure.
Public PRs: #274, #294. Refactor sub-PRs: #261, #262, #268.
Rewards and verifiers
- Adds a composable rewards package with
Rubric,RewardFunc,RewardEvent, built-in reward functions, rubric config loading, and file-reader helpers. - Adds first-class LLM-as-judge verifier support, including dense reward events and rubric-style judging flows.
- Fixes reward/metrics edge cases such as
Nonerewards and non-finite ORS JSON values.
Public PRs: #274, #277, #294. Refactor sub-PR: #266.
External adapters and hosted environments
- Adds external framework adapters for Inspect AI and OpenAI Reinforcement Store / ORS-style JSON.
- Adds first-class hosted environment support for PrimeIntellect / Verifiers:
bench eval create --source-env ...bench environment list --hub primeintellectbench environment showbench environment inspect
- Hosted environments keep native identity (
env_uid,hub_url) and provider-owned harness/sandbox behavior instead of being treated as BenchFlow task directories or compatibility shims. - Hosted-env runs install the versioned provider package into an isolated venv, run through
vf-eval, and record logs, command, reward, tool calls, provider errors, and metadata.
Public PRs: #274, #290, #294. Refactor sub-PR: #271.
CLI and docs
- Promotes
bench eval createas the first-class execution command and deprecates oldbench runstyle flows. - Standardizes on
--sandboxterminology and removes short one-letter flags in favor of explicit long flags. - Refreshes README, concepts, getting-started, running-benchmarks, Python API, CLI reference, task-authoring, skill-eval, and integration-test docs for the v0.4 surface.
- Removes stale Trial/Harbor migration language from the docs and examples.
- Adds the
skills/citation-managementeval asset and a v0.4 skill-eval report.
Public PRs: #274, #281, #282, #284, #294.
Agents, auth, and sandbox hardening
- Adds Codex subscription/access-token auth support in Daytona and native
codex-acpflows:- auto-inherits
CODEX_ACCESS_TOKEN/CODEX_API_KEY - maps
CODEX_API_KEYtoOPENAI_API_KEYfor native Codex auth writing - keeps subscription/auth-token paths out of custom OpenAI-compatible endpoints
- auto-inherits
- Auto-loads
.envfor CLI execution so users do not needset -a && source .env && set +afor common provider variables. - Improves provider/base-url forwarding for agent containers and custom endpoints.
- Fixes skill double-deploy behavior for Dockerfile-injected tasks.
- Hardens shell usage in scene/snapshot helpers with quoted paths and path traversal validation.
- Fixes Modal/Daytona/runtime edge cases found in dogfood, including Modal optional dependencies, sandbox-user normalization, remote repo helper imports, and eval-list/skill-eval summary handling.
Public PRs: #285, #286, #294, #296. Original hardening/fix PRs: #230, #242.
Benchmarks and integrations
- Carries forward the Harvey LAB adapter/converter work and aligns benchmark runners with the v0.4 Rollout/Sandbox API.
- Updates ProgramBench integration to the new source layout and v0.4 runner conventions.
- Adds integration-suite release readiness checks and adapter evidence tooling.
- Updates conformance scripts and example scripts for the new auth, sandbox, and CLI conventions.
Public PRs: #239, #237, #274, #294, #296.
Validation
Release artifact validation from the clean v0.4.0 tag worktree:
uv build
uv run --extra dev python -m pytest tests/test_hosted_env.py tests/test_reexport.py tests/test_yaml_config.py
uv run ruff check .
uv run ty check src/Result: build succeeded, focused release tests passed (31 passed), ruff passed, and ty passed.
GitHub Actions on main for the merge commit passed.
Hosted PrimeIntellect / Verifiers smoke passed with explicit provider sampling args:
uv run bench eval create \
--source-env primeintellect/general-agent \
--source-env-version 0.1.1 \
--source-env-arg task=calendar_scheduling_t0 \
--agent gemini \
--model google/gemini-2.5-flash-lite \
--source-env-sampling-arg reasoning_effort=minimalResult: reward 1.0, tool calls 2, no Verifiers error.
A smoke without reasoning_effort=minimal still completed the harness path but scored 0.0; provider/model-specific sampling options are now explicit and should be passed when needed.
Merged PRs
- #274 — v0.4 architecture consolidation: unified types, Rollout execution path, Sandbox protocol, rewards, agent capabilities, Inspect/ORS adapters, docs cleanup. Consolidates #261, #262, #265, #266, #268, and #271.
- #285 — backports skills double-deploy and shell-injection hardening from #230 and #242.
- #294 — v0.4 refactor integration on
main: module consolidation, package extras,.envloading, trace import, task helpers, docs, skill eval assets, sandbox backends. - #296 — Codex subscription/access-token auth support in Daytona and native Codex flows.
- #290 — hosted PrimeIntellect / Verifiers environment source adapter and
bench environmenthosted-hub commands.
Compatibility notes
- The old Trial/Job names are still available as aliases, but new code should use Rollout/Evaluation terminology.
--sandboxis the canonical sandbox flag.- Short CLI flags were removed; use explicit long flags.
- Hosted environments are provider-owned harnesses. BenchFlow records and orchestrates them, but
--sandboxremains for native BenchFlow task sources.