Release v0.4.0 — Rollout/Sandbox architecture release · benchflow-ai/benchflow

What changed

BenchFlow v0.4.0 is the architecture release. It is much larger than the hosted-environment adapter alone: the release moves BenchFlow onto the Rollout/Sandbox core, removes the old Harbor-centered framing, adds composable rewards and external adapters, refreshes CLI/docs, and closes a set of dogfood regressions found while preparing the v0.4 boundary.

Compared with v0.3.3, this release changes 210 files with roughly 18k lines added and 6k removed.

Install

pip install benchflow==0.4.0

PyPI is published and verified as benchflow==0.4.0.

Core architecture

Makes Rollout the canonical execution unit and Evaluation the canonical batch/eval accounting path. Legacy names such as Trial, TrialConfig, Job, and JobConfig remain as compatibility aliases, but docs and examples are now Rollout-first.
Adds a unified Scene/Role/Turn type layer with per-role configuration, parallel_group, and cleaner role/scene modeling.
Introduces the BenchFlow-native Sandbox and ImageBuilder protocols, with Docker, Daytona, and Modal implementations. Harbor is no longer a core dependency or the conceptual center of the runtime.
Consolidates the public API and module layout around benchflow.rollout, benchflow.evaluation, benchflow.sandbox, benchflow.task, benchflow.traces, and provider/agent utility modules.
Adds reusable task path/config/verifier helpers and trace import/generation infrastructure.

Public PRs: #274, #294. Refactor sub-PRs: #261, #262, #268.

Rewards and verifiers

Adds a composable rewards package with Rubric, RewardFunc, RewardEvent, built-in reward functions, rubric config loading, and file-reader helpers.
Adds first-class LLM-as-judge verifier support, including dense reward events and rubric-style judging flows.
Fixes reward/metrics edge cases such as None rewards and non-finite ORS JSON values.

Public PRs: #274, #277, #294. Refactor sub-PR: #266.

External adapters and hosted environments

Adds external framework adapters for Inspect AI and OpenAI Reinforcement Store / ORS-style JSON.
Adds first-class hosted environment support for PrimeIntellect / Verifiers:
- bench eval create --source-env ...
- bench environment list --hub primeintellect
- bench environment show
- bench environment inspect
Hosted environments keep native identity (env_uid, hub_url) and provider-owned harness/sandbox behavior instead of being treated as BenchFlow task directories or compatibility shims.
Hosted-env runs install the versioned provider package into an isolated venv, run through vf-eval, and record logs, command, reward, tool calls, provider errors, and metadata.

Public PRs: #274, #290, #294. Refactor sub-PR: #271.

CLI and docs

Promotes bench eval create as the first-class execution command and deprecates old bench run style flows.
Standardizes on --sandbox terminology and removes short one-letter flags in favor of explicit long flags.
Refreshes README, concepts, getting-started, running-benchmarks, Python API, CLI reference, task-authoring, skill-eval, and integration-test docs for the v0.4 surface.
Removes stale Trial/Harbor migration language from the docs and examples.
Adds the skills/citation-management eval asset and a v0.4 skill-eval report.

Public PRs: #274, #281, #282, #284, #294.

Agents, auth, and sandbox hardening

Adds Codex subscription/access-token auth support in Daytona and native codex-acp flows:
- auto-inherits CODEX_ACCESS_TOKEN / CODEX_API_KEY
- maps CODEX_API_KEY to OPENAI_API_KEY for native Codex auth writing
- keeps subscription/auth-token paths out of custom OpenAI-compatible endpoints
Auto-loads .env for CLI execution so users do not need set -a && source .env && set +a for common provider variables.
Improves provider/base-url forwarding for agent containers and custom endpoints.
Fixes skill double-deploy behavior for Dockerfile-injected tasks.
Hardens shell usage in scene/snapshot helpers with quoted paths and path traversal validation.
Fixes Modal/Daytona/runtime edge cases found in dogfood, including Modal optional dependencies, sandbox-user normalization, remote repo helper imports, and eval-list/skill-eval summary handling.

Public PRs: #285, #286, #294, #296. Original hardening/fix PRs: #230, #242.

Benchmarks and integrations

Carries forward the Harvey LAB adapter/converter work and aligns benchmark runners with the v0.4 Rollout/Sandbox API.
Updates ProgramBench integration to the new source layout and v0.4 runner conventions.
Adds integration-suite release readiness checks and adapter evidence tooling.
Updates conformance scripts and example scripts for the new auth, sandbox, and CLI conventions.

Public PRs: #239, #237, #274, #294, #296.

Validation

Release artifact validation from the clean v0.4.0 tag worktree:

uv build
uv run --extra dev python -m pytest tests/test_hosted_env.py tests/test_reexport.py tests/test_yaml_config.py
uv run ruff check .
uv run ty check src/

Result: build succeeded, focused release tests passed (31 passed), ruff passed, and ty passed.

GitHub Actions on main for the merge commit passed.

Hosted PrimeIntellect / Verifiers smoke passed with explicit provider sampling args:

uv run bench eval create \
  --source-env primeintellect/general-agent \
  --source-env-version 0.1.1 \
  --source-env-arg task=calendar_scheduling_t0 \
  --agent gemini \
  --model google/gemini-2.5-flash-lite \
  --source-env-sampling-arg reasoning_effort=minimal

Result: reward 1.0, tool calls 2, no Verifiers error.

A smoke without reasoning_effort=minimal still completed the harness path but scored 0.0; provider/model-specific sampling options are now explicit and should be passed when needed.

Merged PRs

#274 — v0.4 architecture consolidation: unified types, Rollout execution path, Sandbox protocol, rewards, agent capabilities, Inspect/ORS adapters, docs cleanup. Consolidates #261, #262, #265, #266, #268, and #271.
#285 — backports skills double-deploy and shell-injection hardening from #230 and #242.
#294 — v0.4 refactor integration on main: module consolidation, package extras, .env loading, trace import, task helpers, docs, skill eval assets, sandbox backends.
#296 — Codex subscription/access-token auth support in Daytona and native Codex flows.
#290 — hosted PrimeIntellect / Verifiers environment source adapter and bench environment hosted-hub commands.

Compatibility notes

The old Trial/Job names are still available as aliases, but new code should use Rollout/Evaluation terminology.
--sandbox is the canonical sandbox flag.
Short CLI flags were removed; use explicit long flags.
Hosted environments are provider-owned harnesses. BenchFlow records and orchestrates them, but --sandbox remains for native BenchFlow task sources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.0 — Rollout/Sandbox architecture release

Choose a tag to compare

Sorry, something went wrong.