Conversation
…169) * fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: openhands install — use uv tool install or pip install openhands-ai The PyPI package 'openhands' (0.0.0) is a placeholder, not the CLI. The real install is 'uv tool install openhands' (preferred) or 'pip install openhands-ai'. Tries uv first, falls back to pip. Fixes #169 runtime error: 'openhands: command not found' --------- Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
Five fixes for issue #169 (openhands: command not found): 1. PATH: add $HOME/.local/bin to launch_cmd so uv-installed binary is found 2. Interpreter access: chmod o+x on /root path chain so sandbox user can reach the uv-managed Python shebang at /root/.local/share/uv/tools/ 3. ACP auth: seed ~/.openhands/agent_settings.json at install (OpenHands _is_authenticated() requires it) and overwrite with real LLM_MODEL/KEY at launch (workaround for OpenHands ACP not applying --override-with-envs in _create_conversation) 4. Model env: add BENCHFLOW_PROVIDER_MODEL → LLM_MODEL to env_mapping 5. CWD: remove hardcoded cd /home/{user} from build_priv_drop_cmd — it overrode the docker -w /app workspace, causing agents to write files in the wrong directory Also adds home_dirs=[".openhands"] so setup_sandbox_user copies the settings dir to the agent user. Tested: bench eval create + bench run, both sandbox_user=agent and root, gemini agent regression-verified, 45/45 registry+sandbox tests pass.
…enes Multi-role scenes (coder + reviewer) now communicate via outbox files through the main bf.run(TrialConfig) path. Previously, outbox-based message passing only worked through the standalone _scene.py scheduler (used by followup-bench). Now the same convention works end-to-end: 1. Scheduler sets up /app/.outbox/ before the first turn 2. After each turn, reads outbox files written by the active role 3. Injects received messages into the next role's prompt Also includes: - Coder-reviewer demo script (docs/notebooks/coder-reviewer-demo.py) - Real runnable notebook replacing config-only cells with bf.run() calls - Multi-turn vs multi-round terminology in README and api-reference - 7 new tests covering outbox setup, injection, cleanup, and edge cases
1. Quote file paths with shlex.quote() in _read_scene_outbox() to prevent shell command injection via crafted outbox filenames 2. chown /app/.outbox to sandbox_user so agents can actually write outbox files (was root:root 755 → agent couldn't write)
…st gaps
1. Persist inter-role messages to trial_dir/scene_messages.jsonl
(was ephemeral — injected into prompts then discarded)
2. Install non-primary agents in connect_as() for heterogeneous scenes
(was broken: only primary agent was installed)
3. Honest Harbor mapping — document what 0.3 delivers vs what's a gap:
- Shipped: roles, turns, outbox messaging, message persistence
- Gap: dynamic termination, oracle access, per-round verification,
inter-round trajectory inspection
4. Add 0.3 Limitations section to api-reference
5. Two new tests: message persistence + heterogeneous agent install
All 3 patterns executed end-to-end on regex-log task via Daytona: - Baseline: reward=1.0, 3 tool calls - Self-review (multi-turn): reward=1.0, 7 tool calls - Coder-reviewer (multi-round): reward=0.0, 13 tool calls Outbox messaging confirmed working: reviewer wrote feedback to /app/.outbox/coder.json, scheduler read and injected into coder's prompt. Messages persisted to scene_messages.jsonl.
…primary agents 1. connect_as() now writes credential files and uploads subscription auth for non-primary agents, matching what install_agent() does for the primary agent. Fixes heterogeneous scenes where e.g. codex-acp needs ~/.codex/auth.json. 2. connect_as() now updates self._agent_launch so disconnect()'s pkill fallback targets the correct process (not always the primary agent's binary). 3. Note: the openhands launch_cmd pkill issue (pkill -f 'export') is pre-existing in registry.py, not introduced by this PR.
Tasks requesting more storage than the Daytona tier allows fail at sandbox creation. Apply the same clamping pattern already used for cpus and memory_mb so tasks degrade gracefully. The cap is overridable via BENCHFLOW_DAYTONA_MAX_STORAGE_MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: clamp Daytona storage_mb to configurable max
feat: wire outbox messaging into Trial._run_scene()
* Fix DinD compose exec missing project/directory/file flags DaytonaProcess.start() hardcoded `docker compose exec` without the `-p`, `--project-directory`, and `-f` flags needed to locate the running compose project inside the DinD sandbox. This caused exec to fail silently with "Process closed stdout (rc=None)". Extract the full compose base command from Harbor's strategy via `_compose_cmd([])` during `from_harbor_env()` and use it in `start()` so the exec subcommand includes all required project identifiers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use shlex.join for DinD compose exec to handle paths with spaces Address Devin review feedback — shlex.split() + " ".join() loses quoting for tokens with spaces. Use shlex.join() which properly quotes each token. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…193) SSH pipes break through the DinD→compose exec chain, causing "Process closed stdout (rc=None)" on all compose tasks. New DaytonaPtyProcess uses Daytona SDK's WebSocket PTY API for the outer connection (keeps pipe alive), then docker compose exec -i -T inside (clean stdio for the agent). Includes marker-based startup to drain shell output before ACP handshake, and echo-resistant response matching in the ACP client (filter echoed requests by checking for 'method' field absence). Also adds skills_dir: "auto" support in Job for per-task skill resolution after PR #720 removed COPY skills from Dockerfiles.
* fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix. * test: regression suite pinning oracle chokepoint + orphan removal Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…". * fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant). * docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI --------- Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
Brings 126 ruff errors → 0 so CI's lint check goes green and unblocks the 5 PRs targeting dev-0.3 (#176, #180, #181, #182, #191) that were landing on top of pre-existing repo lint debt. What changed: 1. Auto-fixes via `ruff check --fix --unsafe-fixes`: - 40 F401 unused-imports across src/, tests/, examples/ - 8 I001 unsorted-imports - 6 UP037 quoted-annotations modernized - Other auto-fixable rules 2. Hand fixes: - src/benchflow/__init__.py: removed `Trial` from the `from harbor` re-export block (it was shadowed by `from benchflow.trial import Trial` at line 65, which is the canonical public Trial). Added `trial_config_from_yaml` to __all__. - src/benchflow/process.py: 3x `raise ConnectionError(...) from e` for B904 (errors raised inside except clauses). - src/benchflow/mcp/reviewer_server.py: same B904 fix for fastmcp ImportError reraise. - tests/test_skill_eval.py: raw string for `pytest.raises(match=...)` pattern (RUF043). - 3 files: replaced `×` (Unicode multiplication sign) in comments and f-strings with `x` (latin x) to clear RUF001/RUF003. 3. Per-file ignores added to pyproject.toml `[tool.ruff.lint.per-file-ignores]`: - `experiments/*.py` and `tests/conformance/*.py` ignore E402 — these are standalone scripts that legitimately set sys.path before importing. - `src/benchflow/runtime.py` ignores F821 — uses forward references resolved by `from __future__ import annotations`; explicit TYPE_CHECKING imports would force eager loads. No code behavior changes. 580 tests pass; the 8 pre-existing failures (env-leak between subscription auth tests, Docker compose env, judge model default mismatch) are unrelated to this PR.
* fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix. * test: regression suite pinning oracle chokepoint + orphan removal Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…". * fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant). * docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI * docs: add fix plan for connect_as() agent_env bug (#2) * docs: expand fix plan with eng review findings and test cases Add two edge-case test requirements (non-overlapping key merge, None safety) from /plan-eng-review. Append review report confirming 0 issues, 0 critical gaps — ready to implement. * fix: merge cfg.agent_env into connect_as() env resolution (#2) connect_as() passed only role.env to resolve_agent_env, losing all config-level env vars (e.g. BENCHFLOW_PROVIDER_BASE_URL from YAML). Merge cfg.agent_env as base with role.env overlay so role-specific vars win on overlap. * remove plan --------- Co-authored-by: Xiangyi Li <xiangyi@benchmarkthing.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* rebase on upstream/0.3 * openhand cli add * enhance api key security * refine tests Co-authored-by: Copilot <copilot@github.com> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix. * test: regression suite pinning oracle chokepoint + orphan removal Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…". * fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant). * docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI * docs: use `uv tool install` instead of `pip install` benchflow is a CLI tool with entry points — uv tool install gives users an isolated environment (like pipx) without managing venvs manually. --------- Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
* fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix. * test: regression suite pinning oracle chokepoint + orphan removal Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…". * fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant). * docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI * test: cover sandbox setup timeout wiring * docs: document sandbox setup timeout * feat: wire sandbox setup timeout through configs `setup_sandbox_user()` already accepted a `timeout_sec` kwarg (default 120s) but no live call site surfaced it — the knob was unreachable for normal runs. Under heavy sandbox bootstrap (parallel containers copying large tool caches into /home/<sandbox_user>) the 120s cap was hit with no user override. Add `sandbox_setup_timeout: int = 120` to TrialConfig, JobConfig, and RuntimeConfig, and forward it through: - trial YAML (`trial_config_from_dict`) - job YAML (both native and Harbor-compatible loaders) - `SDK.run(..., sandbox_setup_timeout=...)` - `bench eval create --sandbox-setup-timeout` - `Trial.install_agent()` into both `setup_sandbox_user()` call sites (oracle + normal agent) The value is also recorded in the run's `config.json` snapshot to aid post-hoc diagnosis. Default stays at 120s — this change is about making the value configurable, not changing runtime behavior. --------- Co-authored-by: Xiangyi Li <xiangyi@benchmarkthing.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix. * test: regression suite pinning oracle chokepoint + orphan removal Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…". * fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant). * docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI * docs(plan): add plan to fix sandbox io problem * test: lock sandbox setup contract Plan step 1/6: Lock the new sandbox contract in tests * fix: stop copying root tool installs into sandbox home Plan step 2/6: Narrow setup_sandbox_user() to user state only * refactor: derive sandbox home dirs from registry config Plan step 3/6: Align registry semantics with the new contract * refactor: symlink skills into sandbox, enforce shared install prefixes Replace per-trial skill-tree copies with ln -sfn into a shared /skills (or task skills_dir) root, drop skill_paths from get_sandbox_home_dirs(), and add registry + sandbox-setup invariants that keep agent binaries on /usr/local/* rather than /root-only home paths. Updates task-authoring and api-reference docs to describe the new lightweight sandbox contract. * chore: remove completed sandbox plan doc --------- Co-authored-by: Xiangyi Li <xiangyi@benchmarkthing.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* feat: BaseUser abstraction for progressive-disclosure trial loops Add User as a first-class participant in the trial loop — a Python callback that produces prompts, sees test results between rounds, and decides when to stop. This is the infrastructure Josh (GitHub/Microsoft) needs for SWE-bench Pro progressive disclosure. New types (user.py): - BaseUser with setup(instruction, solution) and run(round, instruction, round_result) - RoundResult dataclass with trajectory, rewards, verifier output - PassthroughUser (backward-compat default, single round) - FunctionUser (wraps a plain callback for lightweight use) Trial changes: - TrialConfig gains user, max_user_rounds, oracle_access fields - Trial._run_user_loop(): user.run() → connect → execute → disconnect → soft_verify() → build RoundResult → repeat until None or max rounds - Trial.soft_verify(): runs Harbor verifier WITHOUT hardening so agent stays alive between rounds. Final verify() still does full hardening. - Multi-role + User raises ValueError (deferred to future phase) 16 new tests, 0 regressions on existing 618 tests. * fix: address self-review — 5 bugs in user abstraction 1. Reorder: disconnect() before soft_verify() — agent process is already dead when soft_verify runs, so soft_verify's docstring was misleading. Now disconnect → soft_verify is the explicit flow. 2. soft_verify() now runs CLEANUP_CMD (conftest/pth/sitecustomize purge) before the verifier. Prevents agent from gaming intermediate test results by injecting test-patching files. 3. FunctionUser: use inspect.isawaitable() instead of asyncio.iscoroutine() — handles asyncio.Task, Future, and any __await__ object, not just coroutines. 4. oracle_access: cat /solution now runs as user="root" — /solution is locked (root:700) after install_agent, so the read would silently fail without root. 5. try/finally around connect/execute/disconnect in user loop — ensures disconnect() always runs even if execute() raises. * feat: add user_dogfood.py — progressive disclosure on regex-log Demonstrates the FunctionUser abstraction: - Round 0: terse 2-sentence prompt - Round 1: hints about edge cases on failure - Round 2: full instruction on continued failure - Stops early if tests pass * fix: address Devin review — remove tautological tests, fix model name - Remove 4 tautological tests (pure dataclass reads) per CLAUDE.md convention: TestRoundResult.test_defaults, test_with_data, TestTrialConfigUser.test_user_field_defaults_to_none, test_user_field_set - Fix dogfood model name: gemini-2.5-flash (not expired preview) - Note: iscoroutine→isawaitable was already fixed in 51d6c61 * fix: address code review — oracle safety, unused import, soft_verify tests 1. Oracle /solution is now moved (not deleted) before agent runs and restored before final verify(). Prevents breaking verifiers that need /solution to compute rewards. 2. Remove unused asyncio import from user.py. 3. Add 4 soft_verify tests: timeout, crash, success, and CLEANUP_CMD execution verification. soft_verify is no longer untested. * feat: dogfood results — progressive disclosure on regex-log via Daytona 3-round progressive disclosure with Gemini Flash on regex-log: Round 0: terse prompt (2 tool calls) → reward=0.0 Round 1: hint prompt (3 tool calls) → reward=0.0 Round 2: full instruction (3 tool calls) → reward=0.0 Final verify: reward=0.0 Agent scored 0.0 on all rounds — regex-log is a hard task. But the infrastructure works end-to-end: user loop, soft_verify, fresh ACP sessions per round, user_rounds.jsonl persistence, final hardened verify. No errors. * feat: add opencode agent to registry OpenCode (opencode-ai) is an open-source TypeScript coding agent with ACP support. Skills path: $HOME/.opencode/skills (updated from .opencode/skill per skillsbench #718). Closes skillsbench #718 for the benchflow side. * fix: opencode ACP returns 0 tool calls — model format mismatch Root cause: OpenCode's ACP parseModel() splits modelId on "/" to extract providerID and modelID. When benchflow sent "gemini-3.1-pro-preview" (no slash), opencode parsed it as providerID="gemini-3.1-pro-preview" with modelID="" — an invalid config that silently returned end_turn. Fix: Add acp_model_format field to AgentConfig. When set to "provider/model" (opencode), _format_acp_model() infers the models.dev provider from the bare model name (e.g. "gemini" → "google") and sends "google/gemini-3.1-pro-preview" to set_model. Also: opencode requires_env is now empty (inferred from model at runtime, not hardcoded to ANTHROPIC_API_KEY). * feat: executed notebook — SWE-bench Pro progressive disclosure analysis OpenCode + gemini-3.1-pro-preview on qutebrowser SWE-bench Pro: Baseline (full prompt, 1 round): 40 tools, 736s, reward=0.0 Progressive (3 rounds): 185 tools, 1154s, reward=0.0 Round 0 (terse): 86 tools (81 bash + 5 edit) Round 1 (hints): 76 tools (66 bash + 10 edit) Round 2 (full): 23 tools (16 bash + 7 edit) Both scored 0.0 due to verifier infrastructure bug (rootdir=/tests instead of /app, pytest couldn't find config). Agent's fixes were likely correct — demonstrated passing tests in own environment. Key findings: - Progressive disclosure changed agent behavior (86→76→23 tools) - _reset_cache implemented only after Round 1 hint - OpenCode handled 185 tool calls without token limits - Verifier rootdir bug needs investigation * fix: replace hand-curated pytest plugin whitelist with auto-discovery The old mechanism (4 dicts + 4 functions + 1 regex) required manual code changes for every new benchmark with an undeclared pytest plugin. SWE-bench Pro tasks failed because pytest-benchmark wasn't whitelisted. New mechanism: one container-side script + one async function. At hardening time, enumerate all pytest11 entry points from root-owned system packages. Only root-owned dist-info directories are trusted — editable installs from agent-writable /testbed are excluded. PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 stays in place. Security preserved. task.toml pytest_plugins kept as fallback. Deleted: _PYTEST_PLUGIN_ALIASES, _PYTEST_OPTION_PLUGINS, _PYTEST_INSTALLED_PLUGINS, _PIP_INSTALL_RE, _normalize_pytest_plugin, _plugins_from_verifier_script, _declared_pytest_plugins, _pytest_plugin_flags, tomllib import. Added: _DISCOVER_PYTEST_PLUGINS_SCRIPT, _discover_pytest_plugin_flags. * fix: handle Python 3.9 importlib.metadata API in plugin discovery Python 3.9's entry_points() doesn't accept keyword arguments — returns a dict instead. Fall back to entry_points().get('pytest11', []) when the keyword style raises TypeError. * fix: simplify plugin discovery — skip ownership check The uid==0 check was failing on Python 3.9 containers where ep.dist._path doesn't exist. Simplified to just enumerate all pytest11 entry points — sandbox_user prevents agent pip installs, so all discovered plugins are image-authored. * feat: updated notebook with fixed-verifier results Both progressive + baseline rerun with working verifier (15 plugins discovered). Results with honest scoring: Progressive (3 rounds): 284 tools, 970s, reward=0.0 Round 0: 94 tools, Round 1: 92 tools, Round 2: 98 tools Baseline (1 round): 73 tools, 611s, reward=0.0 Both failed due to agent code errors (circular imports), not verifier infrastructure. Progressive used 4x more compute for same outcome on this task. * fix: preserve trusted PYTHONPATH entries during verifier hardening VERIFIER_ENV cleared PYTHONPATH="" which broke SWE-bench Pro tasks where the Dockerfile sets PYTHONPATH=/app/lib:/app for project imports. New: _trusted_verifier_pythonpath() filters PYTHONPATH using the same root-owned validation as PATH, but does NOT block the workspace — /app is already importable via CWD/pytest sys.path insertion, so clearing it only breaks imports without security benefit. /tmp, /var/tmp, /home/agent are still blocked. Re-pinned after task-env merge like PATH. * fix: address review comments on BaseUser PR - soft_verify: chmod 777 /logs/verifier so non-root verifier can write - soft_verify: restore /solution before verify, re-hide after (oracle access) - validate empty roles (!=1) and multi-scene configs in user loop - remove tautological test_setup_is_noop - remove opencode BENCHFLOW_PROVIDER_API_KEY→ANTHROPIC_API_KEY mapping (wrong for non-Anthropic models; native keys inherited via auto_inherit_env) - warn on unknown provider fallback in _format_acp_model - remove --rootdir=/tests from VERIFIER_ENV (cherry-pick from PR #187) - fix printenv PYTHONPATH crash when unset - fix stale plugin discovery docstring * feat: add SWE-bench Pro oracle validation + baseline experiment script Runs oracle (gold solution) on all 4 testable tasks to verify the --rootdir fix, then runs a single-round agent baseline for comparison with progressive disclosure. Results to CSV. * fix: address Codex review on PR #184 — oracle safety + warnings Three Codex review findings on the BaseUser abstraction: 1. oracle_access=True with user=None silently leaves /solution exposed to the agent for the entire trial. Add a logger.warning at setup time so misconfigurations surface immediately. 2. Oracle restore (mv /solution_oracle_backup /solution) was outside any finally block. If _run_user_loop() raised, /solution was never restored. Move the user/scene execution into try/finally so the restore always runs before the final verify(). 3. Oracle read used a wildcard fallback (cat /solution/* || true) that could leak unintended files (binaries, credentials). Narrow to solve.sh — the canonical SWE-bench Pro oracle path. Bugs Codex flagged that were FALSE POSITIVES (verified against code): - "session counter reset" — disconnect() already resets both counters - "None instruction" — _resolve_prompts returns [instruction] not [None] Tests still pass: 15 user + 58 sandbox = 73 total. * feat: per-task verifier hardening opt-outs + restore --rootdir=/app Two related changes addressing SWE-bench Pro oracle compatibility: 1) Restore --rootdir=/app in PYTEST_ADDOPTS Removing --rootdir entirely (PR #187) made pytest fall back to /dev as rootdir (from -c /dev/null), producing test node IDs like ../dev/::test_foo instead of <repo>/<path>::test_foo. The verifier expects full-path IDs and reported 0 passing tests on openlibrary even though all 18 tests passed. --rootdir=/app anchors test IDs to the canonical Harbor repo root while -c /dev/null still blocks pyproject.toml/pytest.ini discovery and --confcutdir=/tests still blocks conftest walk-up beyond /tests. 2) Per-task [verifier.hardening] opt-outs in task.toml The cleanup that deletes agent-injected conftest.py also deletes legitimate repo conftest.py files. qutebrowser ships conftest.py that sets up import order to break a real circular dependency between qutebrowser.browser.inspector and qutebrowser.misc.miscwidgets — without them, pytest collection fails on a type annotation in miscwidgets.py:419. Tasks now declare opt-outs in task.toml: [verifier.hardening] cleanup_conftests = false # qutebrowser Defaults remain secure (all True). New helpers in _sandbox.py: - HARDENING_DEFAULTS: dict of feature flags - _read_hardening_config(task_dir): parse task.toml [verifier.hardening] - _build_cleanup_cmd(hardening): build cleanup honoring opt-outs CLEANUP_CMD constant kept as backward-compat alias. Both harden_before_verify() and Trial.soft_verify() now read per-task hardening config before running cleanup. Validation on SWE-bench Pro oracle (Daytona): Before: 2/4 (ansible, flipt) — openlibrary failed test ID format, qutebrowser failed conftest deletion After: 5/5 (ansible, flipt, openlibrary, qutebrowser, navidrome) Tests: 80 passing (15 user + 65 sandbox including 7 new opt-out tests). * docs: add progressive-disclosure guide + SWE-bench Pro demo notebook For Josh's SWE-bench Pro use case (and Harbor #1316 parity in the no-second-LLM case): - docs/progressive-disclosure.md: dedicated guide for the BaseUser abstraction. Covers the API, oracle access, [verifier.hardening] opt-outs, and when to choose BaseUser vs multi-role Scene. - docs/use-cases.md: brief mention in §1 (Interactive User Simulation) pointing to progressive-disclosure.md for the lighter-weight callback-based pattern. - examples/swebench_pro_progressive_disclosure.ipynb: clean rewrite of the existing notebook. Shows the API, oracle 5/5, baseline 4 tasks, per-task hardening opt-out example, and a placeholder cell that auto- loads the latest progressive-disclosure run from /tmp/swebench-pro-jobs/progressive when one exists. Executes top-to- bottom against the current oracle/baseline CSV. - examples/swebench_pro_user_dogfood.py: ready-to-run script for progressive disclosure on any of the 5 working SWE-bench Pro tasks. Three-round user: terse → failing tests + half spec → full spec. - experiments/swebench-pro-results.csv: oracle + baseline results from 2026-04-24 Daytona run. qutebrowser entry is pre-fix (verified post- fix separately, noted in notebook). * docs: add progressive-disclosure.md to CLAUDE.md docs index
…gs retry (#196) * Bump DaytonaPtyProcess readline timeout 300s→900s Long-running TTS/audio tasks (e.g. pg-essay-to-audiobook) generate extended quiet periods on stdout while ffmpeg/whisper run. The 300s PTY readline timeout fires before the per-task agent timeout (900s), prematurely killing healthy runs. Align readline timeout with the standard agent timeout so the PTY only fails when the inner process is actually wedged. * Daytona SDK: retry SessionCommandLogsResponse ValidationError The Daytona server occasionally returns an empty string instead of a JSON object when fetching session command logs, which causes pydantic to raise ValidationError inside AsyncProcess.get_session_command_logs. We've reproduced this on SDK 0.168.x and 0.169.x; the surface is most visible in skillsbench tasks that ask the verifier for command output (e.g. latex-formula-extraction). Patch the SDK method at runtime with a small bounded retry. After four malformed payloads we fall back to an empty (but valid) response so callers can still inspect exit_code via get_session_command — a silent missing-logs is preferable to taking a whole trial as ERROR on an upstream marshalling glitch. Patch is applied lazily from _create_environment so we never touch the SDK on Docker-only runs. * Daytona retry: catch DaytonaError wrapping the malformed-logs ValidationError The first version of this patch only matched on pydantic ValidationError, but AsyncProcess.get_session_command_logs is decorated by intercept_errors at class-definition time — every inner exception is converted to DaytonaError before our patched bound method ever sees it. Verified against latex-formula-extraction on Daytona: the patch wrapper was being called, but the except-clause never matched, so the run still failed. Match on DaytonaError whose message contains 'SessionCommandLogsResponse' in addition to bare ValidationError, and drop the wrapper to 2 attempts (harbor already wraps the call in tenacity x3 — extra retries here are wasted on a deterministic malformed payload). Empty-fallback unchanged.
* fix: env-file path mismatch in DinD compose mode Devin caught a real bug introduced by PR #193 (DinD compose ACP): src/benchflow/process.py:325 sets remote_env_path = "/tmp/benchflow_env_$$.env" expecting the remote shell to expand $$ to its PID. But shlex.join() at line 329 single-quotes the --env-file argument, so docker compose receives the literal string "/tmp/benchflow_env_$$.env" while the cat heredoc that writes the file (line 339, raw f-string) does expand $$. The file is written to /tmp/benchflow_env_<pid>.env and read from /tmp/benchflow_env_$$.env — silent mismatch, env vars (incl. API keys) silently dropped in DinD compose tasks. Fix: use uuid.uuid4().hex[:16] for the unique suffix instead of relying on shell-side $$ expansion. The path is then a literal that survives quoting. Apply the same fix to the direct (non-DinD) Daytona branch even though it was working — uniformity makes the path robust against future quoting changes. Also fix a pre-existing SIM103 lint error in _daytona_patches.py that ruff caught while validating the test changes. Tests: tests/test_process.py +2 regression tests pinning that no remote command contains a literal "$$" — would catch this exact regression. 8/8 process tests pass; ruff clean. * test: reference PR #193 / #198 in regression test docstring Devin caught: CLAUDE.md mandates regression tests name the commit/PR they guard. Updated TestDaytonaProcessEnvFilePath docstring to cite PR #198 (the fix) and PR #193 / commit cdccac7 (the regression).
# Conflicts: # src/benchflow/_agent_env.py # src/benchflow/cli/eval.py # tests/test_oracle_chokepoint.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Release: v0.3.2
Cuts dev-0.3 → main as the v0.3.2 release. Tag
v0.3.2to be created on main after this merges; CI publishes to PyPI.What's in 0.3.2
Features
docs/progressive-disclosure.md.[verifier.hardening]opt-outs in task.toml (feat: BaseUser abstraction + per-task verifier hardening opt-outs #194) — tasks with legitimateconftest.pysetups (e.g. qutebrowser) opt out of specific cleanup steps. Fixes 5/5 SWE-bench Pro oracle on hardened verifier.sandbox_setup_timeoutwired through configs (feat: wire sandbox_setup_timeout through all configs #180).Fixes
cfg.agent_envreachesconnect_as()(fix: merge cfg.agent_env into connect_as() env resolution #191) — closes Bug: connect_as() ignores cfg.agent_env, re-resolves from empty role.env #190; YAML-supplied provider creds reach the agent.shlex.join()was quoting$$literally so written/read paths diverged; now usesuuid.uuid4()for unique paths.--rootdir=/app(feat: BaseUser abstraction + per-task verifier hardening opt-outs #194) — anchors test node IDs to repo root; openlibrary oracle goes 0/18 → 18/18.Chores
Validation
ruff check .cleanPost-merge actions
v0.3.2on main:git tag -a v0.3.2 -m "benchflow 0.3.2"; git push origin v0.3.2gh release create v0.3.2 --generate-notes(CI publishes to PyPI)pyproject.tomlversion to0.3.3.dev0dev-0.3branch (going forward: trunk-based, PRs target main)Test plan
pip install benchflow==0.3.2works after CI publishes