feat: wire sandbox_setup_timeout through all configs by EYH0602 · Pull Request #180 · benchflow-ai/benchflow

EYH0602 · 2026-04-22T19:16:11Z

Summary

setup_sandbox_user() already accepted a timeout_sec kwarg (default 120s), but no live call site surfaced it — the knob was unreachable for normal runs. Under heavy sandbox bootstrap (parallel containers copying large tool caches into /home/<sandbox_user>) users hit the 120s cap with no override.
Adds sandbox_setup_timeout: int = 120 to TrialConfig, JobConfig, and RuntimeConfig, and forwards it through every live config entry point: trial YAML, job YAML (native + Harbor), SDK.run(), bench eval create --sandbox-setup-timeout, and both setup_sandbox_user() call sites in Trial.install_agent() (oracle + normal agent).
Default stays at 120s — this change is about making the value configurable, not silently changing runtime behavior. The value is recorded in the run's config.json snapshot for post-hoc diagnosis.

Commits

bc7e841 feat: wire sandbox setup timeout through configs
055f605 test: cover sandbox setup timeout wiring
db9d99a docs: document sandbox setup timeout

release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign

The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it.

Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this.

…key-check fix: skip model/API-key validation for oracle agent

PR benchflow-ai#173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix.

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR benchflow-ai#173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant).

…e CLI

…t-and-cleanup fix: oracle chokepoint guard + effective_model helper

`setup_sandbox_user()` already accepted a `timeout_sec` kwarg (default 120s) but no live call site surfaced it — the knob was unreachable for normal runs. Under heavy sandbox bootstrap (parallel containers copying large tool caches into /home/<sandbox_user>) the 120s cap was hit with no user override. Add `sandbox_setup_timeout: int = 120` to TrialConfig, JobConfig, and RuntimeConfig, and forward it through: - trial YAML (`trial_config_from_dict`) - job YAML (both native and Harbor-compatible loaders) - `SDK.run(..., sandbox_setup_timeout=...)` - `bench eval create --sandbox-setup-timeout` - `Trial.install_agent()` into both `setup_sandbox_user()` call sites (oracle + normal agent) The value is also recorded in the run's `config.json` snapshot to aid post-hoc diagnosis. Default stays at 120s — this change is about making the value configurable, not changing runtime behavior.

EYH0602 · 2026-04-22T19:17:47Z

@xdotli seems like CI unit tests are fixed in #177 , I'll rebase after it is merged

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Brings 126 ruff errors → 0 so CI's lint check goes green and unblocks the 5 PRs targeting dev-0.3 (#176, #180, #181, #182, #191) that were landing on top of pre-existing repo lint debt. What changed: 1. Auto-fixes via `ruff check --fix --unsafe-fixes`: - 40 F401 unused-imports across src/, tests/, examples/ - 8 I001 unsorted-imports - 6 UP037 quoted-annotations modernized - Other auto-fixable rules 2. Hand fixes: - src/benchflow/__init__.py: removed `Trial` from the `from harbor` re-export block (it was shadowed by `from benchflow.trial import Trial` at line 65, which is the canonical public Trial). Added `trial_config_from_yaml` to __all__. - src/benchflow/process.py: 3x `raise ConnectionError(...) from e` for B904 (errors raised inside except clauses). - src/benchflow/mcp/reviewer_server.py: same B904 fix for fastmcp ImportError reraise. - tests/test_skill_eval.py: raw string for `pytest.raises(match=...)` pattern (RUF043). - 3 files: replaced `×` (Unicode multiplication sign) in comments and f-strings with `x` (latin x) to clear RUF001/RUF003. 3. Per-file ignores added to pyproject.toml `[tool.ruff.lint.per-file-ignores]`: - `experiments/*.py` and `tests/conformance/*.py` ignore E402 — these are standalone scripts that legitimately set sys.path before importing. - `src/benchflow/runtime.py` ignores F821 — uses forward references resolved by `from __future__ import annotations`; explicit TYPE_CHECKING imports would force eager loads. No code behavior changes. 580 tests pass; the 8 pre-existing failures (env-leak between subscription auth tests, Docker compose env, judge model default mismatch) are unrelated to this PR.

# Conflicts: # src/benchflow/_agent_env.py # src/benchflow/cli/eval.py # tests/test_oracle_chokepoint.py

xdotli and others added 12 commits April 21, 2026 16:38

release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign

7dc18fc

release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign

Merge pull request benchflow-ai#173 from EYH0602/fix/oracle-skip-api-…

a099ff9

…key-check fix: skip model/API-key validation for oracle agent

docs: clarify cli/eval.py and test_eval_cli.py are not wired into liv…

bc04c59

…e CLI

Merge pull request benchflow-ai#174 from EYH0602/fix/oracle-chokepoin…

144b6dc

…t-and-cleanup fix: oracle chokepoint guard + effective_model helper

test: cover sandbox setup timeout wiring

055f605

docs: document sandbox setup timeout

db9d99a

EYH0602 marked this pull request as ready for review April 22, 2026 19:17

EYH0602 added a commit to EYH0602/benchflow that referenced this pull request Apr 23, 2026

preview: fix/docker-timeout (PR benchflow-ai#180)

170e490

xdotli changed the base branch from main to dev-0.3 April 25, 2026 07:51

devin-ai-integration Bot reviewed Apr 25, 2026

View reviewed changes

This was referenced Apr 25, 2026

merge: main → dev-0.3 (release prep for v0.3.2) #195

Merged

chore: clean up ruff lint debt across repo #197

Merged

Merge remote-tracking branch 'origin/dev-0.3' into pr-180-rebase

88d6b13

# Conflicts: # src/benchflow/_agent_env.py # src/benchflow/cli/eval.py # tests/test_oracle_chokepoint.py

xdotli merged commit 1fccf70 into benchflow-ai:dev-0.3 Apr 25, 2026
1 check passed

xdotli mentioned this pull request Apr 25, 2026

release: benchflow 0.3.2 — BaseUser, verifier hardening, DinD compose, lint cleanup #199

Merged

4 tasks

EYH0602 deleted the fix/docker-timeout branch April 25, 2026 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: wire sandbox_setup_timeout through all configs#180

feat: wire sandbox_setup_timeout through all configs#180
xdotli merged 13 commits intobenchflow-ai:dev-0.3from
EYH0602:fix/docker-timeout

EYH0602 commented Apr 22, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

EYH0602 commented Apr 22, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EYH0602 commented Apr 22, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

Uh oh!

EYH0602 commented Apr 22, 2026

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EYH0602 commented Apr 22, 2026 •

edited by devin-ai-integration Bot

Loading