fix: stop copying root tool installs into sandbox home by EYH0602 · Pull Request #181 · benchflow-ai/benchflow

EYH0602 · 2026-04-22T22:08:55Z

Summary

Closes #178.

Shared install prefixes (option 2): setup_sandbox_user() no longer recursively copies /root/.nvm, /root/.local/bin, or heavyweight tool trees into the sandbox home. Agent binaries are expected to live on shared prefixes like /usr/local/bin or /opt.
Symlink fallback (option 1): legacy images that still install tools under /root get a narrow ln -s instead of a full cp -a, keeping setup instant.
Skills linked, not copied: deploy_skills() now symlinks the shared skills tree into agent discovery paths (ln -sfn) instead of cp -r.
Registry contract tightened: get_sandbox_home_dirs() returns only config/auth dirs, not tool install dirs. New registry invariant test enforces that install_cmd targets shared prefixes.

Test plan

pytest tests/test_sandbox_setup.py — contract tests for the generated setup command
pytest tests/test_agent_setup.py — skill deployment uses symlinks, not copies
pytest tests/test_registry_invariants.py — all agents install to shared prefixes
pytest tests/test_sandbox.py — home dir derivation matches new semantics

release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign

The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it.

Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this.

…key-check fix: skip model/API-key validation for oracle agent

PR benchflow-ai#173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix.

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR benchflow-ai#173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant).

…e CLI

…t-and-cleanup fix: oracle chokepoint guard + effective_model helper

Plan step 1/6: Lock the new sandbox contract in tests

Plan step 2/6: Narrow setup_sandbox_user() to user state only

Plan step 3/6: Align registry semantics with the new contract

Replace per-trial skill-tree copies with ln -sfn into a shared /skills (or task skills_dir) root, drop skill_paths from get_sandbox_home_dirs(), and add registry + sandbox-setup invariants that keep agent binaries on /usr/local/* rather than /root-only home paths. Updates task-authoring and api-reference docs to describe the new lightweight sandbox contract.

devin-ai-integration

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-04-25T07:55:08Z

        else:
            logger.info("Skills already injected via Dockerfile")


🔴 deploy_skills uses wrong source path when skills are already injected via Dockerfile

When skills_dir is provided AND the Dockerfile already contains COPY _deps/skills /skills/ (the already_injected path at src/benchflow/_agent_setup.py:126), effective_skills stays set to task_skills_dir (line 117) instead of being updated to "/skills". In the old code, effective_skills = "/skills" if skills_dir else task_skills_dir unconditionally set it to "/skills" whenever skills_dir was truthy, which correctly covered this case.

If task_skills_dir is None (common when the task doesn't declare environment.skills_dir), the condition at line 139 (if effective_skills and ...) is False, so no skill linking happens at all — skills are silently not distributed to agent discovery paths despite being present at /skills in the container.

Suggested change

else:

logger.info("Skills already injected via Dockerfile")

else:

logger.info("Skills already injected via Dockerfile")

effective_skills = "/skills"

Was this helpful? React with 👍 or 👎 to provide feedback.

Brings 126 ruff errors → 0 so CI's lint check goes green and unblocks the 5 PRs targeting dev-0.3 (#176, #180, #181, #182, #191) that were landing on top of pre-existing repo lint debt. What changed: 1. Auto-fixes via `ruff check --fix --unsafe-fixes`: - 40 F401 unused-imports across src/, tests/, examples/ - 8 I001 unsorted-imports - 6 UP037 quoted-annotations modernized - Other auto-fixable rules 2. Hand fixes: - src/benchflow/__init__.py: removed `Trial` from the `from harbor` re-export block (it was shadowed by `from benchflow.trial import Trial` at line 65, which is the canonical public Trial). Added `trial_config_from_yaml` to __all__. - src/benchflow/process.py: 3x `raise ConnectionError(...) from e` for B904 (errors raised inside except clauses). - src/benchflow/mcp/reviewer_server.py: same B904 fix for fastmcp ImportError reraise. - tests/test_skill_eval.py: raw string for `pytest.raises(match=...)` pattern (RUF043). - 3 files: replaced `×` (Unicode multiplication sign) in comments and f-strings with `x` (latin x) to clear RUF001/RUF003. 3. Per-file ignores added to pyproject.toml `[tool.ruff.lint.per-file-ignores]`: - `experiments/*.py` and `tests/conformance/*.py` ignore E402 — these are standalone scripts that legitimately set sys.path before importing. - `src/benchflow/runtime.py` ignores F821 — uses forward references resolved by `from __future__ import annotations`; explicit TYPE_CHECKING imports would force eager loads. No code behavior changes. 580 tests pass; the 8 pre-existing failures (env-leak between subscription auth tests, Docker compose env, judge model default mismatch) are unrelated to this PR.

# Conflicts: # src/benchflow/_agent_env.py # src/benchflow/cli/eval.py # tests/test_agent_setup.py # tests/test_oracle_chokepoint.py

xdotli and others added 15 commits April 21, 2026 16:38

release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign

7dc18fc

release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign

Merge pull request benchflow-ai#173 from EYH0602/fix/oracle-skip-api-…

a099ff9

…key-check fix: skip model/API-key validation for oracle agent

docs: clarify cli/eval.py and test_eval_cli.py are not wired into liv…

bc04c59

…e CLI

Merge pull request benchflow-ai#174 from EYH0602/fix/oracle-chokepoin…

144b6dc

…t-and-cleanup fix: oracle chokepoint guard + effective_model helper

docs(plan): add plan to fix sandbox io problem

9459a23

test: lock sandbox setup contract

f18bab6

Plan step 1/6: Lock the new sandbox contract in tests

fix: stop copying root tool installs into sandbox home

0f9b320

Plan step 2/6: Narrow setup_sandbox_user() to user state only

refactor: derive sandbox home dirs from registry config

13f06b7

Plan step 3/6: Align registry semantics with the new contract

chore: remove completed sandbox plan doc

cbca495

EYH0602 added a commit to EYH0602/benchflow that referenced this pull request Apr 23, 2026

preview: fix/sandbox-copy (PR benchflow-ai#181)

f138e91

xdotli changed the base branch from main to dev-0.3 April 25, 2026 07:51

devin-ai-integration Bot reviewed Apr 25, 2026

View reviewed changes

This was referenced Apr 25, 2026

merge: main → dev-0.3 (release prep for v0.3.2) #195

Merged

chore: clean up ruff lint debt across repo #197

Merged

Merge remote-tracking branch 'origin/dev-0.3' into pr-181-rebase

0bf476a

# Conflicts: # src/benchflow/_agent_env.py # src/benchflow/cli/eval.py # tests/test_agent_setup.py # tests/test_oracle_chokepoint.py

xdotli merged commit e66537c into benchflow-ai:dev-0.3 Apr 25, 2026
1 check passed

xdotli mentioned this pull request Apr 25, 2026

release: benchflow 0.3.2 — BaseUser, verifier hardening, DinD compose, lint cleanup #199

Merged

4 tasks

EYH0602 deleted the fix/sandbox-copy branch April 25, 2026 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stop copying root tool installs into sandbox home#181

fix: stop copying root tool installs into sandbox home#181
xdotli merged 16 commits intobenchflow-ai:dev-0.3from
EYH0602:fix/sandbox-copy

EYH0602 commented Apr 22, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EYH0602 commented Apr 22, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EYH0602 commented Apr 22, 2026 •

edited by devin-ai-integration Bot

Loading