fix: stop copying root tool installs into sandbox home#181
fix: stop copying root tool installs into sandbox home#181xdotli merged 16 commits intobenchflow-ai:dev-0.3from
Conversation
release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign
The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it.
Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this.
…key-check fix: skip model/API-key validation for oracle agent
PR benchflow-ai#173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix.
Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR benchflow-ai#173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".
The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant).
…t-and-cleanup fix: oracle chokepoint guard + effective_model helper
Plan step 1/6: Lock the new sandbox contract in tests
Plan step 2/6: Narrow setup_sandbox_user() to user state only
Plan step 3/6: Align registry semantics with the new contract
Replace per-trial skill-tree copies with ln -sfn into a shared /skills (or task skills_dir) root, drop skill_paths from get_sandbox_home_dirs(), and add registry + sandbox-setup invariants that keep agent binaries on /usr/local/* rather than /root-only home paths. Updates task-authoring and api-reference docs to describe the new lightweight sandbox contract.
| else: | ||
| logger.info("Skills already injected via Dockerfile") |
There was a problem hiding this comment.
🔴 deploy_skills uses wrong source path when skills are already injected via Dockerfile
When skills_dir is provided AND the Dockerfile already contains COPY _deps/skills /skills/ (the already_injected path at src/benchflow/_agent_setup.py:126), effective_skills stays set to task_skills_dir (line 117) instead of being updated to "/skills". In the old code, effective_skills = "/skills" if skills_dir else task_skills_dir unconditionally set it to "/skills" whenever skills_dir was truthy, which correctly covered this case.
If task_skills_dir is None (common when the task doesn't declare environment.skills_dir), the condition at line 139 (if effective_skills and ...) is False, so no skill linking happens at all — skills are silently not distributed to agent discovery paths despite being present at /skills in the container.
| else: | |
| logger.info("Skills already injected via Dockerfile") | |
| else: | |
| logger.info("Skills already injected via Dockerfile") | |
| effective_skills = "/skills" |
Was this helpful? React with 👍 or 👎 to provide feedback.
Brings 126 ruff errors → 0 so CI's lint check goes green and unblocks the 5 PRs targeting dev-0.3 (#176, #180, #181, #182, #191) that were landing on top of pre-existing repo lint debt. What changed: 1. Auto-fixes via `ruff check --fix --unsafe-fixes`: - 40 F401 unused-imports across src/, tests/, examples/ - 8 I001 unsorted-imports - 6 UP037 quoted-annotations modernized - Other auto-fixable rules 2. Hand fixes: - src/benchflow/__init__.py: removed `Trial` from the `from harbor` re-export block (it was shadowed by `from benchflow.trial import Trial` at line 65, which is the canonical public Trial). Added `trial_config_from_yaml` to __all__. - src/benchflow/process.py: 3x `raise ConnectionError(...) from e` for B904 (errors raised inside except clauses). - src/benchflow/mcp/reviewer_server.py: same B904 fix for fastmcp ImportError reraise. - tests/test_skill_eval.py: raw string for `pytest.raises(match=...)` pattern (RUF043). - 3 files: replaced `×` (Unicode multiplication sign) in comments and f-strings with `x` (latin x) to clear RUF001/RUF003. 3. Per-file ignores added to pyproject.toml `[tool.ruff.lint.per-file-ignores]`: - `experiments/*.py` and `tests/conformance/*.py` ignore E402 — these are standalone scripts that legitimately set sys.path before importing. - `src/benchflow/runtime.py` ignores F821 — uses forward references resolved by `from __future__ import annotations`; explicit TYPE_CHECKING imports would force eager loads. No code behavior changes. 580 tests pass; the 8 pre-existing failures (env-leak between subscription auth tests, Docker compose env, judge model default mismatch) are unrelated to this PR.
# Conflicts: # src/benchflow/_agent_env.py # src/benchflow/cli/eval.py # tests/test_agent_setup.py # tests/test_oracle_chokepoint.py
Summary
Closes #178.
setup_sandbox_user()no longer recursively copies/root/.nvm,/root/.local/bin, or heavyweight tool trees into the sandbox home. Agent binaries are expected to live on shared prefixes like/usr/local/binor/opt./rootget a narrowln -sinstead of a fullcp -a, keeping setup instant.deploy_skills()now symlinks the shared skills tree into agent discovery paths (ln -sfn) instead ofcp -r.get_sandbox_home_dirs()returns only config/auth dirs, not tool install dirs. New registry invariant test enforces thatinstall_cmdtargets shared prefixes.Test plan
pytest tests/test_sandbox_setup.py— contract tests for the generated setup commandpytest tests/test_agent_setup.py— skill deployment uses symlinks, not copiespytest tests/test_registry_invariants.py— all agents install to shared prefixespytest tests/test_sandbox.py— home dir derivation matches new semantics