fix: merge cfg.agent_env into connect_as() env resolution by EYH0602 · Pull Request #191 · benchflow-ai/benchflow

EYH0602 · 2026-04-23T23:07:41Z

Summary

Implements plan/fix-agent-env.md. Fixes #2.

connect_as() passed only role.env to resolve_agent_env, discarding all config-level env vars from cfg.agent_env (e.g. BENCHFLOW_PROVIDER_BASE_URL from YAML). This merges cfg.agent_env as base with role.env as overlay, so role-specific vars win on key overlap.

Plan Steps

Fix connect_as() env merging in trial.py:641 — one-line dict merge
Add 4 regression tests for env propagation (tests/test_connect_as_env.py)
Run test suite — all related tests pass, no regressions

Review Status

Eng review: CLEAR (PLAN) — 0 issues, 0 critical gaps (via /plan-eng-review).

Test Plan

Verify pytest tests/test_connect_as_env.py passes (4 tests)
Verify pytest tests/test_sdk_internals.py::TestResolveAgentEnv passes
Verify pytest tests/test_resolve_env_helpers.py passes
Manual: run a trial with agent_env in YAML config and confirm env vars reach the agent

release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign

The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it.

Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this.

…key-check fix: skip model/API-key validation for oracle agent

PR benchflow-ai#173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix.

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR benchflow-ai#173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant).

…e CLI

…t-and-cleanup fix: oracle chokepoint guard + effective_model helper

Add two edge-case test requirements (non-overlapping key merge, None safety) from /plan-eng-review. Append review report confirming 0 issues, 0 critical gaps — ready to implement.

connect_as() passed only role.env to resolve_agent_env, losing all config-level env vars (e.g. BENCHFLOW_PROVIDER_BASE_URL from YAML). Merge cfg.agent_env as base with role.env overlay so role-specific vars win on overlap.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

connect_as() passed only role.env to resolve_agent_env, losing all config-level env vars (e.g. BENCHFLOW_PROVIDER_BASE_URL from YAML). Merge cfg.agent_env as base with role.env overlay so role-specific vars win on overlap. Squashed from benchflow-ai#191.

# Conflicts: # src/benchflow/trial.py

Brings 126 ruff errors → 0 so CI's lint check goes green and unblocks the 5 PRs targeting dev-0.3 (#176, #180, #181, #182, #191) that were landing on top of pre-existing repo lint debt. What changed: 1. Auto-fixes via `ruff check --fix --unsafe-fixes`: - 40 F401 unused-imports across src/, tests/, examples/ - 8 I001 unsorted-imports - 6 UP037 quoted-annotations modernized - Other auto-fixable rules 2. Hand fixes: - src/benchflow/__init__.py: removed `Trial` from the `from harbor` re-export block (it was shadowed by `from benchflow.trial import Trial` at line 65, which is the canonical public Trial). Added `trial_config_from_yaml` to __all__. - src/benchflow/process.py: 3x `raise ConnectionError(...) from e` for B904 (errors raised inside except clauses). - src/benchflow/mcp/reviewer_server.py: same B904 fix for fastmcp ImportError reraise. - tests/test_skill_eval.py: raw string for `pytest.raises(match=...)` pattern (RUF043). - 3 files: replaced `×` (Unicode multiplication sign) in comments and f-strings with `x` (latin x) to clear RUF001/RUF003. 3. Per-file ignores added to pyproject.toml `[tool.ruff.lint.per-file-ignores]`: - `experiments/*.py` and `tests/conformance/*.py` ignore E402 — these are standalone scripts that legitimately set sys.path before importing. - `src/benchflow/runtime.py` ignores F821 — uses forward references resolved by `from __future__ import annotations`; explicit TYPE_CHECKING imports would force eager loads. No code behavior changes. 580 tests pass; the 8 pre-existing failures (env-leak between subscription auth tests, Docker compose env, judge model default mismatch) are unrelated to this PR.

xdotli and others added 12 commits April 21, 2026 16:38

release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign

7dc18fc

release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign

Merge pull request benchflow-ai#173 from EYH0602/fix/oracle-skip-api-…

a099ff9

…key-check fix: skip model/API-key validation for oracle agent

docs: clarify cli/eval.py and test_eval_cli.py are not wired into liv…

bc04c59

…e CLI

Merge pull request benchflow-ai#174 from EYH0602/fix/oracle-chokepoin…

144b6dc

…t-and-cleanup fix: oracle chokepoint guard + effective_model helper

docs: add fix plan for connect_as() agent_env bug (#2)

c231ebe

docs: expand fix plan with eng review findings and test cases

64fc23d

Add two edge-case test requirements (non-overlapping key merge, None safety) from /plan-eng-review. Append review report confirming 0 issues, 0 critical gaps — ready to implement.

fix: merge cfg.agent_env into connect_as() env resolution (#2)

4b36c74

connect_as() passed only role.env to resolve_agent_env, losing all config-level env vars (e.g. BENCHFLOW_PROVIDER_BASE_URL from YAML). Merge cfg.agent_env as base with role.env overlay so role-specific vars win on overlap.

devin-ai-integration Bot reviewed Apr 23, 2026

View reviewed changes

remove plan

4731d8c

xdotli changed the base branch from main to dev-0.3 April 25, 2026 07:51

xdotli mentioned this pull request Apr 25, 2026

merge: main → dev-0.3 (release prep for v0.3.2) #195

Merged

4 tasks

Merge remote-tracking branch 'origin/dev-0.3' into pr-191-rebase

dbcd7b0

# Conflicts: # src/benchflow/trial.py

xdotli mentioned this pull request Apr 25, 2026

chore: clean up ruff lint debt across repo #197

Merged

4 tasks

xdotli merged commit 9fd2863 into benchflow-ai:dev-0.3 Apr 25, 2026
1 check passed

xdotli mentioned this pull request Apr 25, 2026

release: benchflow 0.3.2 — BaseUser, verifier hardening, DinD compose, lint cleanup #199

Merged

4 tasks

EYH0602 deleted the fix/agent_env branch April 25, 2026 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: merge cfg.agent_env into connect_as() env resolution#191

fix: merge cfg.agent_env into connect_as() env resolution#191
xdotli merged 14 commits intobenchflow-ai:dev-0.3from
EYH0602:fix/agent_env

EYH0602 commented Apr 23, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EYH0602 commented Apr 23, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Plan Steps

Review Status

Test Plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EYH0602 commented Apr 23, 2026 •

edited by devin-ai-integration Bot

Loading