feat: BaseUser abstraction + per-task verifier hardening opt-outs#194
Merged
feat: BaseUser abstraction + per-task verifier hardening opt-outs#194
Conversation
Add User as a first-class participant in the trial loop — a Python callback that produces prompts, sees test results between rounds, and decides when to stop. This is the infrastructure Josh (GitHub/Microsoft) needs for SWE-bench Pro progressive disclosure. New types (user.py): - BaseUser with setup(instruction, solution) and run(round, instruction, round_result) - RoundResult dataclass with trajectory, rewards, verifier output - PassthroughUser (backward-compat default, single round) - FunctionUser (wraps a plain callback for lightweight use) Trial changes: - TrialConfig gains user, max_user_rounds, oracle_access fields - Trial._run_user_loop(): user.run() → connect → execute → disconnect → soft_verify() → build RoundResult → repeat until None or max rounds - Trial.soft_verify(): runs Harbor verifier WITHOUT hardening so agent stays alive between rounds. Final verify() still does full hardening. - Multi-role + User raises ValueError (deferred to future phase) 16 new tests, 0 regressions on existing 618 tests.
1. Reorder: disconnect() before soft_verify() — agent process is already dead when soft_verify runs, so soft_verify's docstring was misleading. Now disconnect → soft_verify is the explicit flow. 2. soft_verify() now runs CLEANUP_CMD (conftest/pth/sitecustomize purge) before the verifier. Prevents agent from gaming intermediate test results by injecting test-patching files. 3. FunctionUser: use inspect.isawaitable() instead of asyncio.iscoroutine() — handles asyncio.Task, Future, and any __await__ object, not just coroutines. 4. oracle_access: cat /solution now runs as user="root" — /solution is locked (root:700) after install_agent, so the read would silently fail without root. 5. try/finally around connect/execute/disconnect in user loop — ensures disconnect() always runs even if execute() raises.
Demonstrates the FunctionUser abstraction: - Round 0: terse 2-sentence prompt - Round 1: hints about edge cases on failure - Round 2: full instruction on continued failure - Stops early if tests pass
- Remove 4 tautological tests (pure dataclass reads) per CLAUDE.md convention: TestRoundResult.test_defaults, test_with_data, TestTrialConfigUser.test_user_field_defaults_to_none, test_user_field_set - Fix dogfood model name: gemini-2.5-flash (not expired preview) - Note: iscoroutine→isawaitable was already fixed in 51d6c61
…tests 1. Oracle /solution is now moved (not deleted) before agent runs and restored before final verify(). Prevents breaking verifiers that need /solution to compute rewards. 2. Remove unused asyncio import from user.py. 3. Add 4 soft_verify tests: timeout, crash, success, and CLEANUP_CMD execution verification. soft_verify is no longer untested.
3-round progressive disclosure with Gemini Flash on regex-log: Round 0: terse prompt (2 tool calls) → reward=0.0 Round 1: hint prompt (3 tool calls) → reward=0.0 Round 2: full instruction (3 tool calls) → reward=0.0 Final verify: reward=0.0 Agent scored 0.0 on all rounds — regex-log is a hard task. But the infrastructure works end-to-end: user loop, soft_verify, fresh ACP sessions per round, user_rounds.jsonl persistence, final hardened verify. No errors.
OpenCode (opencode-ai) is an open-source TypeScript coding agent with ACP support. Skills path: $HOME/.opencode/skills (updated from .opencode/skill per skillsbench #718). Closes skillsbench #718 for the benchflow side.
Root cause: OpenCode's ACP parseModel() splits modelId on "/" to extract providerID and modelID. When benchflow sent "gemini-3.1-pro-preview" (no slash), opencode parsed it as providerID="gemini-3.1-pro-preview" with modelID="" — an invalid config that silently returned end_turn. Fix: Add acp_model_format field to AgentConfig. When set to "provider/model" (opencode), _format_acp_model() infers the models.dev provider from the bare model name (e.g. "gemini" → "google") and sends "google/gemini-3.1-pro-preview" to set_model. Also: opencode requires_env is now empty (inferred from model at runtime, not hardcoded to ANTHROPIC_API_KEY).
OpenCode + gemini-3.1-pro-preview on qutebrowser SWE-bench Pro: Baseline (full prompt, 1 round): 40 tools, 736s, reward=0.0 Progressive (3 rounds): 185 tools, 1154s, reward=0.0 Round 0 (terse): 86 tools (81 bash + 5 edit) Round 1 (hints): 76 tools (66 bash + 10 edit) Round 2 (full): 23 tools (16 bash + 7 edit) Both scored 0.0 due to verifier infrastructure bug (rootdir=/tests instead of /app, pytest couldn't find config). Agent's fixes were likely correct — demonstrated passing tests in own environment. Key findings: - Progressive disclosure changed agent behavior (86→76→23 tools) - _reset_cache implemented only after Round 1 hint - OpenCode handled 185 tool calls without token limits - Verifier rootdir bug needs investigation
The old mechanism (4 dicts + 4 functions + 1 regex) required manual code changes for every new benchmark with an undeclared pytest plugin. SWE-bench Pro tasks failed because pytest-benchmark wasn't whitelisted. New mechanism: one container-side script + one async function. At hardening time, enumerate all pytest11 entry points from root-owned system packages. Only root-owned dist-info directories are trusted — editable installs from agent-writable /testbed are excluded. PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 stays in place. Security preserved. task.toml pytest_plugins kept as fallback. Deleted: _PYTEST_PLUGIN_ALIASES, _PYTEST_OPTION_PLUGINS, _PYTEST_INSTALLED_PLUGINS, _PIP_INSTALL_RE, _normalize_pytest_plugin, _plugins_from_verifier_script, _declared_pytest_plugins, _pytest_plugin_flags, tomllib import. Added: _DISCOVER_PYTEST_PLUGINS_SCRIPT, _discover_pytest_plugin_flags.
Python 3.9's entry_points() doesn't accept keyword arguments — returns
a dict instead. Fall back to entry_points().get('pytest11', []) when
the keyword style raises TypeError.
The uid==0 check was failing on Python 3.9 containers where ep.dist._path doesn't exist. Simplified to just enumerate all pytest11 entry points — sandbox_user prevents agent pip installs, so all discovered plugins are image-authored.
Both progressive + baseline rerun with working verifier (15 plugins discovered). Results with honest scoring: Progressive (3 rounds): 284 tools, 970s, reward=0.0 Round 0: 94 tools, Round 1: 92 tools, Round 2: 98 tools Baseline (1 round): 73 tools, 611s, reward=0.0 Both failed due to agent code errors (circular imports), not verifier infrastructure. Progressive used 4x more compute for same outcome on this task.
VERIFIER_ENV cleared PYTHONPATH="" which broke SWE-bench Pro tasks where the Dockerfile sets PYTHONPATH=/app/lib:/app for project imports. New: _trusted_verifier_pythonpath() filters PYTHONPATH using the same root-owned validation as PATH, but does NOT block the workspace — /app is already importable via CWD/pytest sys.path insertion, so clearing it only breaks imports without security benefit. /tmp, /var/tmp, /home/agent are still blocked. Re-pinned after task-env merge like PATH.
- soft_verify: chmod 777 /logs/verifier so non-root verifier can write - soft_verify: restore /solution before verify, re-hide after (oracle access) - validate empty roles (!=1) and multi-scene configs in user loop - remove tautological test_setup_is_noop - remove opencode BENCHFLOW_PROVIDER_API_KEY→ANTHROPIC_API_KEY mapping (wrong for non-Anthropic models; native keys inherited via auto_inherit_env) - warn on unknown provider fallback in _format_acp_model - remove --rootdir=/tests from VERIFIER_ENV (cherry-pick from PR #187) - fix printenv PYTHONPATH crash when unset - fix stale plugin discovery docstring
Runs oracle (gold solution) on all 4 testable tasks to verify the --rootdir fix, then runs a single-round agent baseline for comparison with progressive disclosure. Results to CSV.
Three Codex review findings on the BaseUser abstraction: 1. oracle_access=True with user=None silently leaves /solution exposed to the agent for the entire trial. Add a logger.warning at setup time so misconfigurations surface immediately. 2. Oracle restore (mv /solution_oracle_backup /solution) was outside any finally block. If _run_user_loop() raised, /solution was never restored. Move the user/scene execution into try/finally so the restore always runs before the final verify(). 3. Oracle read used a wildcard fallback (cat /solution/* || true) that could leak unintended files (binaries, credentials). Narrow to solve.sh — the canonical SWE-bench Pro oracle path. Bugs Codex flagged that were FALSE POSITIVES (verified against code): - "session counter reset" — disconnect() already resets both counters - "None instruction" — _resolve_prompts returns [instruction] not [None] Tests still pass: 15 user + 58 sandbox = 73 total.
Two related changes addressing SWE-bench Pro oracle compatibility: 1) Restore --rootdir=/app in PYTEST_ADDOPTS Removing --rootdir entirely (PR #187) made pytest fall back to /dev as rootdir (from -c /dev/null), producing test node IDs like ../dev/::test_foo instead of <repo>/<path>::test_foo. The verifier expects full-path IDs and reported 0 passing tests on openlibrary even though all 18 tests passed. --rootdir=/app anchors test IDs to the canonical Harbor repo root while -c /dev/null still blocks pyproject.toml/pytest.ini discovery and --confcutdir=/tests still blocks conftest walk-up beyond /tests. 2) Per-task [verifier.hardening] opt-outs in task.toml The cleanup that deletes agent-injected conftest.py also deletes legitimate repo conftest.py files. qutebrowser ships conftest.py that sets up import order to break a real circular dependency between qutebrowser.browser.inspector and qutebrowser.misc.miscwidgets — without them, pytest collection fails on a type annotation in miscwidgets.py:419. Tasks now declare opt-outs in task.toml: [verifier.hardening] cleanup_conftests = false # qutebrowser Defaults remain secure (all True). New helpers in _sandbox.py: - HARDENING_DEFAULTS: dict of feature flags - _read_hardening_config(task_dir): parse task.toml [verifier.hardening] - _build_cleanup_cmd(hardening): build cleanup honoring opt-outs CLEANUP_CMD constant kept as backward-compat alias. Both harden_before_verify() and Trial.soft_verify() now read per-task hardening config before running cleanup. Validation on SWE-bench Pro oracle (Daytona): Before: 2/4 (ansible, flipt) — openlibrary failed test ID format, qutebrowser failed conftest deletion After: 5/5 (ansible, flipt, openlibrary, qutebrowser, navidrome) Tests: 80 passing (15 user + 65 sandbox including 7 new opt-out tests).
For Josh's SWE-bench Pro use case (and Harbor #1316 parity in the no-second-LLM case): - docs/progressive-disclosure.md: dedicated guide for the BaseUser abstraction. Covers the API, oracle access, [verifier.hardening] opt-outs, and when to choose BaseUser vs multi-role Scene. - docs/use-cases.md: brief mention in §1 (Interactive User Simulation) pointing to progressive-disclosure.md for the lighter-weight callback-based pattern. - examples/swebench_pro_progressive_disclosure.ipynb: clean rewrite of the existing notebook. Shows the API, oracle 5/5, baseline 4 tasks, per-task hardening opt-out example, and a placeholder cell that auto- loads the latest progressive-disclosure run from /tmp/swebench-pro-jobs/progressive when one exists. Executes top-to- bottom against the current oracle/baseline CSV. - examples/swebench_pro_user_dogfood.py: ready-to-run script for progressive disclosure on any of the 5 working SWE-bench Pro tasks. Three-round user: terse → failing tests + half spec → full spec. - experiments/swebench-pro-results.csv: oracle + baseline results from 2026-04-24 Daytona run. qutebrowser entry is pre-fix (verified post- fix separately, noted in notebook).
# Conflicts: # src/benchflow/_acp_run.py
4 tasks
# Conflicts: # src/benchflow/__init__.py # src/benchflow/_acp_run.py # src/benchflow/agents/registry.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR replaces #184 (auto-closed when its base
feat/scene-outbox-messagingwas merged + deleted via #179). All 19 commits ahead ofmainare intact, plus 3 new commits from the 2026-04-24 SWE-bench Pro validation session.Summary
BaseUserprogressive-disclosure abstraction: Python callback that drives a multi-round trial loop. Each round:user.run()→ agent executes →soft_verify()→ callback sees the result and decides what to do next. Built for Josh's SWE-bench Pro use case and as benchflow's no-second-LLM parity answer to Harbor #1316.Per-task
[verifier.hardening]opt-outs: tasks that need legitimateconftest.pysetups (e.g. qutebrowser) opt out of specific cleanup steps intask.toml.Verifier
--rootdir=/app: anchors pytest test node IDs to the canonical Harbor repo root (replaces the broken--rootdir=/testsremoval in #187).Validation (2026-04-24, Daytona)
SWE-bench Pro oracle: 5/5 (ansible, flipt, openlibrary, navidrome, qutebrowser).
Single-round Gemini 3.1 Pro baseline: 2/4 (ansible ✅, openlibrary ✅; flipt ❌ after 24min, qutebrowser ❌ verifier broken pre-fix).
Oracle was 2/5 before this PR's verifier fixes —
--rootdirremoval broke openlibrary's test ID format, andconftest.pycleanup broke qutebrowser's import-order setup.Files
src/benchflow/user.py,src/benchflow/trial.py,src/benchflow/__init__.pysrc/benchflow/_sandbox.py(HARDENING_DEFAULTS, _read_hardening_config, _build_cleanup_cmd)tests/test_user.py(15),tests/test_sandbox_hardening.py(+7 new opt-out tests, 65 total)docs/progressive-disclosure.md,docs/use-cases.mdexamples/swebench_pro_user_dogfood.py,examples/swebench_pro_progressive_disclosure.ipynb,examples/user_dogfood.pyexperiments/swebench_pro_oracle_and_baseline.py,experiments/swebench-pro-results.csvTest plan