feat: BaseUser abstraction + per-task verifier hardening opt-outs by xdotli · Pull Request #194 · benchflow-ai/benchflow

xdotli · 2026-04-25T01:12:21Z

This PR replaces #184 (auto-closed when its base feat/scene-outbox-messaging was merged + deleted via #179). All 19 commits ahead of main are intact, plus 3 new commits from the 2026-04-24 SWE-bench Pro validation session.

Summary

BaseUser progressive-disclosure abstraction: Python callback that drives a multi-round trial loop. Each round: user.run() → agent executes → soft_verify() → callback sees the result and decides what to do next. Built for Josh's SWE-bench Pro use case and as benchflow's no-second-LLM parity answer to Harbor #1316.

Per-task [verifier.hardening] opt-outs: tasks that need legitimate conftest.py setups (e.g. qutebrowser) opt out of specific cleanup steps in task.toml.

Verifier --rootdir=/app: anchors pytest test node IDs to the canonical Harbor repo root (replaces the broken --rootdir=/tests removal in #187).

Validation (2026-04-24, Daytona)

SWE-bench Pro oracle: 5/5 (ansible, flipt, openlibrary, navidrome, qutebrowser).
Single-round Gemini 3.1 Pro baseline: 2/4 (ansible ✅, openlibrary ✅; flipt ❌ after 24min, qutebrowser ❌ verifier broken pre-fix).

Oracle was 2/5 before this PR's verifier fixes — --rootdir removal broke openlibrary's test ID format, and conftest.py cleanup broke qutebrowser's import-order setup.

Files

Area	Files
Core abstraction	`src/benchflow/user.py`, `src/benchflow/trial.py`, `src/benchflow/__init__.py`
Verifier hardening	`src/benchflow/_sandbox.py` (HARDENING_DEFAULTS, _read_hardening_config, _build_cleanup_cmd)
Tests	`tests/test_user.py` (15), `tests/test_sandbox_hardening.py` (+7 new opt-out tests, 65 total)
Docs	`docs/progressive-disclosure.md`, `docs/use-cases.md`
Examples	`examples/swebench_pro_user_dogfood.py`, `examples/swebench_pro_progressive_disclosure.ipynb`, `examples/user_dogfood.py`
Experiments	`experiments/swebench_pro_oracle_and_baseline.py`, `experiments/swebench-pro-results.csv`

Test plan

All 80 tests pass: 15 user + 65 sandbox (incl. 7 new opt-out tests)
Oracle validates 5/5 SWE-bench Pro tasks via Daytona
Baseline runs 4 tasks via Daytona, 2 pass (gemini-3.1-pro-preview)
Progressive disclosure on flipt (running now)

Add User as a first-class participant in the trial loop — a Python callback that produces prompts, sees test results between rounds, and decides when to stop. This is the infrastructure Josh (GitHub/Microsoft) needs for SWE-bench Pro progressive disclosure. New types (user.py): - BaseUser with setup(instruction, solution) and run(round, instruction, round_result) - RoundResult dataclass with trajectory, rewards, verifier output - PassthroughUser (backward-compat default, single round) - FunctionUser (wraps a plain callback for lightweight use) Trial changes: - TrialConfig gains user, max_user_rounds, oracle_access fields - Trial._run_user_loop(): user.run() → connect → execute → disconnect → soft_verify() → build RoundResult → repeat until None or max rounds - Trial.soft_verify(): runs Harbor verifier WITHOUT hardening so agent stays alive between rounds. Final verify() still does full hardening. - Multi-role + User raises ValueError (deferred to future phase) 16 new tests, 0 regressions on existing 618 tests.

1. Reorder: disconnect() before soft_verify() — agent process is already dead when soft_verify runs, so soft_verify's docstring was misleading. Now disconnect → soft_verify is the explicit flow. 2. soft_verify() now runs CLEANUP_CMD (conftest/pth/sitecustomize purge) before the verifier. Prevents agent from gaming intermediate test results by injecting test-patching files. 3. FunctionUser: use inspect.isawaitable() instead of asyncio.iscoroutine() — handles asyncio.Task, Future, and any __await__ object, not just coroutines. 4. oracle_access: cat /solution now runs as user="root" — /solution is locked (root:700) after install_agent, so the read would silently fail without root. 5. try/finally around connect/execute/disconnect in user loop — ensures disconnect() always runs even if execute() raises.

Demonstrates the FunctionUser abstraction: - Round 0: terse 2-sentence prompt - Round 1: hints about edge cases on failure - Round 2: full instruction on continued failure - Stops early if tests pass

- Remove 4 tautological tests (pure dataclass reads) per CLAUDE.md convention: TestRoundResult.test_defaults, test_with_data, TestTrialConfigUser.test_user_field_defaults_to_none, test_user_field_set - Fix dogfood model name: gemini-2.5-flash (not expired preview) - Note: iscoroutine→isawaitable was already fixed in 51d6c61

…tests 1. Oracle /solution is now moved (not deleted) before agent runs and restored before final verify(). Prevents breaking verifiers that need /solution to compute rewards. 2. Remove unused asyncio import from user.py. 3. Add 4 soft_verify tests: timeout, crash, success, and CLEANUP_CMD execution verification. soft_verify is no longer untested.

3-round progressive disclosure with Gemini Flash on regex-log: Round 0: terse prompt (2 tool calls) → reward=0.0 Round 1: hint prompt (3 tool calls) → reward=0.0 Round 2: full instruction (3 tool calls) → reward=0.0 Final verify: reward=0.0 Agent scored 0.0 on all rounds — regex-log is a hard task. But the infrastructure works end-to-end: user loop, soft_verify, fresh ACP sessions per round, user_rounds.jsonl persistence, final hardened verify. No errors.

OpenCode (opencode-ai) is an open-source TypeScript coding agent with ACP support. Skills path: $HOME/.opencode/skills (updated from .opencode/skill per skillsbench #718). Closes skillsbench #718 for the benchflow side.

Root cause: OpenCode's ACP parseModel() splits modelId on "/" to extract providerID and modelID. When benchflow sent "gemini-3.1-pro-preview" (no slash), opencode parsed it as providerID="gemini-3.1-pro-preview" with modelID="" — an invalid config that silently returned end_turn. Fix: Add acp_model_format field to AgentConfig. When set to "provider/model" (opencode), _format_acp_model() infers the models.dev provider from the bare model name (e.g. "gemini" → "google") and sends "google/gemini-3.1-pro-preview" to set_model. Also: opencode requires_env is now empty (inferred from model at runtime, not hardcoded to ANTHROPIC_API_KEY).

OpenCode + gemini-3.1-pro-preview on qutebrowser SWE-bench Pro: Baseline (full prompt, 1 round): 40 tools, 736s, reward=0.0 Progressive (3 rounds): 185 tools, 1154s, reward=0.0 Round 0 (terse): 86 tools (81 bash + 5 edit) Round 1 (hints): 76 tools (66 bash + 10 edit) Round 2 (full): 23 tools (16 bash + 7 edit) Both scored 0.0 due to verifier infrastructure bug (rootdir=/tests instead of /app, pytest couldn't find config). Agent's fixes were likely correct — demonstrated passing tests in own environment. Key findings: - Progressive disclosure changed agent behavior (86→76→23 tools) - _reset_cache implemented only after Round 1 hint - OpenCode handled 185 tool calls without token limits - Verifier rootdir bug needs investigation

The old mechanism (4 dicts + 4 functions + 1 regex) required manual code changes for every new benchmark with an undeclared pytest plugin. SWE-bench Pro tasks failed because pytest-benchmark wasn't whitelisted. New mechanism: one container-side script + one async function. At hardening time, enumerate all pytest11 entry points from root-owned system packages. Only root-owned dist-info directories are trusted — editable installs from agent-writable /testbed are excluded. PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 stays in place. Security preserved. task.toml pytest_plugins kept as fallback. Deleted: _PYTEST_PLUGIN_ALIASES, _PYTEST_OPTION_PLUGINS, _PYTEST_INSTALLED_PLUGINS, _PIP_INSTALL_RE, _normalize_pytest_plugin, _plugins_from_verifier_script, _declared_pytest_plugins, _pytest_plugin_flags, tomllib import. Added: _DISCOVER_PYTEST_PLUGINS_SCRIPT, _discover_pytest_plugin_flags.

Python 3.9's entry_points() doesn't accept keyword arguments — returns a dict instead. Fall back to entry_points().get('pytest11', []) when the keyword style raises TypeError.

The uid==0 check was failing on Python 3.9 containers where ep.dist._path doesn't exist. Simplified to just enumerate all pytest11 entry points — sandbox_user prevents agent pip installs, so all discovered plugins are image-authored.

Both progressive + baseline rerun with working verifier (15 plugins discovered). Results with honest scoring: Progressive (3 rounds): 284 tools, 970s, reward=0.0 Round 0: 94 tools, Round 1: 92 tools, Round 2: 98 tools Baseline (1 round): 73 tools, 611s, reward=0.0 Both failed due to agent code errors (circular imports), not verifier infrastructure. Progressive used 4x more compute for same outcome on this task.

VERIFIER_ENV cleared PYTHONPATH="" which broke SWE-bench Pro tasks where the Dockerfile sets PYTHONPATH=/app/lib:/app for project imports. New: _trusted_verifier_pythonpath() filters PYTHONPATH using the same root-owned validation as PATH, but does NOT block the workspace — /app is already importable via CWD/pytest sys.path insertion, so clearing it only breaks imports without security benefit. /tmp, /var/tmp, /home/agent are still blocked. Re-pinned after task-env merge like PATH.

- soft_verify: chmod 777 /logs/verifier so non-root verifier can write - soft_verify: restore /solution before verify, re-hide after (oracle access) - validate empty roles (!=1) and multi-scene configs in user loop - remove tautological test_setup_is_noop - remove opencode BENCHFLOW_PROVIDER_API_KEY→ANTHROPIC_API_KEY mapping (wrong for non-Anthropic models; native keys inherited via auto_inherit_env) - warn on unknown provider fallback in _format_acp_model - remove --rootdir=/tests from VERIFIER_ENV (cherry-pick from PR #187) - fix printenv PYTHONPATH crash when unset - fix stale plugin discovery docstring

Runs oracle (gold solution) on all 4 testable tasks to verify the --rootdir fix, then runs a single-round agent baseline for comparison with progressive disclosure. Results to CSV.

Three Codex review findings on the BaseUser abstraction: 1. oracle_access=True with user=None silently leaves /solution exposed to the agent for the entire trial. Add a logger.warning at setup time so misconfigurations surface immediately. 2. Oracle restore (mv /solution_oracle_backup /solution) was outside any finally block. If _run_user_loop() raised, /solution was never restored. Move the user/scene execution into try/finally so the restore always runs before the final verify(). 3. Oracle read used a wildcard fallback (cat /solution/* || true) that could leak unintended files (binaries, credentials). Narrow to solve.sh — the canonical SWE-bench Pro oracle path. Bugs Codex flagged that were FALSE POSITIVES (verified against code): - "session counter reset" — disconnect() already resets both counters - "None instruction" — _resolve_prompts returns [instruction] not [None] Tests still pass: 15 user + 58 sandbox = 73 total.

Two related changes addressing SWE-bench Pro oracle compatibility: 1) Restore --rootdir=/app in PYTEST_ADDOPTS Removing --rootdir entirely (PR #187) made pytest fall back to /dev as rootdir (from -c /dev/null), producing test node IDs like ../dev/::test_foo instead of <repo>/<path>::test_foo. The verifier expects full-path IDs and reported 0 passing tests on openlibrary even though all 18 tests passed. --rootdir=/app anchors test IDs to the canonical Harbor repo root while -c /dev/null still blocks pyproject.toml/pytest.ini discovery and --confcutdir=/tests still blocks conftest walk-up beyond /tests. 2) Per-task [verifier.hardening] opt-outs in task.toml The cleanup that deletes agent-injected conftest.py also deletes legitimate repo conftest.py files. qutebrowser ships conftest.py that sets up import order to break a real circular dependency between qutebrowser.browser.inspector and qutebrowser.misc.miscwidgets — without them, pytest collection fails on a type annotation in miscwidgets.py:419. Tasks now declare opt-outs in task.toml: [verifier.hardening] cleanup_conftests = false # qutebrowser Defaults remain secure (all True). New helpers in _sandbox.py: - HARDENING_DEFAULTS: dict of feature flags - _read_hardening_config(task_dir): parse task.toml [verifier.hardening] - _build_cleanup_cmd(hardening): build cleanup honoring opt-outs CLEANUP_CMD constant kept as backward-compat alias. Both harden_before_verify() and Trial.soft_verify() now read per-task hardening config before running cleanup. Validation on SWE-bench Pro oracle (Daytona): Before: 2/4 (ansible, flipt) — openlibrary failed test ID format, qutebrowser failed conftest deletion After: 5/5 (ansible, flipt, openlibrary, qutebrowser, navidrome) Tests: 80 passing (15 user + 65 sandbox including 7 new opt-out tests).

For Josh's SWE-bench Pro use case (and Harbor #1316 parity in the no-second-LLM case): - docs/progressive-disclosure.md: dedicated guide for the BaseUser abstraction. Covers the API, oracle access, [verifier.hardening] opt-outs, and when to choose BaseUser vs multi-role Scene. - docs/use-cases.md: brief mention in §1 (Interactive User Simulation) pointing to progressive-disclosure.md for the lighter-weight callback-based pattern. - examples/swebench_pro_progressive_disclosure.ipynb: clean rewrite of the existing notebook. Shows the API, oracle 5/5, baseline 4 tasks, per-task hardening opt-out example, and a placeholder cell that auto- loads the latest progressive-disclosure run from /tmp/swebench-pro-jobs/progressive when one exists. Executes top-to- bottom against the current oracle/baseline CSV. - examples/swebench_pro_user_dogfood.py: ready-to-run script for progressive disclosure on any of the 5 working SWE-bench Pro tasks. Three-round user: terse → failing tests + half spec → full spec. - experiments/swebench-pro-results.csv: oracle + baseline results from 2026-04-24 Daytona run. qutebrowser entry is pre-fix (verified post- fix separately, noted in notebook).

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

# Conflicts: # src/benchflow/_acp_run.py

# Conflicts: # src/benchflow/__init__.py # src/benchflow/_acp_run.py # src/benchflow/agents/registry.py

xdotli added 19 commits April 22, 2026 22:23

feat: add user_dogfood.py — progressive disclosure on regex-log

2c3cc3c

Demonstrates the FunctionUser abstraction: - Round 0: terse 2-sentence prompt - Round 1: hints about edge cases on failure - Round 2: full instruction on continued failure - Stops early if tests pass

feat: add opencode agent to registry

27433ad

OpenCode (opencode-ai) is an open-source TypeScript coding agent with ACP support. Skills path: $HOME/.opencode/skills (updated from .opencode/skill per skillsbench #718). Closes skillsbench #718 for the benchflow side.

fix: handle Python 3.9 importlib.metadata API in plugin discovery

922c4a6

Python 3.9's entry_points() doesn't accept keyword arguments — returns a dict instead. Fall back to entry_points().get('pytest11', []) when the keyword style raises TypeError.

feat: add SWE-bench Pro oracle validation + baseline experiment script

2f46994

Runs oracle (gold solution) on all 4 testable tasks to verify the --rootdir fix, then runs a single-round agent baseline for comparison with progressive disclosure. Results to CSV.

devin-ai-integration Bot reviewed Apr 25, 2026

View reviewed changes

docs: add progressive-disclosure.md to CLAUDE.md docs index

06672b5

xdotli changed the base branch from main to dev-0.3 April 25, 2026 02:07

Merge remote-tracking branch 'origin/dev-0.3' into feat/user-abstraction

e106811

# Conflicts: # src/benchflow/_acp_run.py

xdotli mentioned this pull request Apr 25, 2026

merge: main → dev-0.3 (release prep for v0.3.2) #195

Merged

4 tasks

Merge remote-tracking branch 'origin/dev-0.3' into feat/user-abstraction

f1224cc

# Conflicts: # src/benchflow/__init__.py # src/benchflow/_acp_run.py # src/benchflow/agents/registry.py

xdotli merged commit 1762123 into dev-0.3 Apr 25, 2026
2 checks passed

xdotli deleted the feat/user-abstraction branch April 25, 2026 10:04

xdotli mentioned this pull request Apr 25, 2026

release: benchflow 0.3.2 — BaseUser, verifier hardening, DinD compose, lint cleanup #199

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: BaseUser abstraction + per-task verifier hardening opt-outs#194

feat: BaseUser abstraction + per-task verifier hardening opt-outs#194
xdotli merged 22 commits intodev-0.3from
feat/user-abstraction

xdotli commented Apr 25, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xdotli commented Apr 25, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation (2026-04-24, Daytona)

Files

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xdotli commented Apr 25, 2026 •

edited by devin-ai-integration Bot

Loading