Skip to content

feat: BaseUser abstraction + per-task verifier hardening opt-outs#194

Merged
xdotli merged 22 commits intodev-0.3from
feat/user-abstraction
Apr 25, 2026
Merged

feat: BaseUser abstraction + per-task verifier hardening opt-outs#194
xdotli merged 22 commits intodev-0.3from
feat/user-abstraction

Conversation

@xdotli
Copy link
Copy Markdown
Member

@xdotli xdotli commented Apr 25, 2026

This PR replaces #184 (auto-closed when its base feat/scene-outbox-messaging was merged + deleted via #179). All 19 commits ahead of main are intact, plus 3 new commits from the 2026-04-24 SWE-bench Pro validation session.

Summary

BaseUser progressive-disclosure abstraction: Python callback that drives a multi-round trial loop. Each round: user.run() → agent executes → soft_verify() → callback sees the result and decides what to do next. Built for Josh's SWE-bench Pro use case and as benchflow's no-second-LLM parity answer to Harbor #1316.

Per-task [verifier.hardening] opt-outs: tasks that need legitimate conftest.py setups (e.g. qutebrowser) opt out of specific cleanup steps in task.toml.

Verifier --rootdir=/app: anchors pytest test node IDs to the canonical Harbor repo root (replaces the broken --rootdir=/tests removal in #187).

Validation (2026-04-24, Daytona)

SWE-bench Pro oracle: 5/5 (ansible, flipt, openlibrary, navidrome, qutebrowser).
Single-round Gemini 3.1 Pro baseline: 2/4 (ansible ✅, openlibrary ✅; flipt ❌ after 24min, qutebrowser ❌ verifier broken pre-fix).

Oracle was 2/5 before this PR's verifier fixes — --rootdir removal broke openlibrary's test ID format, and conftest.py cleanup broke qutebrowser's import-order setup.

Files

Area Files
Core abstraction src/benchflow/user.py, src/benchflow/trial.py, src/benchflow/__init__.py
Verifier hardening src/benchflow/_sandbox.py (HARDENING_DEFAULTS, _read_hardening_config, _build_cleanup_cmd)
Tests tests/test_user.py (15), tests/test_sandbox_hardening.py (+7 new opt-out tests, 65 total)
Docs docs/progressive-disclosure.md, docs/use-cases.md
Examples examples/swebench_pro_user_dogfood.py, examples/swebench_pro_progressive_disclosure.ipynb, examples/user_dogfood.py
Experiments experiments/swebench_pro_oracle_and_baseline.py, experiments/swebench-pro-results.csv

Test plan

  • All 80 tests pass: 15 user + 65 sandbox (incl. 7 new opt-out tests)
  • Oracle validates 5/5 SWE-bench Pro tasks via Daytona
  • Baseline runs 4 tasks via Daytona, 2 pass (gemini-3.1-pro-preview)
  • Progressive disclosure on flipt (running now)

Open in Devin Review

xdotli added 19 commits April 22, 2026 22:23
Add User as a first-class participant in the trial loop — a Python
callback that produces prompts, sees test results between rounds, and
decides when to stop. This is the infrastructure Josh (GitHub/Microsoft)
needs for SWE-bench Pro progressive disclosure.

New types (user.py):
- BaseUser with setup(instruction, solution) and run(round, instruction, round_result)
- RoundResult dataclass with trajectory, rewards, verifier output
- PassthroughUser (backward-compat default, single round)
- FunctionUser (wraps a plain callback for lightweight use)

Trial changes:
- TrialConfig gains user, max_user_rounds, oracle_access fields
- Trial._run_user_loop(): user.run() → connect → execute → disconnect →
  soft_verify() → build RoundResult → repeat until None or max rounds
- Trial.soft_verify(): runs Harbor verifier WITHOUT hardening so agent
  stays alive between rounds. Final verify() still does full hardening.
- Multi-role + User raises ValueError (deferred to future phase)

16 new tests, 0 regressions on existing 618 tests.
1. Reorder: disconnect() before soft_verify() — agent process is
   already dead when soft_verify runs, so soft_verify's docstring
   was misleading. Now disconnect → soft_verify is the explicit flow.

2. soft_verify() now runs CLEANUP_CMD (conftest/pth/sitecustomize
   purge) before the verifier. Prevents agent from gaming intermediate
   test results by injecting test-patching files.

3. FunctionUser: use inspect.isawaitable() instead of
   asyncio.iscoroutine() — handles asyncio.Task, Future, and any
   __await__ object, not just coroutines.

4. oracle_access: cat /solution now runs as user="root" — /solution
   is locked (root:700) after install_agent, so the read would
   silently fail without root.

5. try/finally around connect/execute/disconnect in user loop —
   ensures disconnect() always runs even if execute() raises.
Demonstrates the FunctionUser abstraction:
- Round 0: terse 2-sentence prompt
- Round 1: hints about edge cases on failure
- Round 2: full instruction on continued failure
- Stops early if tests pass
- Remove 4 tautological tests (pure dataclass reads) per CLAUDE.md
  convention: TestRoundResult.test_defaults, test_with_data,
  TestTrialConfigUser.test_user_field_defaults_to_none, test_user_field_set
- Fix dogfood model name: gemini-2.5-flash (not expired preview)
- Note: iscoroutine→isawaitable was already fixed in 51d6c61
…tests

1. Oracle /solution is now moved (not deleted) before agent runs and
   restored before final verify(). Prevents breaking verifiers that
   need /solution to compute rewards.

2. Remove unused asyncio import from user.py.

3. Add 4 soft_verify tests: timeout, crash, success, and CLEANUP_CMD
   execution verification. soft_verify is no longer untested.
3-round progressive disclosure with Gemini Flash on regex-log:
  Round 0: terse prompt (2 tool calls) → reward=0.0
  Round 1: hint prompt  (3 tool calls) → reward=0.0
  Round 2: full instruction (3 tool calls) → reward=0.0
  Final verify: reward=0.0

Agent scored 0.0 on all rounds — regex-log is a hard task. But the
infrastructure works end-to-end: user loop, soft_verify, fresh ACP
sessions per round, user_rounds.jsonl persistence, final hardened
verify. No errors.
OpenCode (opencode-ai) is an open-source TypeScript coding agent with
ACP support. Skills path: $HOME/.opencode/skills (updated from
.opencode/skill per skillsbench #718).

Closes skillsbench #718 for the benchflow side.
Root cause: OpenCode's ACP parseModel() splits modelId on "/" to extract
providerID and modelID. When benchflow sent "gemini-3.1-pro-preview"
(no slash), opencode parsed it as providerID="gemini-3.1-pro-preview"
with modelID="" — an invalid config that silently returned end_turn.

Fix: Add acp_model_format field to AgentConfig. When set to
"provider/model" (opencode), _format_acp_model() infers the models.dev
provider from the bare model name (e.g. "gemini" → "google") and sends
"google/gemini-3.1-pro-preview" to set_model.

Also: opencode requires_env is now empty (inferred from model at
runtime, not hardcoded to ANTHROPIC_API_KEY).
OpenCode + gemini-3.1-pro-preview on qutebrowser SWE-bench Pro:

Baseline (full prompt, 1 round): 40 tools, 736s, reward=0.0
Progressive (3 rounds):          185 tools, 1154s, reward=0.0
  Round 0 (terse):     86 tools (81 bash + 5 edit)
  Round 1 (hints):     76 tools (66 bash + 10 edit)
  Round 2 (full):      23 tools (16 bash + 7 edit)

Both scored 0.0 due to verifier infrastructure bug (rootdir=/tests
instead of /app, pytest couldn't find config). Agent's fixes were
likely correct — demonstrated passing tests in own environment.

Key findings:
- Progressive disclosure changed agent behavior (86→76→23 tools)
- _reset_cache implemented only after Round 1 hint
- OpenCode handled 185 tool calls without token limits
- Verifier rootdir bug needs investigation
The old mechanism (4 dicts + 4 functions + 1 regex) required manual
code changes for every new benchmark with an undeclared pytest plugin.
SWE-bench Pro tasks failed because pytest-benchmark wasn't whitelisted.

New mechanism: one container-side script + one async function. At
hardening time, enumerate all pytest11 entry points from root-owned
system packages. Only root-owned dist-info directories are trusted —
editable installs from agent-writable /testbed are excluded.

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 stays in place. Security preserved.
task.toml pytest_plugins kept as fallback.

Deleted: _PYTEST_PLUGIN_ALIASES, _PYTEST_OPTION_PLUGINS,
_PYTEST_INSTALLED_PLUGINS, _PIP_INSTALL_RE, _normalize_pytest_plugin,
_plugins_from_verifier_script, _declared_pytest_plugins,
_pytest_plugin_flags, tomllib import.

Added: _DISCOVER_PYTEST_PLUGINS_SCRIPT, _discover_pytest_plugin_flags.
Python 3.9's entry_points() doesn't accept keyword arguments — returns
a dict instead. Fall back to entry_points().get('pytest11', []) when
the keyword style raises TypeError.
The uid==0 check was failing on Python 3.9 containers where
ep.dist._path doesn't exist. Simplified to just enumerate all
pytest11 entry points — sandbox_user prevents agent pip installs,
so all discovered plugins are image-authored.
Both progressive + baseline rerun with working verifier (15 plugins
discovered). Results with honest scoring:

Progressive (3 rounds): 284 tools, 970s, reward=0.0
  Round 0: 94 tools, Round 1: 92 tools, Round 2: 98 tools
Baseline (1 round):     73 tools, 611s, reward=0.0

Both failed due to agent code errors (circular imports), not
verifier infrastructure. Progressive used 4x more compute for
same outcome on this task.
VERIFIER_ENV cleared PYTHONPATH="" which broke SWE-bench Pro tasks
where the Dockerfile sets PYTHONPATH=/app/lib:/app for project imports.

New: _trusted_verifier_pythonpath() filters PYTHONPATH using the same
root-owned validation as PATH, but does NOT block the workspace —
/app is already importable via CWD/pytest sys.path insertion, so
clearing it only breaks imports without security benefit. /tmp,
/var/tmp, /home/agent are still blocked.

Re-pinned after task-env merge like PATH.
- soft_verify: chmod 777 /logs/verifier so non-root verifier can write
- soft_verify: restore /solution before verify, re-hide after (oracle access)
- validate empty roles (!=1) and multi-scene configs in user loop
- remove tautological test_setup_is_noop
- remove opencode BENCHFLOW_PROVIDER_API_KEY→ANTHROPIC_API_KEY mapping
  (wrong for non-Anthropic models; native keys inherited via auto_inherit_env)
- warn on unknown provider fallback in _format_acp_model
- remove --rootdir=/tests from VERIFIER_ENV (cherry-pick from PR #187)
- fix printenv PYTHONPATH crash when unset
- fix stale plugin discovery docstring
Runs oracle (gold solution) on all 4 testable tasks to verify the
--rootdir fix, then runs a single-round agent baseline for comparison
with progressive disclosure. Results to CSV.
Three Codex review findings on the BaseUser abstraction:

1. oracle_access=True with user=None silently leaves /solution exposed to
   the agent for the entire trial. Add a logger.warning at setup time so
   misconfigurations surface immediately.

2. Oracle restore (mv /solution_oracle_backup /solution) was outside any
   finally block. If _run_user_loop() raised, /solution was never restored.
   Move the user/scene execution into try/finally so the restore always
   runs before the final verify().

3. Oracle read used a wildcard fallback (cat /solution/* || true) that
   could leak unintended files (binaries, credentials). Narrow to
   solve.sh — the canonical SWE-bench Pro oracle path.

Bugs Codex flagged that were FALSE POSITIVES (verified against code):
  - "session counter reset" — disconnect() already resets both counters
  - "None instruction" — _resolve_prompts returns [instruction] not [None]

Tests still pass: 15 user + 58 sandbox = 73 total.
Two related changes addressing SWE-bench Pro oracle compatibility:

1) Restore --rootdir=/app in PYTEST_ADDOPTS

   Removing --rootdir entirely (PR #187) made pytest fall back to /dev as
   rootdir (from -c /dev/null), producing test node IDs like ../dev/::test_foo
   instead of <repo>/<path>::test_foo. The verifier expects full-path IDs and
   reported 0 passing tests on openlibrary even though all 18 tests passed.

   --rootdir=/app anchors test IDs to the canonical Harbor repo root while
   -c /dev/null still blocks pyproject.toml/pytest.ini discovery and
   --confcutdir=/tests still blocks conftest walk-up beyond /tests.

2) Per-task [verifier.hardening] opt-outs in task.toml

   The cleanup that deletes agent-injected conftest.py also deletes
   legitimate repo conftest.py files. qutebrowser ships conftest.py that
   sets up import order to break a real circular dependency between
   qutebrowser.browser.inspector and qutebrowser.misc.miscwidgets — without
   them, pytest collection fails on a type annotation in miscwidgets.py:419.

   Tasks now declare opt-outs in task.toml:

       [verifier.hardening]
       cleanup_conftests = false  # qutebrowser

   Defaults remain secure (all True). New helpers in _sandbox.py:

   - HARDENING_DEFAULTS: dict of feature flags
   - _read_hardening_config(task_dir): parse task.toml [verifier.hardening]
   - _build_cleanup_cmd(hardening): build cleanup honoring opt-outs

   CLEANUP_CMD constant kept as backward-compat alias.

   Both harden_before_verify() and Trial.soft_verify() now read per-task
   hardening config before running cleanup.

Validation on SWE-bench Pro oracle (Daytona):

  Before: 2/4 (ansible, flipt) — openlibrary failed test ID format,
                                  qutebrowser failed conftest deletion
  After:  5/5 (ansible, flipt, openlibrary, qutebrowser, navidrome)

Tests: 80 passing (15 user + 65 sandbox including 7 new opt-out tests).
For Josh's SWE-bench Pro use case (and Harbor #1316 parity in the
no-second-LLM case):

- docs/progressive-disclosure.md: dedicated guide for the BaseUser
  abstraction. Covers the API, oracle access, [verifier.hardening]
  opt-outs, and when to choose BaseUser vs multi-role Scene.

- docs/use-cases.md: brief mention in §1 (Interactive User Simulation)
  pointing to progressive-disclosure.md for the lighter-weight
  callback-based pattern.

- examples/swebench_pro_progressive_disclosure.ipynb: clean rewrite of
  the existing notebook. Shows the API, oracle 5/5, baseline 4 tasks,
  per-task hardening opt-out example, and a placeholder cell that auto-
  loads the latest progressive-disclosure run from
  /tmp/swebench-pro-jobs/progressive when one exists. Executes top-to-
  bottom against the current oracle/baseline CSV.

- examples/swebench_pro_user_dogfood.py: ready-to-run script for
  progressive disclosure on any of the 5 working SWE-bench Pro tasks.
  Three-round user: terse → failing tests + half spec → full spec.

- experiments/swebench-pro-results.csv: oracle + baseline results from
  2026-04-24 Daytona run. qutebrowser entry is pre-fix (verified post-
  fix separately, noted in notebook).
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@xdotli xdotli changed the base branch from main to dev-0.3 April 25, 2026 02:07
# Conflicts:
#	src/benchflow/__init__.py
#	src/benchflow/_acp_run.py
#	src/benchflow/agents/registry.py
@xdotli xdotli merged commit 1762123 into dev-0.3 Apr 25, 2026
2 checks passed
@xdotli xdotli deleted the feat/user-abstraction branch April 25, 2026 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant