team_harness: extract team mode as standalone harness + ablation flags#58
Conversation
Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter. Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.
Codex adapter highlights:
- Invokes ``codex exec --json --sandbox danger-full-access
--skip-git-repo-check --model <id>``.
- Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
inside the container so the CLI authenticates without prompts.
- Parses Codex's JSONL event stream for status / token totals /
messages. Cost is reported as 0.0 because Codex does not emit a
cost field; tokens are summed across ``turn.completed`` events.
- Model fallback: if Codex rejects ``--model gpt-5.5`` with a
"model not found" shaped error, the adapter retries once without
``--model`` and lets Codex pick its default.
- Preflight credential check: if OPENAI_API_KEY is unset the adapter
returns Error immediately instead of spinning up a container that
can only fail.
Shared ``_coop`` module:
- ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
``coop-peek`` / ``coop-agents`` under /usr/local/bin.
- ``install_snippet.sh`` — pip-installs redis and drops the shell
wrappers; each adapter's setup.sh sources it.
- ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
agnostic.
- ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
``write_file_in_container`` / ``read_file_from_container``,
``rewrite_comm_url_for_container``, ``build_git_setup_command``,
``parse_sent_messages_log``, and ``normalize_patch``.
Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires. Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace). This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.
Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module. Full suite: 228
passed, 63 skipped.
End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:
- codex solo f1: Submitted, 1 turn, 365k input tokens,
184-line patch (with the trailing-newline
fix it applies cleanly)
- codex coop+git f1,f2: both Submitted, both patches applied but
0/2 tests pass — coordination failure
(agent1 fetched ``team`` but never merged,
so the stacked patches produce a Python
SyntaxError at line 144 of the modified
file). Claude on the same task scored
2/2; Codex used the tools less aggressively
on this run.
The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug. Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product. Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:
1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
atomic claim semantics (HSETNX-style — exactly one caller wins on a
race) and an audit log of every mutation. Exposed in the container
as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
/ ``coop-task-list`` shell wrappers.
2. A **lead / member role split**. The first agent is designated
``team-lead`` and gets a system-prompt block instructing them to
break the spec into tasks, assign them via ``coop-task-create
--assign``, watch progress, and integrate. Other agents are
``member`` and look for open tasks to claim.
3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
mounted at ``/workspace/shared`` in every container. Free
coordination artifact for design notes, partial diffs, interface
sketches.
Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``. Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.
Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs. The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.
Unit tests: 46 new
- 18 task_list (CRUD, atomic claim, owner-only update, audit log,
run isolation)
- 12 prompt (lead vs member branches, solo fallback, git interaction)
- 3 runtime (env assembly, scratchpad mount args)
- 4 metrics (happy path, unowned-at-end, empty log, multiple claims)
- 5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
metrics in result, three-agent team)
- 4 misc
Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:
- 5 tasks created (2 by bench-runner, 3 by the lead splitting its
work), all reached ``done``
- time_to_first_claim_seconds=34.2
- claims_per_agent={agent1: 2, agent2: 1}
- updates_per_agent={agent1: 4, agent2: 3}
- scratchpad volume actively used (agent2 wrote its diff to
/workspace/shared/agent2.patch + a summary.md)
- **0/1 pass rate** — both ``patch.txt`` files were empty: the
members wrote diffs to the scratchpad instead of also writing
``/workspace/repo/patch.txt``, and the lead never ran the final
integration step. This is real coordination signal (the prompt
told them to write both places but they followed the scratchpad
half only) — a follow-up will tighten the prompt to make patch.txt
submission the explicit final step.
Future PRs (intentionally out of scope here so this lands at a
reviewable size):
- In-loop auto-refresh for the Python-loop adapters
- MCP long-poll tool to give CLI adapters push-ish inbox semantics
- Typed ``coop-request`` / ``coop-respond`` protocol on top of
messaging (CC's plan_approval_request shape)
- Filesystem mirror of the task list (CC-style ``ls`` artefacts)
Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…resh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to
parity with the CLI adapters for team mode. Before this commit they
accepted the team kwargs but discarded them; now each one appends the
team prompt section to the task it sends the agent, and (where the
adapter actually controls the container) propagates ``CB_TEAM_*`` env
vars + mounts the team scratchpad.
New helper: ``_team.team_task_section(agents, agent_id, team_role)``
returns ONLY the lead-or-member block + coop-task-* CLI usage,
without the surrounding task/submission/git scaffolding that
``build_team_instruction`` adds. Python-loop adapters already have
their own prompts covering messaging/git/submission, so they need
only the new piece; CLI adapters keep using the bigger function.
Per-adapter wiring:
- ``mini_swe_agent_v2``: appends team_task_section to task;
propagates CB_TEAM_* through env_kwargs["env"]; adds
``--add-host=host.docker.internal:host-gateway`` + scratchpad
volume to docker run args; installs the team CLI scripts + pip
redis in the container after env spin-up. The existing
``TeamPoller`` host-side hook (already in step()) still fires.
- ``openhands_sdk``: appends team_task_section to task; folds a new
``team_env`` dict into ``coop_info`` so
``_build_credentials_dict`` propagates CB_TEAM_* into the
sandbox. Coop-task-* binary install in the OpenHands agent-server
image is a follow-up — OpenHands manages its own image build and
doesn't expose a clean post-start exec hook.
- ``swe_agent``: appends team_task_section to task. The SWE-agent
framework's sandbox + agent loop is third-party and harder to
instrument; everything beyond the prompt is a follow-up.
Tests: 13 new
- 3 prompt unit tests for team_task_section (lead, member, empty)
- 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py:
consistency between team_task_section and build_team_instruction,
every registered runner accepts the team kwargs, openhands env
keys, swe_agent signature
Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code +
team + git (sanity check that the shared changes didn't regress the
CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines.
End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent,
openhands_sdk) requires API keys (Anthropic for the three Python-loop
adapters via litellm, OpenAI for codex) that aren't available in this
environment. Unit tests cover the new wiring; the e2e validations
should be run with real keys before relying on the per-adapter
behavior.
Compatibility matrix is now:
| Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars |
|---------------------|---------|-------------|--------------|------------------|----------|
| claude_code | yes | yes (full) | n/a | yes | yes |
| codex | yes | yes (full) | n/a | yes | yes |
| mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes |
| openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes |
| swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET |
Stacks on #52 (merged-up team-mode branch).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the documented gap from the prior commit's matrix: the
``coop-task-*`` binaries now ship into the OpenHands agent-server
sandbox, layered onto the upstream ``-oh`` image via Modal's
``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no
upstream image rebuild required). Triggered only when
``coop_info["team_env"]`` is set so solo / coop runs don't pay the
~10s first-build cost. Modal caches the layered image; subsequent
team runs are instant.
Verified end-to-end: ran openhands_sdk team+git on
dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran
``compgen -c | grep coop-task`` and got back all 7 wrappers
(create / claim / update / list / request / respond / pending) — the
install worked. Whether the model actually invokes the tools is a
separate (coordination-quality) axis; in this run it discovered them
but didn't use them, same as codex. Both patches applied; f1 14/14,
f2 19/20.
Tests: 2 new (full suite: 326 passed)
- test_team_env_triggers_image_layering — verifies add_local_file
+ pip_install + run_commands fire with the right args when team
mode is active
- test_no_layering_when_team_inactive — verifies solo / coop
runs skip the image-build cost
Matrix update — openhands_sdk now reads:
Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a /
CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The codex team e2e (cx_team_v3) hit 0/2 with great coordination
metrics — 5/5 tasks done, 27s first claim, claims even — but
neither agent ran ``git merge`` despite the prompt's "Recommended
workflow" mentioning it. Both fetched their peer's branch (2 each)
and then submitted only their own work, so the eval's naive
diff-stacker produced syntactically broken Python.
The previous prompt buried the critical step in a "Concretely:"
sentence at the end; gpt-5.5 didn't follow it. This rewrite:
- Renames the section ``## Git collaboration — MERGE IS REQUIRED
BEFORE SUBMITTING`` so the imperative is in the heading itself.
- Adds an explicit "Required final sequence — run this verbatim
before exiting" block with the full fetch+merge+diff sequence,
parameterized over every partner branch.
- Explains *why* (each agent's patch.txt is evaluated against every
feature's tests; without the merge, the peer feature's symbols
are missing → ImportError).
- Frames it the same way the patch.txt step is framed (REQUIRED,
skip-at-your-loss), which the original prompt fix proved
codex responds to.
Verified: re-ran cx_team_v4 (codex team+git, same task as v3).
Git activity went from ``fetch=2 merge=0 push=0`` per agent →
``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both
patches now contain both features' symbols. Pass rate v4:
33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test
because gpt-5.5's merged code put the ``filters`` kwarg on a helper
function rather than the ``prompt`` decorator (content quality, not
coordination).
A second run (cx_team_v5) produced byte-identical 243-line patches
on both agents — codex coordinated so well both ended up with the
exact same merged tree. This surfaces a separate bench-side
limitation: the eval's diff-stacker fails to apply patch B on top
of patch A when every hunk already matches, producing an empty
merged.patch. That's a real bug in ``eval/evaluate.py``'s coop
merge step, NOT a coordination failure — codex did exactly what the
prompt asked. Fix is a separate concern from team-mode wiring.
Tests still pass (existing prompt tests are content-agnostic;
326 / 63 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In team mode codex can coordinate so well that both agents end up
with byte-identical patches (each fully merged the other's branch).
The existing eval combiner sequence — apply patch1 → apply patch2
on top — chokes because every hunk in patch2 is already applied,
producing an empty merged.patch and a downstream "No valid patches
in input" failure even though both submissions are individually
fine.
Fix in ``test_merged``: before invoking ``_setup_branches`` /
``_merge_naive``, ``cmp`` the two patches. If they match, copy
patch1 to merged.patch (normalized via ``git apply --recount`` so
agents that emit unified diffs with miscounted hunk headers still
work) and skip the merge dance. Returns a fresh result with
``merge.status: "identical"`` so the caller can tell the
short-circuit fired vs a real merge.
Verified on the codex-team e2e:
- cx_team_v5 (codex agents perfectly merged to identical 243-line
patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20)
- cx_team_v4 (codex agents diverged on the merge): unchanged at
f2 20/20 + f1 13/14 = 33/34 tests, still falls back to
agent2-alone via apply_status: {'agent1': 'failed', ...}
I also briefly tried adding ``git apply --recount`` to
``_setup_branches``'s fallback chain, but that REGRESSED v4: it
made agent1's malformed patch apply where it previously failed
silently, triggering a real merge attempt that produced
duplicate function definitions (broken Python) via union merge.
The identical-patches short-circuit is the strictly-better fix —
no regression, recovers the v5 case, and the malformed-hunk
normalization only kicks in on the short-circuit path where it
can't cause merge conflicts.
Also lands previously-uncommitted housekeeping:
- prompt.py: ruff-format-only diff on the merge-required block
from the prior commit
- test_team_wiring.py: ruff --fix removed unused MagicMock
imports
- test_gcp_backend.py / test_tasks.py: ruff --fix removed
f-string-without-placeholder and unused-json import (both
unrelated drift caught by the gate)
Tests: 1 new (full suite: 327 passed)
- ``test_test_merged_shortcircuits_on_identical_patches`` — source
inspection confirms the short-circuit branch + "identical"
merge-status string exist in test_merged
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous openhands team runs (oh_team_v3) showed agents
discovering the ``coop-task-*`` shell wrappers via ``compgen`` but
never invoking them — gpt-5.5 strongly prefers typed tools registered
with the LLM over arbitrary shell commands. This commit lands the
architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered
under the same name as openhands' built-in ``TaskTrackerTool`` so the
registry resolution swaps it transparently.
Files:
* ``openhands/tools/task_tracker/coop_definition.py`` — new tool
definition + executor. Same ``TaskTrackerAction`` /
``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round-
trip through the shared ``cb:<run_id>:`` Redis namespace that
``TaskListClient`` (host side) writes to. Tasks are auto-owned
by the calling agent; ``view`` shows peer tasks prefixed with
``[<their_agent_id>]``. Registered under both
``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing
the module rebinds the latter to the Coop variant.
* ``openhands/tools/preset/default.py`` — gains a ``team_mode``
kwarg (kept for API stability + tests; the actual swap happens
server-side via the .pth/__init__ side-effect import, not by
changing the host-side tool list). Pre-PR coop block split into
a more nuanced team-mode prompt section that documents the
TaskTracker → shared-list behavior.
* ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` —
layers two more bits into the Modal image at build time:
- ``add_local_file`` of ``coop_definition.py`` to
``$OH_DIR/coop_definition.py`` (in the sandbox's openhands
install)
- ``grep ... || echo`` appending
``from . import coop_definition`` to the package's
``__init__.py`` so the registration runs at import time.
Tests: 1 new + updated image-layering assertions
- ``test_importing_coop_definition_overrides_local_registration``:
inspecting the registry's ``_MODULE_QUALNAMES`` confirms
``TaskTrackerTool.name`` resolves to ``coop_definition``'s
registration after import.
- ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file``
calls + 2 ``run_commands`` layers (tool-file install +
``coop-task-*`` wrappers) and that the
``from . import coop_definition`` line is in the install
commands.
Full suite: 329 passed. Ruff / format / mypy all green.
KNOWN LIMITATION (documented in coop_definition.py docstring):
the openhands_sdk agent-server runs in a Modal sandbox that's
network-isolated from the host Redis. The CoopTaskTracker is
correctly registered and the LLM can call it, but every operation
returns "Shared task list unavailable" because the sandbox can't
``socket.getaddrinfo("host.docker.internal")``. The fix is in the
deployment layer (Modal tunnels, a Modal-hosted Redis, or running
openhands directly via docker like the other adapters), not in this
PR — verified by oh_team_v10: agent ran ``coop-task-list`` first
("The coop CLI failed; I'll use the shared task tracker."), then
fell back to TaskTrackerAction which still hit the local executor
because the override + Redis combo can't actually work in Modal.
For non-Modal openhands deployments (e.g. local docker-backed
openhands runs, future remote-conversation transports that share the
host network), this tool works as designed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker
swap from actually functioning. Three pieces, working together:
1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects
``agent_name == "openhands_sdk"`` and spins up a Modal sandbox
running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``,
accessed via ``unencrypted_host:unencrypted_port``). Re-uses the
existing ``connectors/redis_server.ModalRedisServer`` — it was
already written, just unused. Both the host TaskListClient and
the agent sandboxes point at the same public TCP endpoint, so
pre-seed and agent reads/writes share state. Falls back to local
Redis for the other adapters.
2. **CoopTaskTrackerTool injection into the Modal sandbox.** The
adapter now ``add_local_file``s three pieces into the OpenHands
image at build time:
- ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py``
- ``coop_definition.py`` → ``$OH_DIR/coop_definition.py``
- ``_team_init_override.py`` → ``$OH_DIR/__init__.py``
(replaces upstream; same exports + a side-effect import of
coop_definition so the Redis-backed executor overrides the
local TaskTracker registration at first import).
Plus a ``find -name '*.pyc' -delete`` to invalidate Python's
bytecode cache so the new __init__ actually re-runs.
3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle
connections after a few minutes, so the original Redis client
pre-seed used at startup gets closed before the 9-min agent run
finishes. Re-open the client at harvest time using the same URL.
End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with
``-a openhands_sdk --setting team --git``:
- Modal Redis startup: ``redis ready redis://r450.modal.host:41899``
- Both agents Submitted, 9m total
- Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓)
- Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0,
time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2,
agent1:1}, updates_per_agent: {agent2:4, agent1:5}``
- Cost: $3.33
Tests: image-layering assertions expanded — ``add_local_file`` now
called 3 times (CLI helper, tool def, __init__ override), and the
run_commands chain copies both files + wipes .pyc caches.
Full suite: 329 passed. Ruff / format / mypy all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes that together unblock swe_agent team-mode runs (and
solo/coop runs too — the bug wasn't team-specific):
1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2``
in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``.
The old package was renamed in v0.0.13; both swe_agent files
had stale imports that no-op'd at module load (TypeError or
ModuleNotFoundError depending on how the framework was invoked),
making every swe_agent invocation return Error before any LLM
call.
2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras
in pyproject.toml. swe_agent's vendored framework imports these
at module-load time even when the docker/S3/model paths are
dormant, so a clean ``pip install '.[swe-agent]'`` without these
would still ImportError on first invocation.
3. uv.lock refreshed with the new transitive deps.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with
``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5):
both agents Submitted, patches 373 + 88 lines, both applied via
git apply. Eval failed 0/2 due to a content-quality issue
(``NameError: name 'Set' is not defined`` — agent used Set
without importing it; both agents hit exit_cost budget limit
mid-implementation), but that's model variance, not adapter
wiring. swe_agent is unblocked: it runs end-to-end, produces
patches, the eval pipeline processes them.
Coordination metrics still empty (claims_per_agent: {}) because
swe_agent doesn't yet have the in-container coop-task-* CLI
install or in-loop task auto-refresh — those are tracked as
follow-ups in the PR body. For now the swe_agent team-mode run
just gets the team prompt section + env vars; full team-tool
integration is a separate PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five compounding bugs prevented `claude_code`, `codex`, and `mini_swe_agent_v2` from reaching honest pass-rates on the core subset in team setting. All four now ≥ 5/10. - normalize_patch ate trailing blank context lines (text.strip() consumes " \n"), breaking last-hunk line counts so git apply rejected otherwise-valid diffs. Replaced with lstrip/rstrip on "\n" only. - mini_swe_agent_v2 adapter wasn't normalizing patches at all — raw .strip() on the patch.txt read, so every msa patch ended in a non-newline byte. Now routes through normalize_patch. - mini_swe_agent_v2 ModalEnvironment created the sandbox with no long-running command, so the image's default CMD exited and every exec hit "Sandbox not found". Pass "sleep", "infinity" as the positional command (matches eval backend's existing fix). - claude_code and codex adapters silently ignored --backend modal because shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters. - Team lead prompt buried the integration step at the bottom of a long workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist. Member prompt now opens with "stay in your lane" per the lead's PLAN.md. - eval test_merged now falls back to testing each agent's patch alone when the merged tree doesn't pass both features. Surfaced as merge.strategy="solo-agent1" / "solo-agent2". Credits the agent (typically the lead) who correctly integrated both features into one working patch but had it corrupted by union-merging with the other agent's partial implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- dataset/subsets/core.json: 10-pair subset for quick agent comparisons. Stratified by repo (largest-remainder proportional allocation by full-dataset pair count) with a one-slot floor per primary language (Python / Go / Rust / TS). Reproducible via scripts/generate_core_subset.py (seed=42). - docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent frameworks on the core subset in team setting. Includes per-task pass/fail matrix annotated with the merge strategy used, plus the chronological narrative of the dozen reruns that surfaced each of the bugs fixed in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously test_merged returned early with an error when both naive and union merge strategies hit conflicts, so the solo-agent fallback never got a chance to credit a team whose lead alone integrated both features. Now we write an empty merged.patch, let run_tests fail naturally on the merged tree, and fall through to the solo fallback. Doesn't change any of the current 40 eval results — union's merge=union attribute is tolerant enough that every task in the dataset produces some tree (potentially broken code with stitched-together lines); the broken-tree-tests-fail path already triggered the solo fallback. This just closes the defensive gap for future pathological cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the union-merge strategy and the member-only fallback from
test_merged. The new chain is:
1. identical patches → skip-merge short-circuit
2. naive 3-way merge clean → merged-tree tests are authoritative
(no further fallback)
3. naive merge conflicts → test the lead's patch.txt alone against
both feature suites
Rationale: union merge concatenates conflicting hunks, which usually
produces syntactically broken code; the cases where it accidentally
produced a working tree were rewarding lucky non-overlap, not genuine
coordination. The member-only fallback was symmetric to lead-only but
incoherent under team-mode semantics (the lead is the designated
integrator; if they didn't integrate, the team failed regardless of
what the member's branch looks like).
Effect on the core-subset horizontal comparison:
msa 6 → 6 (unchanged)
oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which
concealed that oh's lead doesn't integrate)
cc 5 → 5 (unchanged)
cx 5 → 5 (unchanged)
oh sliding below 5/10 is the correct outcome: the previous union-pass
on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit
their patch.txt into the working tree, which forces a merge conflict
on patch.txt that union resolved while the actual source merge was
non-conflicting). Under the stricter policy this gets routed through
lead-alone, which oh's lead does not pass.
BENCHMARK_RESULTS.md updated to reflect the new totals + per-task
matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry
revised; full test suite still green (329 passed, 63 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
codex on Modal: `codex exec` was hanging for the full sandbox lifetime (~2h) producing zero stream output. Root cause: codex's exec mode prints "Reading additional input from stdin..." and blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF for free; Modal sandbox keeps stdin open. Fix: add `</dev/null` to the codex invocation in _build_codex_command. Smoke-tested on dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s. openhands_sdk eval guardrail: openhands_sdk produces patches that include a committed patch.txt in the working tree and relies on Modal-hosted Redis for coordination; running eval through Docker silently changed the test environment. The eval now reads the run's config.json and refuses with a clear warning when the run was produced by openhands_sdk but --backend != modal. Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig by default; the earlier docs claiming it was docker-only were wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig. Added a backend dispatch that picks DockerDeploymentConfig when config["backend"] == "docker"; Modal stays as the default. Two upstream-swerex issues had to be worked around to make the docker path actually start a container: 1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh, so swerex's `docker run ... image sh -c "<startup>"` becomes `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as the feature-patch path. Pass docker_args=["--entrypoint", ""] to clear the entrypoint (mirrors the existing Modal monkey-patch that does .entrypoint([]) on the image). 2. swerex's startup falls back to `pipx run swe-rex ...` when the swerex-remote binary isn't pre-installed, but pipx looks for an executable literally named "swe-rex" — which doesn't exist in the published `swe-rex` package (it provides "swerex-remote"). Monkey-patch DockerDeployment._get_swerex_start_cmd to use `pipx run --spec swe-rex swerex-remote ...` instead. Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker: 1/1 pass in 2m 53s, 17 steps, $0.32, no errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the squash-merge conflicts from #52 landing on main. All conflicts followed the same pattern: this branch's HEAD contains #52's content plus the subsequent work on top, while main's squashed-merge commit contains only #52. Resolved each conflict by taking ours (HEAD), which preserves the cumulative state of: - CHANGELOG: full Fixed/Changed/Added entries for team-mode bug fixes, eval policy change, core subset + benchmark doc, plus the original "team setting" bullet from #52 - _team/prompt.py: the stronger lead-prompt with the 5-point integration checklist (#52 had the older "buried integration" version) - swe_agent/adapter.py: team-mode kwarg propagation + Docker backend dispatch + pipx --spec monkey-patch - runner/team.py: openhands_sdk Modal-Redis tunnel branch - everywhere else: my newer adapter changes are strict supersets of #52's CI green locally: 329 tests passed, ruff clean, mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move team-mode primitives from cooperbench/agents/_team (private) to
cooperbench/team_harness (public, library-shaped) so other benchmarks
can consume the multi-agent coordination algorithm without depending on
CooperBench's task layout.
Adds TeamSession + TeamHarnessConfig:
- TeamSession bundles per-run state (run_id, namespaced Redis URL,
ordered agent list, scratchpad volume name) with the feature config
and exposes adapter-facing factories that each return None / [] / {}
when their feature is disabled, so adapter code paths collapse to one
branch:
coop_env.update(session.env_for(agent_id))
extra_run_args.extend(session.scratchpad_mount_args())
mcp_config = session.mcp_config(container_script_path=...)
- TeamHarnessConfig is a frozen dataclass of five per-feature booleans
(task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member
role split is the always-on baseline -- without it team is just coop.
Wires five --team-no-* CLI flags through cli.py -> runner.run ->
runner.core -> runner.team -> each adapter. result.json now records
team_features so post-hoc analysis can attribute deltas to the feature
that was off.
Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and
openhands_agent_sdk now accept team_features kwarg and construct a
local TeamSession instead of calling loose helpers. Each adapter's
team-mode blocks (prompt, env, mount, MCP, install) gate on the
session's config.
Tests: tests/agents/_team -> tests/team_harness (rename), new
test_session.py (29 cases) covers the facade, four new ablation tests
in tests/runner/test_team.py verify the runner-side gating. Full suite
363 passed, 63 skipped; ruff/format/mypy clean.
End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker):
- Default: writes task_log.json + tasks.json + metrics, cb-team-<run>
volume created.
- --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log /
tasks files, empty metrics dict, no volume. team_features in
result.json reflects the requested ablation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
End-to-end eval results (was missing from the original PR body — ran
Real pass-rate delta in the expected direction — with the shared task list / scratchpad / MCP off, the lead had nowhere to fetch the member's patch from, naive merge produced a non-composing diff, and eval failed. With them on, the lead integrated and the team passed. n=1 is far too small for any conclusion about which of the three features is doing the work, but the plumbing and the ablation signal both work end-to-end. Next-PR target: the same matrix across the |
Ablation matrix (core 10-pair × marginal-effect design)Used the marginal-effect design (1 baseline + 5 one-feature-off) instead of the full 2⁵=32 interaction matrix — answers the "what's each feature's contribution" question at ~1/5 the cost. Codex CLI on Headline numbers
Per-task breakdown — the signal is concentrated
What this says about each feature
Caveats
Raw data
🤖 Generated with Claude Code |
# Conflicts: # CHANGELOG.md # src/cooperbench/agents/_team/__init__.py # src/cooperbench/agents/_team/prompt.py # src/cooperbench/agents/mini_swe_agent_v2/adapter.py # src/cooperbench/agents/openhands_agent_sdk/adapter.py # src/cooperbench/agents/openhands_agent_sdk/openhands-tools/openhands/tools/task_tracker/coop_definition.py # src/cooperbench/agents/swe_agent/adapter.py # src/cooperbench/runner/team.py # tests/agents/test_team_wiring.py # tests/team_harness/test_prompt.py
Summary
Stacks on #55. Lifts the team-mode coordination primitives out of
cooperbench/agents/_team(private, benchmark-internal) intocooperbench/team_harness(public, library-shaped) so the algorithm can be studied and consumed by other benchmarks — the long-horizon target discussed in #52's followups. Adds five per-feature ablation flags so we can measure each coordination mechanism's contribution independently.What's new
cooperbench.team_harness— public packageA documented sibling of
cooperbench/agents, not nested under it. Same modules as the old_team/(task_list,protocol,mcp_server,prompt,loop_refresh,fs_mirror,metrics,runtime,coop_task,install_snippet.sh) plus a facade in__init__.py:TeamHarnessConfigtask_list,scratchpad,mcp,auto_refresh,protocol).with_only("task_list", "mcp")anddisabled()helpers.TeamSessionrun_id/redis_url/agents/team_volume/config. Adapter-facing factories:env_for,scratchpad_mount_args,mcp_config,prompt_for,prompt_section,loop_poller,task_list_client,harvest_metrics. Each factory returnsNone/[]/{}when its feature is disabled so adapters write one code path.COOP_TASK_SCRIPT_PATH,INSTALL_SNIPPET_PATH,MCP_SERVER_SCRIPT_PATH,MCP_SERVER_NAME— adapters import these instead of computingPath(__file__).parent.parent / \"_team\".TeamSession.redis_urlis the host-side URL;env_for()rewriteslocalhost/127.0.0.1→host.docker.internalso adapters don't have to plumb that themselves. The rewrite is duplicated from_coop.runtimerather than imported, because the harness is meant to be portable to other benchmarks that don't ship_coop.Ablation flags
Five
--team-no-*flags oncooperbench run, each gating one coordination mechanism:The lead/member role split stays on either way — without it team mode collapses to coop, so it's the always-on baseline.
result.jsonnow records which features were enabled:so post-hoc analysis can attribute pass-rate deltas to the specific feature that was off, without cross-referencing CLI invocations.
Adapter refactor
claude_code,codex,mini_swe_agent_v2,swe_agent,openhands_agent_sdkall accept a newteam_features: TeamHarnessConfig | Nonekwarg and construct a localTeamSessioninstead of calling loose helpers. Each adapter's team-mode blocks (prompt assembly, env vars, scratchpad mount, MCP install, in-loop poller, CLI install) gate onsession.config.<feature>. For example, the MCP install in claude_code is now:— gate on the session's config, no
is_teamflag spread around.Tests
tests/agents/_team→tests/team_harness(rename, 83 existing tests still pass). Plus:tests/team_harness/test_session.py(29 new) — coversTeamHarnessConfigdefaults /with_only/disabled,TeamSession.lead/role_for/is_active,env_for(default, localhost rewrite, non-localhost passthrough, empty when all consumers off, full when onlymcpremains),scratchpad_mount_args(default, off, empty volume),mcp_config(default, off),task_list_client(on, off),loop_poller(on, off),harvest_metrics(active, off, None client),prompt_for(lead, member),prompt_section(default, single-agent collapse).tests/runner/test_team.py(4 new) —team_featuresrecorded inresult.jsonby default; same config instance propagates to every adapter call;--team-no-task-listskips Redis pre-seed and produces empty metrics dict while keeping the role split;--team-no-scratchpadclearsconfig[\"team_volume\"]to the empty string.Full suite: 363 passed, 63 skipped. Ruff / format / mypy all green.
End-to-end on
dottxt_ai_outlines/1371 [1,2]withcodex --setting team --backend dockerresult.metricscb-team-*volumeteam_featuressmoke-team-default{tasks_total: 2, tasks_done: 2, time_to_first_claim_seconds: 67.3, claims_per_agent: {agent2: 1}}cb-team-96067cd3)truesmoke-team-ablated--team-no-task-list --team-no-scratchpad --team-no-mcp{}task_list/scratchpad/mcp: false, auto_refresh/protocol: trueBoth runs Submitted 2/2 agents.
Out of scope (next PRs)
TeamSessionAPI is validated only by CooperBench itself.TeamHarnessConfig× prompt variants × refresh cadence, with eval pass-rate as the objective and the coordination metrics (time_to_first_claim_seconds,unowned_at_end,claims_per_agent) as cheap first-pass proxies.Test plan
ruff check,ruff format --check,mypy,pytest tests/(all green locally)uv run cooperbench run -a codex -r dottxt_ai_outlines_task -t 1371 -f 1,2 --setting team --backend docker --no-auto-eval -n smoke-team-default--team-no-task-list --team-no-scratchpad --team-no-mcpresult.json:team_featuresmatches the flags on both runs🤖 Generated with Claude Code