Skip to content

team_harness: extract team mode as standalone harness + ablation flags#58

Merged
ProKil merged 20 commits into
mainfrom
team-harness-module
May 21, 2026
Merged

team_harness: extract team mode as standalone harness + ablation flags#58
ProKil merged 20 commits into
mainfrom
team-harness-module

Conversation

@ProKil
Copy link
Copy Markdown
Member

@ProKil ProKil commented May 19, 2026

Summary

Stacks on #55. Lifts the team-mode coordination primitives out of cooperbench/agents/_team (private, benchmark-internal) into cooperbench/team_harness (public, library-shaped) so the algorithm can be studied and consumed by other benchmarks — the long-horizon target discussed in #52's followups. Adds five per-feature ablation flags so we can measure each coordination mechanism's contribution independently.

What's new

cooperbench.team_harness — public package

A documented sibling of cooperbench/agents, not nested under it. Same modules as the old _team/ (task_list, protocol, mcp_server, prompt, loop_refresh, fs_mirror, metrics, runtime, coop_task, install_snippet.sh) plus a facade in __init__.py:

Public symbol What it is
TeamHarnessConfig Frozen dataclass of five booleans (task_list, scratchpad, mcp, auto_refresh, protocol). with_only("task_list", "mcp") and disabled() helpers.
TeamSession Per-run object bundling run_id / redis_url / agents / team_volume / config. Adapter-facing factories: env_for, scratchpad_mount_args, mcp_config, prompt_for, prompt_section, loop_poller, task_list_client, harvest_metrics. Each factory returns None / [] / {} when its feature is disabled so adapters write one code path.
Path constants COOP_TASK_SCRIPT_PATH, INSTALL_SNIPPET_PATH, MCP_SERVER_SCRIPT_PATH, MCP_SERVER_NAME — adapters import these instead of computing Path(__file__).parent.parent / \"_team\".

TeamSession.redis_url is the host-side URL; env_for() rewrites localhost / 127.0.0.1host.docker.internal so adapters don't have to plumb that themselves. The rewrite is duplicated from _coop.runtime rather than imported, because the harness is meant to be portable to other benchmarks that don't ship _coop.

Ablation flags

Five --team-no-* flags on cooperbench run, each gating one coordination mechanism:

--team-no-task-list      shared Redis task list + pre-seeding + metrics
--team-no-scratchpad     /workspace/shared Docker volume
--team-no-mcp            wait_for_message MCP registration
--team-no-auto-refresh   in-loop task-list summary injection (Python-loop adapters)
--team-no-protocol       coop-request / coop-respond / coop-pending verbs

The lead/member role split stays on either way — without it team mode collapses to coop, so it's the always-on baseline.

result.json now records which features were enabled:

\"team_features\": {
  \"task_list\": false,
  \"scratchpad\": false,
  \"mcp\": false,
  \"auto_refresh\": true,
  \"protocol\": true
}

so post-hoc analysis can attribute pass-rate deltas to the specific feature that was off, without cross-referencing CLI invocations.

Adapter refactor

claude_code, codex, mini_swe_agent_v2, swe_agent, openhands_agent_sdk all accept a new team_features: TeamHarnessConfig | None kwarg and construct a local TeamSession instead of calling loose helpers. Each adapter's team-mode blocks (prompt assembly, env vars, scratchpad mount, MCP install, in-loop poller, CLI install) gate on session.config.<feature>. For example, the MCP install in claude_code is now:

mcp_config = team_session.mcp_config(container_script_path=CONTAINER_TEAM_MCP_PATH) if team_session else None
if mcp_config is not None:
    write_file_in_container(env, CONTAINER_TEAM_MCP_PATH, TEAM_MCP_SCRIPT_PATH.read_text())
    write_file_in_container(env, f\"{CONTAINER_CLAUDE_CONFIG_DIR}/.claude.json\", json.dumps(mcp_config, indent=2))

— gate on the session's config, no is_team flag spread around.

Tests

tests/agents/_teamtests/team_harness (rename, 83 existing tests still pass). Plus:

  • tests/team_harness/test_session.py (29 new) — covers TeamHarnessConfig defaults / with_only / disabled, TeamSession.lead/role_for/is_active, env_for (default, localhost rewrite, non-localhost passthrough, empty when all consumers off, full when only mcp remains), scratchpad_mount_args (default, off, empty volume), mcp_config (default, off), task_list_client (on, off), loop_poller (on, off), harvest_metrics (active, off, None client), prompt_for (lead, member), prompt_section (default, single-agent collapse).

  • tests/runner/test_team.py (4 new)team_features recorded in result.json by default; same config instance propagates to every adapter call; --team-no-task-list skips Redis pre-seed and produces empty metrics dict while keeping the role split; --team-no-scratchpad clears config[\"team_volume\"] to the empty string.

Full suite: 363 passed, 63 skipped. Ruff / format / mypy all green.

End-to-end on dottxt_ai_outlines/1371 [1,2] with codex --setting team --backend docker

Run flags task_log.json tasks.json result.metrics cb-team-* volume team_features
smoke-team-default (defaults) {tasks_total: 2, tasks_done: 2, time_to_first_claim_seconds: 67.3, claims_per_agent: {agent2: 1}} created (cb-team-96067cd3) all true
smoke-team-ablated --team-no-task-list --team-no-scratchpad --team-no-mcp {} not created task_list/scratchpad/mcp: false, auto_refresh/protocol: true

Both runs Submitted 2/2 agents.

Out of scope (next PRs)

  • Long-horizon benchmark consumer that drives task creation dynamically rather than pre-seeding one task per feature — the actual second-consumer this PR is in service of. Without it the TeamSession API is validated only by CooperBench itself.
  • Automatic search over TeamHarnessConfig × prompt variants × refresh cadence, with eval pass-rate as the objective and the coordination metrics (time_to_first_claim_seconds, unowned_at_end, claims_per_agent) as cheap first-pass proxies.

Test plan

  • ruff check, ruff format --check, mypy, pytest tests/ (all green locally)
  • Default smoke: uv run cooperbench run -a codex -r dottxt_ai_outlines_task -t 1371 -f 1,2 --setting team --backend docker --no-auto-eval -n smoke-team-default
  • Ablation smoke: same + --team-no-task-list --team-no-scratchpad --team-no-mcp
  • Confirm result.json:team_features matches the flags on both runs

🤖 Generated with Claude Code

Ubuntu and others added 19 commits May 16, 2026 20:42
Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter.  Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.

Codex adapter highlights:

  - Invokes ``codex exec --json --sandbox danger-full-access
    --skip-git-repo-check --model <id>``.
  - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
    inside the container so the CLI authenticates without prompts.
  - Parses Codex's JSONL event stream for status / token totals /
    messages.  Cost is reported as 0.0 because Codex does not emit a
    cost field; tokens are summed across ``turn.completed`` events.
  - Model fallback: if Codex rejects ``--model gpt-5.5`` with a
    "model not found" shaped error, the adapter retries once without
    ``--model`` and lets Codex pick its default.
  - Preflight credential check: if OPENAI_API_KEY is unset the adapter
    returns Error immediately instead of spinning up a container that
    can only fail.

Shared ``_coop`` module:

  - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
    installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
    ``coop-peek`` / ``coop-agents`` under /usr/local/bin.
  - ``install_snippet.sh`` — pip-installs redis and drops the shell
    wrappers; each adapter's setup.sh sources it.
  - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
    agnostic.
  - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
    ``write_file_in_container`` / ``read_file_from_container``,
    ``rewrite_comm_url_for_container``, ``build_git_setup_command``,
    ``parse_sent_messages_log``, and ``normalize_patch``.

Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires.  Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace).  This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.

Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module.  Full suite: 228
passed, 63 skipped.

End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:

  - codex solo f1:           Submitted, 1 turn, 365k input tokens,
                             184-line patch (with the trailing-newline
                             fix it applies cleanly)
  - codex coop+git f1,f2:    both Submitted, both patches applied but
                             0/2 tests pass — coordination failure
                             (agent1 fetched ``team`` but never merged,
                             so the stacked patches produce a Python
                             SyntaxError at line 144 of the modified
                             file).  Claude on the same task scored
                             2/2; Codex used the tools less aggressively
                             on this run.

The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug.  Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product.  Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:

  1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
     backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
     atomic claim semantics (HSETNX-style — exactly one caller wins on a
     race) and an audit log of every mutation.  Exposed in the container
     as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
     / ``coop-task-list`` shell wrappers.

  2. A **lead / member role split**.  The first agent is designated
     ``team-lead`` and gets a system-prompt block instructing them to
     break the spec into tasks, assign them via ``coop-task-create
     --assign``, watch progress, and integrate.  Other agents are
     ``member`` and look for open tasks to claim.

  3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
     mounted at ``/workspace/shared`` in every container.  Free
     coordination artifact for design notes, partial diffs, interface
     sketches.

Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``.  Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.

Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs.  The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``.  The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.

Unit tests: 46 new
  - 18 task_list (CRUD, atomic claim, owner-only update, audit log,
    run isolation)
  - 12 prompt (lead vs member branches, solo fallback, git interaction)
  -  3 runtime (env assembly, scratchpad mount args)
  -  4 metrics (happy path, unowned-at-end, empty log, multiple claims)
  -  5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
    metrics in result, three-agent team)
  -  4 misc

Full suite: 274 passed, 63 skipped.  Ruff / format / mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:

  - 5 tasks created (2 by bench-runner, 3 by the lead splitting its
    work), all reached ``done``
  - time_to_first_claim_seconds=34.2
  - claims_per_agent={agent1: 2, agent2: 1}
  - updates_per_agent={agent1: 4, agent2: 3}
  - scratchpad volume actively used (agent2 wrote its diff to
    /workspace/shared/agent2.patch + a summary.md)
  - **0/1 pass rate** — both ``patch.txt`` files were empty: the
    members wrote diffs to the scratchpad instead of also writing
    ``/workspace/repo/patch.txt``, and the lead never ran the final
    integration step.  This is real coordination signal (the prompt
    told them to write both places but they followed the scratchpad
    half only) — a follow-up will tighten the prompt to make patch.txt
    submission the explicit final step.

Future PRs (intentionally out of scope here so this lands at a
reviewable size):

  - In-loop auto-refresh for the Python-loop adapters
  - MCP long-poll tool to give CLI adapters push-ish inbox semantics
  - Typed ``coop-request`` / ``coop-respond`` protocol on top of
    messaging (CC's plan_approval_request shape)
  - Filesystem mirror of the task list (CC-style ``ls`` artefacts)

Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…resh (#53)

Lands the four follow-ups that were called out as "Out of scope" on
the team-mode PR (#52), plus a prompt fix surfaced by the team-mode
end-to-end run.

1. **Filesystem mirror of task list** (``_team/fs_mirror.py``).
   Snapshots the Redis-backed task list to ``/workspace/shared/tasks/``
   so agents can ``ls`` and ``cat`` tasks with their existing tools
   rather than going through the ``coop-task-list`` CLI.  Layout
   mirrors Claude Code's team primitive: one ``<id>.json`` per task,
   plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit
   trail).  Triggered on every ``coop-task-list`` invocation and from
   the host runner at startup.  Files written via tempfile+replace so
   readers never observe a partial state.

2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``).
   Layered on plain Redis messaging, mirroring CC's
   ``plan_approval_request`` / ``plan_approval_response`` shape.
   ``coop-request <peer> <kind> <body>`` returns a request_id (and
   optionally blocks via ``--wait N`` for a response).
   ``coop-respond <request_id> <body>`` writes back; the sender's
   ``await_response`` uses BLPOP so it actually sleeps instead of
   busy-polling.  Both events flow into the shared task-log so
   coordination metrics include protocol events.

3. **MCP long-poll server** (``_team/mcp_server.py``).  Stdio
   JSON-RPC server that exposes a single ``wait_for_message`` tool
   backed by BLPOP on the agent's inbox.  Registered automatically:
   Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with
   the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``.
   The point is to make "watch the inbox" a natural idle behavior for
   the CLI adapters instead of a busy-loop on ``coop-recv`` returning
   empty — the closest we can get to push-style delivery for opaque
   CLI agent loops.

4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``).
   ``TeamPoller`` is a per-agent host-side helper that
   ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM
   queries — same hook as the existing inbox poll.  The LLM sees a
   compact ``[Team task list] open: 1, in_progress: 2, ...`` summary
   prepended to every turn so it doesn't need to remember to call
   ``coop-task-list``.  Plumbed via ``agent.team_poller`` so the
   ``mini_swe_agent_v2`` subtree change is one branch in ``step()``.
   The same module also exports ``poll_team_state()`` for in-container
   use (env-driven variant).

5. **Prompt fix**: the previous team-mode end-to-end had members
   writing diffs to ``/workspace/shared/<id>.patch`` only and never to
   ``/workspace/repo/patch.txt``, scoring 0/2 despite great
   coordination.  Both lead and member prompts now have an explicit
   ``### Final submission — REQUIRED`` section that calls out
   ``patch.txt`` as the only file the bench evaluates and provides
   the exact ``git diff > patch.txt`` command.

Also: cosmetic fix to ``runner/core._print_single_result`` so team
mode's per-agent dicts (which carry ``patch_lines: int``) render
correctly in the run table — previously the column showed 0 because
the function tried ``len(r.get("patch", "").splitlines())`` and team
mode doesn't store the full patch in the agents dict.

Tests: 37 new unit tests
  -  8 fs_mirror     (atomic writes, stale cleanup, empty index)
  -  9 protocol      (request roundtrip, await, timeout, audit log)
  -  9 mcp_server    (initialize, tools/list, tools/call,
                      timeout, blocking, unknown-tool error,
                      env factory)
  -  8 loop_refresh  (summary formatting, TeamPoller, env variant)
  -  3 prompt        (regression: lead+member prompts demand patch.txt)

Full suite: **311 passed**, 63 skipped.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code +
team + git: **2/2 features pass** (14/14 + 20/20 tests).  All four
follow-ups visibly active in the run artifacts:
``/workspace/shared/tasks/`` populated with per-task JSON + _index +
_log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in
``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's
sub-task split), 4 reached ``done``,
``time_to_first_claim_seconds=29.9``.  Previous run scored 0/2 on the
same task — the prompt fix is doing real work.

Stacks on #52.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to
parity with the CLI adapters for team mode.  Before this commit they
accepted the team kwargs but discarded them; now each one appends the
team prompt section to the task it sends the agent, and (where the
adapter actually controls the container) propagates ``CB_TEAM_*`` env
vars + mounts the team scratchpad.

New helper: ``_team.team_task_section(agents, agent_id, team_role)``
returns ONLY the lead-or-member block + coop-task-* CLI usage,
without the surrounding task/submission/git scaffolding that
``build_team_instruction`` adds.  Python-loop adapters already have
their own prompts covering messaging/git/submission, so they need
only the new piece; CLI adapters keep using the bigger function.

Per-adapter wiring:

  - ``mini_swe_agent_v2``: appends team_task_section to task;
    propagates CB_TEAM_* through env_kwargs["env"]; adds
    ``--add-host=host.docker.internal:host-gateway`` + scratchpad
    volume to docker run args; installs the team CLI scripts + pip
    redis in the container after env spin-up.  The existing
    ``TeamPoller`` host-side hook (already in step()) still fires.

  - ``openhands_sdk``: appends team_task_section to task; folds a new
    ``team_env`` dict into ``coop_info`` so
    ``_build_credentials_dict`` propagates CB_TEAM_* into the
    sandbox.  Coop-task-* binary install in the OpenHands agent-server
    image is a follow-up — OpenHands manages its own image build and
    doesn't expose a clean post-start exec hook.

  - ``swe_agent``: appends team_task_section to task.  The SWE-agent
    framework's sandbox + agent loop is third-party and harder to
    instrument; everything beyond the prompt is a follow-up.

Tests: 13 new
  - 3 prompt unit tests for team_task_section (lead, member, empty)
  - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py:
    consistency between team_task_section and build_team_instruction,
    every registered runner accepts the team kwargs, openhands env
    keys, swe_agent signature

Full suite: 324 passed, 63 skipped.  Ruff/format/mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code +
team + git (sanity check that the shared changes didn't regress the
CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines.

End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent,
openhands_sdk) requires API keys (Anthropic for the three Python-loop
adapters via litellm, OpenAI for codex) that aren't available in this
environment.  Unit tests cover the new wiring; the e2e validations
should be run with real keys before relying on the per-adapter
behavior.

Compatibility matrix is now:

  | Adapter             | Accepts | Team prompt | Auto-refresh | CLI in container | env vars |
  |---------------------|---------|-------------|--------------|------------------|----------|
  | claude_code         | yes     | yes (full)  | n/a          | yes              | yes      |
  | codex               | yes     | yes (full)  | n/a          | yes              | yes      |
  | mini_swe_agent_v2   | yes     | yes (sec.)  | yes          | yes              | yes      |
  | openhands_sdk       | yes     | yes (sec.)  | n/a          | NOT YET          | yes      |
  | swe_agent           | yes     | yes (sec.)  | NOT YET      | NOT YET          | NOT YET  |

Stacks on #52 (merged-up team-mode branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the documented gap from the prior commit's matrix: the
``coop-task-*`` binaries now ship into the OpenHands agent-server
sandbox, layered onto the upstream ``-oh`` image via Modal's
``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no
upstream image rebuild required).  Triggered only when
``coop_info["team_env"]`` is set so solo / coop runs don't pay the
~10s first-build cost.  Modal caches the layered image; subsequent
team runs are instant.

Verified end-to-end: ran openhands_sdk team+git on
dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5.  The agent ran
``compgen -c | grep coop-task`` and got back all 7 wrappers
(create / claim / update / list / request / respond / pending) — the
install worked.  Whether the model actually invokes the tools is a
separate (coordination-quality) axis; in this run it discovered them
but didn't use them, same as codex.  Both patches applied; f1 14/14,
f2 19/20.

Tests: 2 new (full suite: 326 passed)
  - test_team_env_triggers_image_layering  — verifies add_local_file
    + pip_install + run_commands fire with the right args when team
    mode is active
  - test_no_layering_when_team_inactive    — verifies solo / coop
    runs skip the image-build cost

Matrix update — openhands_sdk now reads:
  Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a /
  CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The codex team e2e (cx_team_v3) hit 0/2 with great coordination
metrics — 5/5 tasks done, 27s first claim, claims even — but
neither agent ran ``git merge`` despite the prompt's "Recommended
workflow" mentioning it.  Both fetched their peer's branch (2 each)
and then submitted only their own work, so the eval's naive
diff-stacker produced syntactically broken Python.

The previous prompt buried the critical step in a "Concretely:"
sentence at the end; gpt-5.5 didn't follow it.  This rewrite:

  - Renames the section ``## Git collaboration — MERGE IS REQUIRED
    BEFORE SUBMITTING`` so the imperative is in the heading itself.
  - Adds an explicit "Required final sequence — run this verbatim
    before exiting" block with the full fetch+merge+diff sequence,
    parameterized over every partner branch.
  - Explains *why* (each agent's patch.txt is evaluated against every
    feature's tests; without the merge, the peer feature's symbols
    are missing → ImportError).
  - Frames it the same way the patch.txt step is framed (REQUIRED,
    skip-at-your-loss), which the original prompt fix proved
    codex responds to.

Verified: re-ran cx_team_v4 (codex team+git, same task as v3).
Git activity went from ``fetch=2 merge=0 push=0`` per agent →
``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``.  Both
patches now contain both features' symbols.  Pass rate v4:
33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test
because gpt-5.5's merged code put the ``filters`` kwarg on a helper
function rather than the ``prompt`` decorator (content quality, not
coordination).

A second run (cx_team_v5) produced byte-identical 243-line patches
on both agents — codex coordinated so well both ended up with the
exact same merged tree.  This surfaces a separate bench-side
limitation: the eval's diff-stacker fails to apply patch B on top
of patch A when every hunk already matches, producing an empty
merged.patch.  That's a real bug in ``eval/evaluate.py``'s coop
merge step, NOT a coordination failure — codex did exactly what the
prompt asked.  Fix is a separate concern from team-mode wiring.

Tests still pass (existing prompt tests are content-agnostic;
326 / 63 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In team mode codex can coordinate so well that both agents end up
with byte-identical patches (each fully merged the other's branch).
The existing eval combiner sequence — apply patch1 → apply patch2
on top — chokes because every hunk in patch2 is already applied,
producing an empty merged.patch and a downstream "No valid patches
in input" failure even though both submissions are individually
fine.

Fix in ``test_merged``: before invoking ``_setup_branches`` /
``_merge_naive``, ``cmp`` the two patches.  If they match, copy
patch1 to merged.patch (normalized via ``git apply --recount`` so
agents that emit unified diffs with miscounted hunk headers still
work) and skip the merge dance.  Returns a fresh result with
``merge.status: "identical"`` so the caller can tell the
short-circuit fired vs a real merge.

Verified on the codex-team e2e:

  - cx_team_v5 (codex agents perfectly merged to identical 243-line
    patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20)
  - cx_team_v4 (codex agents diverged on the merge): unchanged at
    f2 20/20 + f1 13/14 = 33/34 tests, still falls back to
    agent2-alone via apply_status: {'agent1': 'failed', ...}

I also briefly tried adding ``git apply --recount`` to
``_setup_branches``'s fallback chain, but that REGRESSED v4: it
made agent1's malformed patch apply where it previously failed
silently, triggering a real merge attempt that produced
duplicate function definitions (broken Python) via union merge.
The identical-patches short-circuit is the strictly-better fix —
no regression, recovers the v5 case, and the malformed-hunk
normalization only kicks in on the short-circuit path where it
can't cause merge conflicts.

Also lands previously-uncommitted housekeeping:
  - prompt.py: ruff-format-only diff on the merge-required block
    from the prior commit
  - test_team_wiring.py: ruff --fix removed unused MagicMock
    imports
  - test_gcp_backend.py / test_tasks.py: ruff --fix removed
    f-string-without-placeholder and unused-json import (both
    unrelated drift caught by the gate)

Tests: 1 new (full suite: 327 passed)
  - ``test_test_merged_shortcircuits_on_identical_patches`` — source
    inspection confirms the short-circuit branch + "identical"
    merge-status string exist in test_merged

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous openhands team runs (oh_team_v3) showed agents
discovering the ``coop-task-*`` shell wrappers via ``compgen`` but
never invoking them — gpt-5.5 strongly prefers typed tools registered
with the LLM over arbitrary shell commands.  This commit lands the
architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered
under the same name as openhands' built-in ``TaskTrackerTool`` so the
registry resolution swaps it transparently.

Files:

  * ``openhands/tools/task_tracker/coop_definition.py`` — new tool
    definition + executor.  Same ``TaskTrackerAction`` /
    ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round-
    trip through the shared ``cb:<run_id>:`` Redis namespace that
    ``TaskListClient`` (host side) writes to.  Tasks are auto-owned
    by the calling agent; ``view`` shows peer tasks prefixed with
    ``[<their_agent_id>]``.  Registered under both
    ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing
    the module rebinds the latter to the Coop variant.

  * ``openhands/tools/preset/default.py`` — gains a ``team_mode``
    kwarg (kept for API stability + tests; the actual swap happens
    server-side via the .pth/__init__ side-effect import, not by
    changing the host-side tool list).  Pre-PR coop block split into
    a more nuanced team-mode prompt section that documents the
    TaskTracker → shared-list behavior.

  * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` —
    layers two more bits into the Modal image at build time:
      - ``add_local_file`` of ``coop_definition.py`` to
        ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands
        install)
      - ``grep ... || echo`` appending
        ``from . import coop_definition`` to the package's
        ``__init__.py`` so the registration runs at import time.

Tests: 1 new + updated image-layering assertions
  - ``test_importing_coop_definition_overrides_local_registration``:
    inspecting the registry's ``_MODULE_QUALNAMES`` confirms
    ``TaskTrackerTool.name`` resolves to ``coop_definition``'s
    registration after import.
  - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file``
    calls + 2 ``run_commands`` layers (tool-file install +
    ``coop-task-*`` wrappers) and that the
    ``from . import coop_definition`` line is in the install
    commands.

Full suite: 329 passed.  Ruff / format / mypy all green.

KNOWN LIMITATION (documented in coop_definition.py docstring):
the openhands_sdk agent-server runs in a Modal sandbox that's
network-isolated from the host Redis.  The CoopTaskTracker is
correctly registered and the LLM can call it, but every operation
returns "Shared task list unavailable" because the sandbox can't
``socket.getaddrinfo("host.docker.internal")``.  The fix is in the
deployment layer (Modal tunnels, a Modal-hosted Redis, or running
openhands directly via docker like the other adapters), not in this
PR — verified by oh_team_v10: agent ran ``coop-task-list`` first
("The coop CLI failed; I'll use the shared task tracker."), then
fell back to TaskTrackerAction which still hit the local executor
because the override + Redis combo can't actually work in Modal.

For non-Modal openhands deployments (e.g. local docker-backed
openhands runs, future remote-conversation transports that share the
host network), this tool works as designed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker
swap from actually functioning.  Three pieces, working together:

1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects
   ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox
   running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``,
   accessed via ``unencrypted_host:unencrypted_port``).  Re-uses the
   existing ``connectors/redis_server.ModalRedisServer`` — it was
   already written, just unused.  Both the host TaskListClient and
   the agent sandboxes point at the same public TCP endpoint, so
   pre-seed and agent reads/writes share state.  Falls back to local
   Redis for the other adapters.

2. **CoopTaskTrackerTool injection into the Modal sandbox.** The
   adapter now ``add_local_file``s three pieces into the OpenHands
   image at build time:
     - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py``
     - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py``
     - ``_team_init_override.py`` → ``$OH_DIR/__init__.py``
       (replaces upstream; same exports + a side-effect import of
       coop_definition so the Redis-backed executor overrides the
       local TaskTracker registration at first import).
   Plus a ``find -name '*.pyc' -delete`` to invalidate Python's
   bytecode cache so the new __init__ actually re-runs.

3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle
   connections after a few minutes, so the original Redis client
   pre-seed used at startup gets closed before the 9-min agent run
   finishes.  Re-open the client at harvest time using the same URL.

End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with
``-a openhands_sdk --setting team --git``:

  - Modal Redis startup: ``redis ready redis://r450.modal.host:41899``
  - Both agents Submitted, 9m total
  - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓)
  - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0,
    time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2,
    agent1:1}, updates_per_agent: {agent2:4, agent1:5}``
  - Cost: $3.33

Tests: image-layering assertions expanded — ``add_local_file`` now
called 3 times (CLI helper, tool def, __init__ override), and the
run_commands chain copies both files + wipes .pyc caches.

Full suite: 329 passed.  Ruff / format / mypy all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The team-mode unit tests (task_list / protocol / fs_mirror /
loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic
stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere
in pyproject.toml — it just happened to be present in my local venv
because something else pulled it in transitively.

GitHub CI installs ``[dev]`` only, so on a clean install pytest
collection fails with ``ModuleNotFoundError: No module named
'fakeredis'`` on every team-mode test file.  Adding the dependency
explicitly fixes PR #52 (team-mode) CI; once team-mode merges,
PR #55 (team-all-adapters) will also pick it up via the same path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes that together unblock swe_agent team-mode runs (and
solo/coop runs too — the bug wasn't team-specific):

1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2``
   in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``.
   The old package was renamed in v0.0.13; both swe_agent files
   had stale imports that no-op'd at module load (TypeError or
   ModuleNotFoundError depending on how the framework was invoked),
   making every swe_agent invocation return Error before any LLM
   call.

2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras
   in pyproject.toml.  swe_agent's vendored framework imports these
   at module-load time even when the docker/S3/model paths are
   dormant, so a clean ``pip install '.[swe-agent]'`` without these
   would still ImportError on first invocation.

3. uv.lock refreshed with the new transitive deps.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with
``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5):
both agents Submitted, patches 373 + 88 lines, both applied via
git apply.  Eval failed 0/2 due to a content-quality issue
(``NameError: name 'Set' is not defined`` — agent used Set
without importing it; both agents hit exit_cost budget limit
mid-implementation), but that's model variance, not adapter
wiring.  swe_agent is unblocked: it runs end-to-end, produces
patches, the eval pipeline processes them.

Coordination metrics still empty (claims_per_agent: {}) because
swe_agent doesn't yet have the in-container coop-task-* CLI
install or in-loop task auto-refresh — those are tracked as
follow-ups in the PR body.  For now the swe_agent team-mode run
just gets the team prompt section + env vars; full team-tool
integration is a separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five compounding bugs prevented `claude_code`, `codex`, and
`mini_swe_agent_v2` from reaching honest pass-rates on the core
subset in team setting. All four now ≥ 5/10.

- normalize_patch ate trailing blank context lines (text.strip()
  consumes " \n"), breaking last-hunk line counts so git apply
  rejected otherwise-valid diffs. Replaced with lstrip/rstrip on
  "\n" only.
- mini_swe_agent_v2 adapter wasn't normalizing patches at all —
  raw .strip() on the patch.txt read, so every msa patch ended
  in a non-newline byte. Now routes through normalize_patch.
- mini_swe_agent_v2 ModalEnvironment created the sandbox with no
  long-running command, so the image's default CMD exited and
  every exec hit "Sandbox not found". Pass "sleep", "infinity"
  as the positional command (matches eval backend's existing fix).
- claude_code and codex adapters silently ignored --backend modal
  because shared build_environment was hardcoded to DockerEnvironment.
  Added a backend kwarg and threaded config["backend"] through both
  adapters.
- Team lead prompt buried the integration step at the bottom of a
  long workflow list; Claude/Codex consistently exited after their
  own feature without reading /workspace/shared/<agent>.patch.
  Rewrote with a hard-rule opener and a 5-point pre-submission
  checklist. Member prompt now opens with "stay in your lane" per
  the lead's PLAN.md.
- eval test_merged now falls back to testing each agent's patch
  alone when the merged tree doesn't pass both features. Surfaced
  as merge.strategy="solo-agent1" / "solo-agent2". Credits the
  agent (typically the lead) who correctly integrated both
  features into one working patch but had it corrupted by
  union-merging with the other agent's partial implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- dataset/subsets/core.json: 10-pair subset for quick agent
  comparisons. Stratified by repo (largest-remainder proportional
  allocation by full-dataset pair count) with a one-slot floor per
  primary language (Python / Go / Rust / TS). Reproducible via
  scripts/generate_core_subset.py (seed=42).
- docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent
  frameworks on the core subset in team setting. Includes per-task
  pass/fail matrix annotated with the merge strategy used, plus the
  chronological narrative of the dozen reruns that surfaced each of
  the bugs fixed in the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously test_merged returned early with an error when both naive
and union merge strategies hit conflicts, so the solo-agent fallback
never got a chance to credit a team whose lead alone integrated both
features. Now we write an empty merged.patch, let run_tests fail
naturally on the merged tree, and fall through to the solo fallback.

Doesn't change any of the current 40 eval results — union's merge=union
attribute is tolerant enough that every task in the dataset produces
some tree (potentially broken code with stitched-together lines); the
broken-tree-tests-fail path already triggered the solo fallback. This
just closes the defensive gap for future pathological cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the union-merge strategy and the member-only fallback from
test_merged. The new chain is:

  1. identical patches → skip-merge short-circuit
  2. naive 3-way merge clean → merged-tree tests are authoritative
                               (no further fallback)
  3. naive merge conflicts → test the lead's patch.txt alone against
                             both feature suites

Rationale: union merge concatenates conflicting hunks, which usually
produces syntactically broken code; the cases where it accidentally
produced a working tree were rewarding lucky non-overlap, not genuine
coordination. The member-only fallback was symmetric to lead-only but
incoherent under team-mode semantics (the lead is the designated
integrator; if they didn't integrate, the team failed regardless of
what the member's branch looks like).

Effect on the core-subset horizontal comparison:
  msa  6 → 6  (unchanged)
  oh   5 → 4  (loses pallets_jinja/1621 — was passing via union, which
              concealed that oh's lead doesn't integrate)
  cc   5 → 5  (unchanged)
  cx   5 → 5  (unchanged)

oh sliding below 5/10 is the correct outcome: the previous union-pass
on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit
their patch.txt into the working tree, which forces a merge conflict
on patch.txt that union resolved while the actual source merge was
non-conflicting). Under the stricter policy this gets routed through
lead-alone, which oh's lead does not pass.

BENCHMARK_RESULTS.md updated to reflect the new totals + per-task
matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry
revised; full test suite still green (329 passed, 63 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
codex on Modal: `codex exec` was hanging for the full sandbox
lifetime (~2h) producing zero stream output. Root cause: codex's
exec mode prints "Reading additional input from stdin..." and
blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF
for free; Modal sandbox keeps stdin open. Fix: add `</dev/null`
to the codex invocation in _build_codex_command. Smoke-tested on
dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s.

openhands_sdk eval guardrail: openhands_sdk produces patches that
include a committed patch.txt in the working tree and relies on
Modal-hosted Redis for coordination; running eval through Docker
silently changed the test environment. The eval now reads the
run's config.json and refuses with a clear warning when the run
was produced by openhands_sdk but --backend != modal.

Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig
by default; the earlier docs claiming it was docker-only were
wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig.
Added a backend dispatch that picks DockerDeploymentConfig when
config["backend"] == "docker"; Modal stays as the default.

Two upstream-swerex issues had to be worked around to make the
docker path actually start a container:

1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh,
   so swerex's `docker run ... image sh -c "<startup>"` becomes
   `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as
   the feature-patch path. Pass docker_args=["--entrypoint", ""]
   to clear the entrypoint (mirrors the existing Modal monkey-patch
   that does .entrypoint([]) on the image).

2. swerex's startup falls back to `pipx run swe-rex ...` when the
   swerex-remote binary isn't pre-installed, but pipx looks for an
   executable literally named "swe-rex" — which doesn't exist in
   the published `swe-rex` package (it provides "swerex-remote").
   Monkey-patch DockerDeployment._get_swerex_start_cmd to use
   `pipx run --spec swe-rex swerex-remote ...` instead.

Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker:
1/1 pass in 2m 53s, 17 steps, $0.32, no errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the squash-merge conflicts from #52 landing on main.

All conflicts followed the same pattern: this branch's HEAD contains
#52's content plus the subsequent work on top, while main's
squashed-merge commit contains only #52. Resolved each conflict by
taking ours (HEAD), which preserves the cumulative state of:

- CHANGELOG: full Fixed/Changed/Added entries for team-mode bug
  fixes, eval policy change, core subset + benchmark doc, plus the
  original "team setting" bullet from #52
- _team/prompt.py: the stronger lead-prompt with the 5-point
  integration checklist (#52 had the older "buried integration"
  version)
- swe_agent/adapter.py: team-mode kwarg propagation + Docker
  backend dispatch + pipx --spec monkey-patch
- runner/team.py: openhands_sdk Modal-Redis tunnel branch
- everywhere else: my newer adapter changes are strict supersets
  of #52's

CI green locally: 329 tests passed, ruff clean, mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move team-mode primitives from cooperbench/agents/_team (private) to
cooperbench/team_harness (public, library-shaped) so other benchmarks
can consume the multi-agent coordination algorithm without depending on
CooperBench's task layout.

Adds TeamSession + TeamHarnessConfig:

- TeamSession bundles per-run state (run_id, namespaced Redis URL,
  ordered agent list, scratchpad volume name) with the feature config
  and exposes adapter-facing factories that each return None / [] / {}
  when their feature is disabled, so adapter code paths collapse to one
  branch:

    coop_env.update(session.env_for(agent_id))
    extra_run_args.extend(session.scratchpad_mount_args())
    mcp_config = session.mcp_config(container_script_path=...)

- TeamHarnessConfig is a frozen dataclass of five per-feature booleans
  (task_list, scratchpad, mcp, auto_refresh, protocol).  The lead/member
  role split is the always-on baseline -- without it team is just coop.

Wires five --team-no-* CLI flags through cli.py -> runner.run ->
runner.core -> runner.team -> each adapter.  result.json now records
team_features so post-hoc analysis can attribute deltas to the feature
that was off.

Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and
openhands_agent_sdk now accept team_features kwarg and construct a
local TeamSession instead of calling loose helpers.  Each adapter's
team-mode blocks (prompt, env, mount, MCP, install) gate on the
session's config.

Tests: tests/agents/_team -> tests/team_harness (rename), new
test_session.py (29 cases) covers the facade, four new ablation tests
in tests/runner/test_team.py verify the runner-side gating.  Full suite
363 passed, 63 skipped; ruff/format/mypy clean.

End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker):
- Default: writes task_log.json + tasks.json + metrics, cb-team-<run>
  volume created.
- --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log /
  tasks files, empty metrics dict, no volume.  team_features in
  result.json reflects the requested ablation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ProKil
Copy link
Copy Markdown
Member Author

ProKil commented May 19, 2026

End-to-end eval results (was missing from the original PR body — ran cooperbench eval after pushing):

run merge strategy both pass pass rate
smoke-team-default solo-agent1 (lead integrated both features) 100%
smoke-team-ablated (--team-no-task-list --team-no-scratchpad --team-no-mcp) naive (diff-cat) 0%

Real pass-rate delta in the expected direction — with the shared task list / scratchpad / MCP off, the lead had nowhere to fetch the member's patch from, naive merge produced a non-composing diff, and eval failed. With them on, the lead integrated and the team passed.

n=1 is far too small for any conclusion about which of the three features is doing the work, but the plumbing and the ablation signal both work end-to-end. Next-PR target: the same matrix across the core 10-pair subset with one-flag-off ablations.

@ProKil
Copy link
Copy Markdown
Member Author

ProKil commented May 19, 2026

Ablation matrix (core 10-pair × marginal-effect design)

Used the marginal-effect design (1 baseline + 5 one-feature-off) instead of the full 2⁵=32 interaction matrix — answers the "what's each feature's contribution" question at ~1/5 the cost. Codex CLI on gpt-5.5, --setting team --backend docker.

Headline numbers

config flag off pass / 10 Δ vs baseline
ablate-11111 (baseline, all on) 6
ablate-11011 mcp 5 −1
ablate-11101 auto_refresh 5 −1
ablate-11110 protocol 5 −1
ablate-01111 task_list 4 −2
ablate-10111 scratchpad 3 −3

Per-task breakdown — the signal is concentrated

task                                  base no_tl no_sp no_mcp no_ar no_pr
dottxt_ai_outlines/1655 [1,3]          ✓    ✓    ✓    ✓    ✓    ✓
dspy/8563 [1,4]                        ✗    ✗    ✗    ✗    ✗    ✗
go_chi/27 [3,4]                        ✗    ✗    ✗    ✗    ✗    ✗
llama_index/17244 [5,6]                ✓    ✓    ✓    ✓    ✓    ✓
openai_tiktoken/0 [4,8]                ✓    ✓    ✗    ✓    ✓    ✓
pallets_click/2800 [1,4]               ✗    ✗    ✗    ✗    ✗    ✗
pallets_jinja/1559 [5,8]               ✓    ✗    ✗    ✗    ✗    ✓
pallets_jinja/1621 [6,10]              ✓    ✗    ✗    ✓    ✓    ✗
react_hook_form/153 [2,6]              ✗    ✗    ✗    ✗    ✗    ✗
typst/6554 [2,6]                       ✓    ✓    ✓    ✓    ✓    ✓
  • 3 tasks pass in every config — easy enough that any coordination level works.
  • 4 tasks fail in every config — too hard for codex regardless of coordination.
  • Only 3 tasks are sensitive to the ablation (openai_tiktoken/0, pallets_jinja/1559, pallets_jinja/1621). Effective sample size = 3.

What this says about each feature

feature effect confidence reading
scratchpad −3/10 (−3/3 of sensitive tasks) strongest /workspace/shared/ is where members drop their patches and the lead picks them up. Without it, no integration possible — the lead can't read what the member did.
task_list −2/10 (−2/3 of sensitive tasks) moderate Without coordination state, the lead doesn't know what the member's working on. Still passes when the agents independently solve their feature without needing alignment (openai_tiktoken).
mcp −1/10 (within noise) low wait_for_message long-poll. Codex already idles cheaply between commands; the marginal effect is small.
protocol −1/10 (within noise) low Typed coop-request / coop-respond verbs. The agents in this task set don't actually use them much — message-passing via coop-send covers most needs.
auto_refresh −1/10 (within noise) n/a auto_refresh only fires in Python-loop adapters (mini_swe_agent_v2 etc). Codex is a CLI adapter, so this flag is effectively a no-op for this sweep — the −1 is sample noise, not a real effect.

Caveats

  • n=3 effective is too small to distinguish mcp/protocol from noise. The scratchpad signal (3/3 of sensitive tasks) is the only one I'd call robust.
  • Codex CLI's --json stream doesn't emit cost or a model field, so result.json:total_cost shows $0 throughout. Real spend was ~$200-300 for this sweep based on token-count × public gpt-5.5 estimates (~46-50M input mostly cached + ~400K-1M output across the 6 configs).
  • For the auto_refresh measurement to mean anything, rerun on mini_swe_agent_v2 (where it actually fires).
  • For tighter CIs on mcp/protocol, expand the core subset or re-stratify to drop the always-pass / always-fail tasks.

Raw data

  • ablation_matrix.csv (in repo working dir, gitignored)
  • Per-run logs under logs/ablate-*/team/...

🤖 Generated with Claude Code

Base automatically changed from team-all-adapters to main May 21, 2026 16:52
An error occurred while trying to automatically change base from team-all-adapters to main May 21, 2026 16:52
# Conflicts:
#	CHANGELOG.md
#	src/cooperbench/agents/_team/__init__.py
#	src/cooperbench/agents/_team/prompt.py
#	src/cooperbench/agents/mini_swe_agent_v2/adapter.py
#	src/cooperbench/agents/openhands_agent_sdk/adapter.py
#	src/cooperbench/agents/openhands_agent_sdk/openhands-tools/openhands/tools/task_tracker/coop_definition.py
#	src/cooperbench/agents/swe_agent/adapter.py
#	src/cooperbench/runner/team.py
#	tests/agents/test_team_wiring.py
#	tests/team_harness/test_prompt.py
@ProKil ProKil merged commit bea5e58 into main May 21, 2026
3 checks passed
@ProKil ProKil deleted the team-harness-module branch May 21, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant