team_harness: extract team mode as standalone harness + ablation flags by ProKil · Pull Request #58 · cooperbench/CooperBench

ProKil · 2026-05-19T01:02:54Z

Summary

Stacks on #55. Lifts the team-mode coordination primitives out of cooperbench/agents/_team (private, benchmark-internal) into cooperbench/team_harness (public, library-shaped) so the algorithm can be studied and consumed by other benchmarks — the long-horizon target discussed in #52's followups. Adds five per-feature ablation flags so we can measure each coordination mechanism's contribution independently.

What's new

`cooperbench.team_harness` — public package

A documented sibling of cooperbench/agents, not nested under it. Same modules as the old _team/ (task_list, protocol, mcp_server, prompt, loop_refresh, fs_mirror, metrics, runtime, coop_task, install_snippet.sh) plus a facade in __init__.py:

Public symbol	What it is
`TeamHarnessConfig`	Frozen dataclass of five booleans (`task_list`, `scratchpad`, `mcp`, `auto_refresh`, `protocol`). `with_only("task_list", "mcp")` and `disabled()` helpers.
`TeamSession`	Per-run object bundling `run_id` / `redis_url` / `agents` / `team_volume` / `config`. Adapter-facing factories: `env_for`, `scratchpad_mount_args`, `mcp_config`, `prompt_for`, `prompt_section`, `loop_poller`, `task_list_client`, `harvest_metrics`. Each factory returns `None` / `[]` / `{}` when its feature is disabled so adapters write one code path.
Path constants	`COOP_TASK_SCRIPT_PATH`, `INSTALL_SNIPPET_PATH`, `MCP_SERVER_SCRIPT_PATH`, `MCP_SERVER_NAME` — adapters import these instead of computing `Path(__file__).parent.parent / \"_team\"`.

TeamSession.redis_url is the host-side URL; env_for() rewrites localhost / 127.0.0.1 → host.docker.internal so adapters don't have to plumb that themselves. The rewrite is duplicated from _coop.runtime rather than imported, because the harness is meant to be portable to other benchmarks that don't ship _coop.

Ablation flags

Five --team-no-* flags on cooperbench run, each gating one coordination mechanism:

--team-no-task-list      shared Redis task list + pre-seeding + metrics
--team-no-scratchpad     /workspace/shared Docker volume
--team-no-mcp            wait_for_message MCP registration
--team-no-auto-refresh   in-loop task-list summary injection (Python-loop adapters)
--team-no-protocol       coop-request / coop-respond / coop-pending verbs

The lead/member role split stays on either way — without it team mode collapses to coop, so it's the always-on baseline.

result.json now records which features were enabled:

\"team_features\": {
  \"task_list\": false,
  \"scratchpad\": false,
  \"mcp\": false,
  \"auto_refresh\": true,
  \"protocol\": true
}

so post-hoc analysis can attribute pass-rate deltas to the specific feature that was off, without cross-referencing CLI invocations.

Adapter refactor

claude_code, codex, mini_swe_agent_v2, swe_agent, openhands_agent_sdk all accept a new team_features: TeamHarnessConfig | None kwarg and construct a local TeamSession instead of calling loose helpers. Each adapter's team-mode blocks (prompt assembly, env vars, scratchpad mount, MCP install, in-loop poller, CLI install) gate on session.config.<feature>. For example, the MCP install in claude_code is now:

mcp_config = team_session.mcp_config(container_script_path=CONTAINER_TEAM_MCP_PATH) if team_session else None
if mcp_config is not None:
    write_file_in_container(env, CONTAINER_TEAM_MCP_PATH, TEAM_MCP_SCRIPT_PATH.read_text())
    write_file_in_container(env, f\"{CONTAINER_CLAUDE_CONFIG_DIR}/.claude.json\", json.dumps(mcp_config, indent=2))

— gate on the session's config, no is_team flag spread around.

Tests

tests/agents/_team → tests/team_harness (rename, 83 existing tests still pass). Plus:

tests/team_harness/test_session.py (29 new) — covers TeamHarnessConfig defaults / with_only / disabled, TeamSession.lead/role_for/is_active, env_for (default, localhost rewrite, non-localhost passthrough, empty when all consumers off, full when only mcp remains), scratchpad_mount_args (default, off, empty volume), mcp_config (default, off), task_list_client (on, off), loop_poller (on, off), harvest_metrics (active, off, None client), prompt_for (lead, member), prompt_section (default, single-agent collapse).
tests/runner/test_team.py (4 new) — team_features recorded in result.json by default; same config instance propagates to every adapter call; --team-no-task-list skips Redis pre-seed and produces empty metrics dict while keeping the role split; --team-no-scratchpad clears config[\"team_volume\"] to the empty string.

Full suite: 363 passed, 63 skipped. Ruff / format / mypy all green.

End-to-end on `dottxt_ai_outlines/1371 [1,2]` with `codex --setting team --backend docker`

Run	flags	task_log.json	tasks.json	`result.metrics`	`cb-team-*` volume	`team_features`
`smoke-team-default`	(defaults)	✓	✓	`{tasks_total: 2, tasks_done: 2, time_to_first_claim_seconds: 67.3, claims_per_agent: {agent2: 1}}`	created (`cb-team-96067cd3`)	all `true`
`smoke-team-ablated`	`--team-no-task-list --team-no-scratchpad --team-no-mcp`	✗	✗	`{}`	not created	`task_list/scratchpad/mcp: false, auto_refresh/protocol: true`

Both runs Submitted 2/2 agents.

Out of scope (next PRs)

Long-horizon benchmark consumer that drives task creation dynamically rather than pre-seeding one task per feature — the actual second-consumer this PR is in service of. Without it the TeamSession API is validated only by CooperBench itself.
Automatic search over TeamHarnessConfig × prompt variants × refresh cadence, with eval pass-rate as the objective and the coordination metrics (time_to_first_claim_seconds, unowned_at_end, claims_per_agent) as cheap first-pass proxies.

Test plan

ruff check, ruff format --check, mypy, pytest tests/ (all green locally)
Default smoke: uv run cooperbench run -a codex -r dottxt_ai_outlines_task -t 1371 -f 1,2 --setting team --backend docker --no-auto-eval -n smoke-team-default
Ablation smoke: same + --team-no-task-list --team-no-scratchpad --team-no-mcp
Confirm result.json:team_features matches the flags on both runs

🤖 Generated with Claude Code

Adds an OpenAI Codex CLI adapter alongside the existing Claude Code adapter. Both adapters wrap a third-party CLI inside the task's Docker container; the bits that are agent-agnostic (Redis messaging helper, prompt blocks for solo/coop/coop+git, git remote setup) now live in a new ``cooperbench.agents._coop`` module so the two adapters (and any future CLI adapter) consume them rather than duplicating. Codex adapter highlights: - Invokes ``codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id>``. - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY inside the container so the CLI authenticates without prompts. - Parses Codex's JSONL event stream for status / token totals / messages. Cost is reported as 0.0 because Codex does not emit a cost field; tokens are summed across ``turn.completed`` events. - Model fallback: if Codex rejects ``--model gpt-5.5`` with a "model not found" shaped error, the adapter retries once without ``--model`` and lets Codex pick its default. - Preflight credential check: if OPENAI_API_KEY is unset the adapter returns Error immediately instead of spinning up a container that can only fail. Shared ``_coop`` module: - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent) installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` / ``coop-peek`` / ``coop-agents`` under /usr/local/bin. - ``install_snippet.sh`` — pip-installs redis and drops the shell wrappers; each adapter's setup.sh sources it. - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent- agnostic. - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``, ``write_file_in_container`` / ``read_file_from_container``, ``rewrite_comm_url_for_container``, ``build_git_setup_command``, ``parse_sent_messages_log``, and ``normalize_patch``. Bug fix during this refactor: the previous adapter's ``.strip()`` on ``patch.txt`` was eating the trailing newline that ``git apply`` requires. Replaced with ``normalize_patch()`` (one trailing newline, no leading whitespace). This bit codex's solo run with a "corrupt patch at line N" error; Claude got lucky and didn't. Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code tests re-pointed at the shared ``_coop`` module. Full suite: 228 passed, 63 skipped. End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2: - codex solo f1: Submitted, 1 turn, 365k input tokens, 184-line patch (with the trailing-newline fix it applies cleanly) - codex coop+git f1,f2: both Submitted, both patches applied but 0/2 tests pass — coordination failure (agent1 fetched ``team`` but never merged, so the stacked patches produce a Python SyntaxError at line 144 of the modified file). Claude on the same task scored 2/2; Codex used the tools less aggressively on this run. The 0/2 result is the kind of coordination failure the bench is designed to surface, not an adapter bug. Future iteration could tighten the prompt or hard-enforce a post-run merge, but neither is necessary to land the adapter itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a third setting alongside ``solo`` and ``coop``, modelled on the agent-team primitives Claude Code uses in its own product. Where coop gives N peer agents one feature each and a Redis inbox to chat over, team mode adds three load-bearing primitives: 1. A typed **shared task list** (cooperbench.agents._team.TaskListClient) backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with atomic claim semantics (HSETNX-style — exactly one caller wins on a race) and an audit log of every mutation. Exposed in the container as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update`` / ``coop-task-list`` shell wrappers. 2. A **lead / member role split**. The first agent is designated ``team-lead`` and gets a system-prompt block instructing them to break the spec into tasks, assign them via ``coop-task-create --assign``, watch progress, and integrate. Other agents are ``member`` and look for open tasks to claim. 3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``) mounted at ``/workspace/shared`` in every container. Free coordination artifact for design notes, partial diffs, interface sketches. Coordination metrics are computed from the task-list audit log after the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``, ``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved into ``result.json``. Evaluation is identical to coop — per-agent ``patch.txt`` evaluated per-feature — so no eval changes were needed beyond discovering ``team/`` log directories. Compatibility: all five existing adapters accept the new ``team_role`` / ``team_id`` / ``task_list_url`` kwargs. The CLI adapters (``claude_code``, ``codex``) wire the team install snippet into their ``setup.sh`` so the ``coop-task-*`` wrappers land at ``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``, ``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking; their in-loop integration with the task list (auto-refresh between steps, similar to the existing inbox poll) lands in a follow-up. Unit tests: 46 new - 18 task_list (CRUD, atomic claim, owner-only update, audit log, run isolation) - 12 prompt (lead vs member branches, solo fallback, git interaction) - 3 runtime (env assembly, scratchpad mount args) - 4 metrics (happy path, unowned-at-end, empty log, multiple claims) - 5 runner (lead-is-first-agent, pre-seed, kwarg propagation, metrics in result, three-agent team) - 4 misc Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in team+git mode: - 5 tasks created (2 by bench-runner, 3 by the lead splitting its work), all reached ``done`` - time_to_first_claim_seconds=34.2 - claims_per_agent={agent1: 2, agent2: 1} - updates_per_agent={agent1: 4, agent2: 3} - scratchpad volume actively used (agent2 wrote its diff to /workspace/shared/agent2.patch + a summary.md) - **0/1 pass rate** — both ``patch.txt`` files were empty: the members wrote diffs to the scratchpad instead of also writing ``/workspace/repo/patch.txt``, and the lead never ran the final integration step. This is real coordination signal (the prompt told them to write both places but they followed the scratchpad half only) — a follow-up will tighten the prompt to make patch.txt submission the explicit final step. Future PRs (intentionally out of scope here so this lands at a reviewable size): - In-loop auto-refresh for the Python-loop adapters - MCP long-poll tool to give CLI adapters push-ish inbox semantics - Typed ``coop-request`` / ``coop-respond`` protocol on top of messaging (CC's plan_approval_request shape) - Filesystem mirror of the task list (CC-style ``ls`` artefacts) Stacks on #51 (Codex adapter) so the diff stays focused on team-mode additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…resh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to parity with the CLI adapters for team mode. Before this commit they accepted the team kwargs but discarded them; now each one appends the team prompt section to the task it sends the agent, and (where the adapter actually controls the container) propagates ``CB_TEAM_*`` env vars + mounts the team scratchpad. New helper: ``_team.team_task_section(agents, agent_id, team_role)`` returns ONLY the lead-or-member block + coop-task-* CLI usage, without the surrounding task/submission/git scaffolding that ``build_team_instruction`` adds. Python-loop adapters already have their own prompts covering messaging/git/submission, so they need only the new piece; CLI adapters keep using the bigger function. Per-adapter wiring: - ``mini_swe_agent_v2``: appends team_task_section to task; propagates CB_TEAM_* through env_kwargs["env"]; adds ``--add-host=host.docker.internal:host-gateway`` + scratchpad volume to docker run args; installs the team CLI scripts + pip redis in the container after env spin-up. The existing ``TeamPoller`` host-side hook (already in step()) still fires. - ``openhands_sdk``: appends team_task_section to task; folds a new ``team_env`` dict into ``coop_info`` so ``_build_credentials_dict`` propagates CB_TEAM_* into the sandbox. Coop-task-* binary install in the OpenHands agent-server image is a follow-up — OpenHands manages its own image build and doesn't expose a clean post-start exec hook. - ``swe_agent``: appends team_task_section to task. The SWE-agent framework's sandbox + agent loop is third-party and harder to instrument; everything beyond the prompt is a follow-up. Tests: 13 new - 3 prompt unit tests for team_task_section (lead, member, empty) - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py: consistency between team_task_section and build_team_instruction, every registered runner accepts the team kwargs, openhands env keys, swe_agent signature Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code + team + git (sanity check that the shared changes didn't regress the CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines. End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent, openhands_sdk) requires API keys (Anthropic for the three Python-loop adapters via litellm, OpenAI for codex) that aren't available in this environment. Unit tests cover the new wiring; the e2e validations should be run with real keys before relying on the per-adapter behavior. Compatibility matrix is now: | Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars | |---------------------|---------|-------------|--------------|------------------|----------| | claude_code | yes | yes (full) | n/a | yes | yes | | codex | yes | yes (full) | n/a | yes | yes | | mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes | | openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes | | swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET | Stacks on #52 (merged-up team-mode branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the documented gap from the prior commit's matrix: the ``coop-task-*`` binaries now ship into the OpenHands agent-server sandbox, layered onto the upstream ``-oh`` image via Modal's ``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no upstream image rebuild required). Triggered only when ``coop_info["team_env"]`` is set so solo / coop runs don't pay the ~10s first-build cost. Modal caches the layered image; subsequent team runs are instant. Verified end-to-end: ran openhands_sdk team+git on dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran ``compgen -c | grep coop-task`` and got back all 7 wrappers (create / claim / update / list / request / respond / pending) — the install worked. Whether the model actually invokes the tools is a separate (coordination-quality) axis; in this run it discovered them but didn't use them, same as codex. Both patches applied; f1 14/14, f2 19/20. Tests: 2 new (full suite: 326 passed) - test_team_env_triggers_image_layering — verifies add_local_file + pip_install + run_commands fire with the right args when team mode is active - test_no_layering_when_team_inactive — verifies solo / coop runs skip the image-build cost Matrix update — openhands_sdk now reads: Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a / CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The codex team e2e (cx_team_v3) hit 0/2 with great coordination metrics — 5/5 tasks done, 27s first claim, claims even — but neither agent ran ``git merge`` despite the prompt's "Recommended workflow" mentioning it. Both fetched their peer's branch (2 each) and then submitted only their own work, so the eval's naive diff-stacker produced syntactically broken Python. The previous prompt buried the critical step in a "Concretely:" sentence at the end; gpt-5.5 didn't follow it. This rewrite: - Renames the section ``## Git collaboration — MERGE IS REQUIRED BEFORE SUBMITTING`` so the imperative is in the heading itself. - Adds an explicit "Required final sequence — run this verbatim before exiting" block with the full fetch+merge+diff sequence, parameterized over every partner branch. - Explains *why* (each agent's patch.txt is evaluated against every feature's tests; without the merge, the peer feature's symbols are missing → ImportError). - Frames it the same way the patch.txt step is framed (REQUIRED, skip-at-your-loss), which the original prompt fix proved codex responds to. Verified: re-ran cx_team_v4 (codex team+git, same task as v3). Git activity went from ``fetch=2 merge=0 push=0`` per agent → ``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both patches now contain both features' symbols. Pass rate v4: 33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test because gpt-5.5's merged code put the ``filters`` kwarg on a helper function rather than the ``prompt`` decorator (content quality, not coordination). A second run (cx_team_v5) produced byte-identical 243-line patches on both agents — codex coordinated so well both ended up with the exact same merged tree. This surfaces a separate bench-side limitation: the eval's diff-stacker fails to apply patch B on top of patch A when every hunk already matches, producing an empty merged.patch. That's a real bug in ``eval/evaluate.py``'s coop merge step, NOT a coordination failure — codex did exactly what the prompt asked. Fix is a separate concern from team-mode wiring. Tests still pass (existing prompt tests are content-agnostic; 326 / 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

In team mode codex can coordinate so well that both agents end up with byte-identical patches (each fully merged the other's branch). The existing eval combiner sequence — apply patch1 → apply patch2 on top — chokes because every hunk in patch2 is already applied, producing an empty merged.patch and a downstream "No valid patches in input" failure even though both submissions are individually fine. Fix in ``test_merged``: before invoking ``_setup_branches`` / ``_merge_naive``, ``cmp`` the two patches. If they match, copy patch1 to merged.patch (normalized via ``git apply --recount`` so agents that emit unified diffs with miscounted hunk headers still work) and skip the merge dance. Returns a fresh result with ``merge.status: "identical"`` so the caller can tell the short-circuit fired vs a real merge. Verified on the codex-team e2e: - cx_team_v5 (codex agents perfectly merged to identical 243-line patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20) - cx_team_v4 (codex agents diverged on the merge): unchanged at f2 20/20 + f1 13/14 = 33/34 tests, still falls back to agent2-alone via apply_status: {'agent1': 'failed', ...} I also briefly tried adding ``git apply --recount`` to ``_setup_branches``'s fallback chain, but that REGRESSED v4: it made agent1's malformed patch apply where it previously failed silently, triggering a real merge attempt that produced duplicate function definitions (broken Python) via union merge. The identical-patches short-circuit is the strictly-better fix — no regression, recovers the v5 case, and the malformed-hunk normalization only kicks in on the short-circuit path where it can't cause merge conflicts. Also lands previously-uncommitted housekeeping: - prompt.py: ruff-format-only diff on the merge-required block from the prior commit - test_team_wiring.py: ruff --fix removed unused MagicMock imports - test_gcp_backend.py / test_tasks.py: ruff --fix removed f-string-without-placeholder and unused-json import (both unrelated drift caught by the gate) Tests: 1 new (full suite: 327 passed) - ``test_test_merged_shortcircuits_on_identical_patches`` — source inspection confirms the short-circuit branch + "identical" merge-status string exist in test_merged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous openhands team runs (oh_team_v3) showed agents discovering the ``coop-task-*`` shell wrappers via ``compgen`` but never invoking them — gpt-5.5 strongly prefers typed tools registered with the LLM over arbitrary shell commands. This commit lands the architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered under the same name as openhands' built-in ``TaskTrackerTool`` so the registry resolution swaps it transparently. Files: * ``openhands/tools/task_tracker/coop_definition.py`` — new tool definition + executor. Same ``TaskTrackerAction`` / ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round- trip through the shared ``cb:<run_id>:`` Redis namespace that ``TaskListClient`` (host side) writes to. Tasks are auto-owned by the calling agent; ``view`` shows peer tasks prefixed with ``[<their_agent_id>]``. Registered under both ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing the module rebinds the latter to the Coop variant. * ``openhands/tools/preset/default.py`` — gains a ``team_mode`` kwarg (kept for API stability + tests; the actual swap happens server-side via the .pth/__init__ side-effect import, not by changing the host-side tool list). Pre-PR coop block split into a more nuanced team-mode prompt section that documents the TaskTracker → shared-list behavior. * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` — layers two more bits into the Modal image at build time: - ``add_local_file`` of ``coop_definition.py`` to ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands install) - ``grep ... || echo`` appending ``from . import coop_definition`` to the package's ``__init__.py`` so the registration runs at import time. Tests: 1 new + updated image-layering assertions - ``test_importing_coop_definition_overrides_local_registration``: inspecting the registry's ``_MODULE_QUALNAMES`` confirms ``TaskTrackerTool.name`` resolves to ``coop_definition``'s registration after import. - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file`` calls + 2 ``run_commands`` layers (tool-file install + ``coop-task-*`` wrappers) and that the ``from . import coop_definition`` line is in the install commands. Full suite: 329 passed. Ruff / format / mypy all green. KNOWN LIMITATION (documented in coop_definition.py docstring): the openhands_sdk agent-server runs in a Modal sandbox that's network-isolated from the host Redis. The CoopTaskTracker is correctly registered and the LLM can call it, but every operation returns "Shared task list unavailable" because the sandbox can't ``socket.getaddrinfo("host.docker.internal")``. The fix is in the deployment layer (Modal tunnels, a Modal-hosted Redis, or running openhands directly via docker like the other adapters), not in this PR — verified by oh_team_v10: agent ran ``coop-task-list`` first ("The coop CLI failed; I'll use the shared task tracker."), then fell back to TaskTrackerAction which still hit the local executor because the override + Redis combo can't actually work in Modal. For non-Modal openhands deployments (e.g. local docker-backed openhands runs, future remote-conversation transports that share the host network), this tool works as designed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker swap from actually functioning. Three pieces, working together: 1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``, accessed via ``unencrypted_host:unencrypted_port``). Re-uses the existing ``connectors/redis_server.ModalRedisServer`` — it was already written, just unused. Both the host TaskListClient and the agent sandboxes point at the same public TCP endpoint, so pre-seed and agent reads/writes share state. Falls back to local Redis for the other adapters. 2. **CoopTaskTrackerTool injection into the Modal sandbox.** The adapter now ``add_local_file``s three pieces into the OpenHands image at build time: - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py`` - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py`` - ``_team_init_override.py`` → ``$OH_DIR/__init__.py`` (replaces upstream; same exports + a side-effect import of coop_definition so the Redis-backed executor overrides the local TaskTracker registration at first import). Plus a ``find -name '*.pyc' -delete`` to invalidate Python's bytecode cache so the new __init__ actually re-runs. 3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle connections after a few minutes, so the original Redis client pre-seed used at startup gets closed before the 9-min agent run finishes. Re-open the client at harvest time using the same URL. End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with ``-a openhands_sdk --setting team --git``: - Modal Redis startup: ``redis ready redis://r450.modal.host:41899`` - Both agents Submitted, 9m total - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓) - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0, time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2, agent1:1}, updates_per_agent: {agent2:4, agent1:5}`` - Cost: $3.33 Tests: image-layering assertions expanded — ``add_local_file`` now called 3 times (CLI helper, tool def, __init__ override), and the run_commands chain copies both files + wipes .pyc caches. Full suite: 329 passed. Ruff / format / mypy all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three changes that together unblock swe_agent team-mode runs (and solo/coop runs too — the bug wasn't team-specific): 1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2`` in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``. The old package was renamed in v0.0.13; both swe_agent files had stale imports that no-op'd at module load (TypeError or ModuleNotFoundError depending on how the framework was invoked), making every swe_agent invocation return Error before any LLM call. 2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras in pyproject.toml. swe_agent's vendored framework imports these at module-load time even when the docker/S3/model paths are dormant, so a clean ``pip install '.[swe-agent]'`` without these would still ImportError on first invocation. 3. uv.lock refreshed with the new transitive deps. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with ``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5): both agents Submitted, patches 373 + 88 lines, both applied via git apply. Eval failed 0/2 due to a content-quality issue (``NameError: name 'Set' is not defined`` — agent used Set without importing it; both agents hit exit_cost budget limit mid-implementation), but that's model variance, not adapter wiring. swe_agent is unblocked: it runs end-to-end, produces patches, the eval pipeline processes them. Coordination metrics still empty (claims_per_agent: {}) because swe_agent doesn't yet have the in-container coop-task-* CLI install or in-loop task auto-refresh — those are tracked as follow-ups in the PR body. For now the swe_agent team-mode run just gets the team prompt section + env vars; full team-tool integration is a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five compounding bugs prevented `claude_code`, `codex`, and `mini_swe_agent_v2` from reaching honest pass-rates on the core subset in team setting. All four now ≥ 5/10. - normalize_patch ate trailing blank context lines (text.strip() consumes " \n"), breaking last-hunk line counts so git apply rejected otherwise-valid diffs. Replaced with lstrip/rstrip on "\n" only. - mini_swe_agent_v2 adapter wasn't normalizing patches at all — raw .strip() on the patch.txt read, so every msa patch ended in a non-newline byte. Now routes through normalize_patch. - mini_swe_agent_v2 ModalEnvironment created the sandbox with no long-running command, so the image's default CMD exited and every exec hit "Sandbox not found". Pass "sleep", "infinity" as the positional command (matches eval backend's existing fix). - claude_code and codex adapters silently ignored --backend modal because shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters. - Team lead prompt buried the integration step at the bottom of a long workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist. Member prompt now opens with "stay in your lane" per the lead's PLAN.md. - eval test_merged now falls back to testing each agent's patch alone when the merged tree doesn't pass both features. Surfaced as merge.strategy="solo-agent1" / "solo-agent2". Credits the agent (typically the lead) who correctly integrated both features into one working patch but had it corrupted by union-merging with the other agent's partial implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- dataset/subsets/core.json: 10-pair subset for quick agent comparisons. Stratified by repo (largest-remainder proportional allocation by full-dataset pair count) with a one-slot floor per primary language (Python / Go / Rust / TS). Reproducible via scripts/generate_core_subset.py (seed=42). - docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent frameworks on the core subset in team setting. Includes per-task pass/fail matrix annotated with the merge strategy used, plus the chronological narrative of the dozen reruns that surfaced each of the bugs fixed in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously test_merged returned early with an error when both naive and union merge strategies hit conflicts, so the solo-agent fallback never got a chance to credit a team whose lead alone integrated both features. Now we write an empty merged.patch, let run_tests fail naturally on the merged tree, and fall through to the solo fallback. Doesn't change any of the current 40 eval results — union's merge=union attribute is tolerant enough that every task in the dataset produces some tree (potentially broken code with stitched-together lines); the broken-tree-tests-fail path already triggered the solo fallback. This just closes the defensive gap for future pathological cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the union-merge strategy and the member-only fallback from test_merged. The new chain is: 1. identical patches → skip-merge short-circuit 2. naive 3-way merge clean → merged-tree tests are authoritative (no further fallback) 3. naive merge conflicts → test the lead's patch.txt alone against both feature suites Rationale: union merge concatenates conflicting hunks, which usually produces syntactically broken code; the cases where it accidentally produced a working tree were rewarding lucky non-overlap, not genuine coordination. The member-only fallback was symmetric to lead-only but incoherent under team-mode semantics (the lead is the designated integrator; if they didn't integrate, the team failed regardless of what the member's branch looks like). Effect on the core-subset horizontal comparison: msa 6 → 6 (unchanged) oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which concealed that oh's lead doesn't integrate) cc 5 → 5 (unchanged) cx 5 → 5 (unchanged) oh sliding below 5/10 is the correct outcome: the previous union-pass on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit their patch.txt into the working tree, which forces a merge conflict on patch.txt that union resolved while the actual source merge was non-conflicting). Under the stricter policy this gets routed through lead-alone, which oh's lead does not pass. BENCHMARK_RESULTS.md updated to reflect the new totals + per-task matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry revised; full test suite still green (329 passed, 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codex on Modal: `codex exec` was hanging for the full sandbox lifetime (~2h) producing zero stream output. Root cause: codex's exec mode prints "Reading additional input from stdin..." and blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF for free; Modal sandbox keeps stdin open. Fix: add `</dev/null` to the codex invocation in _build_codex_command. Smoke-tested on dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s. openhands_sdk eval guardrail: openhands_sdk produces patches that include a committed patch.txt in the working tree and relies on Modal-hosted Redis for coordination; running eval through Docker silently changed the test environment. The eval now reads the run's config.json and refuses with a clear warning when the run was produced by openhands_sdk but --backend != modal. Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig by default; the earlier docs claiming it was docker-only were wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig. Added a backend dispatch that picks DockerDeploymentConfig when config["backend"] == "docker"; Modal stays as the default. Two upstream-swerex issues had to be worked around to make the docker path actually start a container: 1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh, so swerex's `docker run ... image sh -c "<startup>"` becomes `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as the feature-patch path. Pass docker_args=["--entrypoint", ""] to clear the entrypoint (mirrors the existing Modal monkey-patch that does .entrypoint([]) on the image). 2. swerex's startup falls back to `pipx run swe-rex ...` when the swerex-remote binary isn't pre-installed, but pipx looks for an executable literally named "swe-rex" — which doesn't exist in the published `swe-rex` package (it provides "swerex-remote"). Monkey-patch DockerDeployment._get_swerex_start_cmd to use `pipx run --spec swe-rex swerex-remote ...` instead. Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker: 1/1 pass in 2m 53s, 17 steps, $0.32, no errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves the squash-merge conflicts from #52 landing on main. All conflicts followed the same pattern: this branch's HEAD contains #52's content plus the subsequent work on top, while main's squashed-merge commit contains only #52. Resolved each conflict by taking ours (HEAD), which preserves the cumulative state of: - CHANGELOG: full Fixed/Changed/Added entries for team-mode bug fixes, eval policy change, core subset + benchmark doc, plus the original "team setting" bullet from #52 - _team/prompt.py: the stronger lead-prompt with the 5-point integration checklist (#52 had the older "buried integration" version) - swe_agent/adapter.py: team-mode kwarg propagation + Docker backend dispatch + pipx --spec monkey-patch - runner/team.py: openhands_sdk Modal-Redis tunnel branch - everywhere else: my newer adapter changes are strict supersets of #52's CI green locally: 329 tests passed, ruff clean, mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move team-mode primitives from cooperbench/agents/_team (private) to cooperbench/team_harness (public, library-shaped) so other benchmarks can consume the multi-agent coordination algorithm without depending on CooperBench's task layout. Adds TeamSession + TeamHarnessConfig: - TeamSession bundles per-run state (run_id, namespaced Redis URL, ordered agent list, scratchpad volume name) with the feature config and exposes adapter-facing factories that each return None / [] / {} when their feature is disabled, so adapter code paths collapse to one branch: coop_env.update(session.env_for(agent_id)) extra_run_args.extend(session.scratchpad_mount_args()) mcp_config = session.mcp_config(container_script_path=...) - TeamHarnessConfig is a frozen dataclass of five per-feature booleans (task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member role split is the always-on baseline -- without it team is just coop. Wires five --team-no-* CLI flags through cli.py -> runner.run -> runner.core -> runner.team -> each adapter. result.json now records team_features so post-hoc analysis can attribute deltas to the feature that was off. Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and openhands_agent_sdk now accept team_features kwarg and construct a local TeamSession instead of calling loose helpers. Each adapter's team-mode blocks (prompt, env, mount, MCP, install) gate on the session's config. Tests: tests/agents/_team -> tests/team_harness (rename), new test_session.py (29 cases) covers the facade, four new ablation tests in tests/runner/test_team.py verify the runner-side gating. Full suite 363 passed, 63 skipped; ruff/format/mypy clean. End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker): - Default: writes task_log.json + tasks.json + metrics, cb-team-<run> volume created. - --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log / tasks files, empty metrics dict, no volume. team_features in result.json reflects the requested ablation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ProKil · 2026-05-19T03:34:10Z

End-to-end eval results (was missing from the original PR body — ran cooperbench eval after pushing):

run	merge strategy	both pass	pass rate
`smoke-team-default`	`solo-agent1` (lead integrated both features)	✓	100%
`smoke-team-ablated` (`--team-no-task-list --team-no-scratchpad --team-no-mcp`)	`naive` (diff-cat)	✗	0%

Real pass-rate delta in the expected direction — with the shared task list / scratchpad / MCP off, the lead had nowhere to fetch the member's patch from, naive merge produced a non-composing diff, and eval failed. With them on, the lead integrated and the team passed.

n=1 is far too small for any conclusion about which of the three features is doing the work, but the plumbing and the ablation signal both work end-to-end. Next-PR target: the same matrix across the core 10-pair subset with one-flag-off ablations.

ProKil · 2026-05-19T06:40:08Z

Ablation matrix (core 10-pair × marginal-effect design)

Used the marginal-effect design (1 baseline + 5 one-feature-off) instead of the full 2⁵=32 interaction matrix — answers the "what's each feature's contribution" question at ~1/5 the cost. Codex CLI on gpt-5.5, --setting team --backend docker.

Headline numbers

config	flag off	pass / 10	Δ vs baseline
`ablate-11111`	(baseline, all on)	6	—
`ablate-11011`	mcp	5	−1
`ablate-11101`	auto_refresh	5	−1
`ablate-11110`	protocol	5	−1
`ablate-01111`	task_list	4	−2
`ablate-10111`	scratchpad	3	−3

Per-task breakdown — the signal is concentrated

task                                  base no_tl no_sp no_mcp no_ar no_pr
dottxt_ai_outlines/1655 [1,3]          ✓    ✓    ✓    ✓    ✓    ✓
dspy/8563 [1,4]                        ✗    ✗    ✗    ✗    ✗    ✗
go_chi/27 [3,4]                        ✗    ✗    ✗    ✗    ✗    ✗
llama_index/17244 [5,6]                ✓    ✓    ✓    ✓    ✓    ✓
openai_tiktoken/0 [4,8]                ✓    ✓    ✗    ✓    ✓    ✓
pallets_click/2800 [1,4]               ✗    ✗    ✗    ✗    ✗    ✗
pallets_jinja/1559 [5,8]               ✓    ✗    ✗    ✗    ✗    ✓
pallets_jinja/1621 [6,10]              ✓    ✗    ✗    ✓    ✓    ✗
react_hook_form/153 [2,6]              ✗    ✗    ✗    ✗    ✗    ✗
typst/6554 [2,6]                       ✓    ✓    ✓    ✓    ✓    ✓

3 tasks pass in every config — easy enough that any coordination level works.
4 tasks fail in every config — too hard for codex regardless of coordination.
Only 3 tasks are sensitive to the ablation (openai_tiktoken/0, pallets_jinja/1559, pallets_jinja/1621). Effective sample size = 3.

What this says about each feature

feature	effect	confidence	reading
scratchpad	−3/10 (−3/3 of sensitive tasks)	strongest	`/workspace/shared/` is where members drop their patches and the lead picks them up. Without it, no integration possible — the lead can't read what the member did.
task_list	−2/10 (−2/3 of sensitive tasks)	moderate	Without coordination state, the lead doesn't know what the member's working on. Still passes when the agents independently solve their feature without needing alignment (openai_tiktoken).
mcp	−1/10 (within noise)	low	wait_for_message long-poll. Codex already idles cheaply between commands; the marginal effect is small.
protocol	−1/10 (within noise)	low	Typed coop-request / coop-respond verbs. The agents in this task set don't actually use them much — message-passing via `coop-send` covers most needs.
auto_refresh	−1/10 (within noise)	n/a	`auto_refresh` only fires in Python-loop adapters (mini_swe_agent_v2 etc). Codex is a CLI adapter, so this flag is effectively a no-op for this sweep — the −1 is sample noise, not a real effect.

Caveats

n=3 effective is too small to distinguish mcp/protocol from noise. The scratchpad signal (3/3 of sensitive tasks) is the only one I'd call robust.
Codex CLI's --json stream doesn't emit cost or a model field, so result.json:total_cost shows $0 throughout. Real spend was ~$200-300 for this sweep based on token-count × public gpt-5.5 estimates (~46-50M input mostly cached + ~400K-1M output across the 6 configs).
For the auto_refresh measurement to mean anything, rerun on mini_swe_agent_v2 (where it actually fires).
For tighter CIs on mcp/protocol, expand the core subset or re-stratify to drop the always-pass / always-fail tasks.

Raw data

ablation_matrix.csv (in repo working dir, gitignored)
Per-run logs under logs/ablate-*/team/...

🤖 Generated with Claude Code

# Conflicts: # CHANGELOG.md # src/cooperbench/agents/_team/__init__.py # src/cooperbench/agents/_team/prompt.py # src/cooperbench/agents/mini_swe_agent_v2/adapter.py # src/cooperbench/agents/openhands_agent_sdk/adapter.py # src/cooperbench/agents/openhands_agent_sdk/openhands-tools/openhands/tools/task_tracker/coop_definition.py # src/cooperbench/agents/swe_agent/adapter.py # src/cooperbench/runner/team.py # tests/agents/test_team_wiring.py # tests/team_harness/test_prompt.py

Ubuntu and others added 19 commits May 16, 2026 20:42

This was referenced May 19, 2026

docs: team-harness ablation report (flash, codex/gpt-5.5) #59

Merged

codex: add Azure OpenAI support #60

Merged

Base automatically changed from team-all-adapters to main May 21, 2026 16:52

An error occurred while trying to automatically change base from team-all-adapters to main May 21, 2026 16:52

ProKil merged commit bea5e58 into main May 21, 2026
3 checks passed

ProKil deleted the team-harness-module branch May 21, 2026 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

team_harness: extract team mode as standalone harness + ablation flags#58

team_harness: extract team mode as standalone harness + ablation flags#58
ProKil merged 20 commits into
mainfrom
team-harness-module

ProKil commented May 19, 2026

Uh oh!

ProKil commented May 19, 2026

Uh oh!

ProKil commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ProKil commented May 19, 2026

Summary

What's new

cooperbench.team_harness — public package

Ablation flags

Adapter refactor

Tests

End-to-end on dottxt_ai_outlines/1371 [1,2] with codex --setting team --backend docker

Out of scope (next PRs)

Test plan

Uh oh!

ProKil commented May 19, 2026

Uh oh!

ProKil commented May 19, 2026

Ablation matrix (core 10-pair × marginal-effect design)

Headline numbers

Per-task breakdown — the signal is concentrated

What this says about each feature

Caveats

Raw data

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`cooperbench.team_harness` — public package

End-to-end on `dottxt_ai_outlines/1371 [1,2]` with `codex --setting team --backend docker`