v2 adapter hardening, shared git server, submission-based patch#48
Merged
Conversation
mini_swe_agent_v2 already loads dotenv from a global config dir (~/.config/mini-swe-agent/.env), but cooperbench itself never loaded the project-local .env, so OPENAI_API_KEY etc. only made it through when the user manually exported them. Calling dotenv.load_dotenv() at the top of cli.py auto-loads ./.env from cwd before any env-var-dependent imports run, matching how projects with python-dotenv conventionally pick up local config.
…bmit via patch.txt
Adapter:
- accept **kwargs so unknown caller-side args don't crash run()
- wire up the agent_config CLI flag that was previously listed in the
signature but never read; load YAML and deep-merge config: block over
the defaults
- sanitize content=None on tool-calling assistant turns before returning
AgentResult.messages (CooperBench's downstream coop runner does
'"send_message" in content' which TypeErrors on None)
- drop the dead SEND_MESSAGE_TOOL import (only BASH_TOOL is registered;
send_message is intercepted from inside the bash command string)
- drop _get_patch() and the base_commit capture; the patch now comes
straight from result['submission'] (no working-tree extraction
fallback — if the agent didn't submit, there is no patch)
Config:
- delete config/mini.yaml (was a single file with {% if agent_id %}
branches handling both solo and coop) and split into
config/solo.yaml and config/coop.yaml; adapter picks based on
is_coop = len(agents) > 1
- fix a leak in the solo branch where the CRITICAL REQUIREMENTS
section still mentioned 'send_message to your colleague' even when
the agent has no colleague
- replace the bare 'echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT' submit
step with the upstream mini-swe-agent SWE-bench three-step flow:
curate via 'git diff -- file1 file2 > patch.txt', verify via cat,
submit via 'echo COMPLETE... && cat patch.txt'. The submission
field then carries the patch verbatim.
… repos Old design spun up a fresh debian-slim container per coop pair, ran 'apt-get install git' inside it, slept 3 seconds, and returned. Two problems: - the 3-second sleep was far too short for the apt install (~30-60s on a cold image), so agents' initial 'git push' raced the daemon start and got 'Connection refused' - the per-run container lived on its own bridge network cooperbench-git-<run_id>, but DockerEnvironmentConfig had no 'network' field, so the kwarg the adapter passed got silently dropped by Pydantic and agent containers ended up on the default bridge — no route to the git server's IP Replaced with a Redis-style shared singleton: - one image cooperbench-git-server:local (built lazily on first use from a 4-line Dockerfile) - one container cooperbench-git running 'git daemon --base-path=/git --listen=0.0.0.0' as PID 1, with a docker volume for /git - one shared bridge network 'cooperbench' that all agent containers join - per-run isolation via path namespacing: each coop pair gets /git/<run_id>/repo.git, served at git://cooperbench-git:9418/<run_id>/repo.git DockerGitServer.create() now just ensures the singleton infra is up (idempotent, ~140ms after first call) and exec's a quick 'mkdir + git init --bare' inside the running daemon. cleanup() removes only the per-run path and leaves the singleton alive. DockerEnvironmentConfig also gets a typed 'network' field so the --network flag actually reaches docker run.
DefaultAgent.run() calls self.save(self.config.output_path) in its
finally clause every step, and save() calls serialize(). Once
compaction has fired, serialize() was unconditionally calling
_close_current_segment('solver') — which appends a snapshot of the
current live messages as a new segment AND resets the buffer (which
the next query() then repopulates).
Net effect: each step after the first compaction added another
post-compaction solver segment to _segments, each one near-superset
of the previous. In a real run we observed segment counts like
[86, 85, 8, 10, 12, 14, 15] where the last 5 should have been a
single segment.
Fix: serialize() builds a snapshot list locally without mutating
self._segments. The current open buffer is appended as a transient
'solver' segment in the snapshot. Multiple calls to serialize() now
produce identical output and leave state unchanged.
When an agent submits a malformed patch (e.g. a 'new file mode' diff against an existing file), 'git apply' rejects it but the eval would silently commit an empty branch, then report the subsequent merge as 'clean' because there was nothing to disagree with. An agent self-sabotage (the canonical case is an agent running 'rm -rf .git' mid-run) would look like a passing eval. _setup_branches now emits explicit per-agent markers (PATCH<N>_APPLIED / _SKIPPED / _FAILED) and returns an apply_status dict. test_merged threads it into the eval result and overrides merge.status to 'missing_input' when any patch failed. While in there, _run_tests now exposes exit_code in its result, and the per-feature dict gains feature_id / exit_code / tests_passed / tests_failed alongside the existing passed + test_output (which is still a 50KB blob). Consumers can now reason about results without grepping raw pytest output.
…truction
The previous Submission section was ~40 lines of three-step procedure
plus a CRITICAL block. Small models (Qwen 9B, observed in coop+git
runs) tended to follow the recipe but still hit footguns we hadn't
forbidden — the canonical failure was an agent running 'rm -rf .git'
mid-task, then 'git init' to 'fix' it, then producing a malformed
'new file mode' diff that the eval silently dropped.
Trim the recipe to the bare flow:
git diff -- path/to/file > patch.txt
cat patch.txt
echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt
Then a tight CRITICAL block forbidding 'rm -rf .git', 'git init',
'git rm -rf .', and 'git reset --hard' inside /workspace/repo —
which corrupt the local .git/ directory regardless of whether the
team remote is enabled.
In a follow-up empirical run, agent6 (the agent that previously did
rm -rf .git) issued zero destructive git commands and produced a
clean modify-existing patch that passed all 100 of feature 6's tests.
CHANGELOG: extend the v0.0.12 entry to cover both this and the
preceding eval-observability commit.
Earlier simplification collapsed the four file-category exclusions (reproduction scripts, helper tools, build/config files, binaries) into a single inline parenthetical, which small models tend to under-weight. Restore the bulleted list — bullets are easier to parse and harder to skim past — without re-adding the Step 1/2/3 scaffolding or the env-var notation that the simplification was meant to remove.
Single fenced bash block invited models to chain the three commands with '&&' or run them as one heredoc. That breaks the design — the env's COMPLETE sentinel detection only fires when 'echo COMPLETE...' is the first line of bash output, and chaining makes the diff happen on the same line as cat, which the env then doesn't capture as a submission. Split into three separate fenced bash blocks (write / verify / submit), with explicit 'SEPARATE bash tool call' instruction.
The previous wording opened with 'Edit files in place. Don't commit.' — which contradicts the existing Shared Git Remote section (telling agents they have a branch to coordinate on) and over-prescribes the workflow. Reframe: patch.txt is the artifact we evaluate, the agent writes whatever unified diff they want to submit to that file, however fits the workflow they used. The 'git diff -- file > patch.txt' recipe stays as 'one common way', not the only way. Agents are free to commit, fetch, merge, or do whatever else — the contract is only what ends up in patch.txt.
…yaml too Mirrors the trim already applied to coop.yaml — the sentence was unnecessary scaffolding now that the three steps are in their own fenced blocks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stabilizes the
mini_swe_agent_v2adapter and removes a pile of latentbugs that were silently degrading coop+git runs. Four commits, each
self-contained:
cli: auto-load ./.env—cooperbenchitself never calledload_dotenv(), so project-localOPENAI_API_KEYetc. only workedwhen users manually exported them. One-liner at the top of
cli.py.mini_swe_agent_v2: harden adapter, split mini.yaml into solo/coop, submit via patch.txt— the bulk of the change:**kwargs; wires up the previously-deadagent_configflag; sanitizescontent=Nonefrom tool-callingturns (CooperBench's
_extract_conversationdoes\"send_message\" in contentand TypeErrors on None); drops thedead
SEND_MESSAGE_TOOLimport._get_patch()/ working-tree extraction; the patch nowcomes verbatim from
result['submission']. No fallback — if theagent didn't submit, there's no patch. This is the upstream
mini-swe-agent SWE-bench design.
mini.yaml(single file with{% if agent_id %}branches)into
solo.yaml+coop.yaml; fixes a leak in the solo branchthat mentioned a non-existent colleague.
git diff -- file1 file2 > patch.txt,cat patch.txt, thenecho COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt—mirroring upstream's three-step submission flow.
mini_swe_agent_v2: shared singleton git server, path-prefixed per-run repos— replaces the per-runapt install git + sleep 3container with a Redis-style singleton. One
cooperbench-gitcontainer running
git daemonon a sharedcooperbenchbridgenetwork; per-run isolation via path namespacing
(
/git/<run_id>/repo.git). Auto-created on first use, idempotentthereafter. Also adds a typed
networkfield toDockerEnvironmentConfigso the--networkkwarg actually reachesdocker run(previously Pydantic silently dropped it).mini_swe_agent_v2: serialize() no longer mutates _segments—DefaultAgent.run()callssave()in its finally clause everystep, and
save → serialize → _close_current_segmentwas appendinga fresh solver segment per save. Result:
_full_traj.jsonballooned with overlapping post-compaction snapshots. Fix:
serialize()snapshots locally without mutating state.Verification
All four CI checks pass locally on the branch:
uv run --extra dev ruff check src/cooperbench/— cleanuv run --extra dev ruff format --check src/cooperbench/— cleanuv run --extra dev mypy src/cooperbench/— cleanuv run --extra dev pytest tests/ -v— 155 passed, 63 skippedEnd-to-end coop+git run against a Modal-served Qwen3.5-9B endpoint with
the new pipeline (
cooperbench run -n test -r dottxt_ai_outlines_task -t 1655 -f 6,7 -m openai/Qwen/Qwen3.5-9B -a mini_swe_agent_v2 --git):both agents reached
Submittedstatus; the resultingagent6.patchand
agent7.patchcontain only the modified source file (no scratchtest scripts), proving the curated-submission flow.
Test plan
ruff check,ruff format --check)