feat(coder): Phase 11 — CLI unification + end-to-end integration tests#829
Conversation
Merges every phase's CLI bits into a single src/gaia/coder/cli.py so one argparse tree owns the whole subcommand surface. Phase 5's trust / promote / demote / ask / note / critical / inbox verbs stay intact; every subcommand whose real handler exists elsewhere is now wired: - daemon / wait / stop + ARTIFACT_FILENAMES (Phase 2 stub daemon, required by gaia.eval.runners.coder_cli and the GAIA-Internal-20 suite) - feedback (Phase 6, writes a real row to feedback.db via gaia.coder.stores.feedback) + self-fix process (runs one FeedbackLoopDriver.process_pending_feedback iteration; no-ops on empty queue) - dev-mode enable / disable / status (Phase 7, over gaia.coder.dev_mode) - debug repro|bisect|hypothesise|probe|localise|propose|postmortem (Phase 8 scaffold — DebugSubLoop is Python-only until the Phase 11 production swap) - rag status / refresh / rebuild (Phase 10, over gaia.coder.rag_freshness with a noop provider/runner until a real RAG backend is bound) `ask` handles both Phase-5 (EM inbox question) and Phase-2 (daemon task body) contracts by dispatching on `--sandbox` — Phase 2 uses `-` as the body to read from stdin, Phase 5 joins `nargs='+'` positional args. Remaining stubs (status / audit / spend / egress / introspect / skill / doctor) print "not yet implemented" until their phases land. The Phase 5 stub-list parametrization in tests/coder/test_cli_trust.py is updated to reflect the new reality.
…tion The two `@pytest.mark.skip(reason="Phase 6 CLI handlers deferred…")` gates on tests/coder/test_self_fix/test_cli.py are lifted now that `gaia-coder feedback` + `gaia-coder self-fix process` are wired. Likewise tests/eval/conftest.py's `collect_ignore_glob` that silenced test_coder_cli_runner.py and test_gaia_internal_20_suite.py is dropped — the daemon / ask / wait / stop / ARTIFACT_FILENAMES surface they exercise is now part of the unified CLI. All 386 tests under tests/coder/ and tests/eval/ pass with zero skips.
… self-heal)
Adds tests/coder/test_integration_e2e.py with two hermetic scenarios that
exercise the whole coder as a system. Every external boundary — LLM
calls, `gh` CLI shell-outs, the ReAct audit hook — is injected via mock
callables so no real tokens or network traffic are needed.
`test_full_flow_feedback_to_fix` runs the §7.3 / §7.4 happy path:
1. `gaia-coder trust --bootstrap --em-handle e2e-em …`
2. `gaia-coder feedback "<body>" --severity high --on <url> …`
writes a real pending row to feedback.db via the store module.
3. FeedbackLoopDriver.process_pending_feedback drives triage → plan →
fix → regression-test → review-gate → publish-PR with canned
triage / gh / review stubs.
4. Assertions cover every externally-visible contract: state machine
transitions (pending → triaged → in-fix → fix-pr-open), branch
`auto/gaia-coder/fb-e2e-1` exists, regression test file on disk,
review gate overall=pass note, audit log rows for every stage.
5. A real git merge + verify_on_merge (against the checked-in merged
SHA) then transitions the row to `verified` and writes
failure_patterns + review_patterns rows into memory.db.
`test_dev_mode_self_heal_e2e` runs the §7.5 sub-loop:
1. A hermetic tmp git repo configured with an amd/gaia origin satisfies
the §7.1 hard precondition, so dev_mode.enable_session succeeds.
2. self_heal.classify_failure with a canned self-code response yields
a high-confidence self-bug classification.
3. pause_current_task snapshots the task to paused-tasks/<id>.json.
4. restart_self(kind="prompt-only") hot-reloads without exiting.
5. resume_task(delete_snapshot=True) round-trips the snapshot and
verifies the file is consumed.
6. The audit log contains rows for both dev_mode.enable_session and
self_heal.restart_self.
Also folds in black auto-formatting fixes on src/gaia/coder/cli.py from
the first lint pass after the unification commit landed.
All 388 tests under tests/coder/ and tests/eval/ continue to pass.
SummaryThis PR converges the Issues Found🟡 Important1. This test is one of the two that just had its 2. Dead os.environ["PATH"] = os.environ.get("PATH", "")This is a no-op (reads
Either way, the current line contributes nothing and is misleading. Also note: this mutates 3. Daemon PID-file lifecycle has a TOCTOU / PID-reuse window (
🟢 Minor4. Docstring: "every call passes an explicit user.email/name + gpgsign override". Body: plain 5. Argparse namespace leaks private-underscore attrs ( Stashing the parent parser on 6. def runner(args, cwd=None, check=True):The stub never honours 7. Strengths
VerdictApprove with suggestions — no blocking issues. Issues 1 and 2 are small but worth fixing in a follow-up commit (the first is a latent test-correctness bug; the second is dead code that reads as though it's doing something). Issue 3 is a Phase-2-stub concern that can wait for the production swap. Everything else is minor polish. |
…view pass) (#834) ## Summary Final cleanup pass to complete the `coder` branch for EM testing. Five Important + three Minor findings across the Phase 5/6/11 auto-reviews. All 395 tests pass. ## Changes - `test_self_fix/test_cli.py` — `pytest.raises` so a silent-pass regression in argparse can't pass the test. [#825, #829] - `test_integration_e2e.py` — real `PATH` prepend via `monkeypatch.setenv` instead of a no-op assignment that leaked env. [#829] - `test_fixes_827_828.py` — drop unused `Path` import. [#832] - `loop_driver.py` — narrow broad `except Exception` around `review_gate` and `notify_em` to `(RuntimeError, CalledProcessError, OSError)`. Programming errors now surface per CLAUDE.md fail-loudly. [#825] - `loop_driver.py` + `verifier.py` — `_append_notes` / `_append_note` raise `ValueError` on corrupted or wrong-type `notes_json` instead of silently replacing with `[]`. [#825] ## Test plan - [x] `pytest tests/coder/ tests/eval/` — 395/395 pass
Summary
Unifies the Phase 5
gaia-coderCLI with every later phase's CLI bits under one argparse tree, then lands two hermetic end-to-end tests that prove the whole coder works as a system — from EM feedback submission through draft PR opening to merged-state verification. Phases 1–10 landed their modules but each phase's CLI glue arrived through a separate branch; rebasing them together produced conflicts that leftfeedback,daemon,dev-mode,debug, andragsubcommands unwired. This PR is the convergence point.Threads
src/gaia/coder/cli.pywires real handlers for every subcommand whose module has landed.askdispatches on--sandboxto serve both the Phase-5 EM-inbox contract and the Phase-2 eval-harness daemon shim, so neither gets special-cased elsewhere.ARTIFACT_FILENAMESis re-exported sogaia.eval.runners.coder_clikeeps compiling.@pytest.mark.skipgates ontests/coder/test_self_fix/test_cli.pyand drops thecollect_ignore_globontests/eval/conftest.py, so the Phase 6 feedback CLI and the Phase 2 eval-harness runner + GAIA-Internal-20 suite tests all run on every CI pass.tests/coder/test_integration_e2e.pywith two scenarios: the full feedback → triage → plan → fix → PR → merge → verified pipeline (23 assertions across state machine, git branch, audit log, memory writes), and the dev-mode self-heal sub-loop (§7.5: classify → pause → restart → resume round-trip). Every LLM /gh/ subprocess boundary is mocked — no real API tokens required.Subcommands in
gaia-coder --help(22 total)Wired to real handlers (15):
trust,promote,demote,ask,note,critical,inbox,daemon,wait,stop,feedback,self-fix(+process),dev-mode(+enable/disable/status),debug(+ 7 scaffolded state verbs),rag(+status/refresh/rebuild).Deliberately deferred stubs (7):
status,audit,spend,egress,introspect,skill,doctor.Previously-skipped tests now passing (26)
tests/coder/test_self_fix/test_cli.py::test_cli_feedback_enqueuestests/coder/test_self_fix/test_cli.py::test_cli_self_fix_subcommand_without_action_prints_helptests/eval/test_coder_cli_runner.pytests/eval/test_gaia_internal_20_suite.pytest_cli_feedback_rejects_invalid_severitythat was always activeTest plan
pytest tests/coder/ tests/eval/— 388 passed, 0 skippedgaia-coder --helplists 22 subcommands covering Phases 2/4/5/6/7/8/10gaia-coder truststill prints the §4.2 summary (Phase 5 regression)gaia-coder feedback "test" --severity lowwrites a real row tofeedback.dband returns JSONgaia-coder self-fix processon an empty queue returnsfinal_state=no-pendingand exits 0gaia-coder rag statusrenders the §6.9 contract (empty provider → watchdog fires as warn)python -m flake8on touched files — zero errors (pre-existing lint debt in unrelated files is out of scope)tmp_pathwith mocked LLM /gh/ audit boundariesDo-not-merge
This is a draft pending upstream review. Rebase
coder→mainlives in a separate PR.