feat(coder): Phase 11 — CLI unification + end-to-end integration tests by kovtcharov · Pull Request #829 · amd/gaia

kovtcharov · 2026-04-20T10:49:49Z

Summary

Unifies the Phase 5 gaia-coder CLI with every later phase's CLI bits under one argparse tree, then lands two hermetic end-to-end tests that prove the whole coder works as a system — from EM feedback submission through draft PR opening to merged-state verification. Phases 1–10 landed their modules but each phase's CLI glue arrived through a separate branch; rebasing them together produced conflicts that left feedback, daemon, dev-mode, debug, and rag subcommands unwired. This PR is the convergence point.

Threads

CLI unification (13d023e) — one argparse tree in src/gaia/coder/cli.py wires real handlers for every subcommand whose module has landed. ask dispatches on --sandbox to serve both the Phase-5 EM-inbox contract and the Phase-2 eval-harness daemon shim, so neither gets special-cased elsewhere. ARTIFACT_FILENAMES is re-exported so gaia.eval.runners.coder_cli keeps compiling.
Re-enabled tests (1aa4d41) — lifts the two @pytest.mark.skip gates on tests/coder/test_self_fix/test_cli.py and drops the collect_ignore_glob on tests/eval/conftest.py, so the Phase 6 feedback CLI and the Phase 2 eval-harness runner + GAIA-Internal-20 suite tests all run on every CI pass.
End-to-end integration tests (937e3e0) — new tests/coder/test_integration_e2e.py with two scenarios: the full feedback → triage → plan → fix → PR → merge → verified pipeline (23 assertions across state machine, git branch, audit log, memory writes), and the dev-mode self-heal sub-loop (§7.5: classify → pause → restart → resume round-trip). Every LLM / gh / subprocess boundary is mocked — no real API tokens required.

Subcommands in `gaia-coder --help` (22 total)

Wired to real handlers (15): trust, promote, demote, ask, note, critical, inbox, daemon, wait, stop, feedback, self-fix (+ process), dev-mode (+ enable / disable / status), debug (+ 7 scaffolded state verbs), rag (+ status / refresh / rebuild).

Deliberately deferred stubs (7): status, audit, spend, egress, introspect, skill, doctor.

Previously-skipped tests now passing (26)

tests/coder/test_self_fix/test_cli.py::test_cli_feedback_enqueues
tests/coder/test_self_fix/test_cli.py::test_cli_self_fix_subcommand_without_action_prints_help
All 7 tests in tests/eval/test_coder_cli_runner.py
All 16 tests in tests/eval/test_gaia_internal_20_suite.py
Plus 1 new test_cli_feedback_rejects_invalid_severity that was always active

Test plan

pytest tests/coder/ tests/eval/ — 388 passed, 0 skipped
gaia-coder --help lists 22 subcommands covering Phases 2/4/5/6/7/8/10
gaia-coder trust still prints the §4.2 summary (Phase 5 regression)
gaia-coder feedback "test" --severity low writes a real row to feedback.db and returns JSON
gaia-coder self-fix process on an empty queue returns final_state=no-pending and exits 0
gaia-coder rag status renders the §6.9 contract (empty provider → watchdog fires as warn)
python -m flake8 on touched files — zero errors (pre-existing lint debt in unrelated files is out of scope)
Both new e2e tests run hermetically under tmp_path with mocked LLM / gh / audit boundaries

Do-not-merge

This is a draft pending upstream review. Rebase coder → main lives in a separate PR.

Merges every phase's CLI bits into a single src/gaia/coder/cli.py so one argparse tree owns the whole subcommand surface. Phase 5's trust / promote / demote / ask / note / critical / inbox verbs stay intact; every subcommand whose real handler exists elsewhere is now wired: - daemon / wait / stop + ARTIFACT_FILENAMES (Phase 2 stub daemon, required by gaia.eval.runners.coder_cli and the GAIA-Internal-20 suite) - feedback (Phase 6, writes a real row to feedback.db via gaia.coder.stores.feedback) + self-fix process (runs one FeedbackLoopDriver.process_pending_feedback iteration; no-ops on empty queue) - dev-mode enable / disable / status (Phase 7, over gaia.coder.dev_mode) - debug repro|bisect|hypothesise|probe|localise|propose|postmortem (Phase 8 scaffold — DebugSubLoop is Python-only until the Phase 11 production swap) - rag status / refresh / rebuild (Phase 10, over gaia.coder.rag_freshness with a noop provider/runner until a real RAG backend is bound) `ask` handles both Phase-5 (EM inbox question) and Phase-2 (daemon task body) contracts by dispatching on `--sandbox` — Phase 2 uses `-` as the body to read from stdin, Phase 5 joins `nargs='+'` positional args. Remaining stubs (status / audit / spend / egress / introspect / skill / doctor) print "not yet implemented" until their phases land. The Phase 5 stub-list parametrization in tests/coder/test_cli_trust.py is updated to reflect the new reality.

…tion The two `@pytest.mark.skip(reason="Phase 6 CLI handlers deferred…")` gates on tests/coder/test_self_fix/test_cli.py are lifted now that `gaia-coder feedback` + `gaia-coder self-fix process` are wired. Likewise tests/eval/conftest.py's `collect_ignore_glob` that silenced test_coder_cli_runner.py and test_gaia_internal_20_suite.py is dropped — the daemon / ask / wait / stop / ARTIFACT_FILENAMES surface they exercise is now part of the unified CLI. All 386 tests under tests/coder/ and tests/eval/ pass with zero skips.

… self-heal) Adds tests/coder/test_integration_e2e.py with two hermetic scenarios that exercise the whole coder as a system. Every external boundary — LLM calls, `gh` CLI shell-outs, the ReAct audit hook — is injected via mock callables so no real tokens or network traffic are needed. `test_full_flow_feedback_to_fix` runs the §7.3 / §7.4 happy path: 1. `gaia-coder trust --bootstrap --em-handle e2e-em …` 2. `gaia-coder feedback "<body>" --severity high --on <url> …` writes a real pending row to feedback.db via the store module. 3. FeedbackLoopDriver.process_pending_feedback drives triage → plan → fix → regression-test → review-gate → publish-PR with canned triage / gh / review stubs. 4. Assertions cover every externally-visible contract: state machine transitions (pending → triaged → in-fix → fix-pr-open), branch `auto/gaia-coder/fb-e2e-1` exists, regression test file on disk, review gate overall=pass note, audit log rows for every stage. 5. A real git merge + verify_on_merge (against the checked-in merged SHA) then transitions the row to `verified` and writes failure_patterns + review_patterns rows into memory.db. `test_dev_mode_self_heal_e2e` runs the §7.5 sub-loop: 1. A hermetic tmp git repo configured with an amd/gaia origin satisfies the §7.1 hard precondition, so dev_mode.enable_session succeeds. 2. self_heal.classify_failure with a canned self-code response yields a high-confidence self-bug classification. 3. pause_current_task snapshots the task to paused-tasks/<id>.json. 4. restart_self(kind="prompt-only") hot-reloads without exiting. 5. resume_task(delete_snapshot=True) round-trips the snapshot and verifies the file is consumed. 6. The audit log contains rows for both dev_mode.enable_session and self_heal.restart_self. Also folds in black auto-formatting fixes on src/gaia/coder/cli.py from the first lint pass after the unification commit landed. All 388 tests under tests/coder/ and tests/eval/ continue to pass.

github-actions · 2026-04-20T11:01:03Z

Summary

This PR converges the gaia-coder CLI surface from Phases 1-10 into a single argparse tree and lands two hermetic end-to-end integration tests. The work is well-scoped and the docstrings are genuinely excellent — every non-obvious handler explains why it exists and what contract it upholds. My main concerns are small: a no-op os.environ line in the new e2e test, a latent assertion bug in a re-enabled test, and some defensive hygiene around the Phase-2 daemon's PID-file lifecycle.

Issues Found

🟡 Important

1. test_cli_feedback_rejects_invalid_severity passes silently when argparse does not raise (tests/coder/test_self_fix/test_cli.py:55-68)

This test is one of the two that just had its @pytest.mark.skip lifted, so it's now load-bearing. But the try/except SystemExit pattern means that if argparse ever stops raising on the invalid enum (accepting nonsense), the test passes with zero assertions — the opposite of what's intended. Switch to pytest.raises:

def test_cli_feedback_rejects_invalid_severity(capsys) -> None:
    """argparse enforces the severity enum."""
    import pytest

    with pytest.raises(SystemExit) as exc_info:
        coder_cli.main(
            [
                "feedback",
                "body",
                "--severity",
                "nonsense",
            ]
        )
    assert exc_info.value.code != 0

2. Dead os.environ["PATH"] assignment in e2e test (tests/coder/test_integration_e2e.py:381)

os.environ["PATH"] = os.environ.get("PATH", "")

This is a no-op (reads PATH and writes the same value back). The surrounding comment says "set it explicitly on PATH too" for Windows CI, but the code doesn't actually prepend sys.executable's directory. Either:

Delete the line + comment if PATH already works (the CI evidence in the PR says it does), or
Actually add the interpreter dir:

    # Windows CI occasionally resolves `sys.executable` to a path that the
    # subprocess loses, so we ensure its directory is on PATH.
    import sys as _sys
    py_dir = str(Path(_sys.executable).parent)
    if py_dir not in os.environ.get("PATH", "").split(os.pathsep):
        os.environ["PATH"] = py_dir + os.pathsep + os.environ.get("PATH", "")

Either way, the current line contributes nothing and is misleading. Also note: this mutates os.environ without monkeypatch — the change leaks to other tests in the same session. Prefer the monkeypatch.setenv fixture used elsewhere in the file (gaia_coder_home already demonstrates the pattern).

3. Daemon PID-file lifecycle has a TOCTOU / PID-reuse window (src/gaia/coder/cli.py:402-415, :495-499)

_handle_daemon uses os.kill(int(existing_pid), 0) to check if the PID is live, and _handle_stop reads the PID and sends SIGTERM. If the recorded PID belongs to an unrelated process after reuse (crash + reboot + new process claims the same PID), stop will SIGTERM that unrelated process. For a Phase-2 stub this is low-probability in practice, but it's worth guarding by writing a start marker (e.g. boot-id or process-start-time) alongside the PID and validating it before signalling. Not a blocker for this PR, but flagging for the production swap.

🟢 Minor

4. _git helper docstring contradicts its body (tests/coder/test_integration_e2e.py:67-76)

Docstring: "every call passes an explicit user.email/name + gpgsign override". Body: plain subprocess.run(["git", *args], ...) — no overrides are passed per-call; they're set once via git config when the fixture creates the repo. Tighten the docstring so a future reader isn't misled:

def _git(cwd: Path, *args: str) -> subprocess.CompletedProcess:
    """Run ``git <args>`` under ``cwd``.

    Hermetic: the fixture that creates the repo configures
    ``user.email`` / ``user.name`` / ``commit.gpgsign`` via
    ``git config`` so this helper does not depend on the developer's
    global git config.
    """

5. Argparse namespace leaks private-underscore attrs (src/gaia/coder/cli.py:1157, :1192, :1211, :1408)

Stashing the parent parser on args via set_defaults(_self_fix_parser=self_fix_parser, ...) works, but _-prefixed attributes on argparse.Namespace are unusual enough to surprise a reader. An alternative is subparsers.choices[subcommand].print_help() from the handler. Purely cosmetic — not a blocker — but worth a follow-up if the pattern grows further.

6. _canned_gh_runner signature accepts but ignores check kwarg (tests/coder/test_integration_e2e.py:122)

def runner(args, cwd=None, check=True):

The stub never honours check=True (it always returns success), so a bug where real code drops check=True wouldn't surface in these tests. Minor — acceptable for now because every callsite does want success — but worth noting if this mock gets reused for negative paths.

7. _noop_status_provider could drop the type: ignore (src/gaia/coder/cli.py:843)

def _noop_status_provider() -> dict:
    """Return an empty per-corpus status map.

    Used when no live RAG backend is bound to the CLI. The resulting
    :class:`RagStatusReport` reports every corpus as having zero
    documents and an unknown ``last_indexed_at`` — the watchdog then
    fires at ``severity='warn'`` because the contract expects at least
    one successful reindex within the configured window.
    """
    return {}

Strengths

Docstring discipline. The module-level inventory in cli.py (the "Subcommand inventory" block with wired handlers in bold) is exactly the right level of detail — it lets a reader understand the Phase-11 convergence in 60 seconds without reading the 1400-line file. Same for every handler: the dual-mode _handle_ask and the Phase-2 stub artifact writer both explain why the stub is clearly marked stub: true (so a later real-daemon run can't be mistaken for a stub run).
The ARTIFACT_FILENAMES module-level constant with the "kept as a module-level constant so runner + daemon + scorer all agree on the list" comment is a great small architectural choice that prevents list-drift between three callers.
The e2e tests genuinely exercise contracts, not implementations. Assertions live on files-on-disk, DB rows, audit rows, and state-machine transitions — exactly the right granularity for integration coverage. The mocks are narrow (LLM client, gh runner, review-gate runner) and every boundary is explicit.
Re-enabling the Phase-2 and Phase-6 tests (the conftest collect_ignore_glob removal + the two @pytest.mark.skip lifts) turns skipped tests back into active coverage — 26 previously-skipped tests now run every CI pass. Good hygiene.

Verdict

Approve with suggestions — no blocking issues. Issues 1 and 2 are small but worth fixing in a follow-up commit (the first is a latent test-correctness bug; the second is dead code that reads as though it's doing something). Issue 3 is a Phase-2-stub concern that can wait for the production swap. Everything else is minor polish.

…view pass) (#834) ## Summary Final cleanup pass to complete the `coder` branch for EM testing. Five Important + three Minor findings across the Phase 5/6/11 auto-reviews. All 395 tests pass. ## Changes - `test_self_fix/test_cli.py` — `pytest.raises` so a silent-pass regression in argparse can't pass the test. [#825, #829] - `test_integration_e2e.py` — real `PATH` prepend via `monkeypatch.setenv` instead of a no-op assignment that leaked env. [#829] - `test_fixes_827_828.py` — drop unused `Path` import. [#832] - `loop_driver.py` — narrow broad `except Exception` around `review_gate` and `notify_em` to `(RuntimeError, CalledProcessError, OSError)`. Programming errors now surface per CLAUDE.md fail-loudly. [#825] - `loop_driver.py` + `verifier.py` — `_append_notes` / `_append_note` raise `ValueError` on corrupted or wrong-type `notes_json` instead of silently replacing with `[]`. [#825] ## Test plan - [x] `pytest tests/coder/ tests/eval/` — 395/395 pass

kovtcharov added 3 commits April 20, 2026 03:39

github-actions Bot added the tests Test changes label Apr 20, 2026

kovtcharov marked this pull request as ready for review April 20, 2026 10:57

kovtcharov requested a review from kovtcharov-amd as a code owner April 20, 2026 10:57

kovtcharov merged commit c228c93 into coder Apr 20, 2026
11 checks passed

kovtcharov deleted the feature/gaia-coder-p11 branch April 20, 2026 10:57

kovtcharov mentioned this pull request Apr 20, 2026

fix(coder): address #825 + #829 + #832 auto-review findings (final review pass) #834

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(coder): Phase 11 — CLI unification + end-to-end integration tests#829

feat(coder): Phase 11 — CLI unification + end-to-end integration tests#829
kovtcharov merged 3 commits into
coderfrom
feature/gaia-coder-p11

kovtcharov commented Apr 20, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kovtcharov commented Apr 20, 2026

Summary

Threads

Subcommands in gaia-coder --help (22 total)

Previously-skipped tests now passing (26)

Test plan

Do-not-merge

Uh oh!

Uh oh!

github-actions Bot commented Apr 20, 2026

Summary

Issues Found

🟡 Important

🟢 Minor

Strengths

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Subcommands in `gaia-coder --help` (22 total)