Skip to content

feat(coder): Phase 11 — CLI unification + end-to-end integration tests#829

Merged
kovtcharov merged 3 commits into
coderfrom
feature/gaia-coder-p11
Apr 20, 2026
Merged

feat(coder): Phase 11 — CLI unification + end-to-end integration tests#829
kovtcharov merged 3 commits into
coderfrom
feature/gaia-coder-p11

Conversation

@kovtcharov
Copy link
Copy Markdown
Collaborator

Summary

Unifies the Phase 5 gaia-coder CLI with every later phase's CLI bits under one argparse tree, then lands two hermetic end-to-end tests that prove the whole coder works as a system — from EM feedback submission through draft PR opening to merged-state verification. Phases 1–10 landed their modules but each phase's CLI glue arrived through a separate branch; rebasing them together produced conflicts that left feedback, daemon, dev-mode, debug, and rag subcommands unwired. This PR is the convergence point.

Threads

  • CLI unification (13d023e) — one argparse tree in src/gaia/coder/cli.py wires real handlers for every subcommand whose module has landed. ask dispatches on --sandbox to serve both the Phase-5 EM-inbox contract and the Phase-2 eval-harness daemon shim, so neither gets special-cased elsewhere. ARTIFACT_FILENAMES is re-exported so gaia.eval.runners.coder_cli keeps compiling.
  • Re-enabled tests (1aa4d41) — lifts the two @pytest.mark.skip gates on tests/coder/test_self_fix/test_cli.py and drops the collect_ignore_glob on tests/eval/conftest.py, so the Phase 6 feedback CLI and the Phase 2 eval-harness runner + GAIA-Internal-20 suite tests all run on every CI pass.
  • End-to-end integration tests (937e3e0) — new tests/coder/test_integration_e2e.py with two scenarios: the full feedback → triage → plan → fix → PR → merge → verified pipeline (23 assertions across state machine, git branch, audit log, memory writes), and the dev-mode self-heal sub-loop (§7.5: classify → pause → restart → resume round-trip). Every LLM / gh / subprocess boundary is mocked — no real API tokens required.

Subcommands in gaia-coder --help (22 total)

Wired to real handlers (15): trust, promote, demote, ask, note, critical, inbox, daemon, wait, stop, feedback, self-fix (+ process), dev-mode (+ enable / disable / status), debug (+ 7 scaffolded state verbs), rag (+ status / refresh / rebuild).

Deliberately deferred stubs (7): status, audit, spend, egress, introspect, skill, doctor.

Previously-skipped tests now passing (26)

  • tests/coder/test_self_fix/test_cli.py::test_cli_feedback_enqueues
  • tests/coder/test_self_fix/test_cli.py::test_cli_self_fix_subcommand_without_action_prints_help
  • All 7 tests in tests/eval/test_coder_cli_runner.py
  • All 16 tests in tests/eval/test_gaia_internal_20_suite.py
  • Plus 1 new test_cli_feedback_rejects_invalid_severity that was always active

Test plan

  • pytest tests/coder/ tests/eval/388 passed, 0 skipped
  • gaia-coder --help lists 22 subcommands covering Phases 2/4/5/6/7/8/10
  • gaia-coder trust still prints the §4.2 summary (Phase 5 regression)
  • gaia-coder feedback "test" --severity low writes a real row to feedback.db and returns JSON
  • gaia-coder self-fix process on an empty queue returns final_state=no-pending and exits 0
  • gaia-coder rag status renders the §6.9 contract (empty provider → watchdog fires as warn)
  • python -m flake8 on touched files — zero errors (pre-existing lint debt in unrelated files is out of scope)
  • Both new e2e tests run hermetically under tmp_path with mocked LLM / gh / audit boundaries

Do-not-merge

This is a draft pending upstream review. Rebase codermain lives in a separate PR.

Merges every phase's CLI bits into a single src/gaia/coder/cli.py so one
argparse tree owns the whole subcommand surface. Phase 5's trust / promote
/ demote / ask / note / critical / inbox verbs stay intact; every
subcommand whose real handler exists elsewhere is now wired:

- daemon / wait / stop + ARTIFACT_FILENAMES (Phase 2 stub daemon,
  required by gaia.eval.runners.coder_cli and the GAIA-Internal-20 suite)
- feedback (Phase 6, writes a real row to feedback.db via
  gaia.coder.stores.feedback) + self-fix process (runs one
  FeedbackLoopDriver.process_pending_feedback iteration; no-ops on
  empty queue)
- dev-mode enable / disable / status (Phase 7, over gaia.coder.dev_mode)
- debug repro|bisect|hypothesise|probe|localise|propose|postmortem
  (Phase 8 scaffold — DebugSubLoop is Python-only until the Phase 11
  production swap)
- rag status / refresh / rebuild (Phase 10, over
  gaia.coder.rag_freshness with a noop provider/runner until a real
  RAG backend is bound)

`ask` handles both Phase-5 (EM inbox question) and Phase-2 (daemon task
body) contracts by dispatching on `--sandbox` — Phase 2 uses `-` as the
body to read from stdin, Phase 5 joins `nargs='+'` positional args.
Remaining stubs (status / audit / spend / egress / introspect / skill /
doctor) print "not yet implemented" until their phases land.

The Phase 5 stub-list parametrization in tests/coder/test_cli_trust.py
is updated to reflect the new reality.
…tion

The two `@pytest.mark.skip(reason="Phase 6 CLI handlers deferred…")` gates
on tests/coder/test_self_fix/test_cli.py are lifted now that
`gaia-coder feedback` + `gaia-coder self-fix process` are wired. Likewise
tests/eval/conftest.py's `collect_ignore_glob` that silenced
test_coder_cli_runner.py and test_gaia_internal_20_suite.py is dropped —
the daemon / ask / wait / stop / ARTIFACT_FILENAMES surface they exercise
is now part of the unified CLI.

All 386 tests under tests/coder/ and tests/eval/ pass with zero skips.
… self-heal)

Adds tests/coder/test_integration_e2e.py with two hermetic scenarios that
exercise the whole coder as a system. Every external boundary — LLM
calls, `gh` CLI shell-outs, the ReAct audit hook — is injected via mock
callables so no real tokens or network traffic are needed.

`test_full_flow_feedback_to_fix` runs the §7.3 / §7.4 happy path:

  1. `gaia-coder trust --bootstrap --em-handle e2e-em …`
  2. `gaia-coder feedback "<body>" --severity high --on <url> …`
     writes a real pending row to feedback.db via the store module.
  3. FeedbackLoopDriver.process_pending_feedback drives triage → plan →
     fix → regression-test → review-gate → publish-PR with canned
     triage / gh / review stubs.
  4. Assertions cover every externally-visible contract: state machine
     transitions (pending → triaged → in-fix → fix-pr-open), branch
     `auto/gaia-coder/fb-e2e-1` exists, regression test file on disk,
     review gate overall=pass note, audit log rows for every stage.
  5. A real git merge + verify_on_merge (against the checked-in merged
     SHA) then transitions the row to `verified` and writes
     failure_patterns + review_patterns rows into memory.db.

`test_dev_mode_self_heal_e2e` runs the §7.5 sub-loop:

  1. A hermetic tmp git repo configured with an amd/gaia origin satisfies
     the §7.1 hard precondition, so dev_mode.enable_session succeeds.
  2. self_heal.classify_failure with a canned self-code response yields
     a high-confidence self-bug classification.
  3. pause_current_task snapshots the task to paused-tasks/<id>.json.
  4. restart_self(kind="prompt-only") hot-reloads without exiting.
  5. resume_task(delete_snapshot=True) round-trips the snapshot and
     verifies the file is consumed.
  6. The audit log contains rows for both dev_mode.enable_session and
     self_heal.restart_self.

Also folds in black auto-formatting fixes on src/gaia/coder/cli.py from
the first lint pass after the unification commit landed.

All 388 tests under tests/coder/ and tests/eval/ continue to pass.
@github-actions github-actions Bot added the tests Test changes label Apr 20, 2026
@kovtcharov kovtcharov marked this pull request as ready for review April 20, 2026 10:57
@kovtcharov kovtcharov merged commit c228c93 into coder Apr 20, 2026
11 checks passed
@kovtcharov kovtcharov deleted the feature/gaia-coder-p11 branch April 20, 2026 10:57
@github-actions
Copy link
Copy Markdown
Contributor

Summary

This PR converges the gaia-coder CLI surface from Phases 1-10 into a single argparse tree and lands two hermetic end-to-end integration tests. The work is well-scoped and the docstrings are genuinely excellent — every non-obvious handler explains why it exists and what contract it upholds. My main concerns are small: a no-op os.environ line in the new e2e test, a latent assertion bug in a re-enabled test, and some defensive hygiene around the Phase-2 daemon's PID-file lifecycle.

Issues Found

🟡 Important

1. test_cli_feedback_rejects_invalid_severity passes silently when argparse does not raise (tests/coder/test_self_fix/test_cli.py:55-68)

This test is one of the two that just had its @pytest.mark.skip lifted, so it's now load-bearing. But the try/except SystemExit pattern means that if argparse ever stops raising on the invalid enum (accepting nonsense), the test passes with zero assertions — the opposite of what's intended. Switch to pytest.raises:

def test_cli_feedback_rejects_invalid_severity(capsys) -> None:
    """argparse enforces the severity enum."""
    import pytest

    with pytest.raises(SystemExit) as exc_info:
        coder_cli.main(
            [
                "feedback",
                "body",
                "--severity",
                "nonsense",
            ]
        )
    assert exc_info.value.code != 0

2. Dead os.environ["PATH"] assignment in e2e test (tests/coder/test_integration_e2e.py:381)

os.environ["PATH"] = os.environ.get("PATH", "")

This is a no-op (reads PATH and writes the same value back). The surrounding comment says "set it explicitly on PATH too" for Windows CI, but the code doesn't actually prepend sys.executable's directory. Either:

  • Delete the line + comment if PATH already works (the CI evidence in the PR says it does), or
  • Actually add the interpreter dir:
    # Windows CI occasionally resolves `sys.executable` to a path that the
    # subprocess loses, so we ensure its directory is on PATH.
    import sys as _sys
    py_dir = str(Path(_sys.executable).parent)
    if py_dir not in os.environ.get("PATH", "").split(os.pathsep):
        os.environ["PATH"] = py_dir + os.pathsep + os.environ.get("PATH", "")

Either way, the current line contributes nothing and is misleading. Also note: this mutates os.environ without monkeypatch — the change leaks to other tests in the same session. Prefer the monkeypatch.setenv fixture used elsewhere in the file (gaia_coder_home already demonstrates the pattern).

3. Daemon PID-file lifecycle has a TOCTOU / PID-reuse window (src/gaia/coder/cli.py:402-415, :495-499)

_handle_daemon uses os.kill(int(existing_pid), 0) to check if the PID is live, and _handle_stop reads the PID and sends SIGTERM. If the recorded PID belongs to an unrelated process after reuse (crash + reboot + new process claims the same PID), stop will SIGTERM that unrelated process. For a Phase-2 stub this is low-probability in practice, but it's worth guarding by writing a start marker (e.g. boot-id or process-start-time) alongside the PID and validating it before signalling. Not a blocker for this PR, but flagging for the production swap.

🟢 Minor

4. _git helper docstring contradicts its body (tests/coder/test_integration_e2e.py:67-76)

Docstring: "every call passes an explicit user.email/name + gpgsign override". Body: plain subprocess.run(["git", *args], ...) — no overrides are passed per-call; they're set once via git config when the fixture creates the repo. Tighten the docstring so a future reader isn't misled:

def _git(cwd: Path, *args: str) -> subprocess.CompletedProcess:
    """Run ``git <args>`` under ``cwd``.

    Hermetic: the fixture that creates the repo configures
    ``user.email`` / ``user.name`` / ``commit.gpgsign`` via
    ``git config`` so this helper does not depend on the developer's
    global git config.
    """

5. Argparse namespace leaks private-underscore attrs (src/gaia/coder/cli.py:1157, :1192, :1211, :1408)

Stashing the parent parser on args via set_defaults(_self_fix_parser=self_fix_parser, ...) works, but _-prefixed attributes on argparse.Namespace are unusual enough to surprise a reader. An alternative is subparsers.choices[subcommand].print_help() from the handler. Purely cosmetic — not a blocker — but worth a follow-up if the pattern grows further.

6. _canned_gh_runner signature accepts but ignores check kwarg (tests/coder/test_integration_e2e.py:122)

def runner(args, cwd=None, check=True):

The stub never honours check=True (it always returns success), so a bug where real code drops check=True wouldn't surface in these tests. Minor — acceptable for now because every callsite does want success — but worth noting if this mock gets reused for negative paths.

7. _noop_status_provider could drop the type: ignore (src/gaia/coder/cli.py:843)

def _noop_status_provider() -> dict:
    """Return an empty per-corpus status map.

    Used when no live RAG backend is bound to the CLI. The resulting
    :class:`RagStatusReport` reports every corpus as having zero
    documents and an unknown ``last_indexed_at`` — the watchdog then
    fires at ``severity='warn'`` because the contract expects at least
    one successful reindex within the configured window.
    """
    return {}

Strengths

  • Docstring discipline. The module-level inventory in cli.py (the "Subcommand inventory" block with wired handlers in bold) is exactly the right level of detail — it lets a reader understand the Phase-11 convergence in 60 seconds without reading the 1400-line file. Same for every handler: the dual-mode _handle_ask and the Phase-2 stub artifact writer both explain why the stub is clearly marked stub: true (so a later real-daemon run can't be mistaken for a stub run).
  • The ARTIFACT_FILENAMES module-level constant with the "kept as a module-level constant so runner + daemon + scorer all agree on the list" comment is a great small architectural choice that prevents list-drift between three callers.
  • The e2e tests genuinely exercise contracts, not implementations. Assertions live on files-on-disk, DB rows, audit rows, and state-machine transitions — exactly the right granularity for integration coverage. The mocks are narrow (LLM client, gh runner, review-gate runner) and every boundary is explicit.
  • Re-enabling the Phase-2 and Phase-6 tests (the conftest collect_ignore_glob removal + the two @pytest.mark.skip lifts) turns skipped tests back into active coverage — 26 previously-skipped tests now run every CI pass. Good hygiene.

Verdict

Approve with suggestions — no blocking issues. Issues 1 and 2 are small but worth fixing in a follow-up commit (the first is a latent test-correctness bug; the second is dead code that reads as though it's doing something). Issue 3 is a Phase-2-stub concern that can wait for the production swap. Everything else is minor polish.

kovtcharov added a commit that referenced this pull request Apr 20, 2026
…view pass) (#834)

## Summary

Final cleanup pass to complete the `coder` branch for EM testing. Five
Important + three Minor findings across the Phase 5/6/11 auto-reviews.
All 395 tests pass.

## Changes

- `test_self_fix/test_cli.py` — `pytest.raises` so a silent-pass
regression in argparse can't pass the test. [#825, #829]
- `test_integration_e2e.py` — real `PATH` prepend via
`monkeypatch.setenv` instead of a no-op assignment that leaked env.
[#829]
- `test_fixes_827_828.py` — drop unused `Path` import. [#832]
- `loop_driver.py` — narrow broad `except Exception` around
`review_gate` and `notify_em` to `(RuntimeError, CalledProcessError,
OSError)`. Programming errors now surface per CLAUDE.md fail-loudly.
[#825]
- `loop_driver.py` + `verifier.py` — `_append_notes` / `_append_note`
raise `ValueError` on corrupted or wrong-type `notes_json` instead of
silently replacing with `[]`. [#825]

## Test plan

- [x] `pytest tests/coder/ tests/eval/` — 395/395 pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant