feat(coder): self-correction loop (§7.3-§7.9 + continuous critique) by kovtcharov · Pull Request #825 · amd/gaia

kovtcharov · 2026-04-20T09:45:14Z

Summary

Phase 6 implements gaia-coder's self-correction loop — the core value proposition. When the EM gives feedback, the agent now triages it into one of eight fix classes, localises the cause, drafts a regression-tested plan, applies a fix on an auto/gaia-coder/<feedback_id> branch, and opens a draft PR targeting coder. Wires §7.2 continuous critique, §7.3 feedback intake, §7.4 loop steps 1-10, and §7.9 verification. Loosely coupled to Phase 4 (review gate) and Phase 5 (trust inbox) so they can land in any order.

Threads

Prompt templates (§15.8 P1/P2/P3). prompts/triage.md, prompts/critique.md, and prompts/plan_review.md — the three canonical text bodies the loop consumes. Stored as plain Markdown so she can edit them via prompt-class self-fix PRs.
Triage (§7.4 step 1-2). classify_fix_class runs P1 on Opus 4.7; < 60 confidence is rewritten to out-of-scope so the loop never commits to a guess. localise is deterministic grep — no LLM.
Planner (§5.1 Stage 3 / §7.4 step 3). draft_plan + is_large_job + request_em_approval. Large jobs (> 200 LoC, or architectural / state-machine, or cross-mixin) post the P3 message to the EM inbox and wait for ✅.
Fixer (§7.4 steps 4-5). generate_fix creates the self-fix branch and applies edits; write_regression_test emits a pytest file + marker flag so it genuinely fails on coder and passes on the fix branch (no clever mocks); verify_test_differential raises on the pass-on-both or fail-on-both failure modes.
Publisher (§7.4 steps 7-8). open_self_fix_pr refuses to open without a regression test (§7.4 step 5 hard rule). PR body cites feedback_id and quotes the EM's wording verbatim — Pass 7 (feedback-binding) depends on both.
Verifier (§7.4 step 10). verify_on_merge re-runs the regression test on the merged SHA, transitions the feedback row to verified, and writes failure_patterns + review_patterns memory records so the same symptom is recognised next time.
Continuous critique (§7.2). Cheap one-shot Opus call after every state-changing tool; findings < 60 confidence are dropped; ≥ 80 surface inline for pre-transition action.
Loop driver. FeedbackLoopDriver.process_pending_feedback() orchestrates the whole thing with full pending → triaged → in-fix → fix-pr-open → verified | rejected → closed transitions written to feedback.notes_json.
SelfFixToolsMixin. Registers 10 tools (well above the §15.2 ≥ 7 contract) so the loop is callable from the agent's tool registry. Phase-7 tools (classify_failure, pause_current_task, restart_self, edit_self_file) are intentionally out of scope.
CLI. gaia-coder feedback "<body>" --severity high --on <url> enqueues, gaia-coder self-fix process runs one iteration.
Tests. 47 new tests covering every §7.4 step, every §7.2 filtering rule, and the §15.2 mixin contract. Mocks Anthropic and gh at the callable boundary; uses a tmp git repo with a coder branch for real branch creation and pytest differential runs.

Out of scope (explicit non-goals per the Phase 6 task)

EventBridge ingestion of @gaia-coder feedback: comments (Phase 10 repo binding).
Auto-merge (§7.6) — blocked on Phase 4 review-gate confidence score.
Dev-mode self-edit (§7.5) and restart_self — Phase 7.
ReAct loop self-edit (§7.8) — Phase 7.

Dependencies

feat(coder): File/CLI/Search mixins per §15.2 #818 (mixins): Imports edit_file semantics inline rather than instantiating FileToolsMixin, to keep the module usable without the mixin wired.
feat(coder): package skeleton + loop.py with DEFAULT_LOOP #819 (scaffold): Package skeleton + GAIA.md + prompts/ dir.
feat(coder): SQLite stores per §15.1 DDL #820 (stores): Uses gaia.coder.stores.feedback and gaia.coder.stores.memory directly.
Phase 4 review gate (sibling): Loose-coupled — review_gate_runner is optional; absent, we publish a "(review gate not available)" placeholder in the PR body.
Phase 5 trust (sibling): Loose-coupled — request_em_approval uses gaia.coder.trust.inbox.enqueue if importable, else defers.

Test plan

pytest tests/coder/test_self_fix/ -xvs — 47 tests pass.
pytest tests/coder/ -q — full coder suite (120 tests) pass.
python util/lint.py --all — no new critical errors (pre-existing pylint warnings in cli.py, discovery.py, etc. are not touched).
python -c "from gaia.coder.self_fix import SelfFixToolsMixin; m = SelfFixToolsMixin(); assert len(m.register_self_fix_tools()) >= 7" — smoke.
gaia-coder feedback "..." --severity high --on https://github.com/amd/gaia/pull/9999 --id fb-test --db-path /tmp/fb.db — writes a row; follow with gaia-coder self-fix process --db-path /tmp/fb.db --repo-root <worktree> --skip-differential-verify --skip-fix-apply for the end-to-end path (PR creation mocked).

Merge plan

Draft PR, do not merge. Rebase onto coder after Phase 4 and Phase 5 land so the review gate and inbox wiring become real imports.

…view) Materialise the three prompt templates the self-correction loop consumes: * triage.md — per-feedback fix-class classifier (§15.8 P1). * critique.md — continuous-critique hook (§15.8 P2). * plan_review.md — deterministically-templated EM inbox message for large-job approval asks (§15.8 P3, not an LLM prompt). All three live under src/gaia/coder/prompts/ and are loaded by the self-fix modules via simple double-brace string substitution. The file format is friendly for humans to edit so she can self-edit her own prompts via prompt-class self-fix PRs (§6.4, §15.8).

…eps 1-2) classify_fix_class runs the P1 prompt on Opus 4.7 (temperature=0) and returns a FixClassResult with the eight-label fix-class (prompt | doc | test | tool | policy | architectural | state-machine | out-of-scope), root-cause hypothesis, candidate files, and a confidence score. Conservatism gate: when confidence < 60 the classifier's fix_class is rewritten to out-of-scope and `escalated_low_confidence` is set to True — so the loop never commits to a guess (§7.2 low-confidence rule). The LLM call site is abstracted via a TriageClient protocol so tests can inject a mock without depending on anthropic. Fail-loudly (CLAUDE.md) on JSON parse errors, unknown fix_class values, or out-of-range confidence. localise is deterministic — no LLM. It grep-scans the triage-proposed candidate_files for either an explicit `path:start-end` range or for keywords the driver extracts from the feedback body. Missing files log at INFO and are skipped; the classifier was guessing after all.

…approval Implements §5.1 Stage 3 planning for the self-correction loop: * draft_plan() synthesises a Plan dataclass from triage + localisation output — root cause, proposed change, regression-test sketch, LoC estimate, alternatives considered, cost envelope. * is_large_job() triggers the EM-review gate when total LoC > 200, when fix_class is architectural or state-machine, or when the plan spans multiple mixin directories under src/gaia/coder/. * request_em_approval() renders the P3 message and writes it to the EM inbox. Loose-coupled to Phase 5's gaia.coder.trust.inbox: if the module is importable we use it; otherwise we log WARN and return a deferred ApprovalRequest so the caller can retry. MAX_PLAN_REFINEMENT_ROUNDS is exported so the driver can enforce the three-round cap from §5.1 Stage 3's plan_refine row.

…rential Implements §7.4 steps 4-5 end-to-end: * generate_fix() creates auto/gaia-coder/<feedback_id> off the base ref (default 'coder' — §5.6, never main), then walks the caller-supplied EditHunks via an _edit_file_impl free function that mirrors the exact semantics of FileToolsMixin.edit_file from PR #818 (unique-match, opt-in replace_all, fail-loudly on missing). The branch is idempotent: re-runs check out the existing branch rather than error. * write_regression_test() emits a pytest file plus a marker flag so the generated test genuinely fails on the base branch (marker missing) and passes on the fix branch (marker present) — the §7.4 step 5 contract without needing any cleverness. Refuses to run unless the working tree is on a SELF_FIX_BRANCH_PREFIX branch. * verify_test_differential() runs pytest on both refs and raises RuntimeError if the fail-then-pass contract is violated (pass on both → regression not exercised; fail on both → fix does not fix). sys.executable is used for subprocess pytest invocations so hosts without a `python` symlink on PATH still work.

…steps 7-8) Opens a **draft** self-fix PR against the coder integration branch via `gh pr create` (shells out — GitHubToolsMixin is Phase-10 work). Hard rules enforced at the module boundary: * regression_test_path must be non-empty; §7.4 step 5 forbids opening a self-fix PR without a regression test. Raises ValueError. * plan.feedback_id must match the feedback_id argument; a mismatched binding would defeat Pass 7 (feedback-binding) before it even runs. * base branch defaults to 'coder' — never main (§5.6). compose_pr_body() cites feedback_id explicitly and block-quotes the EM's feedback verbatim (§15.8 P7 requirements #3). The review-pass table is templated from an optional ReviewGateResult; when the Phase-4 gate isn't wired yet we emit a "(review gate not available)" stub. notify_em() routes to `gh pr comment` or `gh issue comment` based on the feedback context_url; missing/unrecognised URLs log a WARN and return a deferred marker so the loop driver can retry.

…es (§7.4 step 10) Triggered by the PR-merged-where-author=self EventBridge event (the wiring itself is Phase-10 work). Re-runs the regression test on the merged SHA; on green it (a) transitions the feedback row to 'verified', (b) writes a failure_patterns memory row so similar symptoms are recognised next time, (c) writes a review_patterns row so Pass 6 can consult prior review outcomes for the same fix-class. Fail-loudly: if the regression test is red on the merged SHA we raise RuntimeError rather than silently close the feedback — a green-merge / red-regression gap is a correctness alarm that must page the EM. Reuses the SQLite stores from Phase 2 (gaia.coder.stores.feedback and gaia.coder.stores.memory) directly — that task landed before this one.

…critique Single-turn Opus 4.7 call, runs after every state-changing tool call (edit / write_file / run_cli_command) and returns at most one actionable finding. Filtering rules from §7.2: * findings with confidence < 60 are dropped entirely (low-confidence critiques are noise), * findings in 60-79 are kept for self_review batch consideration, * findings ≥ 80 land in CritiqueResult.high_confidence_findings for inline action before the next state transition. Cost framing: ~15 critique calls per 50-turn task at ~$0.02 each = ~$0.30 per task added, well under the §6.6 ceiling. Alternative is Pass-3 discovery of 200 lines that need to be thrown away.

Tests for every §7.4 step, every §7.2 filtering rule, and the §15.2 tool-registration contract: * test_triage.py — 13 tests (8 parametrised fix-class cases, low-conf escalation, unknown-class rejection, localise range/keywords/missing). * test_planner.py — 9 tests (threshold, fix-class forced-large, cross-mixin detection, draft from hits, refinement cap, EM-inbox deferral path + injected writer path). * test_fixer.py — 6 tests (fix lands on auto/gaia-coder/<fid> not coder, empty-edits rejected, write_regression_test refuses coder branch, fail-then-pass contract both directions). * test_publisher.py — 6 tests (regression-test-required ValueError, PR body cites fid and quotes EM wording, --draft --base coder argv, mismatched ids rejected, notify_em context-aware routing). * test_verifier.py — 2 tests (writes failure_patterns + review_patterns memory rows on green; raises RuntimeError on red merged-SHA run). * test_continuous_critique.py — 4 tests (< 60 suppressed, ≥ 80 kept, empty response legal, 60/80 threshold constants). * test_loop_driver.py — 3 tests (end-to-end seeded feedback → fix-pr-open, out-of-scope → rejected, no-pending). * test_mixin.py — 1 test (SelfFixToolsMixin registers ≥ 7 tools). * test_cli.py — 3 tests (feedback enqueues, invalid severity rejected, self-fix without action prints help). Uses a tmp_git_repo fixture with pre-seeded coder branch + buggy sample file, and feedback_db_path / memory_db_path fixtures that open real SQLite stores under tmp_path. Anthropic / gh CLI are mocked at the callable boundary — no network.

FeedbackLoopDriver.process_pending_feedback() pulls one pending row from feedback.db and walks it through the §7.3 state machine: pending → triaged → in-fix → fix-pr-open → verified | rejected → closed Each step is a thin wrapper around its dedicated module (triage / planner / fixer / publisher / verifier) so tests can mock one stage without monkey-patching internals. All transitions are appended to feedback.notes_json so the full history is recoverable from the row. Loose coupling (per the Phase 6 task spec): * Phase 4 (review gate) — if review_gate_runner is supplied, its ReviewGateResult is attached to the PR body; absent, we log a WARN and open the PR with a placeholder passes table. * Phase 5 (trust inbox) — request_em_approval() uses the trust module if importable, otherwise the large-job ASK is deferred. SelfFixToolsMixin registers **10** loop-level tools on an agent, which is comfortably above the smoke-test ≥ 7 bar from the task spec: triage_feedback, localise_feedback, draft_fix_plan, is_plan_large_job, apply_self_fix, write_self_fix_regression_test, verify_differential, publish_self_fix_pr, record_self_fix_pr, critique_turn_output. Phase-7 tools (classify_failure, pause_current_task, resume_task, restart_self, edit_self_file, bump_loop_version) are explicitly NOT registered here — they land with DevModeToolsMixin alongside §7.5. The self_fix package's __init__.py replaces the Phase-1 stub and re-exports the public surface used by the CLI and downstream agents.

github-actions · 2026-04-20T09:55:19Z

Summary

This PR lands a thorough, well-tested Phase 6 self-correction loop for gaia-coder: triage → plan → fix → regression-test → publish → verify, wired together by FeedbackLoopDriver and surfaced as a SelfFixToolsMixin with 10 registered tools. Module boundaries are clean, LLM call sites are abstracted behind Protocols for easy mocking, and 47 tests cover every §7.4 step plus the §7.2 critique filters. The single most important thing: several except Exception sites in loop_driver.py tension against the "No silent fallbacks — Fail loudly" rule in CLAUDE.md; the comments explain the intent (loose coupling to sibling phases), but the exception scopes can be tightened without losing that.

Issues Found

🟡 Important

Broad except Exception in loop driver — tighten scope (src/gaia/coder/self_fix/loop_driver.py:1330-1337, 1382-1386)
Two call sites swallow any exception from the review gate runner and notify_em. Even with the "loose coupling" rationale, this catches programming errors (AttributeError, TypeError) alongside the expected RuntimeError/subprocess.CalledProcessError. CLAUDE.md specifically calls this out under "No Silent Fallbacks". Narrowing the catch preserves loose coupling while letting genuine bugs fail loudly.

        # --- Step 6: review gate (loose-coupled to Phase 4) ----------------
        review_gate_result: Optional[ReviewGateResult] = None
        if self._review_gate_runner is not None:
            try:
                review_gate_result = self._review_gate_runner(
                    diff=drive.diff,
                    plan=plan,
                    feedback_body=row.body,
                )
                drive.notes.append(f"review gate overall={review_gate_result.overall}")
            except (RuntimeError, subprocess.CalledProcessError) as exc:
                # Loose coupling: a Phase-4 runtime problem must not block a
                # Phase-6 drive. Programming errors still surface.
                logger.warning(
                    "review gate runner failed for feedback %s: %s",
                    row.id,
                    exc,
                )

Same pattern for the notify_em block at line 1382.

_append_notes silently discards corrupted JSON (src/gaia/coder/self_fix/loop_driver.py:1068-1077, also verifier.py:3327-3336)
If notes_json is malformed (or somehow a non-list JSON value), the helper silently replaces it with [] — losing the prior audit trail. Given notes_json is the canonical state history per the module docstring ("full history is recoverable from the row alone"), silent replacement is exactly the kind of masked regression CLAUDE.md warns against. Either raise, or log-and-preserve-raw.

def _append_notes(existing: str, entry: Mapping[str, Any]) -> str:
    """Append a state-transition note to the feedback row's JSON array."""
    if not existing:
        parsed: list = []
    else:
        try:
            parsed = json.loads(existing)
        except json.JSONDecodeError as exc:
            raise ValueError(
                f"feedback notes_json is corrupted and cannot be extended: {exc}"
            ) from exc
        if not isinstance(parsed, list):
            raise ValueError(
                f"feedback notes_json must be a JSON array, got {type(parsed).__name__}"
            )
    parsed.append(dict(entry))
    return json.dumps(parsed)

CLI test uses bare try/except SystemExit — test can silently pass (tests/coder/test_self_fix/test_cli.py:3714-3728)
If argparse doesn't raise SystemExit for an invalid severity (regression in argparse wiring), the test passes with no assertion. pytest.raises makes the intent explicit.

def test_cli_feedback_rejects_invalid_severity(capsys) -> None:
    """argparse enforces the severity enum."""
    with pytest.raises(SystemExit) as excinfo:
        coder_cli.main(
            [
                "feedback",
                "body",
                "--severity",
                "nonsense",
            ]
        )
    assert excinfo.value.code != 0

_default_edit_hunks fallback can raise on non-unique snippets (src/gaia/coder/self_fix/loop_driver.py:1415-1445)
The fallback uses first.snippet (a single grepped line) as old_string for _edit_file_impl, which requires unique match. A common case — e.g. a blank-ish comment line or any repeated snippet — will raise ValueError: old_string not unique. This is documented as a test-only fallback, but worth either (a) using replace_all=False + line-range-anchored replacement, or (b) adding replace_all=True for the fallback path so end-to-end tests don't become flaky on seed-file edits.

🟢 Minor

_edit_file_impl duplicates FileToolsMixin logic (src/gaia/coder/self_fix/fixer.py:622-645)
The module docstring explains this is intentional to avoid mixin instantiation, but any future semantic change to edit_file in PR #818's FileToolsMixin creates drift. A follow-up could add a pytest guarding the two implementations with shared fixtures, or lift the implementation into a plain function in the mixin module that both import.

Unused parameters carry pylint: disable (fixer.py:655, 788, 2655, loop_driver.py:1419)
Several functions accept parameters the PR doesn't use yet (fix_class, changed_files). The docstrings explain they're kept for Phase 7 — that's reasonable, but consider prefixing with _ (e.g. _fix_class) to make "intentionally unused" visible without the lint disable.

context_url=None templated as literal "null" string (src/gaia/coder/self_fix/triage.py:2941)
The triage prompt receives "null" (the string) when context_url is None, not a JSON null or empty. Harmless for an LLM, but slightly misleading when eyeballing rendered prompts — consider "(none)" or an empty string.

Unused import in loop_driver test (tests/coder/test_self_fix/test_loop_driver.py:4083)
from typing import Callable is imported but never referenced in the test file (the _triage_client / _gh_runner_factory functions don't annotate returns).

Strengths

Excellent module decomposition. Each §7.4 stage is one module, one concern, with a dataclass result type — trivial to unit-test and swap. Protocol-based LLM abstraction (TriageClient, CritiqueClient) is the right pattern.
Hard-rule enforcement where it matters. open_self_fix_pr refuses to publish without a regression test (publisher.py:2650-2655), and cross-checks plan.feedback_id against the feedback_id argument — both guards have tests (test_regression_test_is_required, test_open_self_fix_pr_rejects_mismatched_ids). Subprocess calls use list-form args throughout, no shell injection surface.
Fail-loudly where it counts. Post-merge verifier raises RuntimeError on a red regression test (verifier.py:3395-3404), and verify_test_differential raises on both fail-on-both and pass-on-both — preventing the exact "regression test doesn't actually exercise the bug" failure mode this loop is built to catch.
Well-crafted test fixtures. tmp_git_repo gives tests a real git repo with a seeded coder branch, enabling genuine fail-then-pass differential verification without mocks. The _default_test_body marker-flag pattern is clever and honest about the fact that it's not a real regression repro.

Verdict

Approve with suggestions. No blocking issues; the PR is already merged and represents a solid Phase-6 landing. The 🟡 items above (broad exception handlers, silent _append_notes recovery, CLI test pattern) are worth picking up as follow-ups — each is a small, bounded change that tightens adherence to the CLAUDE.md fail-loudly discipline without disturbing the architecture.

…view pass) (#834) ## Summary Final cleanup pass to complete the `coder` branch for EM testing. Five Important + three Minor findings across the Phase 5/6/11 auto-reviews. All 395 tests pass. ## Changes - `test_self_fix/test_cli.py` — `pytest.raises` so a silent-pass regression in argparse can't pass the test. [#825, #829] - `test_integration_e2e.py` — real `PATH` prepend via `monkeypatch.setenv` instead of a no-op assignment that leaked env. [#829] - `test_fixes_827_828.py` — drop unused `Path` import. [#832] - `loop_driver.py` — narrow broad `except Exception` around `review_gate` and `notify_em` to `(RuntimeError, CalledProcessError, OSError)`. Programming errors now surface per CLAUDE.md fail-loudly. [#825] - `loop_driver.py` + `verifier.py` — `_append_notes` / `_append_note` raise `ValueError` on corrupted or wrong-type `notes_json` instead of silently replacing with `[]`. [#825] ## Test plan - [x] `pytest tests/coder/ tests/eval/` — 395/395 pass

github-actions Bot added the tests Test changes label Apr 20, 2026

kovtcharov force-pushed the feature/gaia-coder-self-fix branch from 193ed2b to 2b7791e Compare April 20, 2026 09:49

kovtcharov marked this pull request as ready for review April 20, 2026 09:49

kovtcharov requested a review from kovtcharov-amd as a code owner April 20, 2026 09:49

kovtcharov changed the base branch from coder to main April 20, 2026 09:49

kovtcharov changed the base branch from main to coder April 20, 2026 09:49

kovtcharov added 11 commits April 20, 2026 02:50

fix(coder): self_fix package exports (post-cherry-pick)

03c642d

test(coder): skip Phase 6 CLI tests pending follow-up CLI-glue PR

d0d1e3b

kovtcharov force-pushed the feature/gaia-coder-self-fix branch from 2b7791e to d0d1e3b Compare April 20, 2026 09:52

kovtcharov merged commit 96754af into coder Apr 20, 2026
6 checks passed

kovtcharov deleted the feature/gaia-coder-self-fix branch April 20, 2026 09:52

kovtcharov mentioned this pull request Apr 20, 2026

fix(coder): address #825 + #829 + #832 auto-review findings (final review pass) #834

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(coder): self-correction loop (§7.3-§7.9 + continuous critique)#825

feat(coder): self-correction loop (§7.3-§7.9 + continuous critique)#825
kovtcharov merged 11 commits into
coderfrom
feature/gaia-coder-self-fix

kovtcharov commented Apr 20, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kovtcharov commented Apr 20, 2026

Summary

Threads

Out of scope (explicit non-goals per the Phase 6 task)

Dependencies

Test plan

Merge plan

Uh oh!

Uh oh!

github-actions Bot commented Apr 20, 2026

Summary

Issues Found

🟡 Important

🟢 Minor

Strengths

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant