Skip to content

feat(coder): self-correction loop (§7.3-§7.9 + continuous critique)#825

Merged
kovtcharov merged 11 commits into
coderfrom
feature/gaia-coder-self-fix
Apr 20, 2026
Merged

feat(coder): self-correction loop (§7.3-§7.9 + continuous critique)#825
kovtcharov merged 11 commits into
coderfrom
feature/gaia-coder-self-fix

Conversation

@kovtcharov
Copy link
Copy Markdown
Collaborator

Summary

Phase 6 implements gaia-coder's self-correction loop — the core value proposition. When the EM gives feedback, the agent now triages it into one of eight fix classes, localises the cause, drafts a regression-tested plan, applies a fix on an auto/gaia-coder/<feedback_id> branch, and opens a draft PR targeting coder. Wires §7.2 continuous critique, §7.3 feedback intake, §7.4 loop steps 1-10, and §7.9 verification. Loosely coupled to Phase 4 (review gate) and Phase 5 (trust inbox) so they can land in any order.

Threads

  • Prompt templates (§15.8 P1/P2/P3). prompts/triage.md, prompts/critique.md, and prompts/plan_review.md — the three canonical text bodies the loop consumes. Stored as plain Markdown so she can edit them via prompt-class self-fix PRs.
  • Triage (§7.4 step 1-2). classify_fix_class runs P1 on Opus 4.7; < 60 confidence is rewritten to out-of-scope so the loop never commits to a guess. localise is deterministic grep — no LLM.
  • Planner (§5.1 Stage 3 / §7.4 step 3). draft_plan + is_large_job + request_em_approval. Large jobs (> 200 LoC, or architectural / state-machine, or cross-mixin) post the P3 message to the EM inbox and wait for ✅.
  • Fixer (§7.4 steps 4-5). generate_fix creates the self-fix branch and applies edits; write_regression_test emits a pytest file + marker flag so it genuinely fails on coder and passes on the fix branch (no clever mocks); verify_test_differential raises on the pass-on-both or fail-on-both failure modes.
  • Publisher (§7.4 steps 7-8). open_self_fix_pr refuses to open without a regression test (§7.4 step 5 hard rule). PR body cites feedback_id and quotes the EM's wording verbatim — Pass 7 (feedback-binding) depends on both.
  • Verifier (§7.4 step 10). verify_on_merge re-runs the regression test on the merged SHA, transitions the feedback row to verified, and writes failure_patterns + review_patterns memory records so the same symptom is recognised next time.
  • Continuous critique (§7.2). Cheap one-shot Opus call after every state-changing tool; findings < 60 confidence are dropped; ≥ 80 surface inline for pre-transition action.
  • Loop driver. FeedbackLoopDriver.process_pending_feedback() orchestrates the whole thing with full pending → triaged → in-fix → fix-pr-open → verified | rejected → closed transitions written to feedback.notes_json.
  • SelfFixToolsMixin. Registers 10 tools (well above the §15.2 ≥ 7 contract) so the loop is callable from the agent's tool registry. Phase-7 tools (classify_failure, pause_current_task, restart_self, edit_self_file) are intentionally out of scope.
  • CLI. gaia-coder feedback "<body>" --severity high --on <url> enqueues, gaia-coder self-fix process runs one iteration.
  • Tests. 47 new tests covering every §7.4 step, every §7.2 filtering rule, and the §15.2 mixin contract. Mocks Anthropic and gh at the callable boundary; uses a tmp git repo with a coder branch for real branch creation and pytest differential runs.

Out of scope (explicit non-goals per the Phase 6 task)

  • EventBridge ingestion of @gaia-coder feedback: comments (Phase 10 repo binding).
  • Auto-merge (§7.6) — blocked on Phase 4 review-gate confidence score.
  • Dev-mode self-edit (§7.5) and restart_self — Phase 7.
  • ReAct loop self-edit (§7.8) — Phase 7.

Dependencies

Test plan

  • pytest tests/coder/test_self_fix/ -xvs — 47 tests pass.
  • pytest tests/coder/ -q — full coder suite (120 tests) pass.
  • python util/lint.py --all — no new critical errors (pre-existing pylint warnings in cli.py, discovery.py, etc. are not touched).
  • python -c "from gaia.coder.self_fix import SelfFixToolsMixin; m = SelfFixToolsMixin(); assert len(m.register_self_fix_tools()) >= 7" — smoke.
  • gaia-coder feedback "..." --severity high --on https://github.com/amd/gaia/pull/9999 --id fb-test --db-path /tmp/fb.db — writes a row; follow with gaia-coder self-fix process --db-path /tmp/fb.db --repo-root <worktree> --skip-differential-verify --skip-fix-apply for the end-to-end path (PR creation mocked).

Merge plan

  • Draft PR, do not merge. Rebase onto coder after Phase 4 and Phase 5 land so the review gate and inbox wiring become real imports.

@github-actions github-actions Bot added the tests Test changes label Apr 20, 2026
@kovtcharov kovtcharov force-pushed the feature/gaia-coder-self-fix branch from 193ed2b to 2b7791e Compare April 20, 2026 09:49
@kovtcharov kovtcharov marked this pull request as ready for review April 20, 2026 09:49
@kovtcharov kovtcharov changed the base branch from coder to main April 20, 2026 09:49
@kovtcharov kovtcharov changed the base branch from main to coder April 20, 2026 09:49
…view)

Materialise the three prompt templates the self-correction loop consumes:

* triage.md — per-feedback fix-class classifier (§15.8 P1).
* critique.md — continuous-critique hook (§15.8 P2).
* plan_review.md — deterministically-templated EM inbox message for
  large-job approval asks (§15.8 P3, not an LLM prompt).

All three live under src/gaia/coder/prompts/ and are loaded by the
self-fix modules via simple double-brace string substitution. The file
format is friendly for humans to edit so she can self-edit her own
prompts via prompt-class self-fix PRs (§6.4, §15.8).
…eps 1-2)

classify_fix_class runs the P1 prompt on Opus 4.7 (temperature=0) and
returns a FixClassResult with the eight-label fix-class (prompt | doc |
test | tool | policy | architectural | state-machine | out-of-scope),
root-cause hypothesis, candidate files, and a confidence score.

Conservatism gate: when confidence < 60 the classifier's fix_class is
rewritten to out-of-scope and `escalated_low_confidence` is set to
True — so the loop never commits to a guess (§7.2 low-confidence rule).

The LLM call site is abstracted via a TriageClient protocol so tests can
inject a mock without depending on anthropic. Fail-loudly (CLAUDE.md) on
JSON parse errors, unknown fix_class values, or out-of-range confidence.

localise is deterministic — no LLM. It grep-scans the triage-proposed
candidate_files for either an explicit `path:start-end` range or for
keywords the driver extracts from the feedback body. Missing files log
at INFO and are skipped; the classifier was guessing after all.
…approval

Implements §5.1 Stage 3 planning for the self-correction loop:

* draft_plan() synthesises a Plan dataclass from triage + localisation
  output — root cause, proposed change, regression-test sketch, LoC
  estimate, alternatives considered, cost envelope.
* is_large_job() triggers the EM-review gate when total LoC > 200, when
  fix_class is architectural or state-machine, or when the plan spans
  multiple mixin directories under src/gaia/coder/.
* request_em_approval() renders the P3 message and writes it to the EM
  inbox. Loose-coupled to Phase 5's gaia.coder.trust.inbox: if the module
  is importable we use it; otherwise we log WARN and return a deferred
  ApprovalRequest so the caller can retry.

MAX_PLAN_REFINEMENT_ROUNDS is exported so the driver can enforce the
three-round cap from §5.1 Stage 3's plan_refine row.
…rential

Implements §7.4 steps 4-5 end-to-end:

* generate_fix() creates auto/gaia-coder/<feedback_id> off the base ref
  (default 'coder' — §5.6, never main), then walks the caller-supplied
  EditHunks via an _edit_file_impl free function that mirrors the exact
  semantics of FileToolsMixin.edit_file from PR #818 (unique-match,
  opt-in replace_all, fail-loudly on missing). The branch is idempotent:
  re-runs check out the existing branch rather than error.
* write_regression_test() emits a pytest file plus a marker flag so the
  generated test genuinely fails on the base branch (marker missing)
  and passes on the fix branch (marker present) — the §7.4 step 5
  contract without needing any cleverness. Refuses to run unless the
  working tree is on a SELF_FIX_BRANCH_PREFIX branch.
* verify_test_differential() runs pytest on both refs and raises
  RuntimeError if the fail-then-pass contract is violated (pass on
  both → regression not exercised; fail on both → fix does not fix).

sys.executable is used for subprocess pytest invocations so hosts
without a `python` symlink on PATH still work.
…steps 7-8)

Opens a **draft** self-fix PR against the coder integration branch via
`gh pr create` (shells out — GitHubToolsMixin is Phase-10 work). Hard
rules enforced at the module boundary:

* regression_test_path must be non-empty; §7.4 step 5 forbids opening a
  self-fix PR without a regression test. Raises ValueError.
* plan.feedback_id must match the feedback_id argument; a mismatched
  binding would defeat Pass 7 (feedback-binding) before it even runs.
* base branch defaults to 'coder' — never main (§5.6).

compose_pr_body() cites feedback_id explicitly and block-quotes the
EM's feedback verbatim (§15.8 P7 requirements #3). The review-pass
table is templated from an optional ReviewGateResult; when the Phase-4
gate isn't wired yet we emit a "(review gate not available)" stub.

notify_em() routes to `gh pr comment` or `gh issue comment` based on
the feedback context_url; missing/unrecognised URLs log a WARN and
return a deferred marker so the loop driver can retry.
…es (§7.4 step 10)

Triggered by the PR-merged-where-author=self EventBridge event (the
wiring itself is Phase-10 work). Re-runs the regression test on the
merged SHA; on green it (a) transitions the feedback row to 'verified',
(b) writes a failure_patterns memory row so similar symptoms are
recognised next time, (c) writes a review_patterns row so Pass 6 can
consult prior review outcomes for the same fix-class.

Fail-loudly: if the regression test is red on the merged SHA we raise
RuntimeError rather than silently close the feedback — a green-merge /
red-regression gap is a correctness alarm that must page the EM.

Reuses the SQLite stores from Phase 2 (gaia.coder.stores.feedback and
gaia.coder.stores.memory) directly — that task landed before this one.
…critique

Single-turn Opus 4.7 call, runs after every state-changing tool call
(edit / write_file / run_cli_command) and returns at most one
actionable finding. Filtering rules from §7.2:

* findings with confidence < 60 are dropped entirely (low-confidence
  critiques are noise),
* findings in 60-79 are kept for self_review batch consideration,
* findings ≥ 80 land in CritiqueResult.high_confidence_findings for
  inline action before the next state transition.

Cost framing: ~15 critique calls per 50-turn task at ~$0.02 each =
~$0.30 per task added, well under the §6.6 ceiling. Alternative is
Pass-3 discovery of 200 lines that need to be thrown away.
Tests for every §7.4 step, every §7.2 filtering rule, and the §15.2
tool-registration contract:

* test_triage.py — 13 tests (8 parametrised fix-class cases, low-conf
  escalation, unknown-class rejection, localise range/keywords/missing).
* test_planner.py — 9 tests (threshold, fix-class forced-large,
  cross-mixin detection, draft from hits, refinement cap, EM-inbox
  deferral path + injected writer path).
* test_fixer.py — 6 tests (fix lands on auto/gaia-coder/<fid> not
  coder, empty-edits rejected, write_regression_test refuses coder
  branch, fail-then-pass contract both directions).
* test_publisher.py — 6 tests (regression-test-required ValueError,
  PR body cites fid and quotes EM wording, --draft --base coder argv,
  mismatched ids rejected, notify_em context-aware routing).
* test_verifier.py — 2 tests (writes failure_patterns + review_patterns
  memory rows on green; raises RuntimeError on red merged-SHA run).
* test_continuous_critique.py — 4 tests (< 60 suppressed, ≥ 80 kept,
  empty response legal, 60/80 threshold constants).
* test_loop_driver.py — 3 tests (end-to-end seeded feedback →
  fix-pr-open, out-of-scope → rejected, no-pending).
* test_mixin.py — 1 test (SelfFixToolsMixin registers ≥ 7 tools).
* test_cli.py — 3 tests (feedback enqueues, invalid severity rejected,
  self-fix without action prints help).

Uses a tmp_git_repo fixture with pre-seeded coder branch + buggy
sample file, and feedback_db_path / memory_db_path fixtures that open
real SQLite stores under tmp_path. Anthropic / gh CLI are mocked at
the callable boundary — no network.
FeedbackLoopDriver.process_pending_feedback() pulls one pending row
from feedback.db and walks it through the §7.3 state machine:

    pending → triaged → in-fix → fix-pr-open → verified | rejected → closed

Each step is a thin wrapper around its dedicated module (triage /
planner / fixer / publisher / verifier) so tests can mock one stage
without monkey-patching internals. All transitions are appended to
feedback.notes_json so the full history is recoverable from the row.

Loose coupling (per the Phase 6 task spec):
* Phase 4 (review gate) — if review_gate_runner is supplied, its
  ReviewGateResult is attached to the PR body; absent, we log a WARN
  and open the PR with a placeholder passes table.
* Phase 5 (trust inbox) — request_em_approval() uses the trust module
  if importable, otherwise the large-job ASK is deferred.

SelfFixToolsMixin registers **10** loop-level tools on an agent, which
is comfortably above the smoke-test ≥ 7 bar from the task spec:
triage_feedback, localise_feedback, draft_fix_plan, is_plan_large_job,
apply_self_fix, write_self_fix_regression_test, verify_differential,
publish_self_fix_pr, record_self_fix_pr, critique_turn_output.

Phase-7 tools (classify_failure, pause_current_task, resume_task,
restart_self, edit_self_file, bump_loop_version) are explicitly NOT
registered here — they land with DevModeToolsMixin alongside §7.5.

The self_fix package's __init__.py replaces the Phase-1 stub and
re-exports the public surface used by the CLI and downstream agents.
@kovtcharov kovtcharov force-pushed the feature/gaia-coder-self-fix branch from 2b7791e to d0d1e3b Compare April 20, 2026 09:52
@kovtcharov kovtcharov merged commit 96754af into coder Apr 20, 2026
6 checks passed
@kovtcharov kovtcharov deleted the feature/gaia-coder-self-fix branch April 20, 2026 09:52
@github-actions
Copy link
Copy Markdown
Contributor

Summary

This PR lands a thorough, well-tested Phase 6 self-correction loop for gaia-coder: triage → plan → fix → regression-test → publish → verify, wired together by FeedbackLoopDriver and surfaced as a SelfFixToolsMixin with 10 registered tools. Module boundaries are clean, LLM call sites are abstracted behind Protocols for easy mocking, and 47 tests cover every §7.4 step plus the §7.2 critique filters. The single most important thing: several except Exception sites in loop_driver.py tension against the "No silent fallbacks — Fail loudly" rule in CLAUDE.md; the comments explain the intent (loose coupling to sibling phases), but the exception scopes can be tightened without losing that.

Issues Found

🟡 Important

Broad except Exception in loop driver — tighten scope (src/gaia/coder/self_fix/loop_driver.py:1330-1337, 1382-1386)
Two call sites swallow any exception from the review gate runner and notify_em. Even with the "loose coupling" rationale, this catches programming errors (AttributeError, TypeError) alongside the expected RuntimeError/subprocess.CalledProcessError. CLAUDE.md specifically calls this out under "No Silent Fallbacks". Narrowing the catch preserves loose coupling while letting genuine bugs fail loudly.

        # --- Step 6: review gate (loose-coupled to Phase 4) ----------------
        review_gate_result: Optional[ReviewGateResult] = None
        if self._review_gate_runner is not None:
            try:
                review_gate_result = self._review_gate_runner(
                    diff=drive.diff,
                    plan=plan,
                    feedback_body=row.body,
                )
                drive.notes.append(f"review gate overall={review_gate_result.overall}")
            except (RuntimeError, subprocess.CalledProcessError) as exc:
                # Loose coupling: a Phase-4 runtime problem must not block a
                # Phase-6 drive. Programming errors still surface.
                logger.warning(
                    "review gate runner failed for feedback %s: %s",
                    row.id,
                    exc,
                )

Same pattern for the notify_em block at line 1382.

_append_notes silently discards corrupted JSON (src/gaia/coder/self_fix/loop_driver.py:1068-1077, also verifier.py:3327-3336)
If notes_json is malformed (or somehow a non-list JSON value), the helper silently replaces it with [] — losing the prior audit trail. Given notes_json is the canonical state history per the module docstring ("full history is recoverable from the row alone"), silent replacement is exactly the kind of masked regression CLAUDE.md warns against. Either raise, or log-and-preserve-raw.

def _append_notes(existing: str, entry: Mapping[str, Any]) -> str:
    """Append a state-transition note to the feedback row's JSON array."""
    if not existing:
        parsed: list = []
    else:
        try:
            parsed = json.loads(existing)
        except json.JSONDecodeError as exc:
            raise ValueError(
                f"feedback notes_json is corrupted and cannot be extended: {exc}"
            ) from exc
        if not isinstance(parsed, list):
            raise ValueError(
                f"feedback notes_json must be a JSON array, got {type(parsed).__name__}"
            )
    parsed.append(dict(entry))
    return json.dumps(parsed)

CLI test uses bare try/except SystemExit — test can silently pass (tests/coder/test_self_fix/test_cli.py:3714-3728)
If argparse doesn't raise SystemExit for an invalid severity (regression in argparse wiring), the test passes with no assertion. pytest.raises makes the intent explicit.

def test_cli_feedback_rejects_invalid_severity(capsys) -> None:
    """argparse enforces the severity enum."""
    with pytest.raises(SystemExit) as excinfo:
        coder_cli.main(
            [
                "feedback",
                "body",
                "--severity",
                "nonsense",
            ]
        )
    assert excinfo.value.code != 0

_default_edit_hunks fallback can raise on non-unique snippets (src/gaia/coder/self_fix/loop_driver.py:1415-1445)
The fallback uses first.snippet (a single grepped line) as old_string for _edit_file_impl, which requires unique match. A common case — e.g. a blank-ish comment line or any repeated snippet — will raise ValueError: old_string not unique. This is documented as a test-only fallback, but worth either (a) using replace_all=False + line-range-anchored replacement, or (b) adding replace_all=True for the fallback path so end-to-end tests don't become flaky on seed-file edits.

🟢 Minor

_edit_file_impl duplicates FileToolsMixin logic (src/gaia/coder/self_fix/fixer.py:622-645)
The module docstring explains this is intentional to avoid mixin instantiation, but any future semantic change to edit_file in PR #818's FileToolsMixin creates drift. A follow-up could add a pytest guarding the two implementations with shared fixtures, or lift the implementation into a plain function in the mixin module that both import.

Unused parameters carry pylint: disable (fixer.py:655, 788, 2655, loop_driver.py:1419)
Several functions accept parameters the PR doesn't use yet (fix_class, changed_files). The docstrings explain they're kept for Phase 7 — that's reasonable, but consider prefixing with _ (e.g. _fix_class) to make "intentionally unused" visible without the lint disable.

context_url=None templated as literal "null" string (src/gaia/coder/self_fix/triage.py:2941)
The triage prompt receives "null" (the string) when context_url is None, not a JSON null or empty. Harmless for an LLM, but slightly misleading when eyeballing rendered prompts — consider "(none)" or an empty string.

Unused import in loop_driver test (tests/coder/test_self_fix/test_loop_driver.py:4083)
from typing import Callable is imported but never referenced in the test file (the _triage_client / _gh_runner_factory functions don't annotate returns).

Strengths

  • Excellent module decomposition. Each §7.4 stage is one module, one concern, with a dataclass result type — trivial to unit-test and swap. Protocol-based LLM abstraction (TriageClient, CritiqueClient) is the right pattern.
  • Hard-rule enforcement where it matters. open_self_fix_pr refuses to publish without a regression test (publisher.py:2650-2655), and cross-checks plan.feedback_id against the feedback_id argument — both guards have tests (test_regression_test_is_required, test_open_self_fix_pr_rejects_mismatched_ids). Subprocess calls use list-form args throughout, no shell injection surface.
  • Fail-loudly where it counts. Post-merge verifier raises RuntimeError on a red regression test (verifier.py:3395-3404), and verify_test_differential raises on both fail-on-both and pass-on-both — preventing the exact "regression test doesn't actually exercise the bug" failure mode this loop is built to catch.
  • Well-crafted test fixtures. tmp_git_repo gives tests a real git repo with a seeded coder branch, enabling genuine fail-then-pass differential verification without mocks. The _default_test_body marker-flag pattern is clever and honest about the fact that it's not a real regression repro.

Verdict

Approve with suggestions. No blocking issues; the PR is already merged and represents a solid Phase-6 landing. The 🟡 items above (broad exception handlers, silent _append_notes recovery, CLI test pattern) are worth picking up as follow-ups — each is a small, bounded change that tightens adherence to the CLAUDE.md fail-loudly discipline without disturbing the architecture.

kovtcharov added a commit that referenced this pull request Apr 20, 2026
…view pass) (#834)

## Summary

Final cleanup pass to complete the `coder` branch for EM testing. Five
Important + three Minor findings across the Phase 5/6/11 auto-reviews.
All 395 tests pass.

## Changes

- `test_self_fix/test_cli.py` — `pytest.raises` so a silent-pass
regression in argparse can't pass the test. [#825, #829]
- `test_integration_e2e.py` — real `PATH` prepend via
`monkeypatch.setenv` instead of a no-op assignment that leaked env.
[#829]
- `test_fixes_827_828.py` — drop unused `Path` import. [#832]
- `loop_driver.py` — narrow broad `except Exception` around
`review_gate` and `notify_em` to `(RuntimeError, CalledProcessError,
OSError)`. Programming errors now surface per CLAUDE.md fail-loudly.
[#825]
- `loop_driver.py` + `verifier.py` — `_append_notes` / `_append_note`
raise `ValueError` on corrupted or wrong-type `notes_json` instead of
silently replacing with `[]`. [#825]

## Test plan

- [x] `pytest tests/coder/ tests/eval/` — 395/395 pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant