feat(coder): self-correction loop (§7.3-§7.9 + continuous critique)#825
Conversation
193ed2b to
2b7791e
Compare
…view) Materialise the three prompt templates the self-correction loop consumes: * triage.md — per-feedback fix-class classifier (§15.8 P1). * critique.md — continuous-critique hook (§15.8 P2). * plan_review.md — deterministically-templated EM inbox message for large-job approval asks (§15.8 P3, not an LLM prompt). All three live under src/gaia/coder/prompts/ and are loaded by the self-fix modules via simple double-brace string substitution. The file format is friendly for humans to edit so she can self-edit her own prompts via prompt-class self-fix PRs (§6.4, §15.8).
…eps 1-2) classify_fix_class runs the P1 prompt on Opus 4.7 (temperature=0) and returns a FixClassResult with the eight-label fix-class (prompt | doc | test | tool | policy | architectural | state-machine | out-of-scope), root-cause hypothesis, candidate files, and a confidence score. Conservatism gate: when confidence < 60 the classifier's fix_class is rewritten to out-of-scope and `escalated_low_confidence` is set to True — so the loop never commits to a guess (§7.2 low-confidence rule). The LLM call site is abstracted via a TriageClient protocol so tests can inject a mock without depending on anthropic. Fail-loudly (CLAUDE.md) on JSON parse errors, unknown fix_class values, or out-of-range confidence. localise is deterministic — no LLM. It grep-scans the triage-proposed candidate_files for either an explicit `path:start-end` range or for keywords the driver extracts from the feedback body. Missing files log at INFO and are skipped; the classifier was guessing after all.
…approval Implements §5.1 Stage 3 planning for the self-correction loop: * draft_plan() synthesises a Plan dataclass from triage + localisation output — root cause, proposed change, regression-test sketch, LoC estimate, alternatives considered, cost envelope. * is_large_job() triggers the EM-review gate when total LoC > 200, when fix_class is architectural or state-machine, or when the plan spans multiple mixin directories under src/gaia/coder/. * request_em_approval() renders the P3 message and writes it to the EM inbox. Loose-coupled to Phase 5's gaia.coder.trust.inbox: if the module is importable we use it; otherwise we log WARN and return a deferred ApprovalRequest so the caller can retry. MAX_PLAN_REFINEMENT_ROUNDS is exported so the driver can enforce the three-round cap from §5.1 Stage 3's plan_refine row.
…rential Implements §7.4 steps 4-5 end-to-end: * generate_fix() creates auto/gaia-coder/<feedback_id> off the base ref (default 'coder' — §5.6, never main), then walks the caller-supplied EditHunks via an _edit_file_impl free function that mirrors the exact semantics of FileToolsMixin.edit_file from PR #818 (unique-match, opt-in replace_all, fail-loudly on missing). The branch is idempotent: re-runs check out the existing branch rather than error. * write_regression_test() emits a pytest file plus a marker flag so the generated test genuinely fails on the base branch (marker missing) and passes on the fix branch (marker present) — the §7.4 step 5 contract without needing any cleverness. Refuses to run unless the working tree is on a SELF_FIX_BRANCH_PREFIX branch. * verify_test_differential() runs pytest on both refs and raises RuntimeError if the fail-then-pass contract is violated (pass on both → regression not exercised; fail on both → fix does not fix). sys.executable is used for subprocess pytest invocations so hosts without a `python` symlink on PATH still work.
…steps 7-8) Opens a **draft** self-fix PR against the coder integration branch via `gh pr create` (shells out — GitHubToolsMixin is Phase-10 work). Hard rules enforced at the module boundary: * regression_test_path must be non-empty; §7.4 step 5 forbids opening a self-fix PR without a regression test. Raises ValueError. * plan.feedback_id must match the feedback_id argument; a mismatched binding would defeat Pass 7 (feedback-binding) before it even runs. * base branch defaults to 'coder' — never main (§5.6). compose_pr_body() cites feedback_id explicitly and block-quotes the EM's feedback verbatim (§15.8 P7 requirements #3). The review-pass table is templated from an optional ReviewGateResult; when the Phase-4 gate isn't wired yet we emit a "(review gate not available)" stub. notify_em() routes to `gh pr comment` or `gh issue comment` based on the feedback context_url; missing/unrecognised URLs log a WARN and return a deferred marker so the loop driver can retry.
…es (§7.4 step 10) Triggered by the PR-merged-where-author=self EventBridge event (the wiring itself is Phase-10 work). Re-runs the regression test on the merged SHA; on green it (a) transitions the feedback row to 'verified', (b) writes a failure_patterns memory row so similar symptoms are recognised next time, (c) writes a review_patterns row so Pass 6 can consult prior review outcomes for the same fix-class. Fail-loudly: if the regression test is red on the merged SHA we raise RuntimeError rather than silently close the feedback — a green-merge / red-regression gap is a correctness alarm that must page the EM. Reuses the SQLite stores from Phase 2 (gaia.coder.stores.feedback and gaia.coder.stores.memory) directly — that task landed before this one.
…critique Single-turn Opus 4.7 call, runs after every state-changing tool call (edit / write_file / run_cli_command) and returns at most one actionable finding. Filtering rules from §7.2: * findings with confidence < 60 are dropped entirely (low-confidence critiques are noise), * findings in 60-79 are kept for self_review batch consideration, * findings ≥ 80 land in CritiqueResult.high_confidence_findings for inline action before the next state transition. Cost framing: ~15 critique calls per 50-turn task at ~$0.02 each = ~$0.30 per task added, well under the §6.6 ceiling. Alternative is Pass-3 discovery of 200 lines that need to be thrown away.
Tests for every §7.4 step, every §7.2 filtering rule, and the §15.2 tool-registration contract: * test_triage.py — 13 tests (8 parametrised fix-class cases, low-conf escalation, unknown-class rejection, localise range/keywords/missing). * test_planner.py — 9 tests (threshold, fix-class forced-large, cross-mixin detection, draft from hits, refinement cap, EM-inbox deferral path + injected writer path). * test_fixer.py — 6 tests (fix lands on auto/gaia-coder/<fid> not coder, empty-edits rejected, write_regression_test refuses coder branch, fail-then-pass contract both directions). * test_publisher.py — 6 tests (regression-test-required ValueError, PR body cites fid and quotes EM wording, --draft --base coder argv, mismatched ids rejected, notify_em context-aware routing). * test_verifier.py — 2 tests (writes failure_patterns + review_patterns memory rows on green; raises RuntimeError on red merged-SHA run). * test_continuous_critique.py — 4 tests (< 60 suppressed, ≥ 80 kept, empty response legal, 60/80 threshold constants). * test_loop_driver.py — 3 tests (end-to-end seeded feedback → fix-pr-open, out-of-scope → rejected, no-pending). * test_mixin.py — 1 test (SelfFixToolsMixin registers ≥ 7 tools). * test_cli.py — 3 tests (feedback enqueues, invalid severity rejected, self-fix without action prints help). Uses a tmp_git_repo fixture with pre-seeded coder branch + buggy sample file, and feedback_db_path / memory_db_path fixtures that open real SQLite stores under tmp_path. Anthropic / gh CLI are mocked at the callable boundary — no network.
FeedbackLoopDriver.process_pending_feedback() pulls one pending row
from feedback.db and walks it through the §7.3 state machine:
pending → triaged → in-fix → fix-pr-open → verified | rejected → closed
Each step is a thin wrapper around its dedicated module (triage /
planner / fixer / publisher / verifier) so tests can mock one stage
without monkey-patching internals. All transitions are appended to
feedback.notes_json so the full history is recoverable from the row.
Loose coupling (per the Phase 6 task spec):
* Phase 4 (review gate) — if review_gate_runner is supplied, its
ReviewGateResult is attached to the PR body; absent, we log a WARN
and open the PR with a placeholder passes table.
* Phase 5 (trust inbox) — request_em_approval() uses the trust module
if importable, otherwise the large-job ASK is deferred.
SelfFixToolsMixin registers **10** loop-level tools on an agent, which
is comfortably above the smoke-test ≥ 7 bar from the task spec:
triage_feedback, localise_feedback, draft_fix_plan, is_plan_large_job,
apply_self_fix, write_self_fix_regression_test, verify_differential,
publish_self_fix_pr, record_self_fix_pr, critique_turn_output.
Phase-7 tools (classify_failure, pause_current_task, resume_task,
restart_self, edit_self_file, bump_loop_version) are explicitly NOT
registered here — they land with DevModeToolsMixin alongside §7.5.
The self_fix package's __init__.py replaces the Phase-1 stub and
re-exports the public surface used by the CLI and downstream agents.
2b7791e to
d0d1e3b
Compare
SummaryThis PR lands a thorough, well-tested Phase 6 self-correction loop for Issues Found🟡 ImportantBroad Same pattern for the
CLI test uses bare
🟢 Minor
Unused parameters carry
Unused import in loop_driver test ( Strengths
VerdictApprove with suggestions. No blocking issues; the PR is already merged and represents a solid Phase-6 landing. The 🟡 items above (broad exception handlers, silent |
…view pass) (#834) ## Summary Final cleanup pass to complete the `coder` branch for EM testing. Five Important + three Minor findings across the Phase 5/6/11 auto-reviews. All 395 tests pass. ## Changes - `test_self_fix/test_cli.py` — `pytest.raises` so a silent-pass regression in argparse can't pass the test. [#825, #829] - `test_integration_e2e.py` — real `PATH` prepend via `monkeypatch.setenv` instead of a no-op assignment that leaked env. [#829] - `test_fixes_827_828.py` — drop unused `Path` import. [#832] - `loop_driver.py` — narrow broad `except Exception` around `review_gate` and `notify_em` to `(RuntimeError, CalledProcessError, OSError)`. Programming errors now surface per CLAUDE.md fail-loudly. [#825] - `loop_driver.py` + `verifier.py` — `_append_notes` / `_append_note` raise `ValueError` on corrupted or wrong-type `notes_json` instead of silently replacing with `[]`. [#825] ## Test plan - [x] `pytest tests/coder/ tests/eval/` — 395/395 pass
Summary
Phase 6 implements gaia-coder's self-correction loop — the core value proposition. When the EM gives feedback, the agent now triages it into one of eight fix classes, localises the cause, drafts a regression-tested plan, applies a fix on an
auto/gaia-coder/<feedback_id>branch, and opens a draft PR targetingcoder. Wires §7.2 continuous critique, §7.3 feedback intake, §7.4 loop steps 1-10, and §7.9 verification. Loosely coupled to Phase 4 (review gate) and Phase 5 (trust inbox) so they can land in any order.Threads
prompts/triage.md,prompts/critique.md, andprompts/plan_review.md— the three canonical text bodies the loop consumes. Stored as plain Markdown so she can edit them via prompt-class self-fix PRs.classify_fix_classruns P1 on Opus 4.7;< 60confidence is rewritten toout-of-scopeso the loop never commits to a guess.localiseis deterministic grep — no LLM.draft_plan+is_large_job+request_em_approval. Large jobs (> 200 LoC, orarchitectural/state-machine, or cross-mixin) post the P3 message to the EM inbox and wait for ✅.generate_fixcreates the self-fix branch and applies edits;write_regression_testemits a pytest file + marker flag so it genuinely fails oncoderand passes on the fix branch (no clever mocks);verify_test_differentialraises on the pass-on-both or fail-on-both failure modes.open_self_fix_prrefuses to open without a regression test (§7.4 step 5 hard rule). PR body citesfeedback_idand quotes the EM's wording verbatim — Pass 7 (feedback-binding) depends on both.verify_on_mergere-runs the regression test on the merged SHA, transitions the feedback row toverified, and writesfailure_patterns+review_patternsmemory records so the same symptom is recognised next time.< 60confidence are dropped;≥ 80surface inline for pre-transition action.FeedbackLoopDriver.process_pending_feedback()orchestrates the whole thing with fullpending → triaged → in-fix → fix-pr-open → verified | rejected → closedtransitions written tofeedback.notes_json.classify_failure,pause_current_task,restart_self,edit_self_file) are intentionally out of scope.gaia-coder feedback "<body>" --severity high --on <url>enqueues,gaia-coder self-fix processruns one iteration.ghat the callable boundary; uses a tmp git repo with acoderbranch for real branch creation and pytest differential runs.Out of scope (explicit non-goals per the Phase 6 task)
@gaia-coder feedback:comments (Phase 10 repo binding).restart_self— Phase 7.Dependencies
edit_filesemantics inline rather than instantiatingFileToolsMixin, to keep the module usable without the mixin wired.gaia.coder.stores.feedbackandgaia.coder.stores.memorydirectly.review_gate_runneris optional; absent, we publish a "(review gate not available)" placeholder in the PR body.request_em_approvalusesgaia.coder.trust.inbox.enqueueif importable, else defers.Test plan
pytest tests/coder/test_self_fix/ -xvs— 47 tests pass.pytest tests/coder/ -q— full coder suite (120 tests) pass.python util/lint.py --all— no new critical errors (pre-existing pylint warnings incli.py,discovery.py, etc. are not touched).python -c "from gaia.coder.self_fix import SelfFixToolsMixin; m = SelfFixToolsMixin(); assert len(m.register_self_fix_tools()) >= 7"— smoke.gaia-coder feedback "..." --severity high --on https://github.com/amd/gaia/pull/9999 --id fb-test --db-path /tmp/fb.db— writes a row; follow withgaia-coder self-fix process --db-path /tmp/fb.db --repo-root <worktree> --skip-differential-verify --skip-fix-applyfor the end-to-end path (PR creation mocked).Merge plan
coderafter Phase 4 and Phase 5 land so the review gate and inbox wiring become real imports.