feat(coder): Phases 7+8 — dev-mode + self-heal + debug sub-loop by kovtcharov · Pull Request #828 · amd/gaia

kovtcharov · 2026-04-20T10:19:18Z

Summary

Lands Phases 7 and 8 of docs/plans/coder-agent.mdx as a single PR to
minimise public PR count per the project's PR-minimisation directive.
The agent now has: (a) a real dev-mode gate backed by the §7.1 hard
precondition + em.toml + session.json triad, (b) self-heal primitives
for pausing, resuming, and restarting a task when it hits a bug in its
own source (§7.5), and (c) the eight-tool debug sub-loop from §5.9 with
the four discipline rules enforced at the state-machine layer rather
than only in prompts. Public API only — wiring into the main ReAct
loop is a Phase 11 concern.

Threads

Dev-mode gate (§7.1) — src/gaia/coder/dev_mode.py — detection
(editable install + matching origin), session.json + em.toml toggles
with audit-log side effects, composite is_enabled. Fail-loudly on
corrupt session files so an undetected bad state can't silently
enable self-edit.
Self-heal sub-loop (§7.5) — src/gaia/coder/self_fix/self_heal.py
— classify_failure (P8 prompt, conservative escalation), pause/resume
via the existing paused_tasks store, and restart_self with a
module-level rate-limit window (_RESTART_COUNT_WINDOW = 1h,
_RESTART_MAX_IN_WINDOW = 3) so a broken self-edit can't fork-bomb
the supervisor.
P8 prompt — src/gaia/coder/prompts/classify_failure.md —
the §15.8 template the classifier renders; same double-brace
placeholder convention as triage.md.
Debug tool mixin (§5.9) — src/gaia/coder/tools/debug.py —
eight @tool-decorated closures (repro_attempt, git_bisect,
add_instrumented_trace, run_with_tracing, diff_behavior,
query_failure_patterns, flake_check, minimize_repro) with all
subprocess calls funneled through a single injectable _run so
tests never spawn real processes.
Debug sub-loop state machine — src/gaia/coder/debug_loop.py —
the seven-state FSM. Discipline enforcement is the load-bearing
piece: propose_fix refuses without repro AND without ≥ 80%
top-hypothesis confidence (override requires shotgun=True AND
em_elevation_granted=True); hypothesise enforces three
hypotheses minimum; postmortem writes both feedback.db notes
and a failure_patterns memory row.

Test plan

pytest tests/coder/test_dev_mode.py — 15 tests
pytest tests/coder/test_self_heal.py — 15 tests
pytest tests/coder/test_debug_tools.py — 19 tests
pytest tests/coder/test_debug_loop.py — 18 tests
Full coder suite: pytest tests/coder/ — 265 passed, 2 pre-existing skipped
python util/lint.py --all — clean on all new files (black, isort, flake8 with project config, pylint); remaining critical errors are in pre-existing files unrelated to this PR

Smoke imports:

from gaia.coder.dev_mode import detect_dev_mode
from gaia.coder.self_fix.self_heal import classify_failure, pause_current_task, resume_task, restart_self
from gaia.coder.tools.debug import DebugToolsMixin
from gaia.coder.debug_loop import DebugSubLoop

Out of scope

Wiring into src/gaia/coder/loop.py / src/gaia/coder/base.py / the
gaia-coder CLI is not in this PR — Phase 11 (production swap) is
the right place for that, and doing it here would have pulled in the
scaffold branch's churn. The Python APIs are complete and tested in
isolation; the integration surface is the follow-up.

Spec references: §7.1 (dev mode), §7.5 (self-detected bug), §5.9
(debug sub-loop), §15.8 P8 (classify_failure prompt), §6.8.1
(failure_patterns memory topic).

Introduces `gaia.coder.dev_mode` — the hard precondition check (editable install + matching origin) plus the soft switches (session.json + em.toml). `detect_dev_mode` returns a structured `DevModeStatus` so the CLI can explain *why* dev mode is off. `enable_session`/`enable_permanent` refuse when the precondition fails and append an audit row on every flip; `is_enabled` composes the three signals for a single boolean check. Fail-loudly per CLAUDE.md: corrupt session.json raises instead of silently treating it as "off". Wiring into the main ReAct loop is a Phase 11 concern.

15 tests covering detect_dev_mode (no repo / matching origin / mismatched origin), enable_session / disable_session roundtrip, audit-row side effects, enable_permanent persistence, is_enabled composition, and the corrupt-session-file fail-loudly path. Fixtures build a fresh tmp git repo with a fake origin so the precondition check is exercised without touching the real installation.

The P8 prompt template the self-heal loop uses to classify a mid-task failure as user-task / self-code / external. Conservative — "when uncertain, classify external" — so a wrong self-code diagnosis never burns self-edit churn. Double-brace placeholders match the rest of src/gaia/coder/prompts/ for human editability.

… (§7.5) `gaia.coder.self_fix.self_heal` provides the four §7.5 primitives: - classify_failure — LLM-driven P8 triage, force-escalates low-confidence non-external classifications to external so we never self-fix on a guess. - pause_current_task / resume_task — thin wrappers over stores.paused_tasks that capture cwd + tool-call history + partial outputs + original prompt. - restart_self — hot-reload for prompt-only/doc-only via importlib.reload, exit code 42 for code changes. Fork-bomb guard enforces _RESTART_MAX_IN_WINDOW (3) restarts per _RESTART_COUNT_WINDOW (1h); breach raises RestartStormError so the daemon halts and pages the EM (§7.5 last bullet). All subprocess/LLM/exit calls are dependency-injected so tests never hit the network, the filesystem outside tmp_path, or sys.exit for real.

classify_failure paths (happy, low-confidence escalate, external-stays, bad JSON, unknown kind, prompt substitution), pause/resume roundtrip + corrupt-snapshot fail-loudly + delete-on-resume, restart_self rate-limit window (cold exit 42, hot-reload no exit, fourth-attempt RestartStormError, stale timestamps purged, unknown kind raises). Anthropic calls are always mocked; exit_fn is an injection point so the rate-limit window can be exercised without terminating pytest.

Adds `gaia.coder.tools.debug` with the full debug-sub-loop primitive set: - repro_attempt — run N times, score similarity against the expected failure signature; "3 of 3" before we call it reproduced. - git_bisect — parse the canonical "<sha> is the first bad commit" line. - add_instrumented_trace — insert a `logger.debug` on a scratch branch (never a bare `print` — Pass 1 would reject it). - run_with_tracing — PYTHONFAULTHANDLER / PYTHONDEVMODE / NODE_OPTIONS injection plus `python -X dev` prefix for python commands. - diff_behavior — unified diff of harness output between good/bad refs. - query_failure_patterns — memory recall against the failure_patterns topic; token-overlap similarity until FAISS lands in Phase 10. - flake_check — pytest N times, flag test when 0 < rate ≤ 10%. - minimize_repro — binary-search an input down to a minimal reproducer. All subprocess calls use a private _run helper so tests monkeypatch a single entry point. The mixin's register_debug_tools() returns the eight names so the smoke test can assert the contract without touching the tool registry's internals.

One-or-more tests per §5.9 tool (repro-all-pass / signature-mismatch / zero-attempts-raises, bisect-parses-sha / bisect-returns-none-on-failure, instrumented-trace-inserts-debug / missing-file-raises, tracing-sets-faulthandler / tracing-X-dev-prefix, diff-returns-unified, query-ranks-by-similarity / empty-store, flake 3p-2f / all-pass / all-fail, minimize-halves / refuses-non-reproducer / empty-input) plus a mixin smoke test asserting exactly the eight §5.9 names are registered. All subprocess calls are stubbed via monkeypatch so the suite runs in under a second.

Seven-state sub-loop (reproduce → bisect → hypothesise → probe → localise_bug → propose_fix → postmortem) with the four §5.9 discipline rules enforced at the state-machine layer, not just in prompts: 1. propose_fix raises unless `context.reproduced` is True. 2. propose_fix raises unless top-hypothesis confidence ≥ 80%; the shotgun override requires BOTH an explicit flag AND `em_elevation_granted=True`. 3. hypothesise enforces `len(hypotheses) >= 3`. 4. postmortem writes one failure_patterns memory row (per §6.8.1) plus appends a structured entry to the feedback.db notes_json. The loop is purely the control flow — tool implementations live in `gaia.coder.tools.debug`. All collaborators (repro_fn, bisect_fn, probe_fn) are dependency-injected so the loop composes cleanly with the main ReAct machine in Phase 11.

Covers each transition plus every discipline rule: - reproduce: happy path / not-reproduced raises with the exact EM question / missing fn raises. - bisect: advances + skip-mode. - hypothesise: < 3 raises; 3 accepts. - probe: confidence tip-over → LOCALISE_BUG; inconclusive → HYPOTHESISE; round cap → DebugDisciplineError. - propose_fix: rule-1 (no repro) / rule-2 (low confidence) / shotgun requires EM elevation / happy path. - postmortem: writes feedback notes_json + memory failure_patterns row, and require_postmortem_or_raise gates the sub-loop exit. All collaborators injected as callables so no real subprocess or LLM.

github-actions · 2026-04-20T10:26:53Z

Summary

Strong Phase-7+8 landing — dev-mode gate, self-heal primitives, and the §5.9 debug sub-loop. Discipline rules are properly enforced at the state-machine layer (not just prompts), fail-loudly is honored throughout (corrupt session JSON, unknown classify kinds, restart storms), and collaborators are injectable so the 67-test suite is hermetic. The single most important thing to know: add_instrumented_trace inserts logger.debug(...) into arbitrary files without ensuring logger is imported — the tool as written will NameError the target program at runtime.

Issues Found

🟡 Important

add_instrumented_trace injects logger.debug(...) with no guarantee logger exists in scope (src/gaia/coder/tools/debug.py:388)

The probe writes logger.debug("...") into any file at file:line, but never verifies the file imports/binds logger. The unit test at tests/coder/test_debug_tools.py:3108 exercises a file whose contents are literally "def foo():\n return 1\n" — no logger import — and the test passes because it only asserts the string is written, not that the code is executable. When the debug sub-loop actually runs the instrumented branch, every target without a pre-existing module-level logger = logging.getLogger(__name__) will raise NameError, killing the probe and corrupting the signal probe() is trying to produce.

Either (a) detect-or-inject a logger binding when missing, (b) emit a fully-qualified fallback like import logging as _glog; _glog.getLogger(__name__).debug(...), or (c) document this as a precondition and raise loudly if the target file lacks a logger symbol. Option (b) is the cheapest hermetic fix:

    inserted = (
        f"{indent}__import__('logging').getLogger(__name__)"
        f".debug({json.dumps(message)})  # gaia-coder probe\n"
    )
    new_lines = lines[: line - 1] + [inserted] + lines[line - 1 :]
    target.write_text("".join(new_lines), encoding="utf-8")

Also add a test that imports the mutated module to confirm it loads — the current test passes even on a broken probe.

diff_behavior cleanup leaves HEAD detached at the wrong ref (src/gaia/coder/tools/debug.py, _run_on / finally block near line 2321)

The flow is: stash → git switch --detach good_ref → run harness → git switch --detach bad_ref → run harness → finally: git switch -. After two detached switches, switch - returns to the previous detached state (good_ref), not the caller's original branch. On a long-running coder daemon this silently leaves the repo on a detached ref — the next tool that assumes HEAD == main will be working off stale code.

Capture the original ref at entry and restore it explicitly:

    repo_root = Path(cwd) if cwd else Path.cwd()
    argv = _shell_split(harness_script)

    original_head = _run(
        ["git", "symbolic-ref", "--quiet", "--short", "HEAD"], cwd=repo_root
    ).stdout.strip() or _run(
        ["git", "rev-parse", "HEAD"], cwd=repo_root
    ).stdout.strip()

    # Stash first to keep the working tree clean.
    stash = _run(

and replace the git switch - in the finally with git switch --detach {original_head} (or git switch {original_head} when it was a branch). A regression test that asserts HEAD is restored after diff_behavior would catch future regressions.

🟢 Minor

Docstring says "clamped"; code raises (src/gaia/coder/self_fix/self_heal.py:1399)

classify_failure's return contract says "confidence is clamped to [0, 100]" but _parse_classify_response raises ValueError for out-of-range values. Fail-loudly is the right behavior — just correct the docstring:

        :class:`FailureClassification` — kind is always one of
        :data:`FAILURE_KINDS`; confidence must be in [0, 100] or the
        response is rejected. Below the floor the kind is force-rewritten
        to ``external`` and ``escalated_low_confidence`` is set.

Postmortem memory row pins confidence at ≥80 even for shotgun fixes (src/gaia/coder/debug_loop.py:521-524)

confidence=max(
    self._top_hypothesis().confidence if self._top_hypothesis() else 80,
    80,
),

A shotgun fix (top hypothesis confidence 40, elevation granted) will still be recorded in failure_patterns with confidence=80, which biases future memory lookups toward a low-signal pattern. Either (a) record the actual top-hypothesis confidence, or (b) stamp the row differently when ctx.shotgun is True so retrieval can discount it. At minimum, add a WHY: comment explaining the 80 floor.

minimize_repro halving fails on midpoint-straddling reproducers (src/gaia/coder/tools/debug.py:2492-2504)

The loop drops exactly one half at each step; if the essential trigger straddles the midpoint, neither half reproduces and the function returns the unchanged input after one iteration. §5.9's "2000-char → 20-char" guarantee does not hold for that shape. Acceptable for an MVP, but worth documenting in the docstring so callers know when to prefer an external delta-debugging tool, or upgrade to a real ddmin later.

minimize_repro._default_reproducer duplicates _run logic (src/gaia/coder/tools/debug.py:2470-2481)

It open-codes subprocess.run(... input=candidate ...) instead of going through _run, so the module has two subprocess call sites when the tests only mock one. Either extend _run to take an input= kwarg, or call _run and attach stdin downstream — keeping a single funnel makes the module easier to audit and reuse.

propose_fix happy-path record_note omits shotgun state (src/gaia/coder/debug_loop.py:433-436)

The final note line records summary=... architectural=... but not shotgun=self.context.shotgun, while the shotgun flag is important for the audit trail. The earlier "shotgun override ACTIVE" note covers it, but including shotgun in the final note keeps the breadcrumb self-contained.

Strengths

Discipline-at-the-state-machine is the right call. §5.9's four rules are encoded as method-entry guards on DebugSubLoop — a caller that bypasses the prompts still cannot land a fix without repro + ≥80% confidence + three hypotheses + postmortem. This is a meaningfully stronger invariant than prompt-level discipline.
Fail-loudly consistency. Corrupt session.json, unknown classify kinds, missing paused-task fields, and the >3 restarts/hour cap all raise typed exceptions with actionable messages naming §7.x / §5.9 — no silent fallbacks.
Injection-based test design. repro_fn, bisect_fn, probe_fn, memory_conn_factory, exit_fn, git_runner, runner — every subprocess and LLM edge is mockable, which is why 67 tests run hermetically with no Anthropic or shell dependency.
Restart rate-limit modeling. Module-level rolling window with in-place purge avoids unbounded growth on a long-running daemon, and the bound composes with the supervisor's own cap rather than trusting one side.

Verdict

Request changes — the two 🟡 issues (add_instrumented_trace producing a broken target and diff_behavior leaving HEAD detached) are real runtime bugs that will bite as soon as Phase 11 wires this in. Both are surgical fixes and the test suite is solid enough that a follow-up patch can land quickly. Minor issues are fine to roll in or defer.

…ime) (#832) ## Summary Six fixes flagged by the auto-review bot: one Critical (security), five Important (two on #827, two on #828, one on both). All 395 tests pass on `coder` with the fixes. ## Changes **Critical (security):** - `oss_reuse.py` `import_with_attribution` — path traversal on LLM-controlled `dest_path`. Now resolves + `relative_to(root)`-checks; raises `AttributionError` on escape. **Important:** - `oss_reuse.py` `_validate_license_filter` — unknown SPDX ids silently dropped; now raises per CLAUDE.md fail-loudly. - `tools/github.py` `gh_pr_merge` — hardcoded `--admin`; now gated behind `admin_override=False` default. - `repo_binding.py` webhook round-trip — only did positive check; added wrong-signature + wrong-payload discrimination. - `tools/debug.py` `add_instrumented_trace` — emitted `logger.debug(...)` requiring pre-bound `logger`; now inlines `__import__('logging')` lookup. - `tools/debug.py` `diff_behavior` — `git switch -` after two detached switches returns to wrong ref; now captures + explicitly restores original HEAD. ## Test plan - [x] `pytest tests/coder/ tests/eval/` — 395 pass - [x] 7 new regression tests in `test_fixes_827_828.py` cover each fix - [x] `test_add_instrumented_trace_*` now asserts the mutated module actually imports (previously asserted only the string was written)

kovtcharov added 9 commits April 20, 2026 03:17

github-actions Bot added the tests Test changes label Apr 20, 2026

kovtcharov marked this pull request as ready for review April 20, 2026 10:23

kovtcharov requested a review from kovtcharov-amd as a code owner April 20, 2026 10:23

kovtcharov merged commit 6029b0e into coder Apr 20, 2026
11 checks passed

kovtcharov deleted the feature/gaia-coder-p7p8 branch April 20, 2026 10:23

kovtcharov mentioned this pull request Apr 20, 2026

fix(coder): address #827 + #828 auto-review findings (security + runtime) #832

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(coder): Phases 7+8 — dev-mode + self-heal + debug sub-loop#828

feat(coder): Phases 7+8 — dev-mode + self-heal + debug sub-loop#828
kovtcharov merged 9 commits into
coderfrom
feature/gaia-coder-p7p8

kovtcharov commented Apr 20, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kovtcharov commented Apr 20, 2026

Summary

Threads

Test plan

Out of scope

Uh oh!

Uh oh!

github-actions Bot commented Apr 20, 2026

Summary

Issues Found

🟡 Important

🟢 Minor

Strengths

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant