Skip to content

feat(coder): Phases 7+8 — dev-mode + self-heal + debug sub-loop#828

Merged
kovtcharov merged 9 commits into
coderfrom
feature/gaia-coder-p7p8
Apr 20, 2026
Merged

feat(coder): Phases 7+8 — dev-mode + self-heal + debug sub-loop#828
kovtcharov merged 9 commits into
coderfrom
feature/gaia-coder-p7p8

Conversation

@kovtcharov
Copy link
Copy Markdown
Collaborator

Summary

Lands Phases 7 and 8 of docs/plans/coder-agent.mdx as a single PR to
minimise public PR count per the project's PR-minimisation directive.
The agent now has: (a) a real dev-mode gate backed by the §7.1 hard
precondition + em.toml + session.json triad, (b) self-heal primitives
for pausing, resuming, and restarting a task when it hits a bug in its
own source (§7.5), and (c) the eight-tool debug sub-loop from §5.9 with
the four discipline rules enforced at the state-machine layer rather
than only in prompts. Public API only — wiring into the main ReAct
loop is a Phase 11 concern.

Threads

  • Dev-mode gate (§7.1) — src/gaia/coder/dev_mode.py — detection
    (editable install + matching origin), session.json + em.toml toggles
    with audit-log side effects, composite is_enabled. Fail-loudly on
    corrupt session files so an undetected bad state can't silently
    enable self-edit.
  • Self-heal sub-loop (§7.5) — src/gaia/coder/self_fix/self_heal.py
    — classify_failure (P8 prompt, conservative escalation), pause/resume
    via the existing paused_tasks store, and restart_self with a
    module-level rate-limit window (_RESTART_COUNT_WINDOW = 1h,
    _RESTART_MAX_IN_WINDOW = 3) so a broken self-edit can't fork-bomb
    the supervisor.
  • P8 prompt — src/gaia/coder/prompts/classify_failure.md
    the §15.8 template the classifier renders; same double-brace
    placeholder convention as triage.md.
  • Debug tool mixin (§5.9) — src/gaia/coder/tools/debug.py
    eight @tool-decorated closures (repro_attempt, git_bisect,
    add_instrumented_trace, run_with_tracing, diff_behavior,
    query_failure_patterns, flake_check, minimize_repro) with all
    subprocess calls funneled through a single injectable _run so
    tests never spawn real processes.
  • Debug sub-loop state machine — src/gaia/coder/debug_loop.py
    the seven-state FSM. Discipline enforcement is the load-bearing
    piece: propose_fix refuses without repro AND without ≥ 80%
    top-hypothesis confidence (override requires shotgun=True AND
    em_elevation_granted=True); hypothesise enforces three
    hypotheses minimum; postmortem writes both feedback.db notes
    and a failure_patterns memory row.

Test plan

  • pytest tests/coder/test_dev_mode.py — 15 tests
  • pytest tests/coder/test_self_heal.py — 15 tests
  • pytest tests/coder/test_debug_tools.py — 19 tests
  • pytest tests/coder/test_debug_loop.py — 18 tests
  • Full coder suite: pytest tests/coder/ — 265 passed, 2 pre-existing skipped
  • python util/lint.py --all — clean on all new files (black, isort, flake8 with project config, pylint); remaining critical errors are in pre-existing files unrelated to this PR
  • Smoke imports:
    from gaia.coder.dev_mode import detect_dev_mode
    from gaia.coder.self_fix.self_heal import classify_failure, pause_current_task, resume_task, restart_self
    from gaia.coder.tools.debug import DebugToolsMixin
    from gaia.coder.debug_loop import DebugSubLoop
    

Out of scope

Wiring into src/gaia/coder/loop.py / src/gaia/coder/base.py / the
gaia-coder CLI is not in this PR — Phase 11 (production swap) is
the right place for that, and doing it here would have pulled in the
scaffold branch's churn. The Python APIs are complete and tested in
isolation; the integration surface is the follow-up.

Spec references: §7.1 (dev mode), §7.5 (self-detected bug), §5.9
(debug sub-loop), §15.8 P8 (classify_failure prompt), §6.8.1
(failure_patterns memory topic).

Introduces `gaia.coder.dev_mode` — the hard precondition check (editable
install + matching origin) plus the soft switches (session.json +
em.toml). `detect_dev_mode` returns a structured `DevModeStatus` so the
CLI can explain *why* dev mode is off. `enable_session`/`enable_permanent`
refuse when the precondition fails and append an audit row on every flip;
`is_enabled` composes the three signals for a single boolean check.

Fail-loudly per CLAUDE.md: corrupt session.json raises instead of
silently treating it as "off". Wiring into the main ReAct loop is a
Phase 11 concern.
15 tests covering detect_dev_mode (no repo / matching origin /
mismatched origin), enable_session / disable_session roundtrip,
audit-row side effects, enable_permanent persistence, is_enabled
composition, and the corrupt-session-file fail-loudly path. Fixtures
build a fresh tmp git repo with a fake origin so the precondition check
is exercised without touching the real installation.
The P8 prompt template the self-heal loop uses to classify a mid-task
failure as user-task / self-code / external. Conservative — "when
uncertain, classify external" — so a wrong self-code diagnosis never
burns self-edit churn. Double-brace placeholders match the rest of
src/gaia/coder/prompts/ for human editability.
… (§7.5)

`gaia.coder.self_fix.self_heal` provides the four §7.5 primitives:

- classify_failure — LLM-driven P8 triage, force-escalates low-confidence
  non-external classifications to external so we never self-fix on a
  guess.
- pause_current_task / resume_task — thin wrappers over stores.paused_tasks
  that capture cwd + tool-call history + partial outputs + original prompt.
- restart_self — hot-reload for prompt-only/doc-only via importlib.reload,
  exit code 42 for code changes. Fork-bomb guard enforces _RESTART_MAX_IN_WINDOW
  (3) restarts per _RESTART_COUNT_WINDOW (1h); breach raises RestartStormError
  so the daemon halts and pages the EM (§7.5 last bullet).

All subprocess/LLM/exit calls are dependency-injected so tests never hit
the network, the filesystem outside tmp_path, or sys.exit for real.
classify_failure paths (happy, low-confidence escalate, external-stays,
bad JSON, unknown kind, prompt substitution), pause/resume roundtrip
+ corrupt-snapshot fail-loudly + delete-on-resume, restart_self
rate-limit window (cold exit 42, hot-reload no exit, fourth-attempt
RestartStormError, stale timestamps purged, unknown kind raises).

Anthropic calls are always mocked; exit_fn is an injection point so the
rate-limit window can be exercised without terminating pytest.
Adds `gaia.coder.tools.debug` with the full debug-sub-loop primitive set:

- repro_attempt — run N times, score similarity against the expected
  failure signature; "3 of 3" before we call it reproduced.
- git_bisect — parse the canonical "<sha> is the first bad commit" line.
- add_instrumented_trace — insert a `logger.debug` on a scratch branch
  (never a bare `print` — Pass 1 would reject it).
- run_with_tracing — PYTHONFAULTHANDLER / PYTHONDEVMODE / NODE_OPTIONS
  injection plus `python -X dev` prefix for python commands.
- diff_behavior — unified diff of harness output between good/bad refs.
- query_failure_patterns — memory recall against the failure_patterns
  topic; token-overlap similarity until FAISS lands in Phase 10.
- flake_check — pytest N times, flag test when 0 < rate ≤ 10%.
- minimize_repro — binary-search an input down to a minimal reproducer.

All subprocess calls use a private _run helper so tests monkeypatch a
single entry point. The mixin's register_debug_tools() returns the
eight names so the smoke test can assert the contract without touching
the tool registry's internals.
One-or-more tests per §5.9 tool (repro-all-pass / signature-mismatch /
zero-attempts-raises, bisect-parses-sha / bisect-returns-none-on-failure,
instrumented-trace-inserts-debug / missing-file-raises,
tracing-sets-faulthandler / tracing-X-dev-prefix, diff-returns-unified,
query-ranks-by-similarity / empty-store, flake 3p-2f / all-pass /
all-fail, minimize-halves / refuses-non-reproducer / empty-input) plus
a mixin smoke test asserting exactly the eight §5.9 names are
registered.

All subprocess calls are stubbed via monkeypatch so the suite runs in
under a second.
Seven-state sub-loop (reproduce → bisect → hypothesise → probe →
localise_bug → propose_fix → postmortem) with the four §5.9 discipline
rules enforced at the state-machine layer, not just in prompts:

1. propose_fix raises unless `context.reproduced` is True.
2. propose_fix raises unless top-hypothesis confidence ≥ 80%; the
   shotgun override requires BOTH an explicit flag AND
   `em_elevation_granted=True`.
3. hypothesise enforces `len(hypotheses) >= 3`.
4. postmortem writes one failure_patterns memory row (per §6.8.1)
   plus appends a structured entry to the feedback.db notes_json.

The loop is purely the control flow — tool implementations live in
`gaia.coder.tools.debug`. All collaborators (repro_fn, bisect_fn,
probe_fn) are dependency-injected so the loop composes cleanly with
the main ReAct machine in Phase 11.
Covers each transition plus every discipline rule:

- reproduce: happy path / not-reproduced raises with the exact EM
  question / missing fn raises.
- bisect: advances + skip-mode.
- hypothesise: < 3 raises; 3 accepts.
- probe: confidence tip-over → LOCALISE_BUG; inconclusive → HYPOTHESISE;
  round cap → DebugDisciplineError.
- propose_fix: rule-1 (no repro) / rule-2 (low confidence) / shotgun
  requires EM elevation / happy path.
- postmortem: writes feedback notes_json + memory failure_patterns row,
  and require_postmortem_or_raise gates the sub-loop exit.

All collaborators injected as callables so no real subprocess or LLM.
@github-actions github-actions Bot added the tests Test changes label Apr 20, 2026
@kovtcharov kovtcharov marked this pull request as ready for review April 20, 2026 10:23
@kovtcharov kovtcharov merged commit 6029b0e into coder Apr 20, 2026
11 checks passed
@kovtcharov kovtcharov deleted the feature/gaia-coder-p7p8 branch April 20, 2026 10:23
@github-actions
Copy link
Copy Markdown
Contributor

Summary

Strong Phase-7+8 landing — dev-mode gate, self-heal primitives, and the §5.9 debug sub-loop. Discipline rules are properly enforced at the state-machine layer (not just prompts), fail-loudly is honored throughout (corrupt session JSON, unknown classify kinds, restart storms), and collaborators are injectable so the 67-test suite is hermetic. The single most important thing to know: add_instrumented_trace inserts logger.debug(...) into arbitrary files without ensuring logger is imported — the tool as written will NameError the target program at runtime.

Issues Found

🟡 Important

add_instrumented_trace injects logger.debug(...) with no guarantee logger exists in scope (src/gaia/coder/tools/debug.py:388)

The probe writes logger.debug("...") into any file at file:line, but never verifies the file imports/binds logger. The unit test at tests/coder/test_debug_tools.py:3108 exercises a file whose contents are literally "def foo():\n return 1\n" — no logger import — and the test passes because it only asserts the string is written, not that the code is executable. When the debug sub-loop actually runs the instrumented branch, every target without a pre-existing module-level logger = logging.getLogger(__name__) will raise NameError, killing the probe and corrupting the signal probe() is trying to produce.

Either (a) detect-or-inject a logger binding when missing, (b) emit a fully-qualified fallback like import logging as _glog; _glog.getLogger(__name__).debug(...), or (c) document this as a precondition and raise loudly if the target file lacks a logger symbol. Option (b) is the cheapest hermetic fix:

    inserted = (
        f"{indent}__import__('logging').getLogger(__name__)"
        f".debug({json.dumps(message)})  # gaia-coder probe\n"
    )
    new_lines = lines[: line - 1] + [inserted] + lines[line - 1 :]
    target.write_text("".join(new_lines), encoding="utf-8")

Also add a test that imports the mutated module to confirm it loads — the current test passes even on a broken probe.

diff_behavior cleanup leaves HEAD detached at the wrong ref (src/gaia/coder/tools/debug.py, _run_on / finally block near line 2321)

The flow is: stash → git switch --detach good_ref → run harness → git switch --detach bad_ref → run harness → finally: git switch -. After two detached switches, switch - returns to the previous detached state (good_ref), not the caller's original branch. On a long-running coder daemon this silently leaves the repo on a detached ref — the next tool that assumes HEAD == main will be working off stale code.

Capture the original ref at entry and restore it explicitly:

    repo_root = Path(cwd) if cwd else Path.cwd()
    argv = _shell_split(harness_script)

    original_head = _run(
        ["git", "symbolic-ref", "--quiet", "--short", "HEAD"], cwd=repo_root
    ).stdout.strip() or _run(
        ["git", "rev-parse", "HEAD"], cwd=repo_root
    ).stdout.strip()

    # Stash first to keep the working tree clean.
    stash = _run(

and replace the git switch - in the finally with git switch --detach {original_head} (or git switch {original_head} when it was a branch). A regression test that asserts HEAD is restored after diff_behavior would catch future regressions.

🟢 Minor

Docstring says "clamped"; code raises (src/gaia/coder/self_fix/self_heal.py:1399)

classify_failure's return contract says "confidence is clamped to [0, 100]" but _parse_classify_response raises ValueError for out-of-range values. Fail-loudly is the right behavior — just correct the docstring:

        :class:`FailureClassification` — kind is always one of
        :data:`FAILURE_KINDS`; confidence must be in [0, 100] or the
        response is rejected. Below the floor the kind is force-rewritten
        to ``external`` and ``escalated_low_confidence`` is set.

Postmortem memory row pins confidence at ≥80 even for shotgun fixes (src/gaia/coder/debug_loop.py:521-524)

confidence=max(
    self._top_hypothesis().confidence if self._top_hypothesis() else 80,
    80,
),

A shotgun fix (top hypothesis confidence 40, elevation granted) will still be recorded in failure_patterns with confidence=80, which biases future memory lookups toward a low-signal pattern. Either (a) record the actual top-hypothesis confidence, or (b) stamp the row differently when ctx.shotgun is True so retrieval can discount it. At minimum, add a WHY: comment explaining the 80 floor.

minimize_repro halving fails on midpoint-straddling reproducers (src/gaia/coder/tools/debug.py:2492-2504)

The loop drops exactly one half at each step; if the essential trigger straddles the midpoint, neither half reproduces and the function returns the unchanged input after one iteration. §5.9's "2000-char → 20-char" guarantee does not hold for that shape. Acceptable for an MVP, but worth documenting in the docstring so callers know when to prefer an external delta-debugging tool, or upgrade to a real ddmin later.

minimize_repro._default_reproducer duplicates _run logic (src/gaia/coder/tools/debug.py:2470-2481)

It open-codes subprocess.run(... input=candidate ...) instead of going through _run, so the module has two subprocess call sites when the tests only mock one. Either extend _run to take an input= kwarg, or call _run and attach stdin downstream — keeping a single funnel makes the module easier to audit and reuse.

propose_fix happy-path record_note omits shotgun state (src/gaia/coder/debug_loop.py:433-436)

The final note line records summary=... architectural=... but not shotgun=self.context.shotgun, while the shotgun flag is important for the audit trail. The earlier "shotgun override ACTIVE" note covers it, but including shotgun in the final note keeps the breadcrumb self-contained.

Strengths

  • Discipline-at-the-state-machine is the right call. §5.9's four rules are encoded as method-entry guards on DebugSubLoop — a caller that bypasses the prompts still cannot land a fix without repro + ≥80% confidence + three hypotheses + postmortem. This is a meaningfully stronger invariant than prompt-level discipline.
  • Fail-loudly consistency. Corrupt session.json, unknown classify kinds, missing paused-task fields, and the >3 restarts/hour cap all raise typed exceptions with actionable messages naming §7.x / §5.9 — no silent fallbacks.
  • Injection-based test design. repro_fn, bisect_fn, probe_fn, memory_conn_factory, exit_fn, git_runner, runner — every subprocess and LLM edge is mockable, which is why 67 tests run hermetically with no Anthropic or shell dependency.
  • Restart rate-limit modeling. Module-level rolling window with in-place purge avoids unbounded growth on a long-running daemon, and the bound composes with the supervisor's own cap rather than trusting one side.

Verdict

Request changes — the two 🟡 issues (add_instrumented_trace producing a broken target and diff_behavior leaving HEAD detached) are real runtime bugs that will bite as soon as Phase 11 wires this in. Both are surgical fixes and the test suite is solid enough that a follow-up patch can land quickly. Minor issues are fine to roll in or defer.

kovtcharov added a commit that referenced this pull request Apr 20, 2026
…ime) (#832)

## Summary

Six fixes flagged by the auto-review bot: one Critical (security), five
Important (two on #827, two on #828, one on both). All 395 tests pass on
`coder` with the fixes.

## Changes

**Critical (security):**
- `oss_reuse.py` `import_with_attribution` — path traversal on
LLM-controlled `dest_path`. Now resolves + `relative_to(root)`-checks;
raises `AttributionError` on escape.

**Important:**
- `oss_reuse.py` `_validate_license_filter` — unknown SPDX ids silently
dropped; now raises per CLAUDE.md fail-loudly.
- `tools/github.py` `gh_pr_merge` — hardcoded `--admin`; now gated
behind `admin_override=False` default.
- `repo_binding.py` webhook round-trip — only did positive check; added
wrong-signature + wrong-payload discrimination.
- `tools/debug.py` `add_instrumented_trace` — emitted
`logger.debug(...)` requiring pre-bound `logger`; now inlines
`__import__('logging')` lookup.
- `tools/debug.py` `diff_behavior` — `git switch -` after two detached
switches returns to wrong ref; now captures + explicitly restores
original HEAD.

## Test plan

- [x] `pytest tests/coder/ tests/eval/` — 395 pass
- [x] 7 new regression tests in `test_fixes_827_828.py` cover each fix
- [x] `test_add_instrumented_trace_*` now asserts the mutated module
actually imports (previously asserted only the string was written)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant