feat(coder): Phases 7+8 — dev-mode + self-heal + debug sub-loop#828
Conversation
Introduces `gaia.coder.dev_mode` — the hard precondition check (editable install + matching origin) plus the soft switches (session.json + em.toml). `detect_dev_mode` returns a structured `DevModeStatus` so the CLI can explain *why* dev mode is off. `enable_session`/`enable_permanent` refuse when the precondition fails and append an audit row on every flip; `is_enabled` composes the three signals for a single boolean check. Fail-loudly per CLAUDE.md: corrupt session.json raises instead of silently treating it as "off". Wiring into the main ReAct loop is a Phase 11 concern.
15 tests covering detect_dev_mode (no repo / matching origin / mismatched origin), enable_session / disable_session roundtrip, audit-row side effects, enable_permanent persistence, is_enabled composition, and the corrupt-session-file fail-loudly path. Fixtures build a fresh tmp git repo with a fake origin so the precondition check is exercised without touching the real installation.
The P8 prompt template the self-heal loop uses to classify a mid-task failure as user-task / self-code / external. Conservative — "when uncertain, classify external" — so a wrong self-code diagnosis never burns self-edit churn. Double-brace placeholders match the rest of src/gaia/coder/prompts/ for human editability.
… (§7.5) `gaia.coder.self_fix.self_heal` provides the four §7.5 primitives: - classify_failure — LLM-driven P8 triage, force-escalates low-confidence non-external classifications to external so we never self-fix on a guess. - pause_current_task / resume_task — thin wrappers over stores.paused_tasks that capture cwd + tool-call history + partial outputs + original prompt. - restart_self — hot-reload for prompt-only/doc-only via importlib.reload, exit code 42 for code changes. Fork-bomb guard enforces _RESTART_MAX_IN_WINDOW (3) restarts per _RESTART_COUNT_WINDOW (1h); breach raises RestartStormError so the daemon halts and pages the EM (§7.5 last bullet). All subprocess/LLM/exit calls are dependency-injected so tests never hit the network, the filesystem outside tmp_path, or sys.exit for real.
classify_failure paths (happy, low-confidence escalate, external-stays, bad JSON, unknown kind, prompt substitution), pause/resume roundtrip + corrupt-snapshot fail-loudly + delete-on-resume, restart_self rate-limit window (cold exit 42, hot-reload no exit, fourth-attempt RestartStormError, stale timestamps purged, unknown kind raises). Anthropic calls are always mocked; exit_fn is an injection point so the rate-limit window can be exercised without terminating pytest.
Adds `gaia.coder.tools.debug` with the full debug-sub-loop primitive set: - repro_attempt — run N times, score similarity against the expected failure signature; "3 of 3" before we call it reproduced. - git_bisect — parse the canonical "<sha> is the first bad commit" line. - add_instrumented_trace — insert a `logger.debug` on a scratch branch (never a bare `print` — Pass 1 would reject it). - run_with_tracing — PYTHONFAULTHANDLER / PYTHONDEVMODE / NODE_OPTIONS injection plus `python -X dev` prefix for python commands. - diff_behavior — unified diff of harness output between good/bad refs. - query_failure_patterns — memory recall against the failure_patterns topic; token-overlap similarity until FAISS lands in Phase 10. - flake_check — pytest N times, flag test when 0 < rate ≤ 10%. - minimize_repro — binary-search an input down to a minimal reproducer. All subprocess calls use a private _run helper so tests monkeypatch a single entry point. The mixin's register_debug_tools() returns the eight names so the smoke test can assert the contract without touching the tool registry's internals.
One-or-more tests per §5.9 tool (repro-all-pass / signature-mismatch / zero-attempts-raises, bisect-parses-sha / bisect-returns-none-on-failure, instrumented-trace-inserts-debug / missing-file-raises, tracing-sets-faulthandler / tracing-X-dev-prefix, diff-returns-unified, query-ranks-by-similarity / empty-store, flake 3p-2f / all-pass / all-fail, minimize-halves / refuses-non-reproducer / empty-input) plus a mixin smoke test asserting exactly the eight §5.9 names are registered. All subprocess calls are stubbed via monkeypatch so the suite runs in under a second.
Seven-state sub-loop (reproduce → bisect → hypothesise → probe → localise_bug → propose_fix → postmortem) with the four §5.9 discipline rules enforced at the state-machine layer, not just in prompts: 1. propose_fix raises unless `context.reproduced` is True. 2. propose_fix raises unless top-hypothesis confidence ≥ 80%; the shotgun override requires BOTH an explicit flag AND `em_elevation_granted=True`. 3. hypothesise enforces `len(hypotheses) >= 3`. 4. postmortem writes one failure_patterns memory row (per §6.8.1) plus appends a structured entry to the feedback.db notes_json. The loop is purely the control flow — tool implementations live in `gaia.coder.tools.debug`. All collaborators (repro_fn, bisect_fn, probe_fn) are dependency-injected so the loop composes cleanly with the main ReAct machine in Phase 11.
Covers each transition plus every discipline rule: - reproduce: happy path / not-reproduced raises with the exact EM question / missing fn raises. - bisect: advances + skip-mode. - hypothesise: < 3 raises; 3 accepts. - probe: confidence tip-over → LOCALISE_BUG; inconclusive → HYPOTHESISE; round cap → DebugDisciplineError. - propose_fix: rule-1 (no repro) / rule-2 (low confidence) / shotgun requires EM elevation / happy path. - postmortem: writes feedback notes_json + memory failure_patterns row, and require_postmortem_or_raise gates the sub-loop exit. All collaborators injected as callables so no real subprocess or LLM.
SummaryStrong Phase-7+8 landing — dev-mode gate, self-heal primitives, and the §5.9 debug sub-loop. Discipline rules are properly enforced at the state-machine layer (not just prompts), fail-loudly is honored throughout (corrupt session JSON, unknown classify kinds, restart storms), and collaborators are injectable so the 67-test suite is hermetic. The single most important thing to know: Issues Found🟡 Important
The probe writes Either (a) detect-or-inject a Also add a test that imports the mutated module to confirm it loads — the current test passes even on a broken probe.
The flow is: stash → Capture the original ref at entry and restore it explicitly: and replace the 🟢 MinorDocstring says "clamped"; code raises (
Postmortem memory row pins confidence at ≥80 even for shotgun fixes ( confidence=max(
self._top_hypothesis().confidence if self._top_hypothesis() else 80,
80,
),A shotgun fix (top hypothesis confidence 40, elevation granted) will still be recorded in
The loop drops exactly one half at each step; if the essential trigger straddles the midpoint, neither half reproduces and the function returns the unchanged input after one iteration. §5.9's "2000-char → 20-char" guarantee does not hold for that shape. Acceptable for an MVP, but worth documenting in the docstring so callers know when to prefer an external delta-debugging tool, or upgrade to a real ddmin later.
It open-codes
The final note line records Strengths
VerdictRequest changes — the two 🟡 issues ( |
…ime) (#832) ## Summary Six fixes flagged by the auto-review bot: one Critical (security), five Important (two on #827, two on #828, one on both). All 395 tests pass on `coder` with the fixes. ## Changes **Critical (security):** - `oss_reuse.py` `import_with_attribution` — path traversal on LLM-controlled `dest_path`. Now resolves + `relative_to(root)`-checks; raises `AttributionError` on escape. **Important:** - `oss_reuse.py` `_validate_license_filter` — unknown SPDX ids silently dropped; now raises per CLAUDE.md fail-loudly. - `tools/github.py` `gh_pr_merge` — hardcoded `--admin`; now gated behind `admin_override=False` default. - `repo_binding.py` webhook round-trip — only did positive check; added wrong-signature + wrong-payload discrimination. - `tools/debug.py` `add_instrumented_trace` — emitted `logger.debug(...)` requiring pre-bound `logger`; now inlines `__import__('logging')` lookup. - `tools/debug.py` `diff_behavior` — `git switch -` after two detached switches returns to wrong ref; now captures + explicitly restores original HEAD. ## Test plan - [x] `pytest tests/coder/ tests/eval/` — 395 pass - [x] 7 new regression tests in `test_fixes_827_828.py` cover each fix - [x] `test_add_instrumented_trace_*` now asserts the mutated module actually imports (previously asserted only the string was written)
Summary
Lands Phases 7 and 8 of
docs/plans/coder-agent.mdxas a single PR tominimise public PR count per the project's PR-minimisation directive.
The agent now has: (a) a real dev-mode gate backed by the §7.1 hard
precondition + em.toml + session.json triad, (b) self-heal primitives
for pausing, resuming, and restarting a task when it hits a bug in its
own source (§7.5), and (c) the eight-tool debug sub-loop from §5.9 with
the four discipline rules enforced at the state-machine layer rather
than only in prompts. Public API only — wiring into the main ReAct
loop is a Phase 11 concern.
Threads
src/gaia/coder/dev_mode.py— detection(editable install + matching origin), session.json + em.toml toggles
with audit-log side effects, composite
is_enabled. Fail-loudly oncorrupt session files so an undetected bad state can't silently
enable self-edit.
src/gaia/coder/self_fix/self_heal.py— classify_failure (P8 prompt, conservative escalation), pause/resume
via the existing paused_tasks store, and restart_self with a
module-level rate-limit window (
_RESTART_COUNT_WINDOW= 1h,_RESTART_MAX_IN_WINDOW= 3) so a broken self-edit can't fork-bombthe supervisor.
src/gaia/coder/prompts/classify_failure.md—the §15.8 template the classifier renders; same double-brace
placeholder convention as
triage.md.src/gaia/coder/tools/debug.py—eight @tool-decorated closures (repro_attempt, git_bisect,
add_instrumented_trace, run_with_tracing, diff_behavior,
query_failure_patterns, flake_check, minimize_repro) with all
subprocess calls funneled through a single injectable
_runsotests never spawn real processes.
src/gaia/coder/debug_loop.py—the seven-state FSM. Discipline enforcement is the load-bearing
piece:
propose_fixrefuses without repro AND without ≥ 80%top-hypothesis confidence (override requires
shotgun=TrueANDem_elevation_granted=True);hypothesiseenforces threehypotheses minimum;
postmortemwrites bothfeedback.dbnotesand a
failure_patternsmemory row.Test plan
pytest tests/coder/test_dev_mode.py— 15 testspytest tests/coder/test_self_heal.py— 15 testspytest tests/coder/test_debug_tools.py— 19 testspytest tests/coder/test_debug_loop.py— 18 testspytest tests/coder/— 265 passed, 2 pre-existing skippedpython util/lint.py --all— clean on all new files (black, isort, flake8 with project config, pylint); remaining critical errors are in pre-existing files unrelated to this PROut of scope
Wiring into
src/gaia/coder/loop.py/src/gaia/coder/base.py/ thegaia-coderCLI is not in this PR — Phase 11 (production swap) isthe right place for that, and doing it here would have pulled in the
scaffold branch's churn. The Python APIs are complete and tested in
isolation; the integration surface is the follow-up.
Spec references: §7.1 (dev mode), §7.5 (self-detected bug), §5.9
(debug sub-loop), §15.8 P8 (classify_failure prompt), §6.8.1
(failure_patterns memory topic).