feat(coder): package skeleton + loop.py with DEFAULT_LOOP#819
Conversation
Phase 1 scaffolding per docs/plans/coder-agent.mdx §3.1. Creates the package skeleton every downstream task imports from: subpackages for prompts, stores, tools, review, introspect, self_fix, tests, and skills; the three living documents (GAIA.md, ARCHITECTURE.md, PROJECT_MAP.md) as short placeholders; the skills catalog.toml seed; and the gaia-coder console script backed by an argparse-based CLI whose 17 subcommands (§3.1) are all Phase 1 stubs that print "not yet implemented" and exit 0. setup.py registers the new packages and the gaia-coder entry point; the existing gaia-code entry (legacy Next.js scaffolder) is untouched per §1. State machine (loop.py) and base class (base.py) land in follow-up commits on this branch; sibling branches feature/gaia-coder-stores and feature/gaia-coder-mixins-1 will fill in stores/ and tools/ after this lands.
Adds the canonical ReAct control flow for gaia-coder per docs/plans/coder-agent.mdx §5.1 and §15.3: - src/gaia/coder/loop.py defines the immutable Transition / State / Loop dataclasses and ships DEFAULT_LOOP — 20 states grouped into the seven stages (Intake, Understand, Design, Build, Verify, Publish, Land). The self_review state emits the updated three-way transition (publish | debug | edit based on failure_is_complex) from §15.3 so shallow failures go straight back to edit while complex failures enter the dedicated debug sub-loop. - introspect_state_machine() renders the loop as a JSON snapshot plus a stateDiagram-v2 Mermaid string — the Phase 3 IntrospectionToolsMixin (§7.7) will wrap this as the public tool. - src/gaia/coder/base.py defines CoderAgent: her own base class that does NOT inherit from gaia.agents.base.Agent, since product-agent assumptions (request-scoped, single LLM round-trip) do not match her daemon-scoped lifecycle. Composition from the GAIA base (@tool, AgentConsole, PathValidator) is still allowed at use-site. The top-level package re-exports CoderAgent, DEFAULT_LOOP, Loop, State, and Transition so sibling branches have a stable public surface to import from.
Five tests pin the public surface sibling branches rely on: - gaia-coder --help runs and exits 0 with a subcommand list. - `from gaia.coder import CoderAgent` and `from gaia.coder.loop import DEFAULT_LOOP` both succeed. - DEFAULT_LOOP has exactly 20 states (§15.3). - The self_review state has exactly three transitions targeting publish / debug / edit — the updated three-way transition from §15.3 that a typo or edit to loop.py would regress silently. - introspect_state_machine() returns a Mermaid render; the test accepts either the `graph TD` or `stateDiagram` dialect since §7.7 does not pin the choice. CLI invocation uses `python -m gaia.coder.cli` so the test runs whether or not the gaia-coder console script has been installed — same invariant either route exercises.
Review —
|
Three review-followups from the #818/#819/#820 merge: - src/gaia/coder/__init__.py: restore the CoderAgent/DEFAULT_LOOP/Loop/State/Transition re-exports that #819 added and the #818 rebase accidentally dropped. test_package_imports in test_skeleton.py relies on these. - src/gaia/coder/tools/cli.py: replace the bare except Exception: pass in the stream-reader teardown with a targeted OSError catch + logger.debug. Per CLAUDE.md's fail-loudly rule, silent swallows hide reader-thread bugs; a stream close failing under pipe tear-down is real and worth a debug line. - src/gaia/coder/tools/search.py: stop reaching into the private _TOOL_REGISTRY dict from grep(). Use get_tool_metadata() — the public accessor added in base/tools.py:113 exists precisely for this. Tests: all 73 (5 skeleton + 38 mixins + 30 stores) pass on coder HEAD with the fixes.
#823) ## Summary Three review-followups from the #818 / #819 / #820 merge, flagged by the auto-review bot. All tests (73/73 on `coder`) pass with the fixes. ## What this changes - **Critical** — `src/gaia/coder/__init__.py`: restore the `CoderAgent` / `DEFAULT_LOOP` / `Loop` / `State` / `Transition` re-exports that #819 added and #818's rebase accidentally dropped. Without this, `tests/coder/test_skeleton.py::test_package_imports` fails on `coder` HEAD. - **Important** — `src/gaia/coder/tools/cli.py:135`: replace bare `except Exception: pass` in the stream-reader teardown with a targeted `OSError` catch + `logger.debug`. CLAUDE.md's *No Silent Fallbacks* rule explicitly forbids this pattern. - **Important** — `src/gaia/coder/tools/search.py:59`: stop reaching into the private `_TOOL_REGISTRY` from `grep()`. Use `get_tool_metadata()` — the public accessor at `src/gaia/agents/base/tools.py:113` exists for this. ## Test plan - [x] `pytest tests/coder/ -x` — 73/73 pass (5 skeleton + 38 mixins + 30 stores) - [x] `python -c "from gaia.coder import CoderAgent, DEFAULT_LOOP, Loop, State, Transition"` — all imports resolve
Addresses the auto-review findings on #819 and #820: **Land the spec.** docs/plans/coder-agent.mdx was being referenced from 19 places in the scaffold (every module docstring, every living doc, every CLI body) but the file itself was untracked — every citation pointed at vapour. Commit it now so Phase 2+ branches inherit real citations. Also lands docs/superpowers/specs/2026-04-19-gaia-code-agent-analysis.md as the retained historical context per the plan's §14. **SQL identifier hardening** (src/gaia/coder/stores/_common.py). The CRUD primitives string-interpolated table names, column names, and ORDER BY clauses. SQLite has no built-in parameterisation for identifiers, so whitelist matching is the only safe path. Adds _safe_ident() + _SAFE_IDENT_RE and applies it at every interpolation site, plus multi-clause ORDER BY validation (split by comma, each clause validated for optional ASC/DESC direction). Auto-review flagged this as a latent injection surface; treating it as a latent vuln, not hypothetical. **Narrow the conftest catch** (tests/coder/conftest.py). The bare `except Exception` during stub fallback was masking import-time bugs in stores (AttributeError, SyntaxError). Now catches only ImportError/ModuleNotFoundError and logs the underlying reason so real failures surface in CI rather than silently stubbing over. Scaffold is now landed so step-1 (real import) always succeeds — this is a safety net, not a hot path. **Register gaia-coder in CLAUDE.md + docs.json**. The new console_scripts entry was missing from the Console Script Entry Points table (§Project Structure) and from the "Standalone binaries" section. Also adds plans/coder-agent to docs.json under the Agents group so Mintlify renders it. All 73 coder tests pass (5 skeleton + 38 mixins + 30 stores).
## Summary Addresses the auto-review findings on #819 + #820 and lands the plan that every scaffold citation was pointing at. ## Changes - **Land `docs/plans/coder-agent.mdx`** (3,584 lines). 19 scaffold citations were pointing at a file that didn't exist in tracked state. Also lands `docs/superpowers/specs/2026-04-19-gaia-code-agent-analysis.md` as the retained historical context. - **SQL identifier hardening** (`src/gaia/coder/stores/_common.py`). `_safe_ident()` + `_SAFE_IDENT_RE` validation at every interpolation site (table, column, ORDER BY). Multi-clause ORDER BY supported. - **Narrow conftest catch** (`tests/coder/conftest.py`). Only catches `ImportError` / `ModuleNotFoundError`; logs the reason. Real import-time bugs in stores now surface. - **Register `gaia-coder`** in `CLAUDE.md` Console Script Entry Points + Standalone binaries, and add `plans/coder-agent` to `docs/docs.json` under Agents. ## Test plan - [x] `pytest tests/coder/` — 73/73 pass (5 skeleton + 38 mixins + 30 stores) - [x] `docs/plans/coder-agent.mdx` resolves for every scaffold docstring citation - [x] SQL injection guard: `_safe_ident("x; DROP TABLE")` raises `ValueError`
## Summary Phase 4 of the `gaia-coder` plan: the seven-pass self-review gate that runs before she ever calls `gh_pr_create`. Implements the full §8 table — deterministic checks (lint, tests, security, prose) plus LLM-driven passes (architectural, persona, adversarial, feedback-binding) — and surfaces an aggregated verdict with the confidence score the §7.6 auto-merge path reads. This is the "deep-review discipline" the trust contract is built on. Every PR she opens must clear this gate; self-fix PRs additionally require Pass 7 (feedback-binding), which confirms the diff actually addresses the EM's wording and that the regression test fails on `coder` and passes on the branch. ## What this PR adds - **Seven review passes** (`pass_1_static` through `pass_7_feedback_binding`) — each a focused module that returns a common `PassResult` envelope, so the gate never has to special-case any of them. - **`gate.run_all_passes`** — orchestrator with cost-aware short-circuiting: a hard-fail on the deterministic cheap gates (1, 2, 4) skips the expensive Opus calls, because there is no point paying for adversarial review on a branch that fails lint. - **`ReviewToolsMixin`** — exposes every pass as a `@tool` (`review_diff_self_static`, …) plus the one-shot `review_diff_gate`, so the agent, the EM's TUI, and the evaluation harness all share one code path. - **Canonical prompt files** for Passes 3, 5, 6, 7 under `src/gaia/coder/prompts/`, materialised verbatim from §15.8 — self-fix PRs that touch a prompt are now a single-file `git diff`, the whole point of the §6.4 whitelist. - **22 unit tests** that cover every pass (one pass + one fail per module), the gate's short-circuit behaviour, the self-fix gating of Pass 7, the confidence-score contract, and a prompt-files-exist guard so the canonical templates can't silently drift. ## Why the LLM seam matters Every Opus call flows through `gaia.coder.review._llm.call_opus`. Tests patch that one name — cheap, stable, vendor-agnostic — rather than reaching into the Anthropic SDK. When we ship the Claude Agent SDK integration for Passes 3 and 6 (the `architecture-reviewer` / `code-reviewer` subagents in §15.8), the switch happens at a single module boundary without churning every test. ## Scope constraint Changes are confined to `src/gaia/coder/review/`, `src/gaia/coder/prompts/`, and `tests/coder/test_review.py`. No other files are touched. Pass 7's differential-pytest worktree machinery is wired but opt-out via `skip_differential_pytest=True` for unit tests, because Phase 4 intentionally does not ship a running daemon — that's Phase 5. ## Dependencies - Stacks on top of #818 (mixins; uses `gaia.coder.tools.cli._check_denylist` for subprocess safety) and #819 (scaffold). Both appear in this PR until they merge to `coder` — the diff will clean itself up afterwards. - Does not depend on #820 (stores). ## Citations - [`docs/plans/coder-agent.mdx`](../blob/feature/gaia-coder-review/docs/plans/coder-agent.mdx) §8 — the seven-pass table - §15.8 — canonical prompt templates for Passes 3/5/6/7 - §7.6 — confidence-score gate on auto-merge - §7.4 — self-correction loop that Pass 7 bookends ## Test plan - [x] `pytest tests/coder/test_review.py -xvs` — 22/22 pass - [x] `pytest tests/coder/` — 65/65 pass (sibling tests untouched) - [x] `python util/lint.py --black --isort` — clean on in-scope files - [x] Smoke: `from gaia.coder.review import ReviewToolsMixin; m = ReviewToolsMixin(); m.register_review_tools()` registers exactly 8 tools - [ ] Integration run against a real branch with Anthropic creds present (Phase 5 territory — deferred to the daemon wiring task) ## Do not merge Draft: this stacks on two unmerged sibling PRs and the Phase 4 runtime wiring lives in Phase 5.
## Summary
Lands Phase 5 of `docs/plans/coder-agent.mdx` — the EM-facing surface of
`gaia-coder`. Before this PR the CLI verbs were stubs; after it, the EM
can bootstrap the agent, read her trust contract, promote/demote her,
and queue messages. Every LLM call now injects her identity triplet
(`GAIA.md` + `ARCHITECTURE.md` + `PROJECT_MAP.md`) as a cacheable
prefix.
## Threads
- **`trust.py`** — `EMConfig` / `RepoBinding` Pydantic models, the 0-5
`CapabilityTier` ladder, TOML round-trip, and `promote` / `demote`
functions with audit-log writes. Promotion refuses mismatched EM
signatures; demotion is immediate. *Why it matters:* §4.2 makes
promotion explicit, so a quiet accept would let any caller escalate
tier. Fail-loudly TrustError surfaces exactly what to fix.
- **`inbox.py`** — thin CRUD over `em_inbox.db` with the §4.5 5-second
non-LLM auto-ack, channel-agnostic dispatch callable, and escalation
into `feedback.db` with severity translation. *Why it matters:*
§4.5 says the ack is non-negotiable in latency; keeping it template-
only (no model call) makes the SLA automatic.
- **`intent.py`** — LLM-driven conversational intent classifier for
§15.4 + §15.8 P9, temperature 0 Opus 4.7, mockable via an injected
`llm` callable. Low confidence (< 70) and unknown intents coerce to
`free_form`. Handler functions cover every §15.4 intent. *Why it
matters:* regex matchers would miss paraphrases ("let me give you
self-edit for now"); LLM routing keeps the grammar maintainable.
- **`prompt_composer.py`** — builds Anthropic-format message blocks
with `cache_control={"type":"ephemeral"}` on the identity triplet
+ per-skill blocks, per §3.2 / §4.6 / §6.5. *Why it matters:* §3.1
mandates prompt caching; this is the single place that decides what
gets cached.
- **`cli.py`** — replaces seven stub handlers (`trust`, `promote`,
`demote`, `ask`, `note`, `critical`, `inbox`) with real ones. Config
dir honours `$GAIA_CODER_HOME` so tests never touch real user state.
Stubs remain for the Phase 6+ verbs (`daemon`, `status`, `feedback`,
`doctor`, etc.).
- **`prompts/intent_classifier.md` + `prompts/standup.md`** — §15.8 P9
and P10 prompt templates landed verbatim for future `prompt`-class
self-fix PRs.
- **Tests (58 new)** — each module has a dedicated test file; CLI tests
run as subprocess to exercise the full argparse + env-var path.
## Dependencies
This PR depends on sibling branches that have not yet merged to `coder`:
- **#819 scaffold** — imports `gaia.coder.{__init__,base,loop}` and the
`GAIA.md` / `ARCHITECTURE.md` / `PROJECT_MAP.md` placeholders.
- **#820 stores** — hard dep on `gaia.coder.stores.{em_inbox, feedback,
audit}`.
- **#818 mixins** — optional import of `gaia.coder.tools.cli` for
subprocess helpers (not used in this PR's code paths).
Rebase onto `coder` once those land.
## Test plan
- [ ] `pytest
tests/coder/test_{trust,intent,inbox,prompt_composer,cli_trust}.py -xvs`
— all 58 tests pass
- [ ] `gaia-coder trust` on a fresh `$GAIA_CODER_HOME` prints the §4.1
bootstrap question and exits 0
- [ ] `gaia-coder trust --bootstrap --em-handle <you> --em-channel <ch>`
then `gaia-coder trust` renders the §4.2 template verbatim
- [ ] `gaia-coder promote --to-tier 2 --reason "..." --em-signature
<you>` updates `em.toml` and writes an audit row
- [ ] `gaia-coder promote ... --em-signature wrong-user` exits 1 with
the mismatch message on stderr
- [ ] `gaia-coder ask "enable self-edit"` prints the auto-ack template
and enqueues a pending row in `em_inbox.db`
- [ ] `gaia-coder inbox` lists the pending row
…825) ## Summary Phase 6 implements **gaia-coder's self-correction loop** — the core value proposition. When the EM gives feedback, the agent now triages it into one of eight fix classes, localises the cause, drafts a regression-tested plan, applies a fix on an `auto/gaia-coder/<feedback_id>` branch, and opens a draft PR targeting `coder`. Wires §7.2 continuous critique, §7.3 feedback intake, §7.4 loop steps 1-10, and §7.9 verification. Loosely coupled to Phase 4 (review gate) and Phase 5 (trust inbox) so they can land in any order. ## Threads - **Prompt templates (§15.8 P1/P2/P3).** `prompts/triage.md`, `prompts/critique.md`, and `prompts/plan_review.md` — the three canonical text bodies the loop consumes. Stored as plain Markdown so she can edit them via prompt-class self-fix PRs. - **Triage (§7.4 step 1-2).** `classify_fix_class` runs P1 on Opus 4.7; `< 60` confidence is rewritten to `out-of-scope` so the loop never commits to a guess. `localise` is deterministic grep — no LLM. - **Planner (§5.1 Stage 3 / §7.4 step 3).** `draft_plan` + `is_large_job` + `request_em_approval`. Large jobs (> 200 LoC, or `architectural` / `state-machine`, or cross-mixin) post the P3 message to the EM inbox and wait for ✅. - **Fixer (§7.4 steps 4-5).** `generate_fix` creates the self-fix branch and applies edits; `write_regression_test` emits a pytest file + marker flag so it *genuinely* fails on `coder` and passes on the fix branch (no clever mocks); `verify_test_differential` raises on the pass-on-both or fail-on-both failure modes. - **Publisher (§7.4 steps 7-8).** `open_self_fix_pr` refuses to open without a regression test (§7.4 step 5 hard rule). PR body cites `feedback_id` and quotes the EM's wording verbatim — Pass 7 (feedback-binding) depends on both. - **Verifier (§7.4 step 10).** `verify_on_merge` re-runs the regression test on the merged SHA, transitions the feedback row to `verified`, and writes `failure_patterns` + `review_patterns` memory records so the same symptom is recognised next time. - **Continuous critique (§7.2).** Cheap one-shot Opus call after every state-changing tool; findings `< 60` confidence are dropped; `≥ 80` surface inline for pre-transition action. - **Loop driver.** `FeedbackLoopDriver.process_pending_feedback()` orchestrates the whole thing with full `pending → triaged → in-fix → fix-pr-open → verified | rejected → closed` transitions written to `feedback.notes_json`. - **SelfFixToolsMixin.** Registers **10 tools** (well above the §15.2 ≥ 7 contract) so the loop is callable from the agent's tool registry. Phase-7 tools (`classify_failure`, `pause_current_task`, `restart_self`, `edit_self_file`) are intentionally out of scope. - **CLI.** `gaia-coder feedback "<body>" --severity high --on <url>` enqueues, `gaia-coder self-fix process` runs one iteration. - **Tests.** 47 new tests covering every §7.4 step, every §7.2 filtering rule, and the §15.2 mixin contract. Mocks Anthropic and `gh` at the callable boundary; uses a tmp git repo with a `coder` branch for real branch creation and pytest differential runs. ## Out of scope (explicit non-goals per the Phase 6 task) - EventBridge ingestion of `@gaia-coder feedback:` comments (Phase 10 repo binding). - Auto-merge (§7.6) — blocked on Phase 4 review-gate confidence score. - Dev-mode self-edit (§7.5) and `restart_self` — Phase 7. - ReAct loop self-edit (§7.8) — Phase 7. ## Dependencies - **#818 (mixins):** Imports `edit_file` semantics inline rather than instantiating `FileToolsMixin`, to keep the module usable without the mixin wired. - **#819 (scaffold):** Package skeleton + GAIA.md + prompts/ dir. - **#820 (stores):** Uses `gaia.coder.stores.feedback` and `gaia.coder.stores.memory` directly. - **Phase 4 review gate (sibling):** Loose-coupled — `review_gate_runner` is optional; absent, we publish a "(review gate not available)" placeholder in the PR body. - **Phase 5 trust (sibling):** Loose-coupled — `request_em_approval` uses `gaia.coder.trust.inbox.enqueue` if importable, else defers. ## Test plan - [ ] `pytest tests/coder/test_self_fix/ -xvs` — 47 tests pass. - [ ] `pytest tests/coder/ -q` — full coder suite (120 tests) pass. - [ ] `python util/lint.py --all` — no new critical errors (pre-existing pylint warnings in `cli.py`, `discovery.py`, etc. are not touched). - [ ] `python -c "from gaia.coder.self_fix import SelfFixToolsMixin; m = SelfFixToolsMixin(); assert len(m.register_self_fix_tools()) >= 7"` — smoke. - [ ] `gaia-coder feedback "..." --severity high --on https://github.com/amd/gaia/pull/9999 --id fb-test --db-path /tmp/fb.db` — writes a row; follow with `gaia-coder self-fix process --db-path /tmp/fb.db --repo-root <worktree> --skip-differential-verify --skip-fix-apply` for the end-to-end path (PR creation mocked). ## Merge plan - Draft PR, **do not merge**. Rebase onto `coder` after Phase 4 and Phase 5 land so the review gate and inbox wiring become real imports.
Summary
Phase 1 scaffolding for
gaia-coderperdocs/plans/coder-agent.mdx. Creates the package structure every follow-up task imports from — no runtime behaviour, no LLM calls, no network, no SQLite. Just the load-bearing shapes: her own base class, the editable 20-state ReAct loop, the CLI surface, and the three living documents the spec anchors on.This is the trunk the two sibling branches (
feature/gaia-coder-stores,feature/gaia-coder-mixins-1) rebase onto once it lands.Threads
src/gaia/coder/with sub-packages forprompts/,stores/,tools/,review/,introspect/,self_fix/,tests/,skills/. Every__init__.pyis a stub pointing at the phase that fills it in. Matters because downstream tasks need stable import paths now without racing on who creates which directory.gaia-coderconsole script wired throughsetup.py. All 17 subcommands from the spec (daemon,status,ask,note,critical,inbox,feedback,promote,demote,trust,audit,spend,egress,introspect,skill,doctor,rag) print"<name>: not yet implemented"and exit 0. Matters so docs, tests, and muscle memory can start using the real surface immediately. Uses argparse (matching existinggaiaCLI) rather than click, which is not a project dependency.src/gaia/coder/loop.pywith immutableTransition,State,Loopdataclasses andDEFAULT_LOOPcontaining all 20 states grouped into the seven stages. Theself_reviewstate carries the updated three-way transition (publish|debug|editbased onfailure_is_complex) from §15.3 — shallow failures go straight back to edit, complex ones enter the dedicated debug sub-loop. Matters because a typo here silently regresses her review discipline.CoderAgentthat does NOT inherit fromgaia.agents.base.Agent. Product-agent assumptions (request-scoped, single LLM round-trip) do not match her daemon-scoped lifecycle with durable queues and editable control flow. Composition from the GAIA base is still allowed at use-site (@tool,AgentConsole,PathValidator). Matters so sibling tasks canfrom gaia.coder import CoderAgenttoday.GAIA.md(identity: principles, persona, Karpathy working-style rules),ARCHITECTURE.md(how she is composed),PROJECT_MAP.md(how the project she's building is composed), plusskills/catalog.tomlseed. All short Phase-1 placeholders; real content lands in Phase 3/5. Matters because both docs are referenced from every system prompt — they need to exist on disk before the runner does.introspect_state_machine()inloop.pyreturns a JSON snapshot +stateDiagram-v2Mermaid render of the loop. The Phase 3IntrospectionToolsMixinwraps this as the public@tool; keeping the implementation next to the loop it describes avoids duplication.Deviations from spec
setup.pyor any extras. Switched toargparseto match the existingsrc/gaia/cli.pyconvention and avoid adding a dependency. The subcommand surface is unchanged.loop.pyacross two commits (dataclasses in b,DEFAULT_LOOP+ introspection helper in c) required temporarily deleting and re-adding content, which felt worse than a clean "state machine + base class" commit. Commits are: (1) scaffold + CLI, (2) loop.py + base.py +__init__.pyre-exports, (3) tests.Test plan
pytest tests/coder/test_skeleton.py -xvs— 5/5 pass, including the guard against regressingself_review's three-way transition.python util/lint.py --black --isort --flake8on the new files — all clean (project flake8 uses--max-line-length=88with E501 ignored perutil/lint.py).uv pip install -e .— succeeds;gaia-coder --helpprints the 17-subcommand list; stub subcommands exit 0 (gaia-coder introspect,gaia-coder status, etc.).git diff feature/coding-agent...HEAD --stat— changes only undersrc/gaia/coder/,tests/coder/, andsetup.py. No edits tosrc/gaia/agents/or other existing modules.gaia-codeentry (legacy Next.js scaffolder) untouched per §1.