feat(coder): SQLite stores per §15.1 DDL by kovtcharov · Pull Request #820 · amd/gaia

kovtcharov · 2026-04-20T08:24:27Z

Summary

Implements the ten persistent stores called out in §15.1 of docs/plans/coder-agent.mdx — the durable-state layer every Phase 1 / Phase 3 gaia-coder feature depends on (task queue, EM inbox, feedback queue, spend ledger, audit log, CI history, memory, paused-task snapshots, self-edit JSONL, learnings plain-text log). No business logic — just DDL, Pydantic row models, typed CRUD, and tests.

Seven SQLite stores — em_inbox.db, tasks.db, feedback.db, spend.db, audit.log.db, ci_history.db, memory.db. Each ships the §15.1 DDL verbatim (grep-verifiable), a create_tables() / open_store() pair with canonical PRAGMAs (WAL, foreign_keys=ON, busy_timeout=5000), and insert_row / get_row / update_row / list_rows helpers keyed off a per-store Pydantic v2 model.
Three append-only stores — paused_tasks (JSON-per-file), self_edits_log (JSONL), learnings_log (plain text with a strict single-line grammar). All three use Pydantic for schema validation on the write side and fail loudly on read errors.
stores/__init__.py::create_all_stores(root) — idempotent convenience that materialises every store under one directory and returns a {store_name: Path} map.
30 tests in tests/coder/test_stores.py — create / round-trip / CHECK-constraint for every SQLite store, round-trip / schema-rejection for the append-only logs, smoke test for create_all_stores.
tests/coder/conftest.py — defensive sys.modules stubs for gaia.coder.base / gaia.coder.loop so these tests stay isolated from sibling scaffold churn; becomes inert once the sibling branch stabilises.

FAISS index wiring for the memory store is intentionally out of scope — that's Phase 10.

Dependencies

Branched off feature/gaia-coder-scaffold (sibling task, already merged into this branch via rebase). The three scaffold commits (daefa1d, a160148, da051cd) at the tip are required for from gaia.coder import CoderAgent to resolve, but nothing in this PR's 12 commits depends on the scaffold beyond defensive test stubs.

Commits

One commit per store module (ten) plus one for the infra (_common.py + test conftest) and one for the package __init__ with create_all_stores:

feat(coder): stores — SQLite connection helpers + test harness
feat(coder): em_inbox store + CRUD + tests
feat(coder): tasks store + CRUD + tests
feat(coder): feedback store + CRUD + tests
feat(coder): spend store + CRUD + tests
feat(coder): audit store + CRUD + tests
feat(coder): ci_history store + CRUD + tests
feat(coder): memory store + CRUD + tests (SQL side only)
feat(coder): paused_tasks snapshot store + tests
feat(coder): self_edits_log JSONL append-only store + tests
feat(coder): learnings_log plain-text append-only store + tests
feat(coder): stores/__init__ with create_all_stores + smoke test

Test plan

python -m pytest tests/coder/test_stores.py -xvs — 30 passed locally.
python util/lint.py --black --isort — clean.

Grep-verify that every §15.1 DDL string appears verbatim under src/gaia/coder/stores/:

for t in em_inbox tasks feedback spend audit ci_history memory; do
  grep -l "CREATE TABLE $t" src/gaia/coder/stores/*.py || echo "MISSING: $t"
done

Do not merge

Draft — blocked on the sibling feature/gaia-coder-scaffold PR landing first so the git history stays clean once both land on feature/coding-agent.

Shared infrastructure for the per-store modules landing in subsequent commits: - src/gaia/coder/stores/_common.py exposes open_connection() with the canonical PRAGMAs (WAL, foreign_keys=ON, busy_timeout=5000) plus low-level insert / fetch / update helpers every per-store module wraps with typed signatures. The primitives are deliberately untyped at row-level — each store attaches its own Pydantic model. - tests/coder/conftest.py pre-populates sys.modules with a stub for gaia.coder.base so the stores tests stay isolated from in-flight sibling scaffold work on feature/gaia-coder-scaffold.

Phase 1 §15.1 verbatim DDL for em_inbox.db — the EM input queue. One row per inbound message from the bound EM (cli | tui | gh-comment | email | daily-standup-reply). CHECK constraints on channel, severity, and state. Indices on state and received_at. Exposes DDL, create_tables(), open_store() with canonical PRAGMAs, and the EmInboxRow pydantic model plus insert_row / get_row / update_row / list_rows helpers. tests cover table creation, round-trip, and CHECK rejection on invalid channel, severity, and state values.

§15.1 verbatim DDL for tasks.db — the agent's task queue. One row per engineering task through its full lifecycle: pending → running → waiting → blocked → paused → done → abandoned. Seven-valued CHECK on state; indices on state and on (priority DESC, created_at) so the worker can claim the highest-priority pending task efficiently. TaskRow pydantic model, full CRUD, and tests for creation, round-trip (including priority-DESC ordering), and CHECK rejection of an invalid state.

§15.1 verbatim DDL for feedback.db — the graduated-from-em_inbox feedback queue. Two CHECK enums: severity (low | med | high | critical) and fix_class covering the seven agent-fix classes plus out-of-scope (eight total: prompt, doc, test, tool, policy, architectural, state-machine, out-of-scope). Seven-valued state lifecycle: pending → triaged → in-fix → fix-pr-open → verified → rejected → closed. FeedbackRow pydantic model + full CRUD. Tests exercise all eight fix_class values round-trip and confirm CHECK rejection for invalid severity and invalid fix_class.

§15.1 verbatim DDL for spend.db — per-call cost ledger. Captures the four Anthropic token dimensions separately (input, cache_read, cache_create, output) plus the resolved USD amount, the call_site tag (triage | plan | pass_6_adversarial | continuous_critique | …), and the optional task_id FK. Indices on occurred_at and task_id power the weekly spend summary (§6.6). SpendRow pydantic model + full CRUD. Tests cover creation, a cache-heavy round-trip, and NOT-NULL enforcement for the required usd and input_tokens columns.

§15.1 verbatim DDL for audit.log.db — the append-only tool-call audit. One row per tool invocation: tool name, JSON args, JSON result, duration, exception-class-name on failure, and the loop_version / stage / state the call ran under. Integer AUTOINCREMENT primary key; three indices on occurred_at, task_id, and tool_name support §7.7 introspection and §8 review passes. AuditRow pydantic model. insert_row() ignores any caller-supplied id and returns the DB-assigned lastrowid so the audit log is append-only by construction. Tests exercise creation, full round-trip, NOT-NULL enforcement, and strictly-increasing auto-ids.

§15.1 verbatim DDL for ci_history.db — workflow-duration cache for §6.2 GitHub Actions monitoring. Compound primary key (workflow_name, branch, run_id) lets the agent estimate expected runtime for a workflow on a given branch without re-querying the API. CiHistoryRow pydantic model. get_row / update_row take the three PK components explicitly rather than a single id so callers can't confuse which column is which. Tests cover creation, compound-key round-trip, and duplicate-PK rejection via IntegrityError.

§15.1 verbatim DDL for memory.db — the relational side of the hybrid SQLite + FAISS memory store (§6.8). Eight-valued topic CHECK (review_patterns | failure_patterns | flaky_tests | em_preferences | adr_decisions | tool_usage_heuristics | task_outcomes | mutation_seeds) plus confidence BETWEEN 0 AND 100. The embedding_key column stores a FAISS id that Phase 10 will populate — FAISS integration is explicitly out of scope here. MemoryRow pydantic model + full CRUD. Tests cover creation, full round-trip with recall_count increment, and CHECK rejection for both an unknown topic value and an out-of-range confidence.

§7.5 / §15.1 JSON-per-file snapshots under paused-tasks/. Not SQLite: a paused task is read whole, written whole, never queried, and needs to round-trip arbitrary tool-call histories without schema migrations. write_snapshot() uses a .tmp-then-rename for atomic replacement. read_snapshot() raises FileNotFoundError rather than returning None (per CLAUDE.md "no silent fallbacks"). task_ids are validated against path-traversal characters. list_snapshots() returns every task_id present under the root. Tests cover round-trip, path-traversal rejection, and the loud-error contract for missing snapshots.

§6.4 / §15.1 append-only JSONL log of every self-edit PR — one JSON object per line. Schema per §15.1 verbatim: ts, pr, fix_class, files[], before_sha, after_sha, review_passes{static, functional, arch, security, prose, adversarial, feedback_binding}, confidence, em_review, auto_merged, feedback_id. SelfEditRecord + nested ReviewPasses pydantic models validate both writes and reads. append() writes one line; iter_records() / read_all() replay them. append_dict() validates raw dicts against the schema so callers can feed telemetry payloads directly. Tests round-trip two records (confirming append semantics) and confirm schema rejection for a malformed payload.

§4.6 / §15.1 append-only plain-text log of candidate GAIA.md promotions. Canonical single-line format: <ISO-8601 UTC> [task=<task_id>] Observed: "<observation>" → <candidate> Promotion candidate is constrained to 'GAIA.md', 'dismissed', or 'skill:<name>'. Observations must be single-line so grep/awk stay useful. LearningEntry pydantic model with two field validators enforces both rules at construction time. format_entry() and parse_line() keep the on-disk text reversible with the model. append() writes one line; iter_entries() / read_all() replay them, failing loudly on malformed lines. Tests round-trip two entries (GAIA.md + skill:<name> candidates), confirm the on-disk format is human-readable, and reject both an unknown promotion candidate and a multi-line observation.

Stores package entry point. Re-exports all ten sibling store modules for `from gaia.coder.stores import em_inbox, tasks, ...` ergonomics and adds the convenience helper `create_all_stores(root)`. create_all_stores() materialises every store under a single directory (the canonical layout is ~/.gaia/coder/) and returns a {store_name: Path} map. For SQLite stores it opens the connection with the canonical PRAGMAs, applies the DDL, and closes. For the three non-SQL stores it creates the target directory (paused-tasks/) or ensures the parent of the log file exists. The function is idempotent so callers can run it on every boot. Smoke test: test_create_all_stores() asserts every expected store name is present, every SQLite file exists with its table materialised, paused-tasks/ is a directory, and the helper returns identical paths on a second invocation.

github-actions · 2026-04-20T09:26:29Z

Summary

Implements the ten persistent stores called out in §15.1 of docs/plans/coder-agent.mdx as pure data-layer infrastructure: seven SQLite stores (each with verbatim DDL, Pydantic row models, and typed CRUD on top of a shared _common.py helper) plus three append-only stores (JSON-per-file, JSONL, plain-text). The shape is uniform across the seven SQLite modules, tests exercise every CHECK constraint, and create_all_stores() is a nice self-contained entry point with a smoke test. Code quality is high; my main concern is a latent SQL-injection surface in the shared helper and a handful of smaller issues that are easier to fix now than after more callers wire up.

Issues Found

🟡 SQL injection surface: `filter_` keys and `order_by` are string-interpolated (src/gaia/coder/stores/_common.py:51, 247)

_compile_where builds f"{key} IS NULL" / f"{key} = ?" directly from dict keys, and fetch_all interpolates order_by as f" ORDER BY {order_by}". Values go through ? parameters (good), but column names and ORDER BY do not. The in-repo callers today are all store modules passing constants, so it's safe right now — but every store re-exports list_rows(..., filter: Mapping[str, Any] | None) on its public surface with no documentation saying the keys must be trusted. Future callers (EM handles from em_inbox, context URLs from feedback, tool filters forwarded from CLI flags) could easily plumb untrusted input into filter={...} and quietly turn it into injection.

Two reasonable hardenings — either would be fine:

_IDENT_RE = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")


def _check_identifier(name: str) -> None:
    if not _IDENT_RE.match(name):
        raise ValueError(f"invalid SQL identifier: {name!r}")


def _compile_where(filter_: Mapping[str, Any] | None) -> tuple[str, list[Any]]:
    """Build a ``WHERE`` clause and parameter list from an equality filter.

    Empty/``None`` filter returns ``("", [])`` so callers can concatenate
    unconditionally. Values are passed as positional parameters — no string
    interpolation — so this is safe against injection. Column names are
    validated against ``_IDENT_RE`` so a tainted ``filter_`` key cannot
    inject SQL.
    """
    if not filter_:
        return "", []
    parts = []
    params: list[Any] = []
    for key, value in filter_.items():
        _check_identifier(key)
        if value is None:
            parts.append(f"{key} IS NULL")
        else:
            parts.append(f"{key} = ?")
            params.append(value)
    return " WHERE " + " AND ".join(parts), params

Apply the same _check_identifier to every column in update's patch and to fetch_all's order_by (split on commas first, then validate each token / direction). Cheap, and turns a latent landmine into a loud ValueError.

🟡 Test conftest swallows any import error (tests/coder/conftest.py:26)

def _try_real_import() -> bool:
    try:
        import gaia.coder  # noqa: F401
    except Exception:
        return False
    return True

This is exactly the silent-fallback pattern CLAUDE.md prohibits: a broken gaia.coder.stores module (e.g., a typo in em_inbox.py) will raise, _try_real_import returns False, stubs get installed, and the tests run against those stubs — masking the regression. Now that the sibling scaffold has landed (per your PR description), the rationale for the fallback is gone. Prefer one of:

Delete the stub machinery entirely; let import gaia.coder fail loudly.
Or narrow the exception to the specific missing-name ImportError you observed during the overlap window, and add a visible warnings.warn(...) when stubs are installed so a real failure is never silent.

def _try_real_import() -> bool:
    """Attempt the real import; return True on success.

    Narrow catch: we only want to paper over the sibling-branch window where
    ``gaia.coder.base`` / ``gaia.coder.loop`` are referenced before they
    exist. Any other import failure is a real regression and must surface.
    """
    try:
        import gaia.coder  # noqa: F401
    except ImportError as e:
        msg = str(e)
        if "gaia.coder.base" in msg or "gaia.coder.loop" in msg:
            return False
        raise
    return True

🟢 `learnings_log` canonical format doesn't escape embedded `"` (src/gaia/coder/stores/learnings_log.py:68)

format_entry writes f'"{entry.observation}"' unchanged, and _LINE_RE parses with "(?P<observation>.*)". Greedy matching usually recovers, but an observation ending in a literal " (or containing " → exactly) is a round-trip footgun that no test covers. Either reject " in the observation (same spirit as the newline rule) or escape it:

    @field_validator("observation")
    @classmethod
    def _single_line_no_quote(cls, v: str) -> str:
        if "\n" in v or "\r" in v:
            raise ValueError("observation must be a single line")
        if '"' in v:
            raise ValueError("observation must not contain double quotes")
        return v

Plus a small test covering 'he said "hi"' to lock the behavior.

🟢 `self_edits_log` concurrency comment overstates safety (src/gaia/coder/stores/self_edits_log.py:43)

concurrent appends on POSIX are safe for records shorter than PIPE_BUF (4096 bytes on Linux/macOS).

Real self-edit records easily exceed 4KB once files: list[str] and the seven review-pass verdicts are filled in. O_APPEND write atomicity applies to regular files with write(2), but Python's buffered file objects make this non-trivial. I'd either (a) add fh.flush() + os.fsync() and rely on O_APPEND-per-write with a single write() of the full line, or (b) soften the comment to say concurrent use is not supported and callers must serialize. Either is fine; the current docstring promises something it doesn't deliver.

🟢 `memory.py` DDL comment is misleading (src/gaia/coder/stores/memory.py:42)

-- One table per topic for simple cases; in production these are views over a single `memory` table
-- keyed by topic for FAISS co-location.

The DDL ships exactly one table called memory — there are no per-topic tables and no views. The comment reads like it was copy-pasted from a design doc. Trim to the one-line "single memory table keyed by topic; one FAISS index per topic in Phase 10" that matches the actual schema.

🟢 `init.py` mixes `Dict` and `dict` (src/gaia/coder/stores/init.py:41, 74, 91)

from typing import Dict plus Dict[str, Path] in the return annotation, alongside PEP-585 dict[str, str] in _SQLITE_LAYOUT. Pick one — every other new module in this PR already uses lowercase dict:

from __future__ import annotations

from pathlib import Path

…then change -> Dict[str, Path] → -> dict[str, Path] and drop the Dict import.

Strengths

Uniform shape across seven SQLite modules. Every store exposes DDL, create_tables, open_store, insert_row, get_row, update_row, list_rows with an identical signature. A reader who understands em_inbox.py understands tasks.py, feedback.py, spend.py, etc. for free — exactly the right move for a data layer that future phases will grow into.
Test coverage is thoughtful, not just wide. The CHECK-constraint tests (test_em_inbox_check_constraints, test_feedback_check_constraints, test_memory_check_constraints) prove the DDL's CHECK clauses actually reject invalid enums — catching the common bug where a constraint is declared but never validated. The AUTOINCREMENT-id strict-monotonicity test in test_audit_check_constraints is a nice touch.
Path-traversal defence in paused_tasks._validate_task_id + atomic .tmp-then-rename in write_snapshot are both exactly right for a directory-of-files store — and there's a test for the traversal rejection.

Verdict

Approve with suggestions. The PR is already merged; nothing here is a blocker. The SQL-identifier hardening in _common.py and the narrowed-catch in tests/coder/conftest.py are the two items worth a quick follow-up — both are small, both close real latent failure modes rather than hypothetical ones. The others (learnings_log quote escaping, self_edits_log docstring, memory DDL comment, Dict/dict mix) are low-cost cleanups to fold into the next coder-stores commit.

Three review-followups from the #818/#819/#820 merge: - src/gaia/coder/__init__.py: restore the CoderAgent/DEFAULT_LOOP/Loop/State/Transition re-exports that #819 added and the #818 rebase accidentally dropped. test_package_imports in test_skeleton.py relies on these. - src/gaia/coder/tools/cli.py: replace the bare except Exception: pass in the stream-reader teardown with a targeted OSError catch + logger.debug. Per CLAUDE.md's fail-loudly rule, silent swallows hide reader-thread bugs; a stream close failing under pipe tear-down is real and worth a debug line. - src/gaia/coder/tools/search.py: stop reaching into the private _TOOL_REGISTRY dict from grep(). Use get_tool_metadata() — the public accessor added in base/tools.py:113 exists precisely for this. Tests: all 73 (5 skeleton + 38 mixins + 30 stores) pass on coder HEAD with the fixes.

#823) ## Summary Three review-followups from the #818 / #819 / #820 merge, flagged by the auto-review bot. All tests (73/73 on `coder`) pass with the fixes. ## What this changes - **Critical** — `src/gaia/coder/__init__.py`: restore the `CoderAgent` / `DEFAULT_LOOP` / `Loop` / `State` / `Transition` re-exports that #819 added and #818's rebase accidentally dropped. Without this, `tests/coder/test_skeleton.py::test_package_imports` fails on `coder` HEAD. - **Important** — `src/gaia/coder/tools/cli.py:135`: replace bare `except Exception: pass` in the stream-reader teardown with a targeted `OSError` catch + `logger.debug`. CLAUDE.md's *No Silent Fallbacks* rule explicitly forbids this pattern. - **Important** — `src/gaia/coder/tools/search.py:59`: stop reaching into the private `_TOOL_REGISTRY` from `grep()`. Use `get_tool_metadata()` — the public accessor at `src/gaia/agents/base/tools.py:113` exists for this. ## Test plan - [x] `pytest tests/coder/ -x` — 73/73 pass (5 skeleton + 38 mixins + 30 stores) - [x] `python -c "from gaia.coder import CoderAgent, DEFAULT_LOOP, Loop, State, Transition"` — all imports resolve

Addresses the auto-review findings on #819 and #820: **Land the spec.** docs/plans/coder-agent.mdx was being referenced from 19 places in the scaffold (every module docstring, every living doc, every CLI body) but the file itself was untracked — every citation pointed at vapour. Commit it now so Phase 2+ branches inherit real citations. Also lands docs/superpowers/specs/2026-04-19-gaia-code-agent-analysis.md as the retained historical context per the plan's §14. **SQL identifier hardening** (src/gaia/coder/stores/_common.py). The CRUD primitives string-interpolated table names, column names, and ORDER BY clauses. SQLite has no built-in parameterisation for identifiers, so whitelist matching is the only safe path. Adds _safe_ident() + _SAFE_IDENT_RE and applies it at every interpolation site, plus multi-clause ORDER BY validation (split by comma, each clause validated for optional ASC/DESC direction). Auto-review flagged this as a latent injection surface; treating it as a latent vuln, not hypothetical. **Narrow the conftest catch** (tests/coder/conftest.py). The bare `except Exception` during stub fallback was masking import-time bugs in stores (AttributeError, SyntaxError). Now catches only ImportError/ModuleNotFoundError and logs the underlying reason so real failures surface in CI rather than silently stubbing over. Scaffold is now landed so step-1 (real import) always succeeds — this is a safety net, not a hot path. **Register gaia-coder in CLAUDE.md + docs.json**. The new console_scripts entry was missing from the Console Script Entry Points table (§Project Structure) and from the "Standalone binaries" section. Also adds plans/coder-agent to docs.json under the Agents group so Mintlify renders it. All 73 coder tests pass (5 skeleton + 38 mixins + 30 stores).

## Summary Addresses the auto-review findings on #819 + #820 and lands the plan that every scaffold citation was pointing at. ## Changes - **Land `docs/plans/coder-agent.mdx`** (3,584 lines). 19 scaffold citations were pointing at a file that didn't exist in tracked state. Also lands `docs/superpowers/specs/2026-04-19-gaia-code-agent-analysis.md` as the retained historical context. - **SQL identifier hardening** (`src/gaia/coder/stores/_common.py`). `_safe_ident()` + `_SAFE_IDENT_RE` validation at every interpolation site (table, column, ORDER BY). Multi-clause ORDER BY supported. - **Narrow conftest catch** (`tests/coder/conftest.py`). Only catches `ImportError` / `ModuleNotFoundError`; logs the reason. Real import-time bugs in stores now surface. - **Register `gaia-coder`** in `CLAUDE.md` Console Script Entry Points + Standalone binaries, and add `plans/coder-agent` to `docs/docs.json` under Agents. ## Test plan - [x] `pytest tests/coder/` — 73/73 pass (5 skeleton + 38 mixins + 30 stores) - [x] `docs/plans/coder-agent.mdx` resolves for every scaffold docstring citation - [x] SQL injection guard: `_safe_ident("x; DROP TABLE")` raises `ValueError`

## Summary Phase 4 of the `gaia-coder` plan: the seven-pass self-review gate that runs before she ever calls `gh_pr_create`. Implements the full §8 table — deterministic checks (lint, tests, security, prose) plus LLM-driven passes (architectural, persona, adversarial, feedback-binding) — and surfaces an aggregated verdict with the confidence score the §7.6 auto-merge path reads. This is the "deep-review discipline" the trust contract is built on. Every PR she opens must clear this gate; self-fix PRs additionally require Pass 7 (feedback-binding), which confirms the diff actually addresses the EM's wording and that the regression test fails on `coder` and passes on the branch. ## What this PR adds - **Seven review passes** (`pass_1_static` through `pass_7_feedback_binding`) — each a focused module that returns a common `PassResult` envelope, so the gate never has to special-case any of them. - **`gate.run_all_passes`** — orchestrator with cost-aware short-circuiting: a hard-fail on the deterministic cheap gates (1, 2, 4) skips the expensive Opus calls, because there is no point paying for adversarial review on a branch that fails lint. - **`ReviewToolsMixin`** — exposes every pass as a `@tool` (`review_diff_self_static`, …) plus the one-shot `review_diff_gate`, so the agent, the EM's TUI, and the evaluation harness all share one code path. - **Canonical prompt files** for Passes 3, 5, 6, 7 under `src/gaia/coder/prompts/`, materialised verbatim from §15.8 — self-fix PRs that touch a prompt are now a single-file `git diff`, the whole point of the §6.4 whitelist. - **22 unit tests** that cover every pass (one pass + one fail per module), the gate's short-circuit behaviour, the self-fix gating of Pass 7, the confidence-score contract, and a prompt-files-exist guard so the canonical templates can't silently drift. ## Why the LLM seam matters Every Opus call flows through `gaia.coder.review._llm.call_opus`. Tests patch that one name — cheap, stable, vendor-agnostic — rather than reaching into the Anthropic SDK. When we ship the Claude Agent SDK integration for Passes 3 and 6 (the `architecture-reviewer` / `code-reviewer` subagents in §15.8), the switch happens at a single module boundary without churning every test. ## Scope constraint Changes are confined to `src/gaia/coder/review/`, `src/gaia/coder/prompts/`, and `tests/coder/test_review.py`. No other files are touched. Pass 7's differential-pytest worktree machinery is wired but opt-out via `skip_differential_pytest=True` for unit tests, because Phase 4 intentionally does not ship a running daemon — that's Phase 5. ## Dependencies - Stacks on top of #818 (mixins; uses `gaia.coder.tools.cli._check_denylist` for subprocess safety) and #819 (scaffold). Both appear in this PR until they merge to `coder` — the diff will clean itself up afterwards. - Does not depend on #820 (stores). ## Citations - [`docs/plans/coder-agent.mdx`](../blob/feature/gaia-coder-review/docs/plans/coder-agent.mdx) §8 — the seven-pass table - §15.8 — canonical prompt templates for Passes 3/5/6/7 - §7.6 — confidence-score gate on auto-merge - §7.4 — self-correction loop that Pass 7 bookends ## Test plan - [x] `pytest tests/coder/test_review.py -xvs` — 22/22 pass - [x] `pytest tests/coder/` — 65/65 pass (sibling tests untouched) - [x] `python util/lint.py --black --isort` — clean on in-scope files - [x] Smoke: `from gaia.coder.review import ReviewToolsMixin; m = ReviewToolsMixin(); m.register_review_tools()` registers exactly 8 tools - [ ] Integration run against a real branch with Anthropic creds present (Phase 5 territory — deferred to the daemon wiring task) ## Do not merge Draft: this stacks on two unmerged sibling PRs and the Phase 4 runtime wiring lives in Phase 5.

## Summary Lands Phase 5 of `docs/plans/coder-agent.mdx` — the EM-facing surface of `gaia-coder`. Before this PR the CLI verbs were stubs; after it, the EM can bootstrap the agent, read her trust contract, promote/demote her, and queue messages. Every LLM call now injects her identity triplet (`GAIA.md` + `ARCHITECTURE.md` + `PROJECT_MAP.md`) as a cacheable prefix. ## Threads - **`trust.py`** — `EMConfig` / `RepoBinding` Pydantic models, the 0-5 `CapabilityTier` ladder, TOML round-trip, and `promote` / `demote` functions with audit-log writes. Promotion refuses mismatched EM signatures; demotion is immediate. *Why it matters:* §4.2 makes promotion explicit, so a quiet accept would let any caller escalate tier. Fail-loudly TrustError surfaces exactly what to fix. - **`inbox.py`** — thin CRUD over `em_inbox.db` with the §4.5 5-second non-LLM auto-ack, channel-agnostic dispatch callable, and escalation into `feedback.db` with severity translation. *Why it matters:* §4.5 says the ack is non-negotiable in latency; keeping it template- only (no model call) makes the SLA automatic. - **`intent.py`** — LLM-driven conversational intent classifier for §15.4 + §15.8 P9, temperature 0 Opus 4.7, mockable via an injected `llm` callable. Low confidence (< 70) and unknown intents coerce to `free_form`. Handler functions cover every §15.4 intent. *Why it matters:* regex matchers would miss paraphrases ("let me give you self-edit for now"); LLM routing keeps the grammar maintainable. - **`prompt_composer.py`** — builds Anthropic-format message blocks with `cache_control={"type":"ephemeral"}` on the identity triplet + per-skill blocks, per §3.2 / §4.6 / §6.5. *Why it matters:* §3.1 mandates prompt caching; this is the single place that decides what gets cached. - **`cli.py`** — replaces seven stub handlers (`trust`, `promote`, `demote`, `ask`, `note`, `critical`, `inbox`) with real ones. Config dir honours `$GAIA_CODER_HOME` so tests never touch real user state. Stubs remain for the Phase 6+ verbs (`daemon`, `status`, `feedback`, `doctor`, etc.). - **`prompts/intent_classifier.md` + `prompts/standup.md`** — §15.8 P9 and P10 prompt templates landed verbatim for future `prompt`-class self-fix PRs. - **Tests (58 new)** — each module has a dedicated test file; CLI tests run as subprocess to exercise the full argparse + env-var path. ## Dependencies This PR depends on sibling branches that have not yet merged to `coder`: - **#819 scaffold** — imports `gaia.coder.{__init__,base,loop}` and the `GAIA.md` / `ARCHITECTURE.md` / `PROJECT_MAP.md` placeholders. - **#820 stores** — hard dep on `gaia.coder.stores.{em_inbox, feedback, audit}`. - **#818 mixins** — optional import of `gaia.coder.tools.cli` for subprocess helpers (not used in this PR's code paths). Rebase onto `coder` once those land. ## Test plan - [ ] `pytest tests/coder/test_{trust,intent,inbox,prompt_composer,cli_trust}.py -xvs` — all 58 tests pass - [ ] `gaia-coder trust` on a fresh `$GAIA_CODER_HOME` prints the §4.1 bootstrap question and exits 0 - [ ] `gaia-coder trust --bootstrap --em-handle <you> --em-channel <ch>` then `gaia-coder trust` renders the §4.2 template verbatim - [ ] `gaia-coder promote --to-tier 2 --reason "..." --em-signature <you>` updates `em.toml` and writes an audit row - [ ] `gaia-coder promote ... --em-signature wrong-user` exits 1 with the mismatch message on stderr - [ ] `gaia-coder ask "enable self-edit"` prints the auto-ack template and enqueues a pending row in `em_inbox.db` - [ ] `gaia-coder inbox` lists the pending row

…825) ## Summary Phase 6 implements **gaia-coder's self-correction loop** — the core value proposition. When the EM gives feedback, the agent now triages it into one of eight fix classes, localises the cause, drafts a regression-tested plan, applies a fix on an `auto/gaia-coder/<feedback_id>` branch, and opens a draft PR targeting `coder`. Wires §7.2 continuous critique, §7.3 feedback intake, §7.4 loop steps 1-10, and §7.9 verification. Loosely coupled to Phase 4 (review gate) and Phase 5 (trust inbox) so they can land in any order. ## Threads - **Prompt templates (§15.8 P1/P2/P3).** `prompts/triage.md`, `prompts/critique.md`, and `prompts/plan_review.md` — the three canonical text bodies the loop consumes. Stored as plain Markdown so she can edit them via prompt-class self-fix PRs. - **Triage (§7.4 step 1-2).** `classify_fix_class` runs P1 on Opus 4.7; `< 60` confidence is rewritten to `out-of-scope` so the loop never commits to a guess. `localise` is deterministic grep — no LLM. - **Planner (§5.1 Stage 3 / §7.4 step 3).** `draft_plan` + `is_large_job` + `request_em_approval`. Large jobs (> 200 LoC, or `architectural` / `state-machine`, or cross-mixin) post the P3 message to the EM inbox and wait for ✅. - **Fixer (§7.4 steps 4-5).** `generate_fix` creates the self-fix branch and applies edits; `write_regression_test` emits a pytest file + marker flag so it *genuinely* fails on `coder` and passes on the fix branch (no clever mocks); `verify_test_differential` raises on the pass-on-both or fail-on-both failure modes. - **Publisher (§7.4 steps 7-8).** `open_self_fix_pr` refuses to open without a regression test (§7.4 step 5 hard rule). PR body cites `feedback_id` and quotes the EM's wording verbatim — Pass 7 (feedback-binding) depends on both. - **Verifier (§7.4 step 10).** `verify_on_merge` re-runs the regression test on the merged SHA, transitions the feedback row to `verified`, and writes `failure_patterns` + `review_patterns` memory records so the same symptom is recognised next time. - **Continuous critique (§7.2).** Cheap one-shot Opus call after every state-changing tool; findings `< 60` confidence are dropped; `≥ 80` surface inline for pre-transition action. - **Loop driver.** `FeedbackLoopDriver.process_pending_feedback()` orchestrates the whole thing with full `pending → triaged → in-fix → fix-pr-open → verified | rejected → closed` transitions written to `feedback.notes_json`. - **SelfFixToolsMixin.** Registers **10 tools** (well above the §15.2 ≥ 7 contract) so the loop is callable from the agent's tool registry. Phase-7 tools (`classify_failure`, `pause_current_task`, `restart_self`, `edit_self_file`) are intentionally out of scope. - **CLI.** `gaia-coder feedback "<body>" --severity high --on <url>` enqueues, `gaia-coder self-fix process` runs one iteration. - **Tests.** 47 new tests covering every §7.4 step, every §7.2 filtering rule, and the §15.2 mixin contract. Mocks Anthropic and `gh` at the callable boundary; uses a tmp git repo with a `coder` branch for real branch creation and pytest differential runs. ## Out of scope (explicit non-goals per the Phase 6 task) - EventBridge ingestion of `@gaia-coder feedback:` comments (Phase 10 repo binding). - Auto-merge (§7.6) — blocked on Phase 4 review-gate confidence score. - Dev-mode self-edit (§7.5) and `restart_self` — Phase 7. - ReAct loop self-edit (§7.8) — Phase 7. ## Dependencies - **#818 (mixins):** Imports `edit_file` semantics inline rather than instantiating `FileToolsMixin`, to keep the module usable without the mixin wired. - **#819 (scaffold):** Package skeleton + GAIA.md + prompts/ dir. - **#820 (stores):** Uses `gaia.coder.stores.feedback` and `gaia.coder.stores.memory` directly. - **Phase 4 review gate (sibling):** Loose-coupled — `review_gate_runner` is optional; absent, we publish a "(review gate not available)" placeholder in the PR body. - **Phase 5 trust (sibling):** Loose-coupled — `request_em_approval` uses `gaia.coder.trust.inbox.enqueue` if importable, else defers. ## Test plan - [ ] `pytest tests/coder/test_self_fix/ -xvs` — 47 tests pass. - [ ] `pytest tests/coder/ -q` — full coder suite (120 tests) pass. - [ ] `python util/lint.py --all` — no new critical errors (pre-existing pylint warnings in `cli.py`, `discovery.py`, etc. are not touched). - [ ] `python -c "from gaia.coder.self_fix import SelfFixToolsMixin; m = SelfFixToolsMixin(); assert len(m.register_self_fix_tools()) >= 7"` — smoke. - [ ] `gaia-coder feedback "..." --severity high --on https://github.com/amd/gaia/pull/9999 --id fb-test --db-path /tmp/fb.db` — writes a row; follow with `gaia-coder self-fix process --db-path /tmp/fb.db --repo-root <worktree> --skip-differential-verify --skip-fix-apply` for the end-to-end path (PR creation mocked). ## Merge plan - Draft PR, **do not merge**. Rebase onto `coder` after Phase 4 and Phase 5 land so the review gate and inbox wiring become real imports.

github-actions Bot added dependencies Dependency updates tests Test changes labels Apr 20, 2026

kovtcharov changed the base branch from feature/coding-agent to main April 20, 2026 08:25

kovtcharov changed the base branch from main to coder April 20, 2026 08:25

kovtcharov marked this pull request as ready for review April 20, 2026 09:20

kovtcharov requested a review from kovtcharov-amd as a code owner April 20, 2026 09:20

kovtcharov added 12 commits April 20, 2026 02:22

kovtcharov force-pushed the feature/gaia-coder-stores branch from e5ac03b to c468427 Compare April 20, 2026 09:22

kovtcharov merged commit 2b720c6 into coder Apr 20, 2026
5 checks passed

kovtcharov deleted the feature/gaia-coder-stores branch April 20, 2026 09:22

This was referenced Apr 20, 2026

feat(coder): multi-pass self-review gate (Passes 1-7) #821

Merged

feat(coder): trust contract + EM CLI + prompt composer #822

Merged

kovtcharov mentioned this pull request Apr 20, 2026

fix(coder): restore package exports + CLAUDE.md fail-loudly compliance #823

Merged

2 tasks

kovtcharov mentioned this pull request Apr 20, 2026

fix(coder): review followups + land the plan doc #824

Merged

3 tasks

kovtcharov mentioned this pull request Apr 20, 2026

feat(coder): self-correction loop (§7.3-§7.9 + continuous critique) #825

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(coder): SQLite stores per §15.1 DDL#820

feat(coder): SQLite stores per §15.1 DDL#820
kovtcharov merged 12 commits into
coderfrom
feature/gaia-coder-stores

kovtcharov commented Apr 20, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kovtcharov commented Apr 20, 2026

Summary

Dependencies

Commits

Test plan

Do not merge

Uh oh!

Uh oh!

github-actions Bot commented Apr 20, 2026

Summary

Issues Found

🟡 SQL injection surface: filter_ keys and order_by are string-interpolated (src/gaia/coder/stores/_common.py:51, 247)

🟡 Test conftest swallows any import error (tests/coder/conftest.py:26)

🟢 learnings_log canonical format doesn't escape embedded " (src/gaia/coder/stores/learnings_log.py:68)

🟢 self_edits_log concurrency comment overstates safety (src/gaia/coder/stores/self_edits_log.py:43)

🟢 memory.py DDL comment is misleading (src/gaia/coder/stores/memory.py:42)

🟢 __init__.py mixes Dict and dict (src/gaia/coder/stores/init.py:41, 74, 91)

Strengths

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🟡 SQL injection surface: `filter_` keys and `order_by` are string-interpolated (src/gaia/coder/stores/_common.py:51, 247)

🟢 `learnings_log` canonical format doesn't escape embedded `"` (src/gaia/coder/stores/learnings_log.py:68)

🟢 `self_edits_log` concurrency comment overstates safety (src/gaia/coder/stores/self_edits_log.py:43)

🟢 `memory.py` DDL comment is misleading (src/gaia/coder/stores/memory.py:42)

🟢 `init.py` mixes `Dict` and `dict` (src/gaia/coder/stores/init.py:41, 74, 91)