Skip to content

feat(coder): SQLite stores per §15.1 DDL#820

Merged
kovtcharov merged 12 commits into
coderfrom
feature/gaia-coder-stores
Apr 20, 2026
Merged

feat(coder): SQLite stores per §15.1 DDL#820
kovtcharov merged 12 commits into
coderfrom
feature/gaia-coder-stores

Conversation

@kovtcharov
Copy link
Copy Markdown
Collaborator

Summary

Implements the ten persistent stores called out in §15.1 of docs/plans/coder-agent.mdx — the durable-state layer every Phase 1 / Phase 3 gaia-coder feature depends on (task queue, EM inbox, feedback queue, spend ledger, audit log, CI history, memory, paused-task snapshots, self-edit JSONL, learnings plain-text log). No business logic — just DDL, Pydantic row models, typed CRUD, and tests.

  • Seven SQLite storesem_inbox.db, tasks.db, feedback.db, spend.db, audit.log.db, ci_history.db, memory.db. Each ships the §15.1 DDL verbatim (grep-verifiable), a create_tables() / open_store() pair with canonical PRAGMAs (WAL, foreign_keys=ON, busy_timeout=5000), and insert_row / get_row / update_row / list_rows helpers keyed off a per-store Pydantic v2 model.
  • Three append-only storespaused_tasks (JSON-per-file), self_edits_log (JSONL), learnings_log (plain text with a strict single-line grammar). All three use Pydantic for schema validation on the write side and fail loudly on read errors.
  • stores/__init__.py::create_all_stores(root) — idempotent convenience that materialises every store under one directory and returns a {store_name: Path} map.
  • 30 tests in tests/coder/test_stores.py — create / round-trip / CHECK-constraint for every SQLite store, round-trip / schema-rejection for the append-only logs, smoke test for create_all_stores.
  • tests/coder/conftest.py — defensive sys.modules stubs for gaia.coder.base / gaia.coder.loop so these tests stay isolated from sibling scaffold churn; becomes inert once the sibling branch stabilises.

FAISS index wiring for the memory store is intentionally out of scope — that's Phase 10.

Dependencies

Branched off feature/gaia-coder-scaffold (sibling task, already merged into this branch via rebase). The three scaffold commits (daefa1d, a160148, da051cd) at the tip are required for from gaia.coder import CoderAgent to resolve, but nothing in this PR's 12 commits depends on the scaffold beyond defensive test stubs.

Commits

One commit per store module (ten) plus one for the infra (_common.py + test conftest) and one for the package __init__ with create_all_stores:

  1. feat(coder): stores — SQLite connection helpers + test harness
  2. feat(coder): em_inbox store + CRUD + tests
  3. feat(coder): tasks store + CRUD + tests
  4. feat(coder): feedback store + CRUD + tests
  5. feat(coder): spend store + CRUD + tests
  6. feat(coder): audit store + CRUD + tests
  7. feat(coder): ci_history store + CRUD + tests
  8. feat(coder): memory store + CRUD + tests (SQL side only)
  9. feat(coder): paused_tasks snapshot store + tests
  10. feat(coder): self_edits_log JSONL append-only store + tests
  11. feat(coder): learnings_log plain-text append-only store + tests
  12. feat(coder): stores/__init__ with create_all_stores + smoke test

Test plan

  • python -m pytest tests/coder/test_stores.py -xvs — 30 passed locally.
  • python util/lint.py --black --isort — clean.
  • Grep-verify that every §15.1 DDL string appears verbatim under src/gaia/coder/stores/:
    for t in em_inbox tasks feedback spend audit ci_history memory; do
      grep -l "CREATE TABLE $t" src/gaia/coder/stores/*.py || echo "MISSING: $t"
    done

Do not merge

Draft — blocked on the sibling feature/gaia-coder-scaffold PR landing first so the git history stays clean once both land on feature/coding-agent.

@github-actions github-actions Bot added dependencies Dependency updates tests Test changes labels Apr 20, 2026
@kovtcharov kovtcharov changed the base branch from feature/coding-agent to main April 20, 2026 08:25
@kovtcharov kovtcharov changed the base branch from main to coder April 20, 2026 08:25
@kovtcharov kovtcharov marked this pull request as ready for review April 20, 2026 09:20
Shared infrastructure for the per-store modules landing in subsequent
commits:

- src/gaia/coder/stores/_common.py exposes open_connection() with the
  canonical PRAGMAs (WAL, foreign_keys=ON, busy_timeout=5000) plus
  low-level insert / fetch / update helpers every per-store module
  wraps with typed signatures. The primitives are deliberately untyped
  at row-level — each store attaches its own Pydantic model.

- tests/coder/conftest.py pre-populates sys.modules with a stub for
  gaia.coder.base so the stores tests stay isolated from in-flight
  sibling scaffold work on feature/gaia-coder-scaffold.
Phase 1 §15.1 verbatim DDL for em_inbox.db — the EM input queue. One
row per inbound message from the bound EM (cli | tui | gh-comment |
email | daily-standup-reply). CHECK constraints on channel, severity,
and state. Indices on state and received_at.

Exposes DDL, create_tables(), open_store() with canonical PRAGMAs, and
the EmInboxRow pydantic model plus insert_row / get_row / update_row /
list_rows helpers. tests cover table creation, round-trip, and CHECK
rejection on invalid channel, severity, and state values.
§15.1 verbatim DDL for tasks.db — the agent's task queue. One row per
engineering task through its full lifecycle: pending → running →
waiting → blocked → paused → done → abandoned. Seven-valued CHECK on
state; indices on state and on (priority DESC, created_at) so the
worker can claim the highest-priority pending task efficiently.

TaskRow pydantic model, full CRUD, and tests for creation, round-trip
(including priority-DESC ordering), and CHECK rejection of an invalid
state.
§15.1 verbatim DDL for feedback.db — the graduated-from-em_inbox
feedback queue. Two CHECK enums: severity (low | med | high |
critical) and fix_class covering the seven agent-fix classes plus
out-of-scope (eight total: prompt, doc, test, tool, policy,
architectural, state-machine, out-of-scope). Seven-valued state
lifecycle: pending → triaged → in-fix → fix-pr-open → verified →
rejected → closed.

FeedbackRow pydantic model + full CRUD. Tests exercise all eight
fix_class values round-trip and confirm CHECK rejection for invalid
severity and invalid fix_class.
§15.1 verbatim DDL for spend.db — per-call cost ledger. Captures the
four Anthropic token dimensions separately (input, cache_read,
cache_create, output) plus the resolved USD amount, the call_site tag
(triage | plan | pass_6_adversarial | continuous_critique | …), and
the optional task_id FK. Indices on occurred_at and task_id power the
weekly spend summary (§6.6).

SpendRow pydantic model + full CRUD. Tests cover creation, a
cache-heavy round-trip, and NOT-NULL enforcement for the required
usd and input_tokens columns.
§15.1 verbatim DDL for audit.log.db — the append-only tool-call audit.
One row per tool invocation: tool name, JSON args, JSON result,
duration, exception-class-name on failure, and the loop_version /
stage / state the call ran under. Integer AUTOINCREMENT primary key;
three indices on occurred_at, task_id, and tool_name support §7.7
introspection and §8 review passes.

AuditRow pydantic model. insert_row() ignores any caller-supplied id
and returns the DB-assigned lastrowid so the audit log is append-only
by construction. Tests exercise creation, full round-trip, NOT-NULL
enforcement, and strictly-increasing auto-ids.
§15.1 verbatim DDL for ci_history.db — workflow-duration cache for
§6.2 GitHub Actions monitoring. Compound primary key
(workflow_name, branch, run_id) lets the agent estimate expected
runtime for a workflow on a given branch without re-querying the API.

CiHistoryRow pydantic model. get_row / update_row take the three PK
components explicitly rather than a single id so callers can't confuse
which column is which. Tests cover creation, compound-key round-trip,
and duplicate-PK rejection via IntegrityError.
§15.1 verbatim DDL for memory.db — the relational side of the hybrid
SQLite + FAISS memory store (§6.8). Eight-valued topic CHECK
(review_patterns | failure_patterns | flaky_tests | em_preferences |
adr_decisions | tool_usage_heuristics | task_outcomes |
mutation_seeds) plus confidence BETWEEN 0 AND 100. The embedding_key
column stores a FAISS id that Phase 10 will populate — FAISS
integration is explicitly out of scope here.

MemoryRow pydantic model + full CRUD. Tests cover creation, full
round-trip with recall_count increment, and CHECK rejection for both
an unknown topic value and an out-of-range confidence.
§7.5 / §15.1 JSON-per-file snapshots under paused-tasks/. Not SQLite:
a paused task is read whole, written whole, never queried, and needs
to round-trip arbitrary tool-call histories without schema migrations.

write_snapshot() uses a .tmp-then-rename for atomic replacement.
read_snapshot() raises FileNotFoundError rather than returning None
(per CLAUDE.md "no silent fallbacks"). task_ids are validated against
path-traversal characters. list_snapshots() returns every task_id
present under the root.

Tests cover round-trip, path-traversal rejection, and the loud-error
contract for missing snapshots.
§6.4 / §15.1 append-only JSONL log of every self-edit PR — one JSON
object per line. Schema per §15.1 verbatim: ts, pr, fix_class,
files[], before_sha, after_sha, review_passes{static, functional,
arch, security, prose, adversarial, feedback_binding}, confidence,
em_review, auto_merged, feedback_id.

SelfEditRecord + nested ReviewPasses pydantic models validate both
writes and reads. append() writes one line; iter_records() / read_all()
replay them. append_dict() validates raw dicts against the schema so
callers can feed telemetry payloads directly.

Tests round-trip two records (confirming append semantics) and
confirm schema rejection for a malformed payload.
§4.6 / §15.1 append-only plain-text log of candidate GAIA.md
promotions. Canonical single-line format:

  <ISO-8601 UTC> [task=<task_id>] Observed: "<observation>" → <candidate>

Promotion candidate is constrained to 'GAIA.md', 'dismissed', or
'skill:<name>'. Observations must be single-line so grep/awk stay
useful.

LearningEntry pydantic model with two field validators enforces both
rules at construction time. format_entry() and parse_line() keep the
on-disk text reversible with the model. append() writes one line;
iter_entries() / read_all() replay them, failing loudly on malformed
lines.

Tests round-trip two entries (GAIA.md + skill:<name> candidates),
confirm the on-disk format is human-readable, and reject both an
unknown promotion candidate and a multi-line observation.
Stores package entry point. Re-exports all ten sibling store modules
for `from gaia.coder.stores import em_inbox, tasks, ...` ergonomics
and adds the convenience helper `create_all_stores(root)`.

create_all_stores() materialises every store under a single directory
(the canonical layout is ~/.gaia/coder/) and returns a
{store_name: Path} map. For SQLite stores it opens the connection
with the canonical PRAGMAs, applies the DDL, and closes. For the
three non-SQL stores it creates the target directory
(paused-tasks/) or ensures the parent of the log file exists. The
function is idempotent so callers can run it on every boot.

Smoke test: test_create_all_stores() asserts every expected store
name is present, every SQLite file exists with its table
materialised, paused-tasks/ is a directory, and the helper returns
identical paths on a second invocation.
@kovtcharov kovtcharov force-pushed the feature/gaia-coder-stores branch from e5ac03b to c468427 Compare April 20, 2026 09:22
@kovtcharov kovtcharov merged commit 2b720c6 into coder Apr 20, 2026
5 checks passed
@kovtcharov kovtcharov deleted the feature/gaia-coder-stores branch April 20, 2026 09:22
@github-actions
Copy link
Copy Markdown
Contributor

Summary

Implements the ten persistent stores called out in §15.1 of docs/plans/coder-agent.mdx as pure data-layer infrastructure: seven SQLite stores (each with verbatim DDL, Pydantic row models, and typed CRUD on top of a shared _common.py helper) plus three append-only stores (JSON-per-file, JSONL, plain-text). The shape is uniform across the seven SQLite modules, tests exercise every CHECK constraint, and create_all_stores() is a nice self-contained entry point with a smoke test. Code quality is high; my main concern is a latent SQL-injection surface in the shared helper and a handful of smaller issues that are easier to fix now than after more callers wire up.

Issues Found

🟡 SQL injection surface: filter_ keys and order_by are string-interpolated (src/gaia/coder/stores/_common.py:51, 247)

_compile_where builds f"{key} IS NULL" / f"{key} = ?" directly from dict keys, and fetch_all interpolates order_by as f" ORDER BY {order_by}". Values go through ? parameters (good), but column names and ORDER BY do not. The in-repo callers today are all store modules passing constants, so it's safe right now — but every store re-exports list_rows(..., filter: Mapping[str, Any] | None) on its public surface with no documentation saying the keys must be trusted. Future callers (EM handles from em_inbox, context URLs from feedback, tool filters forwarded from CLI flags) could easily plumb untrusted input into filter={...} and quietly turn it into injection.

Two reasonable hardenings — either would be fine:

_IDENT_RE = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")


def _check_identifier(name: str) -> None:
    if not _IDENT_RE.match(name):
        raise ValueError(f"invalid SQL identifier: {name!r}")


def _compile_where(filter_: Mapping[str, Any] | None) -> tuple[str, list[Any]]:
    """Build a ``WHERE`` clause and parameter list from an equality filter.

    Empty/``None`` filter returns ``("", [])`` so callers can concatenate
    unconditionally. Values are passed as positional parameters — no string
    interpolation — so this is safe against injection. Column names are
    validated against ``_IDENT_RE`` so a tainted ``filter_`` key cannot
    inject SQL.
    """
    if not filter_:
        return "", []
    parts = []
    params: list[Any] = []
    for key, value in filter_.items():
        _check_identifier(key)
        if value is None:
            parts.append(f"{key} IS NULL")
        else:
            parts.append(f"{key} = ?")
            params.append(value)
    return " WHERE " + " AND ".join(parts), params

Apply the same _check_identifier to every column in update's patch and to fetch_all's order_by (split on commas first, then validate each token / direction). Cheap, and turns a latent landmine into a loud ValueError.

🟡 Test conftest swallows any import error (tests/coder/conftest.py:26)

def _try_real_import() -> bool:
    try:
        import gaia.coder  # noqa: F401
    except Exception:
        return False
    return True

This is exactly the silent-fallback pattern CLAUDE.md prohibits: a broken gaia.coder.stores module (e.g., a typo in em_inbox.py) will raise, _try_real_import returns False, stubs get installed, and the tests run against those stubs — masking the regression. Now that the sibling scaffold has landed (per your PR description), the rationale for the fallback is gone. Prefer one of:

  • Delete the stub machinery entirely; let import gaia.coder fail loudly.
  • Or narrow the exception to the specific missing-name ImportError you observed during the overlap window, and add a visible warnings.warn(...) when stubs are installed so a real failure is never silent.
def _try_real_import() -> bool:
    """Attempt the real import; return True on success.

    Narrow catch: we only want to paper over the sibling-branch window where
    ``gaia.coder.base`` / ``gaia.coder.loop`` are referenced before they
    exist. Any other import failure is a real regression and must surface.
    """
    try:
        import gaia.coder  # noqa: F401
    except ImportError as e:
        msg = str(e)
        if "gaia.coder.base" in msg or "gaia.coder.loop" in msg:
            return False
        raise
    return True

🟢 learnings_log canonical format doesn't escape embedded " (src/gaia/coder/stores/learnings_log.py:68)

format_entry writes f'"{entry.observation}"' unchanged, and _LINE_RE parses with "(?P<observation>.*)". Greedy matching usually recovers, but an observation ending in a literal " (or containing " → exactly) is a round-trip footgun that no test covers. Either reject " in the observation (same spirit as the newline rule) or escape it:

    @field_validator("observation")
    @classmethod
    def _single_line_no_quote(cls, v: str) -> str:
        if "\n" in v or "\r" in v:
            raise ValueError("observation must be a single line")
        if '"' in v:
            raise ValueError("observation must not contain double quotes")
        return v

Plus a small test covering 'he said "hi"' to lock the behavior.

🟢 self_edits_log concurrency comment overstates safety (src/gaia/coder/stores/self_edits_log.py:43)

concurrent appends on POSIX are safe for records shorter than PIPE_BUF (4096 bytes on Linux/macOS).

Real self-edit records easily exceed 4KB once files: list[str] and the seven review-pass verdicts are filled in. O_APPEND write atomicity applies to regular files with write(2), but Python's buffered file objects make this non-trivial. I'd either (a) add fh.flush() + os.fsync() and rely on O_APPEND-per-write with a single write() of the full line, or (b) soften the comment to say concurrent use is not supported and callers must serialize. Either is fine; the current docstring promises something it doesn't deliver.

🟢 memory.py DDL comment is misleading (src/gaia/coder/stores/memory.py:42)

-- One table per topic for simple cases; in production these are views over a single `memory` table
-- keyed by topic for FAISS co-location.

The DDL ships exactly one table called memory — there are no per-topic tables and no views. The comment reads like it was copy-pasted from a design doc. Trim to the one-line "single memory table keyed by topic; one FAISS index per topic in Phase 10" that matches the actual schema.

🟢 __init__.py mixes Dict and dict (src/gaia/coder/stores/init.py:41, 74, 91)

from typing import Dict plus Dict[str, Path] in the return annotation, alongside PEP-585 dict[str, str] in _SQLITE_LAYOUT. Pick one — every other new module in this PR already uses lowercase dict:

from __future__ import annotations

from pathlib import Path

…then change -> Dict[str, Path]-> dict[str, Path] and drop the Dict import.

Strengths

  • Uniform shape across seven SQLite modules. Every store exposes DDL, create_tables, open_store, insert_row, get_row, update_row, list_rows with an identical signature. A reader who understands em_inbox.py understands tasks.py, feedback.py, spend.py, etc. for free — exactly the right move for a data layer that future phases will grow into.
  • Test coverage is thoughtful, not just wide. The CHECK-constraint tests (test_em_inbox_check_constraints, test_feedback_check_constraints, test_memory_check_constraints) prove the DDL's CHECK clauses actually reject invalid enums — catching the common bug where a constraint is declared but never validated. The AUTOINCREMENT-id strict-monotonicity test in test_audit_check_constraints is a nice touch.
  • Path-traversal defence in paused_tasks._validate_task_id + atomic .tmp-then-rename in write_snapshot are both exactly right for a directory-of-files store — and there's a test for the traversal rejection.

Verdict

Approve with suggestions. The PR is already merged; nothing here is a blocker. The SQL-identifier hardening in _common.py and the narrowed-catch in tests/coder/conftest.py are the two items worth a quick follow-up — both are small, both close real latent failure modes rather than hypothetical ones. The others (learnings_log quote escaping, self_edits_log docstring, memory DDL comment, Dict/dict mix) are low-cost cleanups to fold into the next coder-stores commit.

kovtcharov added a commit that referenced this pull request Apr 20, 2026
Three review-followups from the #818/#819/#820 merge:

- src/gaia/coder/__init__.py: restore the CoderAgent/DEFAULT_LOOP/Loop/State/Transition
  re-exports that #819 added and the #818 rebase accidentally dropped. test_package_imports
  in test_skeleton.py relies on these.

- src/gaia/coder/tools/cli.py: replace the bare except Exception: pass in the stream-reader
  teardown with a targeted OSError catch + logger.debug. Per CLAUDE.md's fail-loudly rule,
  silent swallows hide reader-thread bugs; a stream close failing under pipe tear-down is
  real and worth a debug line.

- src/gaia/coder/tools/search.py: stop reaching into the private _TOOL_REGISTRY dict from
  grep(). Use get_tool_metadata() — the public accessor added in base/tools.py:113 exists
  precisely for this.

Tests: all 73 (5 skeleton + 38 mixins + 30 stores) pass on coder HEAD with the fixes.
kovtcharov added a commit that referenced this pull request Apr 20, 2026
#823)

## Summary

Three review-followups from the #818 / #819 / #820 merge, flagged by the
auto-review bot. All tests (73/73 on `coder`) pass with the fixes.

## What this changes

- **Critical** — `src/gaia/coder/__init__.py`: restore the `CoderAgent`
/ `DEFAULT_LOOP` / `Loop` / `State` / `Transition` re-exports that #819
added and #818's rebase accidentally dropped. Without this,
`tests/coder/test_skeleton.py::test_package_imports` fails on `coder`
HEAD.

- **Important** — `src/gaia/coder/tools/cli.py:135`: replace bare
`except Exception: pass` in the stream-reader teardown with a targeted
`OSError` catch + `logger.debug`. CLAUDE.md's *No Silent Fallbacks* rule
explicitly forbids this pattern.

- **Important** — `src/gaia/coder/tools/search.py:59`: stop reaching
into the private `_TOOL_REGISTRY` from `grep()`. Use
`get_tool_metadata()` — the public accessor at
`src/gaia/agents/base/tools.py:113` exists for this.

## Test plan

- [x] `pytest tests/coder/ -x` — 73/73 pass (5 skeleton + 38 mixins + 30
stores)
- [x] `python -c "from gaia.coder import CoderAgent, DEFAULT_LOOP, Loop,
State, Transition"` — all imports resolve
kovtcharov added a commit that referenced this pull request Apr 20, 2026
Addresses the auto-review findings on #819 and #820:

**Land the spec.** docs/plans/coder-agent.mdx was being referenced from 19 places in
the scaffold (every module docstring, every living doc, every CLI body) but the file
itself was untracked — every citation pointed at vapour. Commit it now so Phase 2+
branches inherit real citations. Also lands docs/superpowers/specs/2026-04-19-gaia-code-agent-analysis.md
as the retained historical context per the plan's §14.

**SQL identifier hardening** (src/gaia/coder/stores/_common.py). The CRUD primitives
string-interpolated table names, column names, and ORDER BY clauses. SQLite has no
built-in parameterisation for identifiers, so whitelist matching is the only safe
path. Adds _safe_ident() + _SAFE_IDENT_RE and applies it at every interpolation site,
plus multi-clause ORDER BY validation (split by comma, each clause validated for
optional ASC/DESC direction). Auto-review flagged this as a latent injection surface;
treating it as a latent vuln, not hypothetical.

**Narrow the conftest catch** (tests/coder/conftest.py). The bare `except Exception`
during stub fallback was masking import-time bugs in stores (AttributeError,
SyntaxError). Now catches only ImportError/ModuleNotFoundError and logs the
underlying reason so real failures surface in CI rather than silently stubbing over.
Scaffold is now landed so step-1 (real import) always succeeds — this is a safety net,
not a hot path.

**Register gaia-coder in CLAUDE.md + docs.json**. The new console_scripts entry
was missing from the Console Script Entry Points table (§Project Structure) and from
the "Standalone binaries" section. Also adds plans/coder-agent to docs.json under
the Agents group so Mintlify renders it.

All 73 coder tests pass (5 skeleton + 38 mixins + 30 stores).
kovtcharov added a commit that referenced this pull request Apr 20, 2026
## Summary

Addresses the auto-review findings on #819 + #820 and lands the plan
that every scaffold citation was pointing at.

## Changes

- **Land `docs/plans/coder-agent.mdx`** (3,584 lines). 19 scaffold
citations were pointing at a file that didn't exist in tracked state.
Also lands
`docs/superpowers/specs/2026-04-19-gaia-code-agent-analysis.md` as the
retained historical context.
- **SQL identifier hardening** (`src/gaia/coder/stores/_common.py`).
`_safe_ident()` + `_SAFE_IDENT_RE` validation at every interpolation
site (table, column, ORDER BY). Multi-clause ORDER BY supported.
- **Narrow conftest catch** (`tests/coder/conftest.py`). Only catches
`ImportError` / `ModuleNotFoundError`; logs the reason. Real import-time
bugs in stores now surface.
- **Register `gaia-coder`** in `CLAUDE.md` Console Script Entry Points +
Standalone binaries, and add `plans/coder-agent` to `docs/docs.json`
under Agents.

## Test plan

- [x] `pytest tests/coder/` — 73/73 pass (5 skeleton + 38 mixins + 30
stores)
- [x] `docs/plans/coder-agent.mdx` resolves for every scaffold docstring
citation
- [x] SQL injection guard: `_safe_ident("x; DROP TABLE")` raises
`ValueError`
kovtcharov added a commit that referenced this pull request Apr 20, 2026
## Summary

Phase 4 of the `gaia-coder` plan: the seven-pass self-review gate that
runs before she ever calls `gh_pr_create`. Implements the full §8 table
— deterministic checks (lint, tests, security, prose) plus LLM-driven
passes (architectural, persona, adversarial, feedback-binding) — and
surfaces an aggregated verdict with the confidence score the §7.6
auto-merge path reads.

This is the "deep-review discipline" the trust contract is built on.
Every PR she opens must clear this gate; self-fix PRs additionally
require Pass 7 (feedback-binding), which confirms the diff actually
addresses the EM's wording and that the regression test fails on `coder`
and passes on the branch.

## What this PR adds

- **Seven review passes** (`pass_1_static` through
`pass_7_feedback_binding`) — each a focused module that returns a common
`PassResult` envelope, so the gate never has to special-case any of
them.
- **`gate.run_all_passes`** — orchestrator with cost-aware
short-circuiting: a hard-fail on the deterministic cheap gates (1, 2, 4)
skips the expensive Opus calls, because there is no point paying for
adversarial review on a branch that fails lint.
- **`ReviewToolsMixin`** — exposes every pass as a `@tool`
(`review_diff_self_static`, …) plus the one-shot `review_diff_gate`, so
the agent, the EM's TUI, and the evaluation harness all share one code
path.
- **Canonical prompt files** for Passes 3, 5, 6, 7 under
`src/gaia/coder/prompts/`, materialised verbatim from §15.8 — self-fix
PRs that touch a prompt are now a single-file `git diff`, the whole
point of the §6.4 whitelist.
- **22 unit tests** that cover every pass (one pass + one fail per
module), the gate's short-circuit behaviour, the self-fix gating of Pass
7, the confidence-score contract, and a prompt-files-exist guard so the
canonical templates can't silently drift.

## Why the LLM seam matters

Every Opus call flows through `gaia.coder.review._llm.call_opus`. Tests
patch that one name — cheap, stable, vendor-agnostic — rather than
reaching into the Anthropic SDK. When we ship the Claude Agent SDK
integration for Passes 3 and 6 (the `architecture-reviewer` /
`code-reviewer` subagents in §15.8), the switch happens at a single
module boundary without churning every test.

## Scope constraint

Changes are confined to `src/gaia/coder/review/`,
`src/gaia/coder/prompts/`, and `tests/coder/test_review.py`. No other
files are touched. Pass 7's differential-pytest worktree machinery is
wired but opt-out via `skip_differential_pytest=True` for unit tests,
because Phase 4 intentionally does not ship a running daemon — that's
Phase 5.

## Dependencies

- Stacks on top of #818 (mixins; uses
`gaia.coder.tools.cli._check_denylist` for subprocess safety) and #819
(scaffold). Both appear in this PR until they merge to `coder` — the
diff will clean itself up afterwards.
- Does not depend on #820 (stores).

## Citations

-
[`docs/plans/coder-agent.mdx`](../blob/feature/gaia-coder-review/docs/plans/coder-agent.mdx)
§8 — the seven-pass table
- §15.8 — canonical prompt templates for Passes 3/5/6/7
- §7.6 — confidence-score gate on auto-merge
- §7.4 — self-correction loop that Pass 7 bookends

## Test plan

- [x] `pytest tests/coder/test_review.py -xvs` — 22/22 pass
- [x] `pytest tests/coder/` — 65/65 pass (sibling tests untouched)
- [x] `python util/lint.py --black --isort` — clean on in-scope files
- [x] Smoke: `from gaia.coder.review import ReviewToolsMixin; m =
ReviewToolsMixin(); m.register_review_tools()` registers exactly 8 tools
- [ ] Integration run against a real branch with Anthropic creds present
(Phase 5 territory — deferred to the daemon wiring task)

## Do not merge

Draft: this stacks on two unmerged sibling PRs and the Phase 4 runtime
wiring lives in Phase 5.
kovtcharov added a commit that referenced this pull request Apr 20, 2026
## Summary

Lands Phase 5 of `docs/plans/coder-agent.mdx` — the EM-facing surface of
`gaia-coder`. Before this PR the CLI verbs were stubs; after it, the EM
can bootstrap the agent, read her trust contract, promote/demote her,
and queue messages. Every LLM call now injects her identity triplet
(`GAIA.md` + `ARCHITECTURE.md` + `PROJECT_MAP.md`) as a cacheable
prefix.

## Threads

- **`trust.py`** — `EMConfig` / `RepoBinding` Pydantic models, the 0-5
  `CapabilityTier` ladder, TOML round-trip, and `promote` / `demote`
  functions with audit-log writes. Promotion refuses mismatched EM
  signatures; demotion is immediate. *Why it matters:* §4.2 makes
  promotion explicit, so a quiet accept would let any caller escalate
  tier. Fail-loudly TrustError surfaces exactly what to fix.
- **`inbox.py`** — thin CRUD over `em_inbox.db` with the §4.5 5-second
  non-LLM auto-ack, channel-agnostic dispatch callable, and escalation
  into `feedback.db` with severity translation. *Why it matters:*
  §4.5 says the ack is non-negotiable in latency; keeping it template-
  only (no model call) makes the SLA automatic.
- **`intent.py`** — LLM-driven conversational intent classifier for
  §15.4 + §15.8 P9, temperature 0 Opus 4.7, mockable via an injected
  `llm` callable. Low confidence (< 70) and unknown intents coerce to
  `free_form`. Handler functions cover every §15.4 intent. *Why it
  matters:* regex matchers would miss paraphrases ("let me give you
  self-edit for now"); LLM routing keeps the grammar maintainable.
- **`prompt_composer.py`** — builds Anthropic-format message blocks
  with `cache_control={"type":"ephemeral"}` on the identity triplet
  + per-skill blocks, per §3.2 / §4.6 / §6.5. *Why it matters:* §3.1
  mandates prompt caching; this is the single place that decides what
  gets cached.
- **`cli.py`** — replaces seven stub handlers (`trust`, `promote`,
  `demote`, `ask`, `note`, `critical`, `inbox`) with real ones. Config
  dir honours `$GAIA_CODER_HOME` so tests never touch real user state.
  Stubs remain for the Phase 6+ verbs (`daemon`, `status`, `feedback`,
  `doctor`, etc.).
- **`prompts/intent_classifier.md` + `prompts/standup.md`** — §15.8 P9
  and P10 prompt templates landed verbatim for future `prompt`-class
  self-fix PRs.
- **Tests (58 new)** — each module has a dedicated test file; CLI tests
  run as subprocess to exercise the full argparse + env-var path.

## Dependencies

This PR depends on sibling branches that have not yet merged to `coder`:

- **#819 scaffold** — imports `gaia.coder.{__init__,base,loop}` and the
  `GAIA.md` / `ARCHITECTURE.md` / `PROJECT_MAP.md` placeholders.
- **#820 stores** — hard dep on `gaia.coder.stores.{em_inbox, feedback,
  audit}`.
- **#818 mixins** — optional import of `gaia.coder.tools.cli` for
  subprocess helpers (not used in this PR's code paths).

Rebase onto `coder` once those land.

## Test plan

- [ ] `pytest
tests/coder/test_{trust,intent,inbox,prompt_composer,cli_trust}.py -xvs`
— all 58 tests pass
- [ ] `gaia-coder trust` on a fresh `$GAIA_CODER_HOME` prints the §4.1
bootstrap question and exits 0
- [ ] `gaia-coder trust --bootstrap --em-handle <you> --em-channel <ch>`
then `gaia-coder trust` renders the §4.2 template verbatim
- [ ] `gaia-coder promote --to-tier 2 --reason "..." --em-signature
<you>` updates `em.toml` and writes an audit row
- [ ] `gaia-coder promote ... --em-signature wrong-user` exits 1 with
the mismatch message on stderr
- [ ] `gaia-coder ask "enable self-edit"` prints the auto-ack template
and enqueues a pending row in `em_inbox.db`
- [ ] `gaia-coder inbox` lists the pending row
kovtcharov added a commit that referenced this pull request Apr 20, 2026
…825)

## Summary

Phase 6 implements **gaia-coder's self-correction loop** — the core
value proposition. When the EM gives feedback, the agent now triages it
into one of eight fix classes, localises the cause, drafts a
regression-tested plan, applies a fix on an
`auto/gaia-coder/<feedback_id>` branch, and opens a draft PR targeting
`coder`. Wires §7.2 continuous critique, §7.3 feedback intake, §7.4 loop
steps 1-10, and §7.9 verification. Loosely coupled to Phase 4 (review
gate) and Phase 5 (trust inbox) so they can land in any order.

## Threads

- **Prompt templates (§15.8 P1/P2/P3).** `prompts/triage.md`,
`prompts/critique.md`, and `prompts/plan_review.md` — the three
canonical text bodies the loop consumes. Stored as plain Markdown so she
can edit them via prompt-class self-fix PRs.
- **Triage (§7.4 step 1-2).** `classify_fix_class` runs P1 on Opus 4.7;
`< 60` confidence is rewritten to `out-of-scope` so the loop never
commits to a guess. `localise` is deterministic grep — no LLM.
- **Planner (§5.1 Stage 3 / §7.4 step 3).** `draft_plan` +
`is_large_job` + `request_em_approval`. Large jobs (> 200 LoC, or
`architectural` / `state-machine`, or cross-mixin) post the P3 message
to the EM inbox and wait for ✅.
- **Fixer (§7.4 steps 4-5).** `generate_fix` creates the self-fix branch
and applies edits; `write_regression_test` emits a pytest file + marker
flag so it *genuinely* fails on `coder` and passes on the fix branch (no
clever mocks); `verify_test_differential` raises on the pass-on-both or
fail-on-both failure modes.
- **Publisher (§7.4 steps 7-8).** `open_self_fix_pr` refuses to open
without a regression test (§7.4 step 5 hard rule). PR body cites
`feedback_id` and quotes the EM's wording verbatim — Pass 7
(feedback-binding) depends on both.
- **Verifier (§7.4 step 10).** `verify_on_merge` re-runs the regression
test on the merged SHA, transitions the feedback row to `verified`, and
writes `failure_patterns` + `review_patterns` memory records so the same
symptom is recognised next time.
- **Continuous critique (§7.2).** Cheap one-shot Opus call after every
state-changing tool; findings `< 60` confidence are dropped; `≥ 80`
surface inline for pre-transition action.
- **Loop driver.** `FeedbackLoopDriver.process_pending_feedback()`
orchestrates the whole thing with full `pending → triaged → in-fix →
fix-pr-open → verified | rejected → closed` transitions written to
`feedback.notes_json`.
- **SelfFixToolsMixin.** Registers **10 tools** (well above the §15.2 ≥
7 contract) so the loop is callable from the agent's tool registry.
Phase-7 tools (`classify_failure`, `pause_current_task`, `restart_self`,
`edit_self_file`) are intentionally out of scope.
- **CLI.** `gaia-coder feedback "<body>" --severity high --on <url>`
enqueues, `gaia-coder self-fix process` runs one iteration.
- **Tests.** 47 new tests covering every §7.4 step, every §7.2 filtering
rule, and the §15.2 mixin contract. Mocks Anthropic and `gh` at the
callable boundary; uses a tmp git repo with a `coder` branch for real
branch creation and pytest differential runs.

## Out of scope (explicit non-goals per the Phase 6 task)

- EventBridge ingestion of `@gaia-coder feedback:` comments (Phase 10
repo binding).
- Auto-merge (§7.6) — blocked on Phase 4 review-gate confidence score.
- Dev-mode self-edit (§7.5) and `restart_self` — Phase 7.
- ReAct loop self-edit (§7.8) — Phase 7.

## Dependencies

- **#818 (mixins):** Imports `edit_file` semantics inline rather than
instantiating `FileToolsMixin`, to keep the module usable without the
mixin wired.
- **#819 (scaffold):** Package skeleton + GAIA.md + prompts/ dir.
- **#820 (stores):** Uses `gaia.coder.stores.feedback` and
`gaia.coder.stores.memory` directly.
- **Phase 4 review gate (sibling):** Loose-coupled —
`review_gate_runner` is optional; absent, we publish a "(review gate not
available)" placeholder in the PR body.
- **Phase 5 trust (sibling):** Loose-coupled — `request_em_approval`
uses `gaia.coder.trust.inbox.enqueue` if importable, else defers.

## Test plan

- [ ] `pytest tests/coder/test_self_fix/ -xvs` — 47 tests pass.
- [ ] `pytest tests/coder/ -q` — full coder suite (120 tests) pass.
- [ ] `python util/lint.py --all` — no new critical errors (pre-existing
pylint warnings in `cli.py`, `discovery.py`, etc. are not touched).
- [ ] `python -c "from gaia.coder.self_fix import SelfFixToolsMixin; m =
SelfFixToolsMixin(); assert len(m.register_self_fix_tools()) >= 7"` —
smoke.
- [ ] `gaia-coder feedback "..." --severity high --on
https://github.com/amd/gaia/pull/9999 --id fb-test --db-path /tmp/fb.db`
— writes a row; follow with `gaia-coder self-fix process --db-path
/tmp/fb.db --repo-root <worktree> --skip-differential-verify
--skip-fix-apply` for the end-to-end path (PR creation mocked).

## Merge plan

- Draft PR, **do not merge**. Rebase onto `coder` after Phase 4 and
Phase 5 land so the review gate and inbox wiring become real imports.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant