feat(coder): File/CLI/Search mixins per §15.2#818
Conversation
Implements §15.2 of the gaia-coder plan (docs/plans/coder-agent.mdx): read_file, write_file, edit_file, search_code, glob, generate_diff. Pure-Python (re + pathlib + difflib) so behaviour is deterministic on every platform — no dependency on ripgrep or external binaries. read_file and edit_file raise loudly on missing paths / non-unique matches per the fail-loudly principle rather than returning error envelopes. 18 tests under tests/coder/test_mixins.py cover happy path, failure modes, and registry wiring.
Implements §15.2 CLI tools: run_cli_command, stop_process, list_processes, get_process_logs, plus the static-denylist layer of the §6.8 guardrail stack (rm -rf /, sudo, chmod 777, curl | bash, git push origin main). Background processes land in a module-level _PROCESS_REGISTRY with reader threads draining stdout/stderr so foreground callers can tail logs without deadlocking pipe buffers. ShellDeniedError is defined locally for now — §15.8 will fold it into a shared exception taxonomy. 12 new tests covering happy path, non-zero exit, denylist, background lifecycle, log capture, and registry wiring (30 total).
Implements §15.2 Search tools: grep (delegates to search_code), find_symbol (Python-only via ast walk for FunctionDef/ClassDef/Import), list_files. SearchToolsMixin inherits from FileToolsMixin so agents that opt into search automatically get file tools — in practice you always want both. find_symbol logs a WARN and returns [] on repos with no .py files rather than silently skipping (§2 principle 3). 8 new tests covering happy path, kind-filter, unsupported-language WARN, recursive list, directory exclusion, and registry wiring (38 total).
4a03aef to
67eb7a5
Compare
Code Review — #818
|
Three review-followups from the #818/#819/#820 merge: - src/gaia/coder/__init__.py: restore the CoderAgent/DEFAULT_LOOP/Loop/State/Transition re-exports that #819 added and the #818 rebase accidentally dropped. test_package_imports in test_skeleton.py relies on these. - src/gaia/coder/tools/cli.py: replace the bare except Exception: pass in the stream-reader teardown with a targeted OSError catch + logger.debug. Per CLAUDE.md's fail-loudly rule, silent swallows hide reader-thread bugs; a stream close failing under pipe tear-down is real and worth a debug line. - src/gaia/coder/tools/search.py: stop reaching into the private _TOOL_REGISTRY dict from grep(). Use get_tool_metadata() — the public accessor added in base/tools.py:113 exists precisely for this. Tests: all 73 (5 skeleton + 38 mixins + 30 stores) pass on coder HEAD with the fixes.
#823) ## Summary Three review-followups from the #818 / #819 / #820 merge, flagged by the auto-review bot. All tests (73/73 on `coder`) pass with the fixes. ## What this changes - **Critical** — `src/gaia/coder/__init__.py`: restore the `CoderAgent` / `DEFAULT_LOOP` / `Loop` / `State` / `Transition` re-exports that #819 added and #818's rebase accidentally dropped. Without this, `tests/coder/test_skeleton.py::test_package_imports` fails on `coder` HEAD. - **Important** — `src/gaia/coder/tools/cli.py:135`: replace bare `except Exception: pass` in the stream-reader teardown with a targeted `OSError` catch + `logger.debug`. CLAUDE.md's *No Silent Fallbacks* rule explicitly forbids this pattern. - **Important** — `src/gaia/coder/tools/search.py:59`: stop reaching into the private `_TOOL_REGISTRY` from `grep()`. Use `get_tool_metadata()` — the public accessor at `src/gaia/agents/base/tools.py:113` exists for this. ## Test plan - [x] `pytest tests/coder/ -x` — 73/73 pass (5 skeleton + 38 mixins + 30 stores) - [x] `python -c "from gaia.coder import CoderAgent, DEFAULT_LOOP, Loop, State, Transition"` — all imports resolve
…rential Implements §7.4 steps 4-5 end-to-end: * generate_fix() creates auto/gaia-coder/<feedback_id> off the base ref (default 'coder' — §5.6, never main), then walks the caller-supplied EditHunks via an _edit_file_impl free function that mirrors the exact semantics of FileToolsMixin.edit_file from PR #818 (unique-match, opt-in replace_all, fail-loudly on missing). The branch is idempotent: re-runs check out the existing branch rather than error. * write_regression_test() emits a pytest file plus a marker flag so the generated test genuinely fails on the base branch (marker missing) and passes on the fix branch (marker present) — the §7.4 step 5 contract without needing any cleverness. Refuses to run unless the working tree is on a SELF_FIX_BRANCH_PREFIX branch. * verify_test_differential() runs pytest on both refs and raises RuntimeError if the fail-then-pass contract is violated (pass on both → regression not exercised; fail on both → fix does not fix). sys.executable is used for subprocess pytest invocations so hosts without a `python` symlink on PATH still work.
## Summary Phase 4 of the `gaia-coder` plan: the seven-pass self-review gate that runs before she ever calls `gh_pr_create`. Implements the full §8 table — deterministic checks (lint, tests, security, prose) plus LLM-driven passes (architectural, persona, adversarial, feedback-binding) — and surfaces an aggregated verdict with the confidence score the §7.6 auto-merge path reads. This is the "deep-review discipline" the trust contract is built on. Every PR she opens must clear this gate; self-fix PRs additionally require Pass 7 (feedback-binding), which confirms the diff actually addresses the EM's wording and that the regression test fails on `coder` and passes on the branch. ## What this PR adds - **Seven review passes** (`pass_1_static` through `pass_7_feedback_binding`) — each a focused module that returns a common `PassResult` envelope, so the gate never has to special-case any of them. - **`gate.run_all_passes`** — orchestrator with cost-aware short-circuiting: a hard-fail on the deterministic cheap gates (1, 2, 4) skips the expensive Opus calls, because there is no point paying for adversarial review on a branch that fails lint. - **`ReviewToolsMixin`** — exposes every pass as a `@tool` (`review_diff_self_static`, …) plus the one-shot `review_diff_gate`, so the agent, the EM's TUI, and the evaluation harness all share one code path. - **Canonical prompt files** for Passes 3, 5, 6, 7 under `src/gaia/coder/prompts/`, materialised verbatim from §15.8 — self-fix PRs that touch a prompt are now a single-file `git diff`, the whole point of the §6.4 whitelist. - **22 unit tests** that cover every pass (one pass + one fail per module), the gate's short-circuit behaviour, the self-fix gating of Pass 7, the confidence-score contract, and a prompt-files-exist guard so the canonical templates can't silently drift. ## Why the LLM seam matters Every Opus call flows through `gaia.coder.review._llm.call_opus`. Tests patch that one name — cheap, stable, vendor-agnostic — rather than reaching into the Anthropic SDK. When we ship the Claude Agent SDK integration for Passes 3 and 6 (the `architecture-reviewer` / `code-reviewer` subagents in §15.8), the switch happens at a single module boundary without churning every test. ## Scope constraint Changes are confined to `src/gaia/coder/review/`, `src/gaia/coder/prompts/`, and `tests/coder/test_review.py`. No other files are touched. Pass 7's differential-pytest worktree machinery is wired but opt-out via `skip_differential_pytest=True` for unit tests, because Phase 4 intentionally does not ship a running daemon — that's Phase 5. ## Dependencies - Stacks on top of #818 (mixins; uses `gaia.coder.tools.cli._check_denylist` for subprocess safety) and #819 (scaffold). Both appear in this PR until they merge to `coder` — the diff will clean itself up afterwards. - Does not depend on #820 (stores). ## Citations - [`docs/plans/coder-agent.mdx`](../blob/feature/gaia-coder-review/docs/plans/coder-agent.mdx) §8 — the seven-pass table - §15.8 — canonical prompt templates for Passes 3/5/6/7 - §7.6 — confidence-score gate on auto-merge - §7.4 — self-correction loop that Pass 7 bookends ## Test plan - [x] `pytest tests/coder/test_review.py -xvs` — 22/22 pass - [x] `pytest tests/coder/` — 65/65 pass (sibling tests untouched) - [x] `python util/lint.py --black --isort` — clean on in-scope files - [x] Smoke: `from gaia.coder.review import ReviewToolsMixin; m = ReviewToolsMixin(); m.register_review_tools()` registers exactly 8 tools - [ ] Integration run against a real branch with Anthropic creds present (Phase 5 territory — deferred to the daemon wiring task) ## Do not merge Draft: this stacks on two unmerged sibling PRs and the Phase 4 runtime wiring lives in Phase 5.
## Summary
Lands Phase 5 of `docs/plans/coder-agent.mdx` — the EM-facing surface of
`gaia-coder`. Before this PR the CLI verbs were stubs; after it, the EM
can bootstrap the agent, read her trust contract, promote/demote her,
and queue messages. Every LLM call now injects her identity triplet
(`GAIA.md` + `ARCHITECTURE.md` + `PROJECT_MAP.md`) as a cacheable
prefix.
## Threads
- **`trust.py`** — `EMConfig` / `RepoBinding` Pydantic models, the 0-5
`CapabilityTier` ladder, TOML round-trip, and `promote` / `demote`
functions with audit-log writes. Promotion refuses mismatched EM
signatures; demotion is immediate. *Why it matters:* §4.2 makes
promotion explicit, so a quiet accept would let any caller escalate
tier. Fail-loudly TrustError surfaces exactly what to fix.
- **`inbox.py`** — thin CRUD over `em_inbox.db` with the §4.5 5-second
non-LLM auto-ack, channel-agnostic dispatch callable, and escalation
into `feedback.db` with severity translation. *Why it matters:*
§4.5 says the ack is non-negotiable in latency; keeping it template-
only (no model call) makes the SLA automatic.
- **`intent.py`** — LLM-driven conversational intent classifier for
§15.4 + §15.8 P9, temperature 0 Opus 4.7, mockable via an injected
`llm` callable. Low confidence (< 70) and unknown intents coerce to
`free_form`. Handler functions cover every §15.4 intent. *Why it
matters:* regex matchers would miss paraphrases ("let me give you
self-edit for now"); LLM routing keeps the grammar maintainable.
- **`prompt_composer.py`** — builds Anthropic-format message blocks
with `cache_control={"type":"ephemeral"}` on the identity triplet
+ per-skill blocks, per §3.2 / §4.6 / §6.5. *Why it matters:* §3.1
mandates prompt caching; this is the single place that decides what
gets cached.
- **`cli.py`** — replaces seven stub handlers (`trust`, `promote`,
`demote`, `ask`, `note`, `critical`, `inbox`) with real ones. Config
dir honours `$GAIA_CODER_HOME` so tests never touch real user state.
Stubs remain for the Phase 6+ verbs (`daemon`, `status`, `feedback`,
`doctor`, etc.).
- **`prompts/intent_classifier.md` + `prompts/standup.md`** — §15.8 P9
and P10 prompt templates landed verbatim for future `prompt`-class
self-fix PRs.
- **Tests (58 new)** — each module has a dedicated test file; CLI tests
run as subprocess to exercise the full argparse + env-var path.
## Dependencies
This PR depends on sibling branches that have not yet merged to `coder`:
- **#819 scaffold** — imports `gaia.coder.{__init__,base,loop}` and the
`GAIA.md` / `ARCHITECTURE.md` / `PROJECT_MAP.md` placeholders.
- **#820 stores** — hard dep on `gaia.coder.stores.{em_inbox, feedback,
audit}`.
- **#818 mixins** — optional import of `gaia.coder.tools.cli` for
subprocess helpers (not used in this PR's code paths).
Rebase onto `coder` once those land.
## Test plan
- [ ] `pytest
tests/coder/test_{trust,intent,inbox,prompt_composer,cli_trust}.py -xvs`
— all 58 tests pass
- [ ] `gaia-coder trust` on a fresh `$GAIA_CODER_HOME` prints the §4.1
bootstrap question and exits 0
- [ ] `gaia-coder trust --bootstrap --em-handle <you> --em-channel <ch>`
then `gaia-coder trust` renders the §4.2 template verbatim
- [ ] `gaia-coder promote --to-tier 2 --reason "..." --em-signature
<you>` updates `em.toml` and writes an audit row
- [ ] `gaia-coder promote ... --em-signature wrong-user` exits 1 with
the mismatch message on stderr
- [ ] `gaia-coder ask "enable self-edit"` prints the auto-ack template
and enqueues a pending row in `em_inbox.db`
- [ ] `gaia-coder inbox` lists the pending row
…rential Implements §7.4 steps 4-5 end-to-end: * generate_fix() creates auto/gaia-coder/<feedback_id> off the base ref (default 'coder' — §5.6, never main), then walks the caller-supplied EditHunks via an _edit_file_impl free function that mirrors the exact semantics of FileToolsMixin.edit_file from PR #818 (unique-match, opt-in replace_all, fail-loudly on missing). The branch is idempotent: re-runs check out the existing branch rather than error. * write_regression_test() emits a pytest file plus a marker flag so the generated test genuinely fails on the base branch (marker missing) and passes on the fix branch (marker present) — the §7.4 step 5 contract without needing any cleverness. Refuses to run unless the working tree is on a SELF_FIX_BRANCH_PREFIX branch. * verify_test_differential() runs pytest on both refs and raises RuntimeError if the fail-then-pass contract is violated (pass on both → regression not exercised; fail on both → fix does not fix). sys.executable is used for subprocess pytest invocations so hosts without a `python` symlink on PATH still work.
…rential Implements §7.4 steps 4-5 end-to-end: * generate_fix() creates auto/gaia-coder/<feedback_id> off the base ref (default 'coder' — §5.6, never main), then walks the caller-supplied EditHunks via an _edit_file_impl free function that mirrors the exact semantics of FileToolsMixin.edit_file from PR #818 (unique-match, opt-in replace_all, fail-loudly on missing). The branch is idempotent: re-runs check out the existing branch rather than error. * write_regression_test() emits a pytest file plus a marker flag so the generated test genuinely fails on the base branch (marker missing) and passes on the fix branch (marker present) — the §7.4 step 5 contract without needing any cleverness. Refuses to run unless the working tree is on a SELF_FIX_BRANCH_PREFIX branch. * verify_test_differential() runs pytest on both refs and raises RuntimeError if the fail-then-pass contract is violated (pass on both → regression not exercised; fail on both → fix does not fix). sys.executable is used for subprocess pytest invocations so hosts without a `python` symlink on PATH still work.
…825) ## Summary Phase 6 implements **gaia-coder's self-correction loop** — the core value proposition. When the EM gives feedback, the agent now triages it into one of eight fix classes, localises the cause, drafts a regression-tested plan, applies a fix on an `auto/gaia-coder/<feedback_id>` branch, and opens a draft PR targeting `coder`. Wires §7.2 continuous critique, §7.3 feedback intake, §7.4 loop steps 1-10, and §7.9 verification. Loosely coupled to Phase 4 (review gate) and Phase 5 (trust inbox) so they can land in any order. ## Threads - **Prompt templates (§15.8 P1/P2/P3).** `prompts/triage.md`, `prompts/critique.md`, and `prompts/plan_review.md` — the three canonical text bodies the loop consumes. Stored as plain Markdown so she can edit them via prompt-class self-fix PRs. - **Triage (§7.4 step 1-2).** `classify_fix_class` runs P1 on Opus 4.7; `< 60` confidence is rewritten to `out-of-scope` so the loop never commits to a guess. `localise` is deterministic grep — no LLM. - **Planner (§5.1 Stage 3 / §7.4 step 3).** `draft_plan` + `is_large_job` + `request_em_approval`. Large jobs (> 200 LoC, or `architectural` / `state-machine`, or cross-mixin) post the P3 message to the EM inbox and wait for ✅. - **Fixer (§7.4 steps 4-5).** `generate_fix` creates the self-fix branch and applies edits; `write_regression_test` emits a pytest file + marker flag so it *genuinely* fails on `coder` and passes on the fix branch (no clever mocks); `verify_test_differential` raises on the pass-on-both or fail-on-both failure modes. - **Publisher (§7.4 steps 7-8).** `open_self_fix_pr` refuses to open without a regression test (§7.4 step 5 hard rule). PR body cites `feedback_id` and quotes the EM's wording verbatim — Pass 7 (feedback-binding) depends on both. - **Verifier (§7.4 step 10).** `verify_on_merge` re-runs the regression test on the merged SHA, transitions the feedback row to `verified`, and writes `failure_patterns` + `review_patterns` memory records so the same symptom is recognised next time. - **Continuous critique (§7.2).** Cheap one-shot Opus call after every state-changing tool; findings `< 60` confidence are dropped; `≥ 80` surface inline for pre-transition action. - **Loop driver.** `FeedbackLoopDriver.process_pending_feedback()` orchestrates the whole thing with full `pending → triaged → in-fix → fix-pr-open → verified | rejected → closed` transitions written to `feedback.notes_json`. - **SelfFixToolsMixin.** Registers **10 tools** (well above the §15.2 ≥ 7 contract) so the loop is callable from the agent's tool registry. Phase-7 tools (`classify_failure`, `pause_current_task`, `restart_self`, `edit_self_file`) are intentionally out of scope. - **CLI.** `gaia-coder feedback "<body>" --severity high --on <url>` enqueues, `gaia-coder self-fix process` runs one iteration. - **Tests.** 47 new tests covering every §7.4 step, every §7.2 filtering rule, and the §15.2 mixin contract. Mocks Anthropic and `gh` at the callable boundary; uses a tmp git repo with a `coder` branch for real branch creation and pytest differential runs. ## Out of scope (explicit non-goals per the Phase 6 task) - EventBridge ingestion of `@gaia-coder feedback:` comments (Phase 10 repo binding). - Auto-merge (§7.6) — blocked on Phase 4 review-gate confidence score. - Dev-mode self-edit (§7.5) and `restart_self` — Phase 7. - ReAct loop self-edit (§7.8) — Phase 7. ## Dependencies - **#818 (mixins):** Imports `edit_file` semantics inline rather than instantiating `FileToolsMixin`, to keep the module usable without the mixin wired. - **#819 (scaffold):** Package skeleton + GAIA.md + prompts/ dir. - **#820 (stores):** Uses `gaia.coder.stores.feedback` and `gaia.coder.stores.memory` directly. - **Phase 4 review gate (sibling):** Loose-coupled — `review_gate_runner` is optional; absent, we publish a "(review gate not available)" placeholder in the PR body. - **Phase 5 trust (sibling):** Loose-coupled — `request_em_approval` uses `gaia.coder.trust.inbox.enqueue` if importable, else defers. ## Test plan - [ ] `pytest tests/coder/test_self_fix/ -xvs` — 47 tests pass. - [ ] `pytest tests/coder/ -q` — full coder suite (120 tests) pass. - [ ] `python util/lint.py --all` — no new critical errors (pre-existing pylint warnings in `cli.py`, `discovery.py`, etc. are not touched). - [ ] `python -c "from gaia.coder.self_fix import SelfFixToolsMixin; m = SelfFixToolsMixin(); assert len(m.register_self_fix_tools()) >= 7"` — smoke. - [ ] `gaia-coder feedback "..." --severity high --on https://github.com/amd/gaia/pull/9999 --id fb-test --db-path /tmp/fb.db` — writes a row; follow with `gaia-coder self-fix process --db-path /tmp/fb.db --repo-root <worktree> --skip-differential-verify --skip-fix-apply` for the end-to-end path (PR creation mocked). ## Merge plan - Draft PR, **do not merge**. Rebase onto `coder` after Phase 4 and Phase 5 land so the review gate and inbox wiring become real imports.
Summary
Three foundational no-network mixins for `gaia-coder` (per §15.2 of `docs/plans/coder-agent.mdx`). Unblocks scaffold and downstream mixin work: the scaffold task wires these into `CoderAgent.init`; GitHub/Web/OSSReuse mixins build on top.
Sibling streams: `feature/gaia-coder-scaffold` (agent class + loop), `feature/gaia-coder-stores` (§15.1 SQLite stores).
What's in the box
Test plan
Not in this PR
Status: draft — do not merge until the scaffold PR lands and wiring is verified end-to-end.