Skip to content

feat(coder): File/CLI/Search mixins per §15.2#818

Merged
kovtcharov merged 3 commits into
coderfrom
feature/gaia-coder-mixins-1
Apr 20, 2026
Merged

feat(coder): File/CLI/Search mixins per §15.2#818
kovtcharov merged 3 commits into
coderfrom
feature/gaia-coder-mixins-1

Conversation

@kovtcharov
Copy link
Copy Markdown
Collaborator

Summary

Three foundational no-network mixins for `gaia-coder` (per §15.2 of `docs/plans/coder-agent.mdx`). Unblocks scaffold and downstream mixin work: the scaffold task wires these into `CoderAgent.init`; GitHub/Web/OSSReuse mixins build on top.

Sibling streams: `feature/gaia-coder-scaffold` (agent class + loop), `feature/gaia-coder-stores` (§15.1 SQLite stores).

What's in the box

  • `FileToolsMixin` — `read_file`, `write_file`, `edit_file`, `search_code`, `glob`, `generate_diff`. Pure-Python (`re` + `pathlib` + `difflib`) so behaviour is deterministic on every platform. `read_file`/`edit_file` raise loudly on missing paths and non-unique matches per §2 principle 3.
  • `CLIToolsMixin` — `run_cli_command`, `stop_process`, `list_processes`, `get_process_logs`, plus the static-denylist layer of the §6.8 guardrail stack (`rm -rf /`, `sudo`, `chmod 777`, `curl | bash`, `git push origin main`) and a local `ShellDeniedError` (§15.8 will fold it into a shared taxonomy). Background processes land in a module-level registry with reader threads draining stdio so tails don't deadlock on pipe buffers.
  • `SearchToolsMixin` — `grep` (delegates to `search_code`), `find_symbol` (Python-only via `ast`), `list_files`. Non-Python `find_symbol` calls log WARN and return `[]` rather than silently skipping.

Test plan

  • `pytest tests/coder/test_mixins.py` — 38/38 passing (happy + failure + registry per tool)
  • `flake8 --max-line-length=88 --select=E9,F63,F7,F82,F401,F811 src/gaia/coder tests/coder` — clean
  • `black --check` / `isort --check-only` — clean
  • Integration smoke: `from gaia.coder.tools import *; FileToolsMixin().register_file_tools()` — 13 tools land in `_TOOL_REGISTRY`
  • `git diff --stat` — changes confined to `src/gaia/coder/` and `tests/coder/`

Not in this PR

  • `CoderAgent` class / `loop.py` — Stream A (`feature/gaia-coder-scaffold`)
  • SQLite stores — Stream C (`feature/gaia-coder-stores`)
  • `GitHubToolsMixin`, `WebToolsMixin`, `OSSReuseMixin` — follow-up tasks (require network)
  • Full exception taxonomy for denylist — §15.8

Status: draft — do not merge until the scaffold PR lands and wiring is verified end-to-end.

@github-actions github-actions Bot added the tests Test changes label Apr 20, 2026
@kovtcharov kovtcharov changed the base branch from feature/coding-agent to coder April 20, 2026 08:17
@kovtcharov kovtcharov changed the base branch from coder to main April 20, 2026 08:25
@kovtcharov kovtcharov changed the base branch from main to coder April 20, 2026 08:25
@kovtcharov kovtcharov marked this pull request as ready for review April 20, 2026 09:20
Implements §15.2 of the gaia-coder plan (docs/plans/coder-agent.mdx):
read_file, write_file, edit_file, search_code, glob, generate_diff.

Pure-Python (re + pathlib + difflib) so behaviour is deterministic on
every platform — no dependency on ripgrep or external binaries. read_file
and edit_file raise loudly on missing paths / non-unique matches per the
fail-loudly principle rather than returning error envelopes.

18 tests under tests/coder/test_mixins.py cover happy path, failure
modes, and registry wiring.
Implements §15.2 CLI tools: run_cli_command, stop_process, list_processes,
get_process_logs, plus the static-denylist layer of the §6.8 guardrail
stack (rm -rf /, sudo, chmod 777, curl | bash, git push origin main).

Background processes land in a module-level _PROCESS_REGISTRY with reader
threads draining stdout/stderr so foreground callers can tail logs without
deadlocking pipe buffers. ShellDeniedError is defined locally for now —
§15.8 will fold it into a shared exception taxonomy.

12 new tests covering happy path, non-zero exit, denylist, background
lifecycle, log capture, and registry wiring (30 total).
Implements §15.2 Search tools: grep (delegates to search_code), find_symbol
(Python-only via ast walk for FunctionDef/ClassDef/Import), list_files.

SearchToolsMixin inherits from FileToolsMixin so agents that opt into
search automatically get file tools — in practice you always want both.
find_symbol logs a WARN and returns [] on repos with no .py files rather
than silently skipping (§2 principle 3).

8 new tests covering happy path, kind-filter, unsupported-language WARN,
recursive list, directory exclusion, and registry wiring (38 total).
@kovtcharov kovtcharov force-pushed the feature/gaia-coder-mixins-1 branch from 4a03aef to 67eb7a5 Compare April 20, 2026 09:21
@kovtcharov kovtcharov merged commit bf18bbe into coder Apr 20, 2026
2 checks passed
@kovtcharov kovtcharov deleted the feature/gaia-coder-mixins-1 branch April 20, 2026 09:21
@github-actions
Copy link
Copy Markdown
Contributor

Code Review — #818 feat(coder): File/CLI/Search mixins per §15.2

Summary

Clean, well-tested foundation for the three no-network coder mixins. Stdlib-only implementation is portable and deterministic, the happy/failure/registration triad per tool is exemplary, and the 514-line test file is mostly readable. However, the post-rebase state of src/gaia/coder/__init__.py appears to drop the public re-exports added in #819 (CoderAgent, DEFAULT_LOOP, …), which breaks tests/coder/test_skeleton.py::test_package_imports. A handful of smaller issues around private-API coupling, a CLAUDE.md-forbidden silent except, and filesystem-walk hygiene are worth cleaning up in a follow-up.

Since the PR is already merged, flag these for a fast follow-up rather than a revert.

Issues Found

🔴 Critical

Package-level re-exports from #819 got dropped in the rebase (src/gaia/coder/__init__.py)

The diff replaces the __init__.py that #819 landed (which exposed CoderAgent, DEFAULT_LOOP, Loop, State, Transition) with a comment-only module. tests/coder/test_skeleton.py:50 still asserts those imports:

def test_package_imports() -> None:
    """Public re-exports must resolve from the top-level package."""
    from gaia.coder import CoderAgent  # noqa: F401  (import is the assertion)

Verified — after this PR's merge, gaia.coder.__init__ exports nothing. base.py:105 also has from gaia.coder import CoderAgent in its docstring example, so the docs are also wrong now.

Suggested fix — restore the exports while keeping the terser docstring from this PR:

# Copyright(C) 2026 Advanced Micro Devices, Inc. All rights reserved.
# SPDX-License-Identifier: MIT
"""gaia-coder — dev-tooling coding agent (see docs/plans/coder-agent.mdx).

This package is intentionally separate from ``src/gaia/agents/`` — the coder
is infrastructure for building GAIA, not a GAIA product agent.
"""

from gaia.coder.base import CoderAgent
from gaia.coder.loop import DEFAULT_LOOP, Loop, State, Transition

__all__ = [
    "CoderAgent",
    "DEFAULT_LOOP",
    "Loop",
    "State",
    "Transition",
]

🟡 Important

Silent except Exception: pass violates CLAUDE.md "No Silent Fallbacks" (src/gaia/coder/tools/cli.py:208)

finally:
    try:
        stream.close()
    except Exception:  # pylint: disable=broad-except
        pass

CLAUDE.md § No Silent Fallbacks — Fail Loudly explicitly calls this out and warns "The codebase has pre-existing except Exception: pass blocks… They are tech debt, not precedent". A stream-close failure here is real (e.g., OSError on a torn-down pipe) and swallowing it hides reader-thread bugs. Use a specific exception or re-raise with context:

        finally:
            try:
                stream.close()
            except OSError as e:
                logger.debug("reader thread: stream close failed (%s)", e)

Reaching into a private registry instead of the public accessor (src/gaia/coder/tools/search.py:672)

from gaia.agents.base.tools import _TOOL_REGISTRY
search_code = _TOOL_REGISTRY["search_code"]["function"]

The base module already ships a public helper at src/gaia/agents/base/tools.py:116:

def get_tool(tool_name: str) -> ...:
    """This is the public accessor for ``_TOOL_REGISTRY``.  Consumers outside
    [...] should use this function instead of importing ``_TOOL_REGISTRY``
    directly.
    """
    return _TOOL_REGISTRY.get(tool_name)

Use it:

            # Fetch search_code from the registry rather than re-importing so
            # tests that monkey-patch it still work.
            from gaia.agents.base.tools import get_tool

            entry = get_tool("search_code")
            if entry is None:
                raise RuntimeError(
                    "grep requires search_code to be registered; "
                    "call register_file_tools() first"
                )
            search_code = entry["function"]

Denylist is trivially bypassable AND prone to false positives (src/gaia/coder/tools/cli.py:146-164)

_SHELL_DENYLIST: tuple = ("rm -rf /", "sudo", "chmod 777", "curl | bash", "git push origin main")

Literal-substring matching against a joined argv:

  • sudo matches benign commands like sudoku, pseudo-compiler, --help-sudo.
  • git push origin main does not match ["git", "push", "-f", "origin", "main"] (different argv order / extra flags) — the most important case to block slips through.
  • curl | bash cannot match anything because subprocess never sees a pipe when shell=False.

The PR description calls out §15.8 will harden this, and the in-file comment is honest. But the v1 denylist gives the illusion of a guardrail without the substance — consider either (a) matching on argv[0] + flag intersection, or (b) making the denylist empty in v1 and landing §15.8 before any agent wires this in. Discuss with maintainer — happy to defer if §15.8 is near.

Recursive tree walks don't prune heavy directories (src/gaia/coder/tools/file.py:597-607, src/gaia/coder/tools/search.py:692)

search_code / find_symbol / list_files all walk .git/, .venv/, node_modules/, __pycache__/. On this repo alone .git/ has ~35k objects and node_modules/ in src/gaia/apps/webui/ has ~300 MB. A single find_symbol("foo") call will read every one of those files before returning. This is a perf regression and a latency footgun for the LLM loop.

Cheapest fix is a default skip-set in _iter_text_files:

_DEFAULT_SKIP_DIRS = frozenset({
    ".git", ".hg", ".svn", ".venv", "venv", "__pycache__",
    "node_modules", ".mypy_cache", ".pytest_cache", ".ruff_cache",
    "dist", "build", ".next",
})


def _iter_text_files(base: Path, glob_pattern: Optional[str]):
    """Yield files under ``base`` that match ``glob_pattern`` (or everything)."""
    if base.is_file():
        if glob_pattern is None or fnmatch.fnmatch(base.name, glob_pattern):
            yield base
        return
    for root, dirs, files in os.walk(base):
        dirs[:] = [d for d in dirs if d not in _DEFAULT_SKIP_DIRS]
        for name in files:
            if glob_pattern is not None and not fnmatch.fnmatch(name, glob_pattern):
                continue
            yield Path(root) / name

Apply the same _DEFAULT_SKIP_DIRS filter inside find_symbol's rglob loop (search.py:692) — or better, route find_symbol through _iter_text_files with glob="*.py".

🟢 Minor

Dead background processes are never GC'd from the registry (src/gaia/coder/tools/cli.py:341-357)

After a background command exits naturally (not via stop_process), its entry stays in _PROCESS_REGISTRY forever — list_processes keeps returning it with running=False. A long-running agent accumulates zombie-like entries. Cheap fix: prune entries where proc.poll() is not None inside list_processes.

SearchToolsMixin.register_search_tools() silently double-registers file tools (src/gaia/coder/tools/search.py:661)

SearchToolsMixin inherits FileToolsMixin and calls self.register_file_tools() inside register_search_tools(). A class that composes both mixins and calls both registration methods (a natural pattern given the KNOWN_TOOLS convention in src/gaia/agents/registry.py) will re-register every file tool. Base @tool decorator overwrites silently, so the bug is invisible until someone monkey-patches. Either document that callers must pick one, or add an idempotency guard.

Timing-dependent test (tests/coder/test_mixins.py:1177)

time.sleep(0.8)
logs = get_process_logs(pid, tail_lines=10)
assert "line" in logs

0.8s is fine on a dev laptop; on a loaded CI runner it will occasionally miss. Prefer polling with a short timeout:

        deadline = time.monotonic() + 5.0
        logs = ""
        while time.monotonic() < deadline:
            logs = get_process_logs(pid, tail_lines=10)
            if "line" in logs:
                break
            time.sleep(0.05)
        assert "line" in logs

search_code reads full file before decode-fail (src/gaia/coder/tools/file.py:546-548)

file_path.read_text(encoding="utf-8") loads the whole blob into memory, then UnicodeDecodeError fires on the first bad byte. A 500 MB vendored binary still gets fully read. A cheap size gate (skip files >1 MB for regex search) would harden this and pairs naturally with the skip-dir fix above.

PR description says draft — do not merge until the scaffold PR lands and wiring is verified end-to-end

It was merged anyway. Not a code issue — but worth flagging to @kovtcharov-amd that merge gates didn't honor the stated intent, and that "wiring verified end-to-end" is precisely the step that would have caught the Critical issue above.

Strengths

  • Test coverage is the right shape. Happy / failure / registration triad per tool, with tmp_path isolation and a registry_snapshot fixture that prevents test cross-contamination. test_edit_file_non_unique_raises, test_run_cli_command_denylist_sudo, and test_find_symbol_unsupported_language show care about edge cases.
  • Pipe-buffer deadlock avoidance is correct. _spawn_reader draining stdout/stderr on dedicated daemon threads is the standard fix for long-running background processes that produce stdio; easy to get wrong and this PR gets it right.
  • edit_file semantics mirror Claude Code Edit. Exact-match + uniqueness enforcement + opt-in replace_all is exactly what the calling LLM expects, and the ValueError messages name the problem in a way an agent can recover from (per CLAUDE.md actionable-errors rule).

Verdict

Approve with follow-up required — the PR itself is good work, but the rebase dropped the __init__.py exports from #819, which breaks tests/coder/test_skeleton.py. That needs a fix-forward PR within a day, along with the private-registry coupling in search.py and the except Exception: pass in cli.py:208. The rest (denylist hardening, skip-dirs, registry GC) can ride §15.8 or a dedicated polish pass.

kovtcharov added a commit that referenced this pull request Apr 20, 2026
Three review-followups from the #818/#819/#820 merge:

- src/gaia/coder/__init__.py: restore the CoderAgent/DEFAULT_LOOP/Loop/State/Transition
  re-exports that #819 added and the #818 rebase accidentally dropped. test_package_imports
  in test_skeleton.py relies on these.

- src/gaia/coder/tools/cli.py: replace the bare except Exception: pass in the stream-reader
  teardown with a targeted OSError catch + logger.debug. Per CLAUDE.md's fail-loudly rule,
  silent swallows hide reader-thread bugs; a stream close failing under pipe tear-down is
  real and worth a debug line.

- src/gaia/coder/tools/search.py: stop reaching into the private _TOOL_REGISTRY dict from
  grep(). Use get_tool_metadata() — the public accessor added in base/tools.py:113 exists
  precisely for this.

Tests: all 73 (5 skeleton + 38 mixins + 30 stores) pass on coder HEAD with the fixes.
kovtcharov added a commit that referenced this pull request Apr 20, 2026
#823)

## Summary

Three review-followups from the #818 / #819 / #820 merge, flagged by the
auto-review bot. All tests (73/73 on `coder`) pass with the fixes.

## What this changes

- **Critical** — `src/gaia/coder/__init__.py`: restore the `CoderAgent`
/ `DEFAULT_LOOP` / `Loop` / `State` / `Transition` re-exports that #819
added and #818's rebase accidentally dropped. Without this,
`tests/coder/test_skeleton.py::test_package_imports` fails on `coder`
HEAD.

- **Important** — `src/gaia/coder/tools/cli.py:135`: replace bare
`except Exception: pass` in the stream-reader teardown with a targeted
`OSError` catch + `logger.debug`. CLAUDE.md's *No Silent Fallbacks* rule
explicitly forbids this pattern.

- **Important** — `src/gaia/coder/tools/search.py:59`: stop reaching
into the private `_TOOL_REGISTRY` from `grep()`. Use
`get_tool_metadata()` — the public accessor at
`src/gaia/agents/base/tools.py:113` exists for this.

## Test plan

- [x] `pytest tests/coder/ -x` — 73/73 pass (5 skeleton + 38 mixins + 30
stores)
- [x] `python -c "from gaia.coder import CoderAgent, DEFAULT_LOOP, Loop,
State, Transition"` — all imports resolve
kovtcharov added a commit that referenced this pull request Apr 20, 2026
…rential

Implements §7.4 steps 4-5 end-to-end:

* generate_fix() creates auto/gaia-coder/<feedback_id> off the base ref
  (default 'coder' — §5.6, never main), then walks the caller-supplied
  EditHunks via an _edit_file_impl free function that mirrors the exact
  semantics of FileToolsMixin.edit_file from PR #818 (unique-match,
  opt-in replace_all, fail-loudly on missing). The branch is idempotent:
  re-runs check out the existing branch rather than error.
* write_regression_test() emits a pytest file plus a marker flag so the
  generated test genuinely fails on the base branch (marker missing)
  and passes on the fix branch (marker present) — the §7.4 step 5
  contract without needing any cleverness. Refuses to run unless the
  working tree is on a SELF_FIX_BRANCH_PREFIX branch.
* verify_test_differential() runs pytest on both refs and raises
  RuntimeError if the fail-then-pass contract is violated (pass on
  both → regression not exercised; fail on both → fix does not fix).

sys.executable is used for subprocess pytest invocations so hosts
without a `python` symlink on PATH still work.
kovtcharov added a commit that referenced this pull request Apr 20, 2026
## Summary

Phase 4 of the `gaia-coder` plan: the seven-pass self-review gate that
runs before she ever calls `gh_pr_create`. Implements the full §8 table
— deterministic checks (lint, tests, security, prose) plus LLM-driven
passes (architectural, persona, adversarial, feedback-binding) — and
surfaces an aggregated verdict with the confidence score the §7.6
auto-merge path reads.

This is the "deep-review discipline" the trust contract is built on.
Every PR she opens must clear this gate; self-fix PRs additionally
require Pass 7 (feedback-binding), which confirms the diff actually
addresses the EM's wording and that the regression test fails on `coder`
and passes on the branch.

## What this PR adds

- **Seven review passes** (`pass_1_static` through
`pass_7_feedback_binding`) — each a focused module that returns a common
`PassResult` envelope, so the gate never has to special-case any of
them.
- **`gate.run_all_passes`** — orchestrator with cost-aware
short-circuiting: a hard-fail on the deterministic cheap gates (1, 2, 4)
skips the expensive Opus calls, because there is no point paying for
adversarial review on a branch that fails lint.
- **`ReviewToolsMixin`** — exposes every pass as a `@tool`
(`review_diff_self_static`, …) plus the one-shot `review_diff_gate`, so
the agent, the EM's TUI, and the evaluation harness all share one code
path.
- **Canonical prompt files** for Passes 3, 5, 6, 7 under
`src/gaia/coder/prompts/`, materialised verbatim from §15.8 — self-fix
PRs that touch a prompt are now a single-file `git diff`, the whole
point of the §6.4 whitelist.
- **22 unit tests** that cover every pass (one pass + one fail per
module), the gate's short-circuit behaviour, the self-fix gating of Pass
7, the confidence-score contract, and a prompt-files-exist guard so the
canonical templates can't silently drift.

## Why the LLM seam matters

Every Opus call flows through `gaia.coder.review._llm.call_opus`. Tests
patch that one name — cheap, stable, vendor-agnostic — rather than
reaching into the Anthropic SDK. When we ship the Claude Agent SDK
integration for Passes 3 and 6 (the `architecture-reviewer` /
`code-reviewer` subagents in §15.8), the switch happens at a single
module boundary without churning every test.

## Scope constraint

Changes are confined to `src/gaia/coder/review/`,
`src/gaia/coder/prompts/`, and `tests/coder/test_review.py`. No other
files are touched. Pass 7's differential-pytest worktree machinery is
wired but opt-out via `skip_differential_pytest=True` for unit tests,
because Phase 4 intentionally does not ship a running daemon — that's
Phase 5.

## Dependencies

- Stacks on top of #818 (mixins; uses
`gaia.coder.tools.cli._check_denylist` for subprocess safety) and #819
(scaffold). Both appear in this PR until they merge to `coder` — the
diff will clean itself up afterwards.
- Does not depend on #820 (stores).

## Citations

-
[`docs/plans/coder-agent.mdx`](../blob/feature/gaia-coder-review/docs/plans/coder-agent.mdx)
§8 — the seven-pass table
- §15.8 — canonical prompt templates for Passes 3/5/6/7
- §7.6 — confidence-score gate on auto-merge
- §7.4 — self-correction loop that Pass 7 bookends

## Test plan

- [x] `pytest tests/coder/test_review.py -xvs` — 22/22 pass
- [x] `pytest tests/coder/` — 65/65 pass (sibling tests untouched)
- [x] `python util/lint.py --black --isort` — clean on in-scope files
- [x] Smoke: `from gaia.coder.review import ReviewToolsMixin; m =
ReviewToolsMixin(); m.register_review_tools()` registers exactly 8 tools
- [ ] Integration run against a real branch with Anthropic creds present
(Phase 5 territory — deferred to the daemon wiring task)

## Do not merge

Draft: this stacks on two unmerged sibling PRs and the Phase 4 runtime
wiring lives in Phase 5.
kovtcharov added a commit that referenced this pull request Apr 20, 2026
## Summary

Lands Phase 5 of `docs/plans/coder-agent.mdx` — the EM-facing surface of
`gaia-coder`. Before this PR the CLI verbs were stubs; after it, the EM
can bootstrap the agent, read her trust contract, promote/demote her,
and queue messages. Every LLM call now injects her identity triplet
(`GAIA.md` + `ARCHITECTURE.md` + `PROJECT_MAP.md`) as a cacheable
prefix.

## Threads

- **`trust.py`** — `EMConfig` / `RepoBinding` Pydantic models, the 0-5
  `CapabilityTier` ladder, TOML round-trip, and `promote` / `demote`
  functions with audit-log writes. Promotion refuses mismatched EM
  signatures; demotion is immediate. *Why it matters:* §4.2 makes
  promotion explicit, so a quiet accept would let any caller escalate
  tier. Fail-loudly TrustError surfaces exactly what to fix.
- **`inbox.py`** — thin CRUD over `em_inbox.db` with the §4.5 5-second
  non-LLM auto-ack, channel-agnostic dispatch callable, and escalation
  into `feedback.db` with severity translation. *Why it matters:*
  §4.5 says the ack is non-negotiable in latency; keeping it template-
  only (no model call) makes the SLA automatic.
- **`intent.py`** — LLM-driven conversational intent classifier for
  §15.4 + §15.8 P9, temperature 0 Opus 4.7, mockable via an injected
  `llm` callable. Low confidence (< 70) and unknown intents coerce to
  `free_form`. Handler functions cover every §15.4 intent. *Why it
  matters:* regex matchers would miss paraphrases ("let me give you
  self-edit for now"); LLM routing keeps the grammar maintainable.
- **`prompt_composer.py`** — builds Anthropic-format message blocks
  with `cache_control={"type":"ephemeral"}` on the identity triplet
  + per-skill blocks, per §3.2 / §4.6 / §6.5. *Why it matters:* §3.1
  mandates prompt caching; this is the single place that decides what
  gets cached.
- **`cli.py`** — replaces seven stub handlers (`trust`, `promote`,
  `demote`, `ask`, `note`, `critical`, `inbox`) with real ones. Config
  dir honours `$GAIA_CODER_HOME` so tests never touch real user state.
  Stubs remain for the Phase 6+ verbs (`daemon`, `status`, `feedback`,
  `doctor`, etc.).
- **`prompts/intent_classifier.md` + `prompts/standup.md`** — §15.8 P9
  and P10 prompt templates landed verbatim for future `prompt`-class
  self-fix PRs.
- **Tests (58 new)** — each module has a dedicated test file; CLI tests
  run as subprocess to exercise the full argparse + env-var path.

## Dependencies

This PR depends on sibling branches that have not yet merged to `coder`:

- **#819 scaffold** — imports `gaia.coder.{__init__,base,loop}` and the
  `GAIA.md` / `ARCHITECTURE.md` / `PROJECT_MAP.md` placeholders.
- **#820 stores** — hard dep on `gaia.coder.stores.{em_inbox, feedback,
  audit}`.
- **#818 mixins** — optional import of `gaia.coder.tools.cli` for
  subprocess helpers (not used in this PR's code paths).

Rebase onto `coder` once those land.

## Test plan

- [ ] `pytest
tests/coder/test_{trust,intent,inbox,prompt_composer,cli_trust}.py -xvs`
— all 58 tests pass
- [ ] `gaia-coder trust` on a fresh `$GAIA_CODER_HOME` prints the §4.1
bootstrap question and exits 0
- [ ] `gaia-coder trust --bootstrap --em-handle <you> --em-channel <ch>`
then `gaia-coder trust` renders the §4.2 template verbatim
- [ ] `gaia-coder promote --to-tier 2 --reason "..." --em-signature
<you>` updates `em.toml` and writes an audit row
- [ ] `gaia-coder promote ... --em-signature wrong-user` exits 1 with
the mismatch message on stderr
- [ ] `gaia-coder ask "enable self-edit"` prints the auto-ack template
and enqueues a pending row in `em_inbox.db`
- [ ] `gaia-coder inbox` lists the pending row
kovtcharov added a commit that referenced this pull request Apr 20, 2026
…rential

Implements §7.4 steps 4-5 end-to-end:

* generate_fix() creates auto/gaia-coder/<feedback_id> off the base ref
  (default 'coder' — §5.6, never main), then walks the caller-supplied
  EditHunks via an _edit_file_impl free function that mirrors the exact
  semantics of FileToolsMixin.edit_file from PR #818 (unique-match,
  opt-in replace_all, fail-loudly on missing). The branch is idempotent:
  re-runs check out the existing branch rather than error.
* write_regression_test() emits a pytest file plus a marker flag so the
  generated test genuinely fails on the base branch (marker missing)
  and passes on the fix branch (marker present) — the §7.4 step 5
  contract without needing any cleverness. Refuses to run unless the
  working tree is on a SELF_FIX_BRANCH_PREFIX branch.
* verify_test_differential() runs pytest on both refs and raises
  RuntimeError if the fail-then-pass contract is violated (pass on
  both → regression not exercised; fail on both → fix does not fix).

sys.executable is used for subprocess pytest invocations so hosts
without a `python` symlink on PATH still work.
kovtcharov added a commit that referenced this pull request Apr 20, 2026
…rential

Implements §7.4 steps 4-5 end-to-end:

* generate_fix() creates auto/gaia-coder/<feedback_id> off the base ref
  (default 'coder' — §5.6, never main), then walks the caller-supplied
  EditHunks via an _edit_file_impl free function that mirrors the exact
  semantics of FileToolsMixin.edit_file from PR #818 (unique-match,
  opt-in replace_all, fail-loudly on missing). The branch is idempotent:
  re-runs check out the existing branch rather than error.
* write_regression_test() emits a pytest file plus a marker flag so the
  generated test genuinely fails on the base branch (marker missing)
  and passes on the fix branch (marker present) — the §7.4 step 5
  contract without needing any cleverness. Refuses to run unless the
  working tree is on a SELF_FIX_BRANCH_PREFIX branch.
* verify_test_differential() runs pytest on both refs and raises
  RuntimeError if the fail-then-pass contract is violated (pass on
  both → regression not exercised; fail on both → fix does not fix).

sys.executable is used for subprocess pytest invocations so hosts
without a `python` symlink on PATH still work.
kovtcharov added a commit that referenced this pull request Apr 20, 2026
…825)

## Summary

Phase 6 implements **gaia-coder's self-correction loop** — the core
value proposition. When the EM gives feedback, the agent now triages it
into one of eight fix classes, localises the cause, drafts a
regression-tested plan, applies a fix on an
`auto/gaia-coder/<feedback_id>` branch, and opens a draft PR targeting
`coder`. Wires §7.2 continuous critique, §7.3 feedback intake, §7.4 loop
steps 1-10, and §7.9 verification. Loosely coupled to Phase 4 (review
gate) and Phase 5 (trust inbox) so they can land in any order.

## Threads

- **Prompt templates (§15.8 P1/P2/P3).** `prompts/triage.md`,
`prompts/critique.md`, and `prompts/plan_review.md` — the three
canonical text bodies the loop consumes. Stored as plain Markdown so she
can edit them via prompt-class self-fix PRs.
- **Triage (§7.4 step 1-2).** `classify_fix_class` runs P1 on Opus 4.7;
`< 60` confidence is rewritten to `out-of-scope` so the loop never
commits to a guess. `localise` is deterministic grep — no LLM.
- **Planner (§5.1 Stage 3 / §7.4 step 3).** `draft_plan` +
`is_large_job` + `request_em_approval`. Large jobs (> 200 LoC, or
`architectural` / `state-machine`, or cross-mixin) post the P3 message
to the EM inbox and wait for ✅.
- **Fixer (§7.4 steps 4-5).** `generate_fix` creates the self-fix branch
and applies edits; `write_regression_test` emits a pytest file + marker
flag so it *genuinely* fails on `coder` and passes on the fix branch (no
clever mocks); `verify_test_differential` raises on the pass-on-both or
fail-on-both failure modes.
- **Publisher (§7.4 steps 7-8).** `open_self_fix_pr` refuses to open
without a regression test (§7.4 step 5 hard rule). PR body cites
`feedback_id` and quotes the EM's wording verbatim — Pass 7
(feedback-binding) depends on both.
- **Verifier (§7.4 step 10).** `verify_on_merge` re-runs the regression
test on the merged SHA, transitions the feedback row to `verified`, and
writes `failure_patterns` + `review_patterns` memory records so the same
symptom is recognised next time.
- **Continuous critique (§7.2).** Cheap one-shot Opus call after every
state-changing tool; findings `< 60` confidence are dropped; `≥ 80`
surface inline for pre-transition action.
- **Loop driver.** `FeedbackLoopDriver.process_pending_feedback()`
orchestrates the whole thing with full `pending → triaged → in-fix →
fix-pr-open → verified | rejected → closed` transitions written to
`feedback.notes_json`.
- **SelfFixToolsMixin.** Registers **10 tools** (well above the §15.2 ≥
7 contract) so the loop is callable from the agent's tool registry.
Phase-7 tools (`classify_failure`, `pause_current_task`, `restart_self`,
`edit_self_file`) are intentionally out of scope.
- **CLI.** `gaia-coder feedback "<body>" --severity high --on <url>`
enqueues, `gaia-coder self-fix process` runs one iteration.
- **Tests.** 47 new tests covering every §7.4 step, every §7.2 filtering
rule, and the §15.2 mixin contract. Mocks Anthropic and `gh` at the
callable boundary; uses a tmp git repo with a `coder` branch for real
branch creation and pytest differential runs.

## Out of scope (explicit non-goals per the Phase 6 task)

- EventBridge ingestion of `@gaia-coder feedback:` comments (Phase 10
repo binding).
- Auto-merge (§7.6) — blocked on Phase 4 review-gate confidence score.
- Dev-mode self-edit (§7.5) and `restart_self` — Phase 7.
- ReAct loop self-edit (§7.8) — Phase 7.

## Dependencies

- **#818 (mixins):** Imports `edit_file` semantics inline rather than
instantiating `FileToolsMixin`, to keep the module usable without the
mixin wired.
- **#819 (scaffold):** Package skeleton + GAIA.md + prompts/ dir.
- **#820 (stores):** Uses `gaia.coder.stores.feedback` and
`gaia.coder.stores.memory` directly.
- **Phase 4 review gate (sibling):** Loose-coupled —
`review_gate_runner` is optional; absent, we publish a "(review gate not
available)" placeholder in the PR body.
- **Phase 5 trust (sibling):** Loose-coupled — `request_em_approval`
uses `gaia.coder.trust.inbox.enqueue` if importable, else defers.

## Test plan

- [ ] `pytest tests/coder/test_self_fix/ -xvs` — 47 tests pass.
- [ ] `pytest tests/coder/ -q` — full coder suite (120 tests) pass.
- [ ] `python util/lint.py --all` — no new critical errors (pre-existing
pylint warnings in `cli.py`, `discovery.py`, etc. are not touched).
- [ ] `python -c "from gaia.coder.self_fix import SelfFixToolsMixin; m =
SelfFixToolsMixin(); assert len(m.register_self_fix_tools()) >= 7"` —
smoke.
- [ ] `gaia-coder feedback "..." --severity high --on
https://github.com/amd/gaia/pull/9999 --id fb-test --db-path /tmp/fb.db`
— writes a row; follow with `gaia-coder self-fix process --db-path
/tmp/fb.db --repo-root <worktree> --skip-differential-verify
--skip-fix-apply` for the end-to-end path (PR creation mocked).

## Merge plan

- Draft PR, **do not merge**. Rebase onto `coder` after Phase 4 and
Phase 5 land so the review gate and inbox wiring become real imports.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant