Skip to content

feat(coder): Phases 9+10 — OSS reuse + codebase research + repo binding + RAG freshness#827

Merged
kovtcharov merged 6 commits into
coderfrom
feature/gaia-coder-p9p10
Apr 20, 2026
Merged

feat(coder): Phases 9+10 — OSS reuse + codebase research + repo binding + RAG freshness#827
kovtcharov merged 6 commits into
coderfrom
feature/gaia-coder-p9p10

Conversation

@kovtcharov
Copy link
Copy Markdown
Collaborator

Summary

Bundles Phases 9 and 10 of gaia-coder into one draft PR to keep the public PR count low. Lands five new modules + the registry entries that make them discoverable. Everything is additive — no existing coder code is touched, no CLI wiring yet (that's the unification follow-up).

  • OSS reuse + license gate (§5.3, §5.4) — OSSReuseMixin hard-fails GPL/AGPL/LGPL/SSPL/BUSL and vendors compatible sources with a # Adapted from <repo> @ <sha> — <license> header plus a THIRD_PARTY_NOTICES.md append. Prevents the "quietly vendored GPL into MIT repo" regression class at the tool layer, not just in review.
  • Codebase-research subagent (§5.10) — research(source, question, ...) runs a short-lived observational subagent with a hard wall-clock + $ + tool-call budget. Keeps external-repo investigation out of the primary RAG (cache pollution) and away from write_file / gh / memory writes (safety).
  • Repo binding + doctor gate (§5.11, §15.6) — Pydantic schema for repo_binding.toml; doctor() aggregates four bootstrap invariants (App install, PEM decrypts, webhook signature round-trip, coder branch exists). Agent is blocked until doctor().green is True — the §15.6 hard bootstrap gate.
  • RAG freshness contract (§6.9) — StaleIndexError raised when a query's youngest-indexed doc is older than 12h; reindex_watchdog fires critical-severity at 36h; check_citation_valid is the Pass 3 gate that rejects "cites a file that was indexed but has been deleted/renamed."
  • GitHubToolsMixin (§15.2) — the 11 gh_* wrappers. Foundational for OSS reuse (which composes _run_gh) and for the whole §5.5 event-trigger layer that lands later.
  • KNOWN_TOOLS registrationgithub + oss_reuse added so YAML-manifest agents can opt in via tools: [github, oss_reuse]. Schema regenerated.

Test plan

  • pytest tests/coder/test_github_tools.py — 21 passing
  • pytest tests/coder/test_oss_reuse.py — 18 passing (incl. explicit torvalds/linux → GPL-rejected and pallets/click → BSD-3-accepted acceptance checks)
  • pytest tests/coder/test_repo_binding.py — 17 passing
  • pytest tests/coder/test_rag_freshness.py — 22 passing
  • pytest tests/coder/test_codebase_research.py — 15 passing
  • pytest tests/coder/ full coder suite — 291 passed, 2 skipped (nothing regressed)
  • python util/lint.py --all — no errors on new files
  • Scope check: git diff --stat origin/coder..HEAD is limited to the five new modules, their tests, the two-line KNOWN_TOOLS addition, and the auto-regenerated schema

No real gh / network / keyring calls in tests — every external boundary is mocked (_run_gh, _gh_api, _fetch_raw, cloner, keyring_getter, gh_runner).

Do not merge — draft for review.

@github-actions github-actions Bot added agents tests Test changes labels Apr 20, 2026
11 `gh_*` tools wrapping the `gh` CLI via subprocess. Tests mock the
`_run_gh` boundary to avoid real API calls. Every non-zero `gh` exit
raises `GitHubCLIError` with the stderr attached — fail-loudly per
CLAUDE.md rule. Defaults `--repo` to $GITHUB_REPO so the LLM never
repeats owner/name and a stray call cannot target a different repo.

21 tests, all passing.
Four tools (gh_search_code, gh_search_repos, vet_license,
import_with_attribution) implementing the license-aware-reuse contract
from §5.4:

* Permissive allowlist: MIT, BSD-2/3-Clause, Apache-2.0, ISC, Unlicense, 0BSD.
  GPL/AGPL/LGPL/SSPL/BUSL hard-fail with LicenseIncompatibleError.
* import_with_attribution enforces four guarantees at the tool layer:
  compatible license, SHA (not branch) pin, per-file
  '# Adapted from <repo> @ <sha> — <license>' header, and an
  append-only THIRD_PARTY_NOTICES.md entry at repo root.
* All network I/O funnels through _gh_api / _fetch_raw; tests mock both.

18 tests, all passing.
* RepoBinding Pydantic model mirroring the §15.6 TOML layout. Missing or
  malformed fields raise RepoBindingError — no silent defaults.
* doctor() aggregates four bootstrap checks (App install, PEM decrypts,
  webhook signature round-trip, coder branch exists) into a
  DoctorResult. The agent refuses to act until result.green is True.
* verify_webhook_signature for §15.5 HMAC-SHA256 header validation.
* agents_md_entry renders the canonical §5.11 discoverability block for
  AGENTS.md / .github/copilot-instructions.md.

17 tests, all passing. keyring_getter + gh_runner are injected so tests
never touch real credentials or the network.
* FreshnessContract.default() declares the five §6.9 corpora
  (source_tree, pr_descriptions, issues, adrs_plans, claude_agents_md)
  with a 12h per-corpus staleness threshold and a 36h watchdog.
* ensure_fresh_or_raise raises StaleIndexError — §6.9 'fail loudly,
  never silently degrade' rule. The error message points at the
  exact CLI command that fixes it.
* reindex_watchdog surfaces a 'critical'-severity EM-inbox message if
  any corpus is past 36h (or never indexed at all).
* check_citation_valid wraps 'git cat-file -e <ref>:<path>' and
  raises CitationStaleError on a missing path — the Pass 3
  architectural check for 'cites a file that was indexed but has
  been deleted/renamed'.
* rag_status / rag_refresh / rag_rebuild expose the Python API
  behind the CLI commands (CLI wiring lives in the unification
  follow-up).

22 tests, all passing. Provider/runner callables injected so tests
never touch real RAG backends or git.
research(source, question, ...) dispatches a short-lived subagent into a
scratch workspace under ~/.gaia/coder/research/<session>/ by default.
The subagent answers the question and returns a StructuredAnalysis
matching the §5.10 schema verbatim.

Hard budget enforcement via ResearchBudget: wall-clock (default 10m),
dollar (default $2), and tool-call (default 40) ceilings. Tripping
any ceiling raises BudgetExceededError with the partial analysis
attached so the caller can still log what was produced.

The LLM engine is injected via the ResearchEngine protocol; this
module never imports an Anthropic / Claude SDK — the caller owns the
model binding. Tests inject a canned engine + stub cloner so no
network, no real clone.

15 tests, all passing. Scratch workspaces are cleaned up on exit
unless keep=True, and cleanup still runs on budget-exception paths
via a try/finally block.
Adds the two new coder mixins to KNOWN_TOOLS so YAML-manifest agents
can opt in via `tools: [github, oss_reuse]` (per §15.2 / §5.2).
Regenerates schemas/agent-manifest.schema.json to match.

Also folds in black+isort formatting pass on the five new modules and
their tests, plus cleans up unused imports flake8 flagged. Net 80+
line reduction vs. the initial drop. Tests unchanged — all 93 new
tests still pass, full coder suite still at 291 passed / 2 skipped.
@kovtcharov kovtcharov force-pushed the feature/gaia-coder-p9p10 branch from 543c9d0 to 8b99a6b Compare April 20, 2026 10:23
@kovtcharov kovtcharov marked this pull request as ready for review April 20, 2026 10:24
@kovtcharov kovtcharov merged commit ea273b9 into coder Apr 20, 2026
10 checks passed
@kovtcharov kovtcharov deleted the feature/gaia-coder-p9p10 branch April 20, 2026 10:24
@github-actions
Copy link
Copy Markdown
Contributor

Summary

This PR lands Phases 9+10 of gaia-coder: five well-scoped modules (OSS reuse + license gate, codebase-research subagent, repo binding + doctor, RAG freshness contract, GitHubToolsMixin) with 93 new tests and ~1750 lines of test code. Architecture is clean — every external boundary (subprocess, network, RAG backend, keyring) is funneled through a single injected callable, which is why the test suite can achieve this coverage without any real gh / network / keyring calls. All changes are additive on the coder branch (not main), so blast radius is bounded.

Most important finding: import_with_attribution has a path-traversal issue that becomes meaningful once the tool is exposed to an LLM via @tool. Flagged below; worth a follow-up commit before this mixin is wired into an agent for live use.


Issues Found

🔴 Critical

🔒 SECURITY CONCERN: Path traversal in import_with_attribution (src/gaia/coder/oss_reuse.py:412-414)

cc @kovtcharov-amd

The dest_path parameter is LLM-controlled (exposed via @tool) and joined into repo_root without traversal protection:

root = Path(repo_root) if repo_root else Path.cwd()
dest = root / dest_path
dest.parent.mkdir(parents=True, exist_ok=True)
...
dest.write_text(header + body, encoding="utf-8")

Verified locally:

  • Path("/tmp/repo") / "/etc/passwd"PosixPath("/etc/passwd") (absolute path override)
  • Path("/tmp/repo") / "../../etc/passwd" → not resolved against root

Under prompt injection or a hostile tool-call arg, the coder agent could write (and chmod/chown via file creation) anywhere its process has access. The mixin is currently gated behind OSSReuseMixin registration, so the exposure is limited today — but KNOWN_TOOLS now advertises it to YAML-manifest agents, so the guard needs to live in the tool, not the caller.

Suggested fix pattern — resolve and confirm containment before writing:

            root = Path(repo_root) if repo_root else Path.cwd()
            root_resolved = root.resolve()
            candidate = (root / dest_path).resolve()
            try:
                candidate.relative_to(root_resolved)
            except ValueError as e:
                raise AttributionError(
                    f"dest_path {dest_path!r} escapes repo_root {root_resolved}"
                    " — absolute paths and `..` traversal are forbidden"
                ) from e
            dest = candidate
            dest.parent.mkdir(parents=True, exist_ok=True)

Same shape applies to _append_notices_entry — the NOTICES.md path itself is fixed there, so that one's fine, but any future per-dest-path writes need the same guard.


🟡 Important

🟡 _check_webhook_signature_round_trip is effectively a no-op (src/gaia/coder/repo_binding.py:1466-1503)

The check signs a payload with secret and then verifies that same signature against the same secret. This can only fail if hmac/hashlib are broken — the check contributes nothing beyond what _check_private_key_decrypts and the empty-secret guard already provide. To actually validate "the secret we have is the one GitHub will send," the check needs either:

  1. A known-good signature fixture (signed offline from the GitHub App admin console), compared against verify_webhook_signature(secret, known_body, known_sig); or
  2. A round-trip through the GitHub App's test-delivery API (POST /app/hook/deliveries/{id}/attempts) verified against the live webhook receiver.

Not strictly blocking because doctor still catches the two other failure modes (missing slot, empty value), but the "signature round-trip" name suggests stronger guarantees than the code actually provides. Worth either strengthening or renaming to _check_webhook_secret_present.

🟡 _validate_license_filter silently drops non-permissive entries (src/gaia/coder/oss_reuse.py:450-461)

# Silently drop entries that aren't permissive — we never widen.
return requested & PERMISSIVE_LICENSES

A caller passing license_filter=["MIT", "MPL-2.0"] gets {"MIT"} back with no warning that MPL-2.0 was discarded — directly at odds with CLAUDE.md's "No Silent Fallbacks" rule, especially for licensing where silent narrowing can cause confusing "why is this hit filtered out" behavior. At minimum, log a warning; ideally raise when any unrecognized SPDX id is passed (the blocked-denylist check already raises, which is the right model).

    requested = frozenset(license_filter)
    bad = requested & BLOCKED_LICENSES
    if bad:
        raise LicenseIncompatibleError("<filter>", ",".join(sorted(bad)))
    unknown = requested - PERMISSIVE_LICENSES - BLOCKED_LICENSES
    if unknown:
        logger.warning(
            "license filter dropped unrecognized SPDX ids %s "
            "(neither permissive-allowlisted nor blocked)",
            sorted(unknown),
        )
    return requested & PERMISSIVE_LICENSES

🟡 gh_pr_merge hardcodes --admin (src/gaia/coder/tools/github.py:2448)

argv = ["pr", "merge", str(number), flag_map[method], "--admin"]

--admin bypasses branch protections (required reviews, required status checks, CODEOWNERS). The docstring notes this is sensitive per §4.3 and "callers must elevate", but elevation is enforced one layer up — which means anyone who reaches this tool with a merge call gets a protection-bypassing merge whether they asked for it or not. Given §5.7 rules out merging to main structurally, the scope is bounded, but a stray call targeting the wrong repo (via repo= override) on a protected branch would silently bypass protections. Prefer making --admin opt-in:

        @tool
        def gh_pr_merge(
            number: int,
            method: Literal["merge", "squash", "rebase"] = "squash",
            admin: bool = False,
            repo: Optional[str] = None,
        ) -> MergeResult:

…and only append "--admin" when admin=True. Let the trust-tier check decide when to pass it.

🟡 Fragile stdout parsing in URL extractors (src/gaia/coder/tools/github.py:2349, 2398, 2488, 2504, 2616)

Pattern repeats in five tools:

url = _run_gh(argv).strip().splitlines()[-1]

gh sometimes prints deprecation warnings, gist invitations, or ✓ Created pull request… preamble on stdout before the URL. Taking the last line is correct today but silently misparses on any future gh upgrade that adds a trailing hint. Safer: filter to the first line matching https://:

            raw = _run_gh(argv).strip()
            urls = [ln for ln in raw.splitlines() if ln.startswith("https://")]
            if not urls:
                raise GitHubCLIError(argv, 0, f"no URL in gh output: {raw!r}")
            url = urls[-1]

Apply the same fix in gh_pr_comment (2398), gh_issue_create (2488), gh_issue_comment (2504), gh_release_create (2616). One example ```suggestion above; author to replicate.


🟢 Minor

🟢 _is_remote_url matches https:// prefix anywhere (src/gaia/coder/subagents/codebase_research.py:1881)

Not a bug, but "https://" as a filename prefix (unusual but legal on POSIX) would be misclassified as a URL. Cheap fix: also check that urlparse(source).netloc is non-empty.

🟢 _looks_like_sha accepts short hex-only strings (src/gaia/coder/oss_reuse.py:541-546)

"abcdef0" (7 chars) passes, which matches git's default short-SHA width, but so would any hex-only branch name. In practice nobody names branches with 7+ hex chars, but tightening to >= 40 for full SHAs (with a force_short: bool = False escape hatch) makes the contract more explicit.

🟢 Docs / CLI wiring deferred

PR body explicitly defers gaia-coder rag {status|refresh|rebuild} CLI wiring and the docs/ updates to a unification follow-up. CLAUDE.md requires docs with every feature, but given this lands on coder (not main) and the deferral is stated upfront, it's reasonable. Make sure the follow-up tracks the docs/reference/cli.mdx + docs/docs.json + per-module SDK page updates together.


Strengths

  • Testability by construction. Every external boundary is an injected callable (keyring_getter, gh_runner, cloner, runner, provider, engine). The test suite stubs exactly one function per test instead of patching subprocess.run or urllib globally, which is the right seam. 93 new tests, covering happy-path + failure-path + registration check for every public tool.
  • Fail-loudly exceptions with actionable messages. StaleIndexError points at gaia-coder rag refresh. GitHubCLIMissingError points at cli.github.com. LicenseIncompatibleError names the repo and the SPDX id. RepoBindingError points at gaia-coder bind. Follows CLAUDE.md's "name three things — what failed, what to do, where to look" to a T.
  • try: … finally: shutil.rmtree(…) in research(). Scratch-workspace cleanup runs on budget-exception paths too (codebase_research.py:2029-2046), and there's an explicit test for that (test_research_cleans_up_even_on_budget_exception). Often missed; good catch.
  • Doctor aggregates failures instead of short-circuiting. test_doctor_aggregates_all_failures_not_short_circuit pins the behavior that the EM sees every broken invariant in one pass — matches the "one pass over all failures" ergonomic that §15.6 specifies.

Verdict

Approve with follow-up required. PR is already merged to coder (not main), so this review is advisory. The architecture, testing, and fail-loudly discipline are all in good shape. The path-traversal concern in import_with_attribution should be addressed before the mixin is attached to any live agent — it's the difference between "tool written defensively" and "tool handed to an LLM." The remaining 🟡 items are worth addressing in the follow-up PR that wires the CLI.

kovtcharov added a commit that referenced this pull request Apr 20, 2026
…ime) (#832)

## Summary

Six fixes flagged by the auto-review bot: one Critical (security), five
Important (two on #827, two on #828, one on both). All 395 tests pass on
`coder` with the fixes.

## Changes

**Critical (security):**
- `oss_reuse.py` `import_with_attribution` — path traversal on
LLM-controlled `dest_path`. Now resolves + `relative_to(root)`-checks;
raises `AttributionError` on escape.

**Important:**
- `oss_reuse.py` `_validate_license_filter` — unknown SPDX ids silently
dropped; now raises per CLAUDE.md fail-loudly.
- `tools/github.py` `gh_pr_merge` — hardcoded `--admin`; now gated
behind `admin_override=False` default.
- `repo_binding.py` webhook round-trip — only did positive check; added
wrong-signature + wrong-payload discrimination.
- `tools/debug.py` `add_instrumented_trace` — emitted
`logger.debug(...)` requiring pre-bound `logger`; now inlines
`__import__('logging')` lookup.
- `tools/debug.py` `diff_behavior` — `git switch -` after two detached
switches returns to wrong ref; now captures + explicitly restores
original HEAD.

## Test plan

- [x] `pytest tests/coder/ tests/eval/` — 395 pass
- [x] 7 new regression tests in `test_fixes_827_828.py` cover each fix
- [x] `test_add_instrumented_trace_*` now asserts the mutated module
actually imports (previously asserted only the string was written)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant