feat(parser): Python / Rust / Go structural extractors (#883 phase 2) by gfargo · Pull Request #928 · gfargo/coco

gfargo · 2026-05-13T14:38:22Z

Summary

Three new regex-based extractors — Python (`.py`/`.pyi`), Rust (`.rs`), Go (`.go`) — built on a new shared `structuralDiff.ts` scaffold. The TS extractor is refactored onto the same scaffold; existing tests continue to pass.
`service.fastPath.languageAware.languages` allowlist expands from `'ts' | 'js'` to `'ts' | 'js' | 'py' | 'rs' | 'go'` (schema regenerated).
Adding the next language is now two files (the per-language module + tests) plus a switch case in the dispatcher.

Phase 2 of #883. Tree-sitter integration and quality-eval scaffolding stay in the backlog — both need design discussion before they ship.

Test plan

`npx tsc --noEmit` → 0 errors
`npx jest` → 1568/1568 pass (49 new)
`npx eslint` on touched files → clean
`npm run build:schema` regenerated; `py` / `rs` / `go` appear in the schema enum.

Refs #883.

Extends the language-aware fast path to three more languages, mirroring the TS extractor from phase 1 (#927). All four languages now share a single rendering / bucketing scaffold so adding the next language is two files plus a switch case. Implementation: - New `structuralDiff.ts` — shared scaffolding: walks a unified diff, calls a per-language `parseLine` callback, buckets symbols into added / removed / signature-change, formats the templated summary. Knows how to render method / impl / module / trait kinds in addition to the function / class / type / etc. set the TS phase needed. - `tsStructuralDiff.ts` refactored onto the shared scaffold. Preserved as the export for existing callers; existing tests continue to pass. - `pythonStructuralDiff.ts` — recognizes module-level `def` / `async def` / `class` / PEP 695 `type` aliases / ALL_CAPS const assignments. Exported flag tracks underscore-prefix convention (leading `_` → not exported). Decorators are skipped — the following def carries the structural signal. - `rustStructuralDiff.ts` — recognizes pub-qualified fn / struct / enum / trait / impl (both `impl T` and `impl Trait for T`) / type aliases / pub const / pub static / mod declarations. Accepts up to 4 spaces of leading indent so rustfmt'd impl-block-method declarations get caught. `pub(crate)` / `pub(super)` / `pub(in path)` all count as exported. - `goStructuralDiff.ts` — recognizes top-level func (incl. receivers, rendered as `Receiver.method`), type X struct / interface / aliases, single-line var / const. Exported tracks Go's capital-first-letter convention. - `summarizeLargeFiles.ts` — gains a `detectStructuralLanguageId` + `dispatchStructuralSummary` pair so the runtime hot path stays flat. The `languageAware.languages` allowlist expands from `'ts' | 'js'` to `'ts' | 'js' | 'py' | 'rs' | 'go'`; the type is mirrored through `LLMService.fastPath.languageAware`, `lib/types.ts`, the parser-options factory, and the regenerated JSON schema. Architectural note: still regex-first. Tree-sitter integration (scopes, receiver types, signature deltas) and the quality-eval scaffolding listed under #883 stay in the backlog because both need design discussion (parser packaging strategy, golden-set sourcing) before they ship. Tests (49 new): - structuralDiff shared rendering exercised via each language's summarizer. - pythonStructuralDiff: def / async def / class / type alias / ALL_CAPS const, underscore-prefixed = not exported, decorator skip, indent gate, body-only-edit fallthrough, signature change. - rustStructuralDiff: pub fn / pub async fn / pub const fn, visibility modifiers, struct / enum / trait, impl (plain + Trait-for-Type), type alias, ALL_CAPS const + static, mod declarations, indent gate. - goStructuralDiff: top-level func, method receivers rendered as `Receiver.method`, struct / interface / type alias, single-line var / const, capitalization-based exported flag, indent gate. Validation: - npx tsc --noEmit → 0 errors - npx jest → 1568/1568 pass (49 new) - npx eslint on touched files → clean - npm run build:schema regenerated — 'py' / 'rs' / 'go' now appear in the languageAware.languages enum. Refs #883.

A/B harness for the language-aware fast path. Runs the parser pipeline twice against a fixed input set — once with `fastPath.languageAware.enabled: false` (LLM baseline), once with it on — and reports LLM calls saved, fast-path hit count, and token deltas per file. Why now: phase 1 + 2 of #883 (#927, #928) shipped regex extractors behind an opt-in flag. The unit tests verify the parser; nothing yet verifies the *resulting commit-message pipeline behavior* when the fast path fires. Without a mechanical eval we can't flip the flag on by default with any confidence, and we can't compare tree-sitter vs. regex when #933 lands. The user's insight: the scenario library (#908) is already a deterministic git-state factory. Reusing it as the golden-set provider means every scenario's commits become an eval input, fresh and byte-identical each run, with zero new infrastructure for "where do the test commits come from". Implementation: - `src/lib/parsers/default/__evals__/structuralExtractEval.ts` — the harness. Mocks the LLM via dependency injection (fake `chain.invoke` passed through `summarizeLargeFiles`'s existing options), counts the calls, classifies each file's outcome (`unchanged` / `trivial` / `markdown` / `languageAware` / `llm`) by inspecting the rewritten diff. Returns a typed `EvalReport` with per-run totals + pairwise deltas. Public API: `runStructuralExtractEval(diffs, configs)` and `renderEvalReportMarkdown(report, title)`. - `src/lib/parsers/default/__evals__/scenarioInputs.ts` — the golden-set adapter. `buildScenarioFixtures(scenarioName)` spins up the scenario, walks its commit log, calls `git show --numstat` + per-file `git show` to build `FileDiff[]` per commit. Returns the temp repo handle so the caller can clean up. - `src/lib/parsers/default/__evals__/fixtures.ts` — hand-crafted modification diffs that target the language-aware path specifically. Scenarios mostly trigger the lossless trivial- shape path (pure additions) so they don't exercise the fast path; the fixtures fill that gap. One fixture per language so a regression in any single extractor surfaces in its own outcome row. - `bin/structuralExtractEval.ts` — CLI driver. Writes per-input JSON + Markdown to `.bench/structural-extract-eval/<timestamp>/` and prints an aggregate summary table. Flags: `--scenario NAME`, `--fixtures-only`, `--no-fixtures`, `--languages ts,js`, `--out DIR`. - `package.json`: new `eval:structural-extract` script. - `.gitignore`: `.bench/structural-extract-eval/` excluded (output is local; regression-detection baseline is a follow-up). - `src/lib/testUtils/README.md`: section describing the eval as a consumer of the scenarios + the extraction-boundary reaffirmation. - `__evals__/README.md`: documents harness layout, input sources, CLI usage, and what's intentionally not here yet (committed baseline / regression check, real-LLM live mode, per-language pivot). Sample run (full scenarios + fixtures): 64 input files, 9 LLM calls in baseline → 3 with languageAware on. 6 calls saved, 6 fast-path hits. All 5 language fixtures (TS×2, Python, Rust, Go) correctly trigger the fast path; body-only TS and markdown-only fixtures correctly fall through to the LLM. Out of scope (each tracked separately): - Committed baseline + CI regression check. - Real-LLM live-mode harness (for "is the resulting message better" comparison, not just "did the LLM get called"). - Per-language breakdown pivot in the report. Tests (9 new): - runStructuralExtractEval: single-config no-deltas, baseline vs fast-paths-on with TS + markdown, languages allowlist honored, empty-config rejection. - renderEvalReportMarkdown: includes / omits delta table based on run count. - buildScenarioFixtures: rejects unknown scenarios, produces a fixture per commit with non-empty diffs + valid shas, output is byte-identical across runs (determinism check). Validation: - npx tsc --noEmit → 0 errors - npx jest → 1597/1597 pass (9 new) - npx eslint on touched code → clean - Manual: `npm run eval:structural-extract` produces meaningful per-input reports + aggregate summary. Closes the harness scaffolding portion of #934.

gfargo merged commit 9237207 into main May 13, 2026
1 check was pending

gfargo deleted the feat/structural-extractors-phase2 branch May 13, 2026 14:38

This was referenced May 13, 2026

Language-aware diff summaries for richer commit messages #883

Closed

Tree-sitter-backed structural diff extractors #933

Closed

Structural-extract quality eval scaffolding #934

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(parser): Python / Rust / Go structural extractors (#883 phase 2)#928

feat(parser): Python / Rust / Go structural extractors (#883 phase 2)#928
gfargo merged 1 commit into
mainfrom
feat/structural-extractors-phase2

gfargo commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gfargo commented May 13, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant