Skip to content

feat(parser): Python / Rust / Go structural extractors (#883 phase 2)#928

Merged
gfargo merged 1 commit into
mainfrom
feat/structural-extractors-phase2
May 13, 2026
Merged

feat(parser): Python / Rust / Go structural extractors (#883 phase 2)#928
gfargo merged 1 commit into
mainfrom
feat/structural-extractors-phase2

Conversation

@gfargo
Copy link
Copy Markdown
Owner

@gfargo gfargo commented May 13, 2026

Summary

  • Three new regex-based extractors — Python (`.py`/`.pyi`), Rust (`.rs`), Go (`.go`) — built on a new shared `structuralDiff.ts` scaffold. The TS extractor is refactored onto the same scaffold; existing tests continue to pass.
  • `service.fastPath.languageAware.languages` allowlist expands from `'ts' | 'js'` to `'ts' | 'js' | 'py' | 'rs' | 'go'` (schema regenerated).
  • Adding the next language is now two files (the per-language module + tests) plus a switch case in the dispatcher.

Phase 2 of #883. Tree-sitter integration and quality-eval scaffolding stay in the backlog — both need design discussion before they ship.

Test plan

  • `npx tsc --noEmit` → 0 errors
  • `npx jest` → 1568/1568 pass (49 new)
  • `npx eslint` on touched files → clean
  • `npm run build:schema` regenerated; `py` / `rs` / `go` appear in the schema enum.

Refs #883.

Extends the language-aware fast path to three more languages,
mirroring the TS extractor from phase 1 (#927). All four
languages now share a single rendering / bucketing scaffold so
adding the next language is two files plus a switch case.

Implementation:

- New `structuralDiff.ts` — shared scaffolding: walks a unified
  diff, calls a per-language `parseLine` callback, buckets symbols
  into added / removed / signature-change, formats the templated
  summary. Knows how to render method / impl / module / trait
  kinds in addition to the function / class / type / etc. set the
  TS phase needed.
- `tsStructuralDiff.ts` refactored onto the shared scaffold.
  Preserved as the export for existing callers; existing tests
  continue to pass.
- `pythonStructuralDiff.ts` — recognizes module-level `def` /
  `async def` / `class` / PEP 695 `type` aliases / ALL_CAPS const
  assignments. Exported flag tracks underscore-prefix convention
  (leading `_` → not exported). Decorators are skipped — the
  following def carries the structural signal.
- `rustStructuralDiff.ts` — recognizes pub-qualified fn / struct /
  enum / trait / impl (both `impl T` and `impl Trait for T`) /
  type aliases / pub const / pub static / mod declarations.
  Accepts up to 4 spaces of leading indent so rustfmt'd
  impl-block-method declarations get caught. `pub(crate)` /
  `pub(super)` / `pub(in path)` all count as exported.
- `goStructuralDiff.ts` — recognizes top-level func (incl.
  receivers, rendered as `Receiver.method`), type X struct /
  interface / aliases, single-line var / const. Exported tracks
  Go's capital-first-letter convention.
- `summarizeLargeFiles.ts` — gains a `detectStructuralLanguageId`
  + `dispatchStructuralSummary` pair so the runtime hot path stays
  flat. The `languageAware.languages` allowlist expands from
  `'ts' | 'js'` to `'ts' | 'js' | 'py' | 'rs' | 'go'`; the type is
  mirrored through `LLMService.fastPath.languageAware`,
  `lib/types.ts`, the parser-options factory, and the regenerated
  JSON schema.

Architectural note: still regex-first. Tree-sitter integration
(scopes, receiver types, signature deltas) and the
quality-eval scaffolding listed under #883 stay in the backlog
because both need design discussion (parser packaging strategy,
golden-set sourcing) before they ship.

Tests (49 new):
- structuralDiff shared rendering exercised via each language's
  summarizer.
- pythonStructuralDiff: def / async def / class / type alias /
  ALL_CAPS const, underscore-prefixed = not exported, decorator
  skip, indent gate, body-only-edit fallthrough, signature change.
- rustStructuralDiff: pub fn / pub async fn / pub const fn,
  visibility modifiers, struct / enum / trait, impl (plain +
  Trait-for-Type), type alias, ALL_CAPS const + static, mod
  declarations, indent gate.
- goStructuralDiff: top-level func, method receivers rendered
  as `Receiver.method`, struct / interface / type alias,
  single-line var / const, capitalization-based exported flag,
  indent gate.

Validation:
- npx tsc --noEmit → 0 errors
- npx jest → 1568/1568 pass (49 new)
- npx eslint on touched files → clean
- npm run build:schema regenerated — 'py' / 'rs' / 'go' now
  appear in the languageAware.languages enum.

Refs #883.
@gfargo gfargo merged commit 9237207 into main May 13, 2026
1 check was pending
@gfargo gfargo deleted the feat/structural-extractors-phase2 branch May 13, 2026 14:38
gfargo added a commit that referenced this pull request May 13, 2026
A/B harness for the language-aware fast path. Runs the parser
pipeline twice against a fixed input set — once with
`fastPath.languageAware.enabled: false` (LLM baseline), once with
it on — and reports LLM calls saved, fast-path hit count, and
token deltas per file.

Why now: phase 1 + 2 of #883 (#927, #928) shipped regex extractors
behind an opt-in flag. The unit tests verify the parser; nothing
yet verifies the *resulting commit-message pipeline behavior* when
the fast path fires. Without a mechanical eval we can't flip the
flag on by default with any confidence, and we can't compare
tree-sitter vs. regex when #933 lands.

The user's insight: the scenario library (#908) is already a
deterministic git-state factory. Reusing it as the golden-set
provider means every scenario's commits become an eval input,
fresh and byte-identical each run, with zero new infrastructure
for "where do the test commits come from".

Implementation:

- `src/lib/parsers/default/__evals__/structuralExtractEval.ts` —
  the harness. Mocks the LLM via dependency injection (fake
  `chain.invoke` passed through `summarizeLargeFiles`'s existing
  options), counts the calls, classifies each file's outcome
  (`unchanged` / `trivial` / `markdown` / `languageAware` / `llm`)
  by inspecting the rewritten diff. Returns a typed `EvalReport`
  with per-run totals + pairwise deltas. Public API:
  `runStructuralExtractEval(diffs, configs)` and
  `renderEvalReportMarkdown(report, title)`.
- `src/lib/parsers/default/__evals__/scenarioInputs.ts` — the
  golden-set adapter. `buildScenarioFixtures(scenarioName)` spins
  up the scenario, walks its commit log, calls `git show --numstat`
  + per-file `git show` to build `FileDiff[]` per commit. Returns
  the temp repo handle so the caller can clean up.
- `src/lib/parsers/default/__evals__/fixtures.ts` — hand-crafted
  modification diffs that target the language-aware path
  specifically. Scenarios mostly trigger the lossless trivial-
  shape path (pure additions) so they don't exercise the fast
  path; the fixtures fill that gap. One fixture per language so a
  regression in any single extractor surfaces in its own outcome
  row.
- `bin/structuralExtractEval.ts` — CLI driver. Writes per-input
  JSON + Markdown to `.bench/structural-extract-eval/<timestamp>/`
  and prints an aggregate summary table. Flags: `--scenario NAME`,
  `--fixtures-only`, `--no-fixtures`, `--languages ts,js`,
  `--out DIR`.
- `package.json`: new `eval:structural-extract` script.
- `.gitignore`: `.bench/structural-extract-eval/` excluded (output
  is local; regression-detection baseline is a follow-up).
- `src/lib/testUtils/README.md`: section describing the eval as a
  consumer of the scenarios + the extraction-boundary
  reaffirmation.
- `__evals__/README.md`: documents harness layout, input sources,
  CLI usage, and what's intentionally not here yet (committed
  baseline / regression check, real-LLM live mode, per-language
  pivot).

Sample run (full scenarios + fixtures): 64 input files, 9 LLM
calls in baseline → 3 with languageAware on. 6 calls saved,
6 fast-path hits. All 5 language fixtures (TS×2, Python, Rust, Go)
correctly trigger the fast path; body-only TS and markdown-only
fixtures correctly fall through to the LLM.

Out of scope (each tracked separately):
- Committed baseline + CI regression check.
- Real-LLM live-mode harness (for "is the resulting message
  better" comparison, not just "did the LLM get called").
- Per-language breakdown pivot in the report.

Tests (9 new):
- runStructuralExtractEval: single-config no-deltas, baseline vs
  fast-paths-on with TS + markdown, languages allowlist honored,
  empty-config rejection.
- renderEvalReportMarkdown: includes / omits delta table based on
  run count.
- buildScenarioFixtures: rejects unknown scenarios, produces a
  fixture per commit with non-empty diffs + valid shas, output is
  byte-identical across runs (determinism check).

Validation:
- npx tsc --noEmit → 0 errors
- npx jest → 1597/1597 pass (9 new)
- npx eslint on touched code → clean
- Manual: `npm run eval:structural-extract` produces meaningful
  per-input reports + aggregate summary.

Closes the harness scaffolding portion of #934.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant