feat(parser): structural-extract eval harness (#934)#937
Merged
Conversation
A/B harness for the language-aware fast path. Runs the parser pipeline twice against a fixed input set — once with `fastPath.languageAware.enabled: false` (LLM baseline), once with it on — and reports LLM calls saved, fast-path hit count, and token deltas per file. Why now: phase 1 + 2 of #883 (#927, #928) shipped regex extractors behind an opt-in flag. The unit tests verify the parser; nothing yet verifies the *resulting commit-message pipeline behavior* when the fast path fires. Without a mechanical eval we can't flip the flag on by default with any confidence, and we can't compare tree-sitter vs. regex when #933 lands. The user's insight: the scenario library (#908) is already a deterministic git-state factory. Reusing it as the golden-set provider means every scenario's commits become an eval input, fresh and byte-identical each run, with zero new infrastructure for "where do the test commits come from". Implementation: - `src/lib/parsers/default/__evals__/structuralExtractEval.ts` — the harness. Mocks the LLM via dependency injection (fake `chain.invoke` passed through `summarizeLargeFiles`'s existing options), counts the calls, classifies each file's outcome (`unchanged` / `trivial` / `markdown` / `languageAware` / `llm`) by inspecting the rewritten diff. Returns a typed `EvalReport` with per-run totals + pairwise deltas. Public API: `runStructuralExtractEval(diffs, configs)` and `renderEvalReportMarkdown(report, title)`. - `src/lib/parsers/default/__evals__/scenarioInputs.ts` — the golden-set adapter. `buildScenarioFixtures(scenarioName)` spins up the scenario, walks its commit log, calls `git show --numstat` + per-file `git show` to build `FileDiff[]` per commit. Returns the temp repo handle so the caller can clean up. - `src/lib/parsers/default/__evals__/fixtures.ts` — hand-crafted modification diffs that target the language-aware path specifically. Scenarios mostly trigger the lossless trivial- shape path (pure additions) so they don't exercise the fast path; the fixtures fill that gap. One fixture per language so a regression in any single extractor surfaces in its own outcome row. - `bin/structuralExtractEval.ts` — CLI driver. Writes per-input JSON + Markdown to `.bench/structural-extract-eval/<timestamp>/` and prints an aggregate summary table. Flags: `--scenario NAME`, `--fixtures-only`, `--no-fixtures`, `--languages ts,js`, `--out DIR`. - `package.json`: new `eval:structural-extract` script. - `.gitignore`: `.bench/structural-extract-eval/` excluded (output is local; regression-detection baseline is a follow-up). - `src/lib/testUtils/README.md`: section describing the eval as a consumer of the scenarios + the extraction-boundary reaffirmation. - `__evals__/README.md`: documents harness layout, input sources, CLI usage, and what's intentionally not here yet (committed baseline / regression check, real-LLM live mode, per-language pivot). Sample run (full scenarios + fixtures): 64 input files, 9 LLM calls in baseline → 3 with languageAware on. 6 calls saved, 6 fast-path hits. All 5 language fixtures (TS×2, Python, Rust, Go) correctly trigger the fast path; body-only TS and markdown-only fixtures correctly fall through to the LLM. Out of scope (each tracked separately): - Committed baseline + CI regression check. - Real-LLM live-mode harness (for "is the resulting message better" comparison, not just "did the LLM get called"). - Per-language breakdown pivot in the report. Tests (9 new): - runStructuralExtractEval: single-config no-deltas, baseline vs fast-paths-on with TS + markdown, languages allowlist honored, empty-config rejection. - renderEvalReportMarkdown: includes / omits delta table based on run count. - buildScenarioFixtures: rejects unknown scenarios, produces a fixture per commit with non-empty diffs + valid shas, output is byte-identical across runs (determinism check). Validation: - npx tsc --noEmit → 0 errors - npx jest → 1597/1597 pass (9 new) - npx eslint on touched code → clean - Manual: `npm run eval:structural-extract` produces meaningful per-input reports + aggregate summary. Closes the harness scaffolding portion of #934.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A/B harness for the language-aware fast path. Runs the parser pipeline twice against a fixed input set — once with `fastPath.languageAware.enabled: false` (LLM baseline), once with it on — and reports LLM calls saved, fast-path hit count, and token deltas per file.
The scenario library from #908 is reused as the golden-set provider — every scenario's commits become deterministic eval inputs without inventing new infrastructure for "where do the test commits come from".
Key pieces
Sample output
Full run (scenarios + fixtures): 64 input files, 9 baseline LLM calls → 3 with languageAware on. 6 saved, 6 fast-path hits. All 5 language fixtures (TS×2, Python, Rust, Go) correctly trigger the fast path; markdown / body-only fall through to the LLM as expected.
Test plan
Out of scope (each tracked separately)
Closes the harness scaffolding portion of #934.