feat(parser): structural-extract eval harness (#934) by gfargo · Pull Request #937 · gfargo/coco

gfargo · 2026-05-13T20:51:57Z

Summary

A/B harness for the language-aware fast path. Runs the parser pipeline twice against a fixed input set — once with `fastPath.languageAware.enabled: false` (LLM baseline), once with it on — and reports LLM calls saved, fast-path hit count, and token deltas per file.

The scenario library from #908 is reused as the golden-set provider — every scenario's commits become deterministic eval inputs without inventing new infrastructure for "where do the test commits come from".

Key pieces

`runStructuralExtractEval(diffs, configs)` — typed harness with pairwise deltas vs. the first config.
`buildScenarioFixtures(scenarioName)` — scenario → `FileDiff[]` adapter that walks each scenario's commit log.
Hand-crafted `fixtures.ts` — modification diffs targeting the language-aware path specifically (scenarios mostly trigger the lossless trivial-shape skip).
`bin/structuralExtractEval.ts` + `npm run eval:structural-extract` — CLI that writes per-input JSON/Markdown reports and prints an aggregate summary.

Sample output

Full run (scenarios + fixtures): 64 input files, 9 baseline LLM calls → 3 with languageAware on. 6 saved, 6 fast-path hits. All 5 language fixtures (TS×2, Python, Rust, Go) correctly trigger the fast path; markdown / body-only fall through to the LLM as expected.

Test plan

`npx tsc --noEmit` → 0 errors
`npx jest` → 1597/1597 pass (9 new)
`npx eslint` on touched code → clean
Manual: `npm run eval:structural-extract` produces meaningful per-input reports + aggregate summary

Out of scope (each tracked separately)

Committed baseline + CI regression check
Real-LLM live-mode harness
Per-language breakdown pivot in the report

Closes the harness scaffolding portion of #934.

A/B harness for the language-aware fast path. Runs the parser pipeline twice against a fixed input set — once with `fastPath.languageAware.enabled: false` (LLM baseline), once with it on — and reports LLM calls saved, fast-path hit count, and token deltas per file. Why now: phase 1 + 2 of #883 (#927, #928) shipped regex extractors behind an opt-in flag. The unit tests verify the parser; nothing yet verifies the *resulting commit-message pipeline behavior* when the fast path fires. Without a mechanical eval we can't flip the flag on by default with any confidence, and we can't compare tree-sitter vs. regex when #933 lands. The user's insight: the scenario library (#908) is already a deterministic git-state factory. Reusing it as the golden-set provider means every scenario's commits become an eval input, fresh and byte-identical each run, with zero new infrastructure for "where do the test commits come from". Implementation: - `src/lib/parsers/default/__evals__/structuralExtractEval.ts` — the harness. Mocks the LLM via dependency injection (fake `chain.invoke` passed through `summarizeLargeFiles`'s existing options), counts the calls, classifies each file's outcome (`unchanged` / `trivial` / `markdown` / `languageAware` / `llm`) by inspecting the rewritten diff. Returns a typed `EvalReport` with per-run totals + pairwise deltas. Public API: `runStructuralExtractEval(diffs, configs)` and `renderEvalReportMarkdown(report, title)`. - `src/lib/parsers/default/__evals__/scenarioInputs.ts` — the golden-set adapter. `buildScenarioFixtures(scenarioName)` spins up the scenario, walks its commit log, calls `git show --numstat` + per-file `git show` to build `FileDiff[]` per commit. Returns the temp repo handle so the caller can clean up. - `src/lib/parsers/default/__evals__/fixtures.ts` — hand-crafted modification diffs that target the language-aware path specifically. Scenarios mostly trigger the lossless trivial- shape path (pure additions) so they don't exercise the fast path; the fixtures fill that gap. One fixture per language so a regression in any single extractor surfaces in its own outcome row. - `bin/structuralExtractEval.ts` — CLI driver. Writes per-input JSON + Markdown to `.bench/structural-extract-eval/<timestamp>/` and prints an aggregate summary table. Flags: `--scenario NAME`, `--fixtures-only`, `--no-fixtures`, `--languages ts,js`, `--out DIR`. - `package.json`: new `eval:structural-extract` script. - `.gitignore`: `.bench/structural-extract-eval/` excluded (output is local; regression-detection baseline is a follow-up). - `src/lib/testUtils/README.md`: section describing the eval as a consumer of the scenarios + the extraction-boundary reaffirmation. - `__evals__/README.md`: documents harness layout, input sources, CLI usage, and what's intentionally not here yet (committed baseline / regression check, real-LLM live mode, per-language pivot). Sample run (full scenarios + fixtures): 64 input files, 9 LLM calls in baseline → 3 with languageAware on. 6 calls saved, 6 fast-path hits. All 5 language fixtures (TS×2, Python, Rust, Go) correctly trigger the fast path; body-only TS and markdown-only fixtures correctly fall through to the LLM. Out of scope (each tracked separately): - Committed baseline + CI regression check. - Real-LLM live-mode harness (for "is the resulting message better" comparison, not just "did the LLM get called"). - Per-language breakdown pivot in the report. Tests (9 new): - runStructuralExtractEval: single-config no-deltas, baseline vs fast-paths-on with TS + markdown, languages allowlist honored, empty-config rejection. - renderEvalReportMarkdown: includes / omits delta table based on run count. - buildScenarioFixtures: rejects unknown scenarios, produces a fixture per commit with non-empty diffs + valid shas, output is byte-identical across runs (determinism check). Validation: - npx tsc --noEmit → 0 errors - npx jest → 1597/1597 pass (9 new) - npx eslint on touched code → clean - Manual: `npm run eval:structural-extract` produces meaningful per-input reports + aggregate summary. Closes the harness scaffolding portion of #934.

gfargo merged commit a65f8cf into main May 13, 2026
6 of 7 checks passed

gfargo deleted the feat/structural-extract-eval branch May 13, 2026 20:52

gfargo mentioned this pull request May 13, 2026

Structural-extract quality eval scaffolding #934

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(parser): structural-extract eval harness (#934)#937

feat(parser): structural-extract eval harness (#934)#937
gfargo merged 1 commit into
mainfrom
feat/structural-extract-eval

gfargo commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gfargo commented May 13, 2026

Summary

Key pieces

Sample output

Test plan

Out of scope (each tracked separately)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant