Skip to content

feat(parser): structural-extract eval harness (#934)#937

Merged
gfargo merged 1 commit into
mainfrom
feat/structural-extract-eval
May 13, 2026
Merged

feat(parser): structural-extract eval harness (#934)#937
gfargo merged 1 commit into
mainfrom
feat/structural-extract-eval

Conversation

@gfargo
Copy link
Copy Markdown
Owner

@gfargo gfargo commented May 13, 2026

Summary

A/B harness for the language-aware fast path. Runs the parser pipeline twice against a fixed input set — once with `fastPath.languageAware.enabled: false` (LLM baseline), once with it on — and reports LLM calls saved, fast-path hit count, and token deltas per file.

The scenario library from #908 is reused as the golden-set provider — every scenario's commits become deterministic eval inputs without inventing new infrastructure for "where do the test commits come from".

Key pieces

  • `runStructuralExtractEval(diffs, configs)` — typed harness with pairwise deltas vs. the first config.
  • `buildScenarioFixtures(scenarioName)` — scenario → `FileDiff[]` adapter that walks each scenario's commit log.
  • Hand-crafted `fixtures.ts` — modification diffs targeting the language-aware path specifically (scenarios mostly trigger the lossless trivial-shape skip).
  • `bin/structuralExtractEval.ts` + `npm run eval:structural-extract` — CLI that writes per-input JSON/Markdown reports and prints an aggregate summary.

Sample output

Full run (scenarios + fixtures): 64 input files, 9 baseline LLM calls → 3 with languageAware on. 6 saved, 6 fast-path hits. All 5 language fixtures (TS×2, Python, Rust, Go) correctly trigger the fast path; markdown / body-only fall through to the LLM as expected.

Test plan

  • `npx tsc --noEmit` → 0 errors
  • `npx jest` → 1597/1597 pass (9 new)
  • `npx eslint` on touched code → clean
  • Manual: `npm run eval:structural-extract` produces meaningful per-input reports + aggregate summary

Out of scope (each tracked separately)

  • Committed baseline + CI regression check
  • Real-LLM live-mode harness
  • Per-language breakdown pivot in the report

Closes the harness scaffolding portion of #934.

A/B harness for the language-aware fast path. Runs the parser
pipeline twice against a fixed input set — once with
`fastPath.languageAware.enabled: false` (LLM baseline), once with
it on — and reports LLM calls saved, fast-path hit count, and
token deltas per file.

Why now: phase 1 + 2 of #883 (#927, #928) shipped regex extractors
behind an opt-in flag. The unit tests verify the parser; nothing
yet verifies the *resulting commit-message pipeline behavior* when
the fast path fires. Without a mechanical eval we can't flip the
flag on by default with any confidence, and we can't compare
tree-sitter vs. regex when #933 lands.

The user's insight: the scenario library (#908) is already a
deterministic git-state factory. Reusing it as the golden-set
provider means every scenario's commits become an eval input,
fresh and byte-identical each run, with zero new infrastructure
for "where do the test commits come from".

Implementation:

- `src/lib/parsers/default/__evals__/structuralExtractEval.ts` —
  the harness. Mocks the LLM via dependency injection (fake
  `chain.invoke` passed through `summarizeLargeFiles`'s existing
  options), counts the calls, classifies each file's outcome
  (`unchanged` / `trivial` / `markdown` / `languageAware` / `llm`)
  by inspecting the rewritten diff. Returns a typed `EvalReport`
  with per-run totals + pairwise deltas. Public API:
  `runStructuralExtractEval(diffs, configs)` and
  `renderEvalReportMarkdown(report, title)`.
- `src/lib/parsers/default/__evals__/scenarioInputs.ts` — the
  golden-set adapter. `buildScenarioFixtures(scenarioName)` spins
  up the scenario, walks its commit log, calls `git show --numstat`
  + per-file `git show` to build `FileDiff[]` per commit. Returns
  the temp repo handle so the caller can clean up.
- `src/lib/parsers/default/__evals__/fixtures.ts` — hand-crafted
  modification diffs that target the language-aware path
  specifically. Scenarios mostly trigger the lossless trivial-
  shape path (pure additions) so they don't exercise the fast
  path; the fixtures fill that gap. One fixture per language so a
  regression in any single extractor surfaces in its own outcome
  row.
- `bin/structuralExtractEval.ts` — CLI driver. Writes per-input
  JSON + Markdown to `.bench/structural-extract-eval/<timestamp>/`
  and prints an aggregate summary table. Flags: `--scenario NAME`,
  `--fixtures-only`, `--no-fixtures`, `--languages ts,js`,
  `--out DIR`.
- `package.json`: new `eval:structural-extract` script.
- `.gitignore`: `.bench/structural-extract-eval/` excluded (output
  is local; regression-detection baseline is a follow-up).
- `src/lib/testUtils/README.md`: section describing the eval as a
  consumer of the scenarios + the extraction-boundary
  reaffirmation.
- `__evals__/README.md`: documents harness layout, input sources,
  CLI usage, and what's intentionally not here yet (committed
  baseline / regression check, real-LLM live mode, per-language
  pivot).

Sample run (full scenarios + fixtures): 64 input files, 9 LLM
calls in baseline → 3 with languageAware on. 6 calls saved,
6 fast-path hits. All 5 language fixtures (TS×2, Python, Rust, Go)
correctly trigger the fast path; body-only TS and markdown-only
fixtures correctly fall through to the LLM.

Out of scope (each tracked separately):
- Committed baseline + CI regression check.
- Real-LLM live-mode harness (for "is the resulting message
  better" comparison, not just "did the LLM get called").
- Per-language breakdown pivot in the report.

Tests (9 new):
- runStructuralExtractEval: single-config no-deltas, baseline vs
  fast-paths-on with TS + markdown, languages allowlist honored,
  empty-config rejection.
- renderEvalReportMarkdown: includes / omits delta table based on
  run count.
- buildScenarioFixtures: rejects unknown scenarios, produces a
  fixture per commit with non-empty diffs + valid shas, output is
  byte-identical across runs (determinism check).

Validation:
- npx tsc --noEmit → 0 errors
- npx jest → 1597/1597 pass (9 new)
- npx eslint on touched code → clean
- Manual: `npm run eval:structural-extract` produces meaningful
  per-input reports + aggregate summary.

Closes the harness scaffolding portion of #934.
@gfargo gfargo merged commit a65f8cf into main May 13, 2026
6 of 7 checks passed
@gfargo gfargo deleted the feat/structural-extract-eval branch May 13, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant