Skip to content

fix(bench): findRunSeries/groupRunSeries respect matcher.scale#158

Merged
blove merged 1 commit into
mainfrom
bench-findrunseries-scale-filter
Jun 5, 2026
Merged

fix(bench): findRunSeries/groupRunSeries respect matcher.scale#158
blove merged 1 commit into
mainfrom
bench-findrunseries-scale-filter

Conversation

@blove
Copy link
Copy Markdown
Contributor

@blove blove commented Jun 5, 2026

Summary

findRunSeries and groupRunSeries in scripts/bench-matrix.mjs filtered by scenarioId + scriptName (+ adapterId) but silently ignored matcher.scale — even though evaluateH16/H17/H18 pass scale: "hypothesis". A runset that mixed scales for the same scenario+script would aggregate runs across scales into a single verdict.

Latent today (each bench-matrix invocation runs a single --scale, so runs only ever carry one scale), but a real correctness footgun the moment a multi-scale runset is fed to the evaluators — and the scale matcher field reads as if it filters when it doesn't.

Fix

Filter by scale when the matcher provides it — conditionally, since H1 / H6–H8 and the comparator matchers intentionally omit scale:

(matcher.scale === undefined || run.scale === matcher.scale)

Test

New regression test: two select-range-extend runs for the same scenario+script — a good hypothesis-scale run (10 ms) and a bad dev-scale run (120 ms). H16 must stay satisfied and report sampleCount === 1. Before the fix, the dev run aggregated in, dragging latency over the 16 ms budget and flipping the verdict to failing.

Gates

  • node --test scripts/__tests__/bench-matrix.test.mjs — 81 passed (incl. new guard)
  • prettier --check clean

🤖 Generated with Claude Code

The run-series matchers filtered by scenarioId + scriptName (+ adapterId) but
silently ignored matcher.scale, even though H16/H17/H18 pass scale: "hypothesis".
A runset mixing scales for the same scenario+script would aggregate runs across
scales into one verdict. Latent today (each bench-matrix invocation is single
-scale) but a real correctness footgun.

Filter by scale when the matcher provides it (conditional, since H1/H6-H8 and the
comparator matchers intentionally omit it). Adds a regression test: a bad
dev-scale select-range-extend run no longer pollutes the hypothesis-scale H16
verdict.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 5, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
pretable Ready Ready Preview, Comment Jun 5, 2026 11:56pm

@blove blove enabled auto-merge (squash) June 5, 2026 23:55
@blove blove merged commit 6ebdd39 into main Jun 5, 2026
13 checks passed
@blove blove deleted the bench-findrunseries-scale-filter branch June 5, 2026 23:57
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

Vercel preview ready

Preview: https://pretable-oh0cfy24r-cacheplane.vercel.app
Commit: d35696cdd3a070c2e462af60425c74c4e3d22c37

Updated automatically by the deploy-preview job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant