Skip to content

fix(bench): correct H1 status and gate parity check on minimum repeats#125

Merged
blove merged 1 commit into
mainfrom
b2-followup-1-corrections
May 9, 2026
Merged

fix(bench): correct H1 status and gate parity check on minimum repeats#125
blove merged 1 commit into
mainfrom
b2-followup-1-corrections

Conversation

@blove
Copy link
Copy Markdown
Contributor

@blove blove commented May 9, 2026

Summary

PR #124 ran a high-repeat (n=20) re-measurement of S2/hypothesis/scroll for pretable + mui and found the B2 H1 failing verdict was a low-sample artifact. Memo at docs/research/2026-05-09-pretable-vs-mui-scroll-perf.md recommended raising the repeat protocol but didn't action the downstream cleanup. This PR closes that loop.

What changed

  • status/milestones/2026-05-09-b2-h1-high-repeat-correction.json (new) — overlays the original B2 evidence with the n=20 result and correctedH1.status = "satisfied". The original B2 milestone is left intact for historical reference.
  • apps/website/app/bench/page.tsx — loads both the n=3 milestone and the n=20 correction. verdictFor now respects a parityAdapters set so adapters with parity verdicts get parity at n=20 (full quality pass) instead of being crowned. Prose rewritten: parity is the headline; the original snapshot is described as a low-sample artifact.
  • scripts/bench-matrix.mjs evaluateH1 — adds a minimum-repeat gate. When the pretable / best-full-grid frame-p95 ratio is in the tight zone (0.9 ≤ r ≤ 1.2) and either adapter has < 10 repeats, returns insufficient with guidance to re-run at --repeats=10+. Outside the tight zone the existing path still fires.
  • scripts/__tests__/bench-matrix.test.mjs — new test for the gated insufficient case (using the actual B2 ratio of 1.115); existing failing test rewritten to use a clearly out-of-zone ratio (1.6) so the failing path stays exercised.
  • docs/research/repo-memory.md — appends a 2026-05-09 entry overturning the H1 flip narrative.

What's NOT changed

  • AG Grid (16.7 ms p95, 1 blank gap, 2 px row-height drift) and TanStack (16.7 ms p95, 1 blank gap) status from the B2 n=3 runset is not corrected. Both are >50% above pretable, well outside the noise zone. They remain ~1.7× pretable's scroll_frame_p95_ms with quality gaps pretable does not have.
  • No public-package source changes. Affects only apps/website, scripts/bench-matrix.mjs, status milestones, and docs.

Test plan

  • `pnpm -w typecheck` passes
  • `pnpm -w test` passes (190 tests)
  • `node --test scripts/tests/bench-matrix.test.mjs` 68/68 pass (added 1 new, modified 1 existing)
  • `pnpm -w lint` 0 errors
  • `pnpm format` clean

🤖 Generated with Claude Code

PR #124 (perf-diag rerun at n=20) showed the B2 H1 "failing" verdict was
a low-sample artifact: pretable 9.07 ms ± 0.20 vs MUI 9.14 ms ± 0.19,
mean diff −0.065 ms inside the 2σ noise floor of 0.40 ms. The original
n=3 ratio of 1.115 was sample noise, not a real regression.

Five targeted corrections:

- Add status/milestones/2026-05-09-b2-h1-high-repeat-correction.json
  overlaying the original B2 evidence with the n=20 result and
  correctedH1.status = "satisfied". Original B2 milestone left intact.
- Rewrite the apps/website/app/bench/page.tsx prose to a parity framing
  at high repeats. verdictFor now respects a parityAdapters set so the
  table doesn't crown a "fastest" off n=3 noise; H1 status reflects the
  corrected verdict.
- Add a min-repeat gate to scripts/bench-matrix.mjs evaluateH1: when
  the pretable / best-full-grid frame-p95 ratio is in the tight zone
  (0.9 ≤ r ≤ 1.2) AND either adapter has < 10 repeats, return
  insufficient with guidance to re-run at --repeats=10+. Outside the
  tight zone the existing behavior is unchanged. New test covers the
  insufficient case; existing failing test rewritten to use a clearly
  out-of-zone ratio (1.6) so the failing path stays exercised.
- Append a 2026-05-09 entry to docs/research/repo-memory.md overturning
  the H1 flip narrative, with the new evaluator gate documented.

AG Grid (16.7 ms p95, 1 blank gap, 2 px row-height drift) and TanStack
(16.7 ms p95, 1 blank gap) status from the B2 n=3 runset is not
corrected here — both are >50% above pretable, well outside the noise
zone. They remain ~1.7× pretable's scroll_frame_p95_ms with quality
gaps that pretable does not have.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
pretable Ready Ready Preview, Comment May 9, 2026 3:08am

@blove blove enabled auto-merge (squash) May 9, 2026 03:07
@blove blove merged commit e3811cf into main May 9, 2026
13 checks passed
@blove blove deleted the b2-followup-1-corrections branch May 9, 2026 03:09
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Vercel preview ready

Preview: https://pretable-215yyjsfq-cacheplane.vercel.app
Commit: 73348c492bde6221f067ffd51b097a7f61d8231c

Updated automatically by the deploy-preview job.

blove added a commit that referenced this pull request May 9, 2026
…ound B2 evidence (#126)

The B2 corrections PR (#125) confirmed pretable / MUI parity at n=20 and
overturned the H1 flip narrative. Three homepage components still
referenced the old gridalpha-stub "4× faster" claim with stub-era
numbers. This PR brings them in line with the real B2 runset.

ComparisonTable.tsx:
- Drop the "4× faster scroll" header badge.
- Replace the gridalpha / gridbeta / gridgammaX columns with real
  ag-grid / tanstack / mui columns; rename Row interface fields.
- Replace scroll-row data with real B2 numbers (pretable 9.07, MUI 9.14,
  AG Grid 16.7, TanStack 16.7) and add row-height-fidelity, blank-gap,
  anchor-shift rows that surface the quality wedge.
- Drop streaming rows (S5/updates) until follow-up #6 lands real-
  comparator S5 evidence; replace with headless-engine + streaming-
  pipeline rows that distinguish pretable's surface honestly.
- Update trail-marker labels to fact-checkable characterizations:
  AG Grid "Slower scroll; row-height drift", TanStack "Headless; you
  wire selection and nav", MUI X "Parity at scroll p95; full-grid
  feature surface".
- Rewrite the section subhead to a parity framing.

ReceiptsBand.tsx:
- Drop the "4×" hero stat.
- Replace stats with the quality wedge: 0 blank gaps (accent), 9 ms
  frame p95, ≤1 px row-height fidelity, 25k/s max sustained update
  rate. The 25k/s figure is pretable's own from the May-1 streaming
  runset; comparative S5 evidence is still pending.

FeatureGrid.tsx:
- Replace "16ms p95 ... 4× faster than Grid Alpha Community" with a
  parity + quality-wedge description that names real comparators.

Test updates:
- ReceiptsBand.test.tsx asserts the new "0" + "9ms" hero stats.
- ComparisonTable.test.tsx asserts the new fact-checkable trail-marker
  labels (regex-matched so prose tweaks don't break the tests).

No source/package changes outside apps/website. All 190 website tests
pass; 68/68 bench-matrix tests pass; pnpm -w lint / typecheck / format
clean.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
blove added a commit that referenced this pull request May 9, 2026
* docs(specs): B2 follow-up #3 — autosize end-to-end wiring design

End-to-end autosize harness wiring (pretable + ag-grid + mui; tanstack
unsupported), with H22 comparator-parity hypothesis evaluator reusing
the min-repeat gate from PR #125, and a full B2 matrix re-run with
autosize included.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(plans): B2 follow-up #3 — autosize wiring implementation plan

Six-task plan for wiring autosize through the bench harness end-to-end,
adding evaluateH22 with the min-repeat gate, and re-running the B2
matrix with autosize included.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(bench-runner): accept autosize script through harness pipeline

Adds "autosize" to the bench-runner supportedScripts allowlist (gated
to S2 and to pretable | ag-grid | mui — tanstack remains unsupported
per the B2 spec), to the apps/bench query-state parser, and to the
BenchScriptName Extract narrow in bench-types.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(bench): measureBenchAutosizeRun helper for autosize script

Adds a single-event autosize latency helper that awaits the adapter's
autosize callback and one rAF, reporting interaction_latency_ms as
"call-to-paint" timing. Mirrors the shape of measureBenchKeySequenceRun.

Also unblocks the now-accepted "autosize" script in the query-state
parser by retargeting the existing fallback-to-defaults test to an
unrelated bogus value.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(bench): wire onAutosizeReady on pretable/ag-grid/mui adapters

Pretable, AG Grid, and MUI adapters now publish their autosize entry
point through a new onAutosizeReady callback. bench-app.tsx captures it
in autosizeApiRef and dispatches measureBenchAutosizeRun on the autosize
script, mirroring the updateApiRef + measureBenchUpdatesRun chain.

Replaces AG Grid's pre-emptive onGridReady autosize branch (which only
ran at mount) with a callback so autosize fires on bench-script
dispatch. MUI now exposes apiRef via useGridApiRef so the harness can
call apiRef.current.autosizeColumns({ includeOutliers: true }) — async
on v7+. TanStack accepts the prop for harness uniformity but the
bench-runner returns "unsupported" before the adapter ever mounts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(bench-matrix): evaluateH22 autosize comparator-parity hypothesis

Adds H22 ("pretable autosize is within a single 60Hz frame and within
10% of the best ag-grid/mui comparator on S2"). Reuses the H1
comparator-parity pattern: 16 ms single-frame floor, 10% parity band,
≥10 repeats per side before resolving a tight-zone (0.9–1.2) ratio.

Hoists COMPARATOR_PARITY_MIN_REPEATS to module scope so H1 and H22
share a single source of truth.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(bench): B2 matrix re-run with autosize; H22 evaluated

S2/hypothesis/Chromium, all 13 scripts including autosize, repeats=3,
~5 min wall-clock. H22 satisfied: pretable autosize 5.3 ms vs MUI 11 ms
(ratio 0.482, outside the tight zone — gate does not apply).

H1 also flipped from failing → satisfied vs the 2026-05-08 milestone
(parity at n=3 with mui this run; matches the n=20 correction documented
in the previous repo-memory entry).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(format): prettier formatting for B2 follow-up #3

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant