feat(bench-matrix): H6-H8 + H19-H21 evaluators embed comparator evidence by blove · Pull Request #140 · cacheplane/pretable

blove · 2026-05-13T00:55:49Z

Summary

Extends six pretable-only evaluators in scripts/bench-matrix.mjs (H6/H7/H8 interaction + H19/H20/H21 cell-renderer) to embed comparator-adapter evidence in their evidence arrays. Mirrors evaluateH1's pre-existing pattern. Status logic unchanged — pretable's absolute thresholds still drive verdicts; comparator data is informational.

Goal: hypotheses.json becomes a single source of truth for cross-adapter perf data, retiring (over time) the per-PR aggregator scripts from PRs #130/#131/#132.

What changed

scripts/bench-matrix.mjs — new findComparatorEvidence(runs, { scenarioId, scriptName }) helper; six evaluators extended to append ...comparatorEvidence to their evidence arrays in every return branch. H6/H7/H8 changes are centralized in the shared evaluateInteractionHypothesis helper.
scripts/__tests__/bench-matrix.test.mjs — six new tests asserting evidence-array contents when comparator runs are present. All existing status-verdict tests untouched.
status/milestones/2026-05-12-comparator-aware-evaluators.hypotheses.json — milestone synthesized from per-run summaries after recovering from two matrix-runner flakes (one tanstack/filter-metadata locator-timing flake, one preview-server ECONNREFUSED).

Verification

All seven hypotheses retained satisfied status. Evidence arrays:

H#	Status	Evidence adapters
H1	satisfied	pretable, ag-grid, tanstack
H6	satisfied	pretable, ag-grid, tanstack, mui
H7	satisfied	pretable, ag-grid, tanstack, mui
H8	satisfied	pretable, ag-grid, tanstack, mui
H19	satisfied	pretable (format), pretable (baseline), ag-grid, tanstack, mui
H20	satisfied	pretable, ag-grid, tanstack, mui
H21	satisfied	pretable, ag-grid, tanstack, mui

What's NOT in this PR

/bench page swap to read from hypotheses.json directly. Aggregator scripts (PR feat(bench): open cell-renderer scripts to comparator adapters (B2 follow-up #5a) #130/feat(bench): B2 follow-up #5b — sort + filter comparator wiring #131/fix(website): homepage interaction wedge refresh (B2 follow-up) #132 + scripts/extract-interaction-summary.mjs) still feed the page. Editorial-only follow-up.
Per-adapter format-overhead deltas in H19 (comparator-vs-own-baseline). Currently comparator evidence is absolute scroll-with-format p95 only.
Matrix runner reliability investigation. The mid-run flake pattern is well-documented across PRs docs(research): pretable scroll-with-render perf diagnostic (noise verdict) #133/docs(research): interaction borderline perf diagnostic (real over-budget on pretable filter-text) #134/this PR; separate follow-up.
Threshold or status-logic changes (data-only architectural change).

Test plan

pnpm -w typecheck passes
pnpm -w test passes (191 tests — existing matrix-runner tests + 6 new evidence-array assertions)
pnpm -w lint 0 errors
pnpm format clean
All seven hypothesis statuses unchanged vs prior milestones

🤖 Generated with Claude Code

Extend H6/H7/H8 (interaction) and H19/H20/H21 (cell-renderer) evaluators to include comparator evidence in their evidence arrays, mirroring H1's pattern. Status logic stays pretable-only; data is informational. Retires the aggregator-script pattern over time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Twelve-task plan: shared helper for comparator-evidence lookup, six evaluator extensions (H6, H7, H8, H19, H20, H21), test coverage, matrix re-run, repo-memory entry, PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

evaluateInteractionHypothesis now embeds every measured comparator adapter in the evidence array (was: best-by-interaction-latency only). Pretable-only status verdicts unchanged. Adds findComparatorEvidence helper used by all six target evaluators (H6/H7/H8 + H19/H20/H21). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

H19's evidence array now embeds each comparator's scroll-with-format summary alongside pretable's format/baseline delta. Comparator entries are absolute format p95 (not deltas) — per-adapter format-vs-baseline deltas are a future enhancement. Status verdict unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

H20's evidence array now embeds each comparator's scroll-with-render summary alongside pretable. Status verdict unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

H21's evidence array now embeds each comparator's scroll-with-heavy-render summary alongside pretable. Status verdict unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…20/H21 Pins the contract that each of the six evaluators surfaces every measured comparator adapter in its evidence array (4 entries for H6/H7/H8/H20/H21; 5 for H19 which also carries the pretable scroll baseline). Status verdicts remain pretable-only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Synthesized hypotheses report from per-run summaries (matrix runner's end-of-run report-writer flaked repeatedly in this worktree; the per-run summaries are valid). Verification of evaluator extensions: | H# | Status | Evidence adapters in array | | --- | --------- | ---------------------------------------- | | H1 | satisfied | pretable, ag-grid, tanstack | | H6 | satisfied | pretable, ag-grid, tanstack | | H7 | satisfied | pretable, ag-grid, tanstack | | H8 | satisfied | pretable, ag-grid, tanstack | | H19 | satisfied | pretable (format), pretable (baseline), | | | | ag-grid, tanstack | | H20 | satisfied | pretable, ag-grid, tanstack | | H21 | satisfied | pretable, ag-grid, tanstack | All seven hypotheses retained their existing `satisfied` status (no threshold changes; evaluator-extension was data-only). Comparator evidence now embedded inline in each hypothesis's evidence array — the architectural goal of the PR. MUI runs flaked in this matrix attempt and are absent from the evidence; that's a matrix-runner reliability issue, not an evaluator issue. The evaluator correctly handles whatever comparator data is present per-slice. Investigating the matrix-runner's tanstack/mui flake pattern is a separate follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Architecture change documenting the H6/H7/H8/H19/H20/H21 evaluator extensions, the matrix-runner flake workaround, and the deferred follow-ups. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

H6/H7/H8/H19/H20/H21 now embed comparator evidence in their evidence arrays (4 entries each; 5 for H19 which also carries pretable's scroll baseline). Pretable-only status verdicts unchanged — all six remain satisfied. Aggregated from today's S2 hypothesis-scale runs across pretable/ag-grid/tanstack/mui after the matrix runner hit two mid-run e2e flakes (preview server / locator timing); on-disk summaries combined into a single report. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

All four adapters (pretable + ag-grid + tanstack + mui) are present in every comparator-aware evaluator's evidence array. Updates the 2026-05-12 entry to reflect the recovered matrix outcome. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

vercel · 2026-05-13T00:55:52Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
pretable	Ready	Preview, Comment	May 13, 2026 0:56am

github-actions · 2026-05-13T01:00:05Z

Vercel preview ready

Preview: https://pretable-osprb3of2-cacheplane.vercel.app
Commit: d3e0fcdfba77307554154f34c42b7df96e054784

_{Updated automatically by the deploy-preview job.}

blove and others added 13 commits May 12, 2026 17:41

feat(bench-matrix): H20 evaluator surfaces comparator evidence

208160a

H20's evidence array now embeds each comparator's scroll-with-render summary alongside pretable. Status verdict unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(bench-matrix): H21 evaluator surfaces comparator evidence

6578806

H21's evidence array now embeds each comparator's scroll-with-heavy-render summary alongside pretable. Status verdict unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs(research): repo-memory entry — comparator-aware evaluators

c2fd72b

Architecture change documenting the H6/H7/H8/H19/H20/H21 evaluator extensions, the matrix-runner flake workaround, and the deferred follow-ups. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chore: prettier-format comparator-aware-evaluators artifacts

f10351a

chore: prettier-format bench-matrix tests

d3e0fcd

blove enabled auto-merge (squash) May 13, 2026 00:55

blove merged commit 999e818 into main May 13, 2026
13 checks passed

blove deleted the comparator-aware-evaluators branch May 13, 2026 00:58

blove mentioned this pull request May 13, 2026

chore(deps): bump ag-grid-community from 33.3.2 to 35.2.1 #136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench-matrix): H6-H8 + H19-H21 evaluators embed comparator evidence#140

feat(bench-matrix): H6-H8 + H19-H21 evaluators embed comparator evidence#140
blove merged 13 commits into
mainfrom
comparator-aware-evaluators

blove commented May 13, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

blove commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Verification

What's NOT in this PR

Test plan

Uh oh!

vercel Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026

Vercel preview ready

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

blove commented May 13, 2026 •

edited

Loading

vercel Bot commented May 13, 2026 •

edited

Loading