Skip to content

feat(bench-matrix): H6-H8 + H19-H21 evaluators embed comparator evidence#140

Merged
blove merged 13 commits into
mainfrom
comparator-aware-evaluators
May 13, 2026
Merged

feat(bench-matrix): H6-H8 + H19-H21 evaluators embed comparator evidence#140
blove merged 13 commits into
mainfrom
comparator-aware-evaluators

Conversation

@blove
Copy link
Copy Markdown
Contributor

@blove blove commented May 13, 2026

Summary

Extends six pretable-only evaluators in scripts/bench-matrix.mjs (H6/H7/H8 interaction + H19/H20/H21 cell-renderer) to embed comparator-adapter evidence in their evidence arrays. Mirrors evaluateH1's pre-existing pattern. Status logic unchanged — pretable's absolute thresholds still drive verdicts; comparator data is informational.

Goal: hypotheses.json becomes a single source of truth for cross-adapter perf data, retiring (over time) the per-PR aggregator scripts from PRs #130/#131/#132.

What changed

  • scripts/bench-matrix.mjs — new findComparatorEvidence(runs, { scenarioId, scriptName }) helper; six evaluators extended to append ...comparatorEvidence to their evidence arrays in every return branch. H6/H7/H8 changes are centralized in the shared evaluateInteractionHypothesis helper.
  • scripts/__tests__/bench-matrix.test.mjs — six new tests asserting evidence-array contents when comparator runs are present. All existing status-verdict tests untouched.
  • status/milestones/2026-05-12-comparator-aware-evaluators.hypotheses.json — milestone synthesized from per-run summaries after recovering from two matrix-runner flakes (one tanstack/filter-metadata locator-timing flake, one preview-server ECONNREFUSED).

Verification

All seven hypotheses retained satisfied status. Evidence arrays:

H# Status Evidence adapters
H1 satisfied pretable, ag-grid, tanstack
H6 satisfied pretable, ag-grid, tanstack, mui
H7 satisfied pretable, ag-grid, tanstack, mui
H8 satisfied pretable, ag-grid, tanstack, mui
H19 satisfied pretable (format), pretable (baseline), ag-grid, tanstack, mui
H20 satisfied pretable, ag-grid, tanstack, mui
H21 satisfied pretable, ag-grid, tanstack, mui

What's NOT in this PR

Test plan

  • pnpm -w typecheck passes
  • pnpm -w test passes (191 tests — existing matrix-runner tests + 6 new evidence-array assertions)
  • pnpm -w lint 0 errors
  • pnpm format clean
  • All seven hypothesis statuses unchanged vs prior milestones

🤖 Generated with Claude Code

blove and others added 13 commits May 12, 2026 17:41
Extend H6/H7/H8 (interaction) and H19/H20/H21 (cell-renderer)
evaluators to include comparator evidence in their evidence arrays,
mirroring H1's pattern. Status logic stays pretable-only; data is
informational. Retires the aggregator-script pattern over time.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Twelve-task plan: shared helper for comparator-evidence lookup, six
evaluator extensions (H6, H7, H8, H19, H20, H21), test coverage,
matrix re-run, repo-memory entry, PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
evaluateInteractionHypothesis now embeds every measured comparator
adapter in the evidence array (was: best-by-interaction-latency only).
Pretable-only status verdicts unchanged. Adds findComparatorEvidence
helper used by all six target evaluators (H6/H7/H8 + H19/H20/H21).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
H19's evidence array now embeds each comparator's scroll-with-format
summary alongside pretable's format/baseline delta. Comparator entries
are absolute format p95 (not deltas) — per-adapter format-vs-baseline
deltas are a future enhancement. Status verdict unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
H20's evidence array now embeds each comparator's scroll-with-render
summary alongside pretable. Status verdict unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
H21's evidence array now embeds each comparator's
scroll-with-heavy-render summary alongside pretable. Status verdict
unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…20/H21

Pins the contract that each of the six evaluators surfaces every
measured comparator adapter in its evidence array (4 entries for
H6/H7/H8/H20/H21; 5 for H19 which also carries the pretable scroll
baseline). Status verdicts remain pretable-only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Synthesized hypotheses report from per-run summaries (matrix runner's
end-of-run report-writer flaked repeatedly in this worktree; the
per-run summaries are valid).

Verification of evaluator extensions:

| H#  | Status    | Evidence adapters in array               |
| --- | --------- | ---------------------------------------- |
| H1  | satisfied | pretable, ag-grid, tanstack              |
| H6  | satisfied | pretable, ag-grid, tanstack              |
| H7  | satisfied | pretable, ag-grid, tanstack              |
| H8  | satisfied | pretable, ag-grid, tanstack              |
| H19 | satisfied | pretable (format), pretable (baseline),  |
|     |           |   ag-grid, tanstack                      |
| H20 | satisfied | pretable, ag-grid, tanstack              |
| H21 | satisfied | pretable, ag-grid, tanstack              |

All seven hypotheses retained their existing `satisfied` status (no
threshold changes; evaluator-extension was data-only). Comparator
evidence now embedded inline in each hypothesis's evidence array —
the architectural goal of the PR.

MUI runs flaked in this matrix attempt and are absent from the
evidence; that's a matrix-runner reliability issue, not an evaluator
issue. The evaluator correctly handles whatever comparator data is
present per-slice. Investigating the matrix-runner's tanstack/mui
flake pattern is a separate follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Architecture change documenting the H6/H7/H8/H19/H20/H21 evaluator
extensions, the matrix-runner flake workaround, and the deferred
follow-ups.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
H6/H7/H8/H19/H20/H21 now embed comparator evidence in their evidence
arrays (4 entries each; 5 for H19 which also carries pretable's scroll
baseline). Pretable-only status verdicts unchanged — all six remain
satisfied.

Aggregated from today's S2 hypothesis-scale runs across
pretable/ag-grid/tanstack/mui after the matrix runner hit two
mid-run e2e flakes (preview server / locator timing); on-disk
summaries combined into a single report.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All four adapters (pretable + ag-grid + tanstack + mui) are present in
every comparator-aware evaluator's evidence array. Updates the
2026-05-12 entry to reflect the recovered matrix outcome.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@blove blove enabled auto-merge (squash) May 13, 2026 00:55
@vercel
Copy link
Copy Markdown

vercel Bot commented May 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
pretable Ready Ready Preview, Comment May 13, 2026 0:56am

@blove blove merged commit 999e818 into main May 13, 2026
13 checks passed
@blove blove deleted the comparator-aware-evaluators branch May 13, 2026 00:58
@github-actions
Copy link
Copy Markdown
Contributor

Vercel preview ready

Preview: https://pretable-osprb3of2-cacheplane.vercel.app
Commit: d3e0fcdfba77307554154f34c42b7df96e054784

Updated automatically by the deploy-preview job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant