GocciaScript bytecode benchmarks in PRs#55
Conversation
- CI workflow now runs benchmarks in both interpreted and bytecode modes - On main pushes, both baselines are cached separately - PR workflow restores both baselines, runs both modes, and posts a single comparison comment with separate Interpreted and Bytecode sections - Refactored PR comparison script into a reusable generateSection function - Updated docs/benchmarks.md to reflect dual-mode benchmark CI Co-authored-by: Johannes Stein <frostney@users.noreply.github.com>
|
Cursor Agent can help with this pull request. Just |
📝 WalkthroughWalkthroughAdds parallel benchmark support for two modes (interpreted, bytecode) across CI and PR workflows: matrixed benchmark runs, per-mode artifacts and baselines, mode-aware caching/restoration, adjusted benchmark commands with mode flags and env vars, and consolidated per-mode comparison/reporting in PR comments. (40 words) Changes
Sequence Diagram(s)sequenceDiagram
participant Dev as Developer
participant CI as CI Matrix Runners
participant Art as Artifact Store / Cache
participant PR as PR Comparison Workflow
participant Main as main branch (baseline)
Dev->>CI: push / open PR
Note over CI: Matrix spawns jobs for\nmode=interpreted and mode=bytecode
CI->>Art: Upload `benchmark-${matrix.mode}-results.json`
CI->>Art: Save mode-specific cache/baseline (on main pushes)
Dev->>PR: PR triggered
PR->>Art: Download per-mode results
PR->>Art: Restore per-mode baseline from `main` (if exists)
PR->>PR: Build per-file per-mode maps\ncompute deltas & aggregate stats
PR->>Dev: Post PR comment with per-mode summaries and per-file sections
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- CI and PR workflows now run interpreted and bytecode benchmarks in
parallel via matrix strategy (mode: [interpreted, bytecode])
- Reduced calibration settings: CALIBRATION_MS=50, ROUNDS=3 (from 100/5)
- PR workflow split into benchmark (matrix) + benchmark-comment jobs;
results passed via artifacts, baselines restored in the comment job
- CI cache keys now use benchmark-{mode}-baseline-* naming scheme
- Updated docs/benchmarks.md to reflect parallel execution and settings
Co-authored-by: Johannes Stein <frostney@users.noreply.github.com>
Benchmark Results254 benchmarks (no baseline) arraybuffer.js — 14 benchmarks
arrays.js — 19 benchmarks
async-await.js — 6 benchmarks
classes.js — 31 benchmarks
closures.js — 11 benchmarks
collections.js — 12 benchmarks
destructuring.js — 22 benchmarks
fibonacci.js — 8 benchmarks
for-of.js — 7 benchmarks
iterators.js — 20 benchmarks
json.js — 20 benchmarks
jsx.jsx — 21 benchmarks
numbers.js — 11 benchmarks
objects.js — 7 benchmarks
promises.js — 12 benchmarks
strings.js — 11 benchmarks
typed-arrays.js — 22 benchmarks
Measured on ubuntu-latest x64. Changes within ±7% are considered insignificant. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/ci.yml (1)
120-137:⚠️ Potential issue | 🟠 MajorAdd
fail-fast: falseto the benchmark job strategy.The matrix now covers 12 job combinations (3 OS × 2 modes + arm64 variants), but GitHub Actions defaults to fail-fast behavior. If any non-baseline leg fails (e.g., windows-latest x64 interpreted), the remaining jobs are cancelled before the ubuntu-latest x64 legs complete and save the interpreted/bytecode baselines at lines 164–176. This leaves future PRs comparing against stale cached results.
Suggested change
benchmark: needs: build strategy: + fail-fast: false matrix: os: [ubuntu-latest, macos-latest, windows-latest] arch: [x64] mode: [interpreted, bytecode]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci.yml around lines 120 - 137, The benchmark job's matrix uses strategy.matrix and should disable fail-fast to avoid cancelling remaining legs (which prevents baseline artifacts at the interpreted/bytecode ubuntu-latest x64 legs from being saved); update the CI workflow by adding fail-fast: false under the same strategy block that contains matrix (the strategy used for the benchmark job in .github/workflows/ci.yml) so all matrix combinations run to completion even if some fail.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/pr.yml:
- Around line 191-243: Benchmarks with errors are being filtered out (see
benches = file.benchmarks.filter(b => !b.error)), so failing or missing PR
baselines vanish from the report; stop filtering errors, emit explicit rows for
errored benchmarks using the existing key format (`${bench.suite} >
${bench.name}`) labeled e.g. "failed" and increment the appropriate counters and
totalBenchmarks, and when hasBase is true do a second pass over baseline keys in
baseMap that weren't seen to emit "missing" rows and increment a missing counter
(don’t push these into allChanges or compute percent change); update the table
row generation logic (the blocks that build fileRows and the counts
fileNew/fileRegressed/fileImproved/fileUnchanged) to include these "failed" and
"missing" cases so they appear in the summary and tables.
---
Outside diff comments:
In @.github/workflows/ci.yml:
- Around line 120-137: The benchmark job's matrix uses strategy.matrix and
should disable fail-fast to avoid cancelling remaining legs (which prevents
baseline artifacts at the interpreted/bytecode ubuntu-latest x64 legs from being
saved); update the CI workflow by adding fail-fast: false under the same
strategy block that contains matrix (the strategy used for the benchmark job in
.github/workflows/ci.yml) so all matrix combinations run to completion even if
some fail.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1335a591-a57a-4ca9-a324-db38f4a68cec
📒 Files selected for processing (3)
.github/workflows/ci.yml.github/workflows/pr.ymldocs/benchmarks.md
| for (const file of current.files) { | ||
| const fileName = file.file; | ||
| const baseMap = baselineByFile[fileName] || {}; | ||
| const hasBase = baseline && Object.keys(baseMap).length > 0; | ||
|
|
||
| let fileImproved = 0; | ||
| let fileRegressed = 0; | ||
| let fileNew = 0; | ||
| let fileUnchanged = 0; | ||
| let fileRows = ''; | ||
| const fileChanges = []; | ||
|
|
||
| const benches = file.benchmarks.filter(b => !b.error); | ||
| totalBenchmarks += benches.length; | ||
|
|
||
| if (hasBase) { | ||
| fileRows += '| Benchmark | Base (ops/sec) | PR (ops/sec) | Change |\n'; | ||
| fileRows += '|-----------|---------------|-------------|--------|\n'; | ||
|
|
||
| for (const bench of benches) { | ||
| const key = `${bench.suite} > ${bench.name}`; | ||
| const ops = bench.opsPerSec; | ||
| const baseOps = baseMap[key]; | ||
|
|
||
| if (baseOps && baseOps > 0) { | ||
| const change = ((ops - baseOps) / baseOps) * 100; | ||
| fileChanges.push(change); | ||
| allChanges.push(change); | ||
| let indicator; | ||
| if (change > THRESHOLD) { | ||
| indicator = `🟢 +${change.toFixed(1)}%`; | ||
| fileImproved++; | ||
| } else if (change < -THRESHOLD) { | ||
| indicator = `🔴 ${change.toFixed(1)}%`; | ||
| fileRegressed++; | ||
| } else { | ||
| indicator = `${change >= 0 ? '+' : ''}${change.toFixed(1)}%`; | ||
| fileUnchanged++; | ||
| } | ||
| fileRows += `| ${bench.name} | ${fmtOps(baseOps)} | ${fmtOps(ops)} | ${indicator} |\n`; | ||
| } else { | ||
| indicator = `${change >= 0 ? '+' : ''}${change.toFixed(1)}%`; | ||
| fileUnchanged++; | ||
| fileRows += `| ${bench.name} | — | ${fmtOps(ops)} | 🆕 new |\n`; | ||
| fileNew++; | ||
| } | ||
| fileRows += `| ${bench.name} | ${fmtOps(baseOps)} | ${fmtOps(ops)} | ${indicator} |\n`; | ||
| } else { | ||
| fileRows += `| ${bench.name} | — | ${fmtOps(ops)} | 🆕 new |\n`; | ||
| } | ||
| } else { | ||
| fileRows += '| Benchmark | ops/sec |\n'; | ||
| fileRows += '|-----------|--------|\n'; | ||
| for (const bench of benches) { | ||
| fileRows += `| ${bench.name} | ${fmtOps(bench.opsPerSec)} |\n`; | ||
| fileNew++; | ||
| } | ||
| } |
There was a problem hiding this comment.
Report failed or missing benchmarks instead of dropping them.
The reporter already emits per-benchmark error fields, but Line 203 filters them out and Lines 191-243 only walk the PR result set. A benchmark that starts failing or stops being emitted therefore vanishes from the comment instead of surfacing as a regression, so the section can look healthy while the benchmark suite is actually broken. Please emit explicit failed/missing rows and count them in the summary; using the existing suite > name key as the row label will also keep duplicate benchmark names unambiguous.
🧪 Suggested handling
- const benches = file.benchmarks.filter(b => !b.error);
+ const benches = file.benchmarks;
+ const seen = new Set();
totalBenchmarks += benches.length;
if (hasBase) {
fileRows += '| Benchmark | Base (ops/sec) | PR (ops/sec) | Change |\n';
fileRows += '|-----------|---------------|-------------|--------|\n';
for (const bench of benches) {
const key = `${bench.suite} > ${bench.name}`;
+ seen.add(key);
+
+ if (bench.error) {
+ fileRows += `| ${key} | ${baseMap[key] ? fmtOps(baseMap[key]) : '—'} | error | 🔴 failed |\n`;
+ fileRegressed++;
+ continue;
+ }
const ops = bench.opsPerSec;
const baseOps = baseMap[key];Then do a second pass over unseen baseMap entries after the loop so baseline-only benchmarks are rendered as missing instead of disappearing silently.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/pr.yml around lines 191 - 243, Benchmarks with errors are
being filtered out (see benches = file.benchmarks.filter(b => !b.error)), so
failing or missing PR baselines vanish from the report; stop filtering errors,
emit explicit rows for errored benchmarks using the existing key format
(`${bench.suite} > ${bench.name}`) labeled e.g. "failed" and increment the
appropriate counters and totalBenchmarks, and when hasBase is true do a second
pass over baseline keys in baseMap that weren't seen to emit "missing" rows and
increment a missing counter (don’t push these into allChanges or compute percent
change); update the table row generation logic (the blocks that build fileRows
and the counts fileNew/fileRegressed/fileImproved/fileUnchanged) to include
these "failed" and "missing" cases so they appear in the summary and tables.
PR comment now shows interpreted and bytecode results side by side in one table per benchmark file, instead of two separate sections. Each row displays both modes' ops/sec and their change from baseline. Co-authored-by: Johannes Stein <frostney@users.noreply.github.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
.github/workflows/pr.yml (1)
181-197:⚠️ Potential issue | 🟠 MajorDon't drop failed or missing benchmarks from the PR report.
buildMap()andbenches.filter(b => !b.error)remove errored entries entirely, and the outer loop only walksinterpData.files. That means a benchmark/file that fails, disappears, or is only emitted in bytecode never shows up in the table or totals, so the PR comment can look healthy while coverage regressed. Please build rows from the union of current+baseline keys for both modes and render explicitfailed/missingstates; using the existing${suite} > ${name}key as the row label also avoids ambiguous duplicate names.Also applies to: 203-257
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/pr.yml around lines 181 - 197, buildMap currently filters out errored benchmarks and only iterates interpData.files, which drops failed/missing entries and files only present in bytecode data; change buildMap to collect entries from the union of files and benchmark keys across both baseline and current datasets (e.g., interpBaseline, interpData for interp mode and bytecodeBaseline, bytecodeData for bytecode mode), do not discard items with b.error but instead record an explicit status/value (e.g., "failed" or "missing") for each `${b.suite} > ${b.name}` key, and ensure the outer loop iterates the union of file keys so every file/benchmark appears in the final maps used for rendering and totals (refer to buildMap, interpBaseline, bytecodeBaseline, bytecodeData, interpData).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.github/workflows/pr.yml:
- Around line 181-197: buildMap currently filters out errored benchmarks and
only iterates interpData.files, which drops failed/missing entries and files
only present in bytecode data; change buildMap to collect entries from the union
of files and benchmark keys across both baseline and current datasets (e.g.,
interpBaseline, interpData for interp mode and bytecodeBaseline, bytecodeData
for bytecode mode), do not discard items with b.error but instead record an
explicit status/value (e.g., "failed" or "missing") for each `${b.suite} >
${b.name}` key, and ensure the outer loop iterates the union of file keys so
every file/benchmark appears in the final maps used for rendering and totals
(refer to buildMap, interpBaseline, bytecodeBaseline, bytecodeData, interpData).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 71750764-70c9-4c2a-9c19-bd4d21237ef1
📒 Files selected for processing (2)
.github/workflows/pr.ymldocs/benchmarks.md
Each Interpreted/Bytecode cell now displays 'base → PR' ops/sec (e.g. '287 → 277') when a baseline exists, making it easy to see both the absolute values and the relative change at a glance. Co-authored-by: Johannes Stein <frostney@users.noreply.github.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
.github/workflows/pr.yml (1)
185-191:⚠️ Potential issue | 🟠 MajorDon't let failed or missing benchmarks disappear from the unified report.
Line 191 drops errored entries from
buildMap, Line 214 filters interpreted errors out again, and the loop starting on Line 207 only walks interpreted current results. That makes interpreted failures, bytecode failures, and baseline-only / bytecode-only benchmarks vanish from the PR comment instead of surfacing asfailedormissing, so the table can still look healthy while the suite is broken. Build the row set from the union of interpreted current, bytecode current, and both baseline key sets, and render${suite} > ${name}as the row label so duplicates stay unambiguous.🛠️ Suggested direction
+ const interpCurrentMap = buildMap(interpData); const interpBaseMap = buildMap(interpBaseline); const bytecodeBaseMap = buildMap(bytecodeBaseline); const bytecodeCurrentMap = buildMap(bytecodeData); function buildMap(data) { const map = {}; if (!data) return map; for (const file of data.files) { const m = {}; for (const b of file.benchmarks) { - if (!b.error) m[`${b.suite} > ${b.name}`] = b.opsPerSec; + m[`${b.suite} > ${b.name}`] = { + opsPerSec: b.opsPerSec, + error: b.error ?? null, + }; } map[file.file] = m; } return map; } - for (const file of interpData.files) { - const benches = file.benchmarks.filter(b => !b.error); + for (const fileName of new Set([ + ...Object.keys(interpCurrentMap), + ...Object.keys(bytecodeCurrentMap), + ...Object.keys(interpBaseMap), + ...Object.keys(bytecodeBaseMap), + ])) { + const keys = new Set([ + ...Object.keys(interpCurrentMap[fileName] ?? {}), + ...Object.keys(bytecodeCurrentMap[fileName] ?? {}), + ...Object.keys(interpBaseMap[fileName] ?? {}), + ...Object.keys(bytecodeBaseMap[fileName] ?? {}), + ]); + + for (const key of keys) { + // use `${key}` as the row label + // emit explicit `failed` / `missing` cells per mode + // count those cases in the file and overall summaries + } }Also applies to: 207-257
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/pr.yml around lines 185 - 191, The buildMap function currently drops benchmarks where b.error is truthy, which causes errored/missing benchmarks to vanish; update the data collection so the final row set is the union of interpreted current, bytecode current, and both baseline key sets (collect keys from data.files[].benchmarks for interpreted and bytecode runs and from the baseline maps) and use the explicit row label format `${suite} > ${name}` when building the union to keep entries unique; ensure errored entries are represented (e.g., marked as "failed" or "missing" in the downstream rendering) instead of being omitted by removing the unconditional `if (!b.error)` filter inside buildMap and by including entries with errors when assembling the rows in the loop that currently iterates only interpreted current results.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.github/workflows/pr.yml:
- Around line 185-191: The buildMap function currently drops benchmarks where
b.error is truthy, which causes errored/missing benchmarks to vanish; update the
data collection so the final row set is the union of interpreted current,
bytecode current, and both baseline key sets (collect keys from
data.files[].benchmarks for interpreted and bytecode runs and from the baseline
maps) and use the explicit row label format `${suite} > ${name}` when building
the union to keep entries unique; ensure errored entries are represented (e.g.,
marked as "failed" or "missing" in the downstream rendering) instead of being
omitted by removing the unconditional `if (!b.error)` filter inside buildMap and
by including entries with errors when assembling the rows in the loop that
currently iterates only interpreted current results.
Enable GocciaScript bytecode benchmarks in CI, including baseline caching and PR comparisons, mirroring the existing interpreted benchmarks.
Summary by CodeRabbit
New Features
Documentation