Skip to content

GocciaScript bytecode benchmarks in PRs#55

Merged
frostney merged 4 commits into
mainfrom
cursor/gocciascript-bytecode-benchmarks-5d37
Mar 9, 2026
Merged

GocciaScript bytecode benchmarks in PRs#55
frostney merged 4 commits into
mainfrom
cursor/gocciascript-bytecode-benchmarks-5d37

Conversation

@frostney
Copy link
Copy Markdown
Owner

@frostney frostney commented Mar 8, 2026

Enable GocciaScript bytecode benchmarks in CI, including baseline caching and PR comparisons, mirroring the existing interpreted benchmarks.


Open in Web Open in Cursor 

Summary by CodeRabbit

  • New Features

    • Benchmarks now run in two modes (interpreted and bytecode) with parallel per-mode runs, per-mode baselines, and per-file per-mode comparisons.
    • PR comments present combined per-file sections and aggregated per-mode summaries with clear indicators for improvements, regressions, unchanged, and new benchmarks.
  • Documentation

    • Updated benchmarking docs to describe multi-mode CI, parallel execution, per-mode baselines, per-file tables, and revised PR comparison format.

- CI workflow now runs benchmarks in both interpreted and bytecode modes
- On main pushes, both baselines are cached separately
- PR workflow restores both baselines, runs both modes, and posts a
  single comparison comment with separate Interpreted and Bytecode sections
- Refactored PR comparison script into a reusable generateSection function
- Updated docs/benchmarks.md to reflect dual-mode benchmark CI

Co-authored-by: Johannes Stein <frostney@users.noreply.github.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented Mar 8, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@frostney frostney marked this pull request as ready for review March 8, 2026 04:47
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 8, 2026

📝 Walkthrough

Walkthrough

Adds parallel benchmark support for two modes (interpreted, bytecode) across CI and PR workflows: matrixed benchmark runs, per-mode artifacts and baselines, mode-aware caching/restoration, adjusted benchmark commands with mode flags and env vars, and consolidated per-mode comparison/reporting in PR comments. (40 words)

Changes

Cohort / File(s) Summary
CI workflow
.github/workflows/ci.yml
Adds matrix.mode entries including interpreted and bytecode (with arm64 rows); per-job matrix steps conditionally include --mode=bytecode; adds GOCCIA_BENCH_CALIBRATION_MS and GOCCIA_BENCH_ROUNDS; produces mode-specific artifacts like benchmark-${matrix.mode}-results.json and mode-specific cache keys.
PR workflow / Comparison logic
.github/workflows/pr.yml
Reworks PR job into per-mode flow: downloads per-mode results, restores per-mode baselines, checks baseline existence per mode, computes per-file per-mode deltas, aggregates per-mode stats (improved/regressed/unchanged/new), and renders a unified PR comment with both Interpreted and Bytecode sections.
Documentation
docs/benchmarks.md
Updates docs to explain parallel multi-mode CI runs, reduced per-mode calibration for CI, caching/storing per-mode JSON baselines on main, and the new PR comparison layout showing per-file rows with both modes and updated indicators/threshold semantics.

Sequence Diagram(s)

sequenceDiagram
  participant Dev as Developer
  participant CI as CI Matrix Runners
  participant Art as Artifact Store / Cache
  participant PR as PR Comparison Workflow
  participant Main as main branch (baseline)

  Dev->>CI: push / open PR
  Note over CI: Matrix spawns jobs for\nmode=interpreted and mode=bytecode
  CI->>Art: Upload `benchmark-${matrix.mode}-results.json`
  CI->>Art: Save mode-specific cache/baseline (on main pushes)
  Dev->>PR: PR triggered
  PR->>Art: Download per-mode results
  PR->>Art: Restore per-mode baseline from `main` (if exists)
  PR->>PR: Build per-file per-mode maps\ncompute deltas & aggregate stats
  PR->>Dev: Post PR comment with per-mode summaries and per-file sections
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐇 I hopped through matrices, two modes to test,
Interpreted hums and bytecode dressed,
Baselines tucked on main with care,
Results compared with rabbit flair,
A tiny cheer — the benchmarks progress!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'GocciaScript bytecode benchmarks in PRs' clearly describes the main objective: adding bytecode benchmark support to PR comparisons, which aligns with the primary changes across CI/PR workflows and documentation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cursor/gocciascript-bytecode-benchmarks-5d37

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@frostney frostney changed the title GocciaScript bytecode benchmarks GocciaScript bytecode benchmarks in PRs Mar 8, 2026
- CI and PR workflows now run interpreted and bytecode benchmarks in
  parallel via matrix strategy (mode: [interpreted, bytecode])
- Reduced calibration settings: CALIBRATION_MS=50, ROUNDS=3 (from 100/5)
- PR workflow split into benchmark (matrix) + benchmark-comment jobs;
  results passed via artifacts, baselines restored in the comment job
- CI cache keys now use benchmark-{mode}-baseline-* naming scheme
- Updated docs/benchmarks.md to reflect parallel execution and settings

Co-authored-by: Johannes Stein <frostney@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 8, 2026

Benchmark Results

254 benchmarks (no baseline)

arraybuffer.js — 14 benchmarks
Benchmark Interpreted Bytecode
create ArrayBuffer(0) 431,022 149,449
create ArrayBuffer(64) 424,869 145,288
create ArrayBuffer(1024) 327,842 132,818
create ArrayBuffer(8192) 144,944 83,426
slice full buffer (64 bytes) 522,386 374,863
slice half buffer (512 of 1024 bytes) 445,853 333,104
slice with negative indices 457,321 370,124
slice empty range 517,905 381,760
byteLength access 1,539,382 1,088,837
Symbol.toStringTag access 1,166,915 518,361
ArrayBuffer.isView 753,242 482,465
clone ArrayBuffer(64) 399,454 324,636
clone ArrayBuffer(1024) 309,656 254,607
clone ArrayBuffer inside object 273,566 150,544
arrays.js — 19 benchmarks
Benchmark Interpreted Bytecode
Array.from length 100 13,772 13,121
Array.from 10 elements 236,151 163,172
Array.of 10 elements 317,247 222,737
spread into new array 337,733 572,533
map over 50 elements 26,852 22,579
filter over 50 elements 23,060 20,853
reduce sum 50 elements 26,790 18,961
forEach over 50 elements 22,453 23,281
find in 50 elements 34,362 27,834
sort 20 elements 12,072 3,376
flat nested array 122,370 277,493
flatMap 75,050 187,583
map inside map (5x5) 22,384 68,499
filter inside map (5x10) 16,454 13,540
reduce inside map (5x10) 19,444 13,899
forEach inside forEach (5x10) 17,100 15,153
find inside some (10x10) 13,751 10,838
map+filter chain nested (5x20) 5,414 4,444
reduce flatten (10x5) 40,620 4,891
async-await.js — 6 benchmarks
Benchmark Interpreted Bytecode
single await 338,777 248,223
multiple awaits 154,618 101,802
await non-Promise value 741,138 697,855
await with try/catch 335,793 230,817
await Promise.all 48,927 41,177
nested async function call 175,020 178,374
classes.js — 31 benchmarks
Benchmark Interpreted Bytecode
simple class new 115,498 353,873
class with defaults 92,367 244,493
50 instances via Array.from 5,597 6,361
instance method call 59,693 161,239
static method call 91,747 381,304
single-level inheritance 46,550 156,533
two-level inheritance 39,819 127,759
private field access 59,981 174,969
private methods 64,165 218,435
getter/setter access 67,768 174,639
class decorator (identity) 83,338 53,414
class decorator (wrapping) 48,194 0
identity method decorator 60,290 43,771
wrapping method decorator 49,481 40,098
stacked method decorators (x3) 34,446 28,167
identity field decorator 68,431 46,246
field initializer decorator 57,880 41,070
getter decorator (identity) 59,888 41,187
setter decorator (identity) 53,705 35,486
static method decorator 64,618 62,194
static field decorator 74,644 65,546
private method decorator 51,356 37,556
private field decorator 55,645 38,630
plain auto-accessor (no decorator) 94,619 53,631
auto-accessor with decorator 54,351 37,099
decorator writing metadata 47,040 40,617
static getter read 111,312 398,743
static getter/setter pair 85,526 211,564
inherited static getter 59,285 269,598
inherited static setter 66,780 212,628
inherited static getter with this binding 57,478 156,273
closures.js — 11 benchmarks
Benchmark Interpreted Bytecode
closure over single variable 140,251 620,317
closure over multiple variables 131,443 401,604
nested closures 135,364 578,235
function as argument 104,868 503,470
function returning function 130,043 620,478
compose two functions 74,338 364,720
fn.call 166,307 134,391
fn.apply 125,645 92,102
fn.bind 156,167 147,469
recursive sum to 50 13,127 38,626
recursive tree traversal 21,234 65,008
collections.js — 12 benchmarks
Benchmark Interpreted Bytecode
add 50 elements 7,665 5,781
has lookup (50 elements) 93,283 90,128
delete elements 49,550 38,532
forEach iteration 17,608 15,582
spread to array 31,243 36,588
deduplicate array 46,095 32,761
set 50 entries 5,898 25,771
get lookup (50 entries) 94,475 102,785
has check 141,392 159,134
delete entries 49,967 36,305
forEach iteration 17,581 15,908
keys/values/entries 8,393 7,613
destructuring.js — 22 benchmarks
Benchmark Interpreted Bytecode
simple array destructuring 431,879 619,333
with rest element 298,136 463,910
with defaults 463,227 649,528
skip elements 467,144 716,860
nested array destructuring 204,035 357,134
swap variables 600,807 1,016,176
simple object destructuring 321,684 534,029
with defaults 398,867 336,366
with renaming 333,508 596,334
nested object destructuring 148,009 262,370
rest properties 196,325 231,656
object parameter 101,945 192,334
array parameter 127,602 326,161
mixed destructuring in map 39,338 34,202
forEach with array destructuring 76,992 135,496
map with array destructuring 76,719 164,487
filter with array destructuring 80,818 191,457
reduce with array destructuring 80,109 179,342
map with object destructuring 87,850 74,118
map with nested destructuring 73,979 60,043
map with rest in destructuring 44,428 46,533
map with defaults in destructuring 67,372 39,323
fibonacci.js — 8 benchmarks
Benchmark Interpreted Bytecode
recursive fib(15) 355 1,129
recursive fib(20) 28 102
recursive fib(15) typed 352 1,434
recursive fib(20) typed 28 130
iterative fib(20) via reduce 12,782 17,269
iterator fib(20) 10,614 10,958
iterator fib(20) via Iterator.from + take 9,600 11,026
iterator fib(20) last value via reduce 8,779 7,751
for-of.js — 7 benchmarks
Benchmark Interpreted Bytecode
for...of with 10-element array 52,161 63,083
for...of with 100-element array 5,651 6,668
for...of with string (10 chars) 39,896 58,816
for...of with Set (10 elements) 56,389 56,686
for...of with Map entries (10 entries) 35,333 11,944
for...of with destructuring 47,419 52,263
for-await-of with sync array 52,574 49,002
iterators.js — 20 benchmarks
Benchmark Interpreted Bytecode
Iterator.from({next}).toArray() — 20 elements 13,068 12,181
Iterator.from({next}).toArray() — 50 elements 5,339 5,263
spread pre-wrapped iterator — 20 elements 12,771 13,155
Iterator.from({next}).forEach — 50 elements 4,392 4,097
Iterator.from({next}).reduce — 50 elements 4,297 4,042
wrap array iterator 63,184 55,080
wrap plain {next()} object 8,344 8,825
map + toArray (50 elements) 3,723 3,994
filter + toArray (50 elements) 3,741 3,746
take(10) + toArray (50 element source) 20,578 17,776
drop(40) + toArray (50 element source) 5,486 5,700
chained map + filter + take (100 element source) 6,624 6,140
some + every (50 elements) 2,513 2,466
find (50 elements) 5,465 5,689
array.values().map().filter().toArray() 4,725 4,926
array.values().take(5).toArray() 73,238 62,728
array.values().drop(45).toArray() 20,779 23,674
map.entries() chained helpers 6,228 2,429
set.values() chained helpers 10,406 10,298
string iterator map + toArray 9,543 9,671
json.js — 20 benchmarks
Benchmark Interpreted Bytecode
parse simple object 175,713 127,148
parse nested object 121,535 81,964
parse array of objects 64,872 51,563
parse large flat object 52,267 47,640
parse mixed types 79,659 66,116
stringify simple object 160,194 135,345
stringify nested object 87,398 71,920
stringify array of objects 39,478 35,571
stringify mixed types 74,886 64,879
reviver doubles numbers 49,071 45,917
reviver filters properties 39,490 46,796
reviver on nested object 51,737 52,680
reviver on array 31,015 29,662
replacer function doubles numbers 49,723 50,104
replacer function excludes properties 63,444 58,268
array replacer (allowlist) 116,705 98,473
stringify with 2-space indent 84,257 71,491
stringify with tab indent 87,908 70,195
parse then stringify 50,578 45,851
stringify then parse 30,630 27,906
jsx.jsx — 21 benchmarks
Benchmark Interpreted Bytecode
simple element 225,489 622,357
self-closing element 230,121 621,945
element with string attribute 176,942 428,118
element with multiple attributes 152,232 374,208
element with expression attribute 168,484 460,758
text child 218,228 642,854
expression child 223,674 611,000
mixed text and expression 212,739 521,276
nested elements (3 levels) 83,379 227,756
sibling children 61,154 184,461
component element 160,776 444,752
component with children 94,827 274,526
dotted component 135,016 329,709
empty fragment 216,567 664,087
fragment with children 61,275 185,366
spread attributes 118,053 109,735
spread with overrides 106,383 81,225
shorthand props 172,317 427,327
nav bar structure 28,542 82,727
card component tree 34,516 89,495
10 list items via Array.from 13,872 23,479
numbers.js — 11 benchmarks
Benchmark Interpreted Bytecode
integer arithmetic 558,502 1,480,463
floating point arithmetic 592,641 1,581,775
number coercion 192,212 135,491
toFixed 114,558 215,976
toString 180,238 678,785
valueOf 256,068 938,900
toPrecision 164,108 394,815
Number.isNaN 304,612 175,834
Number.isFinite 296,795 169,531
Number.isInteger 298,328 186,000
Number.parseInt and parseFloat 257,717 155,491
objects.js — 7 benchmarks
Benchmark Interpreted Bytecode
create simple object 500,926 846,474
create nested object 217,662 382,988
create 50 objects via Array.from 10,049 8,312
property read 686,702 662,639
Object.keys 323,508 210,356
Object.entries 101,841 65,253
spread operator 197,430 230,790
promises.js — 12 benchmarks
Benchmark Interpreted Bytecode
Promise.resolve(value) 527,444 360,628
new Promise(resolve => resolve(value)) 194,978 161,646
Promise.reject(reason) 562,180 338,848
resolve + then (1 handler) 192,102 126,538
resolve + then chain (3 deep) 73,730 57,628
resolve + then chain (10 deep) 22,896 19,176
reject + catch + then 97,533 76,989
resolve + finally + then 85,662 71,125
Promise.all (5 resolved) 29,837 27,580
Promise.race (5 resolved) 34,522 29,248
Promise.allSettled (5 mixed) 28,120 23,441
Promise.any (5 mixed) 30,892 27,040
strings.js — 11 benchmarks
Benchmark Interpreted Bytecode
string concatenation 412,849 407,321
template literal 452,185 728,587
string repeat 443,303 1,000,724
split and join 138,859 245,534
indexOf and includes 190,787 550,852
toUpperCase and toLowerCase 282,006 686,614
slice and substring 169,888 641,275
trim operations 213,411 718,456
replace and replaceAll 236,252 620,985
startsWith and endsWith 139,717 476,544
padStart and padEnd 207,169 551,439
typed-arrays.js — 22 benchmarks
Benchmark Interpreted Bytecode
new Int32Array(0) 371,798 130,409
new Int32Array(100) 321,383 124,302
new Int32Array(1000) 169,844 74,471
new Float64Array(100) 295,244 113,726
Int32Array.from([...]) 156,018 52,327
Int32Array.of(1, 2, 3, 4, 5) 317,539 258,144
sequential write 100 elements 3,885 7,032
sequential read 100 elements 4,242 5,915
Float64Array write 100 elements 3,792 6,754
fill(42) 45,281 43,752
slice() 224,734 193,749
map(x => x * 2) 9,424 7,613
filter(x => x > 50) 9,531 7,728
reduce (sum) 9,048 6,860
sort() 177,218 147,392
indexOf() 465,453 348,610
reverse() 348,016 275,779
create view over existing buffer 449,105 140,819
subarray() 520,610 376,530
set() from array 624,598 263,628
for-of loop 5,917 8,174
spread into array 18,175 19,606

Measured on ubuntu-latest x64. Changes within ±7% are considered insignificant.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/ci.yml (1)

120-137: ⚠️ Potential issue | 🟠 Major

Add fail-fast: false to the benchmark job strategy.

The matrix now covers 12 job combinations (3 OS × 2 modes + arm64 variants), but GitHub Actions defaults to fail-fast behavior. If any non-baseline leg fails (e.g., windows-latest x64 interpreted), the remaining jobs are cancelled before the ubuntu-latest x64 legs complete and save the interpreted/bytecode baselines at lines 164–176. This leaves future PRs comparing against stale cached results.

Suggested change
  benchmark:
    needs: build
    strategy:
+     fail-fast: false
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
        arch: [x64]
        mode: [interpreted, bytecode]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 120 - 137, The benchmark job's matrix
uses strategy.matrix and should disable fail-fast to avoid cancelling remaining
legs (which prevents baseline artifacts at the interpreted/bytecode
ubuntu-latest x64 legs from being saved); update the CI workflow by adding
fail-fast: false under the same strategy block that contains matrix (the
strategy used for the benchmark job in .github/workflows/ci.yml) so all matrix
combinations run to completion even if some fail.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/pr.yml:
- Around line 191-243: Benchmarks with errors are being filtered out (see
benches = file.benchmarks.filter(b => !b.error)), so failing or missing PR
baselines vanish from the report; stop filtering errors, emit explicit rows for
errored benchmarks using the existing key format (`${bench.suite} >
${bench.name}`) labeled e.g. "failed" and increment the appropriate counters and
totalBenchmarks, and when hasBase is true do a second pass over baseline keys in
baseMap that weren't seen to emit "missing" rows and increment a missing counter
(don’t push these into allChanges or compute percent change); update the table
row generation logic (the blocks that build fileRows and the counts
fileNew/fileRegressed/fileImproved/fileUnchanged) to include these "failed" and
"missing" cases so they appear in the summary and tables.

---

Outside diff comments:
In @.github/workflows/ci.yml:
- Around line 120-137: The benchmark job's matrix uses strategy.matrix and
should disable fail-fast to avoid cancelling remaining legs (which prevents
baseline artifacts at the interpreted/bytecode ubuntu-latest x64 legs from being
saved); update the CI workflow by adding fail-fast: false under the same
strategy block that contains matrix (the strategy used for the benchmark job in
.github/workflows/ci.yml) so all matrix combinations run to completion even if
some fail.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1335a591-a57a-4ca9-a324-db38f4a68cec

📥 Commits

Reviewing files that changed from the base of the PR and between b6ad483 and 047d4fc.

📒 Files selected for processing (3)
  • .github/workflows/ci.yml
  • .github/workflows/pr.yml
  • docs/benchmarks.md

Comment thread .github/workflows/pr.yml Outdated
Comment on lines 191 to 243
for (const file of current.files) {
const fileName = file.file;
const baseMap = baselineByFile[fileName] || {};
const hasBase = baseline && Object.keys(baseMap).length > 0;

let fileImproved = 0;
let fileRegressed = 0;
let fileNew = 0;
let fileUnchanged = 0;
let fileRows = '';
const fileChanges = [];

const benches = file.benchmarks.filter(b => !b.error);
totalBenchmarks += benches.length;

if (hasBase) {
fileRows += '| Benchmark | Base (ops/sec) | PR (ops/sec) | Change |\n';
fileRows += '|-----------|---------------|-------------|--------|\n';

for (const bench of benches) {
const key = `${bench.suite} > ${bench.name}`;
const ops = bench.opsPerSec;
const baseOps = baseMap[key];

if (baseOps && baseOps > 0) {
const change = ((ops - baseOps) / baseOps) * 100;
fileChanges.push(change);
allChanges.push(change);
let indicator;
if (change > THRESHOLD) {
indicator = `🟢 +${change.toFixed(1)}%`;
fileImproved++;
} else if (change < -THRESHOLD) {
indicator = `🔴 ${change.toFixed(1)}%`;
fileRegressed++;
} else {
indicator = `${change >= 0 ? '+' : ''}${change.toFixed(1)}%`;
fileUnchanged++;
}
fileRows += `| ${bench.name} | ${fmtOps(baseOps)} | ${fmtOps(ops)} | ${indicator} |\n`;
} else {
indicator = `${change >= 0 ? '+' : ''}${change.toFixed(1)}%`;
fileUnchanged++;
fileRows += `| ${bench.name} | — | ${fmtOps(ops)} | 🆕 new |\n`;
fileNew++;
}
fileRows += `| ${bench.name} | ${fmtOps(baseOps)} | ${fmtOps(ops)} | ${indicator} |\n`;
} else {
fileRows += `| ${bench.name} | — | ${fmtOps(ops)} | 🆕 new |\n`;
}
} else {
fileRows += '| Benchmark | ops/sec |\n';
fileRows += '|-----------|--------|\n';
for (const bench of benches) {
fileRows += `| ${bench.name} | ${fmtOps(bench.opsPerSec)} |\n`;
fileNew++;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Report failed or missing benchmarks instead of dropping them.

The reporter already emits per-benchmark error fields, but Line 203 filters them out and Lines 191-243 only walk the PR result set. A benchmark that starts failing or stops being emitted therefore vanishes from the comment instead of surfacing as a regression, so the section can look healthy while the benchmark suite is actually broken. Please emit explicit failed/missing rows and count them in the summary; using the existing suite > name key as the row label will also keep duplicate benchmark names unambiguous.

🧪 Suggested handling
-                const benches = file.benchmarks.filter(b => !b.error);
+                const benches = file.benchmarks;
+                const seen = new Set();
                 totalBenchmarks += benches.length;

                 if (hasBase) {
                   fileRows += '| Benchmark | Base (ops/sec) | PR (ops/sec) | Change |\n';
                   fileRows += '|-----------|---------------|-------------|--------|\n';

                   for (const bench of benches) {
                     const key = `${bench.suite} > ${bench.name}`;
+                    seen.add(key);
+
+                    if (bench.error) {
+                      fileRows += `| ${key} | ${baseMap[key] ? fmtOps(baseMap[key]) : '—'} | error | 🔴 failed |\n`;
+                      fileRegressed++;
+                      continue;
+                    }

                     const ops = bench.opsPerSec;
                     const baseOps = baseMap[key];

Then do a second pass over unseen baseMap entries after the loop so baseline-only benchmarks are rendered as missing instead of disappearing silently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr.yml around lines 191 - 243, Benchmarks with errors are
being filtered out (see benches = file.benchmarks.filter(b => !b.error)), so
failing or missing PR baselines vanish from the report; stop filtering errors,
emit explicit rows for errored benchmarks using the existing key format
(`${bench.suite} > ${bench.name}`) labeled e.g. "failed" and increment the
appropriate counters and totalBenchmarks, and when hasBase is true do a second
pass over baseline keys in baseMap that weren't seen to emit "missing" rows and
increment a missing counter (don’t push these into allChanges or compute percent
change); update the table row generation logic (the blocks that build fileRows
and the counts fileNew/fileRegressed/fileImproved/fileUnchanged) to include
these "failed" and "missing" cases so they appear in the summary and tables.

PR comment now shows interpreted and bytecode results side by side in
one table per benchmark file, instead of two separate sections. Each
row displays both modes' ops/sec and their change from baseline.

Co-authored-by: Johannes Stein <frostney@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
.github/workflows/pr.yml (1)

181-197: ⚠️ Potential issue | 🟠 Major

Don't drop failed or missing benchmarks from the PR report.

buildMap() and benches.filter(b => !b.error) remove errored entries entirely, and the outer loop only walks interpData.files. That means a benchmark/file that fails, disappears, or is only emitted in bytecode never shows up in the table or totals, so the PR comment can look healthy while coverage regressed. Please build rows from the union of current+baseline keys for both modes and render explicit failed/missing states; using the existing ${suite} > ${name} key as the row label also avoids ambiguous duplicate names.

Also applies to: 203-257

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr.yml around lines 181 - 197, buildMap currently filters
out errored benchmarks and only iterates interpData.files, which drops
failed/missing entries and files only present in bytecode data; change buildMap
to collect entries from the union of files and benchmark keys across both
baseline and current datasets (e.g., interpBaseline, interpData for interp mode
and bytecodeBaseline, bytecodeData for bytecode mode), do not discard items with
b.error but instead record an explicit status/value (e.g., "failed" or
"missing") for each `${b.suite} > ${b.name}` key, and ensure the outer loop
iterates the union of file keys so every file/benchmark appears in the final
maps used for rendering and totals (refer to buildMap, interpBaseline,
bytecodeBaseline, bytecodeData, interpData).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/pr.yml:
- Around line 181-197: buildMap currently filters out errored benchmarks and
only iterates interpData.files, which drops failed/missing entries and files
only present in bytecode data; change buildMap to collect entries from the union
of files and benchmark keys across both baseline and current datasets (e.g.,
interpBaseline, interpData for interp mode and bytecodeBaseline, bytecodeData
for bytecode mode), do not discard items with b.error but instead record an
explicit status/value (e.g., "failed" or "missing") for each `${b.suite} >
${b.name}` key, and ensure the outer loop iterates the union of file keys so
every file/benchmark appears in the final maps used for rendering and totals
(refer to buildMap, interpBaseline, bytecodeBaseline, bytecodeData, interpData).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 71750764-70c9-4c2a-9c19-bd4d21237ef1

📥 Commits

Reviewing files that changed from the base of the PR and between 047d4fc and b09c6c3.

📒 Files selected for processing (2)
  • .github/workflows/pr.yml
  • docs/benchmarks.md

Each Interpreted/Bytecode cell now displays 'base → PR' ops/sec
(e.g. '287 → 277') when a baseline exists, making it easy to see
both the absolute values and the relative change at a glance.

Co-authored-by: Johannes Stein <frostney@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
.github/workflows/pr.yml (1)

185-191: ⚠️ Potential issue | 🟠 Major

Don't let failed or missing benchmarks disappear from the unified report.

Line 191 drops errored entries from buildMap, Line 214 filters interpreted errors out again, and the loop starting on Line 207 only walks interpreted current results. That makes interpreted failures, bytecode failures, and baseline-only / bytecode-only benchmarks vanish from the PR comment instead of surfacing as failed or missing, so the table can still look healthy while the suite is broken. Build the row set from the union of interpreted current, bytecode current, and both baseline key sets, and render ${suite} > ${name} as the row label so duplicates stay unambiguous.

🛠️ Suggested direction
+ const interpCurrentMap = buildMap(interpData);
  const interpBaseMap = buildMap(interpBaseline);
  const bytecodeBaseMap = buildMap(bytecodeBaseline);
  const bytecodeCurrentMap = buildMap(bytecodeData);

  function buildMap(data) {
    const map = {};
    if (!data) return map;
    for (const file of data.files) {
      const m = {};
      for (const b of file.benchmarks) {
-       if (!b.error) m[`${b.suite} > ${b.name}`] = b.opsPerSec;
+       m[`${b.suite} > ${b.name}`] = {
+         opsPerSec: b.opsPerSec,
+         error: b.error ?? null,
+       };
      }
      map[file.file] = m;
    }
    return map;
  }

- for (const file of interpData.files) {
-   const benches = file.benchmarks.filter(b => !b.error);
+ for (const fileName of new Set([
+   ...Object.keys(interpCurrentMap),
+   ...Object.keys(bytecodeCurrentMap),
+   ...Object.keys(interpBaseMap),
+   ...Object.keys(bytecodeBaseMap),
+ ])) {
+   const keys = new Set([
+     ...Object.keys(interpCurrentMap[fileName] ?? {}),
+     ...Object.keys(bytecodeCurrentMap[fileName] ?? {}),
+     ...Object.keys(interpBaseMap[fileName] ?? {}),
+     ...Object.keys(bytecodeBaseMap[fileName] ?? {}),
+   ]);
+
+   for (const key of keys) {
+     // use `${key}` as the row label
+     // emit explicit `failed` / `missing` cells per mode
+     // count those cases in the file and overall summaries
+   }
  }

Also applies to: 207-257

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr.yml around lines 185 - 191, The buildMap function
currently drops benchmarks where b.error is truthy, which causes errored/missing
benchmarks to vanish; update the data collection so the final row set is the
union of interpreted current, bytecode current, and both baseline key sets
(collect keys from data.files[].benchmarks for interpreted and bytecode runs and
from the baseline maps) and use the explicit row label format `${suite} >
${name}` when building the union to keep entries unique; ensure errored entries
are represented (e.g., marked as "failed" or "missing" in the downstream
rendering) instead of being omitted by removing the unconditional `if
(!b.error)` filter inside buildMap and by including entries with errors when
assembling the rows in the loop that currently iterates only interpreted current
results.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/pr.yml:
- Around line 185-191: The buildMap function currently drops benchmarks where
b.error is truthy, which causes errored/missing benchmarks to vanish; update the
data collection so the final row set is the union of interpreted current,
bytecode current, and both baseline key sets (collect keys from
data.files[].benchmarks for interpreted and bytecode runs and from the baseline
maps) and use the explicit row label format `${suite} > ${name}` when building
the union to keep entries unique; ensure errored entries are represented (e.g.,
marked as "failed" or "missing" in the downstream rendering) instead of being
omitted by removing the unconditional `if (!b.error)` filter inside buildMap and
by including entries with errors when assembling the rows in the loop that
currently iterates only interpreted current results.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 22d8ad15-69fd-405d-a247-2933d0e698bf

📥 Commits

Reviewing files that changed from the base of the PR and between b09c6c3 and f7dacd3.

📒 Files selected for processing (1)
  • .github/workflows/pr.yml

@frostney frostney merged commit 1755a05 into main Mar 9, 2026
6 checks passed
@frostney frostney deleted the cursor/gocciascript-bytecode-benchmarks-5d37 branch March 9, 2026 17:05
This was referenced Mar 10, 2026
@frostney frostney added the internal Refactoring, CI, tooling, cleanup label Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal Refactoring, CI, tooling, cleanup

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants