GocciaScript bytecode benchmarks in PRs by frostney · Pull Request #55 · frostney/GocciaScript

frostney · 2026-03-08T04:46:48Z

Enable GocciaScript bytecode benchmarks in CI, including baseline caching and PR comparisons, mirroring the existing interpreted benchmarks.

Summary by CodeRabbit

New Features
- Benchmarks now run in two modes (interpreted and bytecode) with parallel per-mode runs, per-mode baselines, and per-file per-mode comparisons.
- PR comments present combined per-file sections and aggregated per-mode summaries with clear indicators for improvements, regressions, unchanged, and new benchmarks.
Documentation
- Updated benchmarking docs to describe multi-mode CI, parallel execution, per-mode baselines, per-file tables, and revised PR comparison format.

- CI workflow now runs benchmarks in both interpreted and bytecode modes - On main pushes, both baselines are cached separately - PR workflow restores both baselines, runs both modes, and posts a single comparison comment with separate Interpreted and Bytecode sections - Refactored PR comparison script into a reusable generateSection function - Updated docs/benchmarks.md to reflect dual-mode benchmark CI Co-authored-by: Johannes Stein <frostney@users.noreply.github.com>

cursor · 2026-03-08T04:46:49Z

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
_{Learn more about Cursor Agents}

coderabbitai · 2026-03-08T04:47:06Z

📝 Walkthrough

Walkthrough

Adds parallel benchmark support for two modes (interpreted, bytecode) across CI and PR workflows: matrixed benchmark runs, per-mode artifacts and baselines, mode-aware caching/restoration, adjusted benchmark commands with mode flags and env vars, and consolidated per-mode comparison/reporting in PR comments. (40 words)

Changes

Cohort / File(s)	Summary
CI workflow `.github/workflows/ci.yml`	Adds `matrix.mode` entries including `interpreted` and `bytecode` (with arm64 rows); per-job matrix steps conditionally include `--mode=bytecode`; adds `GOCCIA_BENCH_CALIBRATION_MS` and `GOCCIA_BENCH_ROUNDS`; produces mode-specific artifacts like `benchmark-${matrix.mode}-results.json` and mode-specific cache keys.
PR workflow / Comparison logic `.github/workflows/pr.yml`	Reworks PR job into per-mode flow: downloads per-mode results, restores per-mode baselines, checks baseline existence per mode, computes per-file per-mode deltas, aggregates per-mode stats (improved/regressed/unchanged/new), and renders a unified PR comment with both Interpreted and Bytecode sections.
Documentation `docs/benchmarks.md`	Updates docs to explain parallel multi-mode CI runs, reduced per-mode calibration for CI, caching/storing per-mode JSON baselines on main, and the new PR comparison layout showing per-file rows with both modes and updated indicators/threshold semantics.

Sequence Diagram(s)

sequenceDiagram
  participant Dev as Developer
  participant CI as CI Matrix Runners
  participant Art as Artifact Store / Cache
  participant PR as PR Comparison Workflow
  participant Main as main branch (baseline)

  Dev->>CI: push / open PR
  Note over CI: Matrix spawns jobs for\nmode=interpreted and mode=bytecode
  CI->>Art: Upload `benchmark-${matrix.mode}-results.json`
  CI->>Art: Save mode-specific cache/baseline (on main pushes)
  Dev->>PR: PR triggered
  PR->>Art: Download per-mode results
  PR->>Art: Restore per-mode baseline from `main` (if exists)
  PR->>PR: Build per-file per-mode maps\ncompute deltas & aggregate stats
  PR->>Dev: Post PR comment with per-mode summaries and per-file sections

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐇 I hopped through matrices, two modes to test,
Interpreted hums and bytecode dressed,
Baselines tucked on main with care,
Results compared with rabbit flair,
A tiny cheer — the benchmarks progress!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'GocciaScript bytecode benchmarks in PRs' clearly describes the main objective: adding bytecode benchmark support to PR comparisons, which aligns with the primary changes across CI/PR workflows and documentation.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cursor/gocciascript-bytecode-benchmarks-5d37

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- CI and PR workflows now run interpreted and bytecode benchmarks in parallel via matrix strategy (mode: [interpreted, bytecode]) - Reduced calibration settings: CALIBRATION_MS=50, ROUNDS=3 (from 100/5) - PR workflow split into benchmark (matrix) + benchmark-comment jobs; results passed via artifacts, baselines restored in the comment job - CI cache keys now use benchmark-{mode}-baseline-* naming scheme - Updated docs/benchmarks.md to reflect parallel execution and settings Co-authored-by: Johannes Stein <frostney@users.noreply.github.com>

github-actions · 2026-03-08T05:19:05Z

Benchmark Results

254 benchmarks (no baseline)

arraybuffer.js — 14 benchmarks

Benchmark	Interpreted	Bytecode
create ArrayBuffer(0)	431,022	149,449
create ArrayBuffer(64)	424,869	145,288
create ArrayBuffer(1024)	327,842	132,818
create ArrayBuffer(8192)	144,944	83,426
slice full buffer (64 bytes)	522,386	374,863
slice half buffer (512 of 1024 bytes)	445,853	333,104
slice with negative indices	457,321	370,124
slice empty range	517,905	381,760
byteLength access	1,539,382	1,088,837
Symbol.toStringTag access	1,166,915	518,361
ArrayBuffer.isView	753,242	482,465
clone ArrayBuffer(64)	399,454	324,636
clone ArrayBuffer(1024)	309,656	254,607
clone ArrayBuffer inside object	273,566	150,544

arrays.js — 19 benchmarks

Benchmark	Interpreted	Bytecode
Array.from length 100	13,772	13,121
Array.from 10 elements	236,151	163,172
Array.of 10 elements	317,247	222,737
spread into new array	337,733	572,533
map over 50 elements	26,852	22,579
filter over 50 elements	23,060	20,853
reduce sum 50 elements	26,790	18,961
forEach over 50 elements	22,453	23,281
find in 50 elements	34,362	27,834
sort 20 elements	12,072	3,376
flat nested array	122,370	277,493
flatMap	75,050	187,583
map inside map (5x5)	22,384	68,499
filter inside map (5x10)	16,454	13,540
reduce inside map (5x10)	19,444	13,899
forEach inside forEach (5x10)	17,100	15,153
find inside some (10x10)	13,751	10,838
map+filter chain nested (5x20)	5,414	4,444
reduce flatten (10x5)	40,620	4,891

async-await.js — 6 benchmarks

Benchmark	Interpreted	Bytecode
single await	338,777	248,223
multiple awaits	154,618	101,802
await non-Promise value	741,138	697,855
await with try/catch	335,793	230,817
await Promise.all	48,927	41,177
nested async function call	175,020	178,374

classes.js — 31 benchmarks

Benchmark	Interpreted	Bytecode
simple class new	115,498	353,873
class with defaults	92,367	244,493
50 instances via Array.from	5,597	6,361
instance method call	59,693	161,239
static method call	91,747	381,304
single-level inheritance	46,550	156,533
two-level inheritance	39,819	127,759
private field access	59,981	174,969
private methods	64,165	218,435
getter/setter access	67,768	174,639
class decorator (identity)	83,338	53,414
class decorator (wrapping)	48,194	0
identity method decorator	60,290	43,771
wrapping method decorator	49,481	40,098
stacked method decorators (x3)	34,446	28,167
identity field decorator	68,431	46,246
field initializer decorator	57,880	41,070
getter decorator (identity)	59,888	41,187
setter decorator (identity)	53,705	35,486
static method decorator	64,618	62,194
static field decorator	74,644	65,546
private method decorator	51,356	37,556
private field decorator	55,645	38,630
plain auto-accessor (no decorator)	94,619	53,631
auto-accessor with decorator	54,351	37,099
decorator writing metadata	47,040	40,617
static getter read	111,312	398,743
static getter/setter pair	85,526	211,564
inherited static getter	59,285	269,598
inherited static setter	66,780	212,628
inherited static getter with this binding	57,478	156,273

closures.js — 11 benchmarks

Benchmark	Interpreted	Bytecode
closure over single variable	140,251	620,317
closure over multiple variables	131,443	401,604
nested closures	135,364	578,235
function as argument	104,868	503,470
function returning function	130,043	620,478
compose two functions	74,338	364,720
fn.call	166,307	134,391
fn.apply	125,645	92,102
fn.bind	156,167	147,469
recursive sum to 50	13,127	38,626
recursive tree traversal	21,234	65,008

collections.js — 12 benchmarks

Benchmark	Interpreted	Bytecode
add 50 elements	7,665	5,781
has lookup (50 elements)	93,283	90,128
delete elements	49,550	38,532
forEach iteration	17,608	15,582
spread to array	31,243	36,588
deduplicate array	46,095	32,761
set 50 entries	5,898	25,771
get lookup (50 entries)	94,475	102,785
has check	141,392	159,134
delete entries	49,967	36,305
forEach iteration	17,581	15,908
keys/values/entries	8,393	7,613

destructuring.js — 22 benchmarks

Benchmark	Interpreted	Bytecode
simple array destructuring	431,879	619,333
with rest element	298,136	463,910
with defaults	463,227	649,528
skip elements	467,144	716,860
nested array destructuring	204,035	357,134
swap variables	600,807	1,016,176
simple object destructuring	321,684	534,029
with defaults	398,867	336,366
with renaming	333,508	596,334
nested object destructuring	148,009	262,370
rest properties	196,325	231,656
object parameter	101,945	192,334
array parameter	127,602	326,161
mixed destructuring in map	39,338	34,202
forEach with array destructuring	76,992	135,496
map with array destructuring	76,719	164,487
filter with array destructuring	80,818	191,457
reduce with array destructuring	80,109	179,342
map with object destructuring	87,850	74,118
map with nested destructuring	73,979	60,043
map with rest in destructuring	44,428	46,533
map with defaults in destructuring	67,372	39,323

fibonacci.js — 8 benchmarks

Benchmark	Interpreted	Bytecode
recursive fib(15)	355	1,129
recursive fib(20)	28	102
recursive fib(15) typed	352	1,434
recursive fib(20) typed	28	130
iterative fib(20) via reduce	12,782	17,269
iterator fib(20)	10,614	10,958
iterator fib(20) via Iterator.from + take	9,600	11,026
iterator fib(20) last value via reduce	8,779	7,751

for-of.js — 7 benchmarks

Benchmark	Interpreted	Bytecode
for...of with 10-element array	52,161	63,083
for...of with 100-element array	5,651	6,668
for...of with string (10 chars)	39,896	58,816
for...of with Set (10 elements)	56,389	56,686
for...of with Map entries (10 entries)	35,333	11,944
for...of with destructuring	47,419	52,263
for-await-of with sync array	52,574	49,002

iterators.js — 20 benchmarks

Benchmark	Interpreted	Bytecode
Iterator.from({next}).toArray() — 20 elements	13,068	12,181
Iterator.from({next}).toArray() — 50 elements	5,339	5,263
spread pre-wrapped iterator — 20 elements	12,771	13,155
Iterator.from({next}).forEach — 50 elements	4,392	4,097
Iterator.from({next}).reduce — 50 elements	4,297	4,042
wrap array iterator	63,184	55,080
wrap plain {next()} object	8,344	8,825
map + toArray (50 elements)	3,723	3,994
filter + toArray (50 elements)	3,741	3,746
take(10) + toArray (50 element source)	20,578	17,776
drop(40) + toArray (50 element source)	5,486	5,700
chained map + filter + take (100 element source)	6,624	6,140
some + every (50 elements)	2,513	2,466
find (50 elements)	5,465	5,689
array.values().map().filter().toArray()	4,725	4,926
array.values().take(5).toArray()	73,238	62,728
array.values().drop(45).toArray()	20,779	23,674
map.entries() chained helpers	6,228	2,429
set.values() chained helpers	10,406	10,298
string iterator map + toArray	9,543	9,671

json.js — 20 benchmarks

Benchmark	Interpreted	Bytecode
parse simple object	175,713	127,148
parse nested object	121,535	81,964
parse array of objects	64,872	51,563
parse large flat object	52,267	47,640
parse mixed types	79,659	66,116
stringify simple object	160,194	135,345
stringify nested object	87,398	71,920
stringify array of objects	39,478	35,571
stringify mixed types	74,886	64,879
reviver doubles numbers	49,071	45,917
reviver filters properties	39,490	46,796
reviver on nested object	51,737	52,680
reviver on array	31,015	29,662
replacer function doubles numbers	49,723	50,104
replacer function excludes properties	63,444	58,268
array replacer (allowlist)	116,705	98,473
stringify with 2-space indent	84,257	71,491
stringify with tab indent	87,908	70,195
parse then stringify	50,578	45,851
stringify then parse	30,630	27,906

jsx.jsx — 21 benchmarks

Benchmark	Interpreted	Bytecode
simple element	225,489	622,357
self-closing element	230,121	621,945
element with string attribute	176,942	428,118
element with multiple attributes	152,232	374,208
element with expression attribute	168,484	460,758
text child	218,228	642,854
expression child	223,674	611,000
mixed text and expression	212,739	521,276
nested elements (3 levels)	83,379	227,756
sibling children	61,154	184,461
component element	160,776	444,752
component with children	94,827	274,526
dotted component	135,016	329,709
empty fragment	216,567	664,087
fragment with children	61,275	185,366
spread attributes	118,053	109,735
spread with overrides	106,383	81,225
shorthand props	172,317	427,327
nav bar structure	28,542	82,727
card component tree	34,516	89,495
10 list items via Array.from	13,872	23,479

numbers.js — 11 benchmarks

Benchmark	Interpreted	Bytecode
integer arithmetic	558,502	1,480,463
floating point arithmetic	592,641	1,581,775
number coercion	192,212	135,491
toFixed	114,558	215,976
toString	180,238	678,785
valueOf	256,068	938,900
toPrecision	164,108	394,815
Number.isNaN	304,612	175,834
Number.isFinite	296,795	169,531
Number.isInteger	298,328	186,000
Number.parseInt and parseFloat	257,717	155,491

objects.js — 7 benchmarks

Benchmark	Interpreted	Bytecode
create simple object	500,926	846,474
create nested object	217,662	382,988
create 50 objects via Array.from	10,049	8,312
property read	686,702	662,639
Object.keys	323,508	210,356
Object.entries	101,841	65,253
spread operator	197,430	230,790

promises.js — 12 benchmarks

Benchmark	Interpreted	Bytecode
Promise.resolve(value)	527,444	360,628
new Promise(resolve => resolve(value))	194,978	161,646
Promise.reject(reason)	562,180	338,848
resolve + then (1 handler)	192,102	126,538
resolve + then chain (3 deep)	73,730	57,628
resolve + then chain (10 deep)	22,896	19,176
reject + catch + then	97,533	76,989
resolve + finally + then	85,662	71,125
Promise.all (5 resolved)	29,837	27,580
Promise.race (5 resolved)	34,522	29,248
Promise.allSettled (5 mixed)	28,120	23,441
Promise.any (5 mixed)	30,892	27,040

strings.js — 11 benchmarks

Benchmark	Interpreted	Bytecode
string concatenation	412,849	407,321
template literal	452,185	728,587
string repeat	443,303	1,000,724
split and join	138,859	245,534
indexOf and includes	190,787	550,852
toUpperCase and toLowerCase	282,006	686,614
slice and substring	169,888	641,275
trim operations	213,411	718,456
replace and replaceAll	236,252	620,985
startsWith and endsWith	139,717	476,544
padStart and padEnd	207,169	551,439

typed-arrays.js — 22 benchmarks

Benchmark	Interpreted	Bytecode
new Int32Array(0)	371,798	130,409
new Int32Array(100)	321,383	124,302
new Int32Array(1000)	169,844	74,471
new Float64Array(100)	295,244	113,726
Int32Array.from([...])	156,018	52,327
Int32Array.of(1, 2, 3, 4, 5)	317,539	258,144
sequential write 100 elements	3,885	7,032
sequential read 100 elements	4,242	5,915
Float64Array write 100 elements	3,792	6,754
fill(42)	45,281	43,752
slice()	224,734	193,749
map(x => x * 2)	9,424	7,613
filter(x => x > 50)	9,531	7,728
reduce (sum)	9,048	6,860
sort()	177,218	147,392
indexOf()	465,453	348,610
reverse()	348,016	275,779
create view over existing buffer	449,105	140,819
subarray()	520,610	376,530
set() from array	624,598	263,628
for-of loop	5,917	8,174
spread into array	18,175	19,606

_{Measured on ubuntu-latest x64. Changes within ±7% are considered insignificant.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/ci.yml (1)
120-137: ⚠️ Potential issue | 🟠 Major

Add fail-fast: false to the benchmark job strategy.

The matrix now covers 12 job combinations (3 OS × 2 modes + arm64 variants), but GitHub Actions defaults to fail-fast behavior. If any non-baseline leg fails (e.g., windows-latest x64 interpreted), the remaining jobs are cancelled before the ubuntu-latest x64 legs complete and save the interpreted/bytecode baselines at lines 164–176. This leaves future PRs comparing against stale cached results.
Suggested change
  benchmark:
    needs: build
    strategy:
+     fail-fast: false
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
        arch: [x64]
        mode: [interpreted, bytecode]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 120 - 137, The benchmark job's matrix
uses strategy.matrix and should disable fail-fast to avoid cancelling remaining
legs (which prevents baseline artifacts at the interpreted/bytecode
ubuntu-latest x64 legs from being saved); update the CI workflow by adding
fail-fast: false under the same strategy block that contains matrix (the
strategy used for the benchmark job in .github/workflows/ci.yml) so all matrix
combinations run to completion even if some fail.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/pr.yml:
- Around line 191-243: Benchmarks with errors are being filtered out (see
benches = file.benchmarks.filter(b => !b.error)), so failing or missing PR
baselines vanish from the report; stop filtering errors, emit explicit rows for
errored benchmarks using the existing key format (`${bench.suite} >
${bench.name}`) labeled e.g. "failed" and increment the appropriate counters and
totalBenchmarks, and when hasBase is true do a second pass over baseline keys in
baseMap that weren't seen to emit "missing" rows and increment a missing counter
(don’t push these into allChanges or compute percent change); update the table
row generation logic (the blocks that build fileRows and the counts
fileNew/fileRegressed/fileImproved/fileUnchanged) to include these "failed" and
"missing" cases so they appear in the summary and tables.

---

Outside diff comments:
In @.github/workflows/ci.yml:
- Around line 120-137: The benchmark job's matrix uses strategy.matrix and
should disable fail-fast to avoid cancelling remaining legs (which prevents
baseline artifacts at the interpreted/bytecode ubuntu-latest x64 legs from being
saved); update the CI workflow by adding fail-fast: false under the same
strategy block that contains matrix (the strategy used for the benchmark job in
.github/workflows/ci.yml) so all matrix combinations run to completion even if
some fail.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1335a591-a57a-4ca9-a324-db38f4a68cec

📥 Commits

Reviewing files that changed from the base of the PR and between b6ad483 and 047d4fc.

📒 Files selected for processing (3)

.github/workflows/ci.yml
.github/workflows/pr.yml
docs/benchmarks.md

coderabbitai · 2026-03-08T05:24:14Z

+              for (const file of current.files) {
+                const fileName = file.file;
+                const baseMap = baselineByFile[fileName] || {};
+                const hasBase = baseline && Object.keys(baseMap).length > 0;
+
+                let fileImproved = 0;
+                let fileRegressed = 0;
+                let fileNew = 0;
+                let fileUnchanged = 0;
+                let fileRows = '';
+                const fileChanges = [];
+
+                const benches = file.benchmarks.filter(b => !b.error);
+                totalBenchmarks += benches.length;
+
+                if (hasBase) {
+                  fileRows += '| Benchmark | Base (ops/sec) | PR (ops/sec) | Change |\n';
+                  fileRows += '|-----------|---------------|-------------|--------|\n';
+
+                  for (const bench of benches) {
+                    const key = `${bench.suite} > ${bench.name}`;
+                    const ops = bench.opsPerSec;
+                    const baseOps = baseMap[key];
+
+                    if (baseOps && baseOps > 0) {
+                      const change = ((ops - baseOps) / baseOps) * 100;
+                      fileChanges.push(change);
+                      allChanges.push(change);
+                      let indicator;
+                      if (change > THRESHOLD) {
+                        indicator = `🟢 +${change.toFixed(1)}%`;
+                        fileImproved++;
+                      } else if (change < -THRESHOLD) {
+                        indicator = `🔴 ${change.toFixed(1)}%`;
+                        fileRegressed++;
+                      } else {
+                        indicator = `${change >= 0 ? '+' : ''}${change.toFixed(1)}%`;
+                        fileUnchanged++;
+                      }
+                      fileRows += `| ${bench.name} | ${fmtOps(baseOps)} | ${fmtOps(ops)} | ${indicator} |\n`;
                    } else {
-                      indicator = `${change >= 0 ? '+' : ''}${change.toFixed(1)}%`;
-                      fileUnchanged++;
+                      fileRows += `| ${bench.name} | — | ${fmtOps(ops)} | 🆕 new |\n`;
+                      fileNew++;
                    }
-                    fileRows += `| ${bench.name} | ${fmtOps(baseOps)} | ${fmtOps(ops)} | ${indicator} |\n`;
-                  } else {
-                    fileRows += `| ${bench.name} | — | ${fmtOps(ops)} | 🆕 new |\n`;
+                  }
+                } else {
+                  fileRows += '| Benchmark | ops/sec |\n';
+                  fileRows += '|-----------|--------|\n';
+                  for (const bench of benches) {
+                    fileRows += `| ${bench.name} | ${fmtOps(bench.opsPerSec)} |\n`;
                    fileNew++;
                  }
                }


⚠️ Potential issue | 🟠 Major

Report failed or missing benchmarks instead of dropping them.

The reporter already emits per-benchmark error fields, but Line 203 filters them out and Lines 191-243 only walk the PR result set. A benchmark that starts failing or stops being emitted therefore vanishes from the comment instead of surfacing as a regression, so the section can look healthy while the benchmark suite is actually broken. Please emit explicit failed/missing rows and count them in the summary; using the existing suite > name key as the row label will also keep duplicate benchmark names unambiguous.

🧪 Suggested handling

- const benches = file.benchmarks.filter(b => !b.error); + const benches = file.benchmarks; + const seen = new Set(); totalBenchmarks += benches.length; if (hasBase) { fileRows += '| Benchmark | Base (ops/sec) | PR (ops/sec) | Change |\n'; fileRows += '|-----------|---------------|-------------|--------|\n'; for (const bench of benches) { const key = `${bench.suite} > ${bench.name}`; + seen.add(key); + + if (bench.error) { + fileRows += `| ${key} | ${baseMap[key] ? fmtOps(baseMap[key]) : '—'} | error | 🔴 failed |\n`; + fileRegressed++; + continue; + } const ops = bench.opsPerSec; const baseOps = baseMap[key];

Then do a second pass over unseen baseMap entries after the loop so baseline-only benchmarks are rendered as missing instead of disappearing silently.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/pr.yml around lines 191 - 243, Benchmarks with errors are being filtered out (see benches = file.benchmarks.filter(b => !b.error)), so failing or missing PR baselines vanish from the report; stop filtering errors, emit explicit rows for errored benchmarks using the existing key format (`${bench.suite} > ${bench.name}`) labeled e.g. "failed" and increment the appropriate counters and totalBenchmarks, and when hasBase is true do a second pass over baseline keys in baseMap that weren't seen to emit "missing" rows and increment a missing counter (don’t push these into allChanges or compute percent change); update the table row generation logic (the blocks that build fileRows and the counts fileNew/fileRegressed/fileImproved/fileUnchanged) to include these "failed" and "missing" cases so they appear in the summary and tables.

PR comment now shows interpreted and bytecode results side by side in one table per benchmark file, instead of two separate sections. Each row displays both modes' ops/sec and their change from baseline. Co-authored-by: Johannes Stein <frostney@users.noreply.github.com>

coderabbitai

♻️ Duplicate comments (1)

.github/workflows/pr.yml (1)
181-197: ⚠️ Potential issue | 🟠 Major

Don't drop failed or missing benchmarks from the PR report.

buildMap() and benches.filter(b => !b.error) remove errored entries entirely, and the outer loop only walks interpData.files. That means a benchmark/file that fails, disappears, or is only emitted in bytecode never shows up in the table or totals, so the PR comment can look healthy while coverage regressed. Please build rows from the union of current+baseline keys for both modes and render explicit failed/missing states; using the existing ${suite} > ${name} key as the row label also avoids ambiguous duplicate names.

Also applies to: 203-257
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr.yml around lines 181 - 197, buildMap currently filters
out errored benchmarks and only iterates interpData.files, which drops
failed/missing entries and files only present in bytecode data; change buildMap
to collect entries from the union of files and benchmark keys across both
baseline and current datasets (e.g., interpBaseline, interpData for interp mode
and bytecodeBaseline, bytecodeData for bytecode mode), do not discard items with
b.error but instead record an explicit status/value (e.g., "failed" or
"missing") for each `${b.suite} > ${b.name}` key, and ensure the outer loop
iterates the union of file keys so every file/benchmark appears in the final
maps used for rendering and totals (refer to buildMap, interpBaseline,
bytecodeBaseline, bytecodeData, interpData).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/pr.yml:
- Around line 181-197: buildMap currently filters out errored benchmarks and
only iterates interpData.files, which drops failed/missing entries and files
only present in bytecode data; change buildMap to collect entries from the union
of files and benchmark keys across both baseline and current datasets (e.g.,
interpBaseline, interpData for interp mode and bytecodeBaseline, bytecodeData
for bytecode mode), do not discard items with b.error but instead record an
explicit status/value (e.g., "failed" or "missing") for each `${b.suite} >
${b.name}` key, and ensure the outer loop iterates the union of file keys so
every file/benchmark appears in the final maps used for rendering and totals
(refer to buildMap, interpBaseline, bytecodeBaseline, bytecodeData, interpData).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 71750764-70c9-4c2a-9c19-bd4d21237ef1

📥 Commits

Reviewing files that changed from the base of the PR and between 047d4fc and b09c6c3.

📒 Files selected for processing (2)

.github/workflows/pr.yml
docs/benchmarks.md

Each Interpreted/Bytecode cell now displays 'base → PR' ops/sec (e.g. '287 → 277') when a baseline exists, making it easy to see both the absolute values and the relative change at a glance. Co-authored-by: Johannes Stein <frostney@users.noreply.github.com>

coderabbitai

♻️ Duplicate comments (1)

.github/workflows/pr.yml (1)

185-191: ⚠️ Potential issue | 🟠 Major

Don't let failed or missing benchmarks disappear from the unified report.

Line 191 drops errored entries from buildMap, Line 214 filters interpreted errors out again, and the loop starting on Line 207 only walks interpreted current results. That makes interpreted failures, bytecode failures, and baseline-only / bytecode-only benchmarks vanish from the PR comment instead of surfacing as failed or missing, so the table can still look healthy while the suite is broken. Build the row set from the union of interpreted current, bytecode current, and both baseline key sets, and render ${suite} > ${name} as the row label so duplicates stay unambiguous.

🛠️ Suggested direction

+ const interpCurrentMap = buildMap(interpData);
  const interpBaseMap = buildMap(interpBaseline);
  const bytecodeBaseMap = buildMap(bytecodeBaseline);
  const bytecodeCurrentMap = buildMap(bytecodeData);

  function buildMap(data) {
    const map = {};
    if (!data) return map;
    for (const file of data.files) {
      const m = {};
      for (const b of file.benchmarks) {
-       if (!b.error) m[`${b.suite} > ${b.name}`] = b.opsPerSec;
+       m[`${b.suite} > ${b.name}`] = {
+         opsPerSec: b.opsPerSec,
+         error: b.error ?? null,
+       };
      }
      map[file.file] = m;
    }
    return map;
  }

- for (const file of interpData.files) {
-   const benches = file.benchmarks.filter(b => !b.error);
+ for (const fileName of new Set([
+   ...Object.keys(interpCurrentMap),
+   ...Object.keys(bytecodeCurrentMap),
+   ...Object.keys(interpBaseMap),
+   ...Object.keys(bytecodeBaseMap),
+ ])) {
+   const keys = new Set([
+     ...Object.keys(interpCurrentMap[fileName] ?? {}),
+     ...Object.keys(bytecodeCurrentMap[fileName] ?? {}),
+     ...Object.keys(interpBaseMap[fileName] ?? {}),
+     ...Object.keys(bytecodeBaseMap[fileName] ?? {}),
+   ]);
+
+   for (const key of keys) {
+     // use `${key}` as the row label
+     // emit explicit `failed` / `missing` cells per mode
+     // count those cases in the file and overall summaries
+   }
  }

Also applies to: 207-257

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr.yml around lines 185 - 191, The buildMap function
currently drops benchmarks where b.error is truthy, which causes errored/missing
benchmarks to vanish; update the data collection so the final row set is the
union of interpreted current, bytecode current, and both baseline key sets
(collect keys from data.files[].benchmarks for interpreted and bytecode runs and
from the baseline maps) and use the explicit row label format `${suite} >
${name}` when building the union to keep entries unique; ensure errored entries
are represented (e.g., marked as "failed" or "missing" in the downstream
rendering) instead of being omitted by removing the unconditional `if
(!b.error)` filter inside buildMap and by including entries with errors when
assembling the rows in the loop that currently iterates only interpreted current
results.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/pr.yml:
- Around line 185-191: The buildMap function currently drops benchmarks where
b.error is truthy, which causes errored/missing benchmarks to vanish; update the
data collection so the final row set is the union of interpreted current,
bytecode current, and both baseline key sets (collect keys from
data.files[].benchmarks for interpreted and bytecode runs and from the baseline
maps) and use the explicit row label format `${suite} > ${name}` when building
the union to keep entries unique; ensure errored entries are represented (e.g.,
marked as "failed" or "missing" in the downstream rendering) instead of being
omitted by removing the unconditional `if (!b.error)` filter inside buildMap and
by including entries with errors when assembling the rows in the loop that
currently iterates only interpreted current results.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 22d8ad15-69fd-405d-a247-2933d0e698bf

📥 Commits

Reviewing files that changed from the base of the PR and between b09c6c3 and f7dacd3.

📒 Files selected for processing (1)

.github/workflows/pr.yml

frostney marked this pull request as ready for review March 8, 2026 04:47

frostney changed the title ~~GocciaScript bytecode benchmarks~~ GocciaScript bytecode benchmarks in PRs Mar 8, 2026

coderabbitai Bot reviewed Mar 8, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 9, 2026

View reviewed changes

frostney merged commit 1755a05 into main Mar 9, 2026
6 checks passed

frostney deleted the cursor/gocciascript-bytecode-benchmarks-5d37 branch March 9, 2026 17:05

This was referenced Mar 10, 2026

CI job independence #59

Merged

High benchmark variance #69

Merged

This was referenced Mar 30, 2026

Add structured JSON output to ScriptLoader #139

Merged

Use range-based benchmark comparison in PR comments #144

Merged

frostney added the internal Refactoring, CI, tooling, cleanup label Apr 9, 2026

coderabbitai Bot mentioned this pull request Apr 29, 2026

Add test262 conformance lane to CI with PR comment #471

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GocciaScript bytecode benchmarks in PRs#55

GocciaScript bytecode benchmarks in PRs#55
frostney merged 4 commits into
mainfrom
cursor/gocciascript-bytecode-benchmarks-5d37

frostney commented Mar 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

cursor Bot commented Mar 8, 2026

Uh oh!

coderabbitai Bot commented Mar 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

github-actions Bot commented Mar 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

frostney commented Mar 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

cursor Bot commented Mar 8, 2026

Uh oh!

coderabbitai Bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

github-actions Bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

frostney commented Mar 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 8, 2026 •

edited

Loading

github-actions Bot commented Mar 8, 2026 •

edited

Loading