bench(c-g): make history.yaml multi-arch by chaploud · Pull Request #86 · clojurewasm/zwasm

chaploud · 2026-04-29T11:04:37Z

Summary

Closes the schema half of Plan C-g — bench/history.yaml now stores entries for every target triple side by side, so cross-platform regressions can surface in the same trend graph.
The benchmark CI job stays Ubuntu-only for the moment; flipping it to a 3-OS matrix is sequenced behind cleanroom baseline collection on Ubuntu and Windows.

Schema change

Each row gains an explicit arch: field. All 125 pre-existing rows are tagged arch: aarch64-darwin because they were recorded on shota's M4 Pro; the top-level env: block also gains arch: aarch64-darwin as the implicit default for any hand-edited row that omits the field.

bench/record.sh:

new --arch=<triple> flag, auto-detected from uname -s -m when not passed;
emits arch: in the YAML fragment it appends to history.yaml;
duplicate-id check now scopes by (id, arch) — two different triples can both record an entry against the same merge SHA;
--overwrite similarly scoped — never wipes a sibling row recorded on a different host.

scripts/record-merge-bench.sh:

dropped the Darwin-only early exit; the wrapper passes the auto-detected --arch=... along with --id and --reason to bench/record.sh.

.claude/CLAUDE.md Merge-Gate item 10 reworded so the local bench record is no longer Mac-only, and downstream readers are reminded to compare entries within a single arch: series rather than across triples.

Why now

Foundation for W47 (tgo_strops_cached regression bisect). The variance dominating the +24% framing in #84 was diagnosed using only the Mac series; cross-arch data on the same merge SHAs will let us triage whether the slowdown is ARM64 JIT codegen, cross-platform JIT, or interpreter dispatch.

Test plan

CI green across test (windows-latest) / test-nix (ubuntu-latest) / test-nix (macos-latest)
CI benchmark job still runs under the new schema (yq query semantics match)
Ubuntu baseline collection via OrbStack as a follow-up post-merge bench commit

Follow-up (not in this PR)

Cleanroom Ubuntu baseline (OrbStack my-ubuntu-amd64)
Cleanroom Windows baseline (windowsmini SSH; needs hyperfine pin in install-tools.ps1)
Drop runs-on: ubuntu-latest on the benchmark CI job and matrixify across Mac/Ubuntu/Windows

Closes the schema half of Plan C-g. `bench/history.yaml` previously hardcoded Mac aarch64-darwin in its top-level `env:` block; the `record-merge-bench.sh` Darwin-only early-exit reinforced the same assumption. The result was that the per-merge trend graph could only ever cover one platform, even though `bench/ci_compare.sh` already runs Ubuntu-vs-Ubuntu inside CI and the user wanted to see whether Windows shows per-benchmark slowdowns vs Mac/Linux. Schema change: every entry now carries an `arch:` field naming the target triple (`aarch64-darwin` / `x86_64-linux` / `x86_64-windows` / etc.). All 125 pre-existing rows are tagged `arch: aarch64-darwin` in this commit because they were recorded on the user's M4 Pro. The top-level `env:` block also gains `arch: aarch64-darwin` as the implicit default for any hand-edited row that omits the field. `bench/record.sh`: - new `--arch=<triple>` flag, auto-detected from `uname -s -m` when not passed. - emits `arch:` in the YAML fragment it appends to history.yaml. - duplicate-id check now scopes by `(id, arch)` so two different triples can both record an entry against the same merge SHA. - `--overwrite` similarly scoped — never wipes a sibling row recorded on a different host. `scripts/record-merge-bench.sh`: - dropped the Darwin-only early exit. The wrapper now passes the auto-detected `--arch=...` along with `--id` and `--reason` to `bench/record.sh`. - usage text updated to mention multi-arch. `.claude/CLAUDE.md` Merge-Gate item 10 reworded: the local bench record is no longer Mac-only, and downstream readers are reminded to compare entries within a single `arch:` series rather than across triples. The remaining piece of Plan C-g — flipping the `benchmark` CI job from Ubuntu-only to a 3-OS matrix — is sequenced behind a cleanroom baseline collection on Ubuntu and Windows, which is why this commit only ships the schema/script foundation.

Two rows for merge SHA e5766ee — first time the per-merge bench record contains more than one target triple side by side: - aarch64-darwin: native M4 Pro (5 runs + 3 warmup, hyperfine 1.19.0). - x86_64-linux : OrbStack `my-ubuntu-amd64` VM running on the same M4 Pro, so the values are Rosetta-translated and should not be compared 1:1 against numbers from a native Linux x86_64 runner. Treat this as the schema-shakedown baseline; a native x86_64 cleanroom row can replace or supplement it later. The Windows x86_64-windows row is intentionally absent for now — the windowsmini SSH host does not yet have hyperfine on PATH; adding it via versions.lock + install-tools.ps1 is the next follow-up before the `benchmark` CI job can be flipped to a 3-OS matrix.

W53 and the C-g schema foundation shipped earlier this evening (PRs #85, #86). Update the three handover docs so the next session opens against the new open-work shortlist: - W47 (tgo_strops_cached, ~15 % slowdown, σ ≈ 18 % — needs harness stabilisation before bisect). - C-g step 5: Windows hyperfine pin + native x86_64-linux baseline + flip the `benchmark` CI job to a 3-OS matrix. - C-g step 3 followup: replace the OrbStack Rosetta x86_64-linux baseline (committed in `ac4851d`) with one from a native Linux host once a runner exists. memo.md `## Current Task` now describes the W53 + C-g shipment; the W47 bisect plan is restated as the next priority. checklist.md W49 entry rewritten to reflect that the schema half of C-g landed; the matrix flip + native baselines are the remaining items. roadmap.md table picks up a `C-g multi-arch bench schema = Done` row.

Closes the matrix-flip half of Plan C-g. The benchmark job was Ubuntu-only because hyperfine had to be installed manually via a DEB and there was no Windows path. After the schema work in #86 made `bench/history.yaml` multi-arch, the only thing standing between us and per-OS regression checks on PR was the toolchain provisioning gap on Windows. `scripts/windows/install-tools.ps1`: - new `-OnlyTool hyperfine` arm; pinned via versions.lock HYPERFINE_VERSION. The release zip extracts to a single version-stamped subdir holding `hyperfine.exe`, so Resolve-SingleSubdir flattens it and the executable lands directly in the install dir (same layout as zig / wasm-tools / wasmtime). - realworldKeys gains `hyperfine = HYPERFINE_VERSION` so a missing pin fails loudly when the Windows installer is asked for it. - ValidateSet + Update-UserPath + final Verify banner all gain a hyperfine entry. `.github/versions.lock`: - HYPERFINE_VERSION 1.18.0 → 1.20.0 to match the nixpkgs version on aarch64-darwin / x86_64-linux (the existing nix devshell already shipped 1.20.0; the Linux DEB step in the previous benchmark job was the only consumer of the older 1.18.0 pin). `.github/workflows/ci.yml`: - benchmark job becomes `os: [ubuntu-latest, macos-latest, windows-latest]`. Linux/macOS provision via nix devshell (same pattern as test-nix); Windows uses `install-tools.ps1 -OnlyTool zig` + `-OnlyTool hyperfine`, skipping Go / TinyGo / Rust / WASI SDK that the realworld test job pulls but the benchmarks don't need. - Existing Linux DEB + setup-zig install steps deleted; the intra-runner regression check (`ci_compare.sh --base=origin/main --threshold=20 --runs=3 --warmup=1 --skip-build`) and the push-to-main `--record-only` step are unchanged in spirit, just wrapped in a per-runner `if RUNNER_OS = Windows` selector that picks `nix develop --command` vs plain bash. - `needs: [test-nix, test]` so the bench fan-out only runs after all three platform test jobs have already gated the PR. Cross-runner comparison is still meaningless (the hardware deltas dwarf any codegen-level signal), and the docstring above the job makes that explicit. The durable per-arch absolute-time baselines remain in `bench/history.yaml` (recorded locally per CLAUDE.md Merge Gate item 10).

chaploud merged commit e5766ee into main Apr 29, 2026
8 checks passed

chaploud deleted the develop/cg-3platform-bench-baseline branch April 29, 2026 11:24

chaploud mentioned this pull request Apr 29, 2026

docs: refocus handover after W53 + C-g shipped #87

Merged

1 task

chaploud mentioned this pull request Apr 29, 2026

ci(c-g step 5): flip benchmark job to a 3-OS matrix #88

Merged

3 tasks

chaploud mentioned this pull request Apr 29, 2026

ci: add bench-baseline workflow_dispatch (C-g step 3 followup) #89

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bench(c-g): make history.yaml multi-arch#86

bench(c-g): make history.yaml multi-arch#86
chaploud merged 1 commit intomainfrom
develop/cg-3platform-bench-baseline

chaploud commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chaploud commented Apr 29, 2026

Summary

Schema change

Why now

Test plan

Follow-up (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant