Merged
Conversation
Closes the schema half of Plan C-g. `bench/history.yaml` previously
hardcoded Mac aarch64-darwin in its top-level `env:` block; the
`record-merge-bench.sh` Darwin-only early-exit reinforced the same
assumption. The result was that the per-merge trend graph could
only ever cover one platform, even though `bench/ci_compare.sh`
already runs Ubuntu-vs-Ubuntu inside CI and the user wanted to
see whether Windows shows per-benchmark slowdowns vs Mac/Linux.
Schema change: every entry now carries an `arch:` field naming the
target triple (`aarch64-darwin` / `x86_64-linux` /
`x86_64-windows` / etc.). All 125 pre-existing rows are tagged
`arch: aarch64-darwin` in this commit because they were recorded
on the user's M4 Pro. The top-level `env:` block also gains
`arch: aarch64-darwin` as the implicit default for any
hand-edited row that omits the field.
`bench/record.sh`:
- new `--arch=<triple>` flag, auto-detected from `uname -s -m`
when not passed.
- emits `arch:` in the YAML fragment it appends to history.yaml.
- duplicate-id check now scopes by `(id, arch)` so two
different triples can both record an entry against the same
merge SHA.
- `--overwrite` similarly scoped — never wipes a sibling row
recorded on a different host.
`scripts/record-merge-bench.sh`:
- dropped the Darwin-only early exit. The wrapper now passes the
auto-detected `--arch=...` along with `--id` and `--reason` to
`bench/record.sh`.
- usage text updated to mention multi-arch.
`.claude/CLAUDE.md` Merge-Gate item 10 reworded: the local bench
record is no longer Mac-only, and downstream readers are reminded
to compare entries within a single `arch:` series rather than
across triples.
The remaining piece of Plan C-g — flipping the `benchmark` CI job
from Ubuntu-only to a 3-OS matrix — is sequenced behind a
cleanroom baseline collection on Ubuntu and Windows, which is why
this commit only ships the schema/script foundation.
chaploud
added a commit
that referenced
this pull request
Apr 29, 2026
Two rows for merge SHA e5766ee — first time the per-merge bench record contains more than one target triple side by side: - aarch64-darwin: native M4 Pro (5 runs + 3 warmup, hyperfine 1.19.0). - x86_64-linux : OrbStack `my-ubuntu-amd64` VM running on the same M4 Pro, so the values are Rosetta-translated and should not be compared 1:1 against numbers from a native Linux x86_64 runner. Treat this as the schema-shakedown baseline; a native x86_64 cleanroom row can replace or supplement it later. The Windows x86_64-windows row is intentionally absent for now — the windowsmini SSH host does not yet have hyperfine on PATH; adding it via versions.lock + install-tools.ps1 is the next follow-up before the `benchmark` CI job can be flipped to a 3-OS matrix.
1 task
chaploud
added a commit
that referenced
this pull request
Apr 29, 2026
W53 and the C-g schema foundation shipped earlier this evening (PRs #85, #86). Update the three handover docs so the next session opens against the new open-work shortlist: - W47 (tgo_strops_cached, ~15 % slowdown, σ ≈ 18 % — needs harness stabilisation before bisect). - C-g step 5: Windows hyperfine pin + native x86_64-linux baseline + flip the `benchmark` CI job to a 3-OS matrix. - C-g step 3 followup: replace the OrbStack Rosetta x86_64-linux baseline (committed in `ac4851d`) with one from a native Linux host once a runner exists. memo.md `## Current Task` now describes the W53 + C-g shipment; the W47 bisect plan is restated as the next priority. checklist.md W49 entry rewritten to reflect that the schema half of C-g landed; the matrix flip + native baselines are the remaining items. roadmap.md table picks up a `C-g multi-arch bench schema = Done` row.
3 tasks
chaploud
added a commit
that referenced
this pull request
Apr 29, 2026
Closes the matrix-flip half of Plan C-g. The benchmark job was Ubuntu-only because hyperfine had to be installed manually via a DEB and there was no Windows path. After the schema work in #86 made `bench/history.yaml` multi-arch, the only thing standing between us and per-OS regression checks on PR was the toolchain provisioning gap on Windows. `scripts/windows/install-tools.ps1`: - new `-OnlyTool hyperfine` arm; pinned via versions.lock HYPERFINE_VERSION. The release zip extracts to a single version-stamped subdir holding `hyperfine.exe`, so Resolve-SingleSubdir flattens it and the executable lands directly in the install dir (same layout as zig / wasm-tools / wasmtime). - realworldKeys gains `hyperfine = HYPERFINE_VERSION` so a missing pin fails loudly when the Windows installer is asked for it. - ValidateSet + Update-UserPath + final Verify banner all gain a hyperfine entry. `.github/versions.lock`: - HYPERFINE_VERSION 1.18.0 → 1.20.0 to match the nixpkgs version on aarch64-darwin / x86_64-linux (the existing nix devshell already shipped 1.20.0; the Linux DEB step in the previous benchmark job was the only consumer of the older 1.18.0 pin). `.github/workflows/ci.yml`: - benchmark job becomes `os: [ubuntu-latest, macos-latest, windows-latest]`. Linux/macOS provision via nix devshell (same pattern as test-nix); Windows uses `install-tools.ps1 -OnlyTool zig` + `-OnlyTool hyperfine`, skipping Go / TinyGo / Rust / WASI SDK that the realworld test job pulls but the benchmarks don't need. - Existing Linux DEB + setup-zig install steps deleted; the intra-runner regression check (`ci_compare.sh --base=origin/main --threshold=20 --runs=3 --warmup=1 --skip-build`) and the push-to-main `--record-only` step are unchanged in spirit, just wrapped in a per-runner `if RUNNER_OS = Windows` selector that picks `nix develop --command` vs plain bash. - `needs: [test-nix, test]` so the bench fan-out only runs after all three platform test jobs have already gated the PR. Cross-runner comparison is still meaningless (the hardware deltas dwarf any codegen-level signal), and the docstring above the job makes that explicit. The durable per-arch absolute-time baselines remain in `bench/history.yaml` (recorded locally per CLAUDE.md Merge Gate item 10).
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
bench/history.yamlnow stores entries for every target triple side by side, so cross-platform regressions can surface in the same trend graph.benchmarkCI job stays Ubuntu-only for the moment; flipping it to a 3-OS matrix is sequenced behind cleanroom baseline collection on Ubuntu and Windows.Schema change
Each row gains an explicit
arch:field. All 125 pre-existing rows are taggedarch: aarch64-darwinbecause they were recorded on shota's M4 Pro; the top-levelenv:block also gainsarch: aarch64-darwinas the implicit default for any hand-edited row that omits the field.bench/record.sh:--arch=<triple>flag, auto-detected fromuname -s -mwhen not passed;arch:in the YAML fragment it appends to history.yaml;(id, arch)— two different triples can both record an entry against the same merge SHA;--overwritesimilarly scoped — never wipes a sibling row recorded on a different host.scripts/record-merge-bench.sh:--arch=...along with--idand--reasontobench/record.sh..claude/CLAUDE.mdMerge-Gate item 10 reworded so the local bench record is no longer Mac-only, and downstream readers are reminded to compare entries within a singlearch:series rather than across triples.Why now
Foundation for W47 (
tgo_strops_cachedregression bisect). The variance dominating the +24% framing in #84 was diagnosed using only the Mac series; cross-arch data on the same merge SHAs will let us triage whether the slowdown is ARM64 JIT codegen, cross-platform JIT, or interpreter dispatch.Test plan
test (windows-latest)/test-nix (ubuntu-latest)/test-nix (macos-latest)benchmarkjob still runs under the new schema (yq query semantics match)Follow-up (not in this PR)
my-ubuntu-amd64)windowsminiSSH; needs hyperfine pin ininstall-tools.ps1)runs-on: ubuntu-lateston thebenchmarkCI job and matrixify across Mac/Ubuntu/Windows