Skip to content

bench(c-g): make history.yaml multi-arch#86

Merged
chaploud merged 1 commit intomainfrom
develop/cg-3platform-bench-baseline
Apr 29, 2026
Merged

bench(c-g): make history.yaml multi-arch#86
chaploud merged 1 commit intomainfrom
develop/cg-3platform-bench-baseline

Conversation

@chaploud
Copy link
Copy Markdown
Contributor

Summary

  • Closes the schema half of Plan C-g — bench/history.yaml now stores entries for every target triple side by side, so cross-platform regressions can surface in the same trend graph.
  • The benchmark CI job stays Ubuntu-only for the moment; flipping it to a 3-OS matrix is sequenced behind cleanroom baseline collection on Ubuntu and Windows.

Schema change

Each row gains an explicit arch: field. All 125 pre-existing rows are tagged arch: aarch64-darwin because they were recorded on shota's M4 Pro; the top-level env: block also gains arch: aarch64-darwin as the implicit default for any hand-edited row that omits the field.

bench/record.sh:

  • new --arch=<triple> flag, auto-detected from uname -s -m when not passed;
  • emits arch: in the YAML fragment it appends to history.yaml;
  • duplicate-id check now scopes by (id, arch) — two different triples can both record an entry against the same merge SHA;
  • --overwrite similarly scoped — never wipes a sibling row recorded on a different host.

scripts/record-merge-bench.sh:

  • dropped the Darwin-only early exit; the wrapper passes the auto-detected --arch=... along with --id and --reason to bench/record.sh.

.claude/CLAUDE.md Merge-Gate item 10 reworded so the local bench record is no longer Mac-only, and downstream readers are reminded to compare entries within a single arch: series rather than across triples.

Why now

Foundation for W47 (tgo_strops_cached regression bisect). The variance dominating the +24% framing in #84 was diagnosed using only the Mac series; cross-arch data on the same merge SHAs will let us triage whether the slowdown is ARM64 JIT codegen, cross-platform JIT, or interpreter dispatch.

Test plan

  • CI green across test (windows-latest) / test-nix (ubuntu-latest) / test-nix (macos-latest)
  • CI benchmark job still runs under the new schema (yq query semantics match)
  • Ubuntu baseline collection via OrbStack as a follow-up post-merge bench commit

Follow-up (not in this PR)

  1. Cleanroom Ubuntu baseline (OrbStack my-ubuntu-amd64)
  2. Cleanroom Windows baseline (windowsmini SSH; needs hyperfine pin in install-tools.ps1)
  3. Drop runs-on: ubuntu-latest on the benchmark CI job and matrixify across Mac/Ubuntu/Windows

Closes the schema half of Plan C-g. `bench/history.yaml` previously
hardcoded Mac aarch64-darwin in its top-level `env:` block; the
`record-merge-bench.sh` Darwin-only early-exit reinforced the same
assumption. The result was that the per-merge trend graph could
only ever cover one platform, even though `bench/ci_compare.sh`
already runs Ubuntu-vs-Ubuntu inside CI and the user wanted to
see whether Windows shows per-benchmark slowdowns vs Mac/Linux.

Schema change: every entry now carries an `arch:` field naming the
target triple (`aarch64-darwin` / `x86_64-linux` /
`x86_64-windows` / etc.). All 125 pre-existing rows are tagged
`arch: aarch64-darwin` in this commit because they were recorded
on the user's M4 Pro. The top-level `env:` block also gains
`arch: aarch64-darwin` as the implicit default for any
hand-edited row that omits the field.

`bench/record.sh`:
  - new `--arch=<triple>` flag, auto-detected from `uname -s -m`
    when not passed.
  - emits `arch:` in the YAML fragment it appends to history.yaml.
  - duplicate-id check now scopes by `(id, arch)` so two
    different triples can both record an entry against the same
    merge SHA.
  - `--overwrite` similarly scoped — never wipes a sibling row
    recorded on a different host.

`scripts/record-merge-bench.sh`:
  - dropped the Darwin-only early exit. The wrapper now passes the
    auto-detected `--arch=...` along with `--id` and `--reason` to
    `bench/record.sh`.
  - usage text updated to mention multi-arch.

`.claude/CLAUDE.md` Merge-Gate item 10 reworded: the local bench
record is no longer Mac-only, and downstream readers are reminded
to compare entries within a single `arch:` series rather than
across triples.

The remaining piece of Plan C-g — flipping the `benchmark` CI job
from Ubuntu-only to a 3-OS matrix — is sequenced behind a
cleanroom baseline collection on Ubuntu and Windows, which is why
this commit only ships the schema/script foundation.
@chaploud chaploud merged commit e5766ee into main Apr 29, 2026
8 checks passed
chaploud added a commit that referenced this pull request Apr 29, 2026
Two rows for merge SHA e5766ee — first time the per-merge bench
record contains more than one target triple side by side:

- aarch64-darwin: native M4 Pro (5 runs + 3 warmup, hyperfine 1.19.0).
- x86_64-linux  : OrbStack `my-ubuntu-amd64` VM running on the
  same M4 Pro, so the values are Rosetta-translated and should
  not be compared 1:1 against numbers from a native Linux x86_64
  runner. Treat this as the schema-shakedown baseline; a native
  x86_64 cleanroom row can replace or supplement it later.

The Windows x86_64-windows row is intentionally absent for now —
the windowsmini SSH host does not yet have hyperfine on PATH;
adding it via versions.lock + install-tools.ps1 is the next
follow-up before the `benchmark` CI job can be flipped to a
3-OS matrix.
@chaploud chaploud deleted the develop/cg-3platform-bench-baseline branch April 29, 2026 11:24
chaploud added a commit that referenced this pull request Apr 29, 2026
W53 and the C-g schema foundation shipped earlier this evening
(PRs #85, #86). Update the three handover docs so the next session
opens against the new open-work shortlist:

- W47 (tgo_strops_cached, ~15 % slowdown, σ ≈ 18 % — needs harness
  stabilisation before bisect).
- C-g step 5: Windows hyperfine pin + native x86_64-linux baseline
  + flip the `benchmark` CI job to a 3-OS matrix.
- C-g step 3 followup: replace the OrbStack Rosetta x86_64-linux
  baseline (committed in `ac4851d`) with one from a native Linux
  host once a runner exists.

memo.md `## Current Task` now describes the W53 + C-g shipment;
the W47 bisect plan is restated as the next priority.
checklist.md W49 entry rewritten to reflect that the schema half
of C-g landed; the matrix flip + native baselines are the
remaining items.
roadmap.md table picks up a `C-g multi-arch bench schema = Done`
row.
chaploud added a commit that referenced this pull request Apr 29, 2026
Closes the matrix-flip half of Plan C-g. The benchmark job was
Ubuntu-only because hyperfine had to be installed manually via a
DEB and there was no Windows path. After the schema work in #86
made `bench/history.yaml` multi-arch, the only thing standing
between us and per-OS regression checks on PR was the toolchain
provisioning gap on Windows.

`scripts/windows/install-tools.ps1`:
- new `-OnlyTool hyperfine` arm; pinned via versions.lock
  HYPERFINE_VERSION. The release zip extracts to a single
  version-stamped subdir holding `hyperfine.exe`, so
  Resolve-SingleSubdir flattens it and the executable lands
  directly in the install dir (same layout as zig / wasm-tools /
  wasmtime).
- realworldKeys gains `hyperfine = HYPERFINE_VERSION` so a missing
  pin fails loudly when the Windows installer is asked for it.
- ValidateSet + Update-UserPath + final Verify banner all gain a
  hyperfine entry.

`.github/versions.lock`:
- HYPERFINE_VERSION 1.18.0 → 1.20.0 to match the nixpkgs version
  on aarch64-darwin / x86_64-linux (the existing nix devshell
  already shipped 1.20.0; the Linux DEB step in the previous
  benchmark job was the only consumer of the older 1.18.0 pin).

`.github/workflows/ci.yml`:
- benchmark job becomes `os: [ubuntu-latest, macos-latest,
  windows-latest]`. Linux/macOS provision via nix devshell (same
  pattern as test-nix); Windows uses
  `install-tools.ps1 -OnlyTool zig` + `-OnlyTool hyperfine`,
  skipping Go / TinyGo / Rust / WASI SDK that the realworld test
  job pulls but the benchmarks don't need.
- Existing Linux DEB + setup-zig install steps deleted; the
  intra-runner regression check (`ci_compare.sh --base=origin/main
  --threshold=20 --runs=3 --warmup=1 --skip-build`) and the
  push-to-main `--record-only` step are unchanged in spirit, just
  wrapped in a per-runner `if RUNNER_OS = Windows` selector that
  picks `nix develop --command` vs plain bash.
- `needs: [test-nix, test]` so the bench fan-out only runs after
  all three platform test jobs have already gated the PR.

Cross-runner comparison is still meaningless (the hardware deltas
dwarf any codegen-level signal), and the docstring above the job
makes that explicit. The durable per-arch absolute-time baselines
remain in `bench/history.yaml` (recorded locally per CLAUDE.md
Merge Gate item 10).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant