docs(w54): investigation — gap is post-fold loop overhead, not missing fold by chaploud · Pull Request #90 · clojurewasm/zwasm

chaploud · 2026-04-29T13:20:30Z

Summary

The original W54 framing assumed zwasm was missing the constant-divisor fold that cranelift uses (multiply-high reciprocal). That assumption was wrong: both ARM64 (src/jit.zig:3582-3666 tryEmitDivByConstU32) and x86_64 (src/x86.zig) already implement the Hacker's Delight 10-9 magic-multiply, and dumping the JIT for bench/wasm/tgo_string_ops.wasm confirms three identical MOVZ + MOVK + UMULL + LSR sequences for the three i32.div_u 10 sites — zero UDIV instructions.

This PR captures that diagnosis so the next implementation pass can start clean from a verified hypothesis.

What the 2.1× gap actually is

Magic-constant load is re-emitted every iteration (~6 ARM64 instructions/iter for three div sites in tgo_strops). Cranelift's SSA + GVN hoist; zwasm's single-pass cannot without a preheader pass.
TinyGo's local.set → mov rd = rs1 stays in the JIT'd code where cranelift collapses it into a register rename via SSA.

Single-pass-compatible levers

Loop-preheader magic hoist — extend emitLoopPreHeader (today SIMD-only, src/jit.zig:4604) to scan for OP_CONST32 K → OP_DIV_U, allocate a callee-saved register (the prologue already spills x19–x28 unconditionally; functions with ≤13 vregs have x21 free without prologue surgery), pre-load the magic. tryEmitDivByConstU32 short-circuits when the magic is already live. Saves ~6 instructions per iteration on tgo_strops.
OP_CONST32 reuse across loop back-edges — saves the 1-instr const itself but not the magic that hangs off it. Skip unless (1) lands.
OP_MOV coalescing in linear-scan regalloc — substantial surgery; deserves its own W## entry.

Out of scope (would break single-pass): SSA + dataflow, global regalloc, auto unroll / vectorise.

Why no implementation in this PR

The hoist is well-bounded but invasive enough that it warrants a focused PR rather than a tail-end commit on a long autonomous run. Edge cases that need design-level attention before the first commit:

Multiple different magic constants in the same loop body (which one to hoist? all? only the most-used?).
Functions with reg_count ≥ 14 (the obvious x21 slot is occupied; need to either skip the optimisation or pick a caller-saved register and verify the loop has no calls).
Nested loops (innermost wins? all levels?).
x86_64 has a different register layout and a different cost ratio for magic loads — equivalent optimisation will need its own audit.

Test plan

Doc-only PR; CI exists to verify no other gate trips.
memo / checklist / roadmap rewordings cross-checked against .dev/w54-investigation.md.

The original W54 framing assumed zwasm did not constant-fold `i32.div_u K`, leaving cranelift's multiply-high optimisation unmatched. That is wrong: the fold is already implemented for both ARM64 (`src/jit.zig:3582-3666`) and x86_64 (`src/x86.zig` `tryEmitDivByConstU32`). Dumping the JIT for `bench/wasm/tgo_string_ops.wasm` confirms three identical MOVZ+MOVK+UMULL+LSR sequences for the three `i32.div_u 10` sites — zero `UDIV` instructions emitted. The actual gap lives in two places: 1. The 2-instruction magic-constant load (MOVZ + MOVK for 0xCCCCCCCD) is re-emitted inside the loop body on every iteration; cranelift hoists it via SSA + GVN so only UMULL+LSR stay in the hot path. With three div sites in `tgo_strops` that costs ~6 ARM64 instructions per loop iteration. 2. TinyGo emits a `mov rd = rs1` per `local.set`; cranelift's SSA collapses those into register renames whereas zwasm's linear-scan regalloc spills them to LDR/STR pairs against `regs[]`. Single-pass-compatible levers, ranked by leverage: - **Loop-preheader magic hoist.** Extend `emitLoopPreHeader` (currently SIMD-only) to scan for `OP_CONST32 K` whose `rd` feeds `OP_DIV_U` / `OP_REM_U` later in the loop body, allocate a callee-saved register, and pre-load the magic. `tryEmitDivByConstU32` short-circuits when the magic is already live. Saves ~6 instructions per iteration on `tgo_strops`. Risk: medium — needs to coexist with the existing physical-register layout (`vregToPhys` saturates x20-x26 + x9-x15 fast for high reg_count functions like func#24 with 13 vregs, where no callee-saved register is free without reserving one in the prologue). - **`OP_CONST32` reuse across loop back-edges.** Today `known_consts` is wiped at every header. Skip unless the preheader hoist lands first — saves the 1-instr const itself but not the 2-instr magic that hangs off it. - **`OP_MOV` coalescing in linear-scan regalloc.** Substantial surgery; warrants its own W## entry, not in scope for tonight. Next step: open a focused PR that experiments with the preheader hoist on a minimal JIT regression suite first, and abort if `bench/run_bench.sh --quick` shows a regression elsewhere. Re-record `bench/runtime_comparison.yaml` at 5 runs + 3 warmup before claiming a number — the existing values are single-sample. Captures the diagnosis tonight so the implementation pass can start clean from a verified hypothesis rather than redo the analysis.

…post-fold loop overhead (#90)

chaploud force-pushed the develop/w54-investigation branch from afae94a to 4c26046 Compare April 29, 2026 13:22

chaploud merged commit 30890b6 into main Apr 29, 2026
10 checks passed

chaploud deleted the develop/w54-investigation branch April 29, 2026 13:46

chaploud added a commit that referenced this pull request Apr 29, 2026

Record benchmark for docs(w54): investigation — 2.1× wasmtime gap is …

9a14a90

…post-fold loop overhead (#90)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(w54): investigation — gap is post-fold loop overhead, not missing fold#90

docs(w54): investigation — gap is post-fold loop overhead, not missing fold#90
chaploud merged 1 commit into
mainfrom
develop/w54-investigation

chaploud commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chaploud commented Apr 29, 2026

Summary

What the 2.1× gap actually is

Single-pass-compatible levers

Why no implementation in this PR

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant