Skip to content

docs(w54): investigation — gap is post-fold loop overhead, not missing fold#90

Merged
chaploud merged 1 commit into
mainfrom
develop/w54-investigation
Apr 29, 2026
Merged

docs(w54): investigation — gap is post-fold loop overhead, not missing fold#90
chaploud merged 1 commit into
mainfrom
develop/w54-investigation

Conversation

@chaploud
Copy link
Copy Markdown
Contributor

Summary

The original W54 framing assumed zwasm was missing the constant-divisor fold that cranelift uses (multiply-high reciprocal). That assumption was wrong: both ARM64 (src/jit.zig:3582-3666 tryEmitDivByConstU32) and x86_64 (src/x86.zig) already implement the Hacker's Delight 10-9 magic-multiply, and dumping the JIT for bench/wasm/tgo_string_ops.wasm confirms three identical MOVZ + MOVK + UMULL + LSR sequences for the three i32.div_u 10 sites — zero UDIV instructions.

This PR captures that diagnosis so the next implementation pass can start clean from a verified hypothesis.

What the 2.1× gap actually is

  1. Magic-constant load is re-emitted every iteration (~6 ARM64 instructions/iter for three div sites in tgo_strops). Cranelift's SSA + GVN hoist; zwasm's single-pass cannot without a preheader pass.
  2. TinyGo's local.setmov rd = rs1 stays in the JIT'd code where cranelift collapses it into a register rename via SSA.

Single-pass-compatible levers

  1. Loop-preheader magic hoist — extend emitLoopPreHeader (today SIMD-only, src/jit.zig:4604) to scan for OP_CONST32 K → OP_DIV_U, allocate a callee-saved register (the prologue already spills x19–x28 unconditionally; functions with ≤13 vregs have x21 free without prologue surgery), pre-load the magic. tryEmitDivByConstU32 short-circuits when the magic is already live. Saves ~6 instructions per iteration on tgo_strops.
  2. OP_CONST32 reuse across loop back-edges — saves the 1-instr const itself but not the magic that hangs off it. Skip unless (1) lands.
  3. OP_MOV coalescing in linear-scan regalloc — substantial surgery; deserves its own W## entry.

Out of scope (would break single-pass): SSA + dataflow, global regalloc, auto unroll / vectorise.

Why no implementation in this PR

The hoist is well-bounded but invasive enough that it warrants a focused PR rather than a tail-end commit on a long autonomous run. Edge cases that need design-level attention before the first commit:

  • Multiple different magic constants in the same loop body (which one to hoist? all? only the most-used?).
  • Functions with reg_count ≥ 14 (the obvious x21 slot is occupied; need to either skip the optimisation or pick a caller-saved register and verify the loop has no calls).
  • Nested loops (innermost wins? all levels?).
  • x86_64 has a different register layout and a different cost ratio for magic loads — equivalent optimisation will need its own audit.

Test plan

  • Doc-only PR; CI exists to verify no other gate trips.
  • memo / checklist / roadmap rewordings cross-checked against .dev/w54-investigation.md.

The original W54 framing assumed zwasm did not constant-fold
`i32.div_u K`, leaving cranelift's multiply-high optimisation
unmatched. That is wrong: the fold is already implemented for
both ARM64 (`src/jit.zig:3582-3666`) and x86_64
(`src/x86.zig` `tryEmitDivByConstU32`). Dumping the JIT for
`bench/wasm/tgo_string_ops.wasm` confirms three identical
MOVZ+MOVK+UMULL+LSR sequences for the three `i32.div_u 10`
sites — zero `UDIV` instructions emitted.

The actual gap lives in two places:

1. The 2-instruction magic-constant load (MOVZ + MOVK for
   0xCCCCCCCD) is re-emitted inside the loop body on every
   iteration; cranelift hoists it via SSA + GVN so only
   UMULL+LSR stay in the hot path. With three div sites in
   `tgo_strops` that costs ~6 ARM64 instructions per loop
   iteration.
2. TinyGo emits a `mov rd = rs1` per `local.set`; cranelift's
   SSA collapses those into register renames whereas zwasm's
   linear-scan regalloc spills them to LDR/STR pairs against
   `regs[]`.

Single-pass-compatible levers, ranked by leverage:

- **Loop-preheader magic hoist.** Extend `emitLoopPreHeader`
  (currently SIMD-only) to scan for `OP_CONST32 K` whose `rd`
  feeds `OP_DIV_U` / `OP_REM_U` later in the loop body, allocate
  a callee-saved register, and pre-load the magic. `tryEmitDivByConstU32`
  short-circuits when the magic is already live. Saves ~6
  instructions per iteration on `tgo_strops`. Risk: medium —
  needs to coexist with the existing physical-register layout
  (`vregToPhys` saturates x20-x26 + x9-x15 fast for high
  reg_count functions like func#24 with 13 vregs, where no
  callee-saved register is free without reserving one in the
  prologue).
- **`OP_CONST32` reuse across loop back-edges.** Today
  `known_consts` is wiped at every header. Skip unless the
  preheader hoist lands first — saves the 1-instr const itself
  but not the 2-instr magic that hangs off it.
- **`OP_MOV` coalescing in linear-scan regalloc.** Substantial
  surgery; warrants its own W## entry, not in scope for tonight.

Next step: open a focused PR that experiments with the
preheader hoist on a minimal JIT regression suite first, and
abort if `bench/run_bench.sh --quick` shows a regression
elsewhere. Re-record `bench/runtime_comparison.yaml` at 5 runs
+ 3 warmup before claiming a number — the existing values are
single-sample.

Captures the diagnosis tonight so the implementation pass can
start clean from a verified hypothesis rather than redo the
analysis.
@chaploud chaploud force-pushed the develop/w54-investigation branch from afae94a to 4c26046 Compare April 29, 2026 13:22
@chaploud chaploud merged commit 30890b6 into main Apr 29, 2026
10 checks passed
@chaploud chaploud deleted the develop/w54-investigation branch April 29, 2026 13:46
chaploud added a commit that referenced this pull request Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant