Skip to content

rt: add 128 / (<=32-bit divisor) fast path to udivmod128#61

Merged
heifner merged 1 commit into
masterfrom
perf/udivmod128-small-divisor-fastpath
May 19, 2026
Merged

rt: add 128 / (<=32-bit divisor) fast path to udivmod128#61
heifner merged 1 commit into
masterfrom
perf/udivmod128-small-divisor-fastpath

Conversation

@heifner
Copy link
Copy Markdown
Contributor

@heifner heifner commented May 19, 2026

Change Description

udivmod128 (backing __udivti3/__umodti3/__divti3/__modti3) fell back to a 128-iteration shift/subtract loop whenever the dividend or divisor exceeded 64 bits. The overwhelming majority of contract 128-bit divisions have a small divisor — asset/token math, fixed-point scaling, percentages — so this adds a fast path for 128-bit dividend ÷ (≤ 32-bit divisor): schoolbook base-2³² long division over the dividend's four 32-bit digits — 4 native i64.div_u instead of 128 loop iterations.

Routing is now: 64 / 64 (1 division) → 128 / (≤32-bit) (4 divisions) → 128-iteration loop (unchanged).

Why it's safe

The running remainder satisfies r < v ≤ 2³²−1 after every r %= v, which gives the two invariants the path relies on:

  • (r << 32) | digit < 2⁶⁴ → fits a native uint64_t, so the divide is i64.div_u and does not recurse back into __udivti3 (the hard constraint in this file — librt is the contract's own compiler-rt provider, no host fallback to break a cycle).
  • (r << 32) | digit < v · 2³² → each per-digit quotient is < 2³², so the (hi << 32) | lo packing is exact.

Only uint64_t divide/mod/shift/or are used — the same operations the existing 64/64 fast path already uses. No __int128 multi-word ops.

Verification

  • In-repo suite (compiler_builtins_tests): all pass, including two new tests; no regression in the existing 18.
  • Standalone fuzz vs. native unsigned __int128: 16M random + boundary inputs, 0 failures, and bit-identical to the slow loop with the fast path disabled. This is a pure optimization — no determinism change on the consensus path.

New tests udivti3_small_divisor_fastpath / umodti3_small_divisor_fastpath pin the digit-carry chain (MAX/7), the 0xFFFFFFFF upper boundary, v=1 identity, a near-2³² divisor, and the v = 2³² just-over case that must fall through to the slow loop.

udivmod128 fell back to a 128-iteration shift/subtract loop whenever
the dividend or divisor exceeded 64 bits. Most contract 128-bit
divisions have a small divisor (asset math, fixed-point scaling), so
add a schoolbook base-2^32 long division: the 128-bit dividend is
processed as four 32-bit digits -- 4 native i64.div_u instead of 128
loop iterations.

The running remainder satisfies r < v <= 2^32-1 after each step, so
(r << 32) | digit < 2^64 (fits a native uint64, no recursion back
into __udivti3) and < v * 2^32 (each per-digit quotient < 2^32, so
the (hi<<32)|lo packing is exact). Routing is now
64/64 -> 128/(<=32-bit) -> 128-iteration loop.

Verified bit-identical to the slow loop over 16M random + boundary
inputs, so this is a pure optimization with no determinism change.
Adds udivti3/umodti3_small_divisor_fastpath covering the digit-carry
chain, the 0xFFFFFFFF boundary, and the v==2^32 slow-loop fall-through.
@heifner heifner requested a review from brianjohnson5972 May 19, 2026 14:07
@heifner heifner merged commit 9267e8e into master May 19, 2026
4 checks passed
@heifner heifner deleted the perf/udivmod128-small-divisor-fastpath branch May 19, 2026 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants