Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove an explicitly-set-aside scratch register per class. #51

Merged
merged 2 commits into from May 23, 2022

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented May 21, 2022

Currently, regalloc2 sets aside one register per class, unconditionally,
to make move resolution possible. To solve the "parallel moves problem",
we sometimes need to conjure a cyclic permutation of data among
registers or stack slots (this can result, for example, from blockparam
flow that swaps two values on a loop backedge). This set-aside scratch
register is used when a cycle exists.

regalloc2 also uses the scratch register when needed to break down a
stack-to-stack move (which could happen due to blockparam moves on edges
when source and destination are both spilled) into a stack-to-reg move
followed by reg-to-stack, because most machines have loads and stores
but not memory-to-memory moves.

A set-aside register is certainly the simplest solution, but it is not
optimal: it means that we have one fewer register available for use by
the program, and this can be costly especially on machines with fewer
registers (e.g., 16 GPRs/XMMs on x86-64) and especially when some
registers may be set aside by our embedder for other purposes too. Every
register we can reclaim is some nontrivial performance in large function
bodies!

This PR removes this restriction and allows regalloc2 to use all
available physical registers. It then solves the two problems above,
cyclic moves and stack-to-stack moves, with a two-stage approach:

  • First, it finds a location to use to resolve cycles, if any exist. If
    a register is unallocated at the location of the move, we can use it.
    Often we get lucky and this is the case. Otherwise, we allocate a
    stackslot to use as the temp. This is perfectly fine at this stage,
    even if it means that we have more stack-to-stack moves.

  • Then, it resolves stack-to-stack moves into stack-to-reg /
    reg-to-stack. There are two subcases here. If there is another
    available free physical register, we opportunistically use it for this
    decomposition. If not, we fall back to our last-ditch option: we pick
    a victim register of the appropriate class, we allocate another
    temporary stackslot, we spill the victim to that slot just for this
    move, we do the move in the above way (stack-to-reg / reg-to-stack)
    with the victim, then we reload the victim. So one move (original
    stack-to-stack) becomes four moves, but no state is clobbered.

This PR extends the moves fuzz-target to exercise this functionality
as well, randomly choosing for some spare registers to exist or not, and
randomly generating {stack,reg}-to-{stack,reg} moves in the initial
parallel-move input set. The target does a simple symbolic simulation of
the sequential move sequence and ensures that the final state is
equivalent to the parallel-move semantics.

I fuzzed both the moves target, focusing on the new logic; as well as
the ion_checker target, checking the whole register allocator, and
both seem clean (~150M cases on the former, ~1M cases on the latter).

Fixes #45.

Currently, regalloc2 sets aside one register per class, unconditionally,
to make move resolution possible. To solve the "parallel moves problem",
we sometimes need to conjure a cyclic permutation of data among
registers or stack slots (this can result, for example, from blockparam
flow that swaps two values on a loop backedge). This set-aside scratch
register is used when a cycle exists.

regalloc2 also uses the scratch register when needed to break down a
stack-to-stack move (which could happen due to blockparam moves on edges
when source and destination are both spilled) into a stack-to-reg move
followed by reg-to-stack, because most machines have loads and stores
but not memory-to-memory moves.

A set-aside register is certainly the simplest solution, but it is not
optimal: it means that we have one fewer register available for use by
the program, and this can be costly especially on machines with fewer
registers (e.g., 16 GPRs/XMMs on x86-64) and especially when some
registers may be set aside by our embedder for other purposes too. Every
register we can reclaim is some nontrivial performance in large function
bodies!

This PR removes this restriction and allows regalloc2 to use all
available physical registers. It then solves the two problems above,
cyclic moves and stack-to-stack moves, with a two-stage approach:

- First, it finds a location to use to resolve cycles, if any exist. If
  a register is unallocated at the location of the move, we can use it.
  Often we get lucky and this is the case. Otherwise, we allocate a
  stackslot to use as the temp. This is perfectly fine at this stage,
  even if it means that we have more stack-to-stack moves.

- Then, it resolves stack-to-stack moves into stack-to-reg /
  reg-to-stack. There are two subcases here. If there is *another*
  available free physical register, we opportunistically use it for this
  decomposition. If not, we fall back to our last-ditch option: we pick
  a victim register of the appropriate class, we allocate another
  temporary stackslot, we spill the victim to that slot just for this
  move, we do the move in the above way (stack-to-reg / reg-to-stack)
  with the victim, then we reload the victim. So one move (original
  stack-to-stack) becomes four moves, but no state is clobbered.

This PR extends the `moves` fuzz-target to exercise this functionality
as well, randomly choosing for some spare registers to exist or not, and
randomly generating {stack,reg}-to-{stack,reg} moves in the initial
parallel-move input set. The target does a simple symbolic simulation of
the sequential move sequence and ensures that the final state is
equivalent to the parallel-move semantics.

I fuzzed both the `moves` target, focusing on the new logic; as well as
the `ion_checker` target, checking the whole register allocator, and
both seem clean (~150M cases on the former, ~1M cases on the latter).
@cfallin cfallin requested a review from fitzgen May 21, 2022 03:14
@cfallin
Copy link
Member Author

cfallin commented May 21, 2022

Benchmark results with Cranelift: blake3-scalar gets 5-8% faster, runtimes otherwise unaffected.

execution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm

  Δ = 26130.32 ± 5035.33 (confidence = 99%)

  new.so is 1.05x to 1.08x faster than old.so!
  old.so is 0.93x to 0.95x faster than new.so!

  [381216 389967.40 419786] new.so
  [407474 416097.72 445626] old.so

execution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm

  Δ = 6874.52 ± 1325.00 (confidence = 99%)

  new.so is 1.05x to 1.08x faster than old.so!
  old.so is 0.93x to 0.95x faster than new.so!

  [100317 102615.16 110455] new.so
  [107217 109489.68 117256] old.so

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extensive comments, very helpful when reviewing!

src/moves.rs Outdated Show resolved Hide resolved
src/moves.rs Show resolved Hide resolved
@cfallin cfallin merged commit 869c21e into bytecodealliance:main May 23, 2022
@cfallin cfallin deleted the scratch-the-scratch-registers branch May 23, 2022 17:48
cfallin added a commit to cfallin/regalloc2 that referenced this pull request May 23, 2022
The change in bytecodealliance#51 was API-visible (removal of the "scratch register"
field in the `Env`) so this is a semver bump.
@cfallin cfallin mentioned this pull request May 23, 2022
cfallin added a commit that referenced this pull request May 23, 2022
The change in #51 was API-visible (removal of the "scratch register"
field in the `Env`) so this is a semver bump.
cfallin added a commit to cfallin/wasmtime that referenced this pull request May 23, 2022
RA2 recently removed the need for a dedicated scratch register for
cyclic moves (bytecodealliance/regalloc2#51). This has moderate positive
performance impact on function bodies that were register-constrained, as
it means that one more register is available. In Sightglass, I measured
+5-8% on `blake3-scalar`, at least among current benchmarks.
cfallin added a commit to bytecodealliance/wasmtime that referenced this pull request May 23, 2022
…4182)

RA2 recently removed the need for a dedicated scratch register for
cyclic moves (bytecodealliance/regalloc2#51). This has moderate positive
performance impact on function bodies that were register-constrained, as
it means that one more register is available. In Sightglass, I measured
+5-8% on `blake3-scalar`, at least among current benchmarks.
Amanieu added a commit to Amanieu/regalloc2 that referenced this pull request Mar 4, 2023
Dedicated scratch registers used for resolving move cycles were removed
in bytecodealliance#51 and replaced with an algorithm to automatically allocate a
scratch register as needed.

However in many cases, a client will already have a non-allocatable
scratch register available for things like extended jumps (see bytecodealliance#91). It
makes sense to re-use this register for regalloc than potentially
spilling an existing register.
cfallin pushed a commit that referenced this pull request Mar 4, 2023
* Re-introduce optional dedicated scratch registers

Dedicated scratch registers used for resolving move cycles were removed
in #51 and replaced with an algorithm to automatically allocate a
scratch register as needed.

However in many cases, a client will already have a non-allocatable
scratch register available for things like extended jumps (see #91). It
makes sense to re-use this register for regalloc than potentially
spilling an existing register.

* Clarify comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove the need for dedicated, non-allocatable scratch registers per class
2 participants