New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove an explicitly-set-aside scratch register per class. #51
Merged
cfallin
merged 2 commits into
bytecodealliance:main
from
cfallin:scratch-the-scratch-registers
May 23, 2022
Merged
Remove an explicitly-set-aside scratch register per class. #51
cfallin
merged 2 commits into
bytecodealliance:main
from
cfallin:scratch-the-scratch-registers
May 23, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Currently, regalloc2 sets aside one register per class, unconditionally, to make move resolution possible. To solve the "parallel moves problem", we sometimes need to conjure a cyclic permutation of data among registers or stack slots (this can result, for example, from blockparam flow that swaps two values on a loop backedge). This set-aside scratch register is used when a cycle exists. regalloc2 also uses the scratch register when needed to break down a stack-to-stack move (which could happen due to blockparam moves on edges when source and destination are both spilled) into a stack-to-reg move followed by reg-to-stack, because most machines have loads and stores but not memory-to-memory moves. A set-aside register is certainly the simplest solution, but it is not optimal: it means that we have one fewer register available for use by the program, and this can be costly especially on machines with fewer registers (e.g., 16 GPRs/XMMs on x86-64) and especially when some registers may be set aside by our embedder for other purposes too. Every register we can reclaim is some nontrivial performance in large function bodies! This PR removes this restriction and allows regalloc2 to use all available physical registers. It then solves the two problems above, cyclic moves and stack-to-stack moves, with a two-stage approach: - First, it finds a location to use to resolve cycles, if any exist. If a register is unallocated at the location of the move, we can use it. Often we get lucky and this is the case. Otherwise, we allocate a stackslot to use as the temp. This is perfectly fine at this stage, even if it means that we have more stack-to-stack moves. - Then, it resolves stack-to-stack moves into stack-to-reg / reg-to-stack. There are two subcases here. If there is *another* available free physical register, we opportunistically use it for this decomposition. If not, we fall back to our last-ditch option: we pick a victim register of the appropriate class, we allocate another temporary stackslot, we spill the victim to that slot just for this move, we do the move in the above way (stack-to-reg / reg-to-stack) with the victim, then we reload the victim. So one move (original stack-to-stack) becomes four moves, but no state is clobbered. This PR extends the `moves` fuzz-target to exercise this functionality as well, randomly choosing for some spare registers to exist or not, and randomly generating {stack,reg}-to-{stack,reg} moves in the initial parallel-move input set. The target does a simple symbolic simulation of the sequential move sequence and ensures that the final state is equivalent to the parallel-move semantics. I fuzzed both the `moves` target, focusing on the new logic; as well as the `ion_checker` target, checking the whole register allocator, and both seem clean (~150M cases on the former, ~1M cases on the latter).
Benchmark results with Cranelift:
|
fitzgen
approved these changes
May 23, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extensive comments, very helpful when reviewing!
cfallin
added a commit
to cfallin/regalloc2
that referenced
this pull request
May 23, 2022
The change in bytecodealliance#51 was API-visible (removal of the "scratch register" field in the `Env`) so this is a semver bump.
Merged
cfallin
added a commit
that referenced
this pull request
May 23, 2022
The change in #51 was API-visible (removal of the "scratch register" field in the `Env`) so this is a semver bump.
cfallin
added a commit
to cfallin/wasmtime
that referenced
this pull request
May 23, 2022
RA2 recently removed the need for a dedicated scratch register for cyclic moves (bytecodealliance/regalloc2#51). This has moderate positive performance impact on function bodies that were register-constrained, as it means that one more register is available. In Sightglass, I measured +5-8% on `blake3-scalar`, at least among current benchmarks.
cfallin
added a commit
to bytecodealliance/wasmtime
that referenced
this pull request
May 23, 2022
…4182) RA2 recently removed the need for a dedicated scratch register for cyclic moves (bytecodealliance/regalloc2#51). This has moderate positive performance impact on function bodies that were register-constrained, as it means that one more register is available. In Sightglass, I measured +5-8% on `blake3-scalar`, at least among current benchmarks.
Amanieu
added a commit
to Amanieu/regalloc2
that referenced
this pull request
Mar 4, 2023
Dedicated scratch registers used for resolving move cycles were removed in bytecodealliance#51 and replaced with an algorithm to automatically allocate a scratch register as needed. However in many cases, a client will already have a non-allocatable scratch register available for things like extended jumps (see bytecodealliance#91). It makes sense to re-use this register for regalloc than potentially spilling an existing register.
cfallin
pushed a commit
that referenced
this pull request
Mar 4, 2023
* Re-introduce optional dedicated scratch registers Dedicated scratch registers used for resolving move cycles were removed in #51 and replaced with an algorithm to automatically allocate a scratch register as needed. However in many cases, a client will already have a non-allocatable scratch register available for things like extended jumps (see #91). It makes sense to re-use this register for regalloc than potentially spilling an existing register. * Clarify comment
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, regalloc2 sets aside one register per class, unconditionally,
to make move resolution possible. To solve the "parallel moves problem",
we sometimes need to conjure a cyclic permutation of data among
registers or stack slots (this can result, for example, from blockparam
flow that swaps two values on a loop backedge). This set-aside scratch
register is used when a cycle exists.
regalloc2 also uses the scratch register when needed to break down a
stack-to-stack move (which could happen due to blockparam moves on edges
when source and destination are both spilled) into a stack-to-reg move
followed by reg-to-stack, because most machines have loads and stores
but not memory-to-memory moves.
A set-aside register is certainly the simplest solution, but it is not
optimal: it means that we have one fewer register available for use by
the program, and this can be costly especially on machines with fewer
registers (e.g., 16 GPRs/XMMs on x86-64) and especially when some
registers may be set aside by our embedder for other purposes too. Every
register we can reclaim is some nontrivial performance in large function
bodies!
This PR removes this restriction and allows regalloc2 to use all
available physical registers. It then solves the two problems above,
cyclic moves and stack-to-stack moves, with a two-stage approach:
First, it finds a location to use to resolve cycles, if any exist. If
a register is unallocated at the location of the move, we can use it.
Often we get lucky and this is the case. Otherwise, we allocate a
stackslot to use as the temp. This is perfectly fine at this stage,
even if it means that we have more stack-to-stack moves.
Then, it resolves stack-to-stack moves into stack-to-reg /
reg-to-stack. There are two subcases here. If there is another
available free physical register, we opportunistically use it for this
decomposition. If not, we fall back to our last-ditch option: we pick
a victim register of the appropriate class, we allocate another
temporary stackslot, we spill the victim to that slot just for this
move, we do the move in the above way (stack-to-reg / reg-to-stack)
with the victim, then we reload the victim. So one move (original
stack-to-stack) becomes four moves, but no state is clobbered.
This PR extends the
moves
fuzz-target to exercise this functionalityas well, randomly choosing for some spare registers to exist or not, and
randomly generating {stack,reg}-to-{stack,reg} moves in the initial
parallel-move input set. The target does a simple symbolic simulation of
the sequential move sequence and ensures that the final state is
equivalent to the parallel-move semantics.
I fuzzed both the
moves
target, focusing on the new logic; as well asthe
ion_checker
target, checking the whole register allocator, andboth seem clean (~150M cases on the former, ~1M cases on the latter).
Fixes #45.