-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] ISLE: Migrate call and return instructions #3785
Conversation
Subscribe to Label Action
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
c97e5be
to
01ed440
Compare
Rebased now that both prerequisite PRs were merged. This is probably still not quite ready to merge, but I'd certainly appreciate any comments! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this and sorry for the delay in looking at it!
Echoing thoughts from today's Cranelift biweekly, I think that it makes sense to merge ABICallerIsle
into ABISig
proper -- the only additional bit is the copy_to_arg_order
which, while technically a kind of ABI detail rather than signature, I think is reasonable enough to attach to the signature. Then we can provide bindings into ISLE to access the ABISig
directly.
My only other comment is with regard to the (lower (call ...))
toplevel rule itself: it's currently written to emit the bits of the call sequence directly; it might be worth exploring an abstraction similar to the ProducesFlags/ConsumesFlags one where we encapsulate instructions, then pass them as one to a combinator that ensures they are emitted in the right order. (In this case the danger is clobbering of specific registers, rather than flags.) So something like CallSetup
/ CallInvocation
. Though, now that I say that, I realize that the variable-length argument list implies gathering a Vec
or similar of argument setup instructions, so perhaps we can think about this and build it incrementally later.
Given all that, I think I'm happy to land something close to this once the ABISig
refactor happens!
Hi @cfallin, this doesn't implement the However, this shows the regalloc issue I was mentioning earlier today. You'll notice the patch changes the |
@uweigand I spent a few hours digging into exactly why the extra move occurs here, to see if there is anything actionable. To summarize, here is what's going on:
So how could we address this?
So there's a plausible path forward (last option above), but it's not an easy fix. In general, there is going to be "noise" like this with any shift in complex heuristics; I do apologize, and wish we could always avoid regressions like this, but it's also an impossible request (not saying you are making it, but stating this out loud since it's the topic at hand) to always avoid regressions. In general the best we can do is measure and hill-climb toward better performance in aggregate. In any case, I'll file an issue over in regalloc2 to note the above idea, and at some point I might have a slice of time (it would probably take ~a few days to a week) to get to it. |
…ts without falling back to spill bundle. Currently, we unconditionally trim the ends of liveranges around a split when we do a split, including splits due to conflicts in a liverange/bundle's requirements (e.g., a liverange with both a register and a stack use). These trimmed ends, if they exist, go to the spill bundle, and the spill bundle may receive a register during second-chance allocation or otherwise will receive a stack slot. This was previously measured to reduce contention significantly, because it reduces the sizes of liveranges that participate in the first-chance competition for allocations. When a split has to occur, we might as well relegate the "connecting pieces" to a process that comes later, with a hint to try to get the right register if possible but no hard connection to either end. However, in the case of a split arising from a reg-to-stack / stack-to-reg conflict, as happens when references are used or def'd as registers and then cross safepoints, this extra step in the connectivity (normal LR with register use, then spill bundle, then normal LR with stack use) can lead to extra moves. Additionally, when one of the LRs has a stack constraint, contention is far less important; so it doesn't hurt to skip the trimming step. In fact, it's likely much better to put the "connecting piece" together with the stack side of the conflict. Ideally we would handle this with the same move-cost logic we use for conflicts detected during backtracking, but the requirements-related splitting happens separately and that logic would need to be generalized further. For now, this is sufficient to eliminate redundant moves as seen in e.g. bytecodealliance/wasmtime#3785.
OK, so given all the above, I got nerd-sniped for the rest of the afternoon and did bytecodealliance/regalloc2#49. It seems to address the issue above (in fact the code is a little shorter than with regalloc.rs). |
…ts without falling back to spill bundle. (#49) Currently, we unconditionally trim the ends of liveranges around a split when we do a split, including splits due to conflicts in a liverange/bundle's requirements (e.g., a liverange with both a register and a stack use). These trimmed ends, if they exist, go to the spill bundle, and the spill bundle may receive a register during second-chance allocation or otherwise will receive a stack slot. This was previously measured to reduce contention significantly, because it reduces the sizes of liveranges that participate in the first-chance competition for allocations. When a split has to occur, we might as well relegate the "connecting pieces" to a process that comes later, with a hint to try to get the right register if possible but no hard connection to either end. However, in the case of a split arising from a reg-to-stack / stack-to-reg conflict, as happens when references are used or def'd as registers and then cross safepoints, this extra step in the connectivity (normal LR with register use, then spill bundle, then normal LR with stack use) can lead to extra moves. Additionally, when one of the LRs has a stack constraint, contention is far less important; so it doesn't hurt to skip the trimming step. In fact, it's likely much better to put the "connecting piece" together with the stack side of the conflict. Ideally we would handle this with the same move-cost logic we use for conflicts detected during backtracking, but the requirements-related splitting happens separately and that logic would need to be generalized further. For now, this is sufficient to eliminate redundant moves as seen in e.g. bytecodealliance/wasmtime#3785.
This pulls in bytecodealliance/regalloc2#49, which slightly improves codegen in soem cases where a safepoint (for reference-typed values) occurs in the same liverange as a register-constraineed use. For example, in bytecodealliance#3785, an extra move instruction appeared and a callee-save register was used (necessitating a more expensive prologue) because of suboptimal splitting heuristics, which this PR fixes. The updated RA2 heuristics appear to have no measured downsides in existing benchmarks and improve the manually-observed codegen issue.
This pulls in bytecodealliance/regalloc2#49, which slightly improves codegen in some cases where a safepoint (for reference-typed values) occurs in the same liverange as a register-constraineed use. For example, in bytecodealliance#3785, an extra move instruction appeared and a callee-save register was used (necessitating a more expensive prologue) because of suboptimal splitting heuristics, which this PR fixes. The updated RA2 heuristics appear to have no measured downsides in existing benchmarks and improve the manually-observed codegen issue.
This pulls in bytecodealliance/regalloc2#49, which slightly improves codegen in some cases where a safepoint (for reference-typed values) occurs in the same liverange as a register-constrained use. For example, in bytecodealliance#3785, an extra move instruction appeared and a callee-save register was used (necessitating a more expensive prologue) because of suboptimal splitting heuristics, which this PR fixes. The updated RA2 heuristics appear to have no measured downsides in existing benchmarks and improve the manually-observed codegen issue.
Hi @cfallin, thanks for the detailed analysis and the quick fix! I can confirm that with regalloc2 0.1.3 the code generated for the reftypes.clif test case is improved, both without this PR and with this PR applied. So this regalloc change looks like a clear improvement to me. However, I'm still wondering why we are seeing any regalloc differences from this PR - both with and without your regalloc2 change applied. Without the regalloc2 change, we're seeing the regression shown in the PR. But even with the regalloc2 change, we are seeing a change in generated code (not a regression, but still a change):
This seems strange given that the code regalloc sees before and after this PR is nearly identical.
After this PR:
The only difference is that before the PR, insts 5 and 8 use the same vreg (v130), while after the PR, they use two different vregs (v130 and v148), which are marked as aliases. If aliased vregs are indeed treated identically by regalloc, why does this change still appear to make a difference? |
Ah, so I think what is happening is that the aliasing rewrites toward the new vreg, not the old one: This means that the order of vregs seen by regalloc2 is not quite the same, so allocation decisions may be made in a slightly different order. This aligns with what is seen in your diff above, I think: the instruction sequence is exactly the same, but the specific register numbers are different in a few cases. |
I see, that does indeed look like it explains the difference. I think with the regalloc2 patch we should be all good then. Thanks again! |
* Upgrade to regalloc2 0.1.3. This pulls in bytecodealliance/regalloc2#49, which slightly improves codegen in some cases where a safepoint (for reference-typed values) occurs in the same liverange as a register-constrained use. For example, in #3785, an extra move instruction appeared and a callee-save register was used (necessitating a more expensive prologue) because of suboptimal splitting heuristics, which this PR fixes. The updated RA2 heuristics appear to have no measured downsides in existing benchmarks and improve the manually-observed codegen issue. * Update filetests where regalloc2 improvement altered behavior with reftypes.
Hi @cfallin, this latest version now has the As far as I am concerned, this should now be ready to be merged, so I'd appreciate another review - any comments welcome! |
This adds infrastructure to allow implementing call and return instructions in ISLE, and migrates the s390x back-end. To implement ABI details, this patch creates public accessors for `ABISig` and makes them accessible in ISLE. All actual code generation is then done in ISLE rules, following the information provided by that signature. [ Note that the s390x back end never requires multiple slots for a single argument - the infrastructure to handle this should already be present, however. ] To implement loops in ISLE rules, this patch uses regular tail recursion, employing a `Range` data structure holding a range of integers to be looped over.
Updated to resolve merge conflicts due to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With all the updates, this looks great, @uweigand ! Thanks a bunch for the patience and especially for the rebases as this was pending. I think this should be good to merge now.
This adds infrastructure to allow implementing call and return
instructions in ISLE, and migrates the s390x back-end.
Not intended to be committed as-is, this will not even compile
as it depends on the following pre-requisite patches:
#3783
#3784
Note that the s390x back end never requires multiple slots for
a single argument - the infrastructure to handle this should
already be present, however.
This uses
ABICallerIsle
instead of the existingABICaller
.The new type is used solely to collect information about how
to pass arguments and return values - all the actual code
generation is done in ISLE rules. (Note that
ABICallerIsle
ended up as just a thin wrapper around
ABISig
with publicaccessors - maybe the two should be merged?)
To implement loops in ISLE rule, this patch uses regular tail
recursion, employing a
Range
data structure holding a rangeof integers to be looped over.
@cfallin @fitzgen - this is my current state of the call/ret patch - FYI and discussion welcome!