Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] ISLE: Migrate call and return instructions #3785

Merged
merged 1 commit into from
Jun 29, 2022

Conversation

uweigand
Copy link
Member

@uweigand uweigand commented Feb 9, 2022

This adds infrastructure to allow implementing call and return
instructions in ISLE, and migrates the s390x back-end.

Not intended to be committed as-is, this will not even compile
as it depends on the following pre-requisite patches:
#3783
#3784

Note that the s390x back end never requires multiple slots for
a single argument - the infrastructure to handle this should
already be present, however.

This uses ABICallerIsle instead of the existing ABICaller.
The new type is used solely to collect information about how
to pass arguments and return values - all the actual code
generation is done in ISLE rules. (Note that ABICallerIsle
ended up as just a thin wrapper around ABISig with public
accessors - maybe the two should be merged?)

To implement loops in ISLE rule, this patch uses regular tail
recursion, employing a Range data structure holding a range
of integers to be looped over.

@cfallin @fitzgen - this is my current state of the call/ret patch - FYI and discussion welcome!

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Feb 9, 2022
@github-actions
Copy link

github-actions bot commented Feb 9, 2022

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@uweigand
Copy link
Member Author

Rebased now that both prerequisite PRs were merged. This is probably still not quite ready to merge, but I'd certainly appreciate any comments!

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this and sorry for the delay in looking at it!

Echoing thoughts from today's Cranelift biweekly, I think that it makes sense to merge ABICallerIsle into ABISig proper -- the only additional bit is the copy_to_arg_order which, while technically a kind of ABI detail rather than signature, I think is reasonable enough to attach to the signature. Then we can provide bindings into ISLE to access the ABISig directly.

My only other comment is with regard to the (lower (call ...)) toplevel rule itself: it's currently written to emit the bits of the call sequence directly; it might be worth exploring an abstraction similar to the ProducesFlags/ConsumesFlags one where we encapsulate instructions, then pass them as one to a combinator that ensures they are emitted in the right order. (In this case the danger is clobbering of specific registers, rather than flags.) So something like CallSetup / CallInvocation. Though, now that I say that, I realize that the variable-length argument list implies gathering a Vec or similar of argument setup instructions, so perhaps we can think about this and build it incrementally later.

Given all that, I think I'm happy to land something close to this once the ABISig refactor happens!

@uweigand
Copy link
Member Author

uweigand commented May 16, 2022

Hi @cfallin, this doesn't implement the ABISig refactor yet, it's just a rebase to current mainline.

However, this shows the regalloc issue I was mentioning earlier today. You'll notice the patch changes the reftypes.clif file, which shows the regalloc regression I've been seeing. Note the one additional move in (new) line 80, and the extra use of call-saved register %r10.

@cfallin
Copy link
Member

cfallin commented May 16, 2022

@uweigand I spent a few hours digging into exactly why the extra move occurs here, to see if there is anything actionable. To summarize, here is what's going on:

  • There is a liverange from the definition of the ref (initially a move-from-preg to pick up the arg value in r3) until its return at the bottom of the function.
  • When building this liverange, we add uses that constrain to preg r3, and later that constrain to stack at the safepoint.
  • We recognize that this is a conflict, so we split.
  • The split nominally happens just before the conflict -- inst4 at the safepoint -- so we get a new LR and bundle starting at inst4 until return at inst12, and the original LR at inst1..inst4.
  • But the "spill bundle" mechanism that tries to have a common fallback location between LRs-with-actual-uses chops off the bit of the first bundle after the def, so we get a bundle at inst1..inst1 with preg constraint r3, a backing spill bundle, and a bundle at inst4..inst12 with constraint to stack.
  • The "second chance" logic tries to assign a register to every spill bundle before giving up and giving it a stack slot. In this case the spill bundle gets r10, a callee-saved register, because of the call at inst3. So we have the value in r3 initially, then r10 across the call, then on the stack for the explicit stack safepoint.

So how could we address this?

  • We could avoid chopping off the trailing bits of LRs when splitting; but this turns out to be a really useful thing in most cases, because it shrinks the post-split pieces to just around their uses and significantly reduces contention. I measured high-single-digits perf improvements from this when I was initially tuning things.
  • We could somehow hint that the spill bundle can be a stack location. But this requires looking "down" the split tree from the spillbundle at root to all its pieces, and seeing that they have explicit stack constraints. And it also requires some flow-specific knowledge: that we're flowing into a place with a stack constraint, rather than out of one. And probably would need some weighting mechanism.
  • We could let stack-constrained uses live directly on the spillbundle, and then constrain the spillbundle to just a stack slot (no second-chance register choice). This is probably the best option; it doesn't touch the heuristics at all for non-reftype cases. But it would require some careful thought, because currently spillbundles don't carry uses at all.

So there's a plausible path forward (last option above), but it's not an easy fix.

In general, there is going to be "noise" like this with any shift in complex heuristics; I do apologize, and wish we could always avoid regressions like this, but it's also an impossible request (not saying you are making it, but stating this out loud since it's the topic at hand) to always avoid regressions. In general the best we can do is measure and hill-climb toward better performance in aggregate. In any case, I'll file an issue over in regalloc2 to note the above idea, and at some point I might have a slice of time (it would probably take ~a few days to a week) to get to it.

cfallin added a commit to cfallin/regalloc2 that referenced this pull request May 17, 2022
…ts without falling back to spill bundle.

Currently, we unconditionally trim the ends of liveranges around a split
when we do a split, including splits due to conflicts in a
liverange/bundle's requirements (e.g., a liverange with both a register
and a stack use). These trimmed ends, if they exist, go to the spill
bundle, and the spill bundle may receive a register during second-chance
allocation or otherwise will receive a stack slot.

This was previously measured to reduce contention significantly, because
it reduces the sizes of liveranges that participate in the first-chance
competition for allocations. When a split has to occur, we might as well
relegate the "connecting pieces" to a process that comes later, with a
hint to try to get the right register if possible but no hard connection
to either end.

However, in the case of a split arising from a reg-to-stack /
stack-to-reg conflict, as happens when references are used or def'd as
registers and then cross safepoints, this extra step in the connectivity
(normal LR with register use, then spill bundle, then normal LR with
stack use) can lead to extra moves. Additionally, when one of the LRs
has a stack constraint, contention is far less important; so it doesn't
hurt to skip the trimming step. In fact, it's likely much better to put
the "connecting piece" together with the stack side of the conflict.

Ideally we would handle this with the same move-cost logic we use for
conflicts detected during backtracking, but the requirements-related
splitting happens separately and that logic would need to be generalized
further. For now, this is sufficient to eliminate redundant moves as
seen in e.g. bytecodealliance/wasmtime#3785.
@cfallin
Copy link
Member

cfallin commented May 17, 2022

OK, so given all the above, I got nerd-sniped for the rest of the afternoon and did bytecodealliance/regalloc2#49. It seems to address the issue above (in fact the code is a little shorter than with regalloc.rs).

cfallin added a commit to bytecodealliance/regalloc2 that referenced this pull request May 17, 2022
…ts without falling back to spill bundle. (#49)

Currently, we unconditionally trim the ends of liveranges around a split
when we do a split, including splits due to conflicts in a
liverange/bundle's requirements (e.g., a liverange with both a register
and a stack use). These trimmed ends, if they exist, go to the spill
bundle, and the spill bundle may receive a register during second-chance
allocation or otherwise will receive a stack slot.

This was previously measured to reduce contention significantly, because
it reduces the sizes of liveranges that participate in the first-chance
competition for allocations. When a split has to occur, we might as well
relegate the "connecting pieces" to a process that comes later, with a
hint to try to get the right register if possible but no hard connection
to either end.

However, in the case of a split arising from a reg-to-stack /
stack-to-reg conflict, as happens when references are used or def'd as
registers and then cross safepoints, this extra step in the connectivity
(normal LR with register use, then spill bundle, then normal LR with
stack use) can lead to extra moves. Additionally, when one of the LRs
has a stack constraint, contention is far less important; so it doesn't
hurt to skip the trimming step. In fact, it's likely much better to put
the "connecting piece" together with the stack side of the conflict.

Ideally we would handle this with the same move-cost logic we use for
conflicts detected during backtracking, but the requirements-related
splitting happens separately and that logic would need to be generalized
further. For now, this is sufficient to eliminate redundant moves as
seen in e.g. bytecodealliance/wasmtime#3785.
cfallin added a commit to cfallin/wasmtime that referenced this pull request May 17, 2022
This pulls in bytecodealliance/regalloc2#49, which slightly improves
codegen in soem cases where a safepoint (for reference-typed values)
occurs in the same liverange as a register-constraineed use. For
example, in bytecodealliance#3785, an extra move instruction
appeared and a callee-save register was used (necessitating a more
expensive prologue) because of suboptimal splitting heuristics, which
this PR fixes. The updated RA2 heuristics appear to have no measured
downsides in existing benchmarks and improve the manually-observed
codegen issue.
cfallin added a commit to cfallin/wasmtime that referenced this pull request May 17, 2022
This pulls in bytecodealliance/regalloc2#49, which slightly improves
codegen in some cases where a safepoint (for reference-typed values)
occurs in the same liverange as a register-constraineed use. For
example, in bytecodealliance#3785, an extra move instruction
appeared and a callee-save register was used (necessitating a more
expensive prologue) because of suboptimal splitting heuristics, which
this PR fixes. The updated RA2 heuristics appear to have no measured
downsides in existing benchmarks and improve the manually-observed
codegen issue.
cfallin added a commit to cfallin/wasmtime that referenced this pull request May 17, 2022
This pulls in bytecodealliance/regalloc2#49, which slightly improves
codegen in some cases where a safepoint (for reference-typed values)
occurs in the same liverange as a register-constrained use. For
example, in bytecodealliance#3785, an extra move instruction
appeared and a callee-save register was used (necessitating a more
expensive prologue) because of suboptimal splitting heuristics, which
this PR fixes. The updated RA2 heuristics appear to have no measured
downsides in existing benchmarks and improve the manually-observed
codegen issue.
@uweigand
Copy link
Member Author

Hi @cfallin, thanks for the detailed analysis and the quick fix! I can confirm that with regalloc2 0.1.3 the code generated for the reftypes.clif test case is improved, both without this PR and with this PR applied. So this regalloc change looks like a clear improvement to me.

However, I'm still wondering why we are seeing any regalloc differences from this PR - both with and without your regalloc2 change applied. Without the regalloc2 change, we're seeing the regression shown in the PR. But even with the regalloc2 change, we are seeing a change in generated code (not a regression, but still a change):

7,8c7,8
<         "  bras %r1, 12 ; data %f + 0 ; lg %r5, 0(%r1)",
<         "  basr %r14, %r5",
---
>         "  bras %r1, 12 ; data %f + 0 ; lg %r4, 0(%r1)",
>         "  basr %r14, %r4",
12,13c12,13
<         "  llcr %r5, %r2",
<         "  chi %r5, 0",
---
>         "  llcr %r3, %r2",
>         "  chi %r3, 0",
28,29c28,29
<         "  la %r4, 160(%r15)",
<         "  lg %r4, 0(%r4)",
---
>         "  la %r5, 160(%r15)",
>         "  lg %r4, 0(%r5)",

This seems strange given that the code regalloc sees before and after this PR is nearly identical.
Before this PR:

  Entry block: 0
  v131 := v146
  v134 := v140
  v135 := v139
Block 0:
    (original IR block: block0)
    (successor: Block 1)
    (successor: Block 3)
    (instruction range: 0 .. 11)
  Inst 0: lgr %v128, %r2
  Inst 1: lgr %v129, %r3
  Inst 2: lgr %r2, %v128
  Inst 3: bras %r1, 12 ; data %f + 0 ; lg %v147, 0(%r1)
  Inst 4: basr %r14, %v147
  Inst 5: lr %v130, %r2
  Inst 6: la %v146, 0(%r15)
  Inst 7: stg %v128, 0(%v131)
  Inst 8: llcr %v145, %v130
  Inst 9: chi %v145, 0
  Inst 10: jgnlh label1 ; jg label3

After this PR:

  Entry block: 0
  v130 := v148
  v131 := v146
  v134 := v140
  v135 := v139
Block 0:
    (original IR block: block0)
    (successor: Block 1)
    (successor: Block 3)
    (instruction range: 0 .. 11)
  Inst 0: lgr %v128, %r2
  Inst 1: lgr %v129, %r3
  Inst 2: lgr %r2, %v128
  Inst 3: bras %r1, 12 ; data %f + 0 ; lg %v147, 0(%r1)
  Inst 4: basr %r14, %v147
  Inst 5: lr %v148, %r2
  Inst 6: la %v146, 0(%r15)
  Inst 7: stg %v128, 0(%v131)
  Inst 8: llcr %v145, %v130
  Inst 9: chi %v145, 0
  Inst 10: jgnlh label1 ; jg label3

The only difference is that before the PR, insts 5 and 8 use the same vreg (v130), while after the PR, they use two different vregs (v130 and v148), which are marked as aliases.

If aliased vregs are indeed treated identically by regalloc, why does this change still appear to make a difference?

@cfallin
Copy link
Member

cfallin commented May 17, 2022

Ah, so I think what is happening is that the aliasing rewrites toward the new vreg, not the old one: v130 is made an alias of v148, so all instances of v130 become v148 as seen by regalloc.

This means that the order of vregs seen by regalloc2 is not quite the same, so allocation decisions may be made in a slightly different order.

This aligns with what is seen in your diff above, I think: the instruction sequence is exactly the same, but the specific register numbers are different in a few cases.

@uweigand
Copy link
Member Author

This means that the order of vregs seen by regalloc2 is not quite the same, so allocation decisions may be made in a slightly different order.

I see, that does indeed look like it explains the difference. I think with the regalloc2 patch we should be all good then. Thanks again!

cfallin added a commit that referenced this pull request May 18, 2022
* Upgrade to regalloc2 0.1.3.

This pulls in bytecodealliance/regalloc2#49, which slightly improves
codegen in some cases where a safepoint (for reference-typed values)
occurs in the same liverange as a register-constrained use. For
example, in #3785, an extra move instruction
appeared and a callee-save register was used (necessitating a more
expensive prologue) because of suboptimal splitting heuristics, which
this PR fixes. The updated RA2 heuristics appear to have no measured
downsides in existing benchmarks and improve the manually-observed
codegen issue.

* Update filetests where regalloc2 improvement altered behavior with reftypes.
@uweigand
Copy link
Member Author

Hi @cfallin, this latest version now has the ABISig refactoring and a bunch of other general interface cleanups.

As far as I am concerned, this should now be ready to be merged, so I'd appreciate another review - any comments welcome!

This adds infrastructure to allow implementing call and return
instructions in ISLE, and migrates the s390x back-end.

To implement ABI details, this patch creates public accessors
for `ABISig` and makes them accessible in ISLE.  All actual
code generation is then done in ISLE rules, following the
information provided by that signature.

[ Note that the s390x back end never requires multiple slots for
a single argument - the infrastructure to handle this should
already be present, however. ]

To implement loops in ISLE rules, this patch uses regular tail
recursion, employing a `Range` data structure holding a range
of integers to be looped over.
@uweigand
Copy link
Member Author

Updated to resolve merge conflicts due to the uses_defs_clobbers change.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all the updates, this looks great, @uweigand ! Thanks a bunch for the patience and especially for the rebases as this was pending. I think this should be good to merge now.

@cfallin cfallin merged commit 7a9479f into bytecodealliance:main Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants