New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework of MachInst isel, branch fixups and lowering, and block ordering. #1718
Conversation
Subscribe to Label Actioncc @bnjbvr
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:isel", "cranelift:area:x64"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
f3c99ac
to
140e245
Compare
This patch includes: - A complete rework of the way that CLIF blocks and edge blocks are lowered into VCode blocks. The new mechanism in `BlockLoweringOrder` computes RPO over the CFG, but with a twist: it merges edge blocks intto heads or tails of original CLIF blocks wherever possible, and it does this without ever actually materializing the full nodes-plus-edges graph first. The backend driver lowers blocks in final order so there's no need to reshuffle later. - A new `MachBuffer` that replaces the `MachSection`. This is a special version of a code-sink that is far more than a humble `Vec<u8>`. In particular, it keeps a record of label definitions and label uses, with a machine-pluggable `LabelUse` trait that defines various types of fixups (basically internal relocations). Importantly, it implements some simple peephole-style branch rewrites *inline in the emission pass*, without any separate traversals over the code to use fallthroughs, swap taken/not-taken arms, etc. It tracks branches at the tail of the buffer and can (i) remove blocks that are just unconditional branches (by redirecting the label), (ii) understand a conditional/unconditional pair and swap the conditional polarity when it's helpful; and (iii) remove branches that branch to the fallthrough PC. The `MachBuffer` also implements branch-island support. On architectures like AArch64, this is needed to allow conditional branches within plausibly-attainable ranges (+/- 1MB on AArch64 specifically). It also does this inline while streaming through the emission, without any sort of fixpoint algorithm or later moving of code, by simply tracking outstanding references and "deadlines" and emitting an island just-in-time when we're in danger of going out of range. - A rework of the instruction selector driver. This is largely following the same algorithm as before, but is cleaned up significantly, in particular in the API: the machine backend can ask for an input arg and get any of three forms (constant, register, producing instruction), indicating it needs the register or can merge the constant or producing instruction as appropriate. This new driver takes special care to emit constants right at use-sites (and at phi inputs), minimizing their live-ranges, and also special-cases the "pinned register" to avoid superfluous moves. Overall, on `bz2.wasm`, the results are: wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
91af6c5
to
2dc2ffc
Compare
Updated: rebased onto latest master; brought x64 backend up-to-date; updated all aarch64 filetests; and added some more tests for MachBuffer. Should be ready for review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Note; so far I haven't commented on the second, smaller diff, containing x64 changes).
Mostly it looks good. I'm not claiming to understand every detail, but I assume that the fact that it works and gives a substantial perf hop-up means it's pretty much ok.
There are a few comments in-line, but I have some larger semantics-level and maintainability questions here:
-
The MachBuffer also implements branch-island support. [..]
-
How is this tested, considering it is low probability path stuff? Given that long range jumps are only for > 1 MB on arm64, I feel like there's a bunch of paths here with very low probability of being taken. Which is a verification hazard. Is that the case? If so, how is this stuff being tested?
(Later) I see tests at the bottom of machinst/buffer.rs. Are these adequate?
-
-
(Islands-and-Deadlines algorithm further comment): This is clearly a complex bit of machinery. I didn't see any single block comment explaining what the whole algorithm is. Can you add one? Not every detail, but at least some top level description.
-
(revised isel driver): again, this is now more complex than it was. Question: is lookback past block boundaries still allowed? And how does this interact with the colouring machinery? My impression is that lookbacks past block boundaries are still allowed, however the colours always change across block boundaries. Hence this restricts lookbacks across boundaries to pure value trees. So it's correct. But if that analysis is correct, that's a non-obvious interaction that it would be good to document.
-
(revised isel driver more): regarding lookbacks and colouring, I would like to see at least some implementation of movzx/movsx applied to loads, preferably in this patch, or at least very soon in a followup. This is partly to improve code quality but primarily to demonstrate that the colouring infrastructure is sound (viz, to test it to some extent).
-
Is it correct to understand that the new blockorder.rs maintains the old invariant that there are no dead blocks in the output? Or, at least, that the invariant is maintained by whatever means, that there are no dead blocks in the input to regalloc?
-
I see debug! calls in potentially hottish places, eg MachBuffer::put1(). Are we sure those become zero cost in release builds?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK. Thanks for fixing this up. A couple of correctness queries, that's all.
3cf1eef
to
07209fd
Compare
Thanks a bunch for the review! Hopefully I've addressed everything. A few responses to the questions below:
The tests at the end of the file are all we have for now, but I had a thought today: another way to exercise this might be to artificially turn down the allowable branch range to force "real" use of islands/veneers in plausibly-large functions, e.g. maybe 1KB or so. Then we could do the usual correctness tests with our benchmarks. I'll have to think a bit about how to automate this; it ties into the idea Dan suggested before of an "evil mode", where we change some target config option(s) to make pessimistic choices everywhere (randomize block order, lower every phi as explicit moves, clobber every callee-save intentionally, etc.) and then fuzz the heck out of things. For now, perhaps I can do this manually and run the spec testsuite?
Yup, added a big block comment -- thanks; more docs are always better!
Yes, that's right; the predicate for allowing a lookback (which in practice means giving the instruction reference in the result of
Sure; perhaps separately from this patch, as that's part of the x64 backend work (and I'm trying to touch it as little as possible in this, to avoid stepping on other ongoing work)?
Yes, exactly: the output of
Fixed, thanks! |
07209fd
to
bdd2873
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes (extra comments), really. LGTM.
One other comment (not to block landing):
FTR, I am still of the opinion that lookback outside the block, even for just pure values, will turn out to be a net loss in the end, in the sense that the extra register pressure will cause much more of a loss than any minor improvements in insn selection that might result. That said, I don't have any Actual Evidence to substantiate my claim, at least currently. |
tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).
This patch includes:
A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in
BlockLoweringOrder
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.
A new
MachBuffer
that replaces theMachSection
. This is a specialversion of a code-sink that is far more than a humble
Vec<u8>
. Inparticular, it keeps a record of label definitions and label uses,
with a machine-pluggable
LabelUse
trait that defines various typesof fixups (basically internal relocations).
Importantly, it implements some simple peephole-style branch rewrites
inline in the emission pass, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.
The
MachBuffer
also implements branch-island support. Onarchitectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.
A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.
Overall, on
bz2.wasm
, the results are: