Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework of MachInst isel, branch fixups and lowering, and block ordering. #1718

Merged
merged 3 commits into from May 19, 2020

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented May 16, 2020

tl;dr: new new-isel; better block-ordering, handling branches in one pass. 24% faster compile+run on bz2 (28% fewer instructions); 10% faster compile (10% fewer instructions).

This patch includes:

  • A complete rework of the way that CLIF blocks and edge blocks are
    lowered into VCode blocks. The new mechanism in BlockLoweringOrder
    computes RPO over the CFG, but with a twist: it merges edge blocks intto
    heads or tails of original CLIF blocks wherever possible, and it does
    this without ever actually materializing the full nodes-plus-edges
    graph first. The backend driver lowers blocks in final order so
    there's no need to reshuffle later.

  • A new MachBuffer that replaces the MachSection. This is a special
    version of a code-sink that is far more than a humble Vec<u8>. In
    particular, it keeps a record of label definitions and label uses,
    with a machine-pluggable LabelUse trait that defines various types
    of fixups (basically internal relocations).

    Importantly, it implements some simple peephole-style branch rewrites
    inline in the emission pass, without any separate traversals over
    the code to use fallthroughs, swap taken/not-taken arms, etc. It
    tracks branches at the tail of the buffer and can (i) remove blocks
    that are just unconditional branches (by redirecting the label), (ii)
    understand a conditional/unconditional pair and swap the conditional
    polarity when it's helpful; and (iii) remove branches that branch to
    the fallthrough PC.

    The MachBuffer also implements branch-island support. On
    architectures like AArch64, this is needed to allow conditional
    branches within plausibly-attainable ranges (+/- 1MB on AArch64
    specifically). It also does this inline while streaming through the
    emission, without any sort of fixpoint algorithm or later moving of
    code, by simply tracking outstanding references and "deadlines" and
    emitting an island just-in-time when we're in danger of going out of
    range.

  • A rework of the instruction selector driver. This is largely following
    the same algorithm as before, but is cleaned up significantly, in
    particular in the API: the machine backend can ask for an input arg
    and get any of three forms (constant, register, producing
    instruction), indicating it needs the register or can merge the
    constant or producing instruction as appropriate. This new driver
    takes special care to emit constants right at use-sites (and at phi
    inputs), minimizing their live-ranges, and also special-cases the
    "pinned register" to avoid superfluous moves.

Overall, on bz2.wasm, the results are:

    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.

@cfallin cfallin added cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. labels May 16, 2020
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels May 16, 2020
@github-actions
Copy link

Subscribe to Label Action

cc @bnjbvr

This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:isel", "cranelift:area:x64"

Thus the following users have been cc'd because of the following labels:

  • bnjbvr: cranelift

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@cfallin cfallin force-pushed the machinst-codebuffer branch 3 times, most recently from f3c99ac to 140e245 Compare May 17, 2020 05:03
This patch includes:

- A complete rework of the way that CLIF blocks and edge blocks are
  lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
  computes RPO over the CFG, but with a twist: it merges edge blocks intto
  heads or tails of original CLIF blocks wherever possible, and it does
  this without ever actually materializing the full nodes-plus-edges
  graph first. The backend driver lowers blocks in final order so
  there's no need to reshuffle later.

- A new `MachBuffer` that replaces the `MachSection`. This is a special
  version of a code-sink that is far more than a humble `Vec<u8>`. In
  particular, it keeps a record of label definitions and label uses,
  with a machine-pluggable `LabelUse` trait that defines various types
  of fixups (basically internal relocations).

  Importantly, it implements some simple peephole-style branch rewrites
  *inline in the emission pass*, without any separate traversals over
  the code to use fallthroughs, swap taken/not-taken arms, etc. It
  tracks branches at the tail of the buffer and can (i) remove blocks
  that are just unconditional branches (by redirecting the label), (ii)
  understand a conditional/unconditional pair and swap the conditional
  polarity when it's helpful; and (iii) remove branches that branch to
  the fallthrough PC.

  The `MachBuffer` also implements branch-island support. On
  architectures like AArch64, this is needed to allow conditional
  branches within plausibly-attainable ranges (+/- 1MB on AArch64
  specifically). It also does this inline while streaming through the
  emission, without any sort of fixpoint algorithm or later moving of
  code, by simply tracking outstanding references and "deadlines" and
  emitting an island just-in-time when we're in danger of going out of
  range.

- A rework of the instruction selector driver. This is largely following
  the same algorithm as before, but is cleaned up significantly, in
  particular in the API: the machine backend can ask for an input arg
  and get any of three forms (constant, register, producing
  instruction), indicating it needs the register or can merge the
  constant or producing instruction as appropriate. This new driver
  takes special care to emit constants right at use-sites (and at phi
  inputs), minimizing their live-ranges, and also special-cases the
  "pinned register" to avoid superfluous moves.

Overall, on `bz2.wasm`, the results are:

    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
@cfallin cfallin force-pushed the machinst-codebuffer branch 2 times, most recently from 91af6c5 to 2dc2ffc Compare May 18, 2020 01:18
@cfallin
Copy link
Member Author

cfallin commented May 18, 2020

Updated: rebased onto latest master; brought x64 backend up-to-date; updated all aarch64 filetests; and added some more tests for MachBuffer. Should be ready for review now.

@bnjbvr bnjbvr removed their request for review May 18, 2020 08:20
Copy link
Collaborator

@julian-seward1 julian-seward1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note; so far I haven't commented on the second, smaller diff, containing x64 changes).

Mostly it looks good. I'm not claiming to understand every detail, but I assume that the fact that it works and gives a substantial perf hop-up means it's pretty much ok.

There are a few comments in-line, but I have some larger semantics-level and maintainability questions here:

  • The MachBuffer also implements branch-island support. [..]

    • How is this tested, considering it is low probability path stuff? Given that long range jumps are only for > 1 MB on arm64, I feel like there's a bunch of paths here with very low probability of being taken. Which is a verification hazard. Is that the case? If so, how is this stuff being tested?

      (Later) I see tests at the bottom of machinst/buffer.rs. Are these adequate?

  • (Islands-and-Deadlines algorithm further comment): This is clearly a complex bit of machinery. I didn't see any single block comment explaining what the whole algorithm is. Can you add one? Not every detail, but at least some top level description.

  • (revised isel driver): again, this is now more complex than it was. Question: is lookback past block boundaries still allowed? And how does this interact with the colouring machinery? My impression is that lookbacks past block boundaries are still allowed, however the colours always change across block boundaries. Hence this restricts lookbacks across boundaries to pure value trees. So it's correct. But if that analysis is correct, that's a non-obvious interaction that it would be good to document.

  • (revised isel driver more): regarding lookbacks and colouring, I would like to see at least some implementation of movzx/movsx applied to loads, preferably in this patch, or at least very soon in a followup. This is partly to improve code quality but primarily to demonstrate that the colouring infrastructure is sound (viz, to test it to some extent).

  • Is it correct to understand that the new blockorder.rs maintains the old invariant that there are no dead blocks in the output? Or, at least, that the invariant is maintained by whatever means, that there are no dead blocks in the input to regalloc?

  • I see debug! calls in potentially hottish places, eg MachBuffer::put1(). Are we sure those become zero cost in release builds?

cranelift/codegen/src/isa/aarch64/inst/mod.rs Show resolved Hide resolved
cranelift/codegen/src/isa/aarch64/inst/mod.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/aarch64/inst/mod.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/aarch64/inst/mod.rs Show resolved Hide resolved
cranelift/codegen/src/isa/aarch64/inst/mod.rs Show resolved Hide resolved
cranelift/codegen/src/machinst/lower.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/machinst/lower.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/machinst/lower.rs Show resolved Hide resolved
cranelift/codegen/src/machinst/lower.rs Show resolved Hide resolved
cranelift/codegen/src/machinst/lower.rs Outdated Show resolved Hide resolved
@bjorn3
Copy link
Contributor

bjorn3 commented May 18, 2020

I see debug! calls in potentially hottish places, eg MachBuffer::put1(). Are we sure those become zero cost in release builds?

trace! is probably more appropriate.

Copy link
Collaborator

@julian-seward1 julian-seward1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK. Thanks for fixing this up. A couple of correctness queries, that's all.

cranelift/codegen/src/isa/x64/inst/args.rs Show resolved Hide resolved
cranelift/codegen/src/isa/x64/inst/mod.rs Show resolved Hide resolved
cranelift/codegen/src/isa/x64/lower.rs Show resolved Hide resolved
@cfallin cfallin force-pushed the machinst-codebuffer branch 3 times, most recently from 3cf1eef to 07209fd Compare May 18, 2020 23:07
@cfallin
Copy link
Member Author

cfallin commented May 18, 2020

Thanks a bunch for the review! Hopefully I've addressed everything. A few responses to the questions below:

There are a few comments in-line, but I have some larger semantics-level and maintainability questions here:

  • The MachBuffer also implements branch-island support. [..]

    • How is this tested, considering it is low probability path stuff? Given that long range jumps are only for > 1 MB on arm64, I feel like there's a bunch of paths here with very low probability of being taken. Which is a verification hazard. Is that the case? If so, how is this stuff being tested?
      (Later) I see tests at the bottom of machinst/buffer.rs. Are these adequate?

The tests at the end of the file are all we have for now, but I had a thought today: another way to exercise this might be to artificially turn down the allowable branch range to force "real" use of islands/veneers in plausibly-large functions, e.g. maybe 1KB or so. Then we could do the usual correctness tests with our benchmarks. I'll have to think a bit about how to automate this; it ties into the idea Dan suggested before of an "evil mode", where we change some target config option(s) to make pessimistic choices everywhere (randomize block order, lower every phi as explicit moves, clobber every callee-save intentionally, etc.) and then fuzz the heck out of things. For now, perhaps I can do this manually and run the spec testsuite?

  • (Islands-and-Deadlines algorithm further comment): This is clearly a complex bit of machinery. I didn't see any single block comment explaining what the whole algorithm is. Can you add one? Not every detail, but at least some top level description.

Yup, added a big block comment -- thanks; more docs are always better!

  • (revised isel driver): again, this is now more complex than it was. Question: is lookback past block boundaries still allowed? And how does this interact with the colouring machinery? My impression is that lookbacks past block boundaries are still allowed, however the colours always change across block boundaries. Hence this restricts lookbacks across boundaries to pure value trees. So it's correct. But if that analysis is correct, that's a non-obvious interaction that it would be good to document.

Yes, that's right; the predicate for allowing a lookback (which in practice means giving the instruction reference in the result of get_input()) is "same color, or producing op is pure". This is in the doc comment for get_input(), but perhaps I can write more high-level docs as well. I hope to write some sort of "backend writer's guide" that communicates these principles and invariants that one needs to know, eventually...

  • (revised isel driver more): regarding lookbacks and colouring, I would like to see at least some implementation of movzx/movsx applied to loads, preferably in this patch, or at least very soon in a followup. This is partly to improve code quality but primarily to demonstrate that the colouring infrastructure is sound (viz, to test it to some extent).

Sure; perhaps separately from this patch, as that's part of the x64 backend work (and I'm trying to touch it as little as possible in this, to avoid stepping on other ongoing work)?

  • Is it correct to understand that the new blockorder.rs maintains the old invariant that there are no dead blocks in the output? Or, at least, that the invariant is maintained by whatever means, that there are no dead blocks in the input to regalloc?

Yes, exactly: the output of BlockLoweringOrder::new() will never have dead blocks (even if the input CLIF does; the RPO naturally ensures this). Caught a few bugs with the regalloc's liveness checker so I'm somewhat confident this is the case now!

  • I see debug! calls in potentially hottish places, eg MachBuffer::put1(). Are we sure those become zero cost in release builds?

Fixed, thanks!

Copy link
Collaborator

@julian-seward1 julian-seward1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes (extra comments), really. LGTM.

@julian-seward1
Copy link
Collaborator

One other comment (not to block landing):

(revised isel driver): again, this is now more complex than it was. Question: is lookback past block boundaries still allowed? [..]

Yes, that's right; [..]

FTR, I am still of the opinion that lookback outside the block, even for just pure values, will turn out to be a net loss in the end, in the sense that the extra register pressure will cause much more of a loss than any minor improvements in insn selection that might result. That said, I don't have any Actual Evidence to substantiate my claim, at least currently.

@cfallin cfallin merged commit d8d6fbe into bytecodealliance:master May 19, 2020
@cfallin cfallin deleted the machinst-codebuffer branch May 19, 2020 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants