Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce peepmatic: a peephole optimizations DSL and peephole optimizer compiler #1647

Merged
merged 22 commits into from
May 14, 2020

Conversation

fitzgen
Copy link
Member

@fitzgen fitzgen commented May 2, 2020

This PR introduces peepmatic, a peephole optimizations DSL and peephole optimizer compiler.

Developers write a set of optimizations in the DSL, and then peepmatic compiles the set of optimizations into an efficient peephole optimizer:

DSL ----peepmatic----> Peephole Optimizer

The generated peephole optimizer has all of its optimizations' left-hand sides collapsed into a compact transducer automaton that makes matching candidate instruction sequences fast.

The DSL's optimizations may be written by hand or discovered mechanically with a superoptimizer like Souper. Eventually, peepmatic should have a verifier that ensures that the DSL's optimizations are sound, similar to what Alive does for LLVM optimizations.

Learn More

Current Status

  • I've ported most of simple_preopt.rs to peepmatic's DSL

  • All tests are passing

  • I've been doing lots and lots of fuzzing

Next Steps

This work is not complete, but I think it is at a good point to merge into Cranelift and then evolve in-tree.

The next steps after landing this PR are:

  • Port the rest of simple_preopt.rs over to peepmatic

  • Port postopt.rs over to peepmatic

  • Optimize the runtime that interprets the generated peephole optimizations transducers and applies them

  • Extend peepmatic to work with the new backend's MachInst and vcode

For even further future directions, see the discussion in the slides, linked above.

@fitzgen fitzgen requested a review from sunfishcode May 2, 2020 00:23
@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label May 2, 2020
@github-actions
Copy link

github-actions bot commented May 2, 2020

Subscribe to Label Action

cc @bnjbvr

This issue or pull request has been labeled: "cranelift"

Thus the following users have been cc'd because of the following labels:

  • bnjbvr: cranelift

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@bjorn3
Copy link
Contributor

bjorn3 commented May 2, 2020

Cool! By the way does this fix the bug where preopt forgets to sign extend imm when optimizing v1 = iconst.i8 imm; v2 = icmp sgt v0, v1 to v2 = icmp_imm sgt v0, imm?

Edit: It doesn't. Left a comment at the place it should sign extend.

@bjorn3
Copy link
Contributor

bjorn3 commented May 2, 2020

Can you add the commit messages introducing a new crate to a top level doc comment in the respective crate?

cranelift/codegen/src/ir/dfg.rs Show resolved Hide resolved
cranelift/codegen/src/ir/function.rs Show resolved Hide resolved
cranelift/codegen/src/peepmatic.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/peepmatic.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/peepmatic.rs Show resolved Hide resolved
cranelift/peepmatic/examples/preopt.peepmatic Outdated Show resolved Hide resolved
cranelift/peepmatic/examples/preopt.peepmatic Outdated Show resolved Hide resolved
cranelift/peepmatic/src/linearize.rs Outdated Show resolved Hide resolved
.github/workflows/main.yml Outdated Show resolved Hide resolved
@bjorn3
Copy link
Contributor

bjorn3 commented May 2, 2020

This is very well documented and structured code!

@Techcable
Copy link

Super cool! 😀 How does this compare to LuaJIT's FOLD optimization and their perfect-hash system? I know their trace compiler has a simpler IR than crenelift, but what is the motivation of using a fst over perfect hash map? Does it enable more complex matching?

@bnjbvr
Copy link
Member

bnjbvr commented May 11, 2020

This is exciting! A few high-level questions that I think would be important to answer before merging:

  • does this replace the existing simple_preopt, or is this something that ought to be disabled until it has feature parity with the current simple_preopt? I would advocate not enabling this by default if this isn't at feature parity with the existing system, since the current system was actually useful.
  • if this is at feature parity and we plan to enable it by default, can you provide performance data, please? (comparisons of before/after for: wallclocks compile run time + total number of executed instructions through Valgrind/perf) If this is a slowdown considering one of these two measures, I would strongly advocate not enabling it by default and keeping the existing system in the meanwhile.
  • (Less important, mostly for my personal curiosity but this doesn't have to block anything) do we have any ideas of what the (Rust) compile time difference would be, with this new system? (Auto-generated code tends to create large functions which are quite slow to compile.)

@froydnj
Copy link
Collaborator

froydnj commented May 11, 2020

* (Less important, mostly for my personal curiosity but this doesn't have to block anything) do we have any ideas of what the (Rust) compile time difference would be, with this new system? (Auto-generated code tends to create large functions which are quite slow to compile.)

Something that would be nice to have sorted (pun intended) prior to merge is whether the auto-generated code is the same over multiple compilations of the crate, so sccache works correctly.

@fitzgen
Copy link
Member Author

fitzgen commented May 11, 2020

@Techcable

How does this compare to LuaJIT's FOLD optimization and their perfect-hash system?

I'm not really familiar with LuaJIT's FOLD optimizations, but reading through that comment, it seems a little less general (can only match three operations at most?). The idea of combining three opcode checks into a single check via perfect hashing is something we could investigate and add as a new MatchOp, perhaps.


@bnjbvr, as you know, we talked a bit about this at the Cranelift meeting today, but for posterity I'll put them in a comment again.

does this replace the existing simple_preopt, or is this something that ought to be disabled until it has feature parity with the current simple_preopt? I would advocate not enabling this by default if this isn't at feature parity with the existing system, since the current system was actually useful.

Yes, this is feature-gated behind the "enable-peepmatic" cargo feature right now, and the feature is not enabled by default.

can you provide performance data, please?

Performance doesn't quite match the hand-coded peephole optimizer yet. This is one reason why it makes sense to land this off-by-default. This is unsurprising, since I haven't spent time on perf and optimization yet, other than the big picture design.

Graphs of wall time, instructions retired, cache misses, and branch misses

The following examples are for running wasmtime markdown.wasm '# Hello, World!' where markdown.wasm is internally using pulldown-cmark. This is a 272KiB wasm file.

Wall Time

time

Instructions Retired

instructions

Branch Misses

branch-misses

Cache Misses

cache-misses

I have many ideas for perf improvements, but I'd like to land this PR first, and then start investigating perf in follow ups. Since peepmatic is not enabled by default, this shouldn't be risky.

do we have any ideas of what the (Rust) compile time difference would be, with this new system?

The vast majority of peepmatic code is not necessary to compile unless you're changing the set of peephole optimizations. This is the motivation for the split between the peepmatic crate (the compiler, only run at build time if the "rebuild-peephole-optimizers" feature is enabled) and the peepmatic-runtime crate (just the things needed to use a peepmatic-generated peephole optimizer.

Timings of Cranelift's compile time

Without Peepmatic

fitzgen@erdos :: (master) :: ~/wasmtime/cranelift/codegen
    $ cargo clean; time cargo build --quiet

real    0m24.207s
user    1m13.714s
sys     0m4.391s

fitzgen@erdos :: (master) :: ~/wasmtime/cranelift/codegen
    $ echo "// comment" >> src/lib.rs

fitzgen@erdos :: (master *) :: ~/wasmtime/cranelift/codegen
    $ time cargo build --quiet

real    0m2.424s
user    0m1.962s
sys     0m0.559s

With Peepmatic (Not Rebuilding Peephole Optimizers)

fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen
    $ cargo clean; time cargo build --quiet --features enable-peepmatic

real    0m31.580s
user    1m44.893s
sys     0m6.192s

fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen
    $ echo "// comment" >> src/lib.rs

fitzgen@erdos :: (integrate-peepmatic *) :: ~/wasmtime/cranelift/codegen
    $ time cargo build --quiet --features enable-peepmatic

real    0m2.491s
user    0m1.988s
sys     0m0.604s

With Peepmatic (With Rebuilding Peephole Optimizers)

fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen
    $ cargo clean; time cargo build --quiet --features 'enable-peepmatic rebuild-peephole-optimizers'

real    3m35.014s
user    20m46.827s
sys     1m40.616s

fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen
    $ echo "// comment" >> src/lib.rs

fitzgen@erdos :: (integrate-peepmatic *) :: ~/wasmtime/cranelift/codegen
    $ time cargo build --quiet --features 'enable-peepmatic rebuild-peephole-optimizers'

real    0m2.649s
user    0m2.187s
sys     0m0.563s

Incremental builds are unaffected.

Clean builds without rebuilding the peephole optimizers take a little bit longer (24 -> 31 seconds).

Clean builds with rebuilding the peephole optimizers take ~3.5 minutes. This is mainly due to building and statically linking Z3. We could also shared link the system Z3 to avoid much of this overhead, but this has other problems, namely old Z3s that are missing some exported symbols (e.g. Ubuntu's packaged Z3).


@froydnj

whether the auto-generated code is the same over multiple compilations of the crate, so sccache works correctly.

(There is currently no generated Rust code, only a generated automaton that is then interpreted. This may change in the future. Sorry to nitpick.)

Yes, builds are deterministic, producing the same automaton bit-for-bit given the same DSL input. CI is checking this, and one of the fuzz targets is also checking this.

@fitzgen
Copy link
Member Author

fitzgen commented May 11, 2020

Oh, also, there was a question at the Cranelift meeting about how many optimizations we can expect to get out of Souper.

@jubitaneja harvested candidate left-hand sides from rustfmt compiled to Wasm with LLVM optimizations and then ran them through Souper. Souper successfully synthesized 836 optimizations, of which 221 are reducing the whole LHS to a constant.

I think we can expect to see roughly similar results, with a couple caveats:

  • First, she was harvesting LHS candidates from the Wasm, not the clif that the Wasm gets translated into. On the one hand, it isn't clear how many of these synthesized optimizations are subsumed by our existing preopt pass. On the other, these candidates are harvested after LLVM optimizations, and I'm pretty sure LLVM's optimizations largely subsume our preopt pass's, so maybe these are new/unique/missing optimizations?

  • Second, choosing a corpus of benchmark Wasms to harvest LHSes from is tricky, but I am pretty sure we will have more than a single file in the corpus from more than just a single toolchain. So I'd suspect that we would synthesize even more optimizations than this.

@bjorn3
Copy link
Contributor

bjorn3 commented May 12, 2020

It would be nice to also harvested candidate left-hand sides from cg_clif generated clif ir. Maybe add a way for a user to provide it's own set of peephole optimizations to Cranelift?

@fitzgen
Copy link
Member Author

fitzgen commented May 12, 2020

It would be nice to also harvested candidate left-hand sides from cg_clif generated clif ir. Maybe add a way for a user to provide it's own set of peephole optimizations to Cranelift?

Yep, this is definitely something we could do in the future.

@fitzgen
Copy link
Member Author

fitzgen commented May 12, 2020

Finally got windows CI green, so now all CI is green!

Copy link
Member

@sunfishcode sunfishcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we've been discussing this offline, this looks good, and thanks for putting peepmatic behind a feature test for now. I just have one question, and there's a minor merge conflict to resolve.

cranelift/codegen/src/simple_preopt.rs Outdated Show resolved Hide resolved
fitzgen added 11 commits May 14, 2020 07:50
The `peepmatic-automata` crate builds and queries finite-state transducer
automata.

A transducer is a type of automata that has not only an input that it
accepts or rejects, but also an output. While regular automata check whether
an input string is in the set that the automata accepts, a transducer maps
the input strings to values. A regular automata is sort of a compressed,
immutable set, and a transducer is sort of a compressed, immutable key-value
dictionary. A [trie] compresses a set of strings or map from a string to a
value by sharing prefixes of the input string. Automata and transducers can
compress even better: they can share both prefixes and suffixes. [*Index
1,600,000,000 Keys with Automata and Rust* by Andrew Gallant (aka
burntsushi)][burntsushi-blog-post] is a top-notch introduction.

If you're looking for a general-purpose transducers crate in Rust you're
probably looking for [the `fst` crate][fst-crate]. While this implementation
is fully generic and has no dependencies, its feature set is specific to
`peepmatic`'s needs:

* We need to associate extra data with each state: the match operation to
  evaluate next.

* We can't provide the full input string up front, so this crate must
  support incremental lookups. This is because the peephole optimizer is
  computing the input string incrementally and dynamically: it looks at the
  current state's match operation, evaluates it, and then uses the result as
  the next character of the input string.

* We also support incremental insertion and output when building the
  transducer. This is necessary because we don't want to emit output values
  that bind a match on an optimization's left-hand side's pattern (for
  example) until after we've succeeded in matching it, which might not
  happen until we've reached the n^th state.

* We need to support generic output values. The `fst` crate only supports
  `u64` outputs, while we need to build up an optimization's right-hand side
  instructions.

This implementation is based on [*Direct Construction of Minimal Acyclic
Subsequential Transducers* by Mihov and Maurel][paper]. That means that keys
must be inserted in lexicographic order during construction.

[trie]: https://en.wikipedia.org/wiki/Trie
[burntsushi-blog-post]: https://blog.burntsushi.net/transducers/#ordered-maps
[fst-crate]: https://crates.io/crates/fst
[paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3698&rep=rep1&type=pdf
This crate provides the derive macros used by `peepmatic`, notable AST-related
derives that enumerate child AST nodes, and operator-related derives that
provide helpers for type checking.
The `peepmatic-runtime` crate contains everything required to use a
`peepmatic`-generated peephole optimizer.

In short: build times and code size.

If you are just using a peephole optimizer, you shouldn't need the functions
to construct it from scratch from the DSL (and the implied code size and
compilation time), let alone even build it at all. You should just
deserialize an already-built peephole optimizer, and then use it.

That's all that is contained here in this crate.
Peepmatic is a DSL for peephole optimizations and compiler for generating
peephole optimizers from them. The user writes a set of optimizations in the
DSL, and then `peepmatic` compiles the set of optimizations into an efficient
peephole optimizer:

```
DSL ----peepmatic----> Peephole Optimizer
```

The generated peephole optimizer has all of its optimizations' left-hand sides
collapsed into a compact automata that makes matching candidate instruction
sequences fast.

The DSL's optimizations may be written by hand or discovered mechanically with a
superoptimizer like [Souper][]. Eventually, `peepmatic` should have a verifier
that ensures that the DSL's optimizations are sound, similar to what [Alive][]
does for LLVM optimizations.

[Souper]: https://github.com/google/souper
[Alive]: https://github.com/AliveToolkit/alive2
This crate provides testing utilities for `peepmatic`, and a test-only
instruction set we can use to check that various optimizations do or don't
apply.
This crate contains oracles, generators, and fuzz targets for use with fuzzing
engines (e.g. libFuzzer). This doesn't contain the actual
`libfuzzer_sys::fuzz_target!` definitions (those are in the `peepmatic-fuzz`
crate) but does those definitions are one liners calling out to functions
defined in this crate.
This ports all of the identity, no-op, simplification, and canonicalization
related optimizations over from being hand-coded to the `peepmatic` DSL. This
does not handle the branch-to-branch optimizations or most of the
divide-by-constant optimizations.
fitzgen added 11 commits May 14, 2020 07:52
These ids end up in the automaton, so making them smaller should give us better
data cache locality and also smaller serialized sizes.
A boxed slice is only two words, while a vec is three words. This should cut
down on the memory size of our automata and improve cache usage.
… point

After replacing an instruction with an alias to an earlier value, trying to
further optimize that value is unnecessary, since we've already processed it,
and also was triggering an assertion.
Rather than outright replacing parts of our existing peephole optimizations
passes, this makes peepmatic an optional cargo feature that can be enabled. This
allows us to take a conservative approach with enabling peepmatic everywhere,
while also allowing us to get it in-tree and make it easier to collaborate on
improving it quickly.
Beyond just ensuring that they can still be built, ensure that rebuilding them
doesn't result in a different built artifact.
This also updates `wat` in the lockfile so that the SIMD spec tests are passing
again.
This fixes Windows builds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants