Introduce peepmatic: a peephole optimizations DSL and peephole optimizer compiler #1647

fitzgen · 2020-05-02T00:23:09Z

This PR introduces peepmatic, a peephole optimizations DSL and peephole optimizer compiler.

Developers write a set of optimizations in the DSL, and then peepmatic compiles the set of optimizations into an efficient peephole optimizer:

DSL ----peepmatic----> Peephole Optimizer

The generated peephole optimizer has all of its optimizations' left-hand sides collapsed into a compact transducer automaton that makes matching candidate instruction sequences fast.

The DSL's optimizations may be written by hand or discovered mechanically with a superoptimizer like Souper. Eventually, peepmatic should have a verifier that ensures that the DSL's optimizations are sound, similar to what Alive does for LLVM optimizations.

Learn More

Current Status

I've ported most of simple_preopt.rs to peepmatic's DSL
All tests are passing
I've been doing lots and lots of fuzzing

Next Steps

This work is not complete, but I think it is at a good point to merge into Cranelift and then evolve in-tree.

The next steps after landing this PR are:

Port the rest of simple_preopt.rs over to peepmatic
Port postopt.rs over to peepmatic
Optimize the runtime that interprets the generated peephole optimizations transducers and applies them
Extend peepmatic to work with the new backend's MachInst and vcode

For even further future directions, see the discussion in the slides, linked above.

github-actions · 2020-05-02T00:42:15Z

Subscribe to Label Action

cc @bnjbvr

This issue or pull request has been labeled: "cranelift"

Thus the following users have been cc'd because of the following labels:

bnjbvr: cranelift

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

bjorn3 · 2020-05-02T07:23:19Z

Cool! By the way does this fix the bug where preopt forgets to sign extend imm when optimizing v1 = iconst.i8 imm; v2 = icmp sgt v0, v1 to v2 = icmp_imm sgt v0, imm?

Edit: It doesn't. Left a comment at the place it should sign extend.

bjorn3 · 2020-05-02T11:13:51Z

Can you add the commit messages introducing a new crate to a top level doc comment in the respective crate?

cranelift/peepmatic/crates/runtime/src/integer_interner.rs

cranelift/peepmatic/crates/runtime/src/linear.rs

cranelift/peepmatic/crates/runtime/src/operator.rs

cranelift/peepmatic/crates/runtime/src/paths.rs

cranelift/codegen/src/ir/dfg.rs

cranelift/codegen/src/ir/function.rs

cranelift/codegen/src/peepmatic.rs

cranelift/peepmatic/examples/preopt.peepmatic

cranelift/peepmatic/src/linearize.rs

.github/workflows/main.yml

bjorn3 · 2020-05-02T13:01:21Z

This is very well documented and structured code!

Techcable · 2020-05-09T05:59:10Z

Super cool! 😀 How does this compare to LuaJIT's FOLD optimization and their perfect-hash system? I know their trace compiler has a simpler IR than crenelift, but what is the motivation of using a fst over perfect hash map? Does it enable more complex matching?

bnjbvr · 2020-05-11T08:03:23Z

This is exciting! A few high-level questions that I think would be important to answer before merging:

does this replace the existing simple_preopt, or is this something that ought to be disabled until it has feature parity with the current simple_preopt? I would advocate not enabling this by default if this isn't at feature parity with the existing system, since the current system was actually useful.
if this is at feature parity and we plan to enable it by default, can you provide performance data, please? (comparisons of before/after for: wallclocks compile run time + total number of executed instructions through Valgrind/perf) If this is a slowdown considering one of these two measures, I would strongly advocate not enabling it by default and keeping the existing system in the meanwhile.
(Less important, mostly for my personal curiosity but this doesn't have to block anything) do we have any ideas of what the (Rust) compile time difference would be, with this new system? (Auto-generated code tends to create large functions which are quite slow to compile.)

froydnj · 2020-05-11T11:01:52Z

* (Less important, mostly for my personal curiosity but this doesn't have to block anything) do we have any ideas of what the (Rust) compile time difference would be, with this new system? (Auto-generated code tends to create large functions which are quite slow to compile.)

Something that would be nice to have sorted (pun intended) prior to merge is whether the auto-generated code is the same over multiple compilations of the crate, so sccache works correctly.

fitzgen · 2020-05-11T23:22:45Z

@Techcable

How does this compare to LuaJIT's FOLD optimization and their perfect-hash system?

I'm not really familiar with LuaJIT's FOLD optimizations, but reading through that comment, it seems a little less general (can only match three operations at most?). The idea of combining three opcode checks into a single check via perfect hashing is something we could investigate and add as a new MatchOp, perhaps.

@bnjbvr, as you know, we talked a bit about this at the Cranelift meeting today, but for posterity I'll put them in a comment again.

does this replace the existing simple_preopt, or is this something that ought to be disabled until it has feature parity with the current simple_preopt? I would advocate not enabling this by default if this isn't at feature parity with the existing system, since the current system was actually useful.

Yes, this is feature-gated behind the "enable-peepmatic" cargo feature right now, and the feature is not enabled by default.

can you provide performance data, please?

Performance doesn't quite match the hand-coded peephole optimizer yet. This is one reason why it makes sense to land this off-by-default. This is unsurprising, since I haven't spent time on perf and optimization yet, other than the big picture design.

Graphs of wall time, instructions retired, cache misses, and branch misses

The following examples are for running wasmtime markdown.wasm '# Hello, World!' where markdown.wasm is internally using pulldown-cmark. This is a 272KiB wasm file.

Wall Time

Instructions Retired

Branch Misses

Cache Misses

I have many ideas for perf improvements, but I'd like to land this PR first, and then start investigating perf in follow ups. Since peepmatic is not enabled by default, this shouldn't be risky.

do we have any ideas of what the (Rust) compile time difference would be, with this new system?

The vast majority of peepmatic code is not necessary to compile unless you're changing the set of peephole optimizations. This is the motivation for the split between the peepmatic crate (the compiler, only run at build time if the "rebuild-peephole-optimizers" feature is enabled) and the peepmatic-runtime crate (just the things needed to use a peepmatic-generated peephole optimizer.

Timings of Cranelift's compile time

Without Peepmatic

fitzgen@erdos :: (master) :: ~/wasmtime/cranelift/codegen
    $ cargo clean; time cargo build --quiet

real    0m24.207s
user    1m13.714s
sys     0m4.391s

fitzgen@erdos :: (master) :: ~/wasmtime/cranelift/codegen
    $ echo "// comment" >> src/lib.rs

fitzgen@erdos :: (master *) :: ~/wasmtime/cranelift/codegen
    $ time cargo build --quiet

real    0m2.424s
user    0m1.962s
sys     0m0.559s

With Peepmatic (Not Rebuilding Peephole Optimizers)

fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen
    $ cargo clean; time cargo build --quiet --features enable-peepmatic

real    0m31.580s
user    1m44.893s
sys     0m6.192s

fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen
    $ echo "// comment" >> src/lib.rs

fitzgen@erdos :: (integrate-peepmatic *) :: ~/wasmtime/cranelift/codegen
    $ time cargo build --quiet --features enable-peepmatic

real    0m2.491s
user    0m1.988s
sys     0m0.604s

With Peepmatic (With Rebuilding Peephole Optimizers)

fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen
    $ cargo clean; time cargo build --quiet --features 'enable-peepmatic rebuild-peephole-optimizers'

real    3m35.014s
user    20m46.827s
sys     1m40.616s

fitzgen@erdos :: (integrate-peepmatic) :: ~/wasmtime/cranelift/codegen
    $ echo "// comment" >> src/lib.rs

fitzgen@erdos :: (integrate-peepmatic *) :: ~/wasmtime/cranelift/codegen
    $ time cargo build --quiet --features 'enable-peepmatic rebuild-peephole-optimizers'

real    0m2.649s
user    0m2.187s
sys     0m0.563s

Incremental builds are unaffected.

Clean builds without rebuilding the peephole optimizers take a little bit longer (24 -> 31 seconds).

Clean builds with rebuilding the peephole optimizers take ~3.5 minutes. This is mainly due to building and statically linking Z3. We could also shared link the system Z3 to avoid much of this overhead, but this has other problems, namely old Z3s that are missing some exported symbols (e.g. Ubuntu's packaged Z3).

@froydnj

whether the auto-generated code is the same over multiple compilations of the crate, so sccache works correctly.

(There is currently no generated Rust code, only a generated automaton that is then interpreted. This may change in the future. Sorry to nitpick.)

Yes, builds are deterministic, producing the same automaton bit-for-bit given the same DSL input. CI is checking this, and one of the fuzz targets is also checking this.

fitzgen · 2020-05-11T23:43:58Z

Oh, also, there was a question at the Cranelift meeting about how many optimizations we can expect to get out of Souper.

@jubitaneja harvested candidate left-hand sides from rustfmt compiled to Wasm with LLVM optimizations and then ran them through Souper. Souper successfully synthesized 836 optimizations, of which 221 are reducing the whole LHS to a constant.

I think we can expect to see roughly similar results, with a couple caveats:

First, she was harvesting LHS candidates from the Wasm, not the clif that the Wasm gets translated into. On the one hand, it isn't clear how many of these synthesized optimizations are subsumed by our existing preopt pass. On the other, these candidates are harvested after LLVM optimizations, and I'm pretty sure LLVM's optimizations largely subsume our preopt pass's, so maybe these are new/unique/missing optimizations?
Second, choosing a corpus of benchmark Wasms to harvest LHSes from is tricky, but I am pretty sure we will have more than a single file in the corpus from more than just a single toolchain. So I'd suspect that we would synthesize even more optimizations than this.

bjorn3 · 2020-05-12T09:45:48Z

It would be nice to also harvested candidate left-hand sides from cg_clif generated clif ir. Maybe add a way for a user to provide it's own set of peephole optimizations to Cranelift?

fitzgen · 2020-05-12T15:36:54Z

It would be nice to also harvested candidate left-hand sides from cg_clif generated clif ir. Maybe add a way for a user to provide it's own set of peephole optimizations to Cranelift?

Yep, this is definitely something we could do in the future.

fitzgen · 2020-05-12T20:28:47Z

Finally got windows CI green, so now all CI is green!

sunfishcode

As we've been discussing this offline, this looks good, and thanks for putting peepmatic behind a feature test for now. I just have one question, and there's a minor merge conflict to resolve.

cranelift/codegen/src/simple_preopt.rs

The `peepmatic-automata` crate builds and queries finite-state transducer automata. A transducer is a type of automata that has not only an input that it accepts or rejects, but also an output. While regular automata check whether an input string is in the set that the automata accepts, a transducer maps the input strings to values. A regular automata is sort of a compressed, immutable set, and a transducer is sort of a compressed, immutable key-value dictionary. A [trie] compresses a set of strings or map from a string to a value by sharing prefixes of the input string. Automata and transducers can compress even better: they can share both prefixes and suffixes. [*Index 1,600,000,000 Keys with Automata and Rust* by Andrew Gallant (aka burntsushi)][burntsushi-blog-post] is a top-notch introduction. If you're looking for a general-purpose transducers crate in Rust you're probably looking for [the `fst` crate][fst-crate]. While this implementation is fully generic and has no dependencies, its feature set is specific to `peepmatic`'s needs: * We need to associate extra data with each state: the match operation to evaluate next. * We can't provide the full input string up front, so this crate must support incremental lookups. This is because the peephole optimizer is computing the input string incrementally and dynamically: it looks at the current state's match operation, evaluates it, and then uses the result as the next character of the input string. * We also support incremental insertion and output when building the transducer. This is necessary because we don't want to emit output values that bind a match on an optimization's left-hand side's pattern (for example) until after we've succeeded in matching it, which might not happen until we've reached the n^th state. * We need to support generic output values. The `fst` crate only supports `u64` outputs, while we need to build up an optimization's right-hand side instructions. This implementation is based on [*Direct Construction of Minimal Acyclic Subsequential Transducers* by Mihov and Maurel][paper]. That means that keys must be inserted in lexicographic order during construction. [trie]: https://en.wikipedia.org/wiki/Trie [burntsushi-blog-post]: https://blog.burntsushi.net/transducers/#ordered-maps [fst-crate]: https://crates.io/crates/fst [paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3698&rep=rep1&type=pdf

This crate provides the derive macros used by `peepmatic`, notable AST-related derives that enumerate child AST nodes, and operator-related derives that provide helpers for type checking.

The `peepmatic-runtime` crate contains everything required to use a `peepmatic`-generated peephole optimizer. In short: build times and code size. If you are just using a peephole optimizer, you shouldn't need the functions to construct it from scratch from the DSL (and the implied code size and compilation time), let alone even build it at all. You should just deserialize an already-built peephole optimizer, and then use it. That's all that is contained here in this crate.

Peepmatic is a DSL for peephole optimizations and compiler for generating peephole optimizers from them. The user writes a set of optimizations in the DSL, and then `peepmatic` compiles the set of optimizations into an efficient peephole optimizer: ``` DSL ----peepmatic----> Peephole Optimizer ``` The generated peephole optimizer has all of its optimizations' left-hand sides collapsed into a compact automata that makes matching candidate instruction sequences fast. The DSL's optimizations may be written by hand or discovered mechanically with a superoptimizer like [Souper][]. Eventually, `peepmatic` should have a verifier that ensures that the DSL's optimizations are sound, similar to what [Alive][] does for LLVM optimizations. [Souper]: https://github.com/google/souper [Alive]: https://github.com/AliveToolkit/alive2

This crate provides testing utilities for `peepmatic`, and a test-only instruction set we can use to check that various optimizations do or don't apply.

This crate contains oracles, generators, and fuzz targets for use with fuzzing engines (e.g. libFuzzer). This doesn't contain the actual `libfuzzer_sys::fuzz_target!` definitions (those are in the `peepmatic-fuzz` crate) but does those definitions are one liners calling out to functions defined in this crate.

This ports all of the identity, no-op, simplification, and canonicalization related optimizations over from being hand-coded to the `peepmatic` DSL. This does not handle the branch-to-branch optimizations or most of the divide-by-constant optimizations.

…ly used elsewhere

…che friendly

These ids end up in the automaton, so making them smaller should give us better data cache locality and also smaller serialized sizes.

A boxed slice is only two words, while a vec is three words. This should cut down on the memory size of our automata and improve cache usage.

… point After replacing an instruction with an alias to an earlier value, trying to further optimize that value is unnecessary, since we've already processed it, and also was triggering an assertion.

Rather than outright replacing parts of our existing peephole optimizations passes, this makes peepmatic an optional cargo feature that can be enabled. This allows us to take a conservative approach with enabling peepmatic everywhere, while also allowing us to get it in-tree and make it easier to collaborate on improving it quickly.

Beyond just ensuring that they can still be built, ensure that rebuilding them doesn't result in a different built artifact.

This also updates `wat` in the lockfile so that the SIMD spec tests are passing again.

This fixes Windows builds.

fitzgen requested a review from sunfishcode May 2, 2020 00:23

github-actions bot added the cranelift Issues related to the Cranelift code generator label May 2, 2020

bjorn3 reviewed May 2, 2020

View reviewed changes

bjorn3 suggested changes May 2, 2020

View reviewed changes

bjorn3 reviewed May 2, 2020

View reviewed changes

.github/workflows/main.yml Outdated Show resolved Hide resolved

fitzgen force-pushed the integrate-peepmatic branch from 162a13f to dba38ea Compare May 8, 2020 21:35

sunfishcode approved these changes May 13, 2020

View reviewed changes

cranelift/codegen/src/simple_preopt.rs Outdated Show resolved Hide resolved

fitzgen added 11 commits May 14, 2020 07:50

peepmatic: Introduce the peepmatic-macro crate

0f03a97

This crate provides the derive macros used by `peepmatic`, notable AST-related derives that enumerate child AST nodes, and operator-related derives that provide helpers for type checking.

peepmatic: Introduce the peepmatic-test crate

2828da1

This crate provides testing utilities for `peepmatic`, and a test-only instruction set we can use to check that various optimizations do or don't apply.

peepmatic: Define fuzz targets for various parts of peepmatic

4b16a4a

ci: Exercise the peepmatic fuzz targets in CI

18663fe

ci: Test rebuilding the peephole optimizers in CI

c2ec152

peepmatic: Do not transplant instructions whose results are potential…

9a1f803

…ly used elsewhere

fitzgen added 11 commits May 14, 2020 07:52

peepmatic: Make the results of match operations a smaller and more ca…

469104c

…che friendly

peepmatic: Represent various id types with u16

210b036

These ids end up in the automaton, so making them smaller should give us better data cache locality and also smaller serialized sizes.

peepmatic: Save RHS actions as a boxed slice, not vec

eb2dab0

A boxed slice is only two words, while a vec is three words. This should cut down on the memory size of our automata and improve cache usage.

peepmatic: Fix a failed assertion due to extra iterations after fixed…

6e135b3

… point After replacing an instruction with an alias to an earlier value, trying to further optimize that value is unnecessary, since we've already processed it, and also was triggering an assertion.

peepmatic: rustfmt

fd4f08e

CI: Ensure that the built peepmatic peephole optimizers are up to date

a9b280c

Beyond just ensuring that they can still be built, ensure that rebuilding them doesn't result in a different built artifact.

peepmatic: Apply some review suggestions from @bjorn3

22a070e

deps: Update wast to 15.0.0

8d7ed0f

This also updates `wat` in the lockfile so that the SIMD spec tests are passing again.

deps: Bump z3 to 0.5.1

923a73b

This fixes Windows builds.

cranelift: Let lifetime elision elide lifetimes

c093dee

fitzgen force-pushed the integrate-peepmatic branch from c178080 to c093dee Compare May 14, 2020 14:52

fitzgen merged commit 0c8c3f5 into bytecodealliance:master May 14, 2020

fitzgen deleted the integrate-peepmatic branch May 14, 2020 16:02

fitzgen mentioned this pull request Jul 1, 2022

Cranelift: Using E-Graphs for Verified, Cooperating Middle-End Optimizations bytecodealliance/rfcs#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce peepmatic: a peephole optimizations DSL and peephole optimizer compiler #1647

Introduce peepmatic: a peephole optimizations DSL and peephole optimizer compiler #1647

fitzgen commented May 2, 2020

github-actions bot commented May 2, 2020

bjorn3 commented May 2, 2020 •

edited

Loading

bjorn3 commented May 2, 2020

bjorn3 commented May 2, 2020

Techcable commented May 9, 2020

bnjbvr commented May 11, 2020

froydnj commented May 11, 2020

fitzgen commented May 11, 2020

Wall Time

Instructions Retired

Branch Misses

Cache Misses

Without Peepmatic

With Peepmatic (Not Rebuilding Peephole Optimizers)

With Peepmatic (With Rebuilding Peephole Optimizers)

fitzgen commented May 11, 2020

bjorn3 commented May 12, 2020

fitzgen commented May 12, 2020

fitzgen commented May 12, 2020

sunfishcode left a comment

Introduce peepmatic: a peephole optimizations DSL and peephole optimizer compiler #1647

Introduce peepmatic: a peephole optimizations DSL and peephole optimizer compiler #1647

Conversation

fitzgen commented May 2, 2020

Learn More

Current Status

Next Steps

github-actions bot commented May 2, 2020

Subscribe to Label Action

bjorn3 commented May 2, 2020 • edited Loading

bjorn3 commented May 2, 2020

bjorn3 commented May 2, 2020

Techcable commented May 9, 2020

bnjbvr commented May 11, 2020

froydnj commented May 11, 2020

fitzgen commented May 11, 2020

Wall Time

Instructions Retired

Branch Misses

Cache Misses

Without Peepmatic

With Peepmatic (Not Rebuilding Peephole Optimizers)

With Peepmatic (With Rebuilding Peephole Optimizers)

fitzgen commented May 11, 2020

bjorn3 commented May 12, 2020

fitzgen commented May 12, 2020

fitzgen commented May 12, 2020

sunfishcode left a comment

Choose a reason for hiding this comment

bjorn3 commented May 2, 2020 •

edited

Loading