-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce peepmatic: a peephole optimizations DSL and peephole optimizer compiler #1647
Introduce peepmatic: a peephole optimizations DSL and peephole optimizer compiler #1647
Conversation
Subscribe to Label Actioncc @bnjbvr
This issue or pull request has been labeled: "cranelift"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
Cool! By the way does this fix the bug where preopt forgets to sign extend Edit: It doesn't. Left a comment at the place it should sign extend. |
Can you add the commit messages introducing a new crate to a top level doc comment in the respective crate? |
This is very well documented and structured code! |
162a13f
to
dba38ea
Compare
Super cool! 😀 How does this compare to LuaJIT's FOLD optimization and their perfect-hash system? I know their trace compiler has a simpler IR than crenelift, but what is the motivation of using a fst over perfect hash map? Does it enable more complex matching? |
This is exciting! A few high-level questions that I think would be important to answer before merging:
|
Something that would be nice to have sorted (pun intended) prior to merge is whether the auto-generated code is the same over multiple compilations of the crate, so |
I'm not really familiar with LuaJIT's FOLD optimizations, but reading through that comment, it seems a little less general (can only match three operations at most?). The idea of combining three opcode checks into a single check via perfect hashing is something we could investigate and add as a new @bnjbvr, as you know, we talked a bit about this at the Cranelift meeting today, but for posterity I'll put them in a comment again.
Yes, this is feature-gated behind the
Performance doesn't quite match the hand-coded peephole optimizer yet. This is one reason why it makes sense to land this off-by-default. This is unsurprising, since I haven't spent time on perf and optimization yet, other than the big picture design. Graphs of wall time, instructions retired, cache misses, and branch missesThe following examples are for running Wall TimeInstructions RetiredBranch MissesCache MissesI have many ideas for perf improvements, but I'd like to land this PR first, and then start investigating perf in follow ups. Since peepmatic is not enabled by default, this shouldn't be risky.
The vast majority of peepmatic code is not necessary to compile unless you're changing the set of peephole optimizations. This is the motivation for the split between the Timings of Cranelift's compile timeWithout Peepmatic
With Peepmatic (Not Rebuilding Peephole Optimizers)
With Peepmatic (With Rebuilding Peephole Optimizers)
Incremental builds are unaffected. Clean builds without rebuilding the peephole optimizers take a little bit longer (24 -> 31 seconds). Clean builds with rebuilding the peephole optimizers take ~3.5 minutes. This is mainly due to building and statically linking Z3. We could also shared link the system Z3 to avoid much of this overhead, but this has other problems, namely old Z3s that are missing some exported symbols (e.g. Ubuntu's packaged Z3).
(There is currently no generated Rust code, only a generated automaton that is then interpreted. This may change in the future. Sorry to nitpick.) Yes, builds are deterministic, producing the same automaton bit-for-bit given the same DSL input. CI is checking this, and one of the fuzz targets is also checking this. |
Oh, also, there was a question at the Cranelift meeting about how many optimizations we can expect to get out of Souper. @jubitaneja harvested candidate left-hand sides from I think we can expect to see roughly similar results, with a couple caveats:
|
It would be nice to also harvested candidate left-hand sides from cg_clif generated clif ir. Maybe add a way for a user to provide it's own set of peephole optimizations to Cranelift? |
Yep, this is definitely something we could do in the future. |
Finally got windows CI green, so now all CI is green! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we've been discussing this offline, this looks good, and thanks for putting peepmatic behind a feature test for now. I just have one question, and there's a minor merge conflict to resolve.
The `peepmatic-automata` crate builds and queries finite-state transducer automata. A transducer is a type of automata that has not only an input that it accepts or rejects, but also an output. While regular automata check whether an input string is in the set that the automata accepts, a transducer maps the input strings to values. A regular automata is sort of a compressed, immutable set, and a transducer is sort of a compressed, immutable key-value dictionary. A [trie] compresses a set of strings or map from a string to a value by sharing prefixes of the input string. Automata and transducers can compress even better: they can share both prefixes and suffixes. [*Index 1,600,000,000 Keys with Automata and Rust* by Andrew Gallant (aka burntsushi)][burntsushi-blog-post] is a top-notch introduction. If you're looking for a general-purpose transducers crate in Rust you're probably looking for [the `fst` crate][fst-crate]. While this implementation is fully generic and has no dependencies, its feature set is specific to `peepmatic`'s needs: * We need to associate extra data with each state: the match operation to evaluate next. * We can't provide the full input string up front, so this crate must support incremental lookups. This is because the peephole optimizer is computing the input string incrementally and dynamically: it looks at the current state's match operation, evaluates it, and then uses the result as the next character of the input string. * We also support incremental insertion and output when building the transducer. This is necessary because we don't want to emit output values that bind a match on an optimization's left-hand side's pattern (for example) until after we've succeeded in matching it, which might not happen until we've reached the n^th state. * We need to support generic output values. The `fst` crate only supports `u64` outputs, while we need to build up an optimization's right-hand side instructions. This implementation is based on [*Direct Construction of Minimal Acyclic Subsequential Transducers* by Mihov and Maurel][paper]. That means that keys must be inserted in lexicographic order during construction. [trie]: https://en.wikipedia.org/wiki/Trie [burntsushi-blog-post]: https://blog.burntsushi.net/transducers/#ordered-maps [fst-crate]: https://crates.io/crates/fst [paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3698&rep=rep1&type=pdf
This crate provides the derive macros used by `peepmatic`, notable AST-related derives that enumerate child AST nodes, and operator-related derives that provide helpers for type checking.
The `peepmatic-runtime` crate contains everything required to use a `peepmatic`-generated peephole optimizer. In short: build times and code size. If you are just using a peephole optimizer, you shouldn't need the functions to construct it from scratch from the DSL (and the implied code size and compilation time), let alone even build it at all. You should just deserialize an already-built peephole optimizer, and then use it. That's all that is contained here in this crate.
Peepmatic is a DSL for peephole optimizations and compiler for generating peephole optimizers from them. The user writes a set of optimizations in the DSL, and then `peepmatic` compiles the set of optimizations into an efficient peephole optimizer: ``` DSL ----peepmatic----> Peephole Optimizer ``` The generated peephole optimizer has all of its optimizations' left-hand sides collapsed into a compact automata that makes matching candidate instruction sequences fast. The DSL's optimizations may be written by hand or discovered mechanically with a superoptimizer like [Souper][]. Eventually, `peepmatic` should have a verifier that ensures that the DSL's optimizations are sound, similar to what [Alive][] does for LLVM optimizations. [Souper]: https://github.com/google/souper [Alive]: https://github.com/AliveToolkit/alive2
This crate provides testing utilities for `peepmatic`, and a test-only instruction set we can use to check that various optimizations do or don't apply.
This crate contains oracles, generators, and fuzz targets for use with fuzzing engines (e.g. libFuzzer). This doesn't contain the actual `libfuzzer_sys::fuzz_target!` definitions (those are in the `peepmatic-fuzz` crate) but does those definitions are one liners calling out to functions defined in this crate.
This ports all of the identity, no-op, simplification, and canonicalization related optimizations over from being hand-coded to the `peepmatic` DSL. This does not handle the branch-to-branch optimizations or most of the divide-by-constant optimizations.
…ly used elsewhere
These ids end up in the automaton, so making them smaller should give us better data cache locality and also smaller serialized sizes.
A boxed slice is only two words, while a vec is three words. This should cut down on the memory size of our automata and improve cache usage.
… point After replacing an instruction with an alias to an earlier value, trying to further optimize that value is unnecessary, since we've already processed it, and also was triggering an assertion.
Rather than outright replacing parts of our existing peephole optimizations passes, this makes peepmatic an optional cargo feature that can be enabled. This allows us to take a conservative approach with enabling peepmatic everywhere, while also allowing us to get it in-tree and make it easier to collaborate on improving it quickly.
Beyond just ensuring that they can still be built, ensure that rebuilding them doesn't result in a different built artifact.
This also updates `wat` in the lockfile so that the SIMD spec tests are passing again.
This fixes Windows builds.
c178080
to
c093dee
Compare
This PR introduces
peepmatic
, a peephole optimizations DSL and peephole optimizer compiler.Developers write a set of optimizations in the DSL, and then
peepmatic
compiles the set of optimizations into an efficient peephole optimizer:The generated peephole optimizer has all of its optimizations' left-hand sides collapsed into a compact transducer automaton that makes matching candidate instruction sequences fast.
The DSL's optimizations may be written by hand or discovered mechanically with a superoptimizer like Souper. Eventually,
peepmatic
should have a verifier that ensures that the DSL's optimizations are sound, similar to what Alive does for LLVM optimizations.Learn More
cranelift/peepmatic/README.md
has an overview of the DSL and implementationHere is a slide deck I am presenting at the 2020-05-04 Cranelift meeting
Current Status
I've ported most of
simple_preopt.rs
topeepmatic
's DSLAll tests are passing
I've been doing lots and lots of fuzzing
Next Steps
This work is not complete, but I think it is at a good point to merge into Cranelift and then evolve in-tree.
The next steps after landing this PR are:
Port the rest of
simple_preopt.rs
over topeepmatic
Port
postopt.rs
over topeepmatic
Optimize the runtime that interprets the generated peephole optimizations transducers and applies them
Extend
peepmatic
to work with the new backend'sMachInst
and vcodeFor even further future directions, see the discussion in the slides, linked above.