Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift: Using E-Graphs for Verified, Cooperating Middle-End Optimizations #27

Merged
merged 4 commits into from Sep 7, 2022

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Jun 30, 2022

Rendered

This RFC proposes to add a middle-end optimization framework to Cranelift based on e-graphs. A middle-end
optimization framework works to improve the target-independent IR (intermediate representation), CLIF, before Cranelift translates it to machine instructions. Cranelift already has some basic ptimizations in this category: Global Value Numbering (GVN), Loop-Independent Code Motion (LICM), Constant folding, and most recently, a very basic form of alias analysis and redundant-load elimination. However, the current situation falls short in several ways.

The RFC will first describe how the current situation suffers from three problems: how to order and interleave these optimizations as we build more of them (the phase-ordering problem), how to ensure they are correct when they are all delicate hand-written code (the verification problem), and how to make it easier to build up a large body of known simplifications without tedious handwritten code. It will then describe how an e-graph-based framework could address all
three of these problems. It will describe some novel contributions in how to encode control flow in e-graphs (necessary in order to use e-graphs to optimize whole function bodies) developed during initial experimentation. Then finally it will discuss how we can use our existing rewrite-rule DSL, ISLE, to describe rewrites on the e-graph.

@cfallin
Copy link
Member Author

cfallin commented Jun 30, 2022

@cfallin
Copy link
Member Author

cfallin commented Jun 30, 2022

Current prototype is bytecodealliance/wasmtime#4249; it implements the e-graph roundtripping but not (yet) the rewriting described in this RFC.

that when we generate code from the optimized function body within the
e-graph, we retain equivalent control flow.

There have been numerous *graph-based IRs* of other forms that *do*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not had a chance to read this paper yet [1], but do you know if PEG's are another graph-based IR in this category?

[1] https://rosstate.org/publications/eqsat/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link -- I wasn't aware of PEGs previously, but they look to be in the same family as the other approaches! The theta-nodes for loops (see e.g. Fig. 2 in the POPL 2009 paper) indicate to me that these are region-based, like VSDGs, so the same discussion points apply (vs. the retained-CFG-skeleton approach here). I see also in the description that they compute each node at most once, and as a result do partial-redundancy elimination, whereas we take an approach trying to preserve our existing GVN semantics and only compute a value when we know it will be used, which is generally safer from an optimization-always-improves-perf point of view.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked whether this applies to Program Expression Graphs specifically, but similar dataflow-based IRs like the Value Dependence Graph don't preserve non-termination. A loop that doesn't terminate might not define any values used later and would then be removed, "because CFG values that are not demanded do not appear." Also, the "Equality Saturation" paper you cited has a whole side thing (section 6.1) about handling side-effects like heap access, because that's another challenge for purely dataflow IRs.

This is what distinguishes the Value State Dependence Graph and the newer Regionalized VSDG. Treating state and termination dependencies as additional pseudo-dataflow edges means any topological sort of the graph is a valid serialization.

I assume Chris' approach preserves termination and side effects (at the cost of missing some optimization opportunities that might be safe) but I haven't looked at it carefully enough yet to understand how it does that. From what I've heard so far I suspect it's different than anything I've seen in the literature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the approach proposed in this RFC preserves both termination and side-effects in a somewhat simple way: by just keeping that part of the CFG around. Or more precisely, it maintains a "skeleton", which is a list of eclass IDs in each basic block of side-effecting ops in original order. This is then used during the scoped elaboration to schedule the pure ops back into the spaces between them. One nice property of this, also, is that we stay close enough to the original CFG that we could actually do the roundtrip in a way that never has to move instruction data (memory traffic is expensive!) except when a node is truly duplicated: the skeleton's instructions are stored in the original ir::Function, and the original copy of each e-node remains in the ir::Function as well.

@fitzgen
Copy link
Member

fitzgen commented Jul 1, 2022

For example, the bytecodealliance/wasmtime#1647 tooling included tools to "harvest" potential patterns to rewrite from real programs, and then use the superoptimization engine Souper to find rewrite rules offline. We could revisit such tooling and use it to build a significantly more effective body of rules semi-automatically.

FWIW, the harvester still exists and is independent of Peepmatic, it just extracts Souper IR from CLIF.

The only tool we are missing would be a Souper IR to ISLE(CLIF) translator for Souper's synthesized optimizations back to ISLE/this new framework.

@philzook58
Copy link

This is incredibly cool work and I am very interested in following it! I also have appreciated your very thorough blog posts in the past. I'm seeing if it useful for people to have a place for egraph discussion, so I made a nascent but fairly quiet so far zulip. https://egraphs.zulipchat.com/join/q7bjayghbdw66xsnzgwc2hjb/

@fitzgen
Copy link
Member

fitzgen commented Jul 1, 2022

Super happy with how simple and obvious-after-the-fact your proposed e-graph destruction algorithm is (cycle breaking + scoped elaboration).

@fitzgen
Copy link
Member

fitzgen commented Jul 1, 2022

We will thus need a system of "optimization fuel" that limits rewrite steps overall and possibly per-rule or per-eclass. It is not yet certain what policies will work best; this is a question that we will need to experiment with. We simply wish to acknowledge now that it will be a problem that we will need to solve.

Note that egg already has a few knobs we can experiment with before trying to create our own custom thing here. In addition to perform-at-most-N-rewrite-iterations, you can also give it optional limits on the number of e-classes and e-nodes, and halt equality saturation once those are reached. I think the latter will be important for limiting the e-graph's memory overhead.

@mwillsey
Copy link

mwillsey commented Jul 1, 2022

Hey all, egg author here. Very, very exciting stuff ahead! I will dive into the RFC soon and try to grok the details. I'm happy to help out as well!

@fitzgen
Copy link
Member

fitzgen commented Jul 1, 2022

How will e-graph-based optimization affect compile time?

Have you run spidermonkey.wasm through sightglass with and without your prototype that just does round tripping through an e-graph? That could give us a tiny taste of what compile time overheads might look like (also I imagine you'd disable the old GVN when it is subsumed by the e-graph round trip). I'd also expect that, in theory, execution time of the generated code would be unchanged, but that would also be a good thing to empirically check.

@cfallin
Copy link
Member Author

cfallin commented Jul 1, 2022

How will e-graph-based optimization affect compile time?

Have you run spidermonkey.wasm through sightglass with and without your prototype that just does round tripping through an e-graph? That could give us a tiny taste of what compile time overheads might look like (also I imagine you'd disable the old GVN when it is subsumed by the e-graph round trip). I'd also expect that, in theory, execution time of the generated code would be unchanged, but that would also be a good thing to empirically check.

Nope, not yet, but that's next on my plate! Before doing measurements I want to do some basic optimizations to the scheme that I know how to do -- mainly, reuse existing CLIF insts rather than repopulating -- which should dramatically lower overhead. If with that we are "at parity" or "not much worse" than LICM (assuming LICM tweaks here) + GVN today, then I'd be very happy, but I guess how much overhead to accept there (in exchange for having an optimization framework) is another good discussion point.

@fitzgen
Copy link
Member

fitzgen commented Jul 1, 2022

I feel like this RFC is a little different from many of our other RFCs. Usually
we know we can implement the thing, and the RFC process is more about API design
and determining whether to implement the thing as described or in some
alternative way. Merging most RFCs is essentially committing to "yes we will do
this". This RFC, on the other hand, has relatively large open questions around
compilation overhead and whether we will actually solve phase ordering the way
we believe we should in theory. I think merging this RFC doesn't necessarily
mean "yes we will do this" because we still aren't 100% sure we can do
this. We think we can, but we need to actually build more to prove it out. So
I think merging this RFC should mean something more like "we will start landing
(cfg'd off) code in main and when we (hopefully) reach X, Y, and Z
acceptance criteria, we will stop hiding the code behind cfgs and remove the
old, non-egraph code paths."

And I would love this RFC to define and document our ahead-of-time consensus on
what exactly those acceptance criteria are so that we don't have moving goal
posts later on down the line when we are deciding whether to actually, fully,
100% commit to the e-graphs mid-end framework or not.


Just to spark discussion, I'll throw some stuff out there:

  • What kind of compile time overhead is acceptable for when we first turn this
    on (with the expectation that we will have potential compile time wins down
    the line as documented in the "Overall Cranelift Structure and Migration Path"
    section)? 5% overhead? 10%?

  • What kind of memory overhead during compilation is acceptable? Unlike compile
    time, I don't think we have any obvious possibilities for improving this in
    the future other than limiting e-nodes/e-classes. (Please correct me if I'm
    wrong!) 50% additional max RSS? 20%? 10%?

  • Finally, what should the acceptance criteria be for generated code quality?
    How about: none of our Real World benchmarks in sightglass (SpiderMonkey,
    markdown, bz2, meshoptimizer, and maybe intgemm-simd but I'm less familiar
    with that one) get statistically significantly worse execution times and at
    least one of them gets statistically significantly better execution
    time. Additionally, we verify by hand that the alias analysis and GVN phase
    ordering example you mention in the RFC actually gets resolved in practice by
    our e-graphs framework.

What do other Cranelift stakeholders think our acceptance criteria should be?

Does compile time/space and code quality cover everything we need to keep an eye
on as far as acceptance quality goes or am I missing something else?

@cfallin
Copy link
Member Author

cfallin commented Jul 1, 2022

These are great questions, thanks @fitzgen -- I should've included more details around what I think the actual decisions should be (now and later)!

I actually am somewhat optimistic that we're close (a-few-weeks-to-a-month-ish?) to a point where we can evaluate the prototype plus basic rewriting rules that have functional parity to today's GVN, LICM, and simple_preopt, and make a decision with this RFC (merge or not) that means "we will actually do this". And to be clear I do want to get that far in the prototype before I personally advocate for merging; I want to ensure there are no unknown-unknowns in the rewrite step. I'd personally favor acceptance criteria that are something like:

  1. Replicates GVN, LICM, and simple_preopt (and clear path for alias analysis);
  2. Is clearly an improvement in ease of contributing further optimizations;
  3. Meets some compile-time threshold.

Open discussion how we balance items 2 and 3 above, but I hope we will have minimal overhead when we have only today's functionality. That would mean that additional compile-time cost is only a factor when actually adding interesting additional rules/optimizations.

On the flipside, I don't know if it necessarily makes sense to have a "something must speed up on initial merge" criterion, though I'm open to that if a bunch of other folks disagree! The reason is that I think once we have (i) a framework and (ii) a replica of today's rewrites, the future upside is clear enough that the work to come up with those additional optimizations can happen as a second step. But, I can see how we might want to require such a speedup if the compile-time overhead is also non-negligible.

Anyway all this is open to a bunch of discussion and, on that note, I'll put an item on the agenda for our next biweekly :-)

@cfallin
Copy link
Member Author

cfallin commented Jul 6, 2022

So I did some quick-and-dirty compile-time measurements. On SpiderMonkey.wasm, the overhead of roundtripping through the egraph adds about 23% to compile time (when opts are enabled). All current handwritten opts account for about 10% of compile time.

That's not terrible (not a doubling, for example) but it's not great either, so I dug into which parts were slow. It turns out that the extraction and elaboration take very little time (less than 10% of the egraph addition); most of the time is in the initial egraph build.

I suspect that we'll need to do some careful data-structure engineering to get things within acceptable overheads, but fortunately it looks like there's some relatively low-hanging fruit. Right now in egg the EClass struct keeps two Vecs, one for nodes and one for parents; the Id-to-EClass map is a HashMap while I would expect the Id space to be somewhat dense (maybe this is to clear out eclasses that have been merged?); and the Node interface forces us to keep args internally in separate lists (I use SmallVec currently for this). All of this could be refactored to use central Vecs in a pseudo-arena way (like Cranelift does for CLIF args/results currently), at the cost of passing around the "backing Vec".

I'll work on that for a bit, then return to hacking on the ISLE integration; once I prove out both of these, I'm happy to make a bunch more measurements!

@mwillsey
Copy link

mwillsey commented Jul 6, 2022

@cfallin Awesome! At first glance, 23% isn't so bad, and I think there's you're right: a lot that could be done to get that down. Some quick points:

  • While the Id space is initially dense, you actually hope that it ends up sparse because many e-classes are getting merged.
  • I think you are right, we could get some perf back by using more arenas in EClass.

Also, the way the e-graph is being used in the prototype is basically a worst-case scenario for performance; for every Func, you build an e-graph, do e-graph stuff, extract and throw away. One big way we've improved performance in other egg applications is by batching many optimization queries at once. That would obviously require re-architecting the way egg and clif interact, which is probably something to do down the road.

Overall, I think this means that there are some serious "high-level" ways to improve performance as well as the lower-level ways you mentioned.

@cfallin
Copy link
Member Author

cfallin commented Jul 6, 2022

Thanks @mwillsey! By the way I'm planning to continue hacking on my fork of egg but I plan to upstream everything in due time, once plans are stable...

Re: this point:

Also, the way the e-graph is being used in the prototype is basically a worst-case scenario for performance; for every Func, you build an e-graph, do e-graph stuff, extract and throw away. One big way we've improved performance in other egg applications is by batching many optimization queries at once. That would obviously require re-architecting the way egg and clif interact, which is probably something to do down the road.

I want to dig a bit deeper: it seems you're saying that one way of tolerating overheads is to amortize the construction and extraction over more use of the egraph. I guess that's true in a "fraction of time spent in construction+extraction" sense, but I'm thinking in an absolute sense here: the time spent building the egraph and extracting from it, with no work between the two, is the bare minimum in additional overhead, no? At least as long as we have a separate egraph per function, and given that the overall problem we are solving is "compile this CLIF to machine code"? Or it's possible I'm missing some other way of batching or amortization that you mean to suggest...

@cfallin
Copy link
Member Author

cfallin commented Jul 6, 2022

Oh also a quick calibration point re:

At first glance, 23% isn't so bad

In Cranelift context this would be a huge deal actually: the whole regalloc rewrite (0.5 engineer-years) bought us about 20% compile-time improvement, and I fight to not give up 0.5% regularly. This is an open question above but I'll venture a ballpark number on this now just to calibrate (or at least give my own opinion): maybe 5% would be OK, but 20% would require a large increase in generated-code quality to justify, I think.

@mwillsey
Copy link

mwillsey commented Jul 6, 2022

I'm planning to continue hacking on my fork of egg

Makes perfect sense. I'm happy to help hack on the fork if that's useful.

as long as we have a separate egraph per function

As long the e-graph optimization workflow is per-function, you can't really amortize. The batching I'm talking about is across functions: there's nothing stopping you from adding 1000 functions to the same e-graph (hopefully there's some sharing in there!) and optimizing them simultaneously. I suspect that even just adding 1000 functions to one e-graph is faster than adding 1000 functions to 1000 e-graphs because there is likely some sharing, but I could be wrong here.

Obviously one e-graph per function is cleaner for how the optimization workflow is right now. It just comes with the unfortunate overhead of basically copying your whole program twice: once into the e-graph and once out. Even outside of batching, another way of improving performance would be to eliminate one or both of these copies by increasing the part of the pipeline where you use e-graphs, ideally to some other "translation" boundary where you'd have to lower to another IR anyway. This is clearly a big task, though.

@cfallin
Copy link
Member Author

cfallin commented Jul 6, 2022

That's a really interesting idea; we can certainly play with it at some point!

We currently have a hard boundary between each function compilation, in order to allow for parallel compilation, but we'll need to break that boundary (optionally) eventually for inlining as well. And, actually, now that these two ideas are next to each other: putting the inlining and inlined function in the same egraph, linking up the dataflow, and then elaborating out of that is a really nice clean way of doing inlining. So, you've spawned some thoughts here, thanks!

One potentially complicating factor toward deduplicating efficiency between function bodies is that the "roots" (terminal nodes) will differ: both f and g may contain complex_expr(x, y, z) but where x, y, and z are different block params, or side-effecting ops (calls, loads), or whatever, in the two functions. I wonder if there is some sort of canonicalization that could help (in the extreme case this feels like a graph isomorphism problem though). In the egraph literature I mostly see expressions in terms of variable terminals x, y, etc and so this doesn't seem to be as big a deal. Any thoughts?

@cfallin
Copy link
Member Author

cfallin commented Jul 8, 2022

A brief update: I hacked up a custom egraph implementation designed to be more memory-efficient and allocation-light, and got the overhead down to 9%.

The key differences are in how it avoids ever cloning Nodes, using indexmap to allow both node-to-ID dedup and ID-to-node access and using these IDs in eclasses' list of nodes (the use of IDs to name nodes is itself different from egg too); and using Cranelift's EntityList to store lists of node IDs and eclass IDs in pooled storage; and using bumpalo (an arena allocator) to allocate the argument lists of eclass IDs. All of this is to implement the principles (i) O(1) memory allocations, i.e. we should grow single large containers rather than have nested containers per node or eclass (entity-component-system style) and (ii) avoid copies and store nodes only once. There's a little more we could do in theory with a custom hashmap that doesn't require Hash+Eq (namely, allows us to provide context for those; that would allow us to provide the backing pool separately and not store a fat-pointer slice for args and types in every Node, but only a u32 index).

I haven't built the merging or rule-application bits yet but the roundtripping (without any rule applications) actually never merges eclasses so this is enough to evaluate the overhead of this part of the pipeline.

Given this, and given that GVN+LICM add around 10% to compile time, I'm actually cautiously optimistic that we can slip into the hole left by removing these stages and be almost neutral when no rewrite rules are applied. That's the ideal case in my mind. Or at least have small enough overhead that this roundtrip is always on when opts are on. Later we can remove half the roundtrip by generating egraph nodes straight from the frontend/builder API, as noted in the last section of the RFC (and as you note as well @mwillsey) for further savings, but for now anchoring both ends of the transform in CLIF (which can be printed, manipulated, used for testing, etc) has substantial pragmatic benefits I think.

(I'm not sure if the path forward is to actually use cranelift-egraph or egg with a bunch of hacks; as a general principle I don't like to go all NIH and hack up my own versions of things; but the data structures and tradeoffs are different enough that this may be the right way, I'm not sure. Right now I just want to pathfind wrt "is this performance profile possible".)

After dropping this tidbit I'm now promptly going to disappear for a week of PTO (...sorry!) but wanted to get that data point out first. More later!

@bjorn3
Copy link
Contributor

bjorn3 commented Jul 8, 2022

There's a little more we could do in theory with a custom hashmap that doesn't require Hash+Eq (namely, allows us to provide context for those; that would allow us to provide the backing pool separately and not store a fat-pointer slice for args and types in every Node, but only a u32 index).

hashbrown's RawTable might help with that.

@cfallin
Copy link
Member Author

cfallin commented Jul 18, 2022

There's a little more we could do in theory with a custom hashmap that doesn't require Hash+Eq (namely, allows us to provide context for those; that would allow us to provide the backing pool separately and not store a fat-pointer slice for args and types in every Node, but only a u32 index).

hashbrown's RawTable might help with that.

Indeed! I pushed a new implementation that does just that over here. (That could probably use a lot of cleanup, and is a bit lifetime-heavy, but proves out what I wanted to!) Thanks for the pointer to that.

@RDambrosio016
Copy link

RDambrosio016 commented Aug 1, 2022

This RFC is very interesting to me, i plan to use e-graphs too for a large part of the optimizer for a math-heavy language im writing. I considered using ISLE as well for writing patterns, i think it would work pretty well. I however ran into a couple of open questions i could not resolve:

  • Egg by default uses exponential backoff for rules to avoid dominating rewrite time with rules like associativity, but if we just use a single gigantic pattern matching trie for the rules, how would we disable rules? This is probably doable by keeping some arrays for counting rule "hits", and checking those arrays before matching/falling through, but the matching and memory overhead for this needs to be explored. Using indirect jumps through a lookup table could probably also work (indirect-threaded interpreter style kinda). Would love to hear if @mwillsey has any ideas for this.
  • Is ISLE capable of expressing rules common to multiple types easily? for example, x * 1 => x can be applied to floats, ints, and uints of any size. I presume ISLE has a way to handle this, but i haven't explored it much.
  • Is it potentially possible to store e-class data separate to the actual language? I.e the Language enum would be 64/128-bit sized and only contain data required for matching, for example, Add([Id; 2]), To potentially match rules much faster by exploiting vectorized pattern matching.
  • Is egg/ISLE capable of tracking optimization invariants through Analysis in a performant manner? this would allow for much deeper optimization potential by allowing the optimizer to track invariants, for example, eliminating normalize for vectors if a value is known to already be normalized.

But overall this approach seems promising, and it would be awesome to share rulesets/ideas with cranelift.

@mwillsey
Copy link

mwillsey commented Aug 1, 2022

Hi @RDambrosio016,

egg doesn't use trie-based pattern matching, so you'd essentially have to re-implement e-matching to support this. I believe Z3 does do something like this (Efficient E-matching for SMT Solvers, section 4). We prototyped it at some point, and found it was slower because the rules we were using didn't significantly "overlap". In a high-performance setting, what I would try before tries is just running rules using the SimpleScheduler, but with a fixed, small number of iterations.

As far as using Analysis for tracking invariants, sure! Just be careful, an analysis is per e-class, so it depends on your notion of equivalence.

@RDambrosio016
Copy link

Interesting, i did a small experiment generating rules with my own DSL, and, while decently fast, it definitely seems like egg spends a lot of its time on rules that could never match if you have a ruleset that is somewhat larger as a consequence of duplicating rules for numerous numeric types. In this particular ruleset i have ~61 unique rules (including bidirectionality), which after being duplicated for multiple numeric types, result in 240 total rules, and this is how long egg takes on each step:

Number of rules: 240
Time from runner creation: 2.6857ms

Runner report
=============
  Stop reason: Saturated
  Iterations: 7
  Egraph size: 148 nodes, 23 classes, 177 memo
  Rebuilds: 4
  Total time: 0.0026048
    Search:  (0.82) 0.0021298000000000003
    Apply:   (0.17) 0.0004338
    Rebuild: (0.02) 0.0000397

2.0 - pow(x * 1.0 * -1.0, 2.0) + (y / z + 3.0) ----> fma(x, neg(x), fma(y, recip(z), 5))
Cost: 26.5 -> 12

This is on release with 1 CGUs and LTO on, on an i9-10850k. Search time making up the bulk of the time is not unexpected, but to me it seemed a little bit high considering how small the expression is. So i re-tried it but without duplicating rules, and i got these results:

Number of rules: 50
Time from runner creation: 1.0181ms

Runner report
=============
  Stop reason: Saturated
  Iterations: 7
  Egraph size: 148 nodes, 23 classes, 177 memo
  Rebuilds: 4
  Total time: 0.0009838
    Search:  (0.66) 0.0006475000000000001
    Apply:   (0.30) 0.0002992
    Rebuild: (0.04) 0.000035999999999999994

2.0 - pow(x * 1.0 * -1.0, 2.0) + (y / z + 3.0) ----> fma(x, neg(x), fma(y, recip(z), 5))
Cost: 26.5 -> 12

So it definitely seems like hierarchical matching of some sort would be required to avoid spending a ton of time searching the graph. It's odd that your experiment in trie matching ended up being slower, was it using generated rust code like ISLE creates, or was it dynamically matched at runtime?

@RDambrosio016
Copy link

RDambrosio016 commented Aug 1, 2022

Absolutely 👍 , didnt mean to steer it into more minute details, i was just wondering if you considered/discussed this problem in the past, since i looked at the RFC but it did not seem to mention it. Anyways, im excited about this RFC, and interested in maybe contributing to implementing it/making it better.

@cfallin
Copy link
Member Author

cfallin commented Aug 1, 2022

For sure, performance is my main concern right now and I'm working to make the whole thing competitive with handwritten optimizations, so you'll here more about that in a bit!

@RDambrosio016
Copy link

Beating hand-written optimizations is definitely an achievable goal, considering the e-graph approach would have a much easier time knowing when to stop optimizing vs a standard pass-based optimizer. It will also be an interesting problem trying to figure out a good cost metric that maximizes generated code quality. Having a complete e-graph to choose from is great, but trying to select the ideal version from that will definitely prove to be a challenge.

@cfallin
Copy link
Member Author

cfallin commented Aug 3, 2022

As an update here, on the running prototype/experimental branch here, I've gotten overhead of Cranelift-with-egraphs (including elaboration that subsumes GVN and LICM, and rewrite rules using ISLE that encompass basic const propagation and algebraic simplifications) vs Cranelift-with-handwritten-opts down to ~1% slower on parallel compile time (and actually slightly faster in terms of total CPU time across cores):

cfallin@min:~/work/wasmtime % hyperfine --show-output --  "target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/spidermonkey.wasm"
Benchmark 1: target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      1.693 s ±  0.098 s    [User: 10.747 s, System: 0.234 s]
  Range (min … max):    1.621 s …  1.951 s    10 runs

cfallin@min:~/work/wasmtime % hyperfine --show-output --  "target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/spidermonkey.wasm"
Benchmark 1: target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      1.671 s ±  0.035 s    [User: 10.888 s, System: 0.236 s]
  Range (min … max):    1.606 s …  1.707 s    10 runs

(That's measuring on an M1 laptop with SpiderMonkey.wasm, a large Wasm module testcase.)

This is using a new acyclic-egraph approach that I'm calling the "aegraph" (or ægraph for those who love digraphs). I got the conventional egraph rewriting to work last week, and measured it and found compilation time was inflating significantly once rewrites occur -- something like 5x. I then had a series of realizations that the way that we're using egraphs here is more limited than perhaps a conventional full-equality-saturation approach, and there are simplifications we could make.

Basically an ægraph contains eclasses as nodes where each node contains an enode, an enode and a parent-eclass pointer, or two parent-eclass pointers. The union operation joins two eclasses together only by adding nodes (or as an optimization, replacing an enode node with an enode-and-parent node). An eclass node can refer only to earlier eclasses. We maintain the acyclicity invariant when adding enodes and unioning eclasses. Rule application occurs immediately when a new node is added. In a sense, eclasses are split into "levels" and a given level of an eclass refers only to the levels of argument eclasses that existed when it was created. It's thus a persistent immutable data structure. An eclass is represented by the id of its latest version (highest id), whereas the traditional union-find approach tends to push ids downward to the lowest id.

This data structure has some really compelling properties:

  • Because eclass nodes are immutable once created, we only need to apply rules to a node once. This "single pass" rule application is at the heart of its speed.
  • We don't re-canonicalize and re-intern enodes, so we don't need parent pointers and we don't need a rebuild step.
  • It plays really nicely with the scoped elaboration, in that we can keep a parallel scoped map with the "best node" (extraction information) and memoize; extraction doesn't need cycle removal any more.

The tradeoff is that it's not at all a full equality saturation (see above re: no recanonicalization). But I think that's OK for our use-case: we mainly (i) introduce new expressions then (ii) apply rules that introduce alternate forms of those expressions, and union them together. In this case the ægraph is pretty ideal: it's like a streaming journal of nodes and we add a new node from the build, then add a few more nodes that are equivalent to it and union them in-place, then keep going, referring to the just-built expression by the latest id. So the ægraph is sort of a "multi-version IR" where every value can have multiple representations, combined with GVN.

(The case it misses that a traditional egraph with parent pointers, recanonicalization, and the rebuild step would get is where we introduce some node A that gets a new eclass, a rewrite rule adds a node B and unions it with A, and B happens to dedup to an already-existing eclass. The new ID for A will point to a union node that joins A and B's existing eclass, but other users of B will not see A because it's "newer". This seems to be of limited use for our specific rules because our rewrite rules tend to go from more expensive/complex expressions to simpler ones, so A is likely not better than B. That's not always the case and I'm sure there are cases where a full egraph does do better. But the multiple-version-representation and ability of rules to match on all versions seems like the key thing that we want from the data structure.)

Anyway, I'll write this up more properly but that's where I'm at now: a prototype that works and is fast. I'll add a bit more to it (integration with alias analysis! and other rules I haven't added yet) and then start to look at generated code quality too.

@RDambrosio016
Copy link

Very interesting! im curious to see if that specific limitation will end up limiting any optimization potential in the future. It may be worth it to try running generated code benchmarks both without rebuilding, and with. Although it's definitely possible that associativity rules will hide that limitation. Nevertheless, the performance on something as large as the spidermonkey wasm file is impressive! Did you happen to measure how much memory the e-graphs/program overall took?

Excited to try out this approach, i was thinking about perhaps working on algebraic identities for mul/div/rem mostly based on this paper. Would you accept a PR to your experimental branch to add such rules?

@cfallin
Copy link
Member Author

cfallin commented Aug 3, 2022

Sure; more opts are definitely welcome in a PR, thanks!

I haven't done any more detailed measurements; I'm polishing things wrt integrations now and once I have it in a state where I'm happy, I'll do a full comparison and set of measurements (with Sightglass) on compile- and runtime performance.

@cfallin
Copy link
Member Author

cfallin commented Aug 10, 2022

Hi all -- I have some updates. It's kind of a mixed bag -- tl;dr, code quality matches existing Cranelift (all optimizations have been reproduced), compilation speed is ~1-5% slower. Theoretical benefits of everything-in-one-fixpoint (GVN and alias analysis and LICM and algebraic simplifications all cooperating) work in filetests, but don't seem to do much to Wasm-based benchmarks in Sightglass.

So the current status of the WIP branch is:

  • "acyclic egraph" (ægraph) implementation is working;
  • producing and lowering out of this aegraph works fine;
  • ISLE-based rewrite rules operate on this aegraph, and cost-based extraction picks the best option for each node after rewrites are done;
    • rewrites in the aegraph apply immediately (because of acyclicity, there's no need to revisit a node);
    • ISLE itself has been extended to support "multiplicity" in extractors and constructors, so multiple rules can match and matchers on eclass IDs see all available enodes for each eclass;
  • The aegraph construction and "scoped elaboration" subsume GVN (global value numbering, a form of common-subexpression elimination) by means of the deduplication map;
  • The aegraph elaboration automatically does LICM (loop-invariant code motion) implicitly, by way of loop-nest awareness when traversing the domtree;
  • The alias analysis feeds into the rewrite rules and there are rules for redundant-load elimination and store-to-load forwarding;
  • Rewrite rules have been written for a bunch of constant-propagation and algebraic-simplification cases, and some other slightly more ad-hoc ones (e.g. reassociating loads to try to make LICM on partially-loop-invariant sums work)
  • A simple form of rematerialization has been implemented, where adds-with-one-immediate, and immediates themselves, are regenerated in the block they are used rather than maximally hoisted by GVN, in order to reduce register pressure.

All this works -- and IMHO the data structures and algorithms are pretty beautiful actually! -- but it's not any faster. It is nicer to update with more rewrite rules -- see cprop.isle and algebraic.isle -- and if it were completely at parity performance-wise, this would be an interesting benefit still IMHO. But there is still a small compile-time penalty, around 5% on average, with everything above incorporated.

I have some more thoughts in my running TODO list but most of the remaining ones are for compile speed, not code quality (runtime speed).

Given all that, I think I'm running up against a wall that's actually mostly supported by some other limits:

  • There is actually not a ton of remaining "low-hanging fruit" when running Wasm modules produced by an optimizing compiler. (Any strength reduction or other clever "instcombine"-type optimization I would expect to be done by LLVM before producing the .wasm.)
  • Lowering, once it occurs, disallows any further optimizations, because we can't edit VCode currently.

The first is just a fact of our benchmarking conditions (but we're oriented this way because we care very much about Wasm performance -- we can't just choose to focus on poorly-optimized CLIF instead and say "see look opts work!"). But the second is where a lot of my thoughts are going lately. In other words, from first principles, if we start with highly-optimized Wasm bytecode, where do we expect inefficiencies and redundancies to creep in? Probably one of the three areas:

  • The semantic gap between Wasm bytecode and CLIF, and how it is bridged (heap bounds checks, table accesses, and the like);
  • The semantic gap between CLIF and the machine, and how it is bridged (our lowering rules in our backends);
  • regalloc quality.

My forehead is already dented from banging against item 3 (regalloc quality) for a long time, and item 1 (Wasm to CLIF) is largely addressed already by designing CLIF semantics to be close to those of Wasm; but item 2 (lowering) presents unharvested opportunities IMHO.

In looking at disassemblies of "hot blocks" when profiling runtime I see a lot of not-great code, and it comes in several varieties: redundant flags computations (cmp / setnp / cmp / sete for float comparisons is a common one); redundant trap conditions that cprop could optimize away, or GVN could combine between multiple ops; address-mode junk produced by our "match up the operand tree and then regenerate an add tree for the parts we can't incorporate" logic in both x64 and aarch64; a bunch of "spill x+1, x+3, x+10, x+15 to separate spillslots for use later" (hence my attempt at remat above); and others. A lot of this comes from the inability to optimize code once we lower a CLIF op into multiple MachInsts, or one pseudo-MachInst.

(Those who have been around a while might be screaming: legalization and post-opt! Yes, you're right. I think we threw out some of the value of those in exchange for the speed and simplicity of the single-pass lowering, and the real value that we do get from a two-level IR. But we lost something too.)

So: I'm taking a little time to think about what might make sense as a way to optimize VCode. I don't think this means throwing out the aegraph and ISLE-based mid-end framework; I think instead there may be a way to write some mid-end legalizations that do a lot of expansion that we currently do in lowering, and then let lowering be simpler; or in the extreme, even have a Node::MachInst { ... } in the aegraph, though that has potential efficiency issues. More thought needed.

Anyway, I'd love to hear more thoughts on this, and we can talk about it more in the next Cranelift meeting, but there's my report on what I've learned in the last few weeks :-)

@cfallin
Copy link
Member Author

cfallin commented Aug 10, 2022

... and in a typical show of my luck, I discover just after writing the above that my remat was doing something wrong (LICM was overriding its decision). I now have the result:

  • on spidermonkey.wasm I see an 11.4% speedup, with 5% extra compile time (the latter I can optimize a bit more I think);
  • on bz2.wasm I see a 7.3% speedup, also with 5% extra compile time.

More measurements later but IMHO if I can get the compile time overhead a bit lower this should be compelling on its own (then the post-lowering optimization ideas above become the next step).

@cfallin
Copy link
Member Author

cfallin commented Aug 11, 2022

I did some optimization on compiler runtime tonight -- both low-level things like inline pragmas, refactoring some data structures to allow holding borrows rather than lookups, using dense rather than sparse maps in places, preallocating with correct sizes, and similar, and high-level things like algorithmic improvements to memoize more, trim the extraction search space (running upper bound and early exit/pruning), "subsume" all nodes in an eclass with one "final" version (for e.g. cprop), and the like.

In the end I got SpiderMonkey.wasm down to a ~1% compile-time regression (relative to baseline Cranelift), and bz2.wasm to a... 15% compile-time speedup (yeah, I know; it's wild; I was focusing entirely on SpiderMonkey and then just checked bz2 now, and would have fallen off my chair if gravity were slightly stronger):

[cfallin@xap]~/work/wasmtime% hyperfine  -L egraphs false,true 'target/release/wasmtime compile --cranelift-set use_egraphs={egraphs} ../wasm-tests/spidermonkey.wasm'
Benchmark 1: target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      1.106 s ±  0.051 s    [User: 15.034 s, System: 0.505 s]
  Range (min … max):    1.043 s …  1.201 s    10 runs

Benchmark 2: target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/spidermonkey.wasm                                                                                                
  Time (mean ± σ):      1.115 s ±  0.033 s    [User: 15.915 s, System: 0.510 s]
  Range (min … max):    1.077 s …  1.162 s    10 runs

Summary
  'target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/spidermonkey.wasm' ran
    1.01 ± 0.05 times faster than 'target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/spidermonkey.wasm'
[cfallin@xap]~/work/wasmtime% hyperfine -r 200  -L egraphs false,true 'target/release/wasmtime compile --cranelift-set use_egraphs={egraphs} ../wasm-tests/bz2.wasm'
Benchmark 1: target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/bz2.wasm
  Time (mean ± σ):      82.1 ms ±  10.8 ms    [User: 203.2 ms, System: 58.4 ms]
  Range (min … max):    71.4 ms … 106.5 ms    200 runs

Benchmark 2: target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/bz2.wasm
  Time (mean ± σ):      71.4 ms ±  10.2 ms    [User: 191.0 ms, System: 58.7 ms]
  Range (min … max):    59.6 ms …  96.3 ms    200 runs

Summary
  'target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/bz2.wasm' ran
    1.15 ± 0.22 times faster than 'target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/bz2.wasm'

And for completeness, yesterday's runtime numbers were on my M1 laptop, while today's work is on a Linux/x86-64 desktop (Ryzen 3900X); on this system, SpiderMonkey.wasm sees a 13% speedup, and bz2.wasm a 3% speedup:

[cfallin@xap]~/work/wasmtime% target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/spidermonkey.wasm -o base.cwasm
[cfallin@xap]~/work/wasmtime% target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/spidermonkey.wasm -o egraphs.cwasm
[cfallin@xap]~/work/wasmtime% hyperfine -L file base,egraphs 'target/release/wasmtime run --allow-precompiled --dir=. {file}.cwasm ./fib.js'
Benchmark 1: target/release/wasmtime run --allow-precompiled --dir=. base.cwasm ./fib.js
  Time (mean ± σ):      1.345 s ±  0.023 s    [User: 1.338 s, System: 0.005 s]
  Range (min … max):    1.317 s …  1.384 s    10 runs

Benchmark 2: target/release/wasmtime run --allow-precompiled --dir=. egraphs.cwasm ./fib.js
  Time (mean ± σ):      1.195 s ±  0.010 s    [User: 1.189 s, System: 0.003 s]
  Range (min … max):    1.181 s …  1.205 s    10 runs

Summary
  'target/release/wasmtime run --allow-precompiled --dir=. egraphs.cwasm ./fib.js' ran
    1.13 ± 0.02 times faster than 'target/release/wasmtime run --allow-precompiled --dir=. base.cwasm ./fib.js'
[cfallin@xap]~/work/wasmtime% target/release/wasmtime compile --cranelift-set use_egraphs=false ../wasm-tests/bz2.wasm -o base.cwasm
[cfallin@xap]~/work/wasmtime% target/release/wasmtime compile --cranelift-set use_egraphs=true ../wasm-tests/bz2.wasm -o egraphs.cwasm
[cfallin@xap]~/work/wasmtime% hyperfine -L file base,egraphs 'target/release/wasmtime run --allow-precompiled {file}.cwasm'
Benchmark 1: target/release/wasmtime run --allow-precompiled base.cwasm
  Time (mean ± σ):     428.3 ms ±   9.7 ms    [User: 425.3 ms, System: 1.7 ms]
  Range (min … max):   416.6 ms … 441.6 ms    10 runs

Benchmark 2: target/release/wasmtime run --allow-precompiled egraphs.cwasm
  Time (mean ± σ):     414.5 ms ±  12.8 ms    [User: 411.5 ms, System: 1.9 ms]
  Range (min … max):   403.1 ms … 440.7 ms    10 runs

Summary
  'target/release/wasmtime run --allow-precompiled egraphs.cwasm' ran
    1.03 ± 0.04 times faster than 'target/release/wasmtime run --allow-precompiled base.cwasm'

I think we're at the point where we can discuss this in the next Cranelift meeting, then hopefully, if consensus exists, I'll start cleaning this up and preparing it for merging (in small, more easily-digestible pieces!).

@mwillsey
Copy link

@cfallin Amazing work! These are killer results! I think specializing to the ægraph (great name) is definitely the right decision for this use case; the trade-offs seem to be worth it. Although, it does make me wonder what it would take to make "normal" equality saturation usable in this setting. Exciting future work!

@oflatt
Copy link

oflatt commented Aug 11, 2022

This is so cool!
Would be really interested to see how much better results are with different tweaks. Does full egraph-style matching do much good? I'm thinking about the case where an enode creates a new match in a parent eclass.

@cfallin
Copy link
Member Author

cfallin commented Aug 11, 2022

Thanks @mwillsey! I'm definitely curious what other design points may be possible here. I suspect that benefits of more strongly normalizing approaches (full recanonicalize/rebuild as in a traditional egraph or egg) will become more apparent the more we add interesting rules. Right now I have rules I wrote with my traditional-compiler-author brain, so mostly constant prop and the like where the right-hand side is unambiguously better. (There are a few cases of reassociation where I let the extractor do the right thing though, and also scoped elaboration can make things interesting when another expression forces evaluation of some otherwise-expensive node, shorting its cost to 0 for subsequent uses and altering the best choice. (This is why extraction is done lazily and memoized.) In those cases I suspect having multiple versions around actually helps.)

There are definitely some experiments I'd like to run, if I were to put on my old grad-student hat (it has collected some dust!) -- among them, putting together a full rebuild step as @oflatt suggests, and maybe also a "rewrites always subsume" mode (ie no multi-version capability) with the rules. My day-job means I need to focus on getting this "makes compiler faster and better" magic into production shape first, of course, but I do want to experiment some as time allows :-)

@remysucre
Copy link

Hello! Yet another egg contributor here. @mwillsey and I had a great chat with Nico Bruno, who was a main engineer of SQLServer. It turns out one main difference between equality saturation and cascade-style DB query optimization is that the latter maintains the acyclicity of the e-graph, just like aegraph. Acyclicity enables many powerful optimizations like branch-and-bound, and usually runs very fast. I highly recommend checking out some of the literature there, as well as some modern implementations like cockroach and orca.

@cfallin
Copy link
Member Author

cfallin commented Aug 11, 2022

@remysucre thanks for the pointers! (As with many other topics in systems work, the database folks got there first... why am I not surprised?!). I'm happy to hear there are similar approaches in use and I'll read up more.

@cfallin
Copy link
Member Author

cfallin commented Aug 15, 2022

Hi all -- I've updated the RFC to refer to ægraphs and include the numbers above. A full writeup of how ægraphs work is IMHO out-of-scope here but I will include a bunch of detail somewhere in cranelift/codegen/docs/ if this merges. The important decision points here, IMHO, are:

  • Should we have a mid-end?
  • Should it operate based on rewriting rules?
  • Is an implementation based on the prototype acceptable, given its numbers (with polish in a bunch of production details before merging)?

I'd like to kick off a round of discussion and (hopefully!) consensus-building on that. Happy to talk more about this in the next Cranelift meeting but let's see what folks think here first. Thanks!

accepted/cranelift-egraph.md Outdated Show resolved Hide resolved
accepted/cranelift-egraph.md Show resolved Hide resolved
accepted/cranelift-egraph.md Outdated Show resolved Hide resolved
accepted/cranelift-egraph.md Outdated Show resolved Hide resolved
accepted/cranelift-egraph.md Outdated Show resolved Hide resolved
accepted/cranelift-egraph.md Outdated Show resolved Hide resolved
accepted/cranelift-egraph.md Outdated Show resolved Hide resolved
@cfallin
Copy link
Member Author

cfallin commented Aug 17, 2022

Thanks @avanhatt! Updated based on feedback.

@cfallin
Copy link
Member Author

cfallin commented Aug 23, 2022

Hi all -- after discussing this briefly yesterday it sounds like no one has any major objections; so given that, and the positive results we've found, I would like to make a

Motion to finalize with a disposition to merge

I've updated the list below to contain all stakeholders who have committed to Cranelift within the past three months, or participated in RFC discussions in the same timeframe.

Stakeholders sign-off

Arm

Cornell

Embark Studios

Fastly

IBM

Intel

Unaffiliated

@cfallin
Copy link
Member Author

cfallin commented Aug 24, 2022

Given approvals from multiple stakeholder groups, we now enter the 10-day final comment period.

If no objections are registered by 2022-09-03, we will merge this RFC. Thanks all for the discussion!

Copy link
Member

@uweigand uweigand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - nice results!

Copy link
Member

@elliottt elliottt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@cfallin
Copy link
Member Author

cfallin commented Sep 7, 2022

The FCP has now (more than) elapsed; with no objections, I will now merge this RFC! Thanks all for the spirited discussion and ideas. Hopefully soon now we can get a proper implementation into Cranelift; my next task is to clean up the prototype and start to submit pieces for review.

@cfallin cfallin merged commit 70484a7 into bytecodealliance:main Sep 7, 2022
@cfallin cfallin deleted the cranelift-egraphs branch September 7, 2022 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet