Skip to content

GC Optimization Guidebook

Alon Zakai edited this page Oct 18, 2024 · 35 revisions

As mentioned in the optimizer cookbook (which you may also find interesting), historically, Binaryen's primary optimization pipeline has been optimized on LLVM output. That is, when you do wasm-opt -O3 then you get a useful set of optimizations to run on something clang/rustc/etc. emitted at -O3. Wasm GC languages may benefit from a different set of optimizations, primarily because such languages generally do not use LLVM (e.g., Java, Kotlin, Dart, etc.), so the "shape" of the Wasm they emit is different, and different Binaryen optimizations may work better.

In addition to the fact that LLVM is often not used, Wasm GC is fundamentally different from Wasm MVP in the sense that it is much closer to a true IR for optimization. Wasm GC is an IR with explicit allocations, explicit function pointers, etc., which means Binaryen can do a lot more optimizations. So while Binaryen was limited in how it could optimize Wasm MVP content, it can be used as a general-purpose optimizer for Wasm GC.

In more practical terms, wasm-opt -O3 on LLVM output does some useful refining of LLVM's optimizations for Wasm MVP, but on Wasm GC we can do a lot more, and Binaryen can be used as the primary optimizer component in a toolchain. But to fully benefit from Binaryen's optimizations in that situation, we need to do more than just pass -O3. That is the topic of the rest of this page.

(You may also be interested in the page on emitting optimizable WasmGC code from your compiler.)

Multiple Optimization Passes

wasm-opt -O3 is enough for LLVM output, as mentioned above, which LLVM emits in an already quite optimized form. For other code you may want to run the entire Binaryen optimization pipeline more than once, which you can do like this:

wasm-opt -O3 -O3

each -O3 not only sets the opt level to 3 but also asks the tool to run the full optimization pipeline, and so it can be specified more than once (this is different than the UI for gcc and clang). The same holds for -Os etc.

If your compiler has not done many general-purpose optimizations before Binaryen runs on it, you may want several rounds of optimization. For example, J2Wasm uses 6 or so, at the time of writing this doc.

GUFA

The Grand Unified Flow Analysis is an optimization that scans the entire program and makes inferences about what content can appear where. This can help MVP content, but really shines on GC because it infers a lot about types, in particular, it can find when a location must contain a constant, which can lead to devirtualization, a crucial optimization.

GUFA is a heavyweight optimization and not run by default. You run it manually with --gufa. When to run it, and how many times, is worth experimenting with, but you can try things like this:

  • -O3 --gufa -O3: One run of the main optimization pipeline, then GUFA, then another run of the pipeline to take advantage of GUFA's findings.
  • -O3 --gufa -O3 -O3
  • -O3 --gufa -O3 --gufa -O3

etc. You can also try -Os instead of -O3 etc.

It can be useful to run --metrics in the middle, to see the impact of a pass. For example, wasm-opt --metrics -O3 --metrics will dump metrics once, then optimize, then dump metrics again. The second dump will contain a diff compared to the last metrics, so you can see stats on the change in the number of each type of instruction.

You can also try --gufa-cast-all which runs GUFA and adds casts to all the inferences it can make. That is, it will add casts to more refined types that appear in the IR, even if it doesn't see an immediate benefit to that. Later passes can hopefully take advantage of those casts, apply the benefits, and remove casts that end up unhelpful. In practice whether --gufa or --gufa-cast-all is better will depend on the particular codebase run on, since adding more casts "speculatively" will sometimes work out and sometimes not.

GUFA requires --closed-world to run.

Closed World

By default we assume that any type that escapes to the outside may be inspected and interacted with. For example, if a struct escapes then the outside may read a field, or a function reference escapes then it may be called. That means we cannot alter that field or that function, say by refining the type of the field or removing a parameter, etc. If you do not have such interactions with the outside then you can tell Binaryen to assume a closed world:

wasm-opt --closed-world

In a closed world we run several more important passes, and some other passes become more effective, so this is quite important to do, if you can.

Note that you can still let references escape to the outside. For example, the outside might hold onto a reference (caching it on some other object, say) and pass it back in. The only thing that is disallowed is for the outside to interact with the contents of the reference.

  • The type of such escaping references matters: If it is anyref then we do not limit optimizations, but if it is a specific GC type then we consider that specific type public and avoid modifying it. For that reason it is best to only use basic types on the boundary.

Note that the meaning of "closed world" is a little subtle, since the module can still have imports and exports. For more details see the documentation on the closedWorld flag in pass.h.

TypeSSA, TypeMerging

The Binaryen optimizer has many passes that do type-based inference. For example, if a type's field is only ever written a single value, then we can infer the result in all reads of that field in the entire program (which is the case for things like vtables in J2Wasm, for example). Such type-based optimization gets more powerful the more refined the type information is: it is better to use different types as much as possible rather than reusing the same type in many contexts. Binaryen has two optimizations that can help here: TypeSSA which "splits" types, defining a new type at each struct.new basically, and TypeMerging which "coalesces" types, finding different types that do not actually need to be different and then folding them together.

The general idea is that you want to split types as much as possible, then run optimizations, and then merge them at the end. The merge is useful because unnecessarily split types add size to the binary, but it is important to do it at the end because after the merge the optimizer can do less things.

For example, you can try this:

wasm-opt --type-ssa -O3 -O3 --type-merging -O3

That is, split types, then optimize twice, then merge, then optimize again. (The last optimization pass can sometimes do a little more work after merging, as once types are merged sometimes functions who now have identical types can be merged.)

Alternatively, if your toolchain already emits very refined types (already using a new type in every location that makes sense) then you can omit the --type-ssa pass.

As with earlier suggestions, it is a good idea to experiment with various options in terms of the order and number of optimization cycles that you run.

  • TypeSSA helps GUFA, as it provides more types for GUFA to reason about, so you may want to run GUFA between TypeSSA and TypeMerging.

Type Finalizing

You can make Binaryen recompute the final state of types (that is, whether the type is final or not - a final type cannot have subtypes) using --type-unfinalizing --type-finalizing. The first of those makes all (private) types non-final ("open"), and the second makes all (private) types final when we can, that is, when they have no subtypes.

Sometimes it can also be useful to do optimization work in between, i.e.

--type-unfinalizing -O3 --type-finalizing

Whether that helps or not will depend on your codebase: opening (unfinalizing) types can make some types identical that were not before, which can have downsides.

Regardless of that, it is always useful to do --type-finalizing at the very end of your optimization pipeline, as that will mark as many things final as possible, which can improve runtime speed. (wasm-opt does not finalize itself, because it doesn't know where the end of your pipeline is - you might intend to do more work later. So you must manually apply this pass yourself.)

One particular area that unfinalizing can help is before TypeSSA. TypeSSA will not create subtypes of final types (since that is not allowed), so first making as many types non-final as possible allows more work to be done.

TODO --cfp-reftest

TODO call.without.effects