-
Notifications
You must be signed in to change notification settings - Fork 829
Consistently optimize small added constants into load/store offsets #1924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I wonder if this pass would make sense in scenarios where the low unused memory region is significantly smaller? In AssemblyScript it's just 8 bytes for example, essentially only leaving some space for null / asm.js reinterpretations. |
|
Yeah, we may want to make the size of the region customizable. 1024 is just one possibility. For AssemblyScript in particular, perhaps you don't need the pass, though - do you already use load/store offests in all possible places? Or are there places binaryen could optimize for you? |
|
@kripken We waiting this for a long time=) |
|
It's using immediate offsets where straight-forward, for example when accessing class instance fields, but there are some occasions where it isn't that easy, for example when doing indexed array accesses (due to the fact that these are actual operator overloaded functions where constant offset components become mutable locals, in turn losing "precomputability"). Can also imagine that, after other optimizations, even our "proper" loads might be condensable even more. |
|
Interesting. Is there ever a case where it is not valid to use a load/store offset in AssemblyScript? In C, the problem is that a program may assume a pointer may overflow, and that that is valid (!). But another language may just say that's undefined behavior, and that it is always safe to do |
|
In AssemblyScript's (standard library) case that's always undefined behaviour, yeah. A user who's using |
Are you sure? That seems like it would be undefined behavior to me. |
|
Section 6.5.6, paragraph 8 of the C17 spec says
So I think this is indeed undefined behavior. However, we might still want to support code that depends on it? |
|
Yeah, this may well be undefined behavior. But we've seen real-world code that depends on it, weirdly... I'm not sure it's undefined by that paragraph, though. What I think happens is this: If you do the add as an unsigned integer, and overflow it, then you get 1000 as expected. If instead the optimizer did Is that undefined behavior in C - is it not valid to do arbitrary intermediate math like that after converting a pointer to an integer, and before converting it back? |
|
Still curious about that UB C question, but merging this PR. |
|
Section 6.2.6.3, paragraphs 5 and 6 say
Unfortunately I can't find anything that might be referenced by the "except as previously specified" phrase, but it is promising that it can explicitly be a trap representation. |
After WebAssembly/binaryen#1924 Binaryen has a --low-memory-unused flag that we should use (and the --post-emscripten pass no longer has any part relevant to GLOBAL_BASE, that's all in that new flag). Better optimization of offsets thanks to that PR improves code size and enables a little more inlining etc., improving the metadce stats.
See #1919 - we did not do this consistently before.
This adds a
lowMemoryUnusedoption to PassOptions. It can be passed on the commandline with--low-memory-unused. If enabled, we run the newoptimize-added-constantspass, which does the real work here, replacing older code inpost-emscripten.Aside from running at the proper time (unlike the old pass, see #1919), this also has a
-propagatemode, which can do stuff like this:That is, it can propagate such offsets to the loads/stores. This pattern is common in big interpreter loops, where the pointers are offsets into a big struct of state.
The pass does this propagation by using a new feature of LocalGraph, which can verify which locals are in SSA mode. Binaryen IR is not SSA (intentionally, since it's a later IR), but if a local only has a single set for all gets, that means that local is in such a state, and can be optimized. The tricky thing is that all locals are initialized to zero, so there are at minimum two sets. But if we verify that the real set dominates all the gets, then the zero initialization cannot reach them, and we are safe.
This PR also makes
safe-heapaware oflowMemoryUnused. If so, we check for not just an access of0, but the range0-1023.This makes zlib 5% faster, with either the wasm backend or asm2wasm. It also makes it 0.5% smaller. Also helps sqlite (1.5% faster) and lua (1% faster)