Precompute: Rewrite logic for handling children to carefully decide which to keep #7863

kripken · 2025-08-26T22:51:59Z

The main changes here are:

Rather than precompute sometimes while ignoring effects of tees etc.
and sometimes not, do so in a single manner: while considering the effects
carefully and deciding which children to keep.
This lets us remove the dual cache from [GC] Precompute: Fix ref.eq comparisons of structs with nested effects #7857, as now there is a single
mode.

But really, this is a rewrite of that core logic from scratch in a cleaner and
less hackish way, while fixing issues with the dual cache and even the
earlier single cache that it fixed:

We must have a single cached object for each expression. Having a dual
cache opened us up to bugs, because it turns out we might actually
cache an object in the propagate phase, and use it in the main phase, and
each used a different cache. Now both phases do the same thing, so there
is no risk.
We must compute effects when there are effects, because they are
state in the PrecomputingExpressionRunner, a source of bugs with all
previous caches. I realized that the solution here is simple: note when there
are effects, and if so, just compute them. This is fine, because the quadratic
case happens in global objects, which have no effects anyhow (and even inside
functions it is rare to have such effects). And, after computing the effects, we
use the single cached heap location, keeping identity stable (a key fix here).

Then, the main visitExpression is straightforward: compute in the most
general manner: NOT trying to replace the entire expression, which requires
no side effects, but allowing them, and looking at the children afterwards to
see which are actually needed. This is necessary to avoid a regression in this
PR, but it actually ends up as a progression, since we can handle more cases,
like (ref.eq (tee) (get)). Before the tee would stop us, and propagate doesn't
handle this if it isn't written to a local, but now we can just compute it, and
keep the tee around.

Also, I figured out how to avoid the monotonically increasing code size
problem, which e.g. GUFA has, where you see expression A, figure out it
evaluates to constant C, but has effects you must keep, so you emit (A, C).
That lets the constant get optimized, but if you run twice you can add C
twice, unless you carefully look at the parent, which is annoying. Here,
that is avoided because while we may add such a constant, regressing
size, we still make progress because we remove the main expression itself -
we may keep some children, but never the parent, so the increase is
bounded.

This improves not just GC code:

Emscripten size diff


diff --git a/test/code_size/test_codesize_hello_O2.json b/test/code_size/test_codesize_hello_O2.json
index 2ddadfd5c7..352c0831ec 100644
--- a/test/code_size/test_codesize_hello_O2.json
+++ b/test/code_size/test_codesize_hello_O2.json
@@ -1,17 +1,17 @@
 {
   "a.out.js": 4369,
   "a.out.js.gz": 2146,
-  "a.out.nodebug.wasm": 1979,
-  "a.out.nodebug.wasm.gz": 1157,
-  "total": 6348,
-  "total_gz": 3303,
+  "a.out.nodebug.wasm": 1927,
+  "a.out.nodebug.wasm.gz": 1138,
+  "total": 6296,
+  "total_gz": 3284,
   "sent": [
     "fd_write"
   ],
   "imports": [
     "wasi_snapshot_preview1.fd_write"
   ],
   "exports": [
     "__indirect_function_table",
     "__wasm_call_ctors",
     "_emscripten_stack_alloc",
diff --git a/test/code_size/test_codesize_hello_O3.json b/test/code_size/test_codesize_hello_O3.json
index 1e497bc59f..892da4c3d5 100644
--- a/test/code_size/test_codesize_hello_O3.json
+++ b/test/code_size/test_codesize_hello_O3.json
@@ -1,17 +1,17 @@
 {
   "a.out.js": 4311,
   "a.out.js.gz": 2104,
-  "a.out.nodebug.wasm": 1733,
-  "a.out.nodebug.wasm.gz": 980,
-  "total": 6044,
-  "total_gz": 3084,
+  "a.out.nodebug.wasm": 1681,
+  "a.out.nodebug.wasm.gz": 960,
+  "total": 5992,
+  "total_gz": 3064,
   "sent": [
     "a (fd_write)"
   ],
   "imports": [
     "a (fd_write)"
   ],
   "exports": [
     "b (memory)",
     "c (__wasm_call_ctors)",
     "d (main)"
diff --git a/test/code_size/test_codesize_hello_Os.json b/test/code_size/test_codesize_hello_Os.json
index 128a92afd1..0a660aeb20 100644
--- a/test/code_size/test_codesize_hello_Os.json
+++ b/test/code_size/test_codesize_hello_Os.json
@@ -1,17 +1,17 @@
 {
   "a.out.js": 4311,
   "a.out.js.gz": 2104,
-  "a.out.nodebug.wasm": 1723,
-  "a.out.nodebug.wasm.gz": 985,
-  "total": 6034,
-  "total_gz": 3089,
+  "a.out.nodebug.wasm": 1671,
+  "a.out.nodebug.wasm.gz": 964,
+  "total": 5982,
+  "total_gz": 3068,
   "sent": [
     "a (fd_write)"
   ],
   "imports": [
     "a (fd_write)"
   ],
   "exports": [
     "b (memory)",
     "c (__wasm_call_ctors)",
     "d (main)"
diff --git a/test/code_size/test_codesize_hello_Oz.json b/test/code_size/test_codesize_hello_Oz.json
index d38e027f7a..593e603ff0 100644
--- a/test/code_size/test_codesize_hello_Oz.json
+++ b/test/code_size/test_codesize_hello_Oz.json
@@ -1,17 +1,17 @@
 {
   "a.out.js": 3930,
   "a.out.js.gz": 1905,
-  "a.out.nodebug.wasm": 1257,
-  "a.out.nodebug.wasm.gz": 763,
-  "total": 5187,
-  "total_gz": 2668,
+  "a.out.nodebug.wasm": 1205,
+  "a.out.nodebug.wasm.gz": 740,
+  "total": 5135,
+  "total_gz": 2645,
   "sent": [
     "a (fd_write)"
   ],
   "imports": [
     "a (fd_write)"
   ],
   "exports": [
     "b (memory)",
     "c (__wasm_call_ctors)",
     "d (main)"

There are also some minor theoretical regressions, as a few tests
show, but those are things other passes handle better (like
(return (return ..))), so they only happen when running the pass
by itself (production code using the full pipeline should only get
better).

This is also a slight improvement in compile times.

src/passes/Precompute.cpp

tlively · 2025-08-26T23:36:04Z

src/passes/Precompute.cpp

      return flow;
    }
-    heapValuesMap[curr] = flow.getSingleValue().getGCData();
+    heapValues.map[curr] =


Can use insert with a placeholder null data in place of the original heapValues.map.find to avoid the second lookup here.

Hmm, I see what you mean, but this code feels complex enough to me that I'd rather not add that? That is, I feel the current code is easier to read, even if slightly less efficient.

I don't think it would be more complicated! It even gives you a nice inserted bool to check rather than comparing against heapValues.map.end().

Difference of opinion I guess 😄 To me, starting with "insert into the map, just a placeholder we'll fill in later" feels odd, when mentally the situation at the start is "ok, let's see if this is already in the map".

src/passes/Precompute.cpp

test/lit/passes/precompute-effects.wast

test/lit/passes/precompute-gc.wast

tlively · 2025-08-27T00:21:03Z

test/passes/precompute-propagate_all-features.txt

Can I interest you in porting these tests to lit while we're touching them?

I can do it as a followup to this PR.

Co-authored-by: Thomas Lively <tlively123@gmail.com>

….drop

tlively

LGTM, although I still think it would be nice to use insert(). We try to use it where possible elsewhere.

kripken · 2025-08-27T14:09:03Z

Fuzzed for 100K iterations overnight, looks good. Landing.

Let's talk offline about the insert issue - I'm not strongly opposed but I would like to understand better why you favor it.

kripken added 30 commits August 25, 2025 16:24

start

a7bd493

more

3dd86d7

work

4a37746

work

61f3d1b

work

0535544

work

ddd6bb4

work

8a3e1c7

work

a717eb5

work

cb833e9

work

baa0ede

work

7b249a3

work

f950f43

text

033f780

work

9196308

test

22fb8c9

test

23a25b5

fix

c08586a

format

85f635b

test

6251e93

test

6cd0ff1

test

5fed017

fix

90d5ea2

test

eaa6a01

comments

95d149a

Merge remote-tracking branch 'origin/main' into precompute.drop

da8f9e5

test

777a014

work

2c67d84

work

f469916

work

f861fe3

work

6eb438d

kripken added 2 commits August 26, 2025 14:36

simpl

63ca25a

simpl

690aedc

kripken requested a review from tlively August 26, 2025 22:51

tlively reviewed Aug 27, 2025

View reviewed changes

kripken and others added 5 commits August 26, 2025 17:42

Update src/passes/Precompute.cpp

04b6b17

Co-authored-by: Thomas Lively <tlively123@gmail.com>

Update src/passes/Precompute.cpp

0c9d61a

Co-authored-by: Thomas Lively <tlively123@gmail.com>

Update test/lit/passes/precompute-gc.wast

7ce4b7f

Co-authored-by: Thomas Lively <tlively123@gmail.com>

feedback

9cc8dc7

Merge remote-tracking branch 'myself/precompute.drop' into precompute…

bdd29f7

….drop

tlively approved these changes Aug 27, 2025

View reviewed changes

kripken merged commit 9de4aca into WebAssembly:main Aug 27, 2025
17 checks passed

kripken deleted the precompute.drop branch August 27, 2025 14:11

This was referenced Aug 27, 2025

local-cse introduces side effects via local.tee, blocking DCE in O3 #7440

Open

Automatic rebaseline of codesize expectations. NFC emscripten-core/emscripten#25078

Merged

Precompute: Rewrite logic for handling children to carefully decide which to keep #7863

Precompute: Rewrite logic for handling children to carefully decide which to keep #7863

Uh oh!

Conversation

kripken commented Aug 26, 2025

Uh oh!

Uh oh!

tlively Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

kripken Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

tlively Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

kripken Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlively Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

kripken Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

tlively left a comment

Choose a reason for hiding this comment

Uh oh!

kripken commented Aug 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants