Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transition to regalloc2 #3942

Closed
5 tasks done
cfallin opened this issue Mar 17, 2022 · 3 comments · Fixed by #3989
Closed
5 tasks done

Transition to regalloc2 #3942

cfallin opened this issue Mar 17, 2022 · 3 comments · Fixed by #3989
Labels
cranelift:area:regalloc Issues related to register allocation. cranelift Issues related to the Cranelift code generator

Comments

@cfallin
Copy link
Member

cfallin commented Mar 17, 2022

This issue is meant to track the status of migrating Cranelift to use regalloc2, our new register allocator. We started this work a while ago, and as detailed in our 2022 roadmap, we plan to finish the migration this year.

The major tasks remaining are:

  • Develop regalloc2 as a standalone project
  • Get a second reviewer to triple-check the symbolic checker we've been using to fuzz regalloc2, since regalloc correctness is essential for correctness/security of all layers above it (@fitzgen is currently looking this over)
  • Integrate any remaining regalloc2 tweaks/improvements (I have a few queued up, mostly API-related; this is my running branch as I bring it up with Cranelift.)
  • Release regalloc2 crate on crates.io (done)
  • Merge support for regalloc2 into Cranelift

The last task has been under development for the past 2.5 weeks or so. I'll make my private branch public shortly, after a bit of cleanup. Its current status is that it is fully functional (passes tests, runs benchmarks) on x86-64. There is work to do to move the other two backends over (aarch64, s390x) and I will do this before we merge. (I might not be able to do this before Mon Mar 28; I'm out-of-office and offline all of next week unfortunately, but wanted to get these results out first!)

The nature of the changes to Cranelift are such that we do have to do the transition atomically and remove regalloc.rs support at the same time; the whole MachInst infrastructure is basically built up around the regalloc abstractions, so swapping it out has a large effect. Fortunately though I think there is not too much of a downside (aside from the usual code-churn risk, which we mitigate with ongoing fuzzing and careful review) -- performance numbers look good.

Here is a current snapshot of some benchmark results:

Benchmark       Compilation (wallclock)     Execution (wallclock)
blake3-scalar   25% faster                  28% faster
blake3-simd     no diff                     no diff
meshoptimizer   19% faster                  17% faster
pulldown-cmark  17% faster                  no diff
bz2             15% faster                  no diff
SpiderMonkey,   21% faster                  2% faster
  fib(30)
clang.wasm      42% faster                  N/A

with full details here:

Benchmark methodology and raw output
As percentage improvement over baseline (old):

Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/A

As ratios (percent improvement above = 100% * (1 - 1/speedup_ratio))

Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 1.34x faster 1.38x faster
blake3-simd no diff no diff
meshoptimizer 1.24x faster 1.21x faster
pulldown-cmark 1.21x faster no diff
bz2 1.18x faster no diff
SpiderMonkey, 1.26x faster 1.02x faster
fib(30)
clang.wasm 1.71x faster N/A

Methodology:

  • Sightglass with --processes 2 --iterations-per-process 5.
  • Last two benchmarks running commandline wasmtime
    • rm -r ~/.cache/wasmtime
    • run wasmtime run once to ensure compiled
    • measure runtime 5x, take best of five
    • measure compile time with wasmtime compile 5x, take best of five
    • clang.wasm doesn't have a test harness, so is compile-only
  • Testing on 12-core / 24-thread Ryzen 3900X, Linux/x86-64

Comparing baseline of Wasmtime fdf063df98ad3839b0e0b78ea55b53b1a296abb0 (from
Mar 16) against my internal regalloc2 branch
9b89942cf62d262ee9ac3e7eab525ea8544a458b (from Mar 17) which last synced with
Wasmtime at eb1b71e31c035ff4250c5013ca0268deb931aa7c (from Feb 24).

Raw output of Sightglass below (instantiation excluded, not interesting).


compilation :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm

Δ = 121531866.00 ± 51042761.18 (confidence = 99%)

new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so!

[478052996 501410277.40 591983000] new.so
[604955098 622942143.40 709527450] old.so

compilation :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm

Δ = 31981472.00 ± 13432120.92 (confidence = 99%)

new.so is 1.14x to 1.34x faster than old.so!
old.so is 0.72x to 0.89x faster than new.so!

[125802142 131948268.40 155782325] new.so
[159196645 163929740.40 186715328] old.so

execution :: nanoseconds :: benchmarks-next/blake3-scalar/benchmark.wasm

Δ = 36931.50 ± 3272.72 (confidence = 99%)

new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so!

[105358 106660.00 110728] new.so
[140608 143591.50 149787] old.so

execution :: cycles :: benchmarks-next/blake3-scalar/benchmark.wasm

Δ = 140341.60 ± 12437.21 (confidence = 99%)

new.so is 1.32x to 1.38x faster than old.so!
old.so is 0.72x to 0.77x faster than new.so!

[400368 405315.60 420774] new.so
[534318 545657.20 569202] old.so


compilation :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm

No difference in performance.

[112727304 139448014.80 189082604] new.so
[123143218 156732493.40 233512432] old.so

compilation :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm

No difference in performance.

[29664800 36696541.20 49758219] new.so
[32405712 41244760.40 61449541] old.so

execution :: cycles :: benchmarks-next/blake3-simd/benchmark.wasm

No difference in performance.

[400672 739521.80 1042226] new.so
[498142 828791.40 1160786] old.so

execution :: nanoseconds :: benchmarks-next/blake3-simd/benchmark.wasm

No difference in performance.

[105439 194609.20 274267] new.so
[131088 218099.20 305464] old.so


compilation :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm

Δ = 483775336.20 ± 24646158.96 (confidence = 99%)

new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so!

[2090515508 2113482784.00 2150210240] new.so
[2554359582 2597258120.20 2630111328] old.so

compilation :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm

Δ = 127275628.40 ± 6480546.57 (confidence = 99%)

new.so is 1.22x to 1.24x faster than old.so!
old.so is 0.80x to 0.82x faster than new.so!

[550127669 556172437.60 565836581] new.so
[672188482 683448066.00 692063546] old.so

execution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm

Δ = 3386913742.00 ± 454568778.61 (confidence = 99%)

new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so!

[17786842514 17978520795.40 18352029814] new.so
[20863697992 21365434537.40 22139271504] old.so

execution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm

Δ = 891020039.40 ± 119694835.02 (confidence = 99%)

new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so!

[4680694128 4731128047.40 4829411387] new.so
[5489883512 5622148086.80 5826025212] old.so


compilation :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm

Δ = 213252595.20 ± 29303757.92 (confidence = 99%)

new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so!

[1120180378 1148350389.80 1203069094] new.so
[1340768136 1361602985.00 1397014596] old.so

compilation :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm

Δ = 56118120.00 ± 7711578.76 (confidence = 99%)

new.so is 1.16x to 1.21x faster than old.so!
old.so is 0.82x to 0.86x faster than new.so!

[294780634 302193792.40 316593182] new.so
[352828441 358311912.40 367631343] old.so

execution :: cycles :: benchmarks-next/pulldown-cmark/benchmark.wasm

No difference in performance.

[8257780 8443755.80 8560944] new.so
[8455570 9495162.60 17648568] old.so

execution :: nanoseconds :: benchmarks-next/pulldown-cmark/benchmark.wasm

No difference in performance.

[2173072 2222013.50 2252853] new.so
[2225116 2498693.60 4644290] old.so


compilation :: cycles :: benchmarks-next/bz2/benchmark.wasm

Δ = 58684068.80 ± 36909440.37 (confidence = 99%)

new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so!

[498967588 545831464.20 586460840] new.so
[540660276 604515533.00 635005118] old.so

compilation :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm

Δ = 15436153.00 ± 9714229.01 (confidence = 99%)

new.so is 1.04x to 1.18x faster than old.so!
old.so is 0.84x to 0.96x faster than new.so!

[131305387 143637939.40 154329874] new.so
[142264400 159074092.40 167089438] old.so

execution :: nanoseconds :: benchmarks-next/bz2/benchmark.wasm

No difference in performance.

[25932760 35978222.50 53794238] new.so
[28960083 29737468.90 35137211] old.so

execution :: cycles :: benchmarks-next/bz2/benchmark.wasm

No difference in performance.

[98545894 136719075.20 204420658] new.so
[110059628 113008690.20 133522880] old.so

@abrown
Copy link
Collaborator

abrown commented Mar 17, 2022

(Can we add spidermonkey.wasm and clang.wasm to Sightglass?)

@cfallin
Copy link
Member Author

cfallin commented Mar 17, 2022

(Can we add spidermonkey.wasm and clang.wasm to Sightglass?)

We could perhaps, yeah, with some hackery (building a toplevel harness mostly). In the SpiderMonkey case we need to add a WASI directory capability and feed in a JS file, and in the clang case we need a way to tell the infra that it's compile-only (I don't know how to run it). For now it's not too bad to run by hand though :-)

@cfallin
Copy link
Member Author

cfallin commented Mar 17, 2022

A little more benchmarking -- taking most of the modules from #911 and compiling with baseline and regalloc2:

Wasm (SHA256 of module) from #911         baseline compile (s)  regalloc2 compile (s)
0ddff0dac47311846e831cb25df5ec5fcb7c59a4  1.201                 0.262
256e0360aa2774d6ad1bb5589030b7a944a81c5d  0.680                 0.671
28276a409e576044bea8cdc46068426484bf7b06  0.035                 0.039
2e746b5b07c0a022415d6c1527815af44daae33e  0.006                 0.006
4286371e64c07f853a5d4de482d658f3c7f2c711  0.137                 0.365
6ccd889e8a97b9adb2697f9f60477e511ad50be4  0.721                 0.329
9850b3172ddb705be8caa06599cb92ead3cd251c  0.509                 0.645
bdb6099c0073360613f17cc9a7d2380d50f8eb9e  2.725                 0.061
bf8490f3bd1f3350a0d4a83670bb1d3d017cf8ef  0.074                 0.283
cb46921624763cf50eb826585d224bb3975a4234  0.693                 0.035
d31a6a6de65a08096dc855a17f49499114826a3e  0.057                 0.284
d51589b35a521c29420fc140b292383f2ca5fd70  3.180                 0.617
dfafaa30ecd41ab9bece126eec8129b42925a4dd  1.367                 1.011

In almost all cases things got faster, sometimes significantly so (3.18s -> 0.61s, 1.2s -> 0.26s, 2.7s -> 0.061s (!)). This tracks with my understanding of some of the bottlenecks I saw in profiling before and the efforts to keep away from quadratic explosions and nonlinear behavior in general in regalloc2 as far as possible. Some of the smaller modules see some increases (0.137s -> 0.365s, 0.057s -> 0.284s); I haven't conclusively resolved what's going on in those but it wouldn't surprise me if this comes from splitting heuristics being a little more aggressive. In any case nothing immediately jumps out in the profile.

@alexcrichton alexcrichton added cranelift Issues related to the Cranelift code generator cranelift:area:regalloc Issues related to register allocation. labels Mar 23, 2022
cfallin added a commit that referenced this issue Apr 14, 2022
This PR switches Cranelift over to the new register allocator, regalloc2.

See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.

Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:

```
Benchmark       Compilation (wallclock)     Execution (wallclock)
blake3-scalar   25% faster                  28% faster
blake3-simd     no diff                     no diff
meshoptimizer   19% faster                  17% faster
pulldown-cmark  17% faster                  no diff
bz2             15% faster                  no diff
SpiderMonkey,   21% faster                  2% faster
  fib(30)
clang.wasm      42% faster                  N/A
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:regalloc Issues related to register allocation. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants