Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift wasm module compilation seems slower #6798

Closed
TimonPost opened this issue Aug 3, 2023 · 14 comments · Fixed by #6804
Closed

Cranelift wasm module compilation seems slower #6798

TimonPost opened this issue Aug 3, 2023 · 14 comments · Fixed by #6804

Comments

@TimonPost
Copy link
Contributor

We have noticed a recent regression in module compilation times on the MBP M1/M2. Sometimes it takes over 100 seconds when it used to take less than 15.

image

Several non-scientific numbers have been reported on various platforms. At first three individual Mac users started reporting extreme numbers, often exceeding +90 seconds, while Windows and Linux users report an average of around 50 seconds, with some reporting more normal numbers below 10 seconds. We have attempted to isolate the problem by manually clearing all caches and see how long it takes to recompile from scratch.

Upon further profiling, it appears that the performance bottleneck is related to the "cranelift" library (see image).

Any tips on how we can help find out the root cause here are appreciated. Are there perhaps benchmarks to run to see if regressions happened?

Some non-scientific data:

47s, 11th Gen Intel® Core™ i7-11800H, linux
15s, linux, i9-12900H (22s without cache)
6s, i9-13900k, windows (14s without cache)
8.2s amd 5950x, linux
45s 9th gen intel i7, windows
56s AMD Ryzen 9 6900HS, windows (36s without cache) 
90s Mac Pro M1/M2, macos

Perhaps introduced in v10.0.0 or v10.0.1, as we have not upgraded to 11 yet, and in 9.0.4 we fixed another module compilation issue and had not experienced any problems yet up till quite recently

@TimonPost
Copy link
Contributor Author

TimonPost commented Aug 3, 2023

Going to gather some more data and reopen if I have more.

@cfallin
Copy link
Member

cfallin commented Aug 3, 2023

Upon further profiling, it appears that the performance bottleneck is related to the "cranelift" library (see image).

Indeed, Cranelift is the whole compiler, so it makes sense that "slow compilation" involves Cranelift!

Thanks very much for the report; we definitely weren't aware of this. If you're able to bisect to a particular commit that would be really useful; otherwise, if you have an example wasm module that gets much slower to compile, that would be the next best thing.

@jameysharp
Copy link
Contributor

In addition to Chris' comments, here are a couple other thoughts:

The profile shows a third of time is spent in emitting islands, which makes me think this module has some very large functions in it. That could explain some of the difference between targets: the threshold where a function is large enough to need islands is different on aarch64 than on x86.

That said, I'm suspicious about the fact that different x86 systems have wildly different compile times too. It might be useful to run wasmtime settings on each x86 test system and see if there are differences in the "Settings inferred for the current host:" section. We could be hitting slow code paths that are only enabled when certain instruction-set features are present, or perhaps only when they're absent.

To normalize timing measurements you can do all your tests on the same computer so you aren't comparing the performance of different CPUs. You can use cargo build --features all-arch to enable Wasmtime support for all architectures, or use the release artifacts from GitHub which are already built with that turned on. Then you can use wasmtime compile --target x86_64 or aarch64. You may need to pass the various settings printed by wasmtime settings to the --cranelift-enable flag in order to get a fair comparison. I think there's also some magic when running on macOS but I can't find the details, so it might be easiest to run all the tests from your Mac Pro test system and cross-compile for x86.

@TimonPost
Copy link
Contributor Author

TimonPost commented Aug 4, 2023

To the above points I wrote a small program to compile a 115 MB .wasm file for both x86_64 and aarch64 and it's considerably slower on aarch64.

Use wasmtime to compile different targets

I build a script to test compiling our module and ran it three times on Windows 11 with AMD Ryzen 9 5900X 12-Core:

settings inferred `wasmtime settings`
  has_avx
  has_avx2
  has_bmi1
  has_bmi2
  has_fma
  has_lzcnt
  has_popcnt
  has_sse3
  has_sse41
  has_sse42
  has_ssse3

wasmtime 11.0.1
x86_64: 13.9465882s, 11.2770526s, 11.0808334s
aarch64: 157.5406149s, 144.1816011s, 144.1495896s

wasmtime 9.0.4:
x86_64: 11.2984875s
aarch64: 147.1249103s

wasmtime 9.0
x86_64: 11.0617372s
aarch64: 159.7219714s

There seems to be not much of a difference between wasmtime 9 and 11 🤔

Other dev ran it twice on Windows, Intel Core i7 9750H:

settings inferred `wasmtime settings`
  has_avx
  has_avx2
  has_bmi1
  has_bmi2
  has_fma
  has_lzcnt
  has_popcnt
  has_sse3
  has_sse41
  has_sse42
  has_ssse3

x86_64: 23.5362889s, 24.3254295s, 
aarch64: 211.1731269s, 207.9991321s

Macbook pro M1 (from 2 devs initially reporting the problem)

Settings inferred for the current host:
  has_lse
  has_pauth
  sign_return_address
  sign_return_address_with_bkey
fn main() {
    let mut times: HashMap<String, std::time::Duration,RandomState> = HashMap::default();

    for os in &["x86_64", "aarch64"] {
        let start = Instant::now();
        let mut command = Command::new("wasmtime");
        command.arg("compile").arg(format!("--target={os}")).arg(
            "path\xxxxx.wasm",
        );

        let child = command.spawn().unwrap();
        let output = child.wait_with_output().unwrap();
        times.insert(format!("sandbox-{}", os), start.elapsed());

        println!(
            "[{:?}]: {os} ({})",
            start.elapsed(),
            String::from_utf8(output.stdout).unwrap()
        );
    }

    println!("{:?}", times);
}

I also did some queries on how often a module is loaded (not compiled) over 50 seconds in production environments. Over the course of 2 months after my previous patch #6548, where it was almost gone, it came back. It is a bit harder to visualize it in relation to the total amounts of module loading to get a percentile, but it shows something is going on.

image

And also a query on the median and max time per day it took for a module to load (not only compiled).

image

I can unfortunately not share the wasm module.

@alexcrichton
Copy link
Member

Given this program:

use std::fmt::{self, Write};

const F: usize = 2000; // number of functions
const N: usize = 1000; // number of calls per function

fn main() -> fmt::Result {
    let mut s = String::new();
    writeln!(s, "(module")?;
    for _ in 0..F {
        writeln!(s, "(func")?;
        for _ in 0..N {
            writeln!(s, "call $f")?;
        }
        writeln!(s, ")")?;
    }
    writeln!(s, "(func $f)")?;
    writeln!(s, ")")?;
    println!("{s}");
    Ok(())
}

which generates a wasm via:

rustc -O foo.rs && ./foo | wasm-tools parse -o foo.wasm

I then can reproduce this issue I believe with:

$ time wasmtime compile foo.wasm --cranelift-set wasmtime_linkopt_padding_between_functions=$((1<<16)) --target x86_64
wasmtime compile foo.wasm --cranelift-set  --target x86_64  5.43s user 0.31s system 736% cpu 0.780 total
$ time wasmtime compile foo.wasm --cranelift-set wasmtime_linkopt_padding_between_functions=$((1<<16)) --target aarch64
wasmtime compile foo.wasm --cranelift-set  --target aarch64  15.14s user 2.60s system 180% cpu 9.809 total

I believe the issue here is that there's quadratic behavior the way that fixups are handled in the MachBuffer right now. Specifically when an island is emitted any fixups which don't need to be present in the island are re-inserted back into the fixup_records list. This means that, in what I suspect is happening in the OP and what's happening in my test case, O(N) list entries are processed but O(N) are pushed back onto the list to get processed later. This means that there's O(N) times where O(N) items are processed, leading to quadratic behavior.

I stressed this above by setting the wasmtime_linkopt_padding_between_functions option to force space between functions to simulate big functions. Each function has a lot of calls to the final function in the module which means that there is always a lot of pending fixup records. After ~1300 functions are emitted each function has an island after it for prior functions and the next 1k functions each process ~1M fixup records each time.

One "easy" fix is to change this line to a VecDeque instead of a SmallVec. That doesn't actually fix the quadratic behavior though, it just removes a large constant factor from it. For me that brought the runtime to ~5s.

I was otherwise testing out locally an entirely different strategy where the fixup_records are stored as a BinaryHeap and only those necessary to be emitted into the island were emitted, leaving everything else in place. This is not a viable solution as-is because I had to disable the MachBuffer branching optimizations, but it did bring the compile time down to 2s from the original ~10s for aarch64. Profiling shows the 2 seconds is almost entirely spent in b-tree operations as part of regalloc, which I'm interpreting as meaning that this is indeed the issue and would bring aarch64 and x86_64 back in parity with each other.


Do others have a better idea about how to handle the fixup_records list? The problem with the BinaryHeap approach is that the branch optimizations want to delete last-added records and additionally mutate records arbitrarily within the heap, both of which aren't operations that BinaryHeap supports.

Alternatively, one perhaps more radical idea would be to remove the branch optimizations entirely. My naive understanding of them is that they're better suited for restructuring the CFG (e.g. basic-block style mid-end optimizations) rather than emission-time optimizations.

One unfortunate part here is that the branch optimizations which require BinaryHeap aren't used at all when assembling functions into one large image. Only intra-function assembly uses branch optimization which is less of an issue here because functions are typically not large enough to hit this quadratic behavior. I'm not sure that this can be extrapolated into something useful, though, and inevitably a big function will hit this quadratic behavior too (in addition to the big module in the OP)

@cfallin
Copy link
Member

cfallin commented Aug 4, 2023

@alexcrichton I'm taking a look at this now. My most basic unanswered question is actually: what changed recently? The original report above is that this is a regression between v9 and v10; the core MachBuffer algorithm has not changed in three years (around June 2020).

Alternatively, one perhaps more radical idea would be to remove the branch optimizations entirely. My naive understanding of them is that they're better suited for restructuring the CFG (e.g. basic-block style mid-end optimizations) rather than emission-time optimizations.

Do you mean avoiding use of the MachBuffer for the inter-function linking? (My alternative reading of this suggetion -- not doing peephole opts for each function body emission -- seems unlikely as they're really important for perf, something like 25% when I introduced them at least.)

That's possible for sure; I guess the question would then be whether we switch back to full Abs8 relocations for all calls (and reintroduce support for relocs into Wasmtime as code won't be fully PIC anymore) or get island handling some other way. Fundamentally it seems to me that:

  • We do need islands (because of above);
  • Islands require some sort of label-resolution;
  • Label resolution needs some way of handling both forward and reverse refs;
  • Forward refs (labels not yet defined) are the core of the issue here, causing the re-insertion (here).

It seems to me that the most elegant approach may be to kick some label-fixup records off to a "at max range, don't reconsider until final fixup" list; then not consider them in emit_island again. We'd need a notion in LabelUse of "maximum range"; for aarch64 that's one step or two (19 bit -> 26 bit -> full abs8, or 26 bit -> abs8) and in Wasmtime in practice just one step for conditionals and no escalation at all for jumps. emit_island clears the most-recent-branch list so we don't have to worry about invariants wrt its connection to fixup_records. Thoughts?

I can try to prototype this today (but I have a pretty meeting-heavy day sadly so if you want to get to it first, please feel free!).

@cfallin
Copy link
Member

cfallin commented Aug 4, 2023

Actually it turns out that a simple implementation of the above is a Caltrain-ride-sized chunk of work: #6804. It still doesn't resolve the issue because there is actually a 32-bit PCrel label kind that we want to be able to use, so the Branch26's stick around and participate in the quadratic dance. I think we want some sort of tiered deadline approach; continuing to think!

@cfallin
Copy link
Member

cfallin commented Aug 4, 2023

The latest commit in #6804 now solves the issue, I think. I'll do more thorough testing and cleanup later in the day when I have a chance. One interesting consequence of the new change (every forward-ref that crosses an island gets a veneer) is that I think we no longer need a notion of worst-case size; but I'll verify that. It does also pessimize the case where a bunch of 26-bit forward refs cross a 19-bit-caused island in a very large function body, but "calls jump through an island in the middle of a function" is not technically wrong, so... (I'll measure with some benchmarks to be sure!)

@TimonPost
Copy link
Contributor Author

TimonPost commented Aug 4, 2023

Thank you for confirming and swiftly addressing the optimization approach. I have a question that hasn't been answered yet: Why is there such a significant difference in performance between the aarch64 and x86_64 architectures?

Also, I'd like to mention @cfallin and express a bit of uncertainty regarding whether this could be related to version regression. I can not confirm it does (also see benchmarks #6798 (comment)). However, various people in our team have noticed that recently the aarch64 performance has been noticeably slower compared to before.

Our code base is growing fast so it is not odd to experience slower compile times, however, I wonder if there are perhaps certain things we do that are very heavy on wasm-generated-code (.wasm), and why there is such a big mismatch between targets?

@cfallin
Copy link
Member

cfallin commented Aug 4, 2023

Why is there such a significant difference in performance between the aarch64 and x86_64 architectures?

This is a consequence of the kinds of label-references (relocations, kind of) that the two architectures have, and more broadly a difference between RISC-y architectures and CISC-y ones. AArch64 has constant-size (32-bit) instructions, so its branches have either a 19-bit of 26-bit offset field (former for conditionals, which need to encode more information) -- but code can be larger than that. Likewise references to the "constant pool" have a 19-bit offset. To make larger code work, we create "veneers" in an "island" embedded in the code, where a branch first jumps to a longer-range branch (the veneer). The island also contains constants whose references are about to go out of range.This allows us to do single-pass emission, rather than going back and re-emitting with a longer-form branch (which may be a sequence of multiple insts and thus shift other offsets, causing a cascading fixup process). On the other hand, x86-64 has variable-length instructions and so can encode a 32-bit offset almost everywhere; we unconditionally use 32-bit offsets for conditional and unconditional jumps, and data references, so we never need an island of veneers/constants.

recently the aarch64 performance has been noticeably slower compared to before.

If you have an example module we can benchmark and a bisection range ("faster with vX, slower since vY") that'd be very helpful!

@jameysharp
Copy link
Contributor

If I understand Chris and Alex correctly, the relevant difference between x86_64 and aarch64 is that x86 branch instructions take 32-bit signed offsets, while aarch64 unconditional branches take effectively 28-bit signed offsets.

Your 115MB wasm module is almost 2^27 bytes, so if the compiled aarch64 binary is bigger than the wasm input, then a branch from one end of it to the other is just past the threshold. If you had a 2GB wasm module I'd expect to start seeing similar problems on x86, assuming you didn't run into other bugs first.

I would guess that things have gotten slower not because of Wasmtime changes but because as your wasm modules get bigger, the number of branches that overflow the signed offset limit increases.

Oddly, I think this means that topo-sorting the functions in the wasm module would tend to avoid hitting this behavior in Wasmtime, and also let Wasmtime generate slightly better code. I'm not sure that's worth pursuing though.

@alexcrichton
Copy link
Member

what changed recently?

I think the answer here is @TimonPost's module got bigger. If I take my example module and compile it with Wasmtime 4.0.0 (to pick a random version) which is circa Dec 2022 I get x86_64 being 0.9s and aarch64 being 16s (different computer than my original measurements).

In that sense I don't think that this is a new issue, I think this is a scaling issue that @TimonPost's module has run into now.

Do you mean avoiding use of the MachBuffer for the inter-function linking?

Oh sorry no definitely not, that's too load bearing to replace with something else! What I was saying is that for the inter-function use case the branch optimizations are not necessary (as there are no branches). I'll also clarify in that I'm not saying the branch optimizations aren't worth it, I'm saying that, if it works out perf-wise, it might be better to perform these optimizations at the mid-end instead of just before emission. (e.g. basic block jump threading and other basic block optimizations on the clif layer) Note though that I only think this is is a possible good idea if it retains all of the benefits that the current branch optimization brings, if something must be done at the mach buffer layer and can't be done at the clif layer then it's good to keep.

The latest commit in #6804 now solves the issue

Nice! I'll also see your train ride and raise you a plane ride :)

I'll take a look probably early next week 👍 (unless someone beats me to it)

@cfallin
Copy link
Member

cfallin commented Aug 5, 2023

I'm saying that, if it works out perf-wise, it might be better to perform these optimizations at the mid-end instead of just before emission. (e.g. basic block jump threading and other basic block optimizations on the clif layer) Note though that I only think this is is a possible good idea if it retains all of the benefits that the current branch optimization brings, if something must be done at the mach buffer layer and can't be done at the clif layer then it's good to keep.

Ah, I see. I think it's probably best to keep MachBuffer's optimizations where they are: for historical context, when it replaced a separate pass, it resulted in 10% better compile time as well as 24% better perf overall. The reason is that the peephole pass is designed to be as unintrusive as possible: no editing of data structures, just "chomping" of branches where we see they're unnecessary. The two-target branch form is also deeply embedded in the design of the VCode and the register allocator; it's a much more natural representation of a CFG than the "one-sided" form (branch or fallthrough) that true machine branches have.

@TimonPost
Copy link
Contributor Author

TimonPost commented Aug 5, 2023

I can confirm our .wasm file is growing with an average of 8Mb/month over the last year, and that it might play a role in why it becomes more noticeable over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants