Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large link time regression compared to fastcomp #14200

Open
juj opened this issue May 16, 2021 · 14 comments
Open

Large link time regression compared to fastcomp #14200

juj opened this issue May 16, 2021 · 14 comments

Comments

@juj
Copy link
Collaborator

juj commented May 16, 2021

After updating our compiler from 1.38.11 to 2.0.19, we are seeing a very large regression in wasm-ld link times in a large project, with link times taking up to 15 minutes(!), or about twice as slow as fastcomp.

Taking a look at this regression in Visual Studio 2019 performance profiler, there are two functions that consume ~65.4% of the total link time, llvm::Function::hasAddressTaken and hasChangeableCC:

image

Looking inside these two functions, they both have a hotspot in the same location: an inlined call to a seemingly dynamic_cast<>/reinterpret_cast<> like isa<BlockAddress>() check:

image
image

To reproduce this issue, throw me a mail or a message at Discord, the test case is unfortunately a bit larger than what can be directly attached here. Unzip the files to Emscripten root directory, and then run

C:\emsdk\emscripten\main>embuilder build MINIMAL
C:\emsdk\emscripten\main>wasm-ld -o 1.wasm 1.a 2.a 3.a 4.a 5.a 6.a 7.a 8.a 9.a 10.a 11.a 12.a 13.a 14.a 15.a 16.a 17.a 18.a 19.a 20.a 21.a 22.a 23.a 24.a 25.a 26.a 27.a 28.a 29.a 30.a 31.a 32.a 33.a 34.a 35.a 36.a 37.a 38.a 39.a 40.a 41.a 42.a 43.a 44.a 45.a 46.a 47.a 48.a 49.a 50.a 51.a 52.a 53.a 54.a 55.a 56.a 57.a 58.a 59.a 60.a 61.a 62.a 63.a -Lcache/sysroot/lib/wasm32-emscripten cache/sysroot/lib/wasm32-emscripten/libgl-webgl2-full_es3.a cache/sysroot/lib/wasm32-emscripten/libal.a cache/sysroot/lib/wasm32-emscripten/libhtml5.a cache/sysroot/lib/wasm32-emscripten/libc.a cache/sysroot/lib/wasm32-emscripten/libcompiler_rt.a cache/sysroot/lib/wasm32-emscripten/libc++.a cache/sysroot/lib/wasm32-emscripten/libc++abi.a cache/sysroot/lib/wasm32-emscripten/libdlmalloc.a cache/sysroot/lib/wasm32-emscripten/libc_rt_wasm.a cache/sysroot/lib/wasm32-emscripten/libsockets.a -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-cxx-exceptions -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr --allow-undefined --export main --export emscripten_stack_get_end --export emscripten_stack_get_free --export emscripten_stack_init --export __cxa_demangle --export stackSave --export stackRestore --export stackAlloc --export __wasm_call_ctors --export fflush --export __errno_location --export malloc --export free --export __cxa_is_pointer_type --export __cxa_can_catch --export setThrew --export ntohs --export strlen --export htonl --export htons --export memset --export _get_tzname --export _get_daylight --export _get_timezone --export memalign --export emscripten_main_thread_process_queued_calls --export emscripten_webgl_make_context_current --export emscripten_webgl_get_current_context --export-if-defined=__start_em_asm --export-if-defined=__stop_em_asm --export-table -z stack-size=5242880 --initial-memory=33554432 --no-entry --max-memory=2147483648 --global-base=1024
@kripken
Copy link
Member

kripken commented May 16, 2021

It's possible this is an issue with the new LLVM pass manager. Does it happen with -flegacy-pass-manager at compile time (and -Wl,--lto-legacy-pass-manager at link if doing LTO)?

@kripken
Copy link
Member

kripken commented May 16, 2021

Ah, looks like this is with LTO from the stack trace. But running on the testcase, I see no difference with the new pass manager, it's 9 minutes either way for me, so that's not the issue

Also adding -O0 to wasm-ld (using emcc -Wl,-O0) does not change it much, which suggests the issue isn't LTO optimizations, but the compiling of bitcode to wasm - @sbc100 is that reasoning right?

Overall I think 2x slower LTO than fastcomp is about normal - the LLVM wasm backend does a lot more work than fastcomp did in LTO. But the new backend also has wasm object files for much faster linking than fastcomp - have you tried that?

I don't know enough about LLVM internals to know if that stack trace's results, ~65.4% of the total link time in llvm::Function::hasAddressTaken and hasChangeableCC, is expected or whether it's a possible thing to optimize, maybe @dschuff has thoughts there.

@juj
Copy link
Collaborator Author

juj commented May 17, 2021

Thanks for the super fast turnaround!

The intent of doing LTO for release builds is to optimize code performance. Our understanding is that -flto enabled builds should still give better performance compared to non-LTO builds, similar to how --llvm-lto builds trumped non-LTO builds in the old days.

It is unfortunate if LTO builds are much slower. For iteration builds, we are non-LTO. Regressing release build times would be unfortunate, but something that is even a bit more awkward is that this is blowing up our CI testing times, since we do need to test release configuration there as well. Dropping LTO from release testing would be generally undesirable.

Btw, how does llvm handle linking in a mixed form where some of the inputs are bitcode and some of the input are wasm object files? Are there any traps there? (generally I don't think we need to mix, but in this test case I do now notice that we only partially enabled -flto in the build commands, leading to some .a files containing bitcode, and others containing wasm object files - though wasm-ld did not seem to mind(?))

@sbc100
Copy link
Collaborator

sbc100 commented May 17, 2021

Fistly, this a regression from previous version of wasm-ld? Or is just a regression from fastcomp -> upstream in general? I'm guessing its the latter, and that wasm-ld itself didn't regress?

In answer to your question, wasm-ld will do LTO on any inputs built with flto inputs (i.e. bitcode files) and its fine to mix lto and non-lto object files and libraries. The basic steps are:

  1. Find all bitcode inputs.
  2. Compile them to a single lto.o object files.
  3. Link the lto.o with all the other non-lto object file.

This is why -flto itself is not a linker flag, because its driven purely the by the contents of the input object. If you give it bitcode it has to do LTO.. if your give it non-bitcode object its not possible to do LTO.

@sbc100
Copy link
Collaborator

sbc100 commented May 17, 2021

As Alon said, we do expect LTO link speed to regress somewhat with the fastcomp transition. The expectation is basically:

  1. Much faster non-LTO builder (since fastcomp essentially has no non-LTO at all)
  2. Slower LTO since the llvm compiler is doing more work that the fastcomp compiler.

Also the slowdown of LTO builds is not a linker regression but a compiler regression. The LTO slowdown is a slowdown due to the LLVM compiler, no the linker itself. Perhaps this is stating the obvious but I didn't want wasm-ld to be blamed here :)

LTO builds are known to very slow in native world (the chrome LTO build takes many hours and tens of gigs of memory for example.. its so big it can't be done a normal developer workstation and will probable bring down the whole machine!).

There are some methods we can employ for speeding up LTO builds:

  1. As you suggest, we can do less LTO.. but passing some native object and some bitcode object to the linker.
  2. Compile input object files with -O0. This opt level will be stored in the bitcode file itself causing the compiler (at link time) to do less work.
  3. Pass -Wl,-lto-O0 to the linker. This is different to compiler opt level and I belive controls things like that amount of IPO done during LTO. Honestly I'm not sure exactly what this flag does but IIUC it makes the LTO process a bit more like compiling each bitcode file individually. It will still take longer that non-LTO linking, but not as long as -Wl,-lto-O2 (the default). (Its used to set the OptLevel of the lto::Config object: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/LTO/Config.h#L55).
  4. Use thin-lto (-flto=thin). This will cause the linker to compile inputs in groups/partitions rather than all together which can dramatically reduce the amount of work done at link time. Sadly I have done very little testing of this option with wasm-ld and I'm not even sure its is working.

Depending on what you want to achieve one might choose different options. I believe (4) ThinLTO is path most bang for the buck although sadly it needs more testing in the wasm world.

@sbc100
Copy link
Collaborator

sbc100 commented May 17, 2021

A note of using LTO in general: From the reports I've seen its not always a win in the WebAssembly world. Since it allows a lot of cross-module optimizations, and in particular inlining, it is often a code size regression over non-LTO. IIRC we have had reports of success but also reports that of little or no runtime gains, combined with code size regressions. So make sure you measure to see what kind of benefit you are getting from using LTO at all.

@kripken
Copy link
Member

kripken commented May 17, 2021

I did some profiling myself, and I can confirm what @juj saw with a very large amount of time spent in just two methods (hasAddressTaken and hasChangeableCC). I see something similar with hasAddressTaken and optimizeGlobalsInModule being very very hot (over 20%). So there may be a speedup opportunity here, even if as discussed we do not expect LTO to match fastcomp's link times.

@juj
Copy link
Collaborator Author

juj commented May 17, 2021

Thanks for the details!

Fistly, this a regression from previous version of wasm-ld? Or is just a regression from fastcomp -> upstream in general? I'm guessing its the latter, and that wasm-ld itself didn't regress?

A regression from fastcomp.

I did some profiling myself, and I can confirm what @juj saw with a very large amount of time spent in just two methods (hasAddressTaken and hasChangeableCC). I see something similar with hasAddressTaken and optimizeGlobalsInModule being very very hot (over 20%). So there may be a speedup opportunity here, even if as discussed we do not expect LTO to match fastcomp's link times.

When I left the build running on my computer in debug build, with a breakpoint set to the positive branch sides of the hot isa<BlockAddress>() function, I could never get the debugger to hit those throughout the run. I.e. it seems that the result is always false, and that one could have (for that specific workload) comment out those two if() checks and not affect the result of the build. Though I did not really verify. Still, something seems off here.

Looks like we'll have to re-evaluate whether LTO will be giving any benefits, and do some more detailed testing for numbers.

@sbc100
Copy link
Collaborator

sbc100 commented May 17, 2021

I did some profiling myself, and I can confirm what @juj saw with a very large amount of time spent in just two methods (hasAddressTaken and hasChangeableCC). I see something similar with hasAddressTaken and optimizeGlobalsInModule being very very hot (over 20%). So there may be a speedup opportunity here, even if as discussed we do not expect LTO to match fastcomp's link times.

Just to confirm, any such optimizations would be normal compiler optimizations right? Nothing linker specific here? i.e. it would speed up normal compilation of a single source file too? It just happens the LTO is more noticeable because its not as parallelizable?

For actually making LTO more parallel we should probably also look into recommended ThinLTO and ensuring it works well.

@sbc100
Copy link
Collaborator

sbc100 commented May 17, 2021

Another way of putting it.. switching to llvm backend regressed compile times but because parallel build systems, that was maybe not as noticeable as the LTO link-time regression? In fact, I would expect the regression in compile time to be way worse because the upstream backend actually compiles the code, and doesn't stop at bitcode by default.

@kripken
Copy link
Member

kripken commented May 17, 2021

@sbc100

Just to confirm, any such optimizations would be normal compiler optimizations right? Nothing linker specific here?

Correct, yes.

@juj

Based on your last comment, I wonder if you are not hitting a case that LTO on LLVM might not help. That is, maybe inlining isa<BlockAddress>() would be a big win. We do have LTO builds now, you can try emsdk install 2.0.19-lto (or 2.0.20).

@sbc100
Copy link
Collaborator

sbc100 commented May 17, 2021

Interesting idea! LTO all this things for faster LTO'ing of things.

@kripken
Copy link
Member

kripken commented May 17, 2021

Sadly the LTO build does not speed things up for me.

Interestingly, emcc -Wl,-lto-O0, which @sbc100 told me is the way to actually disable LTO opts (and not just -Wl,-O0) does help by a huge factor, it goes down to 1 minute (from around 9). So the issue is in the LLVM optimizer specifically.

Anyhow, -Wl,-lto-O0 is likely not that useful, it it would emit something similar to a build with wasm object files, but wasm object files would link much faster.

@sbc100
Copy link
Collaborator

sbc100 commented Jun 11, 2021

So it looks like this is just the cost of LLVM LTO? @juj are you OK with disabling LTO when you want a quick build?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants