Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] upstream compiled binaries are 30% to 50% slower than fastcomp ones #9817

Closed
MarcelKlammer opened this issue Nov 10, 2019 · 19 comments
Closed

Comments

@MarcelKlammer
Copy link
Contributor

@MarcelKlammer MarcelKlammer commented Nov 10, 2019

With upstream being the new default I compared both exports again in terms of performance of our Face Tracking library. Unfortunately the results using upstream are about 50% (Chrome) to 30% (Firefox) slower than fastcomp compiled binaries.

The only difference in options was for fastcomp:

-s "BINARYEN_TRAP_MODE='clamp'

and upstream:

-mnontrapping-fptoint

Several questions:

Is this a known issue?
Is there a way for me to help find out why this issue exists and help find a why to solve it?
For how long will fastcomp be an option, now that upstream is the default?

Thanks for your work.

Cheers
Marcel

@kripken

This comment has been minimized.

Copy link
Member

@kripken kripken commented Nov 10, 2019

Thanks for filing - this isn't a known issue.

First thing I'd try is to build with --profiling and profile in the browser, comparing the two builds. Perhaps the upstream one has some functions that take significant time that don't appear in the other one?

Another thing is that compilation in upstream happens during the compile step, and not link. This might matter if you have different -O flags at those times. See https://emscripten.org/docs/compiling/WebAssembly.html#backends

Fastcomp doesn't have a specific date yet for how long it will stay around. It will depend on issues like this, first we need to be sure upstream is a good replacement.

@VirtualTim

This comment has been minimized.

Copy link
Contributor

@VirtualTim VirtualTim commented Nov 11, 2019

One thing you can try: If you profile in Firefox, enable "show gecko platform data". This will log the time spend on trap handling.
If BINARYEN_TRAP_MODE is responsible for the performance difference I would expect that the biggest self cost would be "debug trap handling (in wasm)".

Edit: the Firefox profiler doesn't show threads. You may have to use console.profile().

@ryszardgrzesica

This comment has been minimized.

Copy link

@ryszardgrzesica ryszardgrzesica commented Nov 12, 2019

Sorry for jumping in from the void...

Test
Here is attached simple test - it does just random and fpclassifying in 4 threads.

Compilation command:
../../emcc -v --profiling-funcs ../fpclassify.cpp -O2 -s WASM=1 -s DISABLE_EXCEPTION_CATCHING=0 -s TOTAL_MEMORY=1024MB -s WASM_MEM_MAX=1024MB -s USE_PTHREADS=1 -s PTHREAD_POOL_SIZE=5 -o fpcl.html 2>build_log_errout.txt

Toolchain:
I used Emscripten 1.39.2, binyarien master - what I changed was only ~/.emscripten pointing to different backend (you can check it in buildlog files).
Take a look also on wasm2wat_demangled.txt files when you compare how fptest() function looks in both cases overhead reason should be clear (invokes vs function call?).

Execution time:
Similar results are gathered when executing once in main thread only.
I made several checks with different flags also but overhead pattern stays.

Chrome:

  • Fastcomp backend ~120 ms
  • LLVM newest backend ~2100 ms

Firefox:

@dschuff

This comment has been minimized.

Copy link
Member

@dschuff dschuff commented Nov 12, 2019

What I discovered about the fpclassfiy test is that fastcomp forces LTO (link-time optimization), and fpclassify gets inlined from libc into its caller. This is an especially large win when exceptions are enabled since the call otherwise requires a round-trip to JS. The upstream backend defaults to doing the wasm codegen at compile time (rather than link time); this is why it links so much faster. But the drawback is that it doesn't do cross-module inlining. If you use -s WASM_OBJECT_FILES=0 with the upstream backend you'll get fpclassify inlined again, and the performance difference goes away. The small fpclassify test gets even faster if you add --llvm-lto 1.

For some reason using -flto (the flag used for LTO on other platforms) doesn't have the same behavior; we should probably fix that.

@kripken

This comment has been minimized.

Copy link
Member

@kripken kripken commented Nov 13, 2019

Looking at this a bit, this can show the issue with fpclassify:

#include <cmath>
 
int main() {
  volatile float f;
  return (int)std::fpclassify(f);
}

In fastcomp that turns into a simple call. In upstream, we end up in LLVM IR with an invoke and a landing pad and all the exception overhead. Not sure why yet...

@aheejin

This comment has been minimized.

Copy link
Member

@aheejin aheejin commented Nov 13, 2019

Looking at this a bit, this can show the issue with fpclassify:

#include <cmath>
 
int main() {
  volatile float f;
  return (int)std::fpclassify(f);
}

In fastcomp that turns into a simple call. In upstream, we end up in LLVM IR with an invoke and a landing pad and all the exception overhead. Not sure why yet...

This does not seem to generate landingpad both in LLVM backend and fastcomp...? This code does not have throw() or noexcept, so I wonder why it should.

@kripken

This comment has been minimized.

Copy link
Member

@kripken kripken commented Nov 13, 2019

Sorry, I didn't write the flags I used: -O2 -s DISABLE_EXCEPTION_CATCHING=0. Do you not see exceptions code used with those?

@aheejin

This comment has been minimized.

Copy link
Member

@aheejin aheejin commented Nov 13, 2019

I used -s DISABLE_EXCEPTION_CATCHING=0 and don't see any landingpads both in LLVM backend and fastcomp. The code does not have throw() or noexcept. Is this the right test case you were talking about?

@kripken

This comment has been minimized.

Copy link
Member

@kripken kripken commented Nov 13, 2019

Oh, did you write it to a C or a C++ file? Testing on my machine at home, I see that problem on .cpp files but not .c where I guess the frontend works differently.

I also see that removing _NOEXCEPTs from system/include/libcxx/math.h will prevent the problem. That's true even without using a c++ include, as long as it's in a .cpp file,

// build with e.g. -O2 -s DISABLE_EXCEPTION_CATCHING=0 -g
#include <math.h>
 
int main() {
  volatile float f;
  return (int)fpclassify(f);
}
@aheejin

This comment has been minimized.

Copy link
Member

@aheejin aheejin commented Nov 13, 2019

Is -O2 required? I tested with a cpp file too, and didn't see the landing pad. (I also checked the bitcode file) I don't see why it should. For example,

int main() {
  throw 3;
}

does not generate a landing pad (In x86, in fastcomp, and in wasm backend)

@kripken

This comment has been minimized.

Copy link
Member

@kripken kripken commented Nov 13, 2019

Very strange that our results are different. I guess we'll figure this out in the office tomorrow. Meanwhile here is a full log of what I see:

$ cat a.cpp
#include <math.h>
 
int main() {
  volatile float f;
  return (int)fpclassify(f);
}
$ emcc a.cpp -c -emit-llvm -o a.o -O2 -s DISABLE_EXCEPTION_CATCHING=0
$ llvm-dis a.o -o /dev/stdout
; ModuleID = 'a.o'
source_filename = "a.cpp"
target datalayout = "e-m:e-p:32:32-i64:64-n32:64-S128"
target triple = "wasm32-unknown-emscripten"

; Function Attrs: norecurse nounwind
define hidden i32 @main() local_unnamed_addr #0 personality i32 (...)* @__gxx_personality_v0 {
entry:
  %f = alloca float, align 4
  %f.0.f.0..sroa_cast = bitcast float* %f to i8*
  call void @llvm.lifetime.start.p0i8(i64 4, i8* nonnull %f.0.f.0..sroa_cast)
  %f.0.f.0. = load volatile float, float* %f, align 4, !tbaa !2
  %call.i1.i = invoke i32 @__fpclassifyf(float %f.0.f.0.)
          to label %_Z10fpclassifyIfENSt3__29enable_ifIXsr3std17is_floating_pointIT_EE5valueEiE4typeES2_.exit unwind label %lpad.i

lpad.i:                                           ; preds = %entry
  %0 = landingpad { i8*, i32 }
          filter [0 x i8*] zeroinitializer
  %1 = extractvalue { i8*, i32 } %0, 0
  tail call void @__cxa_call_unexpected(i8* %1) #3
  unreachable

_Z10fpclassifyIfENSt3__29enable_ifIXsr3std17is_floating_pointIT_EE5valueEiE4typeES2_.exit: ; preds = %entry
  call void @llvm.lifetime.end.p0i8(i64 4, i8* nonnull %f.0.f.0..sroa_cast)
  ret i32 %call.i1.i
}

; Function Attrs: argmemonly nounwind willreturn
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #1

; Function Attrs: argmemonly nounwind willreturn
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #1

declare i32 @__gxx_personality_v0(...)

declare void @__cxa_call_unexpected(i8*) local_unnamed_addr

declare i32 @__fpclassifyf(float) local_unnamed_addr #2

attributes #0 = { norecurse nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="none" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="generic" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { argmemonly nounwind willreturn }
attributes #2 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="none" "less-precise-fpmad"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="generic" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #3 = { noreturn nounwind }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"clang version 10.0.0 (/b/s/w/ir/cache/git/chromium.googlesource.com-external-github.com-llvm-llvm--project 901cc4a4bc5257d100338e6c318b96b022d554ee)"}
!2 = !{!3, !3, i64 0}
!3 = !{!"float", !4, i64 0}
!4 = !{!"omnipotent char", !5, i64 0}
!5 = !{!"Simple C++ TBAA"}```
Looks like the landingpad is in order to then throw `__cxa_call_unexpected`.
@VirtualTim

This comment has been minimized.

Copy link
Contributor

@VirtualTim VirtualTim commented Nov 13, 2019

For some reason using -flto (the flag used for LTO on other platforms) doesn't have the same behavior; we should probably fix that.

@dschuff what's the current behaviour? No effect in both upstream and fastcomp? Previously (in fastcomp) I was linking with --llvm-lto 3. In the future will we require both flags to enable LTO? I assume we will also need to compile with that flag?

@MarcelKlammer

This comment has been minimized.

Copy link
Contributor Author

@MarcelKlammer MarcelKlammer commented Nov 13, 2019

I was able to reduce the test case to almost no code.

It's actually related to std::vector element access and Oz vs O2.

#include <emscripten.h>

#include <iostream>
#include <vector>
#include <chrono>

int main() {

    std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();

    std::vector<unsigned long> v(5000);

    for(int i = 0; i < 240000; i++) {

        for(int k = 0; k < v.size(); k++) { 
            
            v[k] = i + k;
        }
    }

    std::chrono::duration<long double, std::milli> time_span = std::chrono::high_resolution_clock::now() - t1;

    std::cout << "time: " << time_span.count() << "ms (keep vector " << v[4999]<< ")" << std::endl;
}

Fastcomp:

-Oz --llvm-opts 3 --llvm-lto 3

Firefox: 1680ms
Safari: 1350ms
Chrome: 1950ms (no dev console open), 515ms (with dev console open)

-Oz (omiting --llvm-opts 3 --llvm-lto 3)

Firefox: 1820ms
Safari: 1330ms
Chrome: 1870ms (no dev console open), 515ms (with dev console open)

-O2 --llvm-opts 3 --llvm-lto 3 (same when omitting --llvm-opts 3 --llvm-lto 3)

Firefox: 2090ms
Safari: 1400ms
Chrome: 2020ms (no dev console open), 520ms (with dev console open)

Chromes console behavior is ansolutely strange in that regard, but consistent on my machine.

Upstream:

-Oz --llvm-opts 3 --llvm-lto 3 (same when omitting --llvm-opts 3 --llvm-lto 3)

Firefox: 5030ms
Safari: 3150ms
Chrome: 4180ms (no matter if dev console is open)

-O2 --llvm-opts 3 --llvm-lto 3 (same when omitting --llvm-opts 3 --llvm-lto 3)

Firefox: 2080ms
Safari: 1400ms
Chrome: 1850ms (no dev console open), 515ms (with dev console open)

@MarcelKlammer

This comment has been minimized.

Copy link
Contributor Author

@MarcelKlammer MarcelKlammer commented Nov 13, 2019

And here the export script, just to be clear:

CPP="${DIR_PROJECT}/Test.cpp"

#EMCC="/Users/marcelklammer/Documents/workspaces/github/emsdk/fastcomp/emscripten/emcc"
#
#"${EMCC}" -v --clear-cache
#
#"${EMCC}" "${CPP}" -o "test_fc_o2_opts.html" -std=c++17 -v -O2 --llvm-opts 3 --llvm-lto 3
#"${EMCC}" "${CPP}" -o "test_fc_oz_opts.html" -std=c++17 -v -Oz --llvm-opts 3 --llvm-lto 3
#
#"${EMCC}" "${CPP}" -o "test_fc_o2_noopts.html" -std=c++17 -v -O2
#"${EMCC}" "${CPP}" -o "test_fc_oz_noopts.html" -std=c++17 -v -Oz

EMCC="/Users/marcelklammer/Documents/workspaces/github/emsdk/upstream/emscripten/emcc"

"${EMCC}" -v --clear-cache

"${EMCC}" "${CPP}" -o "test_us_o2_opts.html" -std=c++17 -v -O2 --llvm-opts 3 --llvm-lto 3
"${EMCC}" "${CPP}" -o "test_us_oz_opts.html" -std=c++17 -v -Oz --llvm-opts 3 --llvm-lto 3

"${EMCC}" "${CPP}" -o "test_us_o2_noopts.html" -std=c++17 -v -O2
"${EMCC}" "${CPP}" -o "test_us_oz_noopts.html" -std=c++17 -v -Oz
@aheejin

This comment has been minimized.

Copy link
Member

@aheejin aheejin commented Nov 13, 2019

@kripken

  • -O2 is indeed required to see your code. The reason main has a landing pad is, in the original -O0 form of code, main calls std::enable_if<std::is_floating_point<_A1>::value, int>::type fpclassify, and this calls __libcpp_fpclassify, and __libcpp_fpclassify is marked as _NOEXCEPT. So the landing pad is generated within __libcpp_fpclassify, but with -O2, it is inlined into std::enable_if<std::is_floating_point<_A1>::value, int>::type fpclassify, which is also inlined into main. This is why main ends up with a landing pad. Interestingly, with fastcomp, even -O0 inlines all these into main. I don't know why (and I haven't investigated)

  • The reason the landing pad is deleted in fastcomp is it runs GlobalOptLegacyPass, which does IPO. This figures out __libcpp_fpclassify is not gonna throw after all, and deletes the landing pad which was created for the case it throws. This pass is an LLVM IR pass, and the wasm backend cannot run this because we link object files so we don't see the whole module in the LLVM IR stage. There should be some similar IPO pass for LTO, but I haven't looked for it yet. @sbc100 Is there a LTO pass that does this kind of IPO?

@kripken

This comment has been minimized.

Copy link
Member

@kripken kripken commented Nov 13, 2019

Thanks @aheejin! Looks like that explains it. And I can confirm that building the benchmark with -flto fixes the regression.

A long time ago we speculated that there might be things fastcomp can do that upstream can't, even with fastcomp doesn't have --llvm-lto 1 enabled. That is, that it can do more things even without running full LTO, but just better dce or something else thanks to having LLVM IR info as opposed to wasm object file info. This may be the first significant example of that!

If there isn't a way to optimize this in the non-LTO case in the wasm backend (which seems like the case) then maybe all we can do is improve our docs to mention this issue, with LTO as the fix, and to wait on native wasm exceptions?

@MarcelKlammer Thanks for the reduced testcase! Looks like the other testcase here (the exceptions issue) is completely unrelated, sorry about that. Looking into yours, they key issue seems to be Binaryen optimizations,

$ ./emcc a.cpp -Os
$ d8 a.out.js
time: 4886.12ms (keep vector 244998)
$ ./emcc a.cpp -O2
$ d8 a.out.js
time: 2301.29ms (keep vector 244998)
$ ./emcc a.cpp -O3
$ d8 a.out.js
time: 663.18ms (keep vector 244998)
$ ./emcc a.cpp -Os
$ ~/Dev/binaryen/bin/wasm-opt a.out.wasm --metrics -O3 --metrics -o a.out.wasm
total
 [events]       : 0       
 [exports]      : 9       
 [funcs]        : 780     
 [globals]      : 1       
 [imports]      : 19      
 [memory-data]  : 10548   
 [table-data]   : 353     
 [total]        : 54490   
 [vars]         : 1112    
 binary         : 8441    
 block          : 2470    
 break          : 2209    
 call           : 3134    
 call_indirect  : 95      
 const          : 8716    
 drop           : 330     
 global.get     : 268     
 global.set     : 549     
 host           : 1       
 if             : 1063    
 load           : 2967    
 local.get      : 15801   
 local.set      : 4412    
 loop           : 345     
 nop            : 2       
 return         : 172     
 select         : 278     
 store          : 2272    
 switch         : 24      
 unary          : 896     
 unreachable    : 45      
total
 [events]       : 0       
 [exports]      : 9       
 [funcs]        : 656         -124
 [globals]      : 1       
 [imports]      : 19      
 [memory-data]  : 10548   
 [table-data]   : 353     
 [total]        : 60985      +6495
 [vars]         : 1151         +39
 binary         : 9487       +1046
 block          : 2796        +326
 break          : 2535        +326
 call           : 1975       -1159
 call_indirect  : 351         +256
 const          : 9984       +1268
 drop           : 354          +24
 global.get     : 283          +15
 global.set     : 579          +30
 host           : 1       
 if             : 1510        +447
 load           : 4831       +1864
 local.get      : 16987      +1186
 local.set      : 4798        +386
 loop           : 343           -2
 nop            : 2       
 return         : 166           -6
 select         : 312          +34
 store          : 2737        +465
 switch         : 24      
 unary          : 886          -10
 unreachable    : 44            -1
$ d8 a.out.js
time: 1037.69ms (keep vector 244998)
$ 

Note how -Os is slow, -O2 is better, and -O3 is best. But building with -Os (should be slow) then running wasm-opt manually with -O3 makes it almost as fast as -O3. And the metrics diff shows it inlines a lot (removes 15% of functions!), so that's likely the issue. And it makes sense as the binaryen inliner is pretty cautious at -Os, which we should improve, WebAssembly/binaryen#2144

kripken added a commit that referenced this issue Nov 13, 2019
@kripken

This comment has been minimized.

Copy link
Member

@kripken kripken commented Nov 13, 2019

@VirtualTim docs for that should be here, and should be up to date, https://emscripten.org/docs/optimizing/Optimizing-Code.html?highlight=optimizing#lto

kripken added a commit that referenced this issue Nov 13, 2019
@kripken

This comment has been minimized.

Copy link
Member

@kripken kripken commented Nov 14, 2019

I looked more into @MarcelKlammer's testcase. Changing the inlining parameters to vary between -O3 and -Os does affect both performance and code size, that is, it gets both faster and bigger. So there isn't an obvious thing to fix in the inliner in this case, it is doing the right thing for -Os and -Oz which is to focus on code size, and not increase size significantly in return for speed.

@MarcelKlammer

This comment has been minimized.

Copy link
Contributor Author

@MarcelKlammer MarcelKlammer commented Nov 14, 2019

I never questioned -Oz in fastcomp. It worked for me since BRFv3. Now BRFv5 is significantly faster using -O3 of course (duh). So thanks for switching to upstream, which made me look into it. I'm out of here, feel free to use this thread for the other topic. Thanks for your time and affords.

@sbc100 sbc100 closed this Nov 14, 2019
kripken added a commit to WebAssembly/binaryen that referenced this issue Nov 15, 2019
kripken added a commit to WebAssembly/binaryen that referenced this issue Nov 19, 2019
When we see invoke_ calls in emscripten-generated code, we know
they call into JS just to do a try-catch for exceptions. If the target being
called cannot throw, which we check in a whole-program manner, then
we can simply skip the invoke.

I confirmed that this fixes the regression in emscripten-core/emscripten#9817 (comment)
(that is, with this optimization, upstream is around as fast as fastcomp).

When we have native wasm exception handling, this can be
extended to optimize that as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.