Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Android NDK Clang produces 23% slower binaries than GCC #495

Closed
bruenor41 opened this issue Aug 29, 2017 · 91 comments
Closed

Android NDK Clang produces 23% slower binaries than GCC #495

bruenor41 opened this issue Aug 29, 2017 · 91 comments
Assignees
Milestone

Comments

@bruenor41
Copy link


Switching from android gcc to clang produces slower binaries.

Description

I am working on application what emulates old operating system from 90'. I build c++ binaries with android ndk r10e with gcc 4.8. But, because google drops support for gcc, I want switch to clang. After updating ndk to r15 and successful build, I ran benchmarks(in r15 is default compiler clang, instead gcc like it was in ndk r10e). Result is that emulated system runs 23% slower, even without benchmark it is very noticeable when playing more cpu demanding games.

I did none changes to Android.mk and I use -O3 for optimizations. I only updated ndk. Simply said, I only switch between toolchains with :
NDK_TOOLCHAIN_VERSION=4.9 or without it

In ndk r15 is still possible switch to gcc 4.9. I did it and I got back lost performance. And even more, seems gcc 4.9 makes slightly better optimizations then gcc 4.8.

From what I read on web I expected that clang will produce faster binaries or the same. And I didn't find special magic clang flags what I must enable for Android.mk.

For a test I returned back to ndk r10e and switched to clang 3.8. Compiled application is again around 20%+ slower. Seems it is not related to ndk r15, but is there long time.

Environment Details

In my application.mk I use this setup

APP_ABI := armeabi armeabi-v7a x86
APP_OPTIM := release
APP_STL := stlport_static
APP_PLATFORM := android-8
LOCAL_LDLIBS += -lz

APP_PLATFORM is set to 8 but I see in log that minimum was set to 14 automatically

Android.mk
APP_CPPFLAGS = -O3
LOCAL_ARM_MODE := arm
LOCAL_CFLAGS += -DHAVE_NEON=1
LOCAL_ARM_NEON := true

I tested on windows 10 and ubuntu 16.04, results are same
Tested on device with android 6.0 and 7.0

@stephenhines
Copy link
Collaborator

This isn't really that actionable without a repro case. 20%+ slower seems outside of the bounds of anything we have ever seen (in 7+ years of working on this), but it's hard to do more without further information.

@bruenor41
Copy link
Author

bruenor41 commented Aug 29, 2017

I understand. I can create application what you can test, it takes me 1-2 days. But I don't like make sources public. I can provide them via email to someone from team. Or is it enough to have apks compiled with gcc and with clang?

@kneth
Copy link

kneth commented Aug 29, 2017

Isn't -O3 a bit adventurous (#110 (comment))?

@bruenor41
Copy link
Author

Unfortunately I tried all these flags, but none brings back performance
https://stackoverflow.com/questions/15548023/clang-optimization-levels

@enh
Copy link
Contributor

enh commented Aug 29, 2017

have you tried adding -Bsymbolic? that's one of the main differences between the default clang flags and the default [Android] GCC flags.

@bruenor41
Copy link
Author

nope, today is feast in our country, but will try tomorrow immedately at morning. Thank you very much for idea

@bruenor41
Copy link
Author

I created benchmark application. Please check readme.txt in root

@stephenhines
Copy link
Collaborator

A couple of quick suggestions. I am not sure whether APP_CPPFLAGS (where you set -O3) gets cleared by $(CLEAR_VARS). I will also note that there are several .c source files in this example, which will not get the -O3, since that would require setting APP_CFLAGS instead (which applies to both .c and .cpp files). Instead. these files will be compiled at the Clang default (-O0), which is bound to have performance problems. Since I don't really work on the NDK build side of things, this is my first experience with any of these special make variables, so I am really guessing that not all the flags you expect are being passed to your actual built libraries. LOCAL_CFLAGS is much more familiar to me, and those do get cleared by $(CLEAR_VARS).

@bruenor41
Copy link
Author

bruenor41 commented Aug 31, 2017

Stephen, thank you very much for suggestions! I replaced APP_CPPFLAGS with LOCAL_CFLAGS. It was definitely good idea, seems binaries made by both compilers profits from this change and performs slightly better. Each performance gain is for me important. Unfortunately app compiled by gcc still performs much better :(

@stephenhines
Copy link
Collaborator

You need to update every place after $(CLEAR_VARS) to add a LOCAL_CFLAGS with your optimization level to have it affect that compilation. Otherwise, I am pretty convinced at this point you are getting -O0 compilation for many of the sources files with Clang.

@bruenor41
Copy link
Author

Here is how I updated them

@DanAlbert
Copy link
Member

Could you just upload that as a github project or at least a gist so we don't have to keep redownloading it?

@DanAlbert
Copy link
Member

APP_CFLAGS (or any other APP_* variables) does nothing in Android.mk. Application-wide flags need to be set in Application.mk.

@bruenor41
Copy link
Author

bruenor41 commented Aug 31, 2017

here

@DanAlbert
Copy link
Member

Which ABI is the one you're having issues with?

@bruenor41
Copy link
Author

bruenor41 commented Aug 31, 2017

armeabi-v7a. I don't own x86 based device, I can test it in emulator with haxm enabled, but I didn't. And old armeabi I have only due possible fallback, it is not very used.

@bruenor41
Copy link
Author

Please has someone idea what might be a problem? I like to switch to clang as soon as possible

@DanAlbert DanAlbert self-assigned this Oct 3, 2017
@DanAlbert DanAlbert added this to the r17 milestone Oct 3, 2017
@DanAlbert
Copy link
Member

I'll do what I can to minimize a test case for the compiler folks.

@bruenor41
Copy link
Author

Thank you very much, do not hesitate contact me if you need something

@DanAlbert
Copy link
Member

How do I build your project?

@DanAlbert
Copy link
Member

Never mind, I reopened it and Studio stopped complaining.

@DanAlbert
Copy link
Member

Confirmed a 12% perf regression for clang vs gcc on Pixel 2.

Can you use simpleperf to try to reduce this? I didn't realize the "benchmark" was a full game... If you can get us something more concrete to work with (and something that builds out of the box) then we can take a look, but this beyond the scope of what we'll be able to look at any time soon I think.

@DanAlbert DanAlbert modified the milestones: r17, unplanned Oct 12, 2017
@bruenor41
Copy link
Author

bruenor41 commented Oct 13, 2017 via email

@bruenor41
Copy link
Author

bruenor41 commented Oct 13, 2017 via email

@ehem
Copy link

ehem commented Nov 1, 2017

Might this this be related to #21?

@stephenhines
Copy link
Collaborator

This is probably not related to #21. As I mentioned above, this project is a combination of many libraries, so it is very likely that there are other places where optimization flags are not being set or configured properly. I found one by quick inspection, and that already improved things. Diagnosing the rest of this will require someone to spend a lot more time with simpleperf and the entire build. I will note that this is not a high priority thing for my team to do, as our platform benchmarks don't show similar regressions (in performance or code size).

@arturbac
Copy link

arturbac commented Aug 23, 2018

2 things that should be changed in this project

  • remove armeabi, add arm64-v8a(if the project works on 64bit, 2x number of registers in 64bit mode)
  • switch from mk build system to cmake

@bruenor41
Copy link
Author

Yes, armeabi is obsolete. arm64 dynamic recompiler is finished and will be added. I did not test it with clang and gcc yet. Maybe it will be better. However I report 32bit.

Could there be differences in resulted binaries using cmake?

@arturbac
Copy link

arturbac commented Aug 23, 2018

  1. armv64-v8a on android must be build at least with android 5.x api 21, there werent older 64bit androids.
    armv7a with ndk17 assumes ndk 14 (adnroid 4.x)
  2. as i mentioned in 64bit build cpu has available 2 times more number of registers, this is significant in optimalizations and generated code
  3. cmake - mk is deprecated build system, android-toolchain.cmake is upgraded continiuusly, that one thing decides what default compilations flags are used, what libs are linked. Dont know if there is difference in mk vs android-toolchain.cmake, if both are still upgraded. Staying on mk build system dosnt make sense anyway.
    Hovever both systems manualy configured can generate same compialtion options.
    And that options passed to compiler affects generated code most.

@arturbac
Copy link

BTW in 64bit mode you will get also advanced simd instructions but more important by default integer division is done with hardware while in armv7a idiv is done by software impl by default, as not all armv7a cpus has got not mandatory for armv7 cpu extnsions idiva idivt

@bruenor41
Copy link
Author

Thank you. That's reason why arm64 dynamic recompiler was made. However, I must still compile for 32 bit devices...

@arturbac
Copy link

  1. I looked at the perf result from clang

The host spot is in very ineefficient method

86 27.000 ms 27.000 ms bool PageHandler::writeb_checked(PhysPt addr,Bitu val) {
87 2298.000 ms 173.000 ms writeb(addr,val); return false;
88     }

every single write is done by

  • calling function
  • which dereferences address
  • stores the result
    -returns
    this one should be moved at least to header or build this with LTO - thin
    bool PageHandler::writeb_checked(PhysPt addr,Bitu val) {
    writeb(addr,val); return false;
    }
  1. gcc profiling has totaly diffrent hostspot functions
  • It looks like that some #defines in code are enabled/disabled when compiler is gcc, it looks like some code during clang compialtion is using checked version of implementation which propably should be enabled only for debuging purproses. the checked tag in function name and its inefficient implementation in .cpp from clang sugests that too.

when i diged out what readb does
i saw in asm blx instruction , conditional branch during write
in source there is
void PageHandler::writeb(PhysPt addr,Bitu /val/) {
E_Exit("No byte handler for write to %x",addr);
}

so it looks like for clang E_exit does somthing cpu hungry (loging or what ever ) which is disabled on gcc release ..

@bruenor41
Copy link
Author

Hmm, yes, I know there are some gcc specific flags in code. I played with them, but without any bigger success. If I remember well, I removed line with E_Exit, but again without luck. I'll try it again, it's some time. Many thanks for your help.

@enh
Copy link
Contributor

enh commented Aug 23, 2018

One note: ndk-build is not deprecated. It's every bit as supported as cmake is.

For the most part, the two should behave the same, and where they don't you should feel free to report bugs. Work in NDK r19 to move logic out of the build systems and into the clang driver should also help here.

@DanAlbert
Copy link
Member

@bruenor41: sorry, I'd missed this update:

@DanAlbert I added

LOCAL_LDFLAGS := -Bsymbolic

to armv7.mk file and to Android.mk for magiclib. For all libs. Hope the syntax is correct. But no luck. I tested ndk 17 and here is result :

NDK 17b
clang 2195
gcc 2020

gcc is still on same value, clang seems to be a bit faster. Or if the syntax is not correct, could you fix both mk files and post them?

Thank you

I wonder if the flags aren't being added in the right order. You could add -v to your LOCAL_LDFLAGS (APP_LDFLAGS would be even better). That would also help you figure out what changed for GCC between the two releases. -v will show you exactly how the compiler internals and the linker were invoked. Diff the two and see what changed.

@bruenor41
Copy link
Author

bruenor41 commented Oct 17, 2018

Hi people. After long time, I was able compile this benchmark for arm64-v8a, so I could compare performance against armeabi-v7a. I used ndk 15 and I compared clang and gcc binaries. And I was surprised! while arm7 binaries made by clang reports so big performance degradation against gcc, arm8 binaries gives almost same result for both compilers. Performance is slightly better with gcc binaries, around 3%.

arm7 and arm8 comparison took so long time, because benchmark uses dynamic recompiler which was not made for arm8 before, now is.

Edit : tested ndk 18b. Clang binaries gives same result for arm7 and arm8 like ndk 15, no regression, no improvement

@DanAlbert
Copy link
Member

Reading back through this thread, it sounds like the differences discussed here are a result of 1. differences in inlining behavior and 2. intentional differences in behavior between clang and gcc within the project. The first is a trade-off where Clang has made a different decision than GCC, and the latter needs to be fixed in the project itself, not Clang.

arm7 and arm8 comparison took so long time, because benchmark uses dynamic recompiler which was not made for arm8 before, now is.

If I'm understanding this correctly, it sounds very plausible that the project's recompiler has just been optimized thoroughly for 32-bit but not for 64-bit.

Given that, I don't think there's any action to be taken here.

@bruenor41
Copy link
Author

bruenor41 commented Oct 19, 2018

No Dan. Previously I compared 32 bit binaries made by clang against 32 bit binaries made by gcc. Gcc performs much better in execution speed. I removed all gcc flags from project, but it did not help. Gcc still optimizes it better.

Now I compared 64 bit binaries made by gcc against 64 bit binaries made by clang. Performance is roughly the same, why. In this case I reverted back sources with gcc flags, so they were included during test.

@DanAlbert
Copy link
Member

I've spent the last few hours just trying to get your project to build. I finally succeeded, but it crashes when I start it.

I stand by what I said above. The fact that you do see the regression with GCC in newer NDKs makes me very confident that this is not a Clang issue. If you really want to dig in to what changed between r16's and r17's GCC, add

APP_CFLAGS := -v
APP_LDFLAGS := -v

to your Application.mk files and rebuild. gcc will show the flags that it is using to invoke the compiler, the assembler, the linker, etc. Look for changes in the flags used between your old GCC build and a new one. Odds are that will have your answer.

This is what I was trying to do while trying to build your project, but apparently one successful build is all I get. After cleaning the project, the Java code doesn't even compile.

@bruenor41
Copy link
Author

bruenor41 commented Oct 19, 2018

Thank you very much for your digging, I really appreciate it. My problem is not that gcc has been changed between ndk 16 and ndk 17 and that it's getting yet worse. My problem is that I need to switch to clang and project compiled by clang makes much slower binaries since beginning - in my case beginning means ndk r10e, I did not test lower versions.

Edit : As I said, only 32 bit version is problem
Edit2 : I use ndk 15

Please, I am not sure why you have problems compile it, but I can look it this in in next hour. I will be very happy for any clue.

Edit :
I updated github project to latest android studio. You should not have problems. You only must update paths like is in readme.txt

@DanAlbert
Copy link
Member

Updating studio and Gradle was the first thing I had to do, but I also had to make changes to get things working with externalNativeBuild. Without that it's very difficult to get useful information from Studio. It's something you'll want to do for your project regardless.

@bruenor41
Copy link
Author

bruenor41 commented Oct 20, 2018

Ah I know now, sorry. I was releasing new version and I did not realize that my app uses still ndk r10e. The same for github version of benchmark. I updated to newer ndk only on local repository, I will fix that

Edit : fixed. Tested with ndk 15c and 18b. With ndk 15c you can compare differences between gcc and clang. Binaries made by clang in ndk 18b are slightly faster then made by clang in ndk 15c, but still behind binaries made by gcc in ndk 15c and lower

@bruenor41
Copy link
Author

Hi Dan, I updated project, you shouldn't have problems now. Please, I will be thankful for any kind of help. I spent a lot of time on debugging and switching between clang and gcc flags or removing them. I do not know on what I can focus now, except debugging assembler what is behind my knowledge...

@DanAlbert
Copy link
Member

Thanks. It won't be taking priority over other work but I'll keep in my list of Friday projects.

@bruenor41
Copy link
Author

Yes, I understand, no hurry. Many thanks

@ZaccurLi
Copy link

Is there any conclusion in this topic? .
We face the same issue when switched all our c++ libraries from gcc to clang.
Performance decreasing also occured.
OpenCV, and our source code all compiled with NDK 17 were running much slower.
can anyone update if they succeed resolve this issue? Thanks.

@noyanc
Copy link

noyanc commented Sep 1, 2019

Hi, we also face the same issue after switching from gcc to clang.

Old framerate of the game was 55 FPS. After passing to clang, the framerate dropped to 22 FPS on the same device. Over 50% performance loss! (We build against ndk-r19c using -Ofast3 and all other additional optimization flags.)

Since clang and 64 bits compiling is mandatory, this issue has become urgent for my team!

Would be great if any progress.

@bruenor41
Copy link
Author

bruenor41 commented Sep 1, 2019 via email

@joeshow79
Copy link

joeshow79 commented Sep 2, 2019

Hey, we found the downgrade of the performance seems introduced by openmp.
Take the following test sample as example
main.cc

int main() {                                                                                                                                                                                           
    int sum;                                                                                                                                                                                                      int N = 100000000;                                                                                 
#pragma omp parallel for reduction(+: sum) schedule(dynamic,10)                                        
    for(int i=0; i<N;i++){
        sum += i;
    }       
}

Makefile:
`all:test_gcc test_clang test_gcc_omp test_clang_omp

test_gcc_omp:main.cc
g++ -fopenmp -o $@ $^

test_clang_omp:main.cc
clang++ -fopenmp -o $@ $^

test_gcc:main.cc
g++ -o $@ $^

test_clang:main.cc
clang++ -o $@ $^`

test script:
`echo "test gcc version:"
time ./test_gcc

echo "---------------------------------"

echo "test clang version:"
time ./test_clang

echo "---------------------------------"

echo "test gcc omp version:"
time ./test_gcc_omp

echo "---------------------------------"

echo "test clang omp version:"
time ./test_clang_omp`

The result seems as below for several running of the test (test in linux docker without NDK)
`bash test.sh
test gcc version:

real 0m0.236s
user 0m0.236s
sys 0m0.000s

test clang version:

real 0m0.233s
user 0m0.232s
sys 0m0.001s

test gcc omp version:

real 0m0.243s
user 0m0.469s
sys 0m0.015s

test clang omp version:

real 0m0.331s
user 0m0.651s
sys 0m0.008s`

It shows that the clang is even a little bit faster than the gcc counterpart if omp is disabled. However the performance degrade drastically when openmp is enabled . Similar result observed for the test on Android(with NDK) from the flame chart produced by other test. Can not paste the screenshot of the flam chart here for more detail.

So may anybody can dig further to see what the clang openmp done behind the scene?

@tpetri
Copy link

tpetri commented Mar 8, 2020

Hi,

I was following this thread now for many years in hope for a solution, we have the exact same experience as many here. Code compiled with Clang with omp for Android does result in really bad performance. On other platforms we cannot see these issues.

Does anyone had success in finding a solution/reason/workaround? Would it be possible for the NDK devs to maybe try to have a look again?

@MoNTE48
Copy link

MoNTE48 commented Mar 8, 2020

Android developers just don't care at all. Forget it is closed because it will never be fixed. Buy new devices. Old don't bring money

@arturbac
Copy link

arturbac commented Mar 8, 2020

Check first the source code of bottleneck that has significant difference, then check in that code if there are specific macro definitions for gcc that causes to generate extraordinary optimized code for example for SIMD.
Compare performance only of optimised code with at least -O2
I can say that code generated with clang is in 99% at least as god as generated with gcc, there are many situations when clang generates much better code than gcc! for aarch64.
Use compiler explorer for parts of code and compare generated code by clang and gcc.
It will be very hard for You to make a prof that clang generates at least a little slower code.

@arturbac
Copy link

arturbac commented Mar 8, 2020

int main() { int sum; int N = 100000000; #pragma omp parallel for reduction(+: sum) schedule(dynamic,10) for(int i=0; i<N;i++){ sum += i; } }

This test doesn't make any sense because:

  1. You are compiling without any specified optimizations
  2. With optimized version ex -Ofast3 You will get with clang optimise out entire code as result of computation is unused
    main: // @main
    mov w0, wzr
    ret

Here is exact comparision with using result of computation and optimisation
https://godbolt.org/z/ioPEzB
PS: Initialize variables as adding to uninitialzied variable is UB in generated code.

@joeshow79
Copy link

Switching from android gcc to clang produces slower binaries.

Description

I am working on application what emulates old operating system from 90'. I build c++ binaries with android ndk r10e with gcc 4.8. But, because google drops support for gcc, I want switch to clang. After updating ndk to r15 and successful build, I ran benchmarks(in r15 is default compiler clang, instead gcc like it was in ndk r10e). Result is that emulated system runs 23% slower, even without benchmark it is very noticeable when playing more cpu demanding games.

I did none changes to Android.mk and I use -O3 for optimizations. I only updated ndk. Simply said, I only switch between toolchains with :
NDK_TOOLCHAIN_VERSION=4.9 or without it

In ndk r15 is still possible switch to gcc 4.9. I did it and I got back lost performance. And even more, seems gcc 4.9 makes slightly better optimizations then gcc 4.8.

From what I read on web I expected that clang will produce faster binaries or the same. And I didn't find special magic clang flags what I must enable for Android.mk.

For a test I returned back to ndk r10e and switched to clang 3.8. Compiled application is again around 20%+ slower. Seems it is not related to ndk r15, but is there long time.

Environment Details

In my application.mk I use this setup

APP_ABI := armeabi armeabi-v7a x86
APP_OPTIM := release
APP_STL := stlport_static
APP_PLATFORM := android-8
LOCAL_LDLIBS += -lz

APP_PLATFORM is set to 8 but I see in log that minimum was set to 14 automatically

Android.mk
APP_CPPFLAGS = -O3
LOCAL_ARM_MODE := arm
LOCAL_CFLAGS += -DHAVE_NEON=1
LOCAL_ARM_NEON := true

I tested on windows 10 and ubuntu 16.04, results are same
Tested on device with android 6.0 and 7.0

Is openmp used in the code, we figured out there are some

Hey, we found the downgrade of the performance seems introduced by openmp.
Take the following test sample as example
main.cc

int main() { int sum; int N = 100000000; #pragma omp parallel for reduction(+: sum) schedule(dynamic,10) for(int i=0; i<N;i++){ sum += i; } }

Makefile:
`all:test_gcc test_clang test_gcc_omp test_clang_omp

test_gcc_omp:main.cc
g++ -fopenmp -o $@ $^

test_clang_omp:main.cc
clang++ -fopenmp -o $@ $^

test_gcc:main.cc
g++ -o $@ $^

test_clang:main.cc
clang++ -o $@ $^`

test script:
`echo "test gcc version:"
time ./test_gcc

echo "---------------------------------"

echo "test clang version:"
time ./test_clang

echo "---------------------------------"

echo "test gcc omp version:"
time ./test_gcc_omp

echo "---------------------------------"

echo "test clang omp version:"
time ./test_clang_omp`

The result seems as below for several running of the test (test in linux docker without NDK)
`bash test.sh
test gcc version:

real 0m0.236s

user 0m0.236s
sys 0m0.000s
test clang version:

real 0m0.233s

user 0m0.232s
sys 0m0.001s
test gcc omp version:

real 0m0.243s

user 0m0.469s
sys 0m0.015s
test clang omp version:

real 0m0.331s
user 0m0.651s
sys 0m0.008s`

It shows that the clang is even a little bit faster than the gcc counterpart if omp is disabled. However the performance degrade drastically when openmp is enabled . Similar result observed for the test on Android(with NDK) from the flame chart produced by other test. Can not paste the screenshot of the flam chart here for more detail.

So may anybody can dig further to see what the clang openmp done behind the scene?

We finally found there has something to do with the OpenMP configuration, we made the following changes in code and seems the clang and gcc compiled binaries are finally at the close performance level.

KMP_BLOCKTIME=0
OMP_WAIT_POLICY=PASSIVE

@tpetri
Copy link

tpetri commented Mar 9, 2020

Thank you @joeshow79 for the hint!

This actually improves the situation a lot! No more strange thread blocking. Just as a remark, on certain phones (e.g. Honor View 20) there seems to be an OpenMP version that listens to the GNU environment variables, I had to set GOMP_SPINCOUNT to 0 to get the same result.

What we did: in Application onCreate() before loading any native libs:

Os.setenv("OMP_WAIT_POLICY", "passive", true);
Os.setenv("KMP_BLOCKTIME", "0", true);
Os.setenv("GOMP_SPINCOUNT", "0", true);

What I noticed as well, now when giving the Clang-OpenMP issue a new try: on certain phones this was fixed after some Android update. The manufacturers were informed about this or noticed it themselves or whatever. Anyways, this is the solution for us, thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests