Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Math functions are significantly slower on Ubuntu #9373

Open
tannergooding opened this Issue Feb 6, 2017 · 20 comments

Comments

Projects
None yet
9 participants
@tannergooding
Copy link
Member

tannergooding commented Feb 6, 2017

The general performance of the System.Math and System.MathF functions on Ubuntu are pretty bad as compared to Windows.

I don't have exact numbers for MacOS right now, but last time I ran numbers they were consistent with the Windows performance (#4847 (comment) -- note that those numbers of just for the double precision functions).

Perf Numbers

All performance tests are implemented as follows:

  • 100,000 iterations are executed
  • The time of all iterations are aggregated to compute the Total Time
  • The time of all iterations are averaged to compute the Average Time
  • A single iteration executes some simple operation, using the function under test, 5000 times

The execution time below is the Total Time for all 100,000 iterations, measured in seconds.

The improvement below is the how much faster the Ubuntu implementation is over the Windows implementation.

Hardware: Azure Standard D3 v2 (4 cores, 14 GB Memory) - Same as Jenkins

Function Improvment Execution Time - Windows Execution Time - Ubuntu
absdouble 13.890679% 0.6268844s 0.5398059s
abssingle 4.31980625% 0.5741739s 0.5493707s
acosdouble -4.67198461% 7.7831699s 8.1467984s
acossingle 20.5938899% 5.9848033s 4.7522995s
asindouble 25.3777524% 10.4698488s 7.8128365s
asinsingle 40.3112132% 6.2957351s 3.7578479s
atandouble -44.7552681% 6.6574728s 9.6370426s
atansingle -10.3842566% 5.0162551s 5.5371559s
atan2double 16.1878259% 15.3990765s 12.9063008s
atan2single 13.8083532% 10.7394211s 9.2564839s
ceilingdouble -26.5576214% 1.4910876s 1.887085s
ceilingsingle -21.4092256% 1.3302228s 1.6150132s
cosdouble -119.725433% 5.5959633s 12.2957546s
cossingle 13.955364% 4.5950439s 3.9537888s
coshdouble -5.46886513% 9.9931702s 10.5396832s
coshsingle 18.137377% 8.5860905s 7.0287989s
expdouble -68.7967436% 5.1415544s 8.6787764s
expsingle -15.2262683% 3.7621641s 4.3350013s
floordouble -23.3676423% 1.4269253s 1.7603641s
floorsingle -11.0591731% 1.4640751s 1.6259897s
logdouble -215.202268% 4.5492266s 14.3392654s
logsingle -47.2025199% 3.7204357s 5.4765751s
log10double -219.553632% 5.0886356s 16.2609199s
log10single -103.115881% 4.0351799s 8.1960912s
powdouble -49.8224443% 26.5690144s 39.8063468s
powsingle -330.357796% 11.5863701s 49.862847s
rounddouble 7.0386177% 3.3553449s 3.119175s
roundsingle 1.13396115% 3.2015118s 3.1652079s
sindouble -70.4327108% 4.5421357s 7.741285s
sinsingle 9.77591546% 4.1445295s 3.7393638s
sinhdouble -17.2300933% 10.204286s 11.962494s
sinhsingle -23.8942492% 8.9245106s 11.0569554s
sqrtdouble -57.2807184% 2.5168265s 3.9584828s
sqrtsingle -73.445483% 1.591266s 2.759979s
tandouble -141.578221% 5.6910206s 13.7482663s
tansingle -74.7429528% 4.2112797s 7.3589145s
tanhdouble -162.741805% 4.83917s 12.7145226s
tanhsingle -93.3864207% 5.4909087s 10.6186718s
@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Feb 6, 2017

FYI. @AndyAyersMS, since you were the last person I discussed this with (almost a year ago).

@janvorli

This comment has been minimized.

Copy link
Member

janvorli commented Feb 6, 2017

@tannergooding I've noticed that you were testing the -ffast-math option. Did it make any difference?
Also, could you share your benchmark app with me? I would like to do some profiling in the PAL.

@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Feb 6, 2017

@janvorli, it did not. I attempted to use -ffast-math globally and in the specific PAL layer as well to no effect. I had initially thought this was due to clang not supporting some of the arguments -ffast-math normally sets in GCC (such as -fexcess-precision=fast), but that does not appear to be the case. The libm library seems to be the actual bottleneck here (although I am still unsure why). Using a different libm implementation (such as the version provided by AMD) actually works great and brings it on par with Windows.

As for the benchmark app, the code is actually already checked in (at least for the double-precision functions): https://github.com/dotnet/coreclr/tree/master/tests/src/JIT/Performance/CodeQuality/Math/Functions

I have the single-precision versions available in PR form here: #9354

@janvorli

This comment has been minimized.

Copy link
Member

janvorli commented Feb 6, 2017

@tannergooding thanks for the details. Looking e.g. at the disassembly of the log10 double code on Windows x64 and Linux x64, I can see that Windows one is implemented is asm, but the Linux one in C. So I guess that careful optimization of the code in hand written asm is just better than what the C compiler can generate.

@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Feb 7, 2017

@janvorli, I think it would be great if we could improve the Linux performance somehow. The current implementation is dreadfully slow and makes it difficult to reliably use System.Math or System.MathF in certain scenarios (such as cross-platform graphics).

It would be even greater if we could standardize the various platforms on a standard implementation of libm. This would help minimize differences across the various platforms and would improve reliability.

@fanoI

This comment has been minimized.

Copy link

fanoI commented Feb 8, 2017

It would be possible to have a managed implementation of these methods maybe using Vector as base?
Sin and Cos could be written using SSE intrinsic in C: http://gruntthepeon.free.fr/ssemath/sse_mathfun.h
This would make easier to do the porting to other platforms.

@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Feb 8, 2017

@fanol, I don't think that is going to be nearly as performant. The libm implementation on Windows and Mac is written using hand-crafted assembly, to ensure that every bit is optimized to its potential.

@fanoI

This comment has been minimized.

Copy link

fanoI commented Feb 9, 2017

Well the C# Vector does not uses compiler intrinsic written in assembler?
It should be equally faster I think... Linux probably is using the x87 trigonometric functions that are slow and give wrong results too...

@mikedn

This comment has been minimized.

Copy link
Contributor

mikedn commented Feb 9, 2017

Well the C# Vector does not uses compiler intrinsic written in assembler?

Well, Vector<T> operations are translated to SSE instructions but not all SSE instructions are exposed via Vector<T>. If you happen to need such an instruction then you're out of luck...

And even if you can generate SSE instructions it doesn't mean you can match the performance of hand written assembly code. You may run into perf issues due to less than ideal register allocation and lack of instruction scheduling.

That said, maybe it's worth a try if someone has enough time to spend on this. Perhaps the perf issues aren't significant.

Ultimately I think that the worst issue of this approach is that currently there's no support for SIMD on ARM.

@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Feb 9, 2017

I will say that I have attempted porting the handwritten-assembly implementation to C++. The performance decrease was significant enough to not make it worthwhile for Windows or Mac (it would likely be even less performant in C#), but it would be worthwhile for Linux.

The AMD libm implementation is somewhere between the Windows and Mac numbers on all three platforms, but it is not open source and so licensing would likely be an issue (also, it would not work on ARM).

@fanoI

This comment has been minimized.

Copy link

fanoI commented Feb 11, 2017

This version that I have linked that uses SSE intrinsics it is a worth a try to port to C# or not in your opinion?
http://gruntthepeon.free.fr/ssemath/

I'm particularly interested to have more possible of the .NET runtime written in C# being a Cosmos[1] developer "native methods" means more work for us!
In any case having more .NET in C# would make porting to other platforms easier right?

[1] https://github.com/CosmosOS/Cosmos

@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Feb 11, 2017

The issue here isn't writing the implementation in C# (that itself isn't the hard part), it is that the resulting assembly code produced by the JIT is really suboptimal. This is even in the case of C/C++ where we have intrinsics (with a near 1-to-1 with the actual raw assembly) and (in many cases) better optimizations for this type of thing. That is, the C/C++ compiler output has enough of a decrease in perf to make it worthwhile implementing these in assembly language there as well.

This is also why, in the runtime, these functions are implemented as FCALLs rather than as QCALLs. The overhead in the QCALL implementation is itself significant enough to make the use of an FCALL to the CRT implementation more worthwhile.

As for Cosmos, I am fairly certain there are always going to be a set of functions in the runtime (the FCALLs and QCALLs) for which any AOT compiler will need to be aware that there is no managed implementation. For these functions, the AOT compiler needs to output the appropriate backing call (in this case to the CRT implementation of those functions). If the AOT compiler in question is itself entirely written in Managed Code and has no access to the CRT implementation, then they would likely need to have an implementation cached internally that is emitted when it encounters said call 😄

@fanoI

This comment has been minimized.

Copy link

fanoI commented Feb 13, 2017

Yes Cosmos has its own AOT compiler called IL2CPU and when these types of functions are encountered:

  1. P/Invoke
  2. FCALL
  3. QCALL

then compilation blocks with the error "native method encountered" as Cosmos is the Operating System itself we have no libc to call and as C code is not permitted in Cosmos we have only two possibilities: re-write the native method in C# or in the worst case scenario in assembler using our language X#, we call this process "pluggin a method".

This is for example our implementation of Math.Cos (I known is naive and probably slow but for now we are yet not interested in performance in Cosmos):
https://github.com/CosmosOS/Cosmos/blob/master/source/Cosmos.System.Plugs/System/MathImpl.cs#L143

in future we will see if re-implement them using the Vector class or if we need to write a different version in assembler for any CPU type we will support.

@gkhanna79

This comment has been minimized.

Copy link
Member

gkhanna79 commented Mar 5, 2017

@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Jul 31, 2017

This could be resolved with support for dotnet/designs#13

@DemiMarie

This comment has been minimized.

Copy link

DemiMarie commented Aug 25, 2017

What about using openlibm?

Openlibm was designed for scientific computing, which must be both fast and accurate. It is also portable. CoreCLR could simply bundle it.

@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Aug 25, 2017

@DemiMarie, last time I checked (several months ago now) openlibm wasn't nearly on par with the perf of the Windows or MacOS implementations. Nor was it on par with AMD or Intels libm implementation.

@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Jul 31, 2018

Copying the comment I added here: #19203 (comment)

Looks like glibc recently (07 AUG 2017) made a few changes: https://sourceware.org/git/?p=glibc.git;a=commit;h=57a72fa3502673754d14707da02c7c44e83b8d20

Namely, they still use the IBM Accurate Mathematical Library as their root source code, however, they now have some new logic which additionally compiles that code with the -mfma and -mavx2 flags, which provides some automatic transformations/optimizations (it looks like they do a cached CPUID check at runtime and jump to the appropriate code).

Additionally, it looks like, since the calling conventions map up, they generally end up calling libm-2.27.so~__pow directly, rather than having an intermediate call through COMDouble::Pow.

@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Jul 31, 2018

I will benchmark Ubuntu again tomorrow and update the numbers above.

@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Jul 31, 2018

All performance tests are implemented as follows:

100,000 iterations are executed
The time of all iterations are aggregated to compute the Total Time
The time of all iterations are averaged to compute the Average Time
A single iteration executes some simple operation, using the function under test, 5000 times
The execution time below is the Total Time for all 100,000 iterations, measured in seconds.

The improvement below is the how much faster the Ubuntu implementation is over the Windows implementation.

Hardware: Azure Standard D4s v3 (4 cores, 16 GB Memory)

Function Improvement (%) Execution Time - Windows (s) Execution Time - Ubuntu (s)
absdouble 3.546181638 0.5606763 0.5407937
abssingle 8.480824498 0.5905947 0.5405074
acosdouble 23.23729332 8.8021891 6.7567986
acossingle 36.07494481 6.8471822 4.377065
asindouble 32.04241391 9.5848378 6.5136244
asinsingle 48.45349467 7.5545682 3.8941159
atandouble -27.47165151 6.9858301 8.904953
atansingle 15.17043945 5.8293196 4.9449862
atan2double 25.37952497 15.9935921 11.9344944
atan2single 12.95986626 12.0436945 10.4828478
ceilingdouble 8.114504593 0.8137071 0.7476788
ceilingsingle 1.43315031 0.7909917 0.7796556
cosdouble -27.04654657 6.9580288 8.8399353
cossingle 45.21672442 6.0522668 3.31563
coshdouble 22.86077168 11.1648917 8.6125113
coshsingle 41.8002351 10.5539471 6.1423724
expdouble -15.65069716 7.5727847 8.7579783
expsingle 31.85693379 5.0553795 3.4448906
floordouble 6.572571637 0.783357 0.7318703
floorsingle 7.733552501 0.8161773 0.7530578
logdouble -71.7799955 5.7604477 9.8952968
logsingle 39.33867461 5.1587676 3.1293768
log10double -115.8565365 5.986779 12.9228538
log10single -11.42509124 4.8094233 5.3589043
powdouble -12.82530235 25.8802195 29.1994359
powsingle 37.09252843 9.8087886 6.1704609
rounddouble 7.914597043 0.8141779 0.749739
roundsingle 7.193275175 0.7884934 0.7317749
sindouble -31.31642635 5.5328101 7.2654885
sinsingle 33.30900612 5.0377057 3.359696
sinhdouble 7.783566656 10.7251256 9.8903283
sinhsingle -8.468141754 9.1540461 9.9292237
sqrtdouble 3.148131859 1.4699289 1.4236536
sqrtsingle -7.038505206 0.772441 0.8268093
tandouble -84.81924424 6.703659 12.3896519
tansingle -43.26011744 4.9282832 7.0602643
tanhdouble -78.9103324 5.3309266 9.5375785
tanhsingle -56.22609881 6.0341919 9.4269826
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.