-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues for Math.Min, Math.Max for double/float #33057
Comments
cc: @tannergooding |
Marked as up-for-grabs as it should be a trivial fix. |
We should also add tests covering Our existing math tests are here: https://github.com/dotnet/performance/tree/master/src/benchmarks/micro/runtime/Math/Functions |
As thought: For compliance with IEEE 754:2019 we have
I'll open a PR over there. |
The asm isn't quite as bad when you look at the x64 version. The main bloat comes from the |
This isn't a valid transformation because |
I had previously implemented a branchless version of this code which uses intrinsics, but the results weren't consistent on all platforms: dotnet/coreclr#22965 (this was also a version which didn't propagate NaN, so it would need to be tweaked slightly). We might also be able to gain some perf by ensuring the non Zero/non NaN case is the "predicted" path if the former isn't viable. |
I tried the code with current master (from a checked build). C# code (just copied together)using System.Runtime.CompilerServices;
namespace ConsoleApp4
{
class Program
{
static int Main(string[] args)
{
double a = 3d;
double b = 4d;
double min = Do(a, b);
return min == a ? 0 : 1;
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static double Do(double a, double b)
{
return Min(a, b);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static double Min(double val1, double val2)
{
// This matches the IEEE 754:2019 `minimum` function
//
// It propagates NaN inputs back to the caller and
// otherwise returns the larger of the inputs. It
// treats +0 as larger than -0 as per the specification.
if ((val1 < val2) || IsNaN(val1))
{
return val1;
}
if (val1 == val2)
{
return IsNegative(val1) ? val1 : val2;
}
return val2;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe bool IsNaN(double d)
{
// A NaN will never equal itself so this is an
// easy and efficient way to check for NaN.
#pragma warning disable CS1718
return d != d;
#pragma warning restore CS1718
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe bool IsNegative(double d)
{
return DoubleToInt64Bits(d) < 0;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe long DoubleToInt64Bits(double value)
{
return *((long*)&value);
}
}
} It is quite "jumpy". ; Assembly listing for method ConsoleApp4.Program:Do(double,double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 9, 5 ) double -> mm0
; V01 arg1 [V01,T01] ( 6, 4.25) double -> mm1
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
; V03 tmp1 [V03,T02] ( 5, 3 ) double -> mm0 "Inline return value spill temp"
; V04 tmp2 [V04,T03] ( 2, 1 ) double -> [rbp-0x08] do-not-enreg[F] ld-addr-op "Inlining Arg"
;
; Lcl frame size = 16
G_M65265_IG01:
55 push rbp
4883EC10 sub rsp, 16
C5F877 vzeroupper
488D6C2410 lea rbp, [rsp+10H]
;; bbWeight=1 PerfScore 2.75
G_M65265_IG02:
C5F92EC8 vucomisd xmm1, xmm0
770A ja SHORT G_M65265_IG04
;; bbWeight=1 PerfScore 2.00
G_M65265_IG03:
C5F92EC0 vucomisd xmm0, xmm0
7A03 jp SHORT G_M65265_IG04
7403 je SHORT G_M65265_IG05
;; bbWeight=0.25 PerfScore 0.75
G_M65265_IG04:
EB24 jmp SHORT G_M65265_IG09
;; bbWeight=0.50 PerfScore 1.00
G_M65265_IG05:
C5F92EC1 vucomisd xmm0, xmm1
7A19 jp SHORT G_M65265_IG08
7517 jne SHORT G_M65265_IG08
C5FB1145F8 vmovsd qword ptr [rbp-08H], xmm0
48837DF800 cmp qword ptr [rbp-08H], 0
7C09 jl SHORT G_M65265_IG07
;; bbWeight=0.25 PerfScore 1.38
G_M65265_IG06:
C5F828C1 vmovaps xmm0, xmm1
EB08 jmp SHORT G_M65265_IG09
;; bbWeight=0.50 PerfScore 1.12
G_M65265_IG07:
EB05 jmp SHORT G_M65265_IG09
;; bbWeight=0.50 PerfScore 1.00
G_M65265_IG08:
C5F828C1 vmovaps xmm0, xmm1
;; bbWeight=0.50 PerfScore 0.12
G_M65265_IG09:
488D6500 lea rsp, [rbp]
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 67, prolog size 13, PerfScore 19.43, (MethodHash=0702010e) for method ConsoleApp4.Program:Do(double,double):double
; ============================================================ The stack work for I'll try something to improve this... |
An attempt for better codegen for During the work I noticed that same test-cases in
double.NaN -case (or I did just miss other unit tests for Math.Min ).
With the latest solution codegen looks pretty good. codegen for current master; Assembly listing for method ConsoleApp4.Program:Do(double,double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 9, 5 ) double -> mm0
; V01 arg1 [V01,T01] ( 6, 4.25) double -> mm1
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
; V03 tmp1 [V03,T02] ( 5, 3 ) double -> mm0 "Inline return value spill temp"
; V04 tmp2 [V04,T03] ( 2, 1 ) double -> [rbp-0x08] do-not-enreg[F] ld-addr-op "Inlining Arg"
;
; Lcl frame size = 16
G_M65265_IG01:
55 push rbp
4883EC10 sub rsp, 16
C5F877 vzeroupper
488D6C2410 lea rbp, [rsp+10H]
;; bbWeight=1 PerfScore 2.75
G_M65265_IG02:
C5F92EC8 vucomisd xmm1, xmm0
770A ja SHORT G_M65265_IG04
;; bbWeight=1 PerfScore 2.00
G_M65265_IG03:
C5F92EC0 vucomisd xmm0, xmm0
7A03 jp SHORT G_M65265_IG04
7403 je SHORT G_M65265_IG05
;; bbWeight=0.25 PerfScore 0.75
G_M65265_IG04:
EB24 jmp SHORT G_M65265_IG09
;; bbWeight=0.50 PerfScore 1.00
G_M65265_IG05:
C5F92EC1 vucomisd xmm0, xmm1
7A19 jp SHORT G_M65265_IG08
7517 jne SHORT G_M65265_IG08
C5FB1145F8 vmovsd qword ptr [rbp-08H], xmm0
48837DF800 cmp qword ptr [rbp-08H], 0
7C09 jl SHORT G_M65265_IG07
;; bbWeight=0.25 PerfScore 1.38
G_M65265_IG06:
C5F828C1 vmovaps xmm0, xmm1
EB08 jmp SHORT G_M65265_IG09
;; bbWeight=0.50 PerfScore 1.12
G_M65265_IG07:
EB05 jmp SHORT G_M65265_IG09
;; bbWeight=0.50 PerfScore 1.00
G_M65265_IG08:
C5F828C1 vmovaps xmm0, xmm1
;; bbWeight=0.50 PerfScore 0.12
G_M65265_IG09:
488D6500 lea rsp, [rbp]
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 67, prolog size 13, PerfScore 19.43, (MethodHash=0702010e) for method ConsoleApp4.Program:Do(double,double):double
; ============================================================ codegen for 1af43447e2da78426c31030f6a5eb9e6c930876c; Assembly listing for method ConsoleApp4.Program:Do(double,double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T03] ( 3, 3 ) double -> mm0
; V01 arg1 [V01,T02] ( 5, 3.50) double -> mm1
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
; V03 tmp1 [V03,T01] ( 8, 8.50) double -> mm0 "Inlining Arg"
; V04 tmp2 [V04,T00] ( 2, 0.50) long -> rax "Inline return value spill temp"
;* V05 tmp3 [V05 ] ( 0, 0 ) double -> zero-ref ld-addr-op "Inlining Arg"
;
; Lcl frame size = 0
G_M65265_IG01:
55 push rbp
C5F877 vzeroupper
488BEC mov rbp, rsp
;; bbWeight=1 PerfScore 2.25
G_M65265_IG02:
C5F92EC8 vucomisd xmm1, xmm0
7725 ja SHORT G_M65265_IG05
;; bbWeight=1 PerfScore 2.00
G_M65265_IG03:
C5F92EC0 vucomisd xmm0, xmm0
7A1E jp SHORT G_M65265_IG05
C5F92EC1 vucomisd xmm0, xmm1
7A13 jp SHORT G_M65265_IG04
7511 jne SHORT G_M65265_IG04
C5F828D0 vmovaps xmm2, xmm0
C4E1F97ED0 vmovd rax, xmm2
4885C0 test rax, rax
7C08 jl SHORT G_M65265_IG05
;; bbWeight=0.25 PerfScore 1.88
G_M65265_IG04:
C5F828C1 vmovaps xmm0, xmm1
;; bbWeight=0.25 PerfScore 0.06
G_M65265_IG05:
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 1.50
; Total bytes of code 47, prolog size 7, PerfScore 12.89, (MethodHash=0702010e) for method ConsoleApp4.Program:Do(double,double):double
; ============================================================ What I don't like with this solutions are the gotos, and this will be pretty fragile if the JIT improves with its codegen, as the JIT should lay out the method in that way, ideally. With regard to branch-prediction I also tried to split Any thoughts on this? |
We have
Unfortunately this version doesn't handle
There are a few other places this is done in the framework for similar reasons.
In order to correctly implement the function, we have to pick the lesser of x or y, treating It would likely be good to get implementations for all three cases done and then run benchmarks on a few machines. |
With regards to the hybrid approach, I mean that |
This comment has been minimized.
This comment has been minimized.
@tannergooding side question: should the workaround for BitConverter double <-> long be handled separate? (Cf. #12733 (comment)). |
In gfoidl@b2fe76d I added more tests for the edge cases. Tested variants are here:
All three passes the tests (on intel-x64). The raytracer -- as mentioned in #12159 -- is hacked together and gives on my machine (Intel x64) the following numbers:
(Quite interesting that the branchless vectorized version doesn't perform better (on my machine) -- I didn't investigate why...maybe later) Unfortunately I don't have access to other cpu-platforms (amd, arm) to run tests on. So based on these numbers, I would go with InlinedOptimized variant. |
Thank you for this change. I’m sure it will help.
I see this class has a IsNaN implementation. I‘m not sure if its implementation, which differs from double.IsNan. The only benefits I see is inlining. But to me Math.IsNan(double.NaN) is unsure.
… Le 9 mars 2020 à 19:15, Günther Foidl ***@***.***> a écrit :
In ***@***.*** I added more tests for the edge cases.
Tested variants are here:
default (current master)
variant from #33057 (comment)
a version based on dotnet/coreclr#22965
All three passes the tests (on intel-x64).
Benchmark results (win-x64) and generated asm (linux-x64) are in this folder.
The raytracer -- as mentioned in #12159 -- is hacked together and gives on my machine (Intel x64) the following numbers:
Default: 48021 ms
InlinedOptimized: 38675 ms
Vectorized: 47637 ms
(Quite interesting that the branchless vectorized version doesn't perform better (on my machine) -- I didn't investigate why...maybe later)
Unfortunately I don't have access to other cpu-platforms (amd, arm) to run tests on.
Are there any other real-world benchmarks we can run for min/max?
So based on these numbers, I would go with InlinedOptimized variant.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
It's just copied over from |
Yes, it should be a separate PR with its own jit-diff and potentially perf data. |
Thanks @gfoidl, I'm going to run some numbers on my AMD and Intel boxes in a bit and will share back as well. |
Variants tested and disassembly: https://gist.github.com/tannergooding/d81a6cd7530ec1cdc27e08530922f02a Notably, there are a few places where codegen could be improved; but inlining and reordering things to assume inputs aren't the same or NaN provides the easiest win. Ryzen 3900X
Ryzen 1800X
i7-7700
|
That's why we need a Some dotnet/perf benchmarks are up to 3x faster with Fast-Math mode on Mono-LLVM (e.g. the Burgers ones) |
@tannergooding I updated my benchmarks with your variants -> results.
All variants use optimized BitConverter. From these numbers (for For the raytracer I get the following numbers (
For AMD it seems that Can you run your benchmarks with |
I see the following. Part of the differnce in numbers is due to the inputs given, as such I've run the benchmark with 4 different input variations and listed them below. From the numbers, you can see that All inputs: val1 < val2Ryzen 3900X
i7-7700
1/2 Inputs: val1 < val2; 1/2 Inputs: val2 < val1Ryzen 3900X
i7-7700
1/3 Inputs: val1 < val2; 1/3 Inputs: val2 < val1; 1/3 Inputs: NaNRyzen 3900X
i7-7700
1/4 Inputs: val1 < val2; 1/4 Inputs: val2 < val1; 1/4 Inputs: NaN; 1/4 Inputs: val1 == val2Ryzen 3900X
i7-7700
|
Looking at https://www.agner.org/optimize/instruction_tables.pdf, Skylake has a 4 cycle latency while Ryzen only has a 1 cycle latency (for
It might have to do with internal mechanics of how the branch predictor works. It could also be a side effect of the test or some other factor.
We need something that isn't |
Issue Title
Using Math.Min in a loop is almost 4 times slower than before.
General
Porting our libraries to .net core 3.1, we realized performance issues on Math.Min. Min is including in a loop with millions of iterations. The loop execution was almost 4 times slower than before.
I looked at the source code and realized that the AggressiveInlining option (MethImplAttribute) was missing on Math.Min and Max for floating point arguments.
Of course the function has changed to be compliant with the new IEEE standard, but I guess it still can be inlined.
The text was updated successfully, but these errors were encountered: