Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues for Math.Min, Math.Max for double/float #33057

Closed
polijp opened this issue Mar 2, 2020 · 24 comments · Fixed by #33851
Closed

Performance issues for Math.Min, Math.Max for double/float #33057

polijp opened this issue Mar 2, 2020 · 24 comments · Fixed by #33851
Labels
area-System.Numerics help wanted [up-for-grabs] Good issue for external contributors tenet-performance Performance related issue

Comments

@polijp
Copy link

polijp commented Mar 2, 2020

Issue Title

Using Math.Min in a loop is almost 4 times slower than before.

General

Porting our libraries to .net core 3.1, we realized performance issues on Math.Min. Min is including in a loop with millions of iterations. The loop execution was almost 4 times slower than before.

I looked at the source code and realized that the AggressiveInlining option (MethImplAttribute) was missing on Math.Min and Max for floating point arguments.
Of course the function has changed to be compliant with the new IEEE standard, but I guess it still can be inlined.

@scalablecory scalablecory transferred this issue from dotnet/core Mar 2, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Mar 2, 2020
@jkotas jkotas added area-System.Runtime tenet-performance Performance related issue labels Mar 2, 2020
@stephentoub
Copy link
Member

cc: @tannergooding

@tannergooding tannergooding added area-System.Numerics help wanted [up-for-grabs] Good issue for external contributors and removed area-System.Runtime untriaged New issue has not been triaged by the area owner labels Mar 2, 2020
@tannergooding
Copy link
Member

Marked as up-for-grabs as it should be a trivial fix.

@tannergooding
Copy link
Member

tannergooding commented Mar 2, 2020

We should also add tests covering Min/Max to dotnet/performance. @adamsitnik might be able to help with this side of things.

Our existing math tests are here: https://github.com/dotnet/performance/tree/master/src/benchmarks/micro/runtime/Math/Functions

@gfoidl
Copy link
Member

gfoidl commented Mar 3, 2020

Math.Min(double, double) produces quite a lot of asm (see methods M1 and M2. Do we really want to force the inline?


As thought:
If we know from the use case that neither first nor seconds argument will be NaN there could be a "fast" version that emits minsd (on x86) for full speed (method M0 should emit minsd ideally).
The JIT can be tricked into this by using hw-intrinsics, but should do this by itself.

For compliance with IEEE 754:2019 we have Math.Min.


add tests covering Min/Max to dotnet/performance

I'll open a PR over there.

@tannergooding
Copy link
Member

The asm isn't quite as bad when you look at the x64 version.

The main bloat comes from the IsNaN check which was cleaned up back in October and should be much smaller in .NET 5 (which sharplab doesn't support right now)

@tannergooding
Copy link
Member

If we know from the use case that neither first nor seconds argument will be NaN there could be a "fast" version that emits minsd (on x86) for full speed (method M0 should emit minsd ideally).

This isn't a valid transformation because minsd doesn't correctly handle 0.0 vs -0.0, it always returns the second operand when IEEE 754 requires it return -0.0.

@tannergooding
Copy link
Member

I had previously implemented a branchless version of this code which uses intrinsics, but the results weren't consistent on all platforms: dotnet/coreclr#22965 (this was also a version which didn't propagate NaN, so it would need to be tweaked slightly).

We might also be able to gain some perf by ensuring the non Zero/non NaN case is the "predicted" path if the former isn't viable.

@gfoidl
Copy link
Member

gfoidl commented Mar 4, 2020

I tried the code with current master (from a checked build).

C# code (just copied together)
using System.Runtime.CompilerServices;

namespace ConsoleApp4
{
    class Program
    {
        static int Main(string[] args)
        {
            double a = 3d;
            double b = 4d;
            double min = Do(a, b);

            return min == a ? 0 : 1;
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static double Do(double a, double b)
        {
            return Min(a, b);
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static double Min(double val1, double val2)
        {
            // This matches the IEEE 754:2019 `minimum` function
            //
            // It propagates NaN inputs back to the caller and
            // otherwise returns the larger of the inputs. It
            // treats +0 as larger than -0 as per the specification.

            if ((val1 < val2) || IsNaN(val1))
            {
                return val1;
            }

            if (val1 == val2)
            {
                return IsNegative(val1) ? val1 : val2;
            }

            return val2;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static unsafe bool IsNaN(double d)
        {
            // A NaN will never equal itself so this is an
            // easy and efficient way to check for NaN.

#pragma warning disable CS1718
            return d != d;
#pragma warning restore CS1718
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static unsafe bool IsNegative(double d)
        {
            return DoubleToInt64Bits(d) < 0;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static unsafe long DoubleToInt64Bits(double value)
        {
            return *((long*)&value);
        }
    }
}

It is quite "jumpy".

; Assembly listing for method ConsoleApp4.Program:Do(double,double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  9,  5   )  double  ->  mm0
;  V01 arg1         [V01,T01] (  6,  4.25)  double  ->  mm1
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T02] (  5,  3   )  double  ->  mm0         "Inline return value spill temp"
;  V04 tmp2         [V04,T03] (  2,  1   )  double  ->  [rbp-0x08]   do-not-enreg[F] ld-addr-op "Inlining Arg"
;
; Lcl frame size = 16

G_M65265_IG01:
       55                   push     rbp
       4883EC10             sub      rsp, 16
       C5F877               vzeroupper
       488D6C2410           lea      rbp, [rsp+10H]
                        ;; bbWeight=1    PerfScore 2.75
G_M65265_IG02:
       C5F92EC8             vucomisd xmm1, xmm0
       770A                 ja       SHORT G_M65265_IG04
                        ;; bbWeight=1    PerfScore 2.00
G_M65265_IG03:
       C5F92EC0             vucomisd xmm0, xmm0
       7A03                 jp       SHORT G_M65265_IG04
       7403                 je       SHORT G_M65265_IG05
                        ;; bbWeight=0.25 PerfScore 0.75
G_M65265_IG04:
       EB24                 jmp      SHORT G_M65265_IG09
                        ;; bbWeight=0.50 PerfScore 1.00
G_M65265_IG05:
       C5F92EC1             vucomisd xmm0, xmm1
       7A19                 jp       SHORT G_M65265_IG08
       7517                 jne      SHORT G_M65265_IG08
       C5FB1145F8           vmovsd   qword ptr [rbp-08H], xmm0
       48837DF800           cmp      qword ptr [rbp-08H], 0
       7C09                 jl       SHORT G_M65265_IG07
                        ;; bbWeight=0.25 PerfScore 1.38
G_M65265_IG06:
       C5F828C1             vmovaps  xmm0, xmm1
       EB08                 jmp      SHORT G_M65265_IG09
                        ;; bbWeight=0.50 PerfScore 1.12
G_M65265_IG07:
       EB05                 jmp      SHORT G_M65265_IG09
                        ;; bbWeight=0.50 PerfScore 1.00
G_M65265_IG08:
       C5F828C1             vmovaps  xmm0, xmm1
                        ;; bbWeight=0.50 PerfScore 0.12
G_M65265_IG09:
       488D6500             lea      rsp, [rbp]
       5D                   pop      rbp
       C3                   ret
                        ;; bbWeight=1    PerfScore 2.00

; Total bytes of code 67, prolog size 13, PerfScore 19.43, (MethodHash=0702010e) for method ConsoleApp4.Program:Do(double,double):double
; ============================================================

The stack work for DoubleToInt64Bits isn't needed (could be movq), this is already tracked by #12733 (comment).
A workaround is easy.

I'll try something to improve this...

@gfoidl
Copy link
Member

gfoidl commented Mar 4, 2020

An attempt for better codegen for Math.Min(double, double) is in master...gfoidl:math-max-min-double

During the work I noticed that same test-cases in

public static void Min_Double_NotNetFramework(double x, double y, double expectedResult)
are missing for double.NaN-case (or I did just miss other unit tests for Math.Min).

With the latest solution codegen looks pretty good.

codegen for current master
; Assembly listing for method ConsoleApp4.Program:Do(double,double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  9,  5   )  double  ->  mm0        
;  V01 arg1         [V01,T01] (  6,  4.25)  double  ->  mm1        
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T02] (  5,  3   )  double  ->  mm0         "Inline return value spill temp"
;  V04 tmp2         [V04,T03] (  2,  1   )  double  ->  [rbp-0x08]   do-not-enreg[F] ld-addr-op "Inlining Arg"
;
; Lcl frame size = 16

G_M65265_IG01:
       55                   push     rbp
       4883EC10             sub      rsp, 16
       C5F877               vzeroupper 
       488D6C2410           lea      rbp, [rsp+10H]
						;; bbWeight=1    PerfScore 2.75
G_M65265_IG02:
       C5F92EC8             vucomisd xmm1, xmm0
       770A                 ja       SHORT G_M65265_IG04
						;; bbWeight=1    PerfScore 2.00
G_M65265_IG03:
       C5F92EC0             vucomisd xmm0, xmm0
       7A03                 jp       SHORT G_M65265_IG04
       7403                 je       SHORT G_M65265_IG05
						;; bbWeight=0.25 PerfScore 0.75
G_M65265_IG04:
       EB24                 jmp      SHORT G_M65265_IG09
						;; bbWeight=0.50 PerfScore 1.00
G_M65265_IG05:
       C5F92EC1             vucomisd xmm0, xmm1
       7A19                 jp       SHORT G_M65265_IG08
       7517                 jne      SHORT G_M65265_IG08
       C5FB1145F8           vmovsd   qword ptr [rbp-08H], xmm0
       48837DF800           cmp      qword ptr [rbp-08H], 0
       7C09                 jl       SHORT G_M65265_IG07
						;; bbWeight=0.25 PerfScore 1.38
G_M65265_IG06:
       C5F828C1             vmovaps  xmm0, xmm1
       EB08                 jmp      SHORT G_M65265_IG09
						;; bbWeight=0.50 PerfScore 1.12
G_M65265_IG07:
       EB05                 jmp      SHORT G_M65265_IG09
						;; bbWeight=0.50 PerfScore 1.00
G_M65265_IG08:
       C5F828C1             vmovaps  xmm0, xmm1
						;; bbWeight=0.50 PerfScore 0.12
G_M65265_IG09:
       488D6500             lea      rsp, [rbp]
       5D                   pop      rbp
       C3                   ret      
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 67, prolog size 13, PerfScore 19.43, (MethodHash=0702010e) for method ConsoleApp4.Program:Do(double,double):double
; ============================================================
codegen for 1af43447e2da78426c31030f6a5eb9e6c930876c
; Assembly listing for method ConsoleApp4.Program:Do(double,double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T03] (  3,  3   )  double  ->  mm0        
;  V01 arg1         [V01,T02] (  5,  3.50)  double  ->  mm1        
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T01] (  8,  8.50)  double  ->  mm0         "Inlining Arg"
;  V04 tmp2         [V04,T00] (  2,  0.50)    long  ->  rax         "Inline return value spill temp"
;* V05 tmp3         [V05    ] (  0,  0   )  double  ->  zero-ref    ld-addr-op "Inlining Arg"
;
; Lcl frame size = 0

G_M65265_IG01:
       55                   push     rbp
       C5F877               vzeroupper 
       488BEC               mov      rbp, rsp
						;; bbWeight=1    PerfScore 2.25
G_M65265_IG02:
       C5F92EC8             vucomisd xmm1, xmm0
       7725                 ja       SHORT G_M65265_IG05
						;; bbWeight=1    PerfScore 2.00
G_M65265_IG03:
       C5F92EC0             vucomisd xmm0, xmm0
       7A1E                 jp       SHORT G_M65265_IG05
       C5F92EC1             vucomisd xmm0, xmm1
       7A13                 jp       SHORT G_M65265_IG04
       7511                 jne      SHORT G_M65265_IG04
       C5F828D0             vmovaps  xmm2, xmm0
       C4E1F97ED0           vmovd    rax, xmm2
       4885C0               test     rax, rax
       7C08                 jl       SHORT G_M65265_IG05
						;; bbWeight=0.25 PerfScore 1.88
G_M65265_IG04:
       C5F828C1             vmovaps  xmm0, xmm1
						;; bbWeight=0.25 PerfScore 0.06
G_M65265_IG05:
       5D                   pop      rbp
       C3                   ret      
						;; bbWeight=1    PerfScore 1.50

; Total bytes of code 47, prolog size 7, PerfScore 12.89, (MethodHash=0702010e) for method ConsoleApp4.Program:Do(double,double):double
; ============================================================

What I don't like with this solutions are the gotos, and this will be pretty fragile if the JIT improves with its codegen, as the JIT should lay out the method in that way, ideally.

With regard to branch-prediction I also tried to split Math.Min(double, double) in a fast-path and a cold-path -- see 5bfaf80 (please ignore the redundant check in the slow method).
This should bring an improvment iif val1 < val2, but not in the general case.

Any thoughts on this?
Maybe we should revisit dotnet/coreclr#22965 and support for NaN-propagation?

@tannergooding
Copy link
Member

are missing for double.NaN-case (or I did just miss other unit tests for Math.Min).

We have non-NaN, NaN returns NaN covered by+Inf, NaN. We are missing NaN, non-NaN returns NaN. However, having ones that also cover non infinity values is beneficial anyways. So thanks for catching this!

An attempt for better codegen for Math.Min(double, double) is in master...gfoidl:math-max-min-double

Unfortunately this version doesn't handle NaN correctly and will return 1.0 for NaN, 1.0 when it should return NaN.

What I don't like with this solutions are the gotos, and this will be pretty fragile if the JIT improves with its codegen, as the JIT should lay out the method in that way, ideally.

There are a few other places this is done in the framework for similar reasons.

Maybe we should revisit dotnet/coreclr#22965 and support for NaN-propagation?

In order to correctly implement the function, we have to pick the lesser of x or y, treating -0 as smaller than +0 and propagating NaN. That means we will either have branch heavy code -or- we use branchless logic -or- possibly a slightly hybrid approach.

It would likely be good to get implementations for all three cases done and then run benchmarks on a few machines.

@tannergooding
Copy link
Member

With regards to the hybrid approach, I mean that minsd is safe to use if we know that neither input is NaN or both aren't 0;' or that we might be able to make the branchless logic smaller if we assume NaN or both being 0 is an edge case.

@gfoidl

This comment has been minimized.

@gfoidl
Copy link
Member

gfoidl commented Mar 9, 2020

@tannergooding side question: should the workaround for BitConverter double <-> long be handled separate? (Cf. #12733 (comment)).
I would do so, to keep the changes focused...

@gfoidl
Copy link
Member

gfoidl commented Mar 9, 2020

In gfoidl@b2fe76d I added more tests for the edge cases.

Tested variants are here:

All three passes the tests (on intel-x64).
Benchmark results (win-x64) and generated asm (linux-x64) are in this folder.

The raytracer -- as mentioned in #12159 -- is hacked together and gives on my machine (Intel x64) the following numbers:

Default:            48021 ms
InlinedOptimized:   38675 ms
Vectorized:         47637 ms

(Quite interesting that the branchless vectorized version doesn't perform better (on my machine) -- I didn't investigate why...maybe later)

Unfortunately I don't have access to other cpu-platforms (amd, arm) to run tests on.
Are there any other real-world benchmarks we can run for min/max?

So based on these numbers, I would go with InlinedOptimized variant.

@polijp
Copy link
Author

polijp commented Mar 10, 2020 via email

@gfoidl
Copy link
Member

gfoidl commented Mar 10, 2020

this class has a IsNaN implementation

It's just copied over from double.IsNaN along other methods for easier tweaking of the code.

@tannergooding
Copy link
Member

should the workaround for BitConverter double <-> long be handled separate? (Cf. #12733 (comment)).

Yes, it should be a separate PR with its own jit-diff and potentially perf data.

@tannergooding
Copy link
Member

So based on these numbers, I would go with InlinedOptimized variant.

Thanks @gfoidl, I'm going to run some numbers on my AMD and Intel boxes in a bit and will share back as well.

@tannergooding
Copy link
Member

Variants tested and disassembly: https://gist.github.com/tannergooding/d81a6cd7530ec1cdc27e08530922f02a

Notably, there are a few places where codegen could be improved; but inlining and reordering things to assume inputs aren't the same or NaN provides the easiest win.

Ryzen 3900X

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
Method Mean Error StdDev
MathMin 9.383 us 0.0170 us 0.0159 us
MathMinInlined 4.694 us 0.0133 us 0.0118 us
MinReorder 4.471 us 0.0066 us 0.0061 us
MinReorderWithMinScalar 3.527 us 0.0080 us 0.0071 us
MinVectorized 4.715 us 0.0046 us 0.0041 us

Ryzen 1800X

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 7 1800X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
Method Mean Error StdDev
MathMin 16.147 us 0.1826 us 0.1708 us
MathMinInlined 5.484 us 0.0148 us 0.0131 us
MinReorder 4.902 us 0.0114 us 0.0107 us
MinReorderWithMinScalar 4.120 us 0.0228 us 0.0214 us
MinVectorized 5.166 us 0.0168 us 0.0149 us

i7-7700

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
Method Mean Error StdDev Median
MathMin 12.560 us 0.2606 us 0.5664 us 12.244 us
MathMinInlined 7.369 us 0.0230 us 0.0192 us 7.371 us
MinReorder 6.063 us 0.0257 us 0.0240 us 6.057 us
MinReorderWithMinScalar 6.043 us 0.0235 us 0.0220 us 6.037 us
MinVectorized 7.833 us 0.0788 us 0.0699 us 7.810 us

@EgorBo
Copy link
Member

EgorBo commented Mar 11, 2020

That's why we need a Fast Math mode :-) to be able to just use a single instruction (e.g. minsd) and don't care about nans, etc 🙂

Some dotnet/perf benchmarks are up to 3x faster with Fast-Math mode on Mono-LLVM (e.g. the Burgers ones)

@gfoidl
Copy link
Member

gfoidl commented Mar 11, 2020

@tannergooding I updated my benchmarks with your variants -> results.

All variants use optimized BitConverter.

From these numbers (for double) InlinedOptimized is still the fasted variant (for my benchmarks).

For the raytracer I get the following numbers (float):

InlinedOptimized:                   35525 ms
DefaultReorderedVectorized:         40959 ms
DefaultReorderedVectorizedHotCold:  41835 ms

For AMD it seems that MinScalar has much greater positive effect than on Intel.

Can you run your benchmarks with InlinedOptimized-variant as well, please?

@tannergooding
Copy link
Member

I see the following. Part of the differnce in numbers is due to the inputs given, as such I've run the benchmark with 4 different input variations and listed them below.

From the numbers, you can see that MinVectorized is fairly stable regardless of what inputs are given and other algorithms really start shining when the branch predictor can start working but may perform worse when it can't.

All inputs: val1 < val2

Ryzen 3900X

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
Method Mean Error StdDev
MathMin 10.856 us 0.0392 us 0.0367 us
MathMinInlined 4.716 us 0.0020 us 0.0016 us
MinReorder 4.745 us 0.0139 us 0.0124 us
MinReorderWithMinScalar 3.564 us 0.0103 us 0.0096 us
MinVectorized 4.742 us 0.0069 us 0.0062 us
MinGfoidl 4.627 us 0.0145 us 0.0135 us

i7-7700

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
Method Mean Error StdDev
MathMin 12.110 us 0.0280 us 0.0260 us
MathMinInlined 5.051 us 0.0834 us 0.0780 us
MinReorder 4.847 us 0.0063 us 0.0056 us
MinReorderWithMinScalar 6.034 us 0.0096 us 0.0090 us
MinVectorized 7.793 us 0.0545 us 0.0455 us
MinGfoidl 6.063 us 0.0232 us 0.0217 us

1/2 Inputs: val1 < val2; 1/2 Inputs: val2 < val1

Ryzen 3900X

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
Method Mean Error StdDev
MathMin 8.869 us 0.0481 us 0.0375 us
MathMinInlined 4.163 us 0.0349 us 0.0326 us
MinReorder 4.749 us 0.0229 us 0.0214 us
MinReorderWithMinScalar 3.791 us 0.0541 us 0.0506 us
MinVectorized 4.651 us 0.0203 us 0.0189 us
MinGfoidl 4.235 us 0.0059 us 0.0052 us

i7-7700

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
Method Mean Error StdDev
MathMin 11.010 us 0.0890 us 0.0830 us
MathMinInlined 5.897 us 0.0199 us 0.0186 us
MinReorder 5.743 us 0.0066 us 0.0062 us
MinReorderWithMinScalar 5.991 us 0.0190 us 0.0168 us
MinVectorized 7.716 us 0.0223 us 0.0198 us
MinGfoidl 5.397 us 0.0162 us 0.0151 us

1/3 Inputs: val1 < val2; 1/3 Inputs: val2 < val1; 1/3 Inputs: NaN

Ryzen 3900X

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
Method Mean Error StdDev
MathMin 10.553 us 0.2106 us 0.4488 us
MathMinInlined 3.846 us 0.0094 us 0.0088 us
MinReorder 4.099 us 0.0113 us 0.0101 us
MinReorderWithMinScalar 3.750 us 0.0164 us 0.0145 us
MinVectorized 4.416 us 0.0198 us 0.0185 us
MinGfoidl 3.812 us 0.0138 us 0.0129 us

i7-7700

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
Method Mean Error StdDev
MathMin 11.850 us 0.0470 us 0.0420 us
MathMinInlined 5.050 us 0.0108 us 0.0101 us
MinReorder 5.136 us 0.0100 us 0.0088 us
MinReorderWithMinScalar 5.565 us 0.0094 us 0.0088 us
MinVectorized 6.881 us 0.0080 us 0.0075 us
MinGfoidl 5.221 us 0.0082 us 0.0073 us

1/4 Inputs: val1 < val2; 1/4 Inputs: val2 < val1; 1/4 Inputs: NaN; 1/4 Inputs: val1 == val2

Ryzen 3900X

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
Method Mean Error StdDev
MathMin 11.400 us 0.1980 us 0.1850 us
MathMinInlined 3.839 us 0.0144 us 0.0135 us
MinReorder 3.789 us 0.0153 us 0.0143 us
MinReorderWithMinScalar 3.764 us 0.0245 us 0.0229 us
MinVectorized 4.327 us 0.0082 us 0.0072 us
MinGfoidl 3.766 us 0.0151 us 0.0142 us

i7-7700

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20160.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.15501, CoreFX 5.0.20.15501), X64 RyuJIT
Method Mean Error StdDev
MathMin 11.740 us 0.0400 us 0.0350 us
MathMinInlined 5.046 us 0.0203 us 0.0180 us
MinReorder 5.309 us 0.0120 us 0.0107 us
MinReorderWithMinScalar 5.416 us 0.0099 us 0.0093 us
MinVectorized 6.663 us 0.0067 us 0.0056 us
MinGfoidl 4.967 us 0.0141 us 0.0132 us

@gfoidl
Copy link
Member

gfoidl commented Mar 13, 2020

I poured the numbers in a chart (without case MathMin, as it's the worst):
math-min

For Ryzen MinReorderWithMinScalar performs best and is quite stable on the inputs.
For Intel MinReorder performs best.

So which one should we pick?
Do you know why minsd seems to be quite slower on Intel or how this can be circumvented?

other algorithms really start shining when the branch predictor can start working but may perform worse when it can't

This is what I would expect, but the figure looks partially somewhat different. The more "random" the inputs, the less time it takes. This feels a bit strange to me.

@tannergooding
Copy link
Member

Do you know why minsd seems to be quite slower on Intel or how this can be circumvented?

Looking at https://www.agner.org/optimize/instruction_tables.pdf, Skylake has a 4 cycle latency while Ryzen only has a 1 cycle latency (for minss/minsd and maxss/maxsd).

This is what I would expect, but the figure looks partially somewhat different. The more "random" the inputs, the less time it takes. This feels a bit strange to me.

It might have to do with internal mechanics of how the branch predictor works. It could also be a side effect of the test or some other factor.

Math.Min and Math.Max are really small functions, so we can't accurately test them using just a benchmark that calls them. This is because they take less time than the resolution of the high frequency timer on the system (which is sub 1 microsecond, but generally no more accurate than 100ns. Instead, we have to scale the number of operations done during the test, which can skew results.

We need something that isn't MinReorderWithMinScalar or MinVectorized as the software fallback in either case, and so we should just pick one that is the best overall, on average. We can then optimize other cases as needed.
I imagine all of these algorithms (minus MinVectorized) may behave differently depending on the surrounding code due to the number of branches/return statements they have.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Numerics help wanted [up-for-grabs] Good issue for external contributors tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants