Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise 'Math.CopySign' and 'MathF.CopySign' #26506

Open
wants to merge 10 commits into
base: master
from

Conversation

@john-h-k
Copy link
Contributor

commented Sep 4, 2019

(now in the correct repo 馃)

Optimise 'Math.CopySign' and 'MathF.CopySign'. Both of these methods can be improved from their current implementation. The new implementation uses SSE intrinsics which are faster, as well as having a faster intrinsic-free branch.

The SSE pathway doesn't spill to the stack, unlike the others, and is branch free. The fallback is also branch free but does spill to the stack (the current implementation spills and contains a branch). These are faster in every scenario, except for the non-SSE fallback for double precision being marginally (10%) slower on x64 in the Same scenario - however, windows (and x64) requires SSE2 so it is unlikely this code is ever going to be run on x86. I haven't profiled on ARM as I don't have access to an ARM system.

Benchmark sources here

Scenarios:
Same - every sign is the same (e.g x == 1f, y == 2f)
Different - every sign is different (e.g x == 1f, y == -2f)
Alternating - alternates between same and different
Random - randomly same or different

Single precision (MathF):

Method Scenario Mean Error StdDev
Standard Random 181.14 us 1.0300 us 0.8601 us
John Random 47.49 us 0.1613 us 0.1347 us
John_Intrinsic Random 39.79 us 0.4956 us 0.4636 us
Standard Same 43.59 us 0.2158 us 0.2019 us
John Same 49.11 us 0.6777 us 0.6339 us
John_Intrinsic Same 39.74 us 0.2084 us 0.1949 us
Standard Different 56.04 us 0.6402 us 0.5988 us
John Different 49.58 us 0.9119 us 0.8529 us
John_Intrinsic Different 39.80 us 0.2419 us 0.1889 us
Standard Alternating 48.89 us 0.1422 us 0.1330 us
John Alternating 47.48 us 0.0766 us 0.0717 us
John_Intrinsic Alternating 39.05 us 0.0289 us 0.0226 us

Double precision (Math):

Method Scenario Mean Error StdDev
Standard Random 176.77 us 0.3537 us 0.2954 us
John Random 53.74 us 0.1543 us 0.1443 us
John_Intrinsic Random 41.22 us 0.1321 us 0.1236 us
Standard Same 48.64 us 0.1322 us 0.1237 us
John Same 53.85 us 0.1499 us 0.1402 us
John_Intrinsic Same 41.44 us 0.2931 us 0.2742 us
Standard Different 59.91 us 0.0642 us 0.0569 us
John Different 53.72 us 0.1779 us 0.1664 us
John_Intrinsic Different 41.09 us 0.0346 us 0.0289 us
Standard Alternating 54.06 us 0.1293 us 0.1210 us
John Alternating 53.84 us 0.1772 us 0.1658 us
John_Intrinsic Alternating 41.11 us 0.0681 us 0.0604 us
@john-h-k

This comment has been minimized.

Copy link
Contributor Author

commented Sep 4, 2019

@john-h-k

This comment has been minimized.

Copy link
Contributor Author

commented Sep 4, 2019

@danmosemsft

Well @john-h-k perhaps you might consider adding a benchmark (such as some form of the one you're using here) to https://github.com/dotnet/performance as well ? There's excellent documentation there: it's easy to use. That would protect your change.

Looking at that right now - will make sure to do so

@mikedn

This comment has been minimized.

Copy link
Contributor

commented Sep 4, 2019

The SSE pathway doesn't spill to the stack, unlike the others

The spill that happens on the non-SSE path is something that can be fixed in the JIT. I'm not sure if once that is done the SSE path would still be significantly faster to warrant the extra complexity as the ANDPS/ORPS instructions have lower throughput compared to the their integer counterparts.

For benchmarking purposes it would be useful to test the non-SSE version with SingleToInt64Bits replaced by SSE's MOVD intrinsic.

@mikedn

This comment has been minimized.

Copy link
Contributor

commented Sep 4, 2019

For benchmarking purposes it would be useful to test the non-SSE version with SingleToInt64Bits replaced by SSE's MOVD intrinsic.

Though with ANDNPS I suspect that there's no way that the non-SSE version can be faster, even with MOVD. The SSE code with ANDNPS is:

       48B80000000000000080 mov      rax, 0x8000000000000000
       C4E1F96ED0           vmovd    xmm2, rax
       C5E854C9             vandps   xmm1, xmm2, xmm1
       C5E855C0             vandnps  xmm0, xmm2, xmm0
       C5F856C1             vorps    xmm0, xmm0, xmm1

which is probably as good as it gets.

@EgorBo

This comment has been minimized.

Copy link
Contributor

commented Sep 4, 2019

just for reference here is what mono generates for:

 [MethodImpl(MethodImplOptions.NoInlining)]
 public static double Test(double x, double y) => Math.CopySign(x, y);

asm:

0000000000000000	movabsq	$0x7fb20121f3e0, %rax
000000000000000a	vandps	(%rax), %xmm1, %xmm1
000000000000000e	movabsq	$0x7fb20121f3f0, %rax
0000000000000018	vandps	(%rax), %xmm0, %xmm0
000000000000001c	vorps	%xmm1, %xmm0, %xmm0
0000000000000020	retq

(llvm.copysign.f64)

@john-h-k john-h-k force-pushed the john-h-k:copysign-opt branch from 859b8d9 to 7b74bec Sep 4, 2019

// flip the sign bit of x and return the new value;
// otherwise, just return x
// Remove the sign from x, and remove everything but the sign from y
// Creating from a 'const long' is better than from the correct 'const float/double'

This comment has been minimized.

Copy link
@tannergooding

tannergooding Sep 4, 2019

Member

What was the performance difference between using Vector128.CreateScalarUnsafe(signMask).AsSingle(), which generates:

mov      rax, 0x8000000000000000
vmovd    xmm2, rax

and using Vector128.CreateScalarUnsafe(-0.0);, which generates:

vmovsd xmm0, dword [rip+0x15]

The latter has the load from memory, but smaller codegen, which I believe could be impactful to tight loops.
MSVC, GCC, and Clang all look to prefer the latter scenario...

This comment has been minimized.

Copy link
@john-h-k

john-h-k Sep 4, 2019

Author Contributor

I tried both and in the naive-ish simple creation benchmark, the first appeared to be marginally faster so I chose it

This comment has been minimized.

Copy link
@john-h-k

john-h-k Sep 4, 2019

Author Contributor

Also, now ANDNPS is used, there is only 1 load rather than 2

This comment has been minimized.

Copy link
@tannergooding

tannergooding Sep 4, 2019

Member

the JIT doesn't seem to inline this method in its current state

This is probably because the IL for all code paths is sufficiently big to skip the initial heuristics.

What we've done for some of the other code is mark the method as AggressiveInlining and keep any code-paths we know are small enough inline (i.e. generally the intrinsic paths). The software fallback (or other paths which might not be "small enough") are put in their own local method so the JIT can independently determine whether or not to inline using only that method (another explanation of this is here: https://source.dot.net/#System.Private.CoreLib/shared/System/Runtime/Intrinsics/Vector128.cs,11).

For example, https://source.dot.net/#System.Private.CoreLib/shared/System/Runtime/Intrinsics/Vector128.cs,312.

Is it worth applying it to the method given the discussion around its use in tight loops

I would imagine so, the codegen is very small (5 instructions in the intrinsic case and nearly as small in the software fallback case), so the cost of the additional indirection might be greater than the cost of just running the code.

This comment has been minimized.

Copy link
@john-h-k

john-h-k Sep 4, 2019

Author Contributor

So in this situation because both pathways are likely profitable to be inlined and the fallback is also small, just mark the whole method aggro inlining rather than separate it into a separate fallback method?

This comment has been minimized.

Copy link
@tannergooding

tannergooding Sep 4, 2019

Member

I think it would be good to let the JIT make its own decision about the software fallback...

The code generated around the DoubleToInt64 bits is still non-ideal (I imagine the ARM codegen is similar, but I'll need to check):

    L0000: sub rsp, 0x18
    L0004: vzeroupper
    L0007: vmovsd [rsp+0x10], xmm1
    L000d: mov rax, [rsp+0x10]
    L0012: vmovsd [rsp+0x8], xmm2
    L0018: mov rdx, [rsp+0x8]
    L001d: mov rcx, 0x8000000000000000
    L0027: and rdx, rcx
    L002a: mov rcx, 0x7fffffffffffffff
    L0034: and rax, rcx
    L0037: or rax, rdx
    L003a: mov [rsp], rax
    L003e: vmovsd xmm0, qword [rsp]
    L0043: add rsp, 0x18
    L0047: ret

It would also be good to try and get rid of the second constant load here, having the following would be much better:

     mov rcx, 0x8000000000000000
     and rdx, rcx
-    mov rcx, 0x7fffffffffffffff  
+    not rcx
     and rax, rcx

It looks like the C# compiler optimizes it to be two distinct constants in the case where signMask is declared a constant and the JIT does the same when it isn't.

This comment has been minimized.

Copy link
@john-h-k

john-h-k Sep 4, 2019

Author Contributor

馃憤 for the separate method fallbacks. Currently trying a few things but am failing to trigger the JIT to generate the second style unfortunately

This comment has been minimized.

Copy link
@tannergooding

tannergooding Sep 4, 2019

Member

It might be something the JIT doesn't support today, in which case logging a bug should be sufficient.

@tannergooding

This comment has been minimized.

Copy link
Member

commented Sep 4, 2019

you might consider adding a benchmark

That would be greatly appreciated and would make validating this change on other boxes simpler 馃槃. In general, the benchmarks are being checked into the dotnet/performance repo. -- @adamsitnik, do we have any guidance on how to run dotnet/performance benchmarks against local dotnet/coreclr builds?

Overall the changes LGTM, just want to test this on a few more boxes (and probably an ARM machine) before committing it.

@john-h-k

This comment has been minimized.

Copy link
Contributor Author

commented Sep 4, 2019

Testing on sharplab, the JIT doesn't seem to inline this method in its current state, but will with aggressiveinlining applied. Is it worth applying it to the method given the discussion around its use in tight loops

which I believe could be impactful to tight loops.

john-h-k added 3 commits Sep 4, 2019
Add using namespaces
Not sure where these vanished too, eek
Remove trailing whitespace
didn't realise this when i wrapped the line  :/
@mikedn

This comment has been minimized.

Copy link
Contributor

commented Sep 4, 2019

The SSE double version is problematic on x86:

       C5FB10442444 vmovsd   xmm0, qword ptr [esp+44H]
       C5F911442420 vmovupd  xmmword ptr [esp+20H], xmm0
       C5FB104C243C vmovsd   xmm1, qword ptr [esp+3CH]
       C5F9114C2410 vmovupd  xmmword ptr [esp+10H], xmm1
       6800000080   push     0x80000000
       6A00         push     0
       8D4C2408     lea      ecx, bword ptr [esp+08H]
       FF1514C08E05 call     [System.Runtime.Intrinsics.Vector128:CreateScalarUnsafe(long):struct]
       C5F9100424   vmovupd  xmm0, xmmword ptr [esp]
       C5F855442420 vandnps  xmm0, xmm0, xmmword ptr [esp+20H]
       C5F9104C2410 vmovupd  xmm1, xmmword ptr [esp+10H]
       C5F0540C24   vandps   xmm1, xmm1, xmmword ptr [esp]
       C5F856C1     vorps    xmm0, xmm0, xmm1
       C5FB11442430 vmovsd   qword ptr [esp+30H], xmm0
       DD442430     fld      qword ptr [esp+30H]

It probably should only be enabled on x64.

@tannergooding

This comment has been minimized.

Copy link
Member

commented Sep 4, 2019

It probably should only be enabled on x64.

Sounds good to me.

It's worth noting that some of that looks like codegen issues we might want to track... There isn't really any reason why this couldn't be:

vmovq  xmm2, qword ptr [address of ulong constant]  ; This encoding of movq is supported even on 32-bit
vandps xmm1, xmm2, xmm1
vandnps xmm0, xmm2, xmm0
vorps  xmm0, xmm0, xmm1

@john-h-k john-h-k force-pushed the john-h-k:copysign-opt branch from 5609a0d to 2c75f1b Sep 4, 2019

@adamsitnik

This comment has been minimized.

Copy link
Member

commented Sep 4, 2019

do we have any guidance on how to run dotnet/performance benchmarks against local dotnet/coreclr builds?

yes, all you need to do is to pass the path to corerun executable via --corerun parameter

https://github.com/dotnet/performance/tree/master/src/benchmarks/micro#private-runtime-builds

@john-h-k please let me know if something is not clear or does not work as expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can鈥檛 perform that action at this time.