Conversation
|
Random improves because Abs doesn't inline anymore; will fix |
Post ; Assembly listing for method Math:Abs(int):int
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 7, 5.50) int -> rcx
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
;
; Lcl frame size = 40
G_M42159_IG01:
sub rsp, 40
nop
G_M42159_IG02:
test ecx, ecx
jge SHORT G_M42159_IG03
neg ecx
test ecx, ecx
jl SHORT G_M42159_IG05
G_M42159_IG03:
mov eax, ecx
G_M42159_IG04:
add rsp, 40
ret
G_M42159_IG05:
call Math:ThrowAbsOverflow()
int3
; Total bytes of code 28, prolog size 5 for method Math:Abs(int):int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
src/mscorlib/src/System/Decimal.cs
Outdated
this.flags = flags; | ||
return; | ||
} | ||
if (!((flags & ~(SignMask | ScaleMask)) == 0 && (flags & ScaleMask) <= (28 << 16))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was; but is also completely unnecessary as the function doesn't return anything - will revert
Total bytes of diff: -94 (0.00% of base)
diff is an improvement.
Total byte diff includes -225 bytes from reconciling methods
Base had 5 unique methods, 445 unique bytes
Diff had 3 unique methods, 220 unique bytes
Top file improvements by size (bytes):
-94 : System.Private.CoreLib.dasm (0.00% of base)
1 total files with size differences (1 improved, 0 regressed), 0 unchanged.
Top method regessions by size (bytes):
100 : System.Private.CoreLib.dasm - Decimal:<DecDivMod1E9>g__D32DivMod1E9|161_0(int,byref):int (0/1 methods)
90 : System.Private.CoreLib.dasm - Decimal:Remainder(struct,struct):struct
60 : System.Private.CoreLib.dasm - Decimal:ThrowInvalidDecimalBytes():this (0/1 methods)
60 : System.Private.CoreLib.dasm - Math:ThrowAbsOverflow() (0/1 methods)
31 : System.Private.CoreLib.dasm - Decimal:Ceiling(struct):struct
31 : System.Private.CoreLib.dasm - Math:Ceiling(struct):struct
14 : System.Private.CoreLib.dasm - Math:Abs(short):short
14 : System.Private.CoreLib.dasm - Math:Abs(byte):byte
14 : System.Private.CoreLib.dasm - Random:.ctor(int):this
10 : System.Private.CoreLib.dasm - Decimal:op_UnaryNegation(struct):struct
10 : System.Private.CoreLib.dasm - Math:Abs(struct):struct
8 : System.Private.CoreLib.dasm - Math:Abs(long):long
6 : System.Private.CoreLib.dasm - Decimal:ToDecimal(ref):struct
6 : System.Private.CoreLib.dasm - Math:Abs(int):int
5 : System.Private.CoreLib.dasm - Decimal:Abs(struct):struct
5 : System.Private.CoreLib.dasm - Decimal:Negate(struct):struct
Top method improvements by size (bytes):
-100 : System.Private.CoreLib.dasm - Decimal:<DecDivMod1E9>g__D32DivMod1E9|160_0(int,byref):int (1/0 methods)
-91 : System.Private.CoreLib.dasm - Math:AbsHelper(long):long (1/0 methods)
-87 : System.Private.CoreLib.dasm - Math:AbsHelper(short):short (1/0 methods)
-85 : System.Private.CoreLib.dasm - Math:AbsHelper(byte):byte (1/0 methods)
-82 : System.Private.CoreLib.dasm - Math:AbsHelper(int):int (1/0 methods)
-62 : System.Private.CoreLib.dasm - Decimal:.ctor(int,int,int,int):this
-51 : System.Private.CoreLib.dasm - Decimal:SetBits(ref):this
23 total methods with size differences (7 improved, 16 regressed), 16746 unchanged.
|
src/mscorlib/src/System/Decimal.cs
Outdated
} | ||
throw new ArgumentException(SR.Arg_DecBitCtor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this exception go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put the decimal optimization into separate PR. They are pretty unrelated to the Abs change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to improve the Math.Abs(decimal)
; Assembly listing for method Math:Abs(struct):struct
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 RetBuf [V00,T01] ( 7, 7 ) byref -> rcx
; V01 arg0 [V01,T00] ( 7, 12 ) byref -> rdx
;* V02 loc0 [V02 ] ( 0, 0 ) struct (16) zero-ref
;* V03 tmp1 [V03 ] ( 0, 0 ) struct (16) zero-ref
; V04 tmp2 [V04,T02] ( 4, 7 ) int -> rax
; V05 tmp3 [V05,T11] ( 3, 0 ) ref -> rsi class-hnd exact
;* V06 tmp4 [V06 ] ( 0, 0 ) int -> zero-ref
;* V07 tmp5 [V07 ] ( 0, 0 ) int -> zero-ref
;* V08 tmp6 [V08 ] ( 0, 0 ) int -> zero-ref
; V09 tmp7 [V09,T03] ( 2, 2 ) int -> rax V02.flags(offs=0x00) P-INDEP
; V10 tmp8 [V10,T04] ( 2, 2 ) int -> r8 V02.hi(offs=0x04) P-INDEP
; V11 tmp9 [V11,T05] ( 2, 2 ) int -> r9 V02.lo(offs=0x08) P-INDEP
; V12 tmp10 [V12,T06] ( 2, 2 ) int -> rdx V02.mid(offs=0x0c) P-INDEP
; V13 tmp11 [V13,T07] ( 2, 2 ) int -> rax V03.flags(offs=0x00) P-INDEP
; V14 tmp12 [V14,T08] ( 2, 2 ) int -> r8 V03.hi(offs=0x04) P-INDEP
; V15 tmp13 [V15,T09] ( 2, 2 ) int -> r9 V03.lo(offs=0x08) P-INDEP
; V16 tmp14 [V16,T10] ( 2, 2 ) int -> rdx V03.mid(offs=0x0c) P-INDEP
; V17 tmp15 [V17,T12] ( 2, 0 ) ref -> rcx
; V18 tmp16 [V18,T13] ( 2, 0 ) ref -> rdx
; V19 OutArgs [V19 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
;
; Lcl frame size = 32
G_M42161_IG01:
push rsi
sub rsp, 32
G_M42161_IG02:
mov eax, dword ptr [rdx]
mov r8d, dword ptr [rdx+4]
mov r9d, dword ptr [rdx+8]
mov edx, dword ptr [rdx+12]
and eax, 0xD1FFAB1E
test eax, 0xD1FFAB1E
jne G_M42161_IG04
mov r10d, eax
and r10d, 0xD1FFAB1E
cmp r10d, 0xD1FFAB1E
jg G_M42161_IG04
mov dword ptr [rcx], eax
mov dword ptr [rcx+4], r8d
mov dword ptr [rcx+8], r9d
mov dword ptr [rcx+12], edx
mov rax, rcx
G_M42161_IG03:
add rsp, 32
pop rsi
ret
************** Beginning of cold code **************
G_M42161_IG04:
lea rcx, [(reloc)]
call CORINFO_HELP_NEWSFAST
mov rsi, rax
mov ecx, 0x2648
call CORINFO_HELP_STRCNS_CURRENT_MODULE
mov rcx, rax
xor rdx, rdx
call SR:GetResourceString(ref,ref):ref
mov rdx, rax
mov rcx, rsi
call ArgumentException:.ctor(ref):this
mov rcx, rsi
call CORINFO_HELP_THROW
int3
; Total bytes of code 134, prolog size 5 for method Math:Abs(struct):struct
; ============================================================
But seems to have made it worse by making the variables do-not-enreg[X] addr-exposed
; Assembly listing for method Math:Abs(struct):struct
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 RetBuf [V00,T01] ( 7, 7 ) byref -> rcx
; V01 arg0 [V01,T00] ( 7, 12 ) byref -> rdx
;* V02 loc0 [V02 ] ( 0, 0 ) struct (16) zero-ref
; V03 tmp1 [V03 ] ( 13, 24 ) struct (16) [rsp+0x28] do-not-enreg[XS] addr-exposed
; V04 tmp2 [V04,T02] ( 4, 7 ) int -> rax
;* V05 tmp3 [V05 ] ( 0, 0 ) int -> zero-ref
;* V06 tmp4 [V06 ] ( 0, 0 ) int -> zero-ref
;* V07 tmp5 [V07 ] ( 0, 0 ) int -> zero-ref
; V08 tmp6 [V08,T03] ( 2, 2 ) int -> rax V02.flags(offs=0x00) P-INDEP
; V09 tmp7 [V09,T04] ( 2, 2 ) int -> r8 V02.hi(offs=0x04) P-INDEP
; V10 tmp8 [V10,T05] ( 2, 2 ) int -> r9 V02.lo(offs=0x08) P-INDEP
; V11 tmp9 [V11,T06] ( 2, 2 ) int -> rdx V02.mid(offs=0x0c) P-INDEP
; V12 tmp10 [V12 ] ( 4, 3 ) int -> [rsp+0x28] do-not-enreg[X] addr-exposed V03.flags(offs=0x00) P-DEP
; V13 tmp11 [V13 ] ( 4, 3 ) int -> [rsp+0x2C] do-not-enreg[X] addr-exposed V03.hi(offs=0x04) P-DEP
; V14 tmp12 [V14 ] ( 4, 3 ) int -> [rsp+0x30] do-not-enreg[X] addr-exposed V03.lo(offs=0x08) P-DEP
; V15 tmp13 [V15 ] ( 4, 3 ) int -> [rsp+0x34] do-not-enreg[X] addr-exposed V03.mid(offs=0x0c) P-DEP
; V16 OutArgs [V16 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
;
; Lcl frame size = 56
G_M42161_IG01:
sub rsp, 56
nop
G_M42161_IG02:
mov eax, dword ptr [rdx]
mov r8d, dword ptr [rdx+4]
mov r9d, dword ptr [rdx+8]
mov edx, dword ptr [rdx+12]
xor r10d, r10d
mov dword ptr [rsp+28H], r10d
mov dword ptr [rsp+2CH], r10d
mov dword ptr [rsp+30H], r10d
mov dword ptr [rsp+34H], r10d
and eax, 0xD1FFAB1E
test eax, 0xD1FFAB1E
jne G_M42161_IG04
mov r10d, eax
and r10d, 0xD1FFAB1E
cmp r10d, 0xD1FFAB1E
jg G_M42161_IG04
mov dword ptr [rsp+30H], r9d
mov dword ptr [rsp+34H], edx
mov dword ptr [rsp+2CH], r8d
mov dword ptr [rsp+28H], eax
mov eax, dword ptr [rsp+28H]
mov dword ptr [rcx], eax
mov eax, dword ptr [rsp+2CH]
mov dword ptr [rcx+4], eax
mov eax, dword ptr [rsp+30H]
mov dword ptr [rcx+8], eax
mov eax, dword ptr [rsp+34H]
mov dword ptr [rcx+12], eax
mov rax, rcx
G_M42161_IG03:
add rsp, 56
ret
************** Beginning of cold code **************
G_M42161_IG04:
lea rcx, bword ptr [rsp+28H]
call Decimal:ThrowInvalidDecimalBytes():this
int3
; Total bytes of code 144, prolog size 5 for method Math:Abs(struct):struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really sure why, but will revert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the decimal improvement will grow once it is fine tuned with all the feedback. E.g.:
- I think that Decimal Abs can assume that the in-comming decimal is valid. It is what all other Decimal ops do. I do not think it needs to throw at all.
- There seemed to be a bug introduced in SetBits by your change - it was not throwing on invalid flags anymore.
- SetBits can be folded into constructor that takes byte[]. It does not look like an improvement for it to be separate.
More smaller PRs move faster.
Fortunately this version of G_M55888_IG03:
894DF0 mov gword ptr [ebp-10H], ecx
8D5CF108 lea ebx, bword ptr [ecx+8*esi+8]
8B0B mov ecx, dword ptr [ebx]
8B5B04 mov ebx, dword ptr [ebx+4]
; begin abs
85DB test ebx, ebx
7D0B jge SHORT G_M55888_IG04
F7D9 neg ecx
83D300 adc ebx, 0
F7DB neg ebx
85DB test ebx, ebx
7C12 jl SHORT G_M55888_IG06
G_M55888_IG04:
; end abs
03C1 add eax, ecx
13D3 adc edx, ebx
46 inc esi
3BFE cmp edi, esi
8B4DF0 mov ecx, gword ptr [ebp-10H]
7FD9 jg SHORT G_M55888_IG03 Unfortunately it won't be easy to get rid of the useless |
Some perf numbers from @aobatact 's benchmark extended with the new abs version:
Don't ask me how come |
@mikedn and @benaadams Thank you! |
Hi all, public static int Abs(int value)
{
if (value < 0)
{
value = -value;
if (value < 0)
{
ThrowAbsOverflow();
}
}
return value;
} For me a negative value smaller than zero have to be positive. |
@LeroyD The problem is that numbers are represented in twos complement. You have the problem that there is one more negative number than positive. You can represent -2147483648 = 0b1000...000, but not 2147483648. Inverting this number gives itself (to invert in twos complement you invert all bits and add 1, negative numbers start with 1). |
Ok. Understood. if (value == int.MinValue)
{ ... } Thanks for your return. |
It's worth noting that this specific implementation works in C# because the specification says:
Do not use this implementation in other languages without first checking the specific language behavior. For example in C++ this code exhibits undefined behavior and some C++ compilers may very well discard the second |
@mikedn if something takes 0ns with BenchmarkDotNet it means that given method was either empty or returned a const. In that case, I suspect that JIT was smart enough to recognize that the field does not change and it's calculating the same thing all the time and decided to simply return a const. @AndreyAkinshin I think that we should print some kind of a warning in that case ;) |
Nope, there is no way that the JIT can't do that. Fields are set outside the method so there's no way for the JIT to know what value is there. The generated code is: G_M12548_IG01:
4883EC28 sub rsp, 40
G_M12548_IG02:
8B4108 mov eax, dword ptr [rcx+8]
85C0 test eax, eax
7D06 jge SHORT G_M12548_IG03
F7D8 neg eax
85C0 test eax, eax
7C08 jl SHORT G_M12548_IG05
G_M12548_IG03:
89410C mov dword ptr [rcx+12], eax
G_M12548_IG04:
4883C428 add rsp, 40
C3 ret
G_M12548_IG05:
E812EEFFFF call AbsHelper:AbsThrowHelper()
CC int3 Most likely the code that we attempt to measure is just too fast. Measuring some sort of loop that uses |
BenchmarkDotNet should handle such cases. =) Otherwise, it's a bug which should be fixed. |
You can reproduce this with something as simple as [Benchmark]
public void Empty() => y = -x; // x and y are int fields
Attempting to measure the immeasurable is hardly a bug, more like a case of trying to hard. |
Might need to bump up the executions and use A for loop would likely dominate, so maybe direct repetition? [Benchmark(OperationsPerInvoke = 10)]
public static int Negate()
{
y = -x;
x = -y;
y = -x;
x = -y;
y = -x;
x = -y;
y = -x;
x = -y;
y = -x;
x = -y;
} |
Well, it'a tricky example. for (int i = 0; i < N / 16; i++)
{
Method();
Method();
Method();
Method();
Method();
Method();
Method();
Method();
Method();
Method();
Method();
Method();
Method();
Method();
Method();
Method();
} If this code have the equal time for Next, the most important fact in such method is not the body of the method, but how we call this method. The trick by @benaadams with OperationsPerInvoke is valid, but we should understand that we get different values for different implementations (it depends on the dependency instruction graph between fields inside the method). |
Actually I tried a for loop and the results were pretty good. Too lazy to run it again now to post the numbers but AFAIR the new implementation was something like 2x faster than the old one for negative numbers while being identical with the old one for positive numbers. And a for loop scenario is likely better as it's more likely to showcase real world performance. Nobody cares about the performance of a lone
That's definitely a case of trying too hard. Measuring the accurate time taken but very small pieces of code is practically impossible to get right with the tools you have at your disposal. You'd need to measure the number of cycles such sequences of code require and then, if you really want, you can extrapolate the execution time from the number cycles. Attempting to measure the time and compensate with all kinds of adjustments and statistics is pretty much useless, in the end you have no way to prove that the absolute result is somehow meaningful. In such cases it's good enough if the results of a couple of benchmarks are comparable. |
* Improve perf for Math.Abs * Inline Math.Abs Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Improve perf for Math.Abs * Inline Math.Abs Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
* Improve perf for Math.Abs * Inline Math.Abs Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Improve perf for Math.Abs * Inline Math.Abs Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
@@ -63,6 +99,12 @@ public static decimal Abs(decimal value) | |||
return decimal.Abs(value); | |||
} | |||
|
|||
[StackTraceHidden] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do, prevent inlining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stops it showing up in exception stack trace #14652 so it looks like the throw was in-place rather than off a throw helper function
* Improve perf for Math.Abs * Inline Math.Abs Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
* Improve perf for Math.Abs * Inline Math.Abs Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
* Improve perf for Math.Abs * Inline Math.Abs Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Resolves https://github.com/dotnet/corefx/issues/26253