Slowdown of code when doing subtraction of ints. #10852
Comments
The subtraction is a red herring. The main difference is that in the first version the value is loop invariant (and constant) and in the second case it is not. In the first case you're simply writing a constant on every iteration and in the second case you pay the cost of subtraction and byte swapping on every iteration. If the native code is faster it may be that it is using the |
@mikedn I'm no C/C++ developer, CMake outputs this.
|
Hmm, I'm not sure what's that msgpack thing in the C/C++ code. But the disassembly output seems to indicate that your benchmark fell victim to dead code optimizations. You do not measure what you think you measure. Basically what's left in the C/C++ code is I've tried some C/C++ code that is roughly equivalent to the C# code you are using. Compiling with GCC shows that the assembly code is not very different from what you get from .NET: https://godbolt.org/g/yFF8cp I had to change it to store the value at So I'm not quite sure what the problem is. The 2 C# versions are fundamentally different - one stores a constant value and the other has to calculate the value to store on every iteration so it will always be slower. Using the |
One thing I see is that RyuJIT produces the suboptimal: mov r8d, eax
neg r8d
add r8d, 40000000h When you would expect it to produce something like: mov r8d, 40000000h
sub r8d, eax I think it's coming from here: I noticed this when working on dotnet/coreclr#18837. At the time I assumed it was to simplify the optimizations and that it would be transformed back later, but doesn't seem like it. Maybe there's something else I'm missing. |
Yeah, that's a bit weird and unfortunately the comment doesn't states why it does that. It can save a register if |
And the disassembly contains the "typical" copy propagation/register allocation mess up: 00007ff7`cdf2240b 448b4a08 mov r9d,dword ptr [rdx+8]
00007ff7`cdf2240f 498bd0 mov rdx,r8
00007ff7`cdf22412 458bc1 mov r8d,r9d
00007ff7`cdf22415 458bc8 mov r9d,r8d ; make up your mind! |
I stumbled on this peculiar issue during writing release notes of my msgpack library. Full code of benchmark is here. Results (only relevant left): BenchmarkDotNet=v0.11.0, OS=debian 9
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=2.1.302
[Host] : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
DefaultJob : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
For purposes of eliminating of dead code I introduced benchmark with subtraction. And full stack benchmark still shows same behavior. I would understand, if adding subtraction will add the same cost in all cases. But in Pointer benchmark it adds 30 ns per call, and in full one (MsgPackSpecArray) - 200 ns. But code differs only in that subtraction. And the only reason is that subtraction. Maybe I made a mistake during creation of reproduction repo. |
Also, @EgorBo suggested that if we'll change |
It can't add the same cost because it's not the same code. The pointer code just write the value. The msgpack code seems to pass the value through 3 IFs before finally writing the value: https://github.com/progaudi/msgpack.spec/blob/master/src/msgpack.spec/MsgPackSpec.UInt32.cs#L76-L83 Those IFs will probably go away if the value is a constant so you're sort of comparing apples and oranges. In real world code I'd guess that in many cases you won't pass constants to msgpack so that scenario is not very relevant. |
So, if I did understand correctly, we can safely assume that Jit eliminate ifs in runtime. Because of that fast benchmark is fast. To prove that I created another benchmark. In that benchmark I designed parameter of method BenchmarkDotNet=v0.11.0, OS=debian 9
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=2.1.302
[Host] : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
DefaultJob : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
Here Thank you, @mikedn . Can be closed, if you don't need this issue. |
Is there anything here we should note for follow-up? Can't quite parse the above. |
@AndyAyersMS no, as I said, can be closed. |
Reproduction repo
Problem
Code with writing
baseInt-i
to span is slower by 50%baseInt
, wherebaseInt
isInt32
const,i
isInt32
counter.Expected
Almost no effect, like in native code. Code excerpt and benchmark results below.
Benchmark results
Disassembly diff between
NoMinus
andMinus
.Difference can't be explained by
field access
sincebaseInt
is const, but I added several benchmarks to check that.If we change
int
touint
, 50% difference for .net code stays, almost zero difference for native code - too.Problem persists even if we use unsafe code for .net, so it's not specific to
BinaryPrimitives
class, but it's of much smaller magnitude.If we choose smaller ints (like 200, not 1<<30), it will not change anything.
The text was updated successfully, but these errors were encountered: