New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown of code when doing subtraction of ints. #19355

Open
aensidhe opened this Issue Aug 8, 2018 · 10 comments

Comments

Projects
None yet
4 participants
@aensidhe

aensidhe commented Aug 8, 2018

Reproduction repo

Problem

Code with writing baseInt-i to span is slower by 50% baseInt, where baseInt is Int32 const, i is Int32 counter.

Expected

Almost no effect, like in native code. Code excerpt and benchmark results below.

[Benchmark(Baseline = true)]
public int NoMinus()
{
    var i = 0;
    for (; i < length; i++)
        BinaryPrimitives.WriteInt32BigEndian(_buffer, baseInt);
    return i;
}

[Benchmark]
public int Minus()
{
    var i = 0;
    for (; i < length; i++)
        BinaryPrimitives.WriteInt32BigEndian(_buffer, baseInt - i);
    return i;
}

Benchmark results

BenchmarkDotNet=v0.11.0, OS=debian 9
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=2.1.302
  [Host]     : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT

Method Mean Error StdDev Q3 Scaled ScaledSD Allocated
NoMinus 133.19 ns 2.705 ns 3.1150 ns 134.64 ns 1.00 0.00 0 B
Minus 225.59 ns 4.521 ns 11.7502 ns 232.00 ns 1.69 0.10 0 B
MinusLocalVar 202.04 ns 3.792 ns 3.7244 ns 204.48 ns 1.52 0.04 0 B
MinusLocalConst 213.54 ns 4.193 ns 4.4860 ns 217.70 ns 1.60 0.05 0 B
Pointer 91.14 ns 1.819 ns 1.8681 ns 92.42 ns 0.68 0.02 0 B
PointerMinus 104.95 ns 2.026 ns 1.8954 ns 106.36 ns 0.79 0.02 0 B
CArray 93.81 ns 1.068 ns 0.9468 ns 94.43 ns 0.70 0.02 0 B
CArrayMinus 96.51 ns 1.977 ns 2.1155 ns 97.44 ns 0.72 0.02 0 B
CppArray 86.01 ns 1.713 ns 1.9730 ns 86.91 ns 0.65 0.02 0 B
CppArrayMinus 92.25 ns 1.817 ns 1.7849 ns 93.28 ns 0.69 0.02 0 B

Disassembly diff between NoMinus and Minus.

Difference can't be explained by field access since baseInt is const, but I added several benchmarks to check that.

If we change int to uint, 50% difference for .net code stays, almost zero difference for native code - too.

Problem persists even if we use unsafe code for .net, so it's not specific to BinaryPrimitives class, but it's of much smaller magnitude.

If we choose smaller ints (like 200, not 1<<30), it will not change anything.

@mikedn

This comment has been minimized.

Contributor

mikedn commented Aug 8, 2018

The subtraction is a red herring. The main difference is that in the first version the value is loop invariant (and constant) and in the second case it is not. In the first case you're simply writing a constant on every iteration and in the second case you pay the cost of subtraction and byte swapping on every iteration.

If the native code is faster it may be that it is using the bswap instruction instead of the "manual" byte swapping implemented by WriteInt32BigEndian. Can you check what assembly code the native compiler produced? Which native compiler?

@aensidhe

This comment has been minimized.

aensidhe commented Aug 8, 2018

@mikedn I'm no C/C++ developer, CMake outputs this.

-- The C compiler identification is GNU 6.3.0
-- The CXX compiler identification is GNU 6.3.0

Disassembly gist

@mikedn

This comment has been minimized.

Contributor

mikedn commented Aug 8, 2018

Hmm, I'm not sure what's that msgpack thing in the C/C++ code. But the disassembly output seems to indicate that your benchmark fell victim to dead code optimizations. You do not measure what you think you measure. Basically what's left in the C/C++ code is malloc/free.

I've tried some C/C++ code that is roughly equivalent to the C# code you are using. Compiling with GCC shows that the assembly code is not very different from what you get from .NET: https://godbolt.org/g/yFF8cp

I had to change it to store the value at buffer[i] rather than buffer[0]. Storing to the same location makes the C/C++ compilers eliminate the loop. That's something that the .NET JIT doesn't do. That can be considered as a plus for C/C++ but then this kind of loop isn't exactly "normal" outside benchmarks.

So I'm not quite sure what the problem is. The 2 C# versions are fundamentally different - one stores a constant value and the other has to calculate the value to store on every iteration so it will always be slower. Using the bswap instruction will probably make it faster but it won't be as fast as the first version. Improving the JIT's dead code elimination could result in this benchmark's performance matching the native one but it wouldn't be very relevant for real world code where such dead code is not exactly common.

@jakobbotsch

This comment has been minimized.

Collaborator

jakobbotsch commented Aug 8, 2018

One thing I see is that RyuJIT produces the suboptimal:

mov r8d, eax
neg r8d
add r8d, 40000000h

When you would expect it to produce something like:

mov r8d, 40000000h
sub r8d, eax

I think it's coming from here:

/* Check for "cns1 - op2" , we change it to "(cns1 + (-op2))" */

I noticed this when working on #18837. At the time I assumed it was to simplify the optimizations and that it would be transformed back later, but doesn't seem like it. Maybe there's something else I'm missing.

@mikedn

This comment has been minimized.

Contributor

mikedn commented Aug 8, 2018

Yeah, that's a bit weird and unfortunately the comment doesn't states why it does that. It can save a register if op2 is "last use", then you'd get just neg eax and add eax, const. It can also be useful as a form of expression canonicalization that enables other expression optimizations but it's not like the JIT does a lot of expression optimizations.

@mikedn

This comment has been minimized.

Contributor

mikedn commented Aug 8, 2018

And the disassembly contains the "typical" copy propagation/register allocation mess up:

00007ff7`cdf2240b 448b4a08        mov     r9d,dword ptr [rdx+8]
00007ff7`cdf2240f 498bd0          mov     rdx,r8
00007ff7`cdf22412 458bc1          mov     r8d,r9d
00007ff7`cdf22415 458bc8          mov     r9d,r8d ; make up your mind!
@aensidhe

This comment has been minimized.

aensidhe commented Aug 8, 2018

I stumbled on this peculiar issue during writing release notes of my msgpack library. Full code of benchmark is here.

Results (only relevant left):

BenchmarkDotNet=v0.11.0, OS=debian 9
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=2.1.302
  [Host]     : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT

Method Mean Error StdDev Q3 Scaled ScaledSD Allocated
MsgPackSpecArray 186.64 ns 3.1066 ns 2.9059 ns 188.49 ns 1.80 0.04 0 B
MsgPackSpecArrayMinus 401.06 ns 8.0266 ns 10.1511 ns 408.03 ns 3.87 0.11 0 B
Pointer 115.33 ns 2.0789 ns 1.9446 ns 116.63 ns 1.11 0.02 0 B
PointerMinus 145.38 ns 1.9718 ns 1.7480 ns 146.41 ns 1.40 0.02 0 B
CArray 103.78 ns 1.5577 ns 1.4571 ns 104.53 ns 1.00 0.00 0 B
CArrayMinus 105.01 ns 2.1248 ns 2.1820 ns 105.99 ns 1.01 0.02 0 B
CppArray 99.25 ns 2.1722 ns 3.9170 ns 100.41 ns 0.96 0.04 0 B
CppArrayMinus 105.89 ns 2.0795 ns 1.7365 ns 107.73 ns 1.02 0.02 0 B

For purposes of eliminating of dead code I introduced benchmark with subtraction. And full stack benchmark still shows same behavior. I would understand, if adding subtraction will add the same cost in all cases. But in Pointer benchmark it adds 30 ns per call, and in full one (MsgPackSpecArray) - 200 ns. But code differs only in that subtraction. And the only reason is that subtraction.

Maybe I made a mistake during creation of reproduction repo.

@aensidhe

This comment has been minimized.

aensidhe commented Aug 8, 2018

Also, @EgorBo suggested that if we'll change const int baseInt to usual field, fast benchmark will become as slow as slow one. And that is true.

@mikedn

This comment has been minimized.

Contributor

mikedn commented Aug 8, 2018

I would understand, if adding subtraction will add the same cost in all cases. But in Pointer benchmark it adds 30 ns per call, and in full one (MsgPackSpecArray) - 200 ns. But code differs only in that subtraction. And the only reason is that subtraction.

It can't add the same cost because it's not the same code. The pointer code just write the value. The msgpack code seems to pass the value through 3 IFs before finally writing the value: https://github.com/progaudi/msgpack.spec/blob/master/src/msgpack.spec/MsgPackSpec.UInt32.cs#L76-L83

Those IFs will probably go away if the value is a constant so you're sort of comparing apples and oranges. In real world code I'd guess that in many cases you won't pass constants to msgpack so that scenario is not very relevant.

@aensidhe

This comment has been minimized.

aensidhe commented Aug 8, 2018

So, if I did understand correctly, we can safely assume that Jit eliminate ifs in runtime. Because of that fast benchmark is fast. To prove that I created another benchmark. In that benchmark I designed parameter of method WriteUInt32 in such way, that we'll pass through all branches.

BenchmarkDotNet=v0.11.0, OS=debian 9
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=2.1.302
  [Host]     : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT

Method Mean Error StdDev Q3 Scaled ScaledSD Allocated
Span 378.8 ns 6.778 ns 6.340 ns 385.2 ns 2.01 0.04 0 B
SpanConst 198.7 ns 1.255 ns 1.174 ns 199.5 ns 1.06 0.02 0 B
Fixed 235.5 ns 4.583 ns 4.501 ns 237.6 ns 1.25 0.03 0 B
C 188.1 ns 2.714 ns 2.538 ns 190.3 ns 1.00 0.00 0 B
Cpp 189.5 ns 2.896 ns 2.709 ns 191.5 ns 1.01 0.02 0 B

Here SpanConst is a benchmark with a constant to illustrate code elimination if we use only one value. As we can see, native code is slower too due to elimination of code elimination.

Thank you, @mikedn . Can be closed, if you don't need this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment