Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT Performance regression between 1.1 and 2.0 and between 2.0 and 2.1 #18339

Open
saucecontrol opened this issue Jun 7, 2018 · 22 comments

Comments

Projects
None yet
8 participants
@saucecontrol
Copy link
Member

commented Jun 7, 2018

I've been trying to come up with a simple repro for a perf regression I've been fighting, and I think I have it trimmed down as much as possible while still showing the real-world impact. What I have here is a stripped-down version of my Blake2b hashing algorithm implementation. The main work is driven through a single CPU-intensive mixing function, and I've set up benchmarks with 3 variants of that function to show the problem.

  1. is a no-op version that simply runs all the outer loop code as a baseline for determining time spent in the inner loop.
  2. is the version I'd like to use, with the mixing math in a separate method, marked with [MethodImpl(MethodImplOptions.AggressiveInlining)]
  3. is the version I ended up using in my real implementation, which is fully-inlined manually because this one gives the most consistent results with the older runtimes.

The function is huge when inlined (JitDump was 67MB), but it really takes running all the calculations to show the full perf effect. It looks like it's probably a register allocation issue (the algorithm uses 16 local accumulator variables), but I haven't been able to tell exactly where it's going wrong.

I'm seeing a consistent perf regression between .NET Core 1.1 and 2.0 and then some additional regression between 2.0 and 2.1 (x86 particularly) with this function. I've run the benchmarks against all three runtimes, x86 and x64, on two machines to show what I mean.

First up is a Skylake i7 Desktop

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.431 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
Frequency=3914064 Hz, Resolution=255.4889 ns, Timer=TSC
.NET Core SDK=2.1.300
  [Host]        : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
  netcoreapp1.1 : .NET Core 1.1.8 (CoreCLR 4.6.26328.01, CoreFX 4.6.24705.01), 64bit RyuJIT
  netcoreapp2.0 : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
Method Job Platform IsBaseline Mean Error StdDev Scaled
MixNoop netcoreapp1.1 X64 True 541.1 us 1.2367 us 1.1568 us 1.00
MixNoop netcoreapp2.0 X64 Default 536.7 us 0.4613 us 0.3852 us 0.99
MixNoop netcoreapp2.1 X64 Default 578.8 us 0.4531 us 0.3783 us 1.07
MixNoop netcoreapp1.1 X86 True 1,119.9 us 9.6466 us 9.0234 us 1.00
MixNoop netcoreapp2.0 X86 Default 1,192.8 us 2.6294 us 2.0529 us 1.07
MixNoop netcoreapp2.1 X86 Default 1,191.3 us 0.6622 us 0.5170 us 1.06
MixPreferred netcoreapp1.1 X64 True 12,459.8 us 101.8205 us 90.2612 us 1.00
MixPreferred netcoreapp2.0 X64 Default 12,678.6 us 5.6177 us 4.6910 us 1.02
MixPreferred netcoreapp2.1 X64 Default 12,746.3 us 6.1352 us 4.7900 us 1.02
MixPreferred netcoreapp1.1 X86 True 175,779.4 us 74.0106 us 61.8022 us 1.00
MixPreferred netcoreapp2.0 X86 Default 106,492.8 us 35.6985 us 27.8710 us 0.61
MixPreferred netcoreapp2.1 X86 Default 136,210.1 us 59.2411 us 55.4141 us 0.77
MixManualInline netcoreapp1.1 X64 True 12,349.0 us 24.9308 us 20.8184 us 1.00
MixManualInline netcoreapp2.0 X64 Default 12,792.0 us 80.9773 us 75.7462 us 1.04
MixManualInline netcoreapp2.1 X64 Default 12,741.5 us 6.3150 us 5.2733 us 1.03
MixManualInline netcoreapp1.1 X86 True 81,966.3 us 27.9119 us 21.7917 us 1.00
MixManualInline netcoreapp2.0 X86 Default 106,330.1 us 33.9632 us 31.7692 us 1.30
MixManualInline netcoreapp2.1 X86 Default 135,755.2 us 67.5010 us 63.1405 us 1.66

The MixManualInline numbers show the regression quite clearly. There's a 3-4% slowdown between 1.1 and 2.0 on x64 that holds up in 2.1 On x86, it gets significatntly worse with each newer version. The MixPreferred job with the AgressiveInlining version of the mixing function does very poorly on 1.1 x86 but then shows basically the same results as the manually-inlined version everywhere else.

And here's the same benchmark run on an older Sandy Bridge Xeon server

BenchmarkDotNet=v0.10.14, OS=Windows 8.1 (6.3.9600.0)
Intel Xeon CPU E31230 3.20GHz, 1 CPU, 8 logical and 4 physical cores
Frequency=3117919 Hz, Resolution=320.7267 ns, Timer=TSC
.NET Core SDK=2.1.300
  [Host]        : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 32bit RyuJIT
  netcoreapp1.1 : .NET Core 1.1.8 (CoreCLR 4.6.26328.01, CoreFX 4.6.24705.01), 64bit RyuJIT
  netcoreapp2.0 : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
Method Job Platform IsBaseline Mean Error StdDev Median Scaled ScaledSD
MixNoop netcoreapp1.1 X64 True 1.326 ms 0.1071 ms 0.3158 ms 1.331 ms 1.00 0.00
MixNoop netcoreapp2.0 X64 Default 1.288 ms 0.1003 ms 0.2957 ms 1.285 ms 1.03 0.36
MixNoop netcoreapp2.1 X64 Default 1.352 ms 0.0928 ms 0.2735 ms 1.358 ms 1.08 0.36
MixNoop netcoreapp1.1 X86 True 2.738 ms 0.2200 ms 0.6486 ms 2.701 ms 1.00 0.00
MixNoop netcoreapp2.0 X86 Default 2.673 ms 0.2034 ms 0.5998 ms 2.826 ms 1.04 0.36
MixNoop netcoreapp2.1 X86 Default 2.776 ms 0.2096 ms 0.6181 ms 2.740 ms 1.08 0.37
MixPreferred netcoreapp1.1 X64 True 31.926 ms 2.2443 ms 6.6172 ms 32.740 ms 1.00 0.00
MixPreferred netcoreapp2.0 X64 Default 31.898 ms 2.1440 ms 6.3216 ms 31.735 ms 1.05 0.33
MixPreferred netcoreapp2.1 X64 Default 27.419 ms 2.4213 ms 7.1393 ms 25.538 ms 0.90 0.32
MixPreferred netcoreapp1.1 X86 True 341.759 ms 13.1190 ms 38.4759 ms 331.234 ms 1.00 0.00
MixPreferred netcoreapp2.0 X86 Default 251.373 ms 13.1463 ms 38.3484 ms 239.499 ms 0.74 0.14
MixPreferred netcoreapp2.1 X86 Default 291.564 ms 10.3316 ms 29.6434 ms 285.206 ms 0.86 0.13
MixManualInline netcoreapp1.1 X64 True 33.165 ms 2.2903 ms 6.7529 ms 33.426 ms 1.00 0.00
MixManualInline netcoreapp2.0 X64 Default 33.819 ms 2.4410 ms 7.1973 ms 34.340 ms 1.07 0.34
MixManualInline netcoreapp2.1 X64 Default 20.601 ms 0.1490 ms 0.1394 ms 20.606 ms 0.65 0.15
MixManualInline netcoreapp1.1 X86 True 163.629 ms 7.6584 ms 22.3398 ms 160.224 ms 1.00 0.00
MixManualInline netcoreapp2.0 X86 Default 249.708 ms 11.3452 ms 32.7336 ms 240.169 ms 1.55 0.28
MixManualInline netcoreapp2.1 X86 Default 285.367 ms 9.0883 ms 25.6337 ms 278.039 ms 1.77 0.27

Here there are a few more outliers, but it's the same general story.

Full code for the benchmarks is here if anyone wants to try on different hardware.

@janvorli

This comment has been minimized.

Copy link
Member

commented Jun 7, 2018

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented Jun 12, 2018

I'll take a quick look. Given that it is an x86 regression my guess it might be similar to what we see over in #16610.

@saucecontrol

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2018

Thanks, @AndyAyersMS. I did some further optimization of that inner loop code in my main project that altered results slightly. Don't know if it will help track down the regression. Basically, I added more local variables to hold the data block being hashed. Of course those end up getting spilled, but having them and the accumulator variables in the same stack memory area speeds things up slightly. I also re-ordered some of the operations to reduce register dependencies.

Anyway, the updated version of the function now performs the same on 1.1 and 2.0 on the x64 runtimes but has a consistent 4% slowdown in 2.1. x86 shows the same major slowdown between 1.1 and 2.0 and again between 2.0 and 2.1.

Method Job Platform IsBaseline Mean Error StdDev Scaled Allocated
Blake2bFast netcoreapp1.1 X64 True 10.69 ms 0.0147 ms 0.0123 ms 1.00 0 B
Blake2bFast netcoreapp2.0 X64 Default 10.68 ms 0.0610 ms 0.0571 ms 1.00 0 B
Blake2bFast netcoreapp2.1 X64 Default 11.14 ms 0.0383 ms 0.0359 ms 1.04 0 B
Blake2bFast netcoreapp1.1 X86 True 69.34 ms 0.4273 ms 0.3997 ms 1.00 0 B
Blake2bFast netcoreapp2.0 X86 Default 108.07 ms 0.6709 ms 0.6275 ms 1.56 0 B
Blake2bFast netcoreapp2.1 X86 Default 129.12 ms 0.7999 ms 0.7483 ms 1.86 0 B
@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented Jun 12, 2018

One would hope x86 is ~2x slower than x64 as primary data type is long/ulong. But no...

Looking at x86, 2.0 vs 2.1 there is a large size regression. Likely RA.

2.0
; Total bytes of code 49767, prolog size 31 for method Blake2bContext:mixManualInline(int):this

2.1
; Total bytes of code 60608, prolog size 31 for method Blake2bContext:mixManualInline(int):this

Suffice to say this is something of a torture test for an x86 allocator, leading to long stretches of stack shuffling:

       89B574FCFFFF mov      dword ptr [ebp-38CH], esi
       8BB578FCFFFF mov      esi, dword ptr [ebp-388H]
       81F66BBD41FB xor      esi, 0xFB41BD6B
       89B578FCFFFF mov      dword ptr [ebp-388H], esi
       8BB574FCFFFF mov      esi, dword ptr [ebp-38CH]
       81F6ABD9831F xor      esi, 0x1F83D9AB
       89B574FCFFFF mov      dword ptr [ebp-38CH], esi
       8BB578FCFFFF mov      esi, dword ptr [ebp-388H]
       89B594FEFFFF mov      dword ptr [ebp-16CH], esi
       8BB574FCFFFF mov      esi, dword ptr [ebp-38CH]
       89B590FEFFFF mov      dword ptr [ebp-170H], esi
       8BF1         mov      esi, ecx
       03B5F4FEFFFF add      esi, dword ptr [ebp-10CH]
       89B574FCFFFF mov      dword ptr [ebp-38CH], esi
       8BF2         mov      esi, edx
       13B5F0FEFFFF adc      esi, dword ptr [ebp-110H]
       89B578FCFFFF mov      dword ptr [ebp-388H], esi
       8B75F0       mov      esi, dword ptr [ebp-10H]
       8B8574FCFFFF mov      eax, dword ptr [ebp-38CH]
       0306         add      eax, dword ptr [esi]
       898574FCFFFF mov      dword ptr [ebp-38CH], eax
       8B8578FCFFFF mov      eax, dword ptr [ebp-388H]
       134604       adc      eax, dword ptr [esi+4]
       8BB574FCFFFF mov      esi, dword ptr [ebp-38CH]

The methods are too big to make sense of as is (especially given the semi-arbitrary IG breaks) so we should cut them down to a representative sample. Probably just one round of mixing would be plenty to surface the challenges the jit must overcome.

Also need to look into what kind of code jit32 produced in 1.1.

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented Jun 12, 2018

@saucecontrol Unrolled hash computations are often difficult cases for the allocator because everything ends up highly interdependent. But clearly we have a lot of room for improvement on x86 both on a release/release basis and by comparison to where we are on x64.

Speaking of comparisons, do you have a native (C/C++) implementation that you use as a reference point? It also would be interesting to see how far off our x86/x64 performance is from pure native code.

@saucecontrol

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2018

Sure do. There's a project called Blake2.Bench in that same repo that has comparisons with the reference C version called via PInvoke. My C# version started out with about the same performance on x64, but after optimization, it's quite a bit faster. Performance on x86 has never been as good, but 1.1 did admirably.

Here's another run of that last bench with the native results inlcuded

Method Job Platform IsBaseline Mean Error StdDev Scaled Allocated
Blake2bFast netcoreapp1.1 X64 True 10.73 ms 0.0682 ms 0.0638 ms 1.00 0 B
Blake2bFast netcoreapp2.0 X64 Default 10.79 ms 0.0524 ms 0.0490 ms 1.01 0 B
Blake2bFast netcoreapp2.1 X64 Default 11.19 ms 0.0609 ms 0.0569 ms 1.04 0 B
Blake2bRefNative netcoreapp2.1 X64 Default 16.12 ms 0.0334 ms 0.0279 ms 1.50 0 B
Blake2bFast netcoreapp1.1 X86 True 69.10 ms 0.2009 ms 0.1781 ms 1.00 0 B
Blake2bFast netcoreapp2.0 X86 Default 108.28 ms 0.6423 ms 0.5694 ms 1.57 0 B
Blake2bFast netcoreapp2.1 X86 Default 128.76 ms 0.3564 ms 0.3159 ms 1.86 0 B
Blake2bRefNative netcoreapp2.1 X86 Default 50.00 ms 0.2553 ms 0.2132 ms 0.72 0 B

Since register allocation is such an issue, I reckon it's expected x86 would be more than 2x slower since it has less than half the registers to work with. Combine that with the ulong calculations, and 4x slower is probably about right for a target.

@saucecontrol

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2018

I don't know if this is useful or not, but I also have an implementation of the Blake2s algorithm in the main project in that repo. It's the exact same thing but operates on uint words instead of ulong and does 10 mixing rounds instead of 12.

Method Job Platform IsBaseline Mean Error StdDev Scaled Allocated
Blake2sFast netcoreapp1.1 X64 True 17.06 ms 0.2094 ms 0.1959 ms 1.00 0 B
Blake2sFast netcoreapp2.0 X64 Default 17.34 ms 0.1110 ms 0.1038 ms 1.02 0 B
Blake2sFast netcoreapp2.1 X64 Default 17.23 ms 0.0149 ms 0.0107 ms 1.01 0 B
Blake2sRefNative netcoreapp2.1 X64 Default 25.89 ms 0.0216 ms 0.0168 ms 1.52 0 B
Blake2sFast netcoreapp1.1 X86 True 67.97 ms 0.5795 ms 0.5420 ms 1.00 0 B
Blake2sFast netcoreapp2.0 X86 Default 26.92 ms 0.0720 ms 0.0638 ms 0.40 0 B
Blake2sFast netcoreapp2.1 X86 Default 30.73 ms 0.2364 ms 0.2211 ms 0.45 0 B
Blake2sRefNative netcoreapp2.1 X86 Default 32.10 ms 0.0524 ms 0.0464 ms 0.47 0 B

Interesting that RyuJIT-32 does so much better between 1.1 and 2.0 here, but there's a definite regression between 2.0 and 2.1.

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented Jun 12, 2018

Core 1.1 used the older "jit32" jit for x86, and Core 2.0 and later uses RyuJit.

Generally speaking this switch to RyuJit resulted in improved performance for x86. But not always.

I think the perf issues in the initial long version of your code that show up in 2.0 and 2.1 are related to how RyuJit handles the 64 bit long computations on x86. The main challenge here is that these need to be decomposed into two 32 bit computations and various complications arise depending on when and how this is done.

Long decomposition doubles the number of things that deserve registers (vs same computation on x64) while at the same time the allocator is trying to cope with the fact that on x86 there are only ~6 allocatable registers, while on x64 there are ~13. Twice as much demand and half as much supply.

I don't have anything specific to call out just yet though, or a good explanation of why it started off poorly and has gotten worse.

@saucecontrol

This comment has been minimized.

Copy link
Member Author

commented Jun 13, 2018

Core 1.1 used the older "jit32" jit for x86, and Core 2.0 and later uses RyuJit.

Ohhhhhhh… I feel like I knew that at one point and it fell out of my brain. Doesn't help that BDN will only run Job.RyuJitX86 with their netcore toolchains 😉. That does at least explain some of it.

@saucecontrol

This comment has been minimized.

Copy link
Member Author

commented Jun 29, 2018

I found another case that might be related and is much more simple to follow.

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores
Frequency=2929685 Hz, Resolution=341.3336 ns, Timer=TSC
.NET Core SDK=2.1.301
  [Host]        : .NET Core 2.1.1 (CoreCLR 4.6.26606.02, CoreFX 4.6.26606.05), 32bit RyuJIT
  net46         : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.3110.0
  netcoreapp1.1 : .NET Core 1.1.8 (CoreCLR 4.6.26328.01, CoreFX 4.6.24705.01), 32bit RyuJIT
  netcoreapp2.0 : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 32bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.1 (CoreCLR 4.6.26606.02, CoreFX 4.6.26606.05), 32bit RyuJIT
Method Job Runtime IsBaseline Mean Error StdDev Scaled
FixedPoint net46 Clr Default 17.10 us 0.0742 us 0.0694 us 1.00
FixedPoint netcoreapp1.1 Core True 17.15 us 0.0985 us 0.0921 us 1.00
FixedPoint netcoreapp2.0 Core Default 18.37 us 0.0978 us 0.0915 us 1.07
FixedPoint netcoreapp2.1 Core Default 18.56 us 0.1381 us 0.1292 us 1.08

This one shows a consistent 6-8% regression between jit32 and RyuJIT on this machine and more like 10-20% on another I tested.

class GreyConverter
{
    unsafe static void ToGreyFixedPoint(byte* prgb, byte* pgrey, int cb)
    {
        const ushort scale = 1 << 15;
        const ushort round = 1 << 14;
        const ushort rw = (ushort)(0.299f * scale + 0.5f);
        const ushort gw = (ushort)(0.587f * scale + 0.5f);
        const ushort bw = (ushort)(0.114f * scale + 0.5f);

        byte* end = prgb + cb;
        while (prgb < end)
        {
            int val = prgb[0] * rw + prgb[1] * gw + prgb[2] * bw;
            *pgrey = (byte)((val + round) >> 15);

            prgb += 3;
            pgrey++;
        }
    }

    byte[] Grey;
    byte[] Rgb;

    public GreyConverter()
    {
        Grey = new byte[16384];
        Rgb = new byte[Grey.Length * 3];

        new Random(42).NextBytes(Rgb);
    }

    [Benchmark]
    unsafe public void FixedPoint()
    {
        fixed (byte* prgb = &Rgb[0], pgrey = &Grey[0])
            ToGreyFixedPoint(prgb, pgrey, Rgb.Length);
    }
}

Here's the BDN assembly dump from the legacy JIT:

081f9690 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
081f9697 8b7d08          mov     edi,dword ptr [ebp+8]
081f969a 03f9            add     edi,ecx
081f969c 3bcf            cmp     ecx,edi
081f969e 7334            jae     081f96d4
081f96a0 0fb611          movzx   edx,byte ptr [ecx]
081f96a3 69d246260000    imul    edx,edx,2646h
081f96a9 0fb64101        movzx   eax,byte ptr [ecx+1]
081f96ad 69c0234b0000    imul    eax,eax,4B23h
081f96b3 03d0            add     edx,eax
081f96b5 0fb64102        movzx   eax,byte ptr [ecx+2]
081f96b9 69c0980e0000    imul    eax,eax,0E98h
081f96bf 03d0            add     edx,eax
081f96c1 81c200400000    add     edx,4000h
081f96c7 c1fa0f          sar     edx,0Fh
081f96ca 8816            mov     byte ptr [esi],dl
081f96cc 83c103          add     ecx,3
081f96cf 46              inc     esi
081f96d0 3bcf            cmp     ecx,edi
081f96d2 72cc            jb      081f96a0
081f96d4 5e              pop     esi

And RyuJIT:

05649198 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
0564919e 8bc1            mov     eax,ecx
056491a0 034508          add     eax,dword ptr [ebp+8]
056491a3 3bc8            cmp     ecx,eax
056491a5 7336            jae     056491dd
056491a7 0fb619          movzx   ebx,byte ptr [ecx]
056491aa 69f346260000    imul    esi,ebx,2646h
056491b0 0fb65901        movzx   ebx,byte ptr [ecx+1]
056491b4 69fb234b0000    imul    edi,ebx,4B23h
056491ba 03f7            add     esi,edi
056491bc 0fb65902        movzx   ebx,byte ptr [ecx+2]
056491c0 69fb980e0000    imul    edi,ebx,0E98h
056491c6 03f7            add     esi,edi
056491c8 81c600400000    add     esi,4000h
056491ce 8bde            mov     ebx,esi
056491d0 c1fb0f          sar     ebx,0Fh
056491d3 881a            mov     byte ptr [edx],bl
056491d5 83c103          add     ecx,3
056491d8 42              inc     edx
056491d9 3bc8            cmp     ecx,eax
056491db 72ca            jb      056491a7
056491dd 5b              pop     ebx

The two differences that jump out at me:

  1. RyuJIT uses an extra register and has that extra mov ebx, esi
  2. jit32 sets up the first two movzx imul combinations so they can run in parallel whereas RyuJIT uses ebx for both.
@mikedn

This comment has been minimized.

Copy link
Contributor

commented Jun 29, 2018

jit32 sets up the first two movzx imul combinations so they can run in parallel whereas RyuJIT uses ebx for both.

The fact that the same register is used for both shouldn't make any differences, CPUs handle such cases by register renaming.

RyuJIT uses an extra register and has that extra mov ebx, esi

That may be. In general, register to register moves should have very low cost, in many cases being handled by register renaming and thus generating no actual μops. Still, they can degrade performance, if only because of the increased code size.

@saucecontrol

This comment has been minimized.

Copy link
Member Author

commented Jun 29, 2018

Hmmm... I didn't think about the register renaming, but if the CPU is doing it in both cases, I'm surprised a 2 byte code size increase would have such a large perf impact. I'll add more cases if I see anything different come up.

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented Jun 29, 2018

Perf difference might be caused by loop top alignment, note the backedge target addresses:

;;; legacy jit
081f96d2 72cc            jb      081f96a0

;;; ryjit
056491db 72ca            jb      056491a7

There is a perf benefit to having frequent branch targets be addresses == 0 mod 16 (or these days mod 32).

Not sure if legacy jit gets lucky here or would pad above the loops with a slightly different example.

@saucecontrol

This comment has been minimized.

Copy link
Member Author

commented Jun 30, 2018

Looks like the alignment was purely coincidental. I added an extra increment before the loop to offset each of them by 1. The RyuJIT version still came out 8% slower with its loop top 16-byte aligned

Method Job IsBaseline Mean Error StdDev Scaled
FixedPoint netcoreapp1.1 True 17.37 us 0.0726 us 0.0679 us 1.00
FixedPoint netcoreapp2.1 Default 18.78 us 0.1559 us 0.1458 us 1.08

jit32 asm

074e9690 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
074e9697 41              inc     ecx
074e9698 8b7d08          mov     edi,dword ptr [ebp+8]
074e969b 03f9            add     edi,ecx
074e969d 3bcf            cmp     ecx,edi
074e969f 7334            jae     074e96d5
074e96a1 0fb611          movzx   edx,byte ptr [ecx]
074e96a4 69d246260000    imul    edx,edx,2646h
074e96aa 0fb64101        movzx   eax,byte ptr [ecx+1]
074e96ae 69c0234b0000    imul    eax,eax,4B23h
074e96b4 03d0            add     edx,eax
074e96b6 0fb64102        movzx   eax,byte ptr [ecx+2]
074e96ba 69c0980e0000    imul    eax,eax,0E98h
074e96c0 03d0            add     edx,eax
074e96c2 81c200400000    add     edx,4000h
074e96c8 c1fa0f          sar     edx,0Fh
074e96cb 8816            mov     byte ptr [esi],dl
074e96cd 83c103          add     ecx,3
074e96d0 46              inc     esi
074e96d1 3bcf            cmp     ecx,edi
074e96d3 72cc            jb      074e96a1
074e96d5 5e              pop     esi

and RyuJIT

084e91a0 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
084e91a6 41              inc     ecx
084e91a7 8bc1            mov     eax,ecx
084e91a9 034508          add     eax,dword ptr [ebp+8]
084e91ac 3bc8            cmp     ecx,eax
084e91ae 7336            jae     084e91e6
084e91b0 0fb619          movzx   ebx,byte ptr [ecx]
084e91b3 69f346260000    imul    esi,ebx,2646h
084e91b9 0fb65901        movzx   ebx,byte ptr [ecx+1]
084e91bd 69fb234b0000    imul    edi,ebx,4B23h
084e91c3 03f7            add     esi,edi
084e91c5 0fb65902        movzx   ebx,byte ptr [ecx+2]
084e91c9 69fb980e0000    imul    edi,ebx,0E98h
084e91cf 03f7            add     esi,edi
084e91d1 81c600400000    add     esi,4000h
084e91d7 8bde            mov     ebx,esi
084e91d9 c1fb0f          sar     ebx,0Fh
084e91dc 881a            mov     byte ptr [edx],bl
084e91de 83c103          add     ecx,3
084e91e1 42              inc     edx
084e91e2 3bc8            cmp     ecx,eax
084e91e4 72ca            jb      084e91b0
084e91e6 5b              pop     ebx
@erozenfeld

This comment has been minimized.

Copy link
Member

commented Jul 5, 2018

In newer processors crossing a 32-byte boundary in a loop makes a difference because of µop cache. It's definitely the case for loops that are 32 bytes long or smaller. Here the loops are larger than 32 bytes but they cross 32-byte boundary once in jit32 case and twice in RyuJit case. That may affect perf. Can you try aligning the loops at 32 bytes in both cases?

See 9.3 in http://www.agner.org/optimize/microarchitecture.pdf for details about µop cache and 32-bytes alignment.

@saucecontrol

This comment has been minimized.

Copy link
Member Author

commented Jul 5, 2018

I managed to get both loop tops 32-byte aligned, but I'm not seeing any changes to the perf numbers. And the perf delta holds up even if I force the jit32 output to be misaligned.

Method Job IsBaseline Mean Error StdDev Scaled
FixedPoint netcoreapp1.1 True 17.36 us 0.1503 us 0.1406 us 1.00
FixedPoint netcoreapp2.1 Default 18.64 us 0.1099 us 0.0974 us 1.07

jit32

07a59690 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
07a59697 8b7d08          mov     edi,dword ptr [ebp+8]
07a5969a 03f9            add     edi,ecx
07a5969c 3bcf            cmp     ecx,edi
07a5969e 7334            jae     07a596d4
07a596a0 0fb611          movzx   edx,byte ptr [ecx]
07a596a3 69d246260000    imul    edx,edx,2646h
07a596a9 0fb64101        movzx   eax,byte ptr [ecx+1]
07a596ad 69c0234b0000    imul    eax,eax,4B23h
07a596b3 03d0            add     edx,eax
07a596b5 0fb64102        movzx   eax,byte ptr [ecx+2]
07a596b9 69c0980e0000    imul    eax,eax,0E98h
07a596bf 03d0            add     edx,eax
07a596c1 81c200400000    add     edx,4000h
07a596c7 c1fa0f          sar     edx,0Fh
07a596ca 8816            mov     byte ptr [esi],dl
07a596cc 83c103          add     ecx,3
07a596cf 46              inc     esi
07a596d0 3bcf            cmp     ecx,edi
07a596d2 72cc            jb      07a596a0
07a596d4 5e              pop     esi

RyuJIT

07a191a0 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
07a191a9 81c100010000    add     ecx,100h
07a191af 81c200010000    add     edx,100h
07a191b5 83c010          add     eax,10h
07a191b8 8d440110        lea     eax,[ecx+eax+10h]
07a191bc 3bc8            cmp     ecx,eax
07a191be 7336            jae     07a191f6
07a191c0 0fb619          movzx   ebx,byte ptr [ecx]
07a191c3 69f346260000    imul    esi,ebx,2646h
07a191c9 0fb65901        movzx   ebx,byte ptr [ecx+1]
07a191cd 69fb234b0000    imul    edi,ebx,4B23h
07a191d3 03f7            add     esi,edi
07a191d5 0fb65902        movzx   ebx,byte ptr [ecx+2]
07a191d9 69fb980e0000    imul    edi,ebx,0E98h
07a191df 03f7            add     esi,edi
07a191e1 81c600400000    add     esi,4000h
07a191e7 8bde            mov     ebx,esi
07a191e9 c1fb0f          sar     ebx,0Fh
07a191ec 881a            mov     byte ptr [edx],bl
07a191ee 83c103          add     ecx,3
07a191f1 42              inc     edx
07a191f2 3bc8            cmp     ecx,eax
07a191f4 72ca            jb      07a191c0
07a191f6 5b              pop     ebx
@sgf

This comment has been minimized.

Copy link

commented Oct 4, 2018

json-benchmark (java&.net)
Java‘s DSL-JSON is so fast ...

The benchmark almost include the fastest .net json library,like Jil,NetJson....
There was hardly any .NET JSON library can be higher.

when could be CLR can be faster more than JVM?

@shahid-pk

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2018

@sgf that benchmark is not using .net core though. It might still be slower. But just pointing out.

@saucecontrol

This comment has been minimized.

Copy link
Member Author

commented Feb 6, 2019

I saw there were some recent LSRA changes and decided to revisit this to see if there was any improvement. I managed to distill my original repro down to a much more reasonable size, with only half a round of hash mixing and with just enough math to touch all the variables. Current code is here and the current numbers look like this:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview-010184
  [Host]        : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  netcoreapp1.1 : .NET Core 1.1.10 (CoreCLR 4.6.26906.01, CoreFX 4.6.24705.01), 64bit RyuJIT
  netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  netcoreapp3.0 : .NET Core 3.0.0-preview-27324-5 (CoreCLR 4.6.27322.0, CoreFX 4.7.19.7311), 64bit RyuJIT
Method Job Platform Mean Error StdDev Scaled ScaledSD
SimplifiedHash netcoreapp1.1 X64 1.054 ms 0.0135 ms 0.0126 ms 1.00 0.00
SimplifiedHash netcoreapp2.0 X64 1.070 ms 0.0157 ms 0.0147 ms 1.02 0.02
SimplifiedHash netcoreapp2.1 X64 1.070 ms 0.0170 ms 0.0159 ms 1.02 0.02
SimplifiedHash netcoreapp3.0 X64 1.068 ms 0.0137 ms 0.0128 ms 1.01 0.02
SimplifiedHash netcoreapp1.1 X86 3.005 ms 0.0224 ms 0.0199 ms 1.00 0.00
SimplifiedHash netcoreapp2.0 X86 5.881 ms 0.0203 ms 0.0170 ms 1.96 0.01
SimplifiedHash netcoreapp2.1 X86 6.705 ms 0.0364 ms 0.0322 ms 2.23 0.02
SimplifiedHash netcoreapp3.0 X86 6.739 ms 0.0519 ms 0.0485 ms 2.24 0.02

The x64 regression between 1.1 and 2.0 is barely visible, but there is a code size increase to back it up. The x86 regression is quite clear both in the numbers and in the asm. I went ahead and captured JitDisasm and JitDump output for the simplified mixing method on every runtime version (3.0 is from current master) to make this easier to review. Dumps are here: jitdumps.zip

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

I get similar numbers:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.18323
Intel Core i7-4770HQ CPU 2.20GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview4-010539
  [Host]        : .NET Core 3.0.0-preview4-27427-5 (CoreCLR 4.6.27427.71, CoreFX 4.7.19.12613), 64bit RyuJIT
  netcoreapp1.1 : .NET Core 1.1.2 (CoreCLR 4.6.25211.01, CoreFX 4.6.24705.01), 64bit RyuJIT
  netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  netcoreapp3.0 : .NET Core 3.0.0-preview4-27427-5 (CoreCLR 4.6.27427.71, CoreFX 4.7.19.12613), 64bit RyuJIT
Method Job Platform IsBaseline Mean Error StdDev Scaled ScaledSD
SimplifiedHash netcoreapp1.1 X64 True 1.259 ms 0.0142 ms 0.0111 ms 1.00 0.00
SimplifiedHash netcoreapp2.0 X64 Default 1.342 ms 0.0126 ms 0.0099 ms 1.07 0.01
SimplifiedHash netcoreapp2.1 X64 Default 1.316 ms 0.0229 ms 0.0214 ms 1.05 0.02
SimplifiedHash netcoreapp3.0 X64 Default 1.351 ms 0.0141 ms 0.0125 ms 1.07 0.01
SimplifiedHash netcoreapp1.1 X86 True 4.063 ms 0.0655 ms 0.0613 ms 1.00 0.00
SimplifiedHash netcoreapp2.0 X86 Default 8.563 ms 0.0485 ms 0.0379 ms 2.11 0.03
SimplifiedHash netcoreapp2.1 X86 Default 8.597 ms 0.0266 ms 0.0236 ms 2.12 0.03
SimplifiedHash netcoreapp3.0 X86 Default 8.534 ms 0.1033 ms 0.0915 ms 2.10 0.04

For the latest RyuJit on x86, we've gotten ourselves into a situation where the only available register for long stretches of code is esi. Not surprisingly, this leads to a lot of shuffling...

       lea      esi, [ecx+176]
       mov      dword ptr [ebp-D8H], esi
       mov      esi, dword ptr [esi]
       mov      dword ptr [ebp-ECH], esi
       mov      esi, dword ptr [ebp-D8H]

G_M44156_IG04:
       mov      esi, dword ptr [esi+4]
       mov      dword ptr [ebp-F0H], esi
       mov      esi, dword ptr [ebp-ECH]
       mov      dword ptr [ebp-78H], esi
       mov      esi, dword ptr [ebp-F0H]
       mov      dword ptr [ebp-7CH], esi
       lea      esi, [ecx+184]
       mov      dword ptr [ebp-DCH], esi
       mov      esi, dword ptr [esi]
       mov      dword ptr [ebp-F0H], esi
       mov      esi, dword ptr [ebp-DCH]
       mov      esi, dword ptr [esi+4]
       mov      dword ptr [ebp-ECH], esi
       mov      esi, dword ptr [ebp-F0H]
       mov      dword ptr [ebp-80H], esi
       mov      esi, dword ptr [ebp-ECH]
       mov      dword ptr [ebp-84H], esi
       lea      esi, [ecx+192]
       mov      dword ptr [ebp-E0H], esi
       mov      esi, dword ptr [esi]
       mov      dword ptr [ebp-ECH], esi
       mov      esi, dword ptr [ebp-E0H]
       mov      esi, dword ptr [esi+4]
       mov      dword ptr [ebp-F0H], esi
       mov      esi, dword ptr [ebp-ECH]
       xor      esi, 0xD1FFAB1E
       mov      dword ptr [ebp-ECH], esi
       mov      esi, dword ptr [ebp-F0H]
       xor      esi, 0xD1FFAB1E

I think part of the issue here is that when we decompose a long GT_IND, we save the address to a temp, even if that address is something we could trivially recompute. So we see

       lea      esi, [ecx+176]                  // address of low part of some long 
       mov      dword ptr [ebp-D8H], esi        // spill the address
       mov      esi, dword ptr [esi]            // load low part
       mov      dword ptr [ebp-ECH], esi        // spill the low part
       mov      esi, dword ptr [ebp-D8H]        // reload the address

G_M44156_IG04:
       mov      esi, dword ptr [esi+4]          // bump address to point at high part
       mov      dword ptr [ebp-F0H], esi        // load the high part

when we could just as easily have the following, and save a bit of register pressure:

       mov      esi, dword ptr [ecx + 176]      // load low part
       mov      dword ptr [ebp-ECH], esi        // spill the low part

G_M44156_IG04:
       mov      esi, dword ptr [ecx + 180]      // bump to point at high part
       mov      dword ptr [ebp-F0H], esi        // load the high part

The other thing I saw was that when a long load is dead but might cause an exception, we don't simplify to something like a null check like jit32 does. Again this saves needing a register.

Let me look into long decomposition and see how hard it would be to teach it not to create a temp for simple address forms.

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

I have a prototype here OptimizeDecomposeInd... it helps somewhat but we're still nowhere close to jit32.

PMI Diffs for D:\repos\Blake2Fast\tests\JitRegression\bin\Release\netcoreapp3.0\publish\JitRegression.dll for x86 default jit
Summary:
(Lower is better)
Total bytes of diff: -181 (-2.09% of base)
    diff is an improvement.
Top file improvements by size (bytes):
        -181 : JitRegression.dasm (-2.09% of base)
1 total files with size differences (1 improved, 0 regressed), 0 unchanged.
Top method improvements by size (bytes):
        -168 (-5.65% of base) : JitRegression.dasm - Blake2bContext:mixSimplified(int,int)
          -6 (-1.19% of base) : JitRegression.dasm - Blake2bContext:Update(ref):this
          -3 (-4.11% of base) : JitRegression.dasm - Blake2bContext:addLength(int):this
          -2 (-0.45% of base) : JitRegression.dasm - Blake2bContext:Init(int,ref):this
          -2 (-0.70% of base) : JitRegression.dasm - Blake2bContext:Finish():ref:this
Top method improvements by size (percentage):
        -168 (-5.65% of base) : JitRegression.dasm - Blake2bContext:mixSimplified(int,int)
          -3 (-4.11% of base) : JitRegression.dasm - Blake2bContext:addLength(int):this
          -6 (-1.19% of base) : JitRegression.dasm - Blake2bContext:Update(ref):this
          -2 (-0.70% of base) : JitRegression.dasm - Blake2bContext:Finish():ref:this
          -2 (-0.45% of base) : JitRegression.dasm - Blake2bContext:Init(int,ref):this
5 total methods with size differences (5 improved, 0 regressed), 36 unchanged.

Can do similar for stores but since those are all at the end I don't expect that to have a large impact in this case.

Also reasonably broad impact across the frameworks....

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x86 default jit
Summary:
(Lower is better)
Total bytes of diff: -13859 (-0.05% of base)
    diff is an improvement.
Top file regressions by size (bytes):
          20 : System.Net.WebSockets.dasm (0.06% of base)
Top file improvements by size (bytes):
       -1621 : System.Private.CoreLib.dasm (-0.04% of base)
       -1526 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.07% of base)
       -1146 : CommandLine.dasm (-0.41% of base)
        -826 : Microsoft.CodeAnalysis.dasm (-0.07% of base)
        -791 : System.Memory.dasm (-0.38% of base)
83 total files with size differences (82 improved, 1 regressed), 46 unchanged.
Top method regressions by size (bytes):
         381 (11.88% of base) : System.Private.CoreLib.dasm - DecCalc:VarDecDiv(byref,byref)
         137 ( 6.24% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Analysis.TraceGarbageCollector:Calculate():this
          89 ( 4.55% of base) : System.Private.CoreLib.dasm - Grisu3:TryDigitGenShortest(byref,byref,byref,struct,byref,byref):bool
          64 ( 5.74% of base) : Microsoft.CodeAnalysis.dasm - Microsoft.CodeAnalysis.Compilation:ConstructModuleSerializationProperties(ref,ref,struct):ref:this
          46 (12.23% of base) : System.Linq.Parallel.dasm - LongMinMaxAggregationOperatorEnumerator`1[Double][System.Double]:MoveNextCore(byref):bool:this
Top method improvements by size (bytes):
        -183 (-4.11% of base) : System.Private.Uri.dasm - System.Uri:CheckAuthorityHelper(int,ushort,ushort,byref,byref,ref,byref):ushort:this
        -137 (-2.20% of base) : System.Reflection.Metadata.dasm - System.Reflection.PortableExecutable.PEHeader:.ctor(byref):this
         -65 (-7.58% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Int64,Int64][System.Int64,System.Int64]:ToString():ref:this
         -47 (-1.88% of base) : System.Private.Xml.dasm - System.Xml.Serialization.XmlSerializationWriterCodeGen:WriteEnumMethod(ref):this
         -43 (-17.77% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Int64,Int64][System.Int64,System.Int64]:Equals(ref):bool:this
Top method regressions by size (percentage):
          13 (24.07% of base) : System.Linq.Expressions.dasm - <>c__3`2[Int32,Int64][System.Int32,System.Int64]:<RunOnEmptyStack>b__3_0(ref):ref:this
          46 (12.23% of base) : System.Linq.Parallel.dasm - LongMinMaxAggregationOperatorEnumerator`1[Double][System.Double]:MoveNextCore(byref):bool:this
         381 (11.88% of base) : System.Private.CoreLib.dasm - DecCalc:VarDecDiv(byref,byref)
          27 ( 9.68% of base) : System.Linq.Parallel.dasm - System.Linq.Parallel.RepeatEnumerable`1[Int64][System.Int64]:GetPartitions(int):ref:this
           6 ( 7.50% of base) : System.Collections.Immutable.dasm - System.Linq.ImmutableArrayExtensions:First(struct):long
Top method improvements by size (percentage):
         -11 (-37.93% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Etlx.TraceModuleFile:get_ImageEnd():long:this
         -15 (-28.85% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[__Canon,Int64][System.__Canon,System.Int64]:get_index():long:this
         -15 (-28.85% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Int32,Int64][System.Int32,System.Int64]:get_index():long:this
         -15 (-28.85% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Double,Int64][System.Double,System.Int64]:get_index():long:this
         -15 (-28.85% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],System.Int64]:get_index():long:this
4642 total methods with size differences (4418 improved, 224 regressed), 221321 unchanged.

@AndyAyersMS AndyAyersMS modified the milestones: 3.0, Future Feb 28, 2019

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

Will need to spend more time looking into this to figure out what else must happen to get x86 up to par (and perhaps arm32 too).

Moving this to future as I don't see any quick resolution that would fit into 3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.