Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT: Poor code quality for tight generic loop with many inlineable calls (factor x8 slower than non-generic few calls loop) #5252

Closed
nietras opened this issue Mar 6, 2016 · 18 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@nietras
Copy link
Contributor

nietras commented Mar 6, 2016

A common operation in image processing is to merge separate color channels i.e. blue, green and red to a Bgr image (or 2D array).

Using the proposed Unsafe class, detailed in https://github.com/dotnet/corefx/issues/5474 but via the https://github.com/DotNetCross/Memory.Unsafe implementation, a completely generic loop for operations that can Transform a single input to three different outputs was implemented for such operations. See below.

public static TFunc Transform<T1, T2, T3, TResult, TFunc>(
    ArrayView2D<T1> viewA, ArrayView2D<T2> viewB, ArrayView2D<T3> viewC, 
    TFunc func, ArrayView2D<TResult> viewResult)
    where TFunc : IFunc<T1, T2, T3, TResult>
{
    // Checks omitted for brevity

    var rows = sizeA.Rows;
    var cols = sizeA.Cols;

    var rowStrideA = viewA.RowStrideInBytes;
    var rowStrideB = viewB.RowStrideInBytes;
    var rowStrideC = viewC.RowStrideInBytes;
    var rowStrideResult = viewResult.RowStrideInBytes;

    var rowPtrA = viewA.BytePtr;
    var rowPtrB = viewB.BytePtr;
    var rowPtrC = viewC.BytePtr;
    var rowPtrResult = viewResult.BytePtr;

    var rowPtrEndA = rowPtrA + rowStrideA * rows;
    for (; rowPtrA != rowPtrEndA; 
        rowPtrA += rowStrideA, rowPtrB += rowStrideB, rowPtrC += rowStrideC, 
        rowPtrResult += rowStrideResult)
    {
        var colPtrA = rowPtrA;
        var colPtrB = rowPtrB;
        var colPtrC = rowPtrC;
        var colPtrResult = rowPtrResult;

        var colPtrEndA = colPtrA + cols * SizeOf<T1>();
        for (; colPtrA != colPtrEndA; 
                colPtrA += SizeOf<T1>(), colPtrB += SizeOf<T2>(), colPtrC += SizeOf<T3>(), 
                colPtrResult += SizeOf<TResult>())
        {
            Write(colPtrResult, func.Invoke(
                Read<T1>(colPtrA), Read<T2>(colPtrB), Read<T3>(colPtrC));
        }
    }
    return func;
}

ArrayView2D<T> is just a generic value type containing a pointer, 2D size and row stride in bytes. Using this to implement the merge with a value type func (implementing IFunc interface omitted here since it is simple):

struct MergeBgrFunc : IFunc<byte, byte, byte, Bgr>
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public Bgr Invoke(byte b, byte g, byte r)
    {
        return Bgr.ByBGR(r, g, r);
    }
}

and the Bgr value type defined as something like:

public struct Bgr : IEquatable<Bgr>
{
    public readonly byte Blue;
    public readonly byte Green;
    public readonly byte Red;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private Bgr(byte blue, byte green, byte red)
    {
        this.Blue = blue;
        this.Green = green;
        this.Red = red;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Bgr ByBGR(byte blue, byte green, byte red)
    { return new Bgr(blue, green, red); }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Bgr ByRGB(byte red, byte green, byte blue)
    { return new Bgr(blue, green, red); }

    public byte B { get { return this.Blue; } }
    public byte G { get { return this.Green; } }
    public byte R { get { return this.Red; } }
}

The MergeBgr functionality can be implemented as a simple call such as:

Transform(blue, green, red, new MergeBgrFunc(), bgr);

Testing this in Visual Studio 2015 Update 1 64-bit on an Intel processor that is with RyuJIT (clrjit.dll) it quickly became evident that this was ~8x slower than a specific loop such as:

public static void MergeBgr(
    ArrayView2D<byte> b, ArrayView2D<byte> g, ArrayView2D<byte> r, 
    ArrayView2D<Bgr> bgr)
{
    // Checks omitted for brevity

    var rows = size.Rows;
    var cols = size.Cols;

    var rowStrideB = b.RowStrideInBytes;
    var rowStrideG = g.RowStrideInBytes;
    var rowStrideR = r.RowStrideInBytes;
    var rowStrideBgr = bgr.RowStrideInBytes;

    var rowPtrB = b.BytePtr;
    var rowPtrG = g.BytePtr;
    var rowPtrR = r.BytePtr;
    var rowPtrBgr = bgr.BytePtr;

    var rowPtrEndB = rowPtrB + rowStrideB * rows;
    for (; rowPtrB != rowPtrEndB;
        rowPtrB += rowStrideB, rowPtrG += rowStrideG, rowPtrR += rowStrideR, 
        rowPtrBgr += rowStrideBgr)
    {
        var colPtrB = rowPtrB;
        var colPtrG = rowPtrG;
        var colPtrR = rowPtrR;
        var colPtrBgr = (Bgr*)rowPtrBgr;

        var colPtrEndB = colPtrB + cols;
        for (; colPtrB != colPtrEndB; ++colPtrB, ++colPtrG, ++colPtrR, ++colPtrBgr)
        {
            *colPtrBgr = Bgr.ByBGR(*colPtrB, *colPtrG, *colPtrR);
        }
    }
}

or to give more specific numbers. For a 2D array size of say 1280 x 1024 the two different methods would run in about:

Method Average loop time
Tranform MergeBgr 10.352 ms
Specific MergeBgr 1.264 ms

which gives a factor 8.2x slower for the Transform version. Inspecting the generated assembly shows why.

Transform MergeBgr x64 assembly for inner most loop

00007FFF50BCA7B3  cmp         r11,r14  
00007FFF50BCA7B6  je          00007FFF50BCA810  
00007FFF50BCA7B8  movzx       eax,byte ptr [r11]  
00007FFF50BCA7BC  movzx       eax,byte ptr [r15]  
00007FFF50BCA7C0  movzx       ecx,byte ptr [r12]  
00007FFF50BCA7C5  and         eax,0FFh  
00007FFF50BCA7CA  lea         r10,[rsp+38h]  
00007FFF50BCA7CF  mov         byte ptr [r10],cl  
00007FFF50BCA7D2  mov         byte ptr [r10+1],al  
00007FFF50BCA7D6  mov         byte ptr [r10+2],cl  
00007FFF50BCA7DA  mov         ax,word ptr [rsp+38h]  
00007FFF50BCA7DF  mov         word ptr [rsp+30h],ax  
00007FFF50BCA7E4  mov         al,byte ptr [rsp+3Ah]  
00007FFF50BCA7E8  mov         byte ptr [rsp+32h],al  
00007FFF50BCA7EC  mov         ax,word ptr [rsp+30h]  
00007FFF50BCA7F1  mov         word ptr [r13],ax  
00007FFF50BCA7F6  mov         al,byte ptr [rsp+32h]  
00007FFF50BCA7FA  mov         byte ptr [r13+2],al  
00007FFF50BCA7FE  inc         r11  
00007FFF50BCA801  inc         r15  
00007FFF50BCA804  inc         r12  
00007FFF50BCA807  add         r13,3  
00007FFF50BCA80B  cmp         r11,r14  

As can be seen this has a LOT of mov in it. The JIT does do a good job inlining all the calls, but the resulting assembly is not particular well optimized. Too many moves.

Specific MergeBgr x64 assembly for inner most loop

070CB200  mov         eax,edi  
070CB202  cmp         eax,dword ptr [ebp-3Ch]  
070CB205  je          070CB22F  
070CB207  movzx       ecx,byte ptr [edi]  
070CB20A  mov         eax,dword ptr [ebp-34h]  
070CB20D  movzx       ebx,byte ptr [eax]  
070CB210  mov         eax,dword ptr [ebp-38h]  
070CB213  movzx       esi,byte ptr [eax]  
070CB216  mov         byte ptr [edx],cl  
070CB218  mov         byte ptr [edx+1],bl  
070CB21B  mov         eax,esi  
070CB21D  mov         byte ptr [edx+2],al  
070CB220  inc         edi  
070CB221  inc         dword ptr [ebp-34h]  
070CB224  inc         dword ptr [ebp-38h]  
070CB227  add         edx,3  

The specific loop on the other hand is much better. Probably as good as it gets.

Hoping the JIT can be improved for this scenario as this would make generic numerical processing much more efficient in .NET. I have other examples where the JIT seems to choke when it is generating code where there are a lot of calls that are inlined.

Unmanaged generic type constraint + generic pointers would make it possible to write the loop without the Unsafe class, if that would help. Perhaps it is the IFunc pattern that is a problem, I have seen it work beautifully though for simpler loops.

I know there is work being done on refactoring inlining such as https://github.com/dotnet/coreclr/issues/3371 , https://github.com/dotnet/coreclr/issues/3482 not sure this pertains to this, as it is in fact after inlining has been done.

category:cq
theme:structs
skill-level:expert
cost:extra-large

@mikedn
Copy link
Contributor

mikedn commented Mar 6, 2016

I think the main problem is the lack of optimizations around value types and not inlining. You have calls with value type parameters and/or value type returns. Values passed as parameters need to be copied and after inlining those copies may end up being redundant. The JIT isn't able to eliminate such copies and you end up with code like the one you posted above.

This may simply be a duplicate of #4308 but the code is complex enough that other issues may exist as well. It would be nice to post somewhere a full example that can be compiled and run with coreclr.

@nietras
Copy link
Contributor Author

nietras commented Mar 6, 2016

To be clear. I wasn't saying that inlining was the problem at all, inlining works great, but indeed that something was wrong with the code generation around it. Guess the main problem is value types (which we use a lot in our "domain" of image processing) then. I will try to make a complete example if that is of interest and can help.

Any way of circumventing the copies? Can't do ref returns? Ref inputs? But that would mean that we perhaps need to make types mutable, we try to use immutable value types.

@mikedn
Copy link
Contributor

mikedn commented Mar 6, 2016

Any way of circumventing the copies? Can't do ref returns? Ref inputs? But that would mean that we perhaps need to make types mutable, we try to use immutable value types.

Yes, you can try to use ref parameters and perhaps replace returns with out parameters. I'm not sure why that would require mutable types. I think that all you need is to change the second parameter of Write to use ref. And I suppose Write only reads from that parameter so the fact that it is immutable should not matter.

@nietras
Copy link
Contributor Author

nietras commented Mar 7, 2016

why that would require mutable types

It does not require it. I was just wondering whether out would actually solve the issue. Anyway, as I understand there are two things that could help here.

  • Changing the MergeBgrFunc to use an out parameter instead of a return value, test have to show if this actually avoids the extra copies?
  • Change the Unsafe.Write to take a refparameter instead.

That is for example end up with:

struct MergeBgrFunc : IOutFunc<byte, byte, byte, Bgr> // Need new interface
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void Invoke(byte b, byte g, byte r, out Bgr bgr)
    {
        bgr = Bgr.ByBGR(r, g, r);  // Mistake here first `r` should be `b` results are with this mistake
    }
}

and add a Write taking a ref to the Unsafe class:

public static class Unsafe
{
    public static unsafe void Write<T>(void* p, ref T value)
    {
         // IL?
    }
}

White means the inner loop becomes something like:

        var colPtrEndA = colPtrA + cols * SizeOf<T1>();
        for (; colPtrA != colPtrEndA; 
                colPtrA += SizeOf<T1>(), colPtrB += SizeOf<T2>(), colPtrC += SizeOf<T3>(), 
                colPtrResult += SizeOf<TResult>())
        {
            TResult res;
            func.Invoke(Read<T1>(colPtrA), Read<T2>(colPtrB), Read<T3>(colPtrC), out res);
            Write(colPtrResult, ref res);
        }

which definitely is less elegant than return values.

@nietras
Copy link
Contributor Author

nietras commented Mar 7, 2016

Should I rename the issue to something like RyuJIT: Excessive value type copies due to value type returns/parameters despite inlined calls (factor x8 slower than no calls loop)? Seems more correct.

And yes this does infact seem related to https://github.com/dotnet/coreclr/issues/1133.

@nietras
Copy link
Contributor Author

nietras commented Mar 7, 2016

Changing to just use out parameter:

                TResult res;
                func.Invoke(Read<T1>(colPtrA), Read<T2>(colPtrB), Read<T3>(colPtrC), out res);
                Write(colPtrResult, res);

didn't change anything in the generated assembly:

00007FFF51D3BE55  cmp         rbp,r13  
00007FFF51D3BE58  je          00007FFF51D3BEB1  
00007FFF51D3BE5A  movzx       eax,byte ptr [rbp]  
00007FFF51D3BE5E  movzx       eax,byte ptr [r14]  
00007FFF51D3BE62  movzx       r11d,byte ptr [r15]  
00007FFF51D3BE66  and         eax,0FFh  
00007FFF51D3BE6B  lea         rbx,[rsp+38h]  
00007FFF51D3BE70  mov         byte ptr [rbx],r11b  
00007FFF51D3BE73  mov         byte ptr [rbx+1],al  
00007FFF51D3BE76  mov         byte ptr [rbx+2],r11b  
00007FFF51D3BE7A  mov         ax,word ptr [rsp+38h]  
00007FFF51D3BE7F  mov         word ptr [rsp+30h],ax  
00007FFF51D3BE84  mov         al,byte ptr [rsp+3Ah]  
00007FFF51D3BE88  mov         byte ptr [rsp+32h],al  
00007FFF51D3BE8C  mov         ax,word ptr [rsp+30h]  
00007FFF51D3BE91  mov         word ptr [r12],ax  
00007FFF51D3BE96  mov         al,byte ptr [rsp+32h]  
00007FFF51D3BE9A  mov         byte ptr [r12+2],al  
00007FFF51D3BE9F  inc         rbp  
00007FFF51D3BEA2  inc         r14  
00007FFF51D3BEA5  inc         r15  
00007FFF51D3BEA8  add         r12,3  

Not the speed either.

@RussKeldorph
Copy link
Contributor

@CarolEidt FYI.

@nietras
Copy link
Contributor Author

nietras commented Mar 8, 2016

Related to this I've done some micro-benchmarks of the Unsafe Read, Write and Write using refparameter methods. The benchmark is part of https://github.com/DotNetCross/Memory.Unsafe version 0.2.2.0.

The results are:

BenchmarkDotNet=v0.9.2.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz, ProcessorCount=12
Frequency=3222656 ticks, Resolution=310.3031 ns
HostCLR=MS.NET 4.0.30319.42000, Arch=32-bit RELEASE

Type=BasicReadWriteBenchmarkBgr  Mode=Throughput  
Method Platform Jit Median StdDev
ReadFromStack X64 LegacyJit 2.9152 ns 0.0215 ns
ReadFromStack X64 RyuJit 1.1682 ns 0.0780 ns
ReadFromStack X86 LegacyJit 4.9551 ns 0.0218 ns
WriteRefToStack X64 LegacyJit 5.3458 ns 0.0549 ns
WriteRefToStack X64 RyuJit 9.6123 ns 0.1189 ns
WriteRefToStack X86 LegacyJit 6.6984 ns 0.0360 ns
WriteToStack X64 LegacyJit 5.8234 ns 0.0606 ns
WriteToStack X64 RyuJit 9.9327 ns 0.0927 ns
WriteToStack X86 LegacyJit 10.4968 ns 0.0504 ns

The code for these tests are (note the stackallocs, which was just a quick way of testing this for writing to some memory allocation, this might have unintended consequences for the results):

    public struct Bgr { public byte B; public byte G; public byte R; }

    [Config("jobs=AllJits")]
    public class BasicReadWriteBenchmark<T>
        where T : struct
    {

        // NOTE: This includes cost of stack alloc
        [Benchmark]
        public void ReadFromStack()
        {
            unsafe
            {
                var stackPtr = stackalloc byte[Unsafe.SizeOf<T>()];
                var value = Unsafe.Read<T>(stackPtr);
            }
        }

        // NOTE: This includes cost of stack alloc
        [Benchmark]
        public void WriteToStack()
        {
            unsafe
            {
                var stackPtr = stackalloc byte[Unsafe.SizeOf<T>()];
                T value = default(T);
                Unsafe.Write<T>(stackPtr, value);
            }
        }

        // NOTE: This includes cost of stack alloc
        [Benchmark]
        public void WriteRefToStack()
        {
            unsafe
            {
                var stackPtr = stackalloc byte[Unsafe.SizeOf<T>()];
                T value = default(T);
                Unsafe.Write<T>(stackPtr, ref value);
            }
        }
    }

    public class BasicReadWriteBenchmarkBgr : BasicReadWriteBenchmark<Bgr> { }

I ran this with BenchMarkDotNet 0.9.2 using:

    BenchmarkRunner.Run<BasicReadWriteBenchmarkBgr>();

Below are the assembly code as viewed from running the code in VS2015 Update 1 just as an example so this should be RyuJIT assembly. Didn't have time to make the asm for all JITs. One can basically se that the ref version in this case does fewer movs.

ReadFromStack

00007FFF50BA4B80  push        rbp  
00007FFF50BA4B81  sub         rsp,20h  
00007FFF50BA4B85  mov         rbp,rsp  
00007FFF50BA4B88  mov         qword ptr [rbp+18h],rsp  
00007FFF50BA4B8C  mov         rax,0DAABE8AE499Bh  
00007FFF50BA4B96  mov         qword ptr [rbp+8],rax  
00007FFF50BA4B9A  push        0  
00007FFF50BA4B9C  push        0  
00007FFF50BA4B9E  lea         rax,[rsp]  
00007FFF50BA4BA2  mov         qword ptr [rbp+18h],rsp  
00007FFF50BA4BA6  mov         dx,word ptr [rax]  
00007FFF50BA4BA9  mov         word ptr [rbp+10h],dx  
00007FFF50BA4BAD  mov         dl,byte ptr [rax+2]  
00007FFF50BA4BB0  mov         byte ptr [rbp+12h],dl  
00007FFF50BA4BB3  mov         rcx,0DAABE8AE499Bh  
00007FFF50BA4BBD  cmp         qword ptr [rbp+8],rcx  
00007FFF50BA4BC1  je          00007FFF50BA4BC8  
00007FFF50BA4BC3  call        00007FFFB06430B4  
00007FFF50BA4BC8  nop  
00007FFF50BA4BC9  lea         rsp,[rbp+20h]  
00007FFF50BA4BCD  pop         rbp  
00007FFF50BA4BCE  ret  

WriteToStack

00007FFF50BA4BF0  push        rbp  
00007FFF50BA4BF1  push        rdi  
00007FFF50BA4BF2  push        rsi  
00007FFF50BA4BF3  sub         rsp,20h  
00007FFF50BA4BF7  mov         rbp,rsp  
00007FFF50BA4BFA  mov         rsi,rcx  
00007FFF50BA4BFD  lea         rdi,[rbp]  
00007FFF50BA4C01  mov         ecx,8  
00007FFF50BA4C06  xor         eax,eax  
00007FFF50BA4C08  rep stos    dword ptr [rdi]  
00007FFF50BA4C0A  mov         rcx,rsi  
00007FFF50BA4C0D  mov         qword ptr [rbp+18h],rsp  
00007FFF50BA4C11  mov         rax,0DAABE8AE499Bh  
00007FFF50BA4C1B  mov         qword ptr [rbp],rax  
00007FFF50BA4C1F  push        0  
00007FFF50BA4C21  push        0  
00007FFF50BA4C23  lea         rax,[rsp]  
00007FFF50BA4C27  mov         qword ptr [rbp+18h],rsp  
00007FFF50BA4C2B  xor         edx,edx  
00007FFF50BA4C2D  lea         rcx,[rbp+10h]  
00007FFF50BA4C31  mov         word ptr [rcx],dx  
00007FFF50BA4C34  mov         byte ptr [rcx+2],dl  
00007FFF50BA4C37  mov         dx,word ptr [rbp+10h]  
00007FFF50BA4C3B  mov         word ptr [rbp+8],dx  
00007FFF50BA4C3F  mov         dl,byte ptr [rbp+12h]  
00007FFF50BA4C42  mov         byte ptr [rbp+0Ah],dl  
00007FFF50BA4C45  mov         dx,word ptr [rbp+8]  
00007FFF50BA4C49  mov         word ptr [rax],dx  
00007FFF50BA4C4C  mov         dl,byte ptr [rbp+0Ah]  
00007FFF50BA4C4F  mov         byte ptr [rax+2],dl  
00007FFF50BA4C52  mov         rcx,0DAABE8AE499Bh  
00007FFF50BA4C5C  cmp         qword ptr [rbp],rcx  
00007FFF50BA4C60  je          00007FFF50BA4C67  
00007FFF50BA4C62  call        00007FFFB06430B4  
00007FFF50BA4C67  nop  
00007FFF50BA4C68  lea         rsp,[rbp+20h]  
00007FFF50BA4C6C  pop         rsi  
00007FFF50BA4C6D  pop         rdi  
00007FFF50BA4C6E  pop         rbp  
00007FFF50BA4C6F  ret  

WriteRefToStack

00007FFF50BA4C90  push        rbp  
00007FFF50BA4C91  push        rdi  
00007FFF50BA4C92  push        rsi  
00007FFF50BA4C93  sub         rsp,20h  
00007FFF50BA4C97  mov         rbp,rsp  
00007FFF50BA4C9A  mov         rsi,rcx  
00007FFF50BA4C9D  lea         rdi,[rbp+8]  
00007FFF50BA4CA1  mov         ecx,6  
00007FFF50BA4CA6  xor         eax,eax  
00007FFF50BA4CA8  rep stos    dword ptr [rdi]  
00007FFF50BA4CAA  mov         rcx,rsi  
00007FFF50BA4CAD  mov         qword ptr [rbp+18h],rsp  
00007FFF50BA4CB1  mov         rax,0DAABE8AE499Bh  
00007FFF50BA4CBB  mov         qword ptr [rbp+8],rax  
00007FFF50BA4CBF  push        0  
00007FFF50BA4CC1  push        0  
00007FFF50BA4CC3  lea         rax,[rsp]  
00007FFF50BA4CC7  mov         qword ptr [rbp+18h],rsp  
00007FFF50BA4CCB  xor         edx,edx  
00007FFF50BA4CCD  lea         rcx,[rbp+10h]  
00007FFF50BA4CD1  mov         word ptr [rcx],dx  
00007FFF50BA4CD4  mov         byte ptr [rcx+2],dl  
00007FFF50BA4CD7  mov         dx,word ptr [rbp+10h]  
00007FFF50BA4CDB  mov         word ptr [rax],dx  
00007FFF50BA4CDE  mov         dl,byte ptr [rbp+12h]  
00007FFF50BA4CE1  mov         byte ptr [rax+2],dl  
00007FFF50BA4CE4  mov         rcx,0DAABE8AE499Bh  
00007FFF50BA4CEE  cmp         qword ptr [rbp+8],rcx  
00007FFF50BA4CF2  je          00007FFF50BA4CF9  
00007FFF50BA4CF4  call        00007FFFB06430B4  
00007FFF50BA4CF9  nop  
00007FFF50BA4CFA  lea         rsp,[rbp+20h]  
00007FFF50BA4CFE  pop         rsi  
00007FFF50BA4CFF  pop         rdi  
00007FFF50BA4D00  pop         rbp  
00007FFF50BA4D01  ret  

@nietras
Copy link
Contributor Author

nietras commented Mar 8, 2016

Secondly, I have made a version using both out and ref so the inner loop for the transform merge Bgr becomes:

TResult res;
func.Invoke(Read<T1>(colPtrA), Read<T2>(colPtrB), Read<T3>(colPtrC), out res);
Write(colPtrResult, ref res);

The perf does improve to (for RyuJIT again and with the other times included):

Method Average loop time
Tranform MergeBgr 10.352 ms
Specific MergeBgr 1.264 ms
Tranform Out+Ref MergeBgr 7.445 ms

Transform Out+Ref MergeBgr x64 assembly for inner most loop

The assembly is indeed better but perf is still far behind.

00007FFF50BFBEF5  cmp         rbp,r13  
00007FFF50BFBEF8  je          00007FFF50BFBF3F  
00007FFF50BFBEFA  movzx       eax,byte ptr [rbp]  
00007FFF50BFBEFE  movzx       eax,byte ptr [r14]  
00007FFF50BFBF02  movzx       r11d,byte ptr [r15]  
00007FFF50BFBF06  and         eax,0FFh  
00007FFF50BFBF0B  lea         rbx,[rsp+38h]  
00007FFF50BFBF10  mov         byte ptr [rbx],r11b  
00007FFF50BFBF13  mov         byte ptr [rbx+1],al  
00007FFF50BFBF16  mov         byte ptr [rbx+2],r11b  
00007FFF50BFBF1A  mov         ax,word ptr [rsp+38h]  
00007FFF50BFBF1F  mov         word ptr [r12],ax  
00007FFF50BFBF24  mov         al,byte ptr [rsp+3Ah]  
00007FFF50BFBF28  mov         byte ptr [r12+2],al  
00007FFF50BFBF2D  inc         rbp  
00007FFF50BFBF30  inc         r14  
00007FFF50BFBF33  inc         r15  
00007FFF50BFBF36  add         r12,3  

Transform Ref only MergeBgr x64

I also then tried it without the out parameter so the loop instead becomes:

var res = func.Invoke(Read<T1>(colPtrA), Read<T2>(colPtrB), Read<T3>(colPtrC));
Write(colPtrResult, ref res);

and the assembly (the same as for Out+Ref):

00007FFF50BDBEE5  cmp         rbp,r13  
00007FFF50BDBEE8  je          00007FFF50BDBF2F  
00007FFF50BDBEEA  movzx       eax,byte ptr [rbp]  
00007FFF50BDBEEE  movzx       eax,byte ptr [r14]  
00007FFF50BDBEF2  movzx       r11d,byte ptr [r15]  
00007FFF50BDBEF6  and         eax,0FFh  
00007FFF50BDBEFB  lea         rbx,[rsp+38h]  
00007FFF50BDBF00  mov         byte ptr [rbx],r11b  
00007FFF50BDBF03  mov         byte ptr [rbx+1],al  
00007FFF50BDBF06  mov         byte ptr [rbx+2],r11b  
00007FFF50BDBF0A  mov         ax,word ptr [rsp+38h]  
00007FFF50BDBF0F  mov         word ptr [r12],ax  
00007FFF50BDBF14  mov         al,byte ptr [rsp+3Ah]  
00007FFF50BDBF18  mov         byte ptr [r12+2],al  
00007FFF50BDBF1D  inc         rbp  
00007FFF50BDBF20  inc         r14  
00007FFF50BDBF23  inc         r15  
00007FFF50BDBF26  add         r12,3  

The perf then being the same.

@nietras
Copy link
Contributor Author

nietras commented Mar 8, 2016

Sorry I made a mistake in the MergeBgrFunc so the actual assembly e.g. for the Refversion is:

00007FFF50BBBEFA  cmp         rbp,r13  
00007FFF50BBBEFD  je          00007FFF50BBBF4C  
00007FFF50BBBEFF  movzx       eax,byte ptr [rbp]  
00007FFF50BBBF03  movzx       r11d,byte ptr [r14]  
00007FFF50BBBF07  movzx       ebx,byte ptr [r15]  
00007FFF50BBBF0B  and         eax,0FFh  
00007FFF50BBBF10  and         r11d,0FFh  
00007FFF50BBBF17  lea         r9,[rsp+38h]  
00007FFF50BBBF1C  mov         byte ptr [r9],al  
00007FFF50BBBF1F  mov         byte ptr [r9+1],r11b  
00007FFF50BBBF23  mov         byte ptr [r9+2],bl  
00007FFF50BBBF27  mov         ax,word ptr [rsp+38h]  
00007FFF50BBBF2C  mov         word ptr [r12],ax  
00007FFF50BBBF31  mov         al,byte ptr [rsp+3Ah]  
00007FFF50BBBF35  mov         byte ptr [r12+2],al  
00007FFF50BBBF3A  inc         rbp  
00007FFF50BBBF3D  inc         r14  
00007FFF50BBBF40  inc         r15  
00007FFF50BBBF43  add         r12,3 

This just meant the perf was 8.632 ms so still pretty bad compared to the specific loop.

This with the MergeBgrFunc being:

struct MergeBgrFunc : IOutFunc<byte, byte, byte, Bgr> // Need new interface
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void Invoke(byte b, byte g, byte r, out Bgr bgr)
    {
        bgr = Bgr.ByBGR(b, g, r); // Fixed mistake so now correctly has b first
    }
}

@mikedn
Copy link
Contributor

mikedn commented Mar 8, 2016

I haven't had time to look at all this more closely but even with ref and out you still have a copy in ByBGR. Now that's a case against immutability. Directly setting the fields of Invoke's bgr parameter would probably avoid the issue.

@nietras
Copy link
Contributor Author

nietras commented Mar 8, 2016

@mikedn

look at all this more closely

Yes, it's a lot :)

a case against immutability

An argument against that immutability should be a problem is that the specific loop still uses the immutable type with no perf problem e.g.:

*colPtrBgr = Bgr.ByBGR(*colPtrB, *colPtrG, *colPtrR);

so that can't be the problem.

If I understand correctly you are suggesting changing the func to something like:

struct MergeBgrFunc : IRefFunc<byte, byte, byte, Bgr> // Need new interface
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public void Invoke(byte b, byte g, byte r, ref Bgr bgr)
    {
        bgr.B = b;
        bgr.G = g;
        bgr.R = r;
    }
}

The problem is as far as I can tell this is simply not possible in a generic way. This requires ref returns, so we could reinterpret the memory location via something like ref T AsRef<T>(void* ptr).

So does this not mean this is problem in the way inlining for value types occurs, that is the code resulting from inlined calls with value types? Or?

@CarolEidt
Copy link
Contributor

Like @mikedn I have not had time to take a look at this closely, but from what I can glean with a quick look, I believe that this case is highlighting the limitations that the JIT currently has with regard to the handling of struct (value) types. Copies are made to/from arguments and returns, and then are not eliminated during inlining or downstream. This is something I hope to be able to improve very soon.

@nietras
Copy link
Contributor Author

nietras commented Mar 8, 2016

@CarolEidt ok yes it certainly look like there are some rather serious issues (from a perf perspective) with value types. Would an example project help you with this work?

Or perhaps even some tests (via a pull request) added directly to the coreclr project if such tests are applicable? Have no idea how you test/structure this work in coreclr.

As you can probably tell, this work is quite important for the things we are trying to do in .NET with generic image processing. Basically, we are just trying to replicate what can done in C++ with templates like in the stl, as I'm sure you understand, but with a much better developer story and easy distribution etc.

@CarolEidt
Copy link
Contributor

@nietras Yes, an example project would be helpful - especially one that can be used for benchmarking. The current benchmarks we use for perf analysis are in tests/src/JIT/Performance/CodeQuality; that would be a great place to add new benchmarks. They just need to report their perf result. @chuck-mitchell is currently working on the analysis, but in the past (i.e. desktop perf analysis) the analysis tools were pretty flexible about how it was reported.

@CarolEidt
Copy link
Contributor

@nietras - I am no longer seeing the excess copies in the small ReadFromStack and WriteFromStack methods. Can you check to see whether you're still seeing the excess copies in your larger examples? Also, if you can point to a small-ish test case that exhibits the issue that would be very helpful.

CarolEidt referenced this issue in CarolEidt/coreclr Jan 28, 2019
- Enable CSE of struct values when handle is available (and add code to get the handle of HW SIMD types)
- Don't require block nodes for SIMD assignments
- Don't set `GTF_GLOB_REF` on `GT_OBJ` if it is local
- Set `lvRegStruct` on promoted SIMD fields
- Add tests for #19910 (fixed with this PR) and #3539 & #19910 (fixed with dotnet#21314)
- Additional cleanup
CarolEidt referenced this issue in CarolEidt/coreclr Mar 19, 2019
- Enable CSE of struct values when handle is available (and add code to get the handle of HW SIMD types)
- Don't require block nodes for SIMD assignments
- Don't set `GTF_GLOB_REF` on `GT_OBJ` if it is local
- Set `lvRegStruct` on promoted SIMD fields
- Add tests for #19910 (fixed with this PR) and #3539 & #19438 (fixed with dotnet#21314)
- Additional cleanup

Fix #19910
CarolEidt referenced this issue in CarolEidt/coreclr Mar 21, 2019
- Enable CSE of struct values when handle is available (and add code to get the handle of HW SIMD types)
- Don't require block nodes for SIMD assignments
- Don't set `GTF_GLOB_REF` on `GT_OBJ` if it is local
- Set `lvRegStruct` on promoted SIMD fields
- Add tests for #19910 (fixed with this PR) and #3539 & #19438 (fixed with dotnet#21314)
- Additional cleanup

Fix #19910
mikedn referenced this issue in mikedn/coreclr Mar 23, 2019
- Enable CSE of struct values when handle is available (and add code to get the handle of HW SIMD types)
- Don't require block nodes for SIMD assignments
- Don't set `GTF_GLOB_REF` on `GT_OBJ` if it is local
- Set `lvRegStruct` on promoted SIMD fields
- Add tests for #19910 (fixed with this PR) and #3539 & #19438 (fixed with dotnet#21314)
- Additional cleanup

Fix #19910
CarolEidt referenced this issue in dotnet/coreclr Mar 28, 2019
* [WIP] Struct & SIMD improvements

- Enable CSE of struct values when handle is available (and add code to get the handle of HW SIMD types)
- Don't require block nodes for SIMD assignments
- Don't set `GTF_GLOB_REF` on `GT_OBJ` if it is local
- Set `lvRegStruct` on promoted SIMD fields
- Add tests for #19910 (fixed with this PR) and #3539 & #19438 (fixed with #21314)
- Additional cleanup

Fix #19910
@CarolEidt
Copy link
Contributor

@nietras - I'm going to close this issue as having been fixed by dotnet/coreclr#22255 - it addressed the issues that I captured in the coreclr/tests/src/JIT/Regressions/JitBlue/GitHub_3539 test case.
Please feel free to reopen with more info, open a new issue, or add comments to the remaining open issues with struct code generation.

@nietras
Copy link
Contributor Author

nietras commented Jun 21, 2019

@CarolEidt sorry I missed your comments here, but this sounds great! Thank you, I will let you know if my testing still shows issues.

The code in the top comment (reduced to one dimension) would be the best bet for test case. Merging three bytes into bgr. Via a generic type struct functor.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants