Skip to content

RyuJIT: Poor code quality for tight generic loop with many inlineable calls (factor x8 slower than non-generic few calls loop) #5252

@nietras

Description

@nietras

A common operation in image processing is to merge separate color channels i.e. blue, green and red to a Bgr image (or 2D array).

Using the proposed Unsafe class, detailed in https://github.com/dotnet/corefx/issues/5474 but via the https://github.com/DotNetCross/Memory.Unsafe implementation, a completely generic loop for operations that can Transform a single input to three different outputs was implemented for such operations. See below.

public static TFunc Transform<T1, T2, T3, TResult, TFunc>(
    ArrayView2D<T1> viewA, ArrayView2D<T2> viewB, ArrayView2D<T3> viewC, 
    TFunc func, ArrayView2D<TResult> viewResult)
    where TFunc : IFunc<T1, T2, T3, TResult>
{
    // Checks omitted for brevity

    var rows = sizeA.Rows;
    var cols = sizeA.Cols;

    var rowStrideA = viewA.RowStrideInBytes;
    var rowStrideB = viewB.RowStrideInBytes;
    var rowStrideC = viewC.RowStrideInBytes;
    var rowStrideResult = viewResult.RowStrideInBytes;

    var rowPtrA = viewA.BytePtr;
    var rowPtrB = viewB.BytePtr;
    var rowPtrC = viewC.BytePtr;
    var rowPtrResult = viewResult.BytePtr;

    var rowPtrEndA = rowPtrA + rowStrideA * rows;
    for (; rowPtrA != rowPtrEndA; 
        rowPtrA += rowStrideA, rowPtrB += rowStrideB, rowPtrC += rowStrideC, 
        rowPtrResult += rowStrideResult)
    {
        var colPtrA = rowPtrA;
        var colPtrB = rowPtrB;
        var colPtrC = rowPtrC;
        var colPtrResult = rowPtrResult;

        var colPtrEndA = colPtrA + cols * SizeOf<T1>();
        for (; colPtrA != colPtrEndA; 
                colPtrA += SizeOf<T1>(), colPtrB += SizeOf<T2>(), colPtrC += SizeOf<T3>(), 
                colPtrResult += SizeOf<TResult>())
        {
            Write(colPtrResult, func.Invoke(
                Read<T1>(colPtrA), Read<T2>(colPtrB), Read<T3>(colPtrC));
        }
    }
    return func;
}

ArrayView2D<T> is just a generic value type containing a pointer, 2D size and row stride in bytes. Using this to implement the merge with a value type func (implementing IFunc interface omitted here since it is simple):

struct MergeBgrFunc : IFunc<byte, byte, byte, Bgr>
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public Bgr Invoke(byte b, byte g, byte r)
    {
        return Bgr.ByBGR(r, g, r);
    }
}

and the Bgr value type defined as something like:

public struct Bgr : IEquatable<Bgr>
{
    public readonly byte Blue;
    public readonly byte Green;
    public readonly byte Red;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private Bgr(byte blue, byte green, byte red)
    {
        this.Blue = blue;
        this.Green = green;
        this.Red = red;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Bgr ByBGR(byte blue, byte green, byte red)
    { return new Bgr(blue, green, red); }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Bgr ByRGB(byte red, byte green, byte blue)
    { return new Bgr(blue, green, red); }

    public byte B { get { return this.Blue; } }
    public byte G { get { return this.Green; } }
    public byte R { get { return this.Red; } }
}

The MergeBgr functionality can be implemented as a simple call such as:

Transform(blue, green, red, new MergeBgrFunc(), bgr);

Testing this in Visual Studio 2015 Update 1 64-bit on an Intel processor that is with RyuJIT (clrjit.dll) it quickly became evident that this was ~8x slower than a specific loop such as:

public static void MergeBgr(
    ArrayView2D<byte> b, ArrayView2D<byte> g, ArrayView2D<byte> r, 
    ArrayView2D<Bgr> bgr)
{
    // Checks omitted for brevity

    var rows = size.Rows;
    var cols = size.Cols;

    var rowStrideB = b.RowStrideInBytes;
    var rowStrideG = g.RowStrideInBytes;
    var rowStrideR = r.RowStrideInBytes;
    var rowStrideBgr = bgr.RowStrideInBytes;

    var rowPtrB = b.BytePtr;
    var rowPtrG = g.BytePtr;
    var rowPtrR = r.BytePtr;
    var rowPtrBgr = bgr.BytePtr;

    var rowPtrEndB = rowPtrB + rowStrideB * rows;
    for (; rowPtrB != rowPtrEndB;
        rowPtrB += rowStrideB, rowPtrG += rowStrideG, rowPtrR += rowStrideR, 
        rowPtrBgr += rowStrideBgr)
    {
        var colPtrB = rowPtrB;
        var colPtrG = rowPtrG;
        var colPtrR = rowPtrR;
        var colPtrBgr = (Bgr*)rowPtrBgr;

        var colPtrEndB = colPtrB + cols;
        for (; colPtrB != colPtrEndB; ++colPtrB, ++colPtrG, ++colPtrR, ++colPtrBgr)
        {
            *colPtrBgr = Bgr.ByBGR(*colPtrB, *colPtrG, *colPtrR);
        }
    }
}

or to give more specific numbers. For a 2D array size of say 1280 x 1024 the two different methods would run in about:

Method Average loop time
Tranform MergeBgr 10.352 ms
Specific MergeBgr 1.264 ms

which gives a factor 8.2x slower for the Transform version. Inspecting the generated assembly shows why.

Transform MergeBgr x64 assembly for inner most loop

00007FFF50BCA7B3  cmp         r11,r14  
00007FFF50BCA7B6  je          00007FFF50BCA810  
00007FFF50BCA7B8  movzx       eax,byte ptr [r11]  
00007FFF50BCA7BC  movzx       eax,byte ptr [r15]  
00007FFF50BCA7C0  movzx       ecx,byte ptr [r12]  
00007FFF50BCA7C5  and         eax,0FFh  
00007FFF50BCA7CA  lea         r10,[rsp+38h]  
00007FFF50BCA7CF  mov         byte ptr [r10],cl  
00007FFF50BCA7D2  mov         byte ptr [r10+1],al  
00007FFF50BCA7D6  mov         byte ptr [r10+2],cl  
00007FFF50BCA7DA  mov         ax,word ptr [rsp+38h]  
00007FFF50BCA7DF  mov         word ptr [rsp+30h],ax  
00007FFF50BCA7E4  mov         al,byte ptr [rsp+3Ah]  
00007FFF50BCA7E8  mov         byte ptr [rsp+32h],al  
00007FFF50BCA7EC  mov         ax,word ptr [rsp+30h]  
00007FFF50BCA7F1  mov         word ptr [r13],ax  
00007FFF50BCA7F6  mov         al,byte ptr [rsp+32h]  
00007FFF50BCA7FA  mov         byte ptr [r13+2],al  
00007FFF50BCA7FE  inc         r11  
00007FFF50BCA801  inc         r15  
00007FFF50BCA804  inc         r12  
00007FFF50BCA807  add         r13,3  
00007FFF50BCA80B  cmp         r11,r14  

As can be seen this has a LOT of mov in it. The JIT does do a good job inlining all the calls, but the resulting assembly is not particular well optimized. Too many moves.

Specific MergeBgr x64 assembly for inner most loop

070CB200  mov         eax,edi  
070CB202  cmp         eax,dword ptr [ebp-3Ch]  
070CB205  je          070CB22F  
070CB207  movzx       ecx,byte ptr [edi]  
070CB20A  mov         eax,dword ptr [ebp-34h]  
070CB20D  movzx       ebx,byte ptr [eax]  
070CB210  mov         eax,dword ptr [ebp-38h]  
070CB213  movzx       esi,byte ptr [eax]  
070CB216  mov         byte ptr [edx],cl  
070CB218  mov         byte ptr [edx+1],bl  
070CB21B  mov         eax,esi  
070CB21D  mov         byte ptr [edx+2],al  
070CB220  inc         edi  
070CB221  inc         dword ptr [ebp-34h]  
070CB224  inc         dword ptr [ebp-38h]  
070CB227  add         edx,3  

The specific loop on the other hand is much better. Probably as good as it gets.

Hoping the JIT can be improved for this scenario as this would make generic numerical processing much more efficient in .NET. I have other examples where the JIT seems to choke when it is generating code where there are a lot of calls that are inlined.

Unmanaged generic type constraint + generic pointers would make it possible to write the loop without the Unsafe class, if that would help. Perhaps it is the IFunc pattern that is a problem, I have seen it work beautifully though for simpler loops.

I know there is work being done on refactoring inlining such as https://github.com/dotnet/coreclr/issues/3371 , https://github.com/dotnet/coreclr/issues/3482 not sure this pertains to this, as it is in fact after inlining has been done.

category:cq
theme:structs
skill-level:expert
cost:extra-large

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIenhancementProduct code improvement that does NOT require public API changes/additionsoptimizationtenet-performancePerformance related issue

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions