A common operation in image processing is to merge separate color channels i.e. blue, green and red to a Bgr image (or 2D array).
Using the proposed Unsafe class, detailed in https://github.com/dotnet/corefx/issues/5474 but via the https://github.com/DotNetCross/Memory.Unsafe implementation, a completely generic loop for operations that can Transform a single input to three different outputs was implemented for such operations. See below.
public static TFunc Transform<T1, T2, T3, TResult, TFunc>(
ArrayView2D<T1> viewA, ArrayView2D<T2> viewB, ArrayView2D<T3> viewC,
TFunc func, ArrayView2D<TResult> viewResult)
where TFunc : IFunc<T1, T2, T3, TResult>
{
// Checks omitted for brevity
var rows = sizeA.Rows;
var cols = sizeA.Cols;
var rowStrideA = viewA.RowStrideInBytes;
var rowStrideB = viewB.RowStrideInBytes;
var rowStrideC = viewC.RowStrideInBytes;
var rowStrideResult = viewResult.RowStrideInBytes;
var rowPtrA = viewA.BytePtr;
var rowPtrB = viewB.BytePtr;
var rowPtrC = viewC.BytePtr;
var rowPtrResult = viewResult.BytePtr;
var rowPtrEndA = rowPtrA + rowStrideA * rows;
for (; rowPtrA != rowPtrEndA;
rowPtrA += rowStrideA, rowPtrB += rowStrideB, rowPtrC += rowStrideC,
rowPtrResult += rowStrideResult)
{
var colPtrA = rowPtrA;
var colPtrB = rowPtrB;
var colPtrC = rowPtrC;
var colPtrResult = rowPtrResult;
var colPtrEndA = colPtrA + cols * SizeOf<T1>();
for (; colPtrA != colPtrEndA;
colPtrA += SizeOf<T1>(), colPtrB += SizeOf<T2>(), colPtrC += SizeOf<T3>(),
colPtrResult += SizeOf<TResult>())
{
Write(colPtrResult, func.Invoke(
Read<T1>(colPtrA), Read<T2>(colPtrB), Read<T3>(colPtrC));
}
}
return func;
}
ArrayView2D<T> is just a generic value type containing a pointer, 2D size and row stride in bytes. Using this to implement the merge with a value type func (implementing IFunc interface omitted here since it is simple):
struct MergeBgrFunc : IFunc<byte, byte, byte, Bgr>
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public Bgr Invoke(byte b, byte g, byte r)
{
return Bgr.ByBGR(r, g, r);
}
}
and the Bgr value type defined as something like:
public struct Bgr : IEquatable<Bgr>
{
public readonly byte Blue;
public readonly byte Green;
public readonly byte Red;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private Bgr(byte blue, byte green, byte red)
{
this.Blue = blue;
this.Green = green;
this.Red = red;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Bgr ByBGR(byte blue, byte green, byte red)
{ return new Bgr(blue, green, red); }
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Bgr ByRGB(byte red, byte green, byte blue)
{ return new Bgr(blue, green, red); }
public byte B { get { return this.Blue; } }
public byte G { get { return this.Green; } }
public byte R { get { return this.Red; } }
}
The MergeBgr functionality can be implemented as a simple call such as:
Transform(blue, green, red, new MergeBgrFunc(), bgr);
Testing this in Visual Studio 2015 Update 1 64-bit on an Intel processor that is with RyuJIT (clrjit.dll) it quickly became evident that this was ~8x slower than a specific loop such as:
public static void MergeBgr(
ArrayView2D<byte> b, ArrayView2D<byte> g, ArrayView2D<byte> r,
ArrayView2D<Bgr> bgr)
{
// Checks omitted for brevity
var rows = size.Rows;
var cols = size.Cols;
var rowStrideB = b.RowStrideInBytes;
var rowStrideG = g.RowStrideInBytes;
var rowStrideR = r.RowStrideInBytes;
var rowStrideBgr = bgr.RowStrideInBytes;
var rowPtrB = b.BytePtr;
var rowPtrG = g.BytePtr;
var rowPtrR = r.BytePtr;
var rowPtrBgr = bgr.BytePtr;
var rowPtrEndB = rowPtrB + rowStrideB * rows;
for (; rowPtrB != rowPtrEndB;
rowPtrB += rowStrideB, rowPtrG += rowStrideG, rowPtrR += rowStrideR,
rowPtrBgr += rowStrideBgr)
{
var colPtrB = rowPtrB;
var colPtrG = rowPtrG;
var colPtrR = rowPtrR;
var colPtrBgr = (Bgr*)rowPtrBgr;
var colPtrEndB = colPtrB + cols;
for (; colPtrB != colPtrEndB; ++colPtrB, ++colPtrG, ++colPtrR, ++colPtrBgr)
{
*colPtrBgr = Bgr.ByBGR(*colPtrB, *colPtrG, *colPtrR);
}
}
}
or to give more specific numbers. For a 2D array size of say 1280 x 1024 the two different methods would run in about:
| Method |
Average loop time |
| Tranform MergeBgr |
10.352 ms |
| Specific MergeBgr |
1.264 ms |
which gives a factor 8.2x slower for the Transform version. Inspecting the generated assembly shows why.
Transform MergeBgr x64 assembly for inner most loop
00007FFF50BCA7B3 cmp r11,r14
00007FFF50BCA7B6 je 00007FFF50BCA810
00007FFF50BCA7B8 movzx eax,byte ptr [r11]
00007FFF50BCA7BC movzx eax,byte ptr [r15]
00007FFF50BCA7C0 movzx ecx,byte ptr [r12]
00007FFF50BCA7C5 and eax,0FFh
00007FFF50BCA7CA lea r10,[rsp+38h]
00007FFF50BCA7CF mov byte ptr [r10],cl
00007FFF50BCA7D2 mov byte ptr [r10+1],al
00007FFF50BCA7D6 mov byte ptr [r10+2],cl
00007FFF50BCA7DA mov ax,word ptr [rsp+38h]
00007FFF50BCA7DF mov word ptr [rsp+30h],ax
00007FFF50BCA7E4 mov al,byte ptr [rsp+3Ah]
00007FFF50BCA7E8 mov byte ptr [rsp+32h],al
00007FFF50BCA7EC mov ax,word ptr [rsp+30h]
00007FFF50BCA7F1 mov word ptr [r13],ax
00007FFF50BCA7F6 mov al,byte ptr [rsp+32h]
00007FFF50BCA7FA mov byte ptr [r13+2],al
00007FFF50BCA7FE inc r11
00007FFF50BCA801 inc r15
00007FFF50BCA804 inc r12
00007FFF50BCA807 add r13,3
00007FFF50BCA80B cmp r11,r14
As can be seen this has a LOT of mov in it. The JIT does do a good job inlining all the calls, but the resulting assembly is not particular well optimized. Too many moves.
Specific MergeBgr x64 assembly for inner most loop
070CB200 mov eax,edi
070CB202 cmp eax,dword ptr [ebp-3Ch]
070CB205 je 070CB22F
070CB207 movzx ecx,byte ptr [edi]
070CB20A mov eax,dword ptr [ebp-34h]
070CB20D movzx ebx,byte ptr [eax]
070CB210 mov eax,dword ptr [ebp-38h]
070CB213 movzx esi,byte ptr [eax]
070CB216 mov byte ptr [edx],cl
070CB218 mov byte ptr [edx+1],bl
070CB21B mov eax,esi
070CB21D mov byte ptr [edx+2],al
070CB220 inc edi
070CB221 inc dword ptr [ebp-34h]
070CB224 inc dword ptr [ebp-38h]
070CB227 add edx,3
The specific loop on the other hand is much better. Probably as good as it gets.
Hoping the JIT can be improved for this scenario as this would make generic numerical processing much more efficient in .NET. I have other examples where the JIT seems to choke when it is generating code where there are a lot of calls that are inlined.
Unmanaged generic type constraint + generic pointers would make it possible to write the loop without the Unsafe class, if that would help. Perhaps it is the IFunc pattern that is a problem, I have seen it work beautifully though for simpler loops.
I know there is work being done on refactoring inlining such as https://github.com/dotnet/coreclr/issues/3371 , https://github.com/dotnet/coreclr/issues/3482 not sure this pertains to this, as it is in fact after inlining has been done.
category:cq
theme:structs
skill-level:expert
cost:extra-large
A common operation in image processing is to merge separate color channels i.e. blue, green and red to a
Bgrimage (or 2D array).Using the proposed
Unsafeclass, detailed in https://github.com/dotnet/corefx/issues/5474 but via the https://github.com/DotNetCross/Memory.Unsafe implementation, a completely generic loop for operations that canTransforma single input to three different outputs was implemented for such operations. See below.ArrayView2D<T>is just a generic value type containing a pointer, 2D size and row stride in bytes. Using this to implement the merge with a value type func (implementingIFuncinterface omitted here since it is simple):and the
Bgrvalue type defined as something like:The
MergeBgrfunctionality can be implemented as a simple call such as:Testing this in Visual Studio 2015 Update 1 64-bit on an Intel processor that is with RyuJIT (
clrjit.dll) it quickly became evident that this was ~8x slower than a specific loop such as:or to give more specific numbers. For a 2D array size of say 1280 x 1024 the two different methods would run in about:
which gives a factor 8.2x slower for the
Transformversion. Inspecting the generated assembly shows why.Transform MergeBgr x64 assembly for inner most loop
As can be seen this has a LOT of
movin it. The JIT does do a good job inlining all the calls, but the resulting assembly is not particular well optimized. Too many moves.Specific MergeBgr x64 assembly for inner most loop
The specific loop on the other hand is much better. Probably as good as it gets.
Hoping the JIT can be improved for this scenario as this would make generic numerical processing much more efficient in .NET. I have other examples where the JIT seems to choke when it is generating code where there are a lot of calls that are inlined.
Unmanaged generic type constraint + generic pointers would make it possible to write the loop without the
Unsafeclass, if that would help. Perhaps it is theIFuncpattern that is a problem, I have seen it work beautifully though for simpler loops.I know there is work being done on refactoring inlining such as https://github.com/dotnet/coreclr/issues/3371 , https://github.com/dotnet/coreclr/issues/3482 not sure this pertains to this, as it is in fact after inlining has been done.
category:cq
theme:structs
skill-level:expert
cost:extra-large