-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RyuJIT: Poor code quality for tight generic loop with many inlineable calls (factor x8 slower than non-generic few calls loop) #5252
Comments
I think the main problem is the lack of optimizations around value types and not inlining. You have calls with value type parameters and/or value type returns. Values passed as parameters need to be copied and after inlining those copies may end up being redundant. The JIT isn't able to eliminate such copies and you end up with code like the one you posted above. This may simply be a duplicate of #4308 but the code is complex enough that other issues may exist as well. It would be nice to post somewhere a full example that can be compiled and run with coreclr. |
To be clear. I wasn't saying that inlining was the problem at all, inlining works great, but indeed that something was wrong with the code generation around it. Guess the main problem is value types (which we use a lot in our "domain" of image processing) then. I will try to make a complete example if that is of interest and can help. Any way of circumventing the copies? Can't do ref returns? Ref inputs? But that would mean that we perhaps need to make types mutable, we try to use immutable value types. |
Yes, you can try to use |
It does not require it. I was just wondering whether
That is for example end up with: struct MergeBgrFunc : IOutFunc<byte, byte, byte, Bgr> // Need new interface
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void Invoke(byte b, byte g, byte r, out Bgr bgr)
{
bgr = Bgr.ByBGR(r, g, r); // Mistake here first `r` should be `b` results are with this mistake
}
} and add a public static class Unsafe
{
public static unsafe void Write<T>(void* p, ref T value)
{
// IL?
}
} White means the inner loop becomes something like: var colPtrEndA = colPtrA + cols * SizeOf<T1>();
for (; colPtrA != colPtrEndA;
colPtrA += SizeOf<T1>(), colPtrB += SizeOf<T2>(), colPtrC += SizeOf<T3>(),
colPtrResult += SizeOf<TResult>())
{
TResult res;
func.Invoke(Read<T1>(colPtrA), Read<T2>(colPtrB), Read<T3>(colPtrC), out res);
Write(colPtrResult, ref res);
} which definitely is less elegant than return values. |
Should I rename the issue to something like And yes this does infact seem related to https://github.com/dotnet/coreclr/issues/1133. |
Changing to just use TResult res;
func.Invoke(Read<T1>(colPtrA), Read<T2>(colPtrB), Read<T3>(colPtrC), out res);
Write(colPtrResult, res); didn't change anything in the generated assembly: 00007FFF51D3BE55 cmp rbp,r13
00007FFF51D3BE58 je 00007FFF51D3BEB1
00007FFF51D3BE5A movzx eax,byte ptr [rbp]
00007FFF51D3BE5E movzx eax,byte ptr [r14]
00007FFF51D3BE62 movzx r11d,byte ptr [r15]
00007FFF51D3BE66 and eax,0FFh
00007FFF51D3BE6B lea rbx,[rsp+38h]
00007FFF51D3BE70 mov byte ptr [rbx],r11b
00007FFF51D3BE73 mov byte ptr [rbx+1],al
00007FFF51D3BE76 mov byte ptr [rbx+2],r11b
00007FFF51D3BE7A mov ax,word ptr [rsp+38h]
00007FFF51D3BE7F mov word ptr [rsp+30h],ax
00007FFF51D3BE84 mov al,byte ptr [rsp+3Ah]
00007FFF51D3BE88 mov byte ptr [rsp+32h],al
00007FFF51D3BE8C mov ax,word ptr [rsp+30h]
00007FFF51D3BE91 mov word ptr [r12],ax
00007FFF51D3BE96 mov al,byte ptr [rsp+32h]
00007FFF51D3BE9A mov byte ptr [r12+2],al
00007FFF51D3BE9F inc rbp
00007FFF51D3BEA2 inc r14
00007FFF51D3BEA5 inc r15
00007FFF51D3BEA8 add r12,3 Not the speed either. |
@CarolEidt FYI. |
Related to this I've done some micro-benchmarks of the The results are: BenchmarkDotNet=v0.9.2.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz, ProcessorCount=12
Frequency=3222656 ticks, Resolution=310.3031 ns
HostCLR=MS.NET 4.0.30319.42000, Arch=32-bit RELEASE
Type=BasicReadWriteBenchmarkBgr Mode=Throughput
The code for these tests are (note the stackallocs, which was just a quick way of testing this for writing to some memory allocation, this might have unintended consequences for the results): public struct Bgr { public byte B; public byte G; public byte R; }
[Config("jobs=AllJits")]
public class BasicReadWriteBenchmark<T>
where T : struct
{
// NOTE: This includes cost of stack alloc
[Benchmark]
public void ReadFromStack()
{
unsafe
{
var stackPtr = stackalloc byte[Unsafe.SizeOf<T>()];
var value = Unsafe.Read<T>(stackPtr);
}
}
// NOTE: This includes cost of stack alloc
[Benchmark]
public void WriteToStack()
{
unsafe
{
var stackPtr = stackalloc byte[Unsafe.SizeOf<T>()];
T value = default(T);
Unsafe.Write<T>(stackPtr, value);
}
}
// NOTE: This includes cost of stack alloc
[Benchmark]
public void WriteRefToStack()
{
unsafe
{
var stackPtr = stackalloc byte[Unsafe.SizeOf<T>()];
T value = default(T);
Unsafe.Write<T>(stackPtr, ref value);
}
}
}
public class BasicReadWriteBenchmarkBgr : BasicReadWriteBenchmark<Bgr> { } I ran this with BenchmarkRunner.Run<BasicReadWriteBenchmarkBgr>(); Below are the assembly code as viewed from running the code in VS2015 Update 1 just as an example so this should be ReadFromStack00007FFF50BA4B80 push rbp
00007FFF50BA4B81 sub rsp,20h
00007FFF50BA4B85 mov rbp,rsp
00007FFF50BA4B88 mov qword ptr [rbp+18h],rsp
00007FFF50BA4B8C mov rax,0DAABE8AE499Bh
00007FFF50BA4B96 mov qword ptr [rbp+8],rax
00007FFF50BA4B9A push 0
00007FFF50BA4B9C push 0
00007FFF50BA4B9E lea rax,[rsp]
00007FFF50BA4BA2 mov qword ptr [rbp+18h],rsp
00007FFF50BA4BA6 mov dx,word ptr [rax]
00007FFF50BA4BA9 mov word ptr [rbp+10h],dx
00007FFF50BA4BAD mov dl,byte ptr [rax+2]
00007FFF50BA4BB0 mov byte ptr [rbp+12h],dl
00007FFF50BA4BB3 mov rcx,0DAABE8AE499Bh
00007FFF50BA4BBD cmp qword ptr [rbp+8],rcx
00007FFF50BA4BC1 je 00007FFF50BA4BC8
00007FFF50BA4BC3 call 00007FFFB06430B4
00007FFF50BA4BC8 nop
00007FFF50BA4BC9 lea rsp,[rbp+20h]
00007FFF50BA4BCD pop rbp
00007FFF50BA4BCE ret WriteToStack00007FFF50BA4BF0 push rbp
00007FFF50BA4BF1 push rdi
00007FFF50BA4BF2 push rsi
00007FFF50BA4BF3 sub rsp,20h
00007FFF50BA4BF7 mov rbp,rsp
00007FFF50BA4BFA mov rsi,rcx
00007FFF50BA4BFD lea rdi,[rbp]
00007FFF50BA4C01 mov ecx,8
00007FFF50BA4C06 xor eax,eax
00007FFF50BA4C08 rep stos dword ptr [rdi]
00007FFF50BA4C0A mov rcx,rsi
00007FFF50BA4C0D mov qword ptr [rbp+18h],rsp
00007FFF50BA4C11 mov rax,0DAABE8AE499Bh
00007FFF50BA4C1B mov qword ptr [rbp],rax
00007FFF50BA4C1F push 0
00007FFF50BA4C21 push 0
00007FFF50BA4C23 lea rax,[rsp]
00007FFF50BA4C27 mov qword ptr [rbp+18h],rsp
00007FFF50BA4C2B xor edx,edx
00007FFF50BA4C2D lea rcx,[rbp+10h]
00007FFF50BA4C31 mov word ptr [rcx],dx
00007FFF50BA4C34 mov byte ptr [rcx+2],dl
00007FFF50BA4C37 mov dx,word ptr [rbp+10h]
00007FFF50BA4C3B mov word ptr [rbp+8],dx
00007FFF50BA4C3F mov dl,byte ptr [rbp+12h]
00007FFF50BA4C42 mov byte ptr [rbp+0Ah],dl
00007FFF50BA4C45 mov dx,word ptr [rbp+8]
00007FFF50BA4C49 mov word ptr [rax],dx
00007FFF50BA4C4C mov dl,byte ptr [rbp+0Ah]
00007FFF50BA4C4F mov byte ptr [rax+2],dl
00007FFF50BA4C52 mov rcx,0DAABE8AE499Bh
00007FFF50BA4C5C cmp qword ptr [rbp],rcx
00007FFF50BA4C60 je 00007FFF50BA4C67
00007FFF50BA4C62 call 00007FFFB06430B4
00007FFF50BA4C67 nop
00007FFF50BA4C68 lea rsp,[rbp+20h]
00007FFF50BA4C6C pop rsi
00007FFF50BA4C6D pop rdi
00007FFF50BA4C6E pop rbp
00007FFF50BA4C6F ret WriteRefToStack00007FFF50BA4C90 push rbp
00007FFF50BA4C91 push rdi
00007FFF50BA4C92 push rsi
00007FFF50BA4C93 sub rsp,20h
00007FFF50BA4C97 mov rbp,rsp
00007FFF50BA4C9A mov rsi,rcx
00007FFF50BA4C9D lea rdi,[rbp+8]
00007FFF50BA4CA1 mov ecx,6
00007FFF50BA4CA6 xor eax,eax
00007FFF50BA4CA8 rep stos dword ptr [rdi]
00007FFF50BA4CAA mov rcx,rsi
00007FFF50BA4CAD mov qword ptr [rbp+18h],rsp
00007FFF50BA4CB1 mov rax,0DAABE8AE499Bh
00007FFF50BA4CBB mov qword ptr [rbp+8],rax
00007FFF50BA4CBF push 0
00007FFF50BA4CC1 push 0
00007FFF50BA4CC3 lea rax,[rsp]
00007FFF50BA4CC7 mov qword ptr [rbp+18h],rsp
00007FFF50BA4CCB xor edx,edx
00007FFF50BA4CCD lea rcx,[rbp+10h]
00007FFF50BA4CD1 mov word ptr [rcx],dx
00007FFF50BA4CD4 mov byte ptr [rcx+2],dl
00007FFF50BA4CD7 mov dx,word ptr [rbp+10h]
00007FFF50BA4CDB mov word ptr [rax],dx
00007FFF50BA4CDE mov dl,byte ptr [rbp+12h]
00007FFF50BA4CE1 mov byte ptr [rax+2],dl
00007FFF50BA4CE4 mov rcx,0DAABE8AE499Bh
00007FFF50BA4CEE cmp qword ptr [rbp+8],rcx
00007FFF50BA4CF2 je 00007FFF50BA4CF9
00007FFF50BA4CF4 call 00007FFFB06430B4
00007FFF50BA4CF9 nop
00007FFF50BA4CFA lea rsp,[rbp+20h]
00007FFF50BA4CFE pop rsi
00007FFF50BA4CFF pop rdi
00007FFF50BA4D00 pop rbp
00007FFF50BA4D01 ret |
Secondly, I have made a version using both TResult res;
func.Invoke(Read<T1>(colPtrA), Read<T2>(colPtrB), Read<T3>(colPtrC), out res);
Write(colPtrResult, ref res); The perf does improve to (for RyuJIT again and with the other times included):
Transform Out+Ref MergeBgr x64 assembly for inner most loopThe assembly is indeed better but perf is still far behind. 00007FFF50BFBEF5 cmp rbp,r13
00007FFF50BFBEF8 je 00007FFF50BFBF3F
00007FFF50BFBEFA movzx eax,byte ptr [rbp]
00007FFF50BFBEFE movzx eax,byte ptr [r14]
00007FFF50BFBF02 movzx r11d,byte ptr [r15]
00007FFF50BFBF06 and eax,0FFh
00007FFF50BFBF0B lea rbx,[rsp+38h]
00007FFF50BFBF10 mov byte ptr [rbx],r11b
00007FFF50BFBF13 mov byte ptr [rbx+1],al
00007FFF50BFBF16 mov byte ptr [rbx+2],r11b
00007FFF50BFBF1A mov ax,word ptr [rsp+38h]
00007FFF50BFBF1F mov word ptr [r12],ax
00007FFF50BFBF24 mov al,byte ptr [rsp+3Ah]
00007FFF50BFBF28 mov byte ptr [r12+2],al
00007FFF50BFBF2D inc rbp
00007FFF50BFBF30 inc r14
00007FFF50BFBF33 inc r15
00007FFF50BFBF36 add r12,3 Transform Ref only MergeBgr x64I also then tried it without the var res = func.Invoke(Read<T1>(colPtrA), Read<T2>(colPtrB), Read<T3>(colPtrC));
Write(colPtrResult, ref res); and the assembly (the same as for 00007FFF50BDBEE5 cmp rbp,r13
00007FFF50BDBEE8 je 00007FFF50BDBF2F
00007FFF50BDBEEA movzx eax,byte ptr [rbp]
00007FFF50BDBEEE movzx eax,byte ptr [r14]
00007FFF50BDBEF2 movzx r11d,byte ptr [r15]
00007FFF50BDBEF6 and eax,0FFh
00007FFF50BDBEFB lea rbx,[rsp+38h]
00007FFF50BDBF00 mov byte ptr [rbx],r11b
00007FFF50BDBF03 mov byte ptr [rbx+1],al
00007FFF50BDBF06 mov byte ptr [rbx+2],r11b
00007FFF50BDBF0A mov ax,word ptr [rsp+38h]
00007FFF50BDBF0F mov word ptr [r12],ax
00007FFF50BDBF14 mov al,byte ptr [rsp+3Ah]
00007FFF50BDBF18 mov byte ptr [r12+2],al
00007FFF50BDBF1D inc rbp
00007FFF50BDBF20 inc r14
00007FFF50BDBF23 inc r15
00007FFF50BDBF26 add r12,3 The perf then being the same. |
Sorry I made a mistake in the 00007FFF50BBBEFA cmp rbp,r13
00007FFF50BBBEFD je 00007FFF50BBBF4C
00007FFF50BBBEFF movzx eax,byte ptr [rbp]
00007FFF50BBBF03 movzx r11d,byte ptr [r14]
00007FFF50BBBF07 movzx ebx,byte ptr [r15]
00007FFF50BBBF0B and eax,0FFh
00007FFF50BBBF10 and r11d,0FFh
00007FFF50BBBF17 lea r9,[rsp+38h]
00007FFF50BBBF1C mov byte ptr [r9],al
00007FFF50BBBF1F mov byte ptr [r9+1],r11b
00007FFF50BBBF23 mov byte ptr [r9+2],bl
00007FFF50BBBF27 mov ax,word ptr [rsp+38h]
00007FFF50BBBF2C mov word ptr [r12],ax
00007FFF50BBBF31 mov al,byte ptr [rsp+3Ah]
00007FFF50BBBF35 mov byte ptr [r12+2],al
00007FFF50BBBF3A inc rbp
00007FFF50BBBF3D inc r14
00007FFF50BBBF40 inc r15
00007FFF50BBBF43 add r12,3 This just meant the perf was This with the struct MergeBgrFunc : IOutFunc<byte, byte, byte, Bgr> // Need new interface
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void Invoke(byte b, byte g, byte r, out Bgr bgr)
{
bgr = Bgr.ByBGR(b, g, r); // Fixed mistake so now correctly has b first
}
} |
I haven't had time to look at all this more closely but even with |
Yes, it's a lot :)
An argument against that immutability should be a problem is that the specific loop still uses the immutable type with no perf problem e.g.: *colPtrBgr = Bgr.ByBGR(*colPtrB, *colPtrG, *colPtrR); so that can't be the problem. If I understand correctly you are suggesting changing the func to something like: struct MergeBgrFunc : IRefFunc<byte, byte, byte, Bgr> // Need new interface
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void Invoke(byte b, byte g, byte r, ref Bgr bgr)
{
bgr.B = b;
bgr.G = g;
bgr.R = r;
}
} The problem is as far as I can tell this is simply not possible in a generic way. This requires So does this not mean this is problem in the way inlining for value types occurs, that is the code resulting from inlined calls with value types? Or? |
Like @mikedn I have not had time to take a look at this closely, but from what I can glean with a quick look, I believe that this case is highlighting the limitations that the JIT currently has with regard to the handling of struct (value) types. Copies are made to/from arguments and returns, and then are not eliminated during inlining or downstream. This is something I hope to be able to improve very soon. |
@CarolEidt ok yes it certainly look like there are some rather serious issues (from a perf perspective) with value types. Would an example project help you with this work? Or perhaps even some tests (via a pull request) added directly to the As you can probably tell, this work is quite important for the things we are trying to do in .NET with generic image processing. Basically, we are just trying to replicate what can done in C++ with templates like in the |
@nietras Yes, an example project would be helpful - especially one that can be used for benchmarking. The current benchmarks we use for perf analysis are in tests/src/JIT/Performance/CodeQuality; that would be a great place to add new benchmarks. They just need to report their perf result. @chuck-mitchell is currently working on the analysis, but in the past (i.e. desktop perf analysis) the analysis tools were pretty flexible about how it was reported. |
@nietras - I am no longer seeing the excess copies in the small |
- Enable CSE of struct values when handle is available (and add code to get the handle of HW SIMD types) - Don't require block nodes for SIMD assignments - Don't set `GTF_GLOB_REF` on `GT_OBJ` if it is local - Set `lvRegStruct` on promoted SIMD fields - Add tests for #19910 (fixed with this PR) and #3539 & #19910 (fixed with dotnet#21314) - Additional cleanup
- Enable CSE of struct values when handle is available (and add code to get the handle of HW SIMD types) - Don't require block nodes for SIMD assignments - Don't set `GTF_GLOB_REF` on `GT_OBJ` if it is local - Set `lvRegStruct` on promoted SIMD fields - Add tests for #19910 (fixed with this PR) and #3539 & #19438 (fixed with dotnet#21314) - Additional cleanup Fix #19910
- Enable CSE of struct values when handle is available (and add code to get the handle of HW SIMD types) - Don't require block nodes for SIMD assignments - Don't set `GTF_GLOB_REF` on `GT_OBJ` if it is local - Set `lvRegStruct` on promoted SIMD fields - Add tests for #19910 (fixed with this PR) and #3539 & #19438 (fixed with dotnet#21314) - Additional cleanup Fix #19910
- Enable CSE of struct values when handle is available (and add code to get the handle of HW SIMD types) - Don't require block nodes for SIMD assignments - Don't set `GTF_GLOB_REF` on `GT_OBJ` if it is local - Set `lvRegStruct` on promoted SIMD fields - Add tests for #19910 (fixed with this PR) and #3539 & #19438 (fixed with dotnet#21314) - Additional cleanup Fix #19910
* [WIP] Struct & SIMD improvements - Enable CSE of struct values when handle is available (and add code to get the handle of HW SIMD types) - Don't require block nodes for SIMD assignments - Don't set `GTF_GLOB_REF` on `GT_OBJ` if it is local - Set `lvRegStruct` on promoted SIMD fields - Add tests for #19910 (fixed with this PR) and #3539 & #19438 (fixed with #21314) - Additional cleanup Fix #19910
@nietras - I'm going to close this issue as having been fixed by dotnet/coreclr#22255 - it addressed the issues that I captured in the coreclr/tests/src/JIT/Regressions/JitBlue/GitHub_3539 test case. |
@CarolEidt sorry I missed your comments here, but this sounds great! Thank you, I will let you know if my testing still shows issues. The code in the top comment (reduced to one dimension) would be the best bet for test case. Merging three bytes into bgr. Via a generic type struct functor. |
A common operation in image processing is to merge separate color channels i.e. blue, green and red to a
Bgr
image (or 2D array).Using the proposed
Unsafe
class, detailed in https://github.com/dotnet/corefx/issues/5474 but via the https://github.com/DotNetCross/Memory.Unsafe implementation, a completely generic loop for operations that canTransform
a single input to three different outputs was implemented for such operations. See below.ArrayView2D<T>
is just a generic value type containing a pointer, 2D size and row stride in bytes. Using this to implement the merge with a value type func (implementingIFunc
interface omitted here since it is simple):and the
Bgr
value type defined as something like:The
MergeBgr
functionality can be implemented as a simple call such as:Testing this in Visual Studio 2015 Update 1 64-bit on an Intel processor that is with RyuJIT (
clrjit.dll
) it quickly became evident that this was ~8x slower than a specific loop such as:or to give more specific numbers. For a 2D array size of say 1280 x 1024 the two different methods would run in about:
which gives a factor 8.2x slower for the
Transform
version. Inspecting the generated assembly shows why.Transform MergeBgr x64 assembly for inner most loop
As can be seen this has a LOT of
mov
in it. The JIT does do a good job inlining all the calls, but the resulting assembly is not particular well optimized. Too many moves.Specific MergeBgr x64 assembly for inner most loop
The specific loop on the other hand is much better. Probably as good as it gets.
Hoping the JIT can be improved for this scenario as this would make generic numerical processing much more efficient in .NET. I have other examples where the JIT seems to choke when it is generating code where there are a lot of calls that are inlined.
Unmanaged generic type constraint + generic pointers would make it possible to write the loop without the
Unsafe
class, if that would help. Perhaps it is theIFunc
pattern that is a problem, I have seen it work beautifully though for simpler loops.I know there is work being done on refactoring inlining such as https://github.com/dotnet/coreclr/issues/3371 , https://github.com/dotnet/coreclr/issues/3482 not sure this pertains to this, as it is in fact after inlining has been done.
category:cq
theme:structs
skill-level:expert
cost:extra-large
The text was updated successfully, but these errors were encountered: