-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: ldflda + add not properly being optimized in certain cases #12436
Comments
Negative values are not commonly seen in address computations. This may not be that hard to fix, though we might need to be wary of reassociation issues (as in dotnet/coreclr#23792). Will put in 3.0 pending further investigation. |
Looking at a specific example: using System;
using System.Buffers;
using System.Runtime.CompilerServices;
class X
{
static int[] a;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Z(Memory<int> m)
{
Span<int> s = m.Span;
return s[4] + 56;
}
public static int Main()
{
a = new int[] { 0, 1, 2, 3, 44, 5 };
return Z(new Memory<int>(a));
}
} Codegen for G_M62456_IG02:
xor rdi, rdi
xor ebx, ebx
mov rcx, gword ptr [rsi] // object field of mem
test rcx, rcx
je SHORT G_M62456_IG06
lea rdx, bword ptr [rcx+8]
mov rdx, qword ptr [rdx-8]
cmp dword ptr [rdx], 0 // check for has component size
jge SHORT G_M62456_IG03 |
Probably not so easy to fix. Because of the way
with these inlines:
At that point we're stuck, since:
During inlining, we might be able to do limited forward sub of a complex arg tree, if the callee is a static method with just one arg, and that arg has just one use, and that use is the first instruction in the callee. That would eliminate tmp10 in the above, as We might also be able to clean this up somewhere else, but that would be a new thing. Since the sequence above always ends up with a three-level chain of dependent reads (read object, read object's method table, read method table's flag bits) I'm not convinced one extra add makes a lot of difference in overall perf. Let me think about it a bit more before I move this out of 3.0. Maybe @dotnet/jit-contrib has ideas? |
The simple forward substitution experiment I have generates: G_M33495_IG02:
33FF xor rdi, rdi
33DB xor ebx, ebx
488B0E mov rcx, gword ptr [rsi]
4885C9 test rcx, rcx
7447 je SHORT G_M33495_IG06
488B11 mov rdx, qword ptr [rcx]
833A00 cmp dword ptr [rdx], 0
7D09 jge SHORT G_M33495_IG03 The main problem with what I have now is that it merges arbitrary trees into bigger trees. IMO that increases the chance of a stack overflow in the JIT. The solution would probably be to be more careful and not merge entire trees but just move "interesting" nodes across the trees. Or get rid of |
I'm not sure what the exact issue is in this case but I find the importer handling of calls generally suspect. Here's a simple example showing this: [MethodImpl(MethodImplOptions.NoInlining)]
static int A() => 42;
[MethodImpl(MethodImplOptions.NoInlining)]
static int B() => 42;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test() => A() + B(); The importer generates:
The 2 calls ended up in separate trees and a temp variable was introduced. It's not clear why the JIT does that. Normal tree traversal order should preserve the original order without having to create 2 trees. You'd get the order reversed only if Such tree splitting might be necessary for nested calls: [MethodImpl(MethodImplOptions.NoInlining)]
static int C(int a, int b) => a + b;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test() => C(A(), B());
but even this is questionable, at least on platforms with fixed out args, where the call argument ordering is more flexible compared to x86. |
It probably would be easy to teach ValueNumbering to do a few arithmetic identities like:
|
I changed importer spill behavior in dotnet/coreclr#7923, and it was a correctness fix for a nested case. There is also some benefit to behaving the same for inline candidates (where we always spill) and non-inline candidates (where we almost always spill), as inline candidacy is somewhat arbitrary (for example, you wouldn't expect codegen to change by adding a noinline attribute to a method that was not getting inlined anyways). |
Fixing this seems out of scope for 3.0, so will move to future. |
6.0 codegen seems better here: via sharplab. and jitdump (7.0): G_M50085_IG02: ;; offset=0017H
33FF xor edi, edi
33DB xor ebx, ebx
488B2E mov rbp, gword ptr [rsi]
4885ED test rbp, rbp
0F84B8000000 je G_M50085_IG08
;; bbWeight=1 PerfScore 3.75
G_M50085_IG03: ;; offset=0027H
488B5500 mov rdx, qword ptr [rbp]
F70200000080 test dword ptr [rdx], 0xFFFFFFFF80000000 Not sure when this improved. |
More context at dotnet/coreclr#23783, specifically the comment at dotnet/coreclr#23783 (comment).
We have code in
Memory<T>.Span
that attempts to read anIntPtr
field at offset 0 from an object. The way this is done is by getting a ref to the field at offsetIntPtr.Size
, then subtractingIntPtr.Size
and dereferencing.Today it results in this codegen:
It should ideally result in this codegen:
This optimization would also obviate the need for the new intrinsic proposed as part of dotnet/coreclr#23783.
category:cq
theme:basic-cq
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: