-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable promotion of structs passed in multiple registers with matching fields #37924
Comments
Allow HFAs and other structs with matching fields and registers. Contributes to dotnet#37924
* Allow enregistering more structs args Allow HFAs and other structs with matching fields and registers. Contributes to #37924
Although #39326 addressed some cases, on Arm64 additional changes are needed for non-HFA multireg args. |
Hello, I've stumbled upon an issue that I think is related to this issue (also maybe #13417?), which in my case makes it so that the codegen produced by the JIT when using a struct enumerator is extremely slow, due to tons of moves back and forth from the struct instance, instead of just enregistering the various fields that are present there. Commenting here to avoid opening a possible duplicate issue, let me know if I should instead create one and I'll be happy to move this comment there 🙂 I'm working on a public ref struct Enumerator // Defined within Span2D<T>
{
private readonly Span<T> span; // target ref and height of the 2D area
private readonly int width;
private readonly int pitch; // distance between consecutive rows (0 if rows are contiguous)
private int x; // initially set to -1 in constructor (omitted here)
private int y;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public bool MoveNext()
{
int x = this.x + 1;
if (x < this.width)
{
this.x = x;
return true;
}
this.x = 0;
return this.y++ < (this.span.Length - 1);
}
public readonly ref T Current
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
get
{
ref T r0 = ref MemoryMarshal.GetReference(this.span);
nint index = ((nint)(uint)this.y * (nint)(uint)(this.width + this.pitch)) + (nint)(uint)this.x;
return ref Unsafe.Add(ref r0, index);
}
}
} I noticed using this type in a private static int SumForeachTest(Span2D<int>.Enumerator span)
{
int sum = 0;
while (span.MoveNext())
{
sum += span.Current;
}
return sum;
} And got this result (added some comments next to the assembly as I was trying to figure out what was going on). ; Assembly listing for method Span2DBenchmark.Benchmark:SumForeachTest(Enumerator[Int32]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 14, 96 ) byref -> rcx ld-addr-op
; V01 loc0 [V01,T04] ( 4, 6 ) int -> rax
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
; V03 tmp1 [V03,T06] ( 2, 4 ) long -> r8 "Inline stloc first use temp"
;* V04 tmp2 [V04 ] ( 0, 0 ) byref -> zero-ref "impAppendStmt"
;* V05 tmp3 [V05 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inlining Arg"
; V06 tmp4 [V06,T02] ( 2, 8 ) bool -> rdx "Inline return value spill temp"
; V07 tmp5 [V07,T01] ( 3, 20 ) int -> rdx "Inline stloc first use temp"
; V08 tmp6 [V08,T03] ( 3, 12 ) int -> rdx "Inline stloc first use temp"
; V09 tmp7 [V09,T05] ( 2, 4 ) byref -> rdx V05._pointer(offs=0x00) P-INDEP "field V05._pointer (fldOffset=0x0)"
;* V10 tmp8 [V10 ] ( 0, 0 ) int -> zero-ref V05._length(offs=0x08) P-INDEP "field V05._length (fldOffset=0x8)"
;
; Lcl frame size = 0
G_M35757_IG01:
;; bbWeight=1 PerfScore 0.00
START:
xor eax, eax ; eax = sum = 0
jmp SHORT MOVE_NEXT_COLUMN
;; bbWeight=1 PerfScore 2.25
INDEXING:
mov rdx, bword ptr [rcx] ; rdx = e.ByRef<T>
mov r8d, dword ptr [rcx+28] ; r8d = e.y
mov r9d, dword ptr [rcx+16] ; r9d = e.width
add r9d, dword ptr [rcx+20] ; r9d += e.pitch
mov r9d, r9d ; (nint)(uint)r9d
imul r8, r9 ; r8 = y * (e.height + e.pitch)
mov r9d, dword ptr [rcx+24] ; r9d = e.x
add r8, r9 ; r8 += r9
add eax, dword ptr [rdx+4*r8] ; eax += *(e.ByRef<T> + r8)
;; bbWeight=2 PerfScore 29.00
MOVE_NEXT_COLUMN:
mov edx, dword ptr [rcx+24] ; edx = e.x
inc edx ; x++
cmp edx, dword ptr [rcx+16] ; if (x >= e.width)
jge SHORT MOVE_NEXT_ROW
;; bbWeight=8 PerfScore 42.00
MOVE_NEXT_COLUMN_OK:
mov dword ptr [rcx+24], edx ; e.x = x
jmp SHORT INDEXING
;; bbWeight=4 PerfScore 12.00
MOVE_NEXT_ROW:
xor edx, edx ; edx = 0
mov dword ptr [rcx+24], edx ; e.x = edx
mov edx, dword ptr [rcx+28] ; edx = e.y
lea r8d, [rdx+1] ; r8d = edx + 1
mov dword ptr [rcx+28], r8d ; e.y = r8d
mov r8d, dword ptr [rcx+8] ; r8d = e.height
dec r8d ; r8d--
cmp edx, r8d ; edx < r8d ?
setl dl
movzx rdx, dl ; rdx = edx < r8d
test edx, edx ; if (rdx)
jne SHORT INDEXING
;; bbWeight=4 PerfScore 39.00
RETURN:
ret
;; bbWeight=1 PerfScore 1.00
; Total bytes of code 89, prolog size 0, PerfScore 134.15, (MethodHash=327a7452) for method Span2DBenchmark.Benchmark:SumForeachTest(Enumerator[Int32]):int I've noticed that the resulting codegen is both not enregistering the various fields but instead accessing them every single time from the input struct (so it's just reading from the stack a whole lot of times, which explains the huge slowdown), and also I guess for the same reason it's also writing back values to the input span every time it modifies them. I would've expected the JIT to be able to just decompose the struct into the available registers and stop using the stack entirely, and just use registers for the entirety of the method above. I saw that there's some work related to stack promotion planned for .NET 6, but reading the document that seemed to be related more to small structs that could fit into one or two registers instead of much larger structs (such as this one)? |
You are running into two different limits in the current stack promotion logic:
The first one is easy to experiment with. Last time I looked, it seemed like if we increased the field count limit, we would also need to re-examine the heuristics, as overall there was quite a bit of code size growth to little benefit. Might be worth retrying this and seeing if anything's changed. The second one seemingly requires more substantial changes. I've seen this pattern of including @CarolEidt is looking over struct related issues and can comment further on likely .Net 6 work. |
Hey @AndyAyersMS, thanks for the info! public ref struct Enumerator
{
private readonly ref T reference;
private readonly int height;
private readonly int width;
private readonly int pitch;
private int x;
private int y;
// Enumeration code...
} So in theory if (1) that number of fields for struct promotion was raised to at least 6 and (2) Just my two cents here, but I think that with custom struct enumerators becoming more and more widespread, it would be worth it to lift some of these limitations to ensure that consumers relying on them to get the nice |
Fixed with #43870 |
@CarolEidt I just tried again (now that #43870 is merged) with the snippet I previously posted in my comment above (this one), but the codegen is the same in this case, I imagine due to those other limitations Andy mentioned. Since both this and the other linked issue (#13417) are now closed, I was wondering if it could be useful for me to either just copy that comment in another open and related issue, or just to open a new issue for that specifically? I mean, so that it'd be easier to track whenever that particular thing is fixed, as right now it'd be hard to see since this issue is closed already and that comment was not specifically related to it. Of course, as mentioned above, assuming .NET 6 will get EDIT: as a test, to exclude that As in, this type (both in the public unsafe ref struct Enumerator<T>
where T : unmanaged
{
private readonly T* ptr;
private readonly int height;
private readonly int width;
private readonly int pitch;
private int x;
private int y;
} |
With #36862 we have support for enregistering structs that are returned in multiple registers, but structs that are passed in multiple registers are still sometimes disabled from being enregistered, even if the number of fields matches the number of registers. This includes at least some HFAs. In particular,
Compiler::StructPromotionHelper::ShouldPromoteStructVar()
will only return true for a parameter if there are 1 or 2 fields, and will return false on x64/ux for some cases of 2 fields where the offset of the second field is not 8 bytes.category:cq
theme:structs
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: