-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slightly improve struct zeroing & copying #83488
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsCloses #83277 Avoid handling remainder data after SIMD loop via multiple scalar operations and handle via overlapping instead, e.g.: struct MyStruct
{
public fixed byte Data[30];
}
MyStruct Test()
{
return new MyStruct();
} Codegen diff (base vs PR) for ; Assembly listing for method StackallocTests:Test():StackallocTests+MyStruct:this
G_M46073_IG01:
sub rsp, 56
vzeroupper
mov rax, 0xD1FFAB1E
mov qword ptr [rsp+30H], rax
G_M46073_IG02:
xor eax, eax
vxorps xmm0, xmm0
vmovdqu xmmword ptr [rdx], xmm0
- mov qword ptr [rdx+10H], rax
- mov qword ptr [rdx+16H], rax
+ vmovdqu xmmword ptr [rdx+0EH], xmm0
mov rax, rdx
mov rcx, 0xD1FFAB1E
cmp qword ptr [rsp+30H], rcx
je SHORT G_M46073_IG03
call CORINFO_HELP_FAIL_FAST
G_M46073_IG03:
nop
G_M46073_IG04:
add rsp, 56
ret
-; Total bytes of code 71
+; Total bytes of code 68
|
Diff example from SPMI: vmovdqu ymm0, ymmword ptr[r8+FCH]
vmovdqu ymmword ptr[rdx+FCH], ymm0
- vmovdqu xmm0, xmmword ptr [r8+11CH]
- vmovdqu xmmword ptr [rdx+11CH], xmm0
- mov rax, qword ptr [r8+12CH]
- mov qword ptr [rdx+12CH], rax
- mov rax, qword ptr [r8+133H]
- mov qword ptr [rdx+133H], rax
- ;; size=62 bbWeight=1 PerfScore 19.00
+ vmovdqu ymm0, ymmword ptr[r8+11BH]
+ vmovdqu ymmword ptr[rdx+11BH], ymm0
+ ;; size=34 bbWeight=1 PerfScore 14.00 |
@kunalspathak @TIHan @dotnet/jit-contrib PTAL |
@@ -3042,11 +3042,13 @@ void CodeGen::genCodeForInitBlkUnroll(GenTreeBlk* node) | |||
? YMM_REGSIZE_BYTES | |||
: XMM_REGSIZE_BYTES; | |||
|
|||
bool zeroing = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to do this for any integral const that "fits in byte" or is the same bit pattern repeated to all bytes. So this should also work for -1
for example which initializes all bytes to 0xFF
Even if we don't handle that in this PR, a TODO
and naming this something slightly different to indicate it applies to any case where the bytes are idempotent would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but this path almost never hit for non-zero value so I didn't complicate the logic with "is input idempotent". Can leave a comment
if (src->gtSkipReloadOrCopy()->IsIntegralConst(0)) | ||
{ | ||
// If the source is constant 0 then always use xorps, it's faster | ||
// than copying the constant from a GPR to a XMM register. | ||
emit->emitIns_R_R(INS_xorps, EA_ATTR(regSize), srcXmmReg, srcXmmReg); | ||
zeroing = true; | ||
} | ||
else | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting since I happened to look in the else
here... We're initializing from a srcIntReg
and we probably would have been better off if we did the "can use simd copy" check in lowering instead and opted to lower to a gtNewSimdCreateBroadcastNode
This will do the most efficient thing of creating a GenTreeVecCon
(which includes efficiently getting a zero
or allbitsset
node rather than pulling from memory) -or- it will do the "most efficient" broadcast of the value into the xmm
register (movd
followed by vpbroadcastd
or pshufd
, typically).
What we're doing right now "works", but it's quite a bit more expensive than it needs to be and won't see improvements from other SIMD opts we enable over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave that up-for-grabs since it requires a broader refactoring to do the whole thing in Lower. This PR only slightly improves codegen where it's too late to work with SIMD gen trees
Remember not to reintroduce the bug fixed by #53116 when making this optimization. (I don't know if this code path is at all related, just wanted to call it out so you can double-check!) |
|
||
// Size is too large for YMM moves, try stepping down to XMM size to finish SIMD copies. | ||
if (regSize == YMM_REGSIZE_BYTES) | ||
while (size >= regSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how this is simpler than the original for
loop.
{ | ||
assert(regSize >= XMM_REGSIZE_BYTES); | ||
|
||
if (isPow2(size) && (size <= REGSIZE_BYTES)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you, but since this section is almost the same as https://github.com/dotnet/runtime/pull/83488/files#diff-63dc452244e1b3fea66bfdc746d83c26f866ef153966fd708585df5428e49093R3112 I'm wondering if we make a single common function out of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally just like Tanner suggested this all should be done with GT_HWINTRINISC so we can get the best codegen so I'm leaving that up-for-grabs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closes #83277
Currently, when we need to perform unrolled memset/memcpy (BLK) we do a loop of SIMD and then handle the remainder using scalar loads/stores. This PR slightly improves this logic by using wide stores/loads + overlapping with previously processed data.
Codegen diff (base vs PR) for
InitMemory()
andCopyMemory()
: