Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly improve struct zeroing & copying #83488

Merged
merged 6 commits into from
Mar 17, 2023
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Mar 16, 2023

Closes #83277

Currently, when we need to perform unrolled memset/memcpy (BLK) we do a loop of SIMD and then handle the remainder using scalar loads/stores. This PR slightly improves this logic by using wide stores/loads + overlapping with previously processed data.

struct MyStruct
{
    public fixed byte Data[30]; // and other sizes
}

// unrolled memset
MyStruct InitMemory() => new MyStruct();

// unrolled memcpy
MyStruct CopyMemory(MyStruct val) => val;

Codegen diff (base vs PR) for InitMemory() and CopyMemory():

; Assembly listing for method InitMemory()
G_M46073_IG01:	
       sub      rsp, 56	
       vzeroupper 	
       mov      rax, 0xD1FFAB1E	
       mov      qword ptr [rsp+30H], rax	
G_M46073_IG02:	
       xor      eax, eax	
       vxorps   xmm0, xmm0	
       vmovdqu  xmmword ptr [rdx], xmm0	
-      mov      qword ptr [rdx+10H], rax	
-      mov      qword ptr [rdx+16H], rax	
+      vmovdqu  xmmword ptr [rdx+0EH], xmm0
       mov      rax, rdx	
       mov      rcx, 0xD1FFAB1E	
       cmp      qword ptr [rsp+30H], rcx	
       je       SHORT G_M46073_IG03	
       call     CORINFO_HELP_FAIL_FAST	
G_M46073_IG03:	
       nop      	
G_M46073_IG04:	
       add      rsp, 56	
       ret      	
-; Total bytes of code 71
+; Total bytes of code 68


; Assembly listing for method CopyMemory()
G_M3723_IG01:
       sub      rsp, 56
       vzeroupper 
       mov      rax, 0xD1FFAB1E
       mov      qword ptr [rsp+30H], rax
G_M3723_IG02:
       vmovdqu  xmm0, xmmword ptr [rdx]
       vmovdqu  xmmword ptr [rcx], xmm0
-      mov      rax, qword ptr [rdx+10H]
-      mov      qword ptr [rcx+10H], rax
-      mov      rax, qword ptr [rdx+16H]
-      mov      qword ptr [rcx+16H], rax
+      vmovdqu  xmm0, xmmword ptr [rdx+0EH]
+      vmovdqu  xmmword ptr [rcx+0EH], xmm0
       mov      rax, rcx
       mov      rcx, 0xD1FFAB1E
       cmp      qword ptr [rsp+30H], rcx
       je       SHORT G_M3723_IG03
       call     CORINFO_HELP_FAIL_FAST
G_M3723_IG03:
       nop      
G_M3723_IG04:
       add      rsp, 56
       ret      
-; Total bytes of code 77
+; Total bytes of code 71

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 16, 2023
@ghost ghost assigned EgorBo Mar 16, 2023
@ghost
Copy link

ghost commented Mar 16, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #83277

Avoid handling remainder data after SIMD loop via multiple scalar operations and handle via overlapping instead, e.g.:

struct MyStruct
{
    public fixed byte Data[30];
}

MyStruct Test()
{
    return new MyStruct();
}

Codegen diff (base vs PR) for Test():

; Assembly listing for method StackallocTests:Test():StackallocTests+MyStruct:this	
G_M46073_IG01:	
       sub      rsp, 56	
       vzeroupper 	
       mov      rax, 0xD1FFAB1E	
       mov      qword ptr [rsp+30H], rax	
G_M46073_IG02:	
       xor      eax, eax	
       vxorps   xmm0, xmm0	
       vmovdqu  xmmword ptr [rdx], xmm0	
-      mov      qword ptr [rdx+10H], rax	
-      mov      qword ptr [rdx+16H], rax	
+      vmovdqu  xmmword ptr [rdx+0EH], xmm0
       mov      rax, rdx	
       mov      rcx, 0xD1FFAB1E	
       cmp      qword ptr [rsp+30H], rcx	
       je       SHORT G_M46073_IG03	
       call     CORINFO_HELP_FAIL_FAST	
G_M46073_IG03:	
       nop      	
G_M46073_IG04:	
       add      rsp, 56	
       ret      	
-; Total bytes of code 71
+; Total bytes of code 68
Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo EgorBo changed the title Slightly improve struct zeroing (InitBlk unroll) Slightly improve struct zeroing & copying Mar 16, 2023
@EgorBo EgorBo marked this pull request as ready for review March 16, 2023 10:33
@EgorBo
Copy link
Member Author

EgorBo commented Mar 16, 2023

Diff example from SPMI:

        vmovdqu  ymm0, ymmword ptr[r8+FCH]
        vmovdqu  ymmword ptr[rdx+FCH], ymm0
-       vmovdqu  xmm0, xmmword ptr [r8+11CH]
-       vmovdqu  xmmword ptr [rdx+11CH], xmm0
-       mov      rax, qword ptr [r8+12CH]
-       mov      qword ptr [rdx+12CH], rax
-       mov      rax, qword ptr [r8+133H]
-       mov      qword ptr [rdx+133H], rax
-						;; size=62 bbWeight=1 PerfScore 19.00
+       vmovdqu  ymm0, ymmword ptr[r8+11BH]
+       vmovdqu  ymmword ptr[rdx+11BH], ymm0
+						;; size=34 bbWeight=1 PerfScore 14.00

@EgorBo
Copy link
Member Author

EgorBo commented Mar 16, 2023

Diffs

@kunalspathak @TIHan @dotnet/jit-contrib PTAL

@@ -3042,11 +3042,13 @@ void CodeGen::genCodeForInitBlkUnroll(GenTreeBlk* node)
? YMM_REGSIZE_BYTES
: XMM_REGSIZE_BYTES;

bool zeroing = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to do this for any integral const that "fits in byte" or is the same bit pattern repeated to all bytes. So this should also work for -1 for example which initializes all bytes to 0xFF

Even if we don't handle that in this PR, a TODO and naming this something slightly different to indicate it applies to any case where the bytes are idempotent would be good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but this path almost never hit for non-zero value so I didn't complicate the logic with "is input idempotent". Can leave a comment

if (src->gtSkipReloadOrCopy()->IsIntegralConst(0))
{
// If the source is constant 0 then always use xorps, it's faster
// than copying the constant from a GPR to a XMM register.
emit->emitIns_R_R(INS_xorps, EA_ATTR(regSize), srcXmmReg, srcXmmReg);
zeroing = true;
}
else
{
Copy link
Member

@tannergooding tannergooding Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting since I happened to look in the else here... We're initializing from a srcIntReg and we probably would have been better off if we did the "can use simd copy" check in lowering instead and opted to lower to a gtNewSimdCreateBroadcastNode

This will do the most efficient thing of creating a GenTreeVecCon (which includes efficiently getting a zero or allbitsset node rather than pulling from memory) -or- it will do the "most efficient" broadcast of the value into the xmm register (movd followed by vpbroadcastd or pshufd, typically).

What we're doing right now "works", but it's quite a bit more expensive than it needs to be and won't see improvements from other SIMD opts we enable over time.

Copy link
Member Author

@EgorBo EgorBo Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave that up-for-grabs since it requires a broader refactoring to do the whole thing in Lower. This PR only slightly improves codegen where it's too late to work with SIMD gen trees

@GrabYourPitchforks
Copy link
Member

Remember not to reintroduce the bug fixed by #53116 when making this optimization. (I don't know if this code path is at all related, just wanted to call it out so you can double-check!)

@EgorBo
Copy link
Member Author

EgorBo commented Mar 16, 2023

Remember not to reintroduce the bug fixed by #53116 when making this optimization. (I don't know if this code path is at all related, just wanted to call it out so you can double-check!)

Yep, this path is never hit when GC handles are involved, for example - #83297 🙂


// Size is too large for YMM moves, try stepping down to XMM size to finish SIMD copies.
if (regSize == YMM_REGSIZE_BYTES)
while (size >= regSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how this is simpler than the original for loop.

{
assert(regSize >= XMM_REGSIZE_BYTES);

if (isPow2(size) && (size <= REGSIZE_BYTES))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, but since this section is almost the same as https://github.com/dotnet/runtime/pull/83488/files#diff-63dc452244e1b3fea66bfdc746d83c26f866ef153966fd708585df5428e49093R3112 I'm wondering if we make a single common function out of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally just like Tanner suggested this all should be done with GT_HWINTRINISC so we can get the best codegen so I'm leaving that up-for-grabs

Copy link
Contributor

@TIHan TIHan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thumbs Up

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suboptimal codegen for memset/memcpy unrolling
4 participants