Slightly improve struct zeroing & copying #83488

EgorBo · 2023-03-16T02:42:44Z

Currently, when we need to perform unrolled memset/memcpy (BLK) we do a loop of SIMD and then handle the remainder using scalar loads/stores. This PR slightly improves this logic by using wide stores/loads + overlapping with previously processed data.

struct MyStruct
{
    public fixed byte Data[30]; // and other sizes
}

// unrolled memset
MyStruct InitMemory() => new MyStruct();

// unrolled memcpy
MyStruct CopyMemory(MyStruct val) => val;

Codegen diff (base vs PR) for InitMemory() and CopyMemory():

; Assembly listing for method InitMemory()
G_M46073_IG01:	
       sub      rsp, 56	
       vzeroupper 	
       mov      rax, 0xD1FFAB1E	
       mov      qword ptr [rsp+30H], rax	
G_M46073_IG02:	
       xor      eax, eax	
       vxorps   xmm0, xmm0	
       vmovdqu  xmmword ptr [rdx], xmm0	
-      mov      qword ptr [rdx+10H], rax	
-      mov      qword ptr [rdx+16H], rax	
+      vmovdqu  xmmword ptr [rdx+0EH], xmm0
       mov      rax, rdx	
       mov      rcx, 0xD1FFAB1E	
       cmp      qword ptr [rsp+30H], rcx	
       je       SHORT G_M46073_IG03	
       call     CORINFO_HELP_FAIL_FAST	
G_M46073_IG03:	
       nop      	
G_M46073_IG04:	
       add      rsp, 56	
       ret      	
-; Total bytes of code 71
+; Total bytes of code 68


; Assembly listing for method CopyMemory()
G_M3723_IG01:
       sub      rsp, 56
       vzeroupper 
       mov      rax, 0xD1FFAB1E
       mov      qword ptr [rsp+30H], rax
G_M3723_IG02:
       vmovdqu  xmm0, xmmword ptr [rdx]
       vmovdqu  xmmword ptr [rcx], xmm0
-      mov      rax, qword ptr [rdx+10H]
-      mov      qword ptr [rcx+10H], rax
-      mov      rax, qword ptr [rdx+16H]
-      mov      qword ptr [rcx+16H], rax
+      vmovdqu  xmm0, xmmword ptr [rdx+0EH]
+      vmovdqu  xmmword ptr [rcx+0EH], xmm0
       mov      rax, rcx
       mov      rcx, 0xD1FFAB1E
       cmp      qword ptr [rsp+30H], rcx
       je       SHORT G_M3723_IG03
       call     CORINFO_HELP_FAIL_FAST
G_M3723_IG03:
       nop      
G_M3723_IG04:
       add      rsp, 56
       ret      
-; Total bytes of code 77
+; Total bytes of code 71

ghost · 2023-03-16T02:42:57Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #83277

Avoid handling remainder data after SIMD loop via multiple scalar operations and handle via overlapping instead, e.g.:

struct MyStruct
{
    public fixed byte Data[30];
}

MyStruct Test()
{
    return new MyStruct();
}

Codegen diff (base vs PR) for Test():

; Assembly listing for method StackallocTests:Test():StackallocTests+MyStruct:this	
G_M46073_IG01:	
       sub      rsp, 56	
       vzeroupper 	
       mov      rax, 0xD1FFAB1E	
       mov      qword ptr [rsp+30H], rax	
G_M46073_IG02:	
       xor      eax, eax	
       vxorps   xmm0, xmm0	
       vmovdqu  xmmword ptr [rdx], xmm0	
-      mov      qword ptr [rdx+10H], rax	
-      mov      qword ptr [rdx+16H], rax	
+      vmovdqu  xmmword ptr [rdx+0EH], xmm0
       mov      rax, rdx	
       mov      rcx, 0xD1FFAB1E	
       cmp      qword ptr [rsp+30H], rcx	
       je       SHORT G_M46073_IG03	
       call     CORINFO_HELP_FAIL_FAST	
G_M46073_IG03:	
       nop      	
G_M46073_IG04:	
       add      rsp, 56	
       ret      	
-; Total bytes of code 71
+; Total bytes of code 68

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

EgorBo · 2023-03-16T10:36:59Z

Diff example from SPMI:

        vmovdqu  ymm0, ymmword ptr[r8+FCH]
        vmovdqu  ymmword ptr[rdx+FCH], ymm0
-       vmovdqu  xmm0, xmmword ptr [r8+11CH]
-       vmovdqu  xmmword ptr [rdx+11CH], xmm0
-       mov      rax, qword ptr [r8+12CH]
-       mov      qword ptr [rdx+12CH], rax
-       mov      rax, qword ptr [r8+133H]
-       mov      qword ptr [rdx+133H], rax
-						;; size=62 bbWeight=1 PerfScore 19.00
+       vmovdqu  ymm0, ymmword ptr[r8+11BH]
+       vmovdqu  ymmword ptr[rdx+11BH], ymm0
+						;; size=34 bbWeight=1 PerfScore 14.00

EgorBo · 2023-03-16T13:02:01Z

Diffs

@kunalspathak @TIHan @dotnet/jit-contrib PTAL

tannergooding · 2023-03-16T13:10:32Z

src/coreclr/jit/codegenxarch.cpp

@@ -3042,11 +3042,13 @@ void CodeGen::genCodeForInitBlkUnroll(GenTreeBlk* node)
                               ? YMM_REGSIZE_BYTES
                               : XMM_REGSIZE_BYTES;

+        bool zeroing = false;


We should be able to do this for any integral const that "fits in byte" or is the same bit pattern repeated to all bytes. So this should also work for -1 for example which initializes all bytes to 0xFF

Even if we don't handle that in this PR, a TODO and naming this something slightly different to indicate it applies to any case where the bytes are idempotent would be good.

Right, but this path almost never hit for non-zero value so I didn't complicate the logic with "is input idempotent". Can leave a comment

tannergooding · 2023-03-16T13:15:16Z

src/coreclr/jit/codegenxarch.cpp

        if (src->gtSkipReloadOrCopy()->IsIntegralConst(0))
        {
            // If the source is constant 0 then always use xorps, it's faster
            // than copying the constant from a GPR to a XMM register.
            emit->emitIns_R_R(INS_xorps, EA_ATTR(regSize), srcXmmReg, srcXmmReg);
+            zeroing = true;
        }
        else
        {


Just noting since I happened to look in the else here... We're initializing from a srcIntReg and we probably would have been better off if we did the "can use simd copy" check in lowering instead and opted to lower to a gtNewSimdCreateBroadcastNode

This will do the most efficient thing of creating a GenTreeVecCon (which includes efficiently getting a zero or allbitsset node rather than pulling from memory) -or- it will do the "most efficient" broadcast of the value into the xmm register (movd followed by vpbroadcastd or pshufd, typically).

What we're doing right now "works", but it's quite a bit more expensive than it needs to be and won't see improvements from other SIMD opts we enable over time.

I'd leave that up-for-grabs since it requires a broader refactoring to do the whole thing in Lower. This PR only slightly improves codegen where it's too late to work with SIMD gen trees

GrabYourPitchforks · 2023-03-16T15:36:59Z

Remember not to reintroduce the bug fixed by #53116 when making this optimization. (I don't know if this code path is at all related, just wanted to call it out so you can double-check!)

EgorBo · 2023-03-16T15:52:15Z

Remember not to reintroduce the bug fixed by #53116 when making this optimization. (I don't know if this code path is at all related, just wanted to call it out so you can double-check!)

Yep, this path is never hit when GC handles are involved, for example - #83297 🙂

TIHan · 2023-03-17T03:44:23Z

src/coreclr/jit/codegenxarch.cpp


-            // Size is too large for YMM moves, try stepping down to XMM size to finish SIMD copies.
-            if (regSize == YMM_REGSIZE_BYTES)
+        while (size >= regSize)


I really like how this is simpler than the original for loop.

TIHan · 2023-03-17T03:52:27Z

src/coreclr/jit/codegenxarch.cpp

+        {
+            assert(regSize >= XMM_REGSIZE_BYTES);
+
+            if (isPow2(size) && (size <= REGSIZE_BYTES))


Up to you, but since this section is almost the same as https://github.com/dotnet/runtime/pull/83488/files#diff-63dc452244e1b3fea66bfdc746d83c26f866ef153966fd708585df5428e49093R3112 I'm wondering if we make a single common function out of it.

Ideally just like Tanner suggested this all should be done with GT_HWINTRINISC so we can get the best codegen so I'm leaving that up-for-grabs

TIHan

LGTM

EgorBo added 2 commits March 16, 2023 03:07

Handle remainder via overlapping

1ba78b8

Clean up

e03f681

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 16, 2023

ghost assigned EgorBo Mar 16, 2023

EgorBo added 2 commits March 16, 2023 03:51

Same for copying

f847fae

fix loop

0bcf2d1

EgorBo changed the title ~~Slightly improve struct zeroing (InitBlk unroll)~~ Slightly improve struct zeroing & copying Mar 16, 2023

EgorBo added 2 commits March 16, 2023 04:28

forgot srcOffset

2dd8228

Clean up

3ea0ca4

EgorBo marked this pull request as ready for review March 16, 2023 10:33

tannergooding reviewed Mar 16, 2023

View reviewed changes

tannergooding approved these changes Mar 16, 2023

View reviewed changes

TIHan reviewed Mar 17, 2023

View reviewed changes

TIHan approved these changes Mar 17, 2023

View reviewed changes

EgorBo merged commit 6e95250 into dotnet:main Mar 17, 2023

kunalspathak mentioned this pull request Mar 21, 2023

[Perf] Windows/x64: 4 Improvements on 3/17/2023 5:15:23 PM dotnet/perf-autofiling-issues#14282

Closed

ghost locked as resolved and limited conversation to collaborators Apr 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slightly improve struct zeroing & copying #83488

Slightly improve struct zeroing & copying #83488

EgorBo commented Mar 16, 2023 •

edited

Loading

ghost commented Mar 16, 2023

EgorBo commented Mar 16, 2023 •

edited

Loading

EgorBo commented Mar 16, 2023

tannergooding Mar 16, 2023

EgorBo Mar 16, 2023

tannergooding Mar 16, 2023 •

edited

Loading

EgorBo Mar 16, 2023 •

edited

Loading

GrabYourPitchforks commented Mar 16, 2023

EgorBo commented Mar 16, 2023

TIHan Mar 17, 2023

TIHan Mar 17, 2023

EgorBo Mar 17, 2023

TIHan left a comment

Slightly improve struct zeroing & copying #83488

Slightly improve struct zeroing & copying #83488

Conversation

EgorBo commented Mar 16, 2023 • edited Loading

ghost commented Mar 16, 2023

EgorBo commented Mar 16, 2023 • edited Loading

EgorBo commented Mar 16, 2023

tannergooding Mar 16, 2023

Choose a reason for hiding this comment

EgorBo Mar 16, 2023

Choose a reason for hiding this comment

tannergooding Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

EgorBo Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

GrabYourPitchforks commented Mar 16, 2023

EgorBo commented Mar 16, 2023

TIHan Mar 17, 2023

Choose a reason for hiding this comment

TIHan Mar 17, 2023

Choose a reason for hiding this comment

EgorBo Mar 17, 2023

Choose a reason for hiding this comment

TIHan left a comment

Choose a reason for hiding this comment

EgorBo commented Mar 16, 2023 •

edited

Loading

EgorBo commented Mar 16, 2023 •

edited

Loading

tannergooding Mar 16, 2023 •

edited

Loading

EgorBo Mar 16, 2023 •

edited

Loading