Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize stackalloc zeroing via BLK #83255

Merged
merged 19 commits into from
Apr 4, 2023
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Mar 10, 2023

Closes #63500

Let's insert GT_BLK after GT_HEAPLCL to rely on the former to perform zeroing.

Codegen example:

void Test()
{
    var p = stackalloc byte[250];
    Consume(p);
}

Main:

; Method Program:Test():this
G_M000_IG01:                
       55                   push     rbp
       4883EC30             sub      rsp, 48
       488D6C2420           lea      rbp, [rsp+20H]
       48B8218F332784F40000 mov      rax, 0xF48427338F21 ;; GS-cookie
       48894508             mov      qword ptr [rbp+08H], rax
G_M000_IG02:                
       4883C420             add      rsp, 32
       B910000000           mov      ecx, 16
G_M000_IG03:                
       6A00                 push     0
       6A00                 push     0
       48FFC9               dec      rcx
       75F7                 jne      SHORT G_M000_IG03  ;; slow loop (zeroing 16 bytes at once)
       4883EC20             sub      rsp, 32
       488D4C2420           lea      rcx, [rsp+20H]
       FF1577AA2800         call     [Program:Consume(ulong)]
       48B9218F332784F40000 mov      rcx, 0xF48427338F21
       48394D08             cmp      qword ptr [rbp+08H], rcx
       7405                 je       SHORT G_M000_IG04
       E852F6BB5F           call     CORINFO_HELP_FAIL_FAST
G_M000_IG04:                
       90                   nop      
G_M000_IG05:                
       488D6510             lea      rsp, [rbp+10H]
       5D                   pop      rbp
       C3                   ret      
; Total bytes of code: 85

PR:

; Method Progr:Test():this
G_M435_IG01:              
       55                   push     rbp
       4883EC30             sub      rsp, 48
       C5F877               vzeroupper 
       488D6C2420           lea      rbp, [rsp+20H]
       48B878563412F0DEBC9A mov      rax, 0x9ABCDEF012345678
       48894508             mov      qword ptr [rbp+08H], rax
G_M435_IG02:              
       852424               test     dword ptr [rsp], esp
       4881EC00010000       sub      rsp, 256
       488D542420           lea      rdx, [rsp+20H]
       C5FC57C0             vxorps   ymm0, ymm0
       C5FE7F02             vmovdqu  ymmword ptr[rdx], ymm0
       C5FE7F4220           vmovdqu  ymmword ptr[rdx+20H], ymm0
       C5FE7F4240           vmovdqu  ymmword ptr[rdx+40H], ymm0
       C5FE7F4260           vmovdqu  ymmword ptr[rdx+60H], ymm0
       C5FE7F8280000000     vmovdqu  ymmword ptr[rdx+80H], ymm0
       C5FE7F82A0000000     vmovdqu  ymmword ptr[rdx+A0H], ymm0
       C5FE7F82C0000000     vmovdqu  ymmword ptr[rdx+C0H], ymm0
       C5FE7F82E0000000     vmovdqu  ymmword ptr[rdx+E0H], ymm0
       FF15F9926000         call     [Progr:Consume(ulong):this]
       48B978563412F0DEBC9A mov      rcx, 0x9ABCDEF012345678
       48394D08             cmp      qword ptr [rbp+08H], rcx
       7405                 je       SHORT G_M435_IG03
       E85472505F           call     CORINFO_HELP_FAIL_FAST
G_M435_IG03:              
       90                   nop      
G_M435_IG04:              
       488D6510             lea      rsp, [rbp+10H]
       5D                   pop      rbp
       C3                   ret      
; Total bytes of code: 131

For large constants, this PR switches to call memset while current Main's impl will still be doing that loop of double-push.

Benchmark

BenchmarkSwitcher.FromAssembly(typeof(StackallocTests).Assembly).Run(args);

[CsvExporter]
public unsafe class StackallocBenchmarks
{
    [Benchmark] public void Stackalloc8() { byte* ptr = stackalloc byte[8]; Consume(ptr); }
    [Benchmark] public void Stackalloc16() { byte* ptr = stackalloc byte[16]; Consume(ptr); }
    [Benchmark] public void Stackalloc20() { byte* ptr = stackalloc byte[20]; Consume(ptr); }
    [Benchmark] public void Stackalloc32() { byte* ptr = stackalloc byte[32]; Consume(ptr); }
    [Benchmark] public void Stackalloc36() { byte* ptr = stackalloc byte[36]; Consume(ptr); }
    [Benchmark] public void Stackalloc40() { byte* ptr = stackalloc byte[40]; Consume(ptr); }
    [Benchmark] public void Stackalloc50() { byte* ptr = stackalloc byte[50]; Consume(ptr); }
    [Benchmark] public void Stackalloc64() { byte* ptr = stackalloc byte[64]; Consume(ptr); }
    [Benchmark] public void Stackalloc65() { byte* ptr = stackalloc byte[65]; Consume(ptr); }
    [Benchmark] public void Stackalloc80() { byte* ptr = stackalloc byte[80]; Consume(ptr); }
    [Benchmark] public void Stackalloc100() { byte* ptr = stackalloc byte[100]; Consume(ptr); }
    [Benchmark] public void Stackalloc110() { byte* ptr = stackalloc byte[110]; Consume(ptr); }
    [Benchmark] public void Stackalloc128() { byte* ptr = stackalloc byte[128]; Consume(ptr); }
    [Benchmark] public void Stackalloc129() { byte* ptr = stackalloc byte[129]; Consume(ptr); }
    [Benchmark] public void Stackalloc150() { byte* ptr = stackalloc byte[150]; Consume(ptr); }
    [Benchmark] public void Stackalloc180() { byte* ptr = stackalloc byte[180]; Consume(ptr); }
    [Benchmark] public void Stackalloc220() { byte* ptr = stackalloc byte[220]; Consume(ptr); }
    [Benchmark] public void Stackalloc256() { byte* ptr = stackalloc byte[256]; Consume(ptr); }
    [Benchmark] public void Stackalloc257() { byte* ptr = stackalloc byte[257]; Consume(ptr); }
    [Benchmark] public void Stackalloc300() { byte* ptr = stackalloc byte[300]; Consume(ptr); }
    [Benchmark] public void Stackalloc400() { byte* ptr = stackalloc byte[400]; Consume(ptr); }
    [Benchmark] public void Stackalloc500() { byte* ptr = stackalloc byte[500]; Consume(ptr); }
    [Benchmark] public void Stackalloc1024() { byte* ptr = stackalloc byte[1024]; Consume(ptr); }
    [Benchmark] public void Stackalloc4096() { byte* ptr = stackalloc byte[4096]; Consume(ptr); }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(byte* ptr) {}
}

Core i7 8700K

image

Ryzen 7950X

image

NOTE: 32 bytes and lower are handled separately so there are no differences for them.

@ghost ghost assigned EgorBo Mar 10, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 10, 2023
@ghost
Copy link

ghost commented Mar 10, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Let's see if this works - I just insert GT_BLK (basically, Unsafe.InitMemoryUnaligned) after CEE_LOCALLOC in importer to rely on that for zeroing. GT_BLK has its own logic to unroll/emit MEMSET.

   [Benchmark]
   public void Stackalloc50() { byte* ptr = stackalloc byte[40]; Consume(ptr); }
   
   [Benchmark]
   public void Stackalloc50() { byte* ptr = stackalloc byte[50]; Consume(ptr); }

   [Benchmark]
   public void Stackalloc50() { byte* ptr = stackalloc byte[64]; Consume(ptr); }
   
   [Benchmark]
   public void Stackalloc50() { byte* ptr = stackalloc byte[100]; Consume(ptr); }

   [Benchmark]
   public void Stackalloc50() { byte* ptr = stackalloc byte[128]; Consume(ptr); }

   [Benchmark]
   public void Stackalloc50() { byte* ptr = stackalloc byte[150]; Consume(ptr); }
   
   [Benchmark]
   public void Stackalloc50() { byte* ptr = stackalloc byte[256]; Consume(ptr); }
   
   [Benchmark]
   public void Stackalloc50() { byte* ptr = stackalloc byte[512]; Consume(ptr); }

   [Benchmark]
   public void Stackalloc50() { byte* ptr = stackalloc byte[1024]; Consume(ptr); }
   
   [Benchmark]
   public void Stackalloc50() { byte* ptr = stackalloc byte[4096]; Consume(ptr); }
   
   [Benchmark]
   public void Stackalloc50() { byte* ptr = stackalloc byte[8192]; Consume(ptr); }

   [MethodImpl(MethodImplOptions.NoInlining)]
   static void Consume(byte* ptr)
   {
   }
|         Method |                 Toolchain |       Mean | Ratio |
|--------------- |-------------------------- |-----------:|------:|
|   Stackalloc40 |    \Core_Root\corerun.exe |   1.493 ns |  1.00 |
|   Stackalloc40 | \Core_Root_PR\corerun.exe |   1.495 ns |  1.00 |
|                |                           |            |       |
|   Stackalloc50 |    \Core_Root\corerun.exe |   2.362 ns |  1.00 |
|   Stackalloc50 | \Core_Root_PR\corerun.exe |   1.321 ns |  0.56 |
|                |                           |            |       |
|   Stackalloc64 |    \Core_Root\corerun.exe |   2.354 ns |  1.00 |
|   Stackalloc64 | \Core_Root_PR\corerun.exe |   1.326 ns |  0.56 |
|                |                           |            |       |
|  Stackalloc100 |    \Core_Root\corerun.exe |   3.020 ns |  1.00 |
|  Stackalloc100 | \Core_Root_PR\corerun.exe |   1.316 ns |  0.44 |
|                |                           |            |       |
|  Stackalloc128 |    \Core_Root\corerun.exe |   3.230 ns |  1.00 |
|  Stackalloc128 | \Core_Root_PR\corerun.exe |   1.508 ns |  0.47 |
|                |                           |            |       |
|  Stackalloc150 |    \Core_Root\corerun.exe |   3.792 ns |  1.00 |
|  Stackalloc150 | \Core_Root_PR\corerun.exe |   4.536 ns |  1.20 |
|                |                           |            |       |
|  Stackalloc256 |    \Core_Root\corerun.exe |   6.295 ns |  1.00 |
|  Stackalloc256 | \Core_Root_PR\corerun.exe |   4.534 ns |  0.72 |
|                |                           |            |       |
|  Stackalloc512 |    \Core_Root\corerun.exe |  13.617 ns |  1.00 |
|  Stackalloc512 | \Core_Root_PR\corerun.exe |   4.760 ns |  0.35 |
|                |                           |            |       |
| Stackalloc1024 |    \Core_Root\corerun.exe |  26.741 ns |  1.00 |
| Stackalloc1024 | \Core_Root_PR\corerun.exe |   6.725 ns |  0.25 |
|                |                           |            |       |
| Stackalloc4096 |    \Core_Root\corerun.exe | 109.259 ns |  1.00 |
| Stackalloc4096 | \Core_Root_PR\corerun.exe |  30.131 ns |  0.28 |
|                |                           |            |       |
| Stackalloc8192 |    \Core_Root\corerun.exe | 217.935 ns |  1.00 |
| Stackalloc8192 | \Core_Root_PR\corerun.exe |  57.974 ns |  0.27 |
Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

@stephentoub
Copy link
Member

after 128 PR switches to MEMSET

For comparison, what does the graph look like if you don't do that?

@EgorBo
Copy link
Member Author

EgorBo commented Mar 10, 2023

after 128 PR switches to MEMSET

For comparison, what does the graph look like if you don't do that?

It looks like we might want to revise our heuristics, e.g. here what Clang/LLVM does:
For a generic CPU with AVX it unrolls zeroing up to 256 bytes, e.g.: https://godbolt.org/z/59Mdodc7T (NOTE that I'm using -Os that stands for "Optimize but keep binary size sane")

For Zen4 (AMD 7xxx) it unrolls up to 512 bytes (AVX512): https://godbolt.org/z/PxvoE4P9r

For a generic CPU without AVX it unrolls up to 128 bytes: https://godbolt.org/z/b4vd13PMz

Our threshold is hard-coded to 128. (and 256 for ARM64)

@EgorBo
Copy link
Member Author

EgorBo commented Mar 10, 2023

NOTE: afair, some (most?) libs in BCL use skiplocalsinit globally so they won't benefit from this change. But to properly test it I need to remove that flag.

@stephentoub
Copy link
Member

stephentoub commented Mar 10, 2023

some (most?) libs in BCL use skiplocalsinit globally

Yes, everything in the shared framework:

<SkipLocalsInit Condition="'$(SkipLocalsInit)' == '' and '$(MSBuildProjectExtension)' == '.csproj' and '$(IsNETCoreAppSrc)' == 'true' and '$(TargetFrameworkIdentifier)' == '.NETCoreApp'">true</SkipLocalsInit>

@EgorBo
Copy link
Member Author

EgorBo commented Mar 10, 2023

cc @anthonycanino (in case if you're interested adjusting the BLK unroll heuristic for avx-512)

@EgorBo EgorBo marked this pull request as ready for review March 14, 2023 09:57
@EgorBo
Copy link
Member Author

EgorBo commented Mar 14, 2023

after 128 PR switches to MEMSET

For comparison, what does the graph look like if you don't do that?

Updated. Fixed via #83274

@EgorBo EgorBo marked this pull request as draft March 14, 2023 13:02
@benaadams
Copy link
Member

benaadams commented Mar 17, 2023

Is stackalloc different to locals? As .NET 7 will partially unroll and loop the local zeroing (though currently only uses xmm)

       vxorps   xmm4, xmm4
       mov      rax, -0x2340
       vmovdqa  xmmword ptr [rbp+rax-60H], xmm4
       vmovdqa  xmmword ptr [rbp+rax-50H], xmm4
       vmovdqa  xmmword ptr [rbp+rax-40H], xmm4
       add      rax, 48
       jne      SHORT  -5 instr

Rather than this weird thing

G_M000_IG03:                
       push     0
       push     0
       dec      rcx
       jne      SHORT G_M000_IG03  ;; slow loop (zeroing 16 bytes at once)

@benaadams
Copy link
Member

i.e. should stackalloc be part of locals?

@EgorBo
Copy link
Member Author

EgorBo commented Mar 17, 2023

i.e. should stackalloc be part of locals?

Good question. We do that if stackalloc is smaller than 32 bytes. The problem with that that we'll have to zero it all the time no matter if we need it or not (since it's in the prologue). E.g. let me raise that limit to 128 bytes and check codegen for this:

void Test(bool cond)
{
    if (cond)
    {
        // rarely taken condition
        var p = stackalloc byte[128];
        Consume(p);
    }
    else
    {
        Console.WriteLine();
    }
}

Codegen:

; Method Program:Test(bool):this
G_M34929_IG01:              ;; offset=0000H
       4881ECA8000000       sub      rsp, 168
       C5D857E4             vxorps   xmm4, xmm4
       C5F97F642420         vmovdqa  xmmword ptr [rsp+20H], xmm4
       C5F97F642430         vmovdqa  xmmword ptr [rsp+30H], xmm4
       48B8A0FFFFFFFFFFFFFF mov      rax, -96
       C5F97FA404A0000000   vmovdqa  xmmword ptr [rsp+rax+A0H], xmm4
       C5F97FA404B0000000   vmovdqa  xmmword ptr [rsp+rax+B0H], xmm4
       C5F97FA404C0000000   vmovdqa  xmmword ptr [rsp+rax+C0H], xmm4
       4883C030             add      rax, 48
       75DF                 jne      SHORT  -5 instr
       48B878563412F0DEBC9A mov      rax, 0x9ABCDEF012345678
       48898424A0000000     mov      qword ptr [rsp+A0H], rax
						;; size=84 bbWeight=1 PerfScore 13.33

G_M34929_IG02:              ;; offset=0054H
       84D2                 test     dl, dl
       742D                 je       SHORT G_M34929_IG06
						;; size=4 bbWeight=1 PerfScore 1.25

G_M34929_IG03:              ;; offset=0058H
       488D4C2420           lea      rcx, [rsp+20H]
       FF15150A7100         call     [Program:Consume(ulong)]
       48B978563412F0DEBC9A mov      rcx, 0x9ABCDEF012345678
       48398C24A0000000     cmp      qword ptr [rsp+A0H], rcx
       7405                 je       SHORT G_M34929_IG04
       E824CA4B5F           call     CORINFO_HELP_FAIL_FAST
						;; size=36 bbWeight=0.50 PerfScore 3.88

G_M34929_IG04:              ;; offset=007CH
       90                   nop      
						;; size=1 bbWeight=0.50 PerfScore 0.12

G_M34929_IG05:              ;; offset=007DH
       4881C4A8000000       add      rsp, 168
       C3                   ret      
						;; size=8 bbWeight=0.50 PerfScore 0.62

G_M34929_IG06:              ;; offset=0085H
       FF152DA69000         call     [System.Console:WriteLine()]
       48B978563412F0DEBC9A mov      rcx, 0x9ABCDEF012345678
       48398C24A0000000     cmp      qword ptr [rsp+A0H], rcx
       7405                 je       SHORT G_M34929_IG07
       E8FCC94B5F           call     CORINFO_HELP_FAIL_FAST
						;; size=31 bbWeight=0.50 PerfScore 3.62

G_M34929_IG07:              ;; offset=00A4H
       90                   nop      
						;; size=1 bbWeight=0.50 PerfScore 0.12

G_M34929_IG08:              ;; offset=00A5H
       4881C4A8000000       add      rsp, 168
       C3                   ret      
						;; size=8 bbWeight=0.50 PerfScore 0.62
; Total bytes of code: 173

Also, here we don't do stack probing.
Probably, I should decrease that threshold from 32 to some single-instruction sized

@EgorBo EgorBo marked this pull request as ready for review March 18, 2023 00:12
@EgorBo
Copy link
Member Author

EgorBo commented Mar 31, 2023

@jakobbotsch @BruceForstall @dotnet/jit-contrib PTAL

I inject a BLK node in Lower for all stackalloc nodes (GT_LCLHEAP) with uses. Will enable arm64 separately, its default impl is good but it doesn't switch to memset call for large buffers so still will benefit from BLK too.

src/coreclr/jit/importer.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
EgorBo and others added 5 commits March 31, 2023 19:41
Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>
@EgorBo EgorBo merged commit e13f0dc into dotnet:main Apr 4, 2023
@EgorBo EgorBo deleted the stackalloc-zero-simd branch April 4, 2023 23:36
@ghost ghost locked as resolved and limited conversation to collaborators May 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT should vectorize zeroing constant stackalloc-s
6 participants