-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize stackalloc zeroing via BLK #83255
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsLet's see if this works - I just insert GT_BLK (basically, Unsafe.InitMemoryUnaligned) after CEE_LOCALLOC in importer to rely on that for zeroing. GT_BLK has its own logic to unroll/emit MEMSET. [Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[40]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[50]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[64]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[100]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[128]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[150]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[256]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[512]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[1024]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[4096]; Consume(ptr); }
[Benchmark]
public void Stackalloc50() { byte* ptr = stackalloc byte[8192]; Consume(ptr); }
[MethodImpl(MethodImplOptions.NoInlining)]
static void Consume(byte* ptr)
{
}
|
For comparison, what does the graph look like if you don't do that? |
It looks like we might want to revise our heuristics, e.g. here what Clang/LLVM does: For Zen4 (AMD 7xxx) it unrolls up to 512 bytes (AVX512): https://godbolt.org/z/PxvoE4P9r For a generic CPU without AVX it unrolls up to 128 bytes: https://godbolt.org/z/b4vd13PMz Our threshold is hard-coded to 128. (and 256 for ARM64) |
NOTE: afair, some (most?) libs in BCL use |
Yes, everything in the shared framework: runtime/src/libraries/Directory.Build.targets Line 209 in a923c64
|
cc @anthonycanino (in case if you're interested adjusting the BLK unroll heuristic for avx-512) |
Updated. Fixed via #83274 |
8c1d241
to
167ce5a
Compare
a4a84ab
to
588b964
Compare
Is stackalloc different to locals? As .NET 7 will partially unroll and loop the local zeroing (though currently only uses vxorps xmm4, xmm4
mov rax, -0x2340
vmovdqa xmmword ptr [rbp+rax-60H], xmm4
vmovdqa xmmword ptr [rbp+rax-50H], xmm4
vmovdqa xmmword ptr [rbp+rax-40H], xmm4
add rax, 48
jne SHORT -5 instr Rather than this weird thing G_M000_IG03:
push 0
push 0
dec rcx
jne SHORT G_M000_IG03 ;; slow loop (zeroing 16 bytes at once) |
i.e. should stackalloc be part of locals? |
Good question. We do that if void Test(bool cond)
{
if (cond)
{
// rarely taken condition
var p = stackalloc byte[128];
Consume(p);
}
else
{
Console.WriteLine();
}
} Codegen: ; Method Program:Test(bool):this
G_M34929_IG01: ;; offset=0000H
4881ECA8000000 sub rsp, 168
C5D857E4 vxorps xmm4, xmm4
C5F97F642420 vmovdqa xmmword ptr [rsp+20H], xmm4
C5F97F642430 vmovdqa xmmword ptr [rsp+30H], xmm4
48B8A0FFFFFFFFFFFFFF mov rax, -96
C5F97FA404A0000000 vmovdqa xmmword ptr [rsp+rax+A0H], xmm4
C5F97FA404B0000000 vmovdqa xmmword ptr [rsp+rax+B0H], xmm4
C5F97FA404C0000000 vmovdqa xmmword ptr [rsp+rax+C0H], xmm4
4883C030 add rax, 48
75DF jne SHORT -5 instr
48B878563412F0DEBC9A mov rax, 0x9ABCDEF012345678
48898424A0000000 mov qword ptr [rsp+A0H], rax
;; size=84 bbWeight=1 PerfScore 13.33
G_M34929_IG02: ;; offset=0054H
84D2 test dl, dl
742D je SHORT G_M34929_IG06
;; size=4 bbWeight=1 PerfScore 1.25
G_M34929_IG03: ;; offset=0058H
488D4C2420 lea rcx, [rsp+20H]
FF15150A7100 call [Program:Consume(ulong)]
48B978563412F0DEBC9A mov rcx, 0x9ABCDEF012345678
48398C24A0000000 cmp qword ptr [rsp+A0H], rcx
7405 je SHORT G_M34929_IG04
E824CA4B5F call CORINFO_HELP_FAIL_FAST
;; size=36 bbWeight=0.50 PerfScore 3.88
G_M34929_IG04: ;; offset=007CH
90 nop
;; size=1 bbWeight=0.50 PerfScore 0.12
G_M34929_IG05: ;; offset=007DH
4881C4A8000000 add rsp, 168
C3 ret
;; size=8 bbWeight=0.50 PerfScore 0.62
G_M34929_IG06: ;; offset=0085H
FF152DA69000 call [System.Console:WriteLine()]
48B978563412F0DEBC9A mov rcx, 0x9ABCDEF012345678
48398C24A0000000 cmp qword ptr [rsp+A0H], rcx
7405 je SHORT G_M34929_IG07
E8FCC94B5F call CORINFO_HELP_FAIL_FAST
;; size=31 bbWeight=0.50 PerfScore 3.62
G_M34929_IG07: ;; offset=00A4H
90 nop
;; size=1 bbWeight=0.50 PerfScore 0.12
G_M34929_IG08: ;; offset=00A5H
4881C4A8000000 add rsp, 168
C3 ret
;; size=8 bbWeight=0.50 PerfScore 0.62
; Total bytes of code: 173 Also, here we don't do stack probing. |
@jakobbotsch @BruceForstall @dotnet/jit-contrib PTAL I inject a BLK node in Lower for all stackalloc nodes ( |
Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>
Closes #63500
Let's insert GT_BLK after GT_HEAPLCL to rely on the former to perform zeroing.
Codegen example:
Main:
PR:
For large constants, this PR switches to
call memset
while current Main's impl will still be doing that loop of double-push.Benchmark
Core i7 8700K
Ryzen 7950X
NOTE: 32 bytes and lower are handled separately so there are no differences for them.