Unify unroll limits in a single entry point #83274

EgorBo · 2023-03-10T19:15:35Z

The current limits were a bit odd, e.g. hard limit of 128 bytes on x64 no matter if it supports AVX or not (2x less instructions).

Also, the new limits overall match whatever native compilers do for memset/memcpy unroll in -Os (size-aware): https://godbolt.org/z/dW1qqaP9a

Closes #82529

ghost · 2023-03-10T19:15:52Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-CodeGen-coreclr`
Milestone:	-

jakobbotsch · 2023-03-10T19:40:15Z

Fixes #82529?

EgorBo · 2023-03-10T20:03:36Z

Fixes #82529?

Ah, didn't see this one. Yeah, it does. It zeroes S2 struct via:

       xor      eax, eax
       vxorps   ymm0, ymm0
       vmovdqu  ymmword ptr[rdx], ymm0
       vmovdqu  ymmword ptr[rdx+20H], ymm0
       vmovdqu  ymmword ptr[rdx+40H], ymm0
       vmovdqu  ymmword ptr[rdx+60H], ymm0
       vmovdqu  ymmword ptr[rdx+80H], ymm0
       mov      qword ptr [rdx+A0H], rax

but only with AVX or on arm64

…mits

EgorBo · 2023-03-11T09:20:42Z

Bencmarks:

Memset

public unsafe class MemsetBenchmarks
{
    private static readonly byte[] Data1 = new byte[1024];

    [Benchmark] public void Memset8() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 8);
    [Benchmark] public void Memset10() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 10);
    [Benchmark] public void Memset14() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 14);
    [Benchmark] public void Memset16() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 16);
    [Benchmark] public void Memset17() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 17);
    [Benchmark] public void Memset20() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 20);
    [Benchmark] public void Memset32() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 32);
    [Benchmark] public void Memset33() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 33);
    [Benchmark] public void Memset40() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 40);
    [Benchmark] public void Memset50() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 50);
    [Benchmark] public void Memset64() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 64);
    [Benchmark] public void Memset65() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 65);
    [Benchmark] public void Memset80() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 80);
    [Benchmark] public void Memset90() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 90);
    [Benchmark] public void Memset110() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 110);
    [Benchmark] public void Memset128() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 128);
    [Benchmark] public void Memset129() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 129);
    [Benchmark] public void Memset200() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 200);
    [Benchmark] public void Memset256() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 256);
    [Benchmark] public void Memset257() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 257);
    [Benchmark] public void Memset300() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 300);
    [Benchmark] public void Memset400() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 400);
    [Benchmark] public void Memset512() => Unsafe.InitBlockUnaligned(ref Data1[0], 0, 512);
}

Memcpy

public unsafe class MemcpyBenchmarks
{
    private static readonly byte[] Data1 = new byte[1024];
    private static readonly byte[] Data2 = new byte[1024];

    [Benchmark] public void Memcpy8() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 8);
    [Benchmark] public void Memcpy10() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 10);
    [Benchmark] public void Memcpy14() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 14);
    [Benchmark] public void Memcpy16() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 16);
    [Benchmark] public void Memcpy17() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 17);
    [Benchmark] public void Memcpy20() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 20);
    [Benchmark] public void Memcpy32() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 32);
    [Benchmark] public void Memcpy33() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 33);
    [Benchmark] public void Memcpy40() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 40);
    [Benchmark] public void Memcpy50() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 50);
    [Benchmark] public void Memcpy64() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 64);
    [Benchmark] public void Memcpy65() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 65);
    [Benchmark] public void Memcpy80() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 80);
    [Benchmark] public void Memcpy90() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 90);
    [Benchmark] public void Memcpy110() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 110);
    [Benchmark] public void Memcpy128() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 128);
    [Benchmark] public void Memcpy129() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 129);
    [Benchmark] public void Memcpy200() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 200);
    [Benchmark] public void Memcpy256() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 256);
    [Benchmark] public void Memcpy257() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 257);
    [Benchmark] public void Memcpy300() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 300);
    [Benchmark] public void Memcpy400() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 400);
    [Benchmark] public void Memcpy512() => Unsafe.CopyBlockUnaligned(ref Data1[0], ref Data2[0], 512);
}

Verified on: Core i7 8700k, Core i9 9980HK, planning to test on Ryzen 7950X

src/coreclr/jit/targetx86.h

EgorBo · 2023-03-11T20:58:43Z

@dotnet/jit-contrib PTAL
This PR unifies various unrolling strategies to a single entry point and fixes some oddities, e.g. on X64 we have a hard limit 128 bytes for memset and it doesn't matter whether we can use AVX or not or only GPR (in case of GC fields).

Visible things this PR fixes:

ARM32 used to have memset=32b and memcpy=64b - aparently whoever set those mixed them up, I made it memset=64b, memcpy=32 (large negative diffs)
Changed memset to 256b (with AVX available) and memcpy to 128b on AMD64 (used to be 128/64b) - see benchmarks above + this matches clang/LLVM with -Os (size) behavior for a generic cpu -- this is also needed for faster stackalloc zeroing, see https://user-images.githubusercontent.com/523221/224337267-efa1e0c9-5684-4c53-ab52-6154106d8d80.png
ARM64 had a weird limit if src/dst don't point to stack.

Diffs are not too big outside of coreclr_tests collection - around +2k-3k for libraries.pmi: diffs

To improve some of them I filed:

A typical size regression looks like this:

        mov      qword ptr [rbp-C8H], rdx
        mov      rdx, bword ptr [rbp+18H]
        ; byrRegs +[rdx]
-       lea      rcx, bword ptr [rbp-B8H]
-       ; byrRegs +[rcx]
-       mov      r8d, 80
-       call     CORINFO_HELP_MEMCPY
-       ; byrRegs -[rcx rdx]
+       vmovdqu  ymm0, ymmword ptr[rdx]
+       vmovdqu  ymmword ptr[rbp-B8H], ymm0
+       vmovdqu  ymm0, ymmword ptr[rdx+20H]
+       vmovdqu  ymmword ptr[rbp-98H], ymm0
+       vmovdqu  xmm0, xmmword ptr [rdx+40H]
+       vmovdqu  xmmword ptr [rbp-78H], xmm0
        mov      rdx, qword ptr [rbp-C0H]
+       ; byrRegs -[rdx]
        mov      r8, qword ptr [rbp-C8H]
        lea      r9, [rbp-B8H]
        lea      rcx, [rbp-40H]

which is 2x faster on all machines I tested. There are several cases where unrolling produces more compact code than call memset/memcpy presumably, due to reg spills.

Diffs are mostly negative for ARM64, e.g.:

src/coreclr/jit/compiler.h

EgorBo · 2023-03-13T15:12:01Z

@tannergooding @dotnet/jit-contrib PTAL

EgorBo · 2023-03-14T12:53:20Z

Improved parsing of doubles

Unify unroll limits in a single entry point

6426a70

ghost assigned EgorBo Mar 10, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 10, 2023

Fix compilation

6bcfedf

EgorBo marked this pull request as ready for review March 10, 2023 22:09

EgorBo added 3 commits March 11, 2023 00:23

Fix comment

a9bb9ae

Update compiler.h

c1e865f

Merge branch 'main' of github.com:dotnet/runtime into unify-unroll-li…

559957b

…mits

build-analysis bot mentioned this pull request Mar 11, 2023

Tracking issue for CI build timeouts #76454

Closed

EgorBo marked this pull request as draft March 11, 2023 10:38

ANahr reviewed Mar 11, 2023

View reviewed changes

src/coreclr/jit/targetx86.h Outdated Show resolved Hide resolved

EgorBo added 2 commits March 11, 2023 18:37

Mitigate regressions

e0b40ec

Clean up

0481c3e

EgorBo marked this pull request as ready for review March 11, 2023 21:20

tannergooding reviewed Mar 12, 2023

View reviewed changes

src/coreclr/jit/compiler.h Outdated Show resolved Hide resolved

Update compiler.h

0f99a42

tannergooding approved these changes Mar 13, 2023

View reviewed changes

EgorBo merged commit c861106 into dotnet:main Mar 13, 2023

EgorBo deleted the unify-unroll-limits branch March 13, 2023 17:14

EgorBo mentioned this pull request Mar 14, 2023

Optimize stackalloc zeroing via BLK #83255

Merged

EgorBo mentioned this pull request Mar 14, 2023

[Perf] Linux/x64: 10 Improvements on 3/13/2023 6:11:08 PM dotnet/perf-autofiling-issues#13940

Closed

This was referenced Mar 16, 2023

[Perf] Linux/arm64: 7 Improvements on 3/13/2023 9:17:14 PM dotnet/perf-autofiling-issues#14107

Closed

[Perf] Windows/arm64: 3 Improvements on 3/13/2023 9:17:14 PM dotnet/perf-autofiling-issues#14125

Closed

JulieLeeMSFT mentioned this pull request Apr 4, 2023

What's new in .NET 8 Preview 3 dotnet/core#8135

Open

3 tasks

ghost locked as resolved and limited conversation to collaborators Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify unroll limits in a single entry point #83274

Unify unroll limits in a single entry point #83274

EgorBo commented Mar 10, 2023 •

edited

Loading

ghost commented Mar 10, 2023

jakobbotsch commented Mar 10, 2023

EgorBo commented Mar 10, 2023 •

edited

Loading

EgorBo commented Mar 11, 2023 •

edited

Loading

EgorBo commented Mar 11, 2023 •

edited

Loading

EgorBo commented Mar 13, 2023

EgorBo commented Mar 14, 2023

Unify unroll limits in a single entry point #83274

Unify unroll limits in a single entry point #83274

Conversation

EgorBo commented Mar 10, 2023 • edited Loading

ghost commented Mar 10, 2023

jakobbotsch commented Mar 10, 2023

EgorBo commented Mar 10, 2023 • edited Loading

EgorBo commented Mar 11, 2023 • edited Loading

Memset

Memcpy

EgorBo commented Mar 11, 2023 • edited Loading

EgorBo commented Mar 13, 2023

EgorBo commented Mar 14, 2023

EgorBo commented Mar 10, 2023 •

edited

Loading

EgorBo commented Mar 10, 2023 •

edited

Loading

EgorBo commented Mar 11, 2023 •

edited

Loading

EgorBo commented Mar 11, 2023 •

edited

Loading