Optimize "X == Y" to "(X ^ Y) == 0 for SIMD #93818

EgorBo · 2023-10-21T17:15:11Z

Matches native compilers now https://godbolt.org/z/afaE18saG

Quick example:

bool Test(Vector256<byte> v1, Vector256<byte> v2) => v1 == v2;

; Method Test
       vzeroupper 
       vmovups  ymm0, ymmword ptr [rdx]
-      vpcmpeqb k1, ymm0, ymmword ptr [r8]
-      kortestd k1, k1
-      setb     al
+      vpxor    ymm0, ymm0, ymmword ptr [r8]
+      vptest   ymm0, ymm0
+      sete     al
       movzx    rax, al
       vzeroupper 
       ret      
-; Total bytes of code: 28
+; Total bytes of code: 27

ghost · 2023-10-21T17:15:26Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Remove ternarylog opt as it's now optimized by [JIT] Fold some bitwise operations to vpternlog #91227 so no need to keep this here
Improve codegen for cases when length == simdSize, e.g.:

static bool Foo(string s) => 
    s == "12345678123456781234567812345678";

; Assembly listing for method Program:Foo(System.String):ubyte (FullOpts)
       vzeroupper 
       test     rcx, rcx
       je       SHORT G_M28968_IG05
       cmp      dword ptr [rcx+0x08], 32
       jne      SHORT G_M28968_IG05
       vmovups  zmm0, zmmword ptr [rcx+0x0C]
-      vpxorq   zmm0, zmm0, zmmword ptr [reloc @RWD00]
-      vptestmq k1, zmm0, zmm0
+      vpcmpeqq k1, zmm0, zmmword ptr [reloc @RWD00]
       kortestb k1, k1
-      sete     al
+      setb     al
       movzx    rax, al
       jmp      SHORT G_M28968_IG06
G_M28968_IG05:
       xor      eax, eax
G_M28968_IG06:
       vzeroupper 
       ret
-; Total bytes of code 58
+; Total bytes of code 52

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-CodeGen-coreclr`
Milestone:	-

src/coreclr/jit/lowerxarch.cpp

Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>

tannergooding · 2023-10-22T15:53:36Z

Quick example:

I'm not sure this is "better". vptest is more limited when EVEX is enabled and our register allocator doesn't have the ability to handle that level of nuance today.

For example, vptest can't be used for anything that requires the EVEX encoding. This limits LSRA to only using xmm0-xmm15 (cutting out xmm16-xmm31). Likewise, it prevents the ability to use features like embedded broadcast, can't be used with floating-point, or with other types of comparisons that aren't strict equality. I would expect the native compilers take this into account and will sometimes opt to use kortest instead when register pressure is high or if the total instruction sequence could benefit from an EVEX only feature.

Given that, I'd probably lean towards it being better to just keep things "as is" here and only do the xor, ptest trick for VEX where it is clearly faster. -- If we had a way to do better instruction selection based on some of these heuristics, I think it might be a different story.

Also notably, here are the timings for the instructions involved here

Instruction	Intel	AMD
kortest	1 uop, 2 latency	1 uop, 6 latency
vpcmp ( VEX)	1 uop, 1 latency	1 uop, 1 latency
vpcmp (EVEX)	1 uop, 3 latency	1 uop, 3 latency
vptest (XMM)	2 uops, 4 latency	2 uops, 7 latency
vptest (YMM)	2 uops, 6 latency	2 uops, 9 latency
vpxor	1 uop, 1 latency	1 uop, 1 latency

Which shows that, at least theoretically, vpcmp + kortest is a better sequence in many, if not most, cases. That is, it shows either the same latency, but 1 less uop; or faster latency + less uops in most cases. The one exception is handling XMM on AMD, where it is 1 latency higher, but 1 less uop.

EgorBo · 2023-10-22T16:08:40Z

I'm not sure this is "better"

Once you switched vector equality to use kortest even for XMM/YMM it regressed my benchmarks on AMD, we don't have AVX512 hardware so it was never reflected in our perflab infra via regressions. As you can see from my godbolt link, native compilers also don't use it.

Also notably, here are the timings for the instructions involved here

I think you forgot to include RThrougput here.

Also, it's smaller in terms of code size, hence, potentially better.

And finally, uiCA stats on these for Tiger Lake (pretty much same for others):

EgorBo · 2023-10-22T16:20:33Z

Benchmark:

static Benchmarks()
{
    arr1 = (byte*)NativeMemory.AlignedAlloc(8 * 1024, 64);
    arr2 = (byte*)NativeMemory.AlignedAlloc(8 * 1024, 64);
}

static byte* arr1;
static byte* arr2;

[Benchmark]
public bool VectorEquals()
{
    ref byte a = ref Unsafe.AsRef<byte>(arr1);
    ref byte b = ref Unsafe.AsRef<byte>(arr2);

    for (nuint i = 0; i < 1024; i+=16)
    {
        if (Vector128.LoadUnsafe(ref a, i) != Vector128.LoadUnsafe(ref b, i))
            return false;
    }
    return true;
}

Method	Job	Mean	Error	StdDev	Ratio
VectorEquals	Main	20.26 ns	1.239 ns	0.068 ns	1.00
VectorEquals	PR	15.24 ns	0.697 ns	0.038 ns	0.75

Ryzen 7950X, will find some modern intel to check there but I suspect the same results. But the difference is quite noticeable.

tannergooding · 2023-10-22T16:48:07Z

Ryzen 7950X, will find some modern intel to check there but I suspect the same results. But the difference is quite noticeable.

Can you also check for YMM?

Likewise, what happens in register heavy code where we end up needing to pick XMM16-XMM31 to avoid spilling?

EgorBo · 2023-10-22T17:15:34Z

Ah diffs are empty if I disable it for non-AVX512 hw (presumably, most of our SPMI collections either don't have AVX512 or it's ignored because of throttling issues) so going to close for now. We need some better coverage fro AVX512 in SPMI

ghost assigned EgorBo Oct 21, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 21, 2023

EgorBo closed this Oct 21, 2023

EgorBo reopened this Oct 21, 2023

Clean up

ea7c181

EgorBo force-pushed the improve-importervectorization branch from d6bc8af to ea7c181 Compare October 21, 2023 23:29

EgorBo changed the title ~~Code clean up in importervectorization.cpp~~ Optimize "X == Y" to "(X ^ Y) == 0 for SIMD Oct 21, 2023

This was referenced Oct 22, 2023

Test_EventSource_EtwManifestGeneration* tests failing in CI #48798

Closed

[wasm] Runtime tests failing - _Vector2_3_4::Vector2_3_4Test.RunVector*Tests because of missing support for native libraries #93669

Closed

jakobbotsch reviewed Oct 22, 2023

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

EgorBo and others added 2 commits October 22, 2023 16:46

Update src/coreclr/jit/lowerxarch.cpp

5048b71

Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>

Fix regressions

dd28e1e

EgorBo closed this Oct 22, 2023

build-analysis bot mentioned this pull request Oct 23, 2023

System.Security.Cryptography.Tests timing out #93840

Closed

ghost locked as resolved and limited conversation to collaborators Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize "X == Y" to "(X ^ Y) == 0 for SIMD #93818

Optimize "X == Y" to "(X ^ Y) == 0 for SIMD #93818

EgorBo commented Oct 21, 2023 •

edited

Loading

ghost commented Oct 21, 2023

tannergooding commented Oct 22, 2023 •

edited

Loading

EgorBo commented Oct 22, 2023

EgorBo commented Oct 22, 2023 •

edited

Loading

tannergooding commented Oct 22, 2023

EgorBo commented Oct 22, 2023

Optimize "X == Y" to "(X ^ Y) == 0 for SIMD #93818

Optimize "X == Y" to "(X ^ Y) == 0 for SIMD #93818

Conversation

EgorBo commented Oct 21, 2023 • edited Loading

ghost commented Oct 21, 2023

tannergooding commented Oct 22, 2023 • edited Loading

EgorBo commented Oct 22, 2023

EgorBo commented Oct 22, 2023 • edited Loading

tannergooding commented Oct 22, 2023

EgorBo commented Oct 22, 2023

EgorBo commented Oct 21, 2023 •

edited

Loading

tannergooding commented Oct 22, 2023 •

edited

Loading

EgorBo commented Oct 22, 2023 •

edited

Loading