Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize "X == Y" to "(X ^ Y) == 0 for SIMD #93818

Closed
wants to merge 3 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Oct 21, 2023

Matches native compilers now https://godbolt.org/z/afaE18saG

Quick example:

bool Test(Vector256<byte> v1, Vector256<byte> v2) => v1 == v2;
; Method Test
       vzeroupper 
       vmovups  ymm0, ymmword ptr [rdx]
-      vpcmpeqb k1, ymm0, ymmword ptr [r8]
-      kortestd k1, k1
-      setb     al
+      vpxor    ymm0, ymm0, ymmword ptr [r8]
+      vptest   ymm0, ymm0
+      sete     al
       movzx    rax, al
       vzeroupper 
       ret      
-; Total bytes of code: 28
+; Total bytes of code: 27

@ghost ghost assigned EgorBo Oct 21, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 21, 2023
@ghost
Copy link

ghost commented Oct 21, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details
  1. Remove ternarylog opt as it's now optimized by [JIT] Fold some bitwise operations to vpternlog #91227 so no need to keep this here
  2. Improve codegen for cases when length == simdSize, e.g.:
static bool Foo(string s) => 
    s == "12345678123456781234567812345678";
; Assembly listing for method Program:Foo(System.String):ubyte (FullOpts)
       vzeroupper 
       test     rcx, rcx
       je       SHORT G_M28968_IG05
       cmp      dword ptr [rcx+0x08], 32
       jne      SHORT G_M28968_IG05
       vmovups  zmm0, zmmword ptr [rcx+0x0C]
-      vpxorq   zmm0, zmm0, zmmword ptr [reloc @RWD00]
-      vptestmq k1, zmm0, zmm0
+      vpcmpeqq k1, zmm0, zmmword ptr [reloc @RWD00]
       kortestb k1, k1
-      sete     al
+      setb     al
       movzx    rax, al
       jmp      SHORT G_M28968_IG06
G_M28968_IG05:
       xor      eax, eax
G_M28968_IG06:
       vzeroupper 
       ret
-; Total bytes of code 58
+; Total bytes of code 52
Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo EgorBo closed this Oct 21, 2023
@EgorBo EgorBo reopened this Oct 21, 2023
@EgorBo EgorBo force-pushed the improve-importervectorization branch from d6bc8af to ea7c181 Compare October 21, 2023 23:29
@EgorBo EgorBo changed the title Code clean up in importervectorization.cpp Optimize "X == Y" to "(X ^ Y) == 0 for SIMD Oct 21, 2023
EgorBo and others added 2 commits October 22, 2023 16:46
Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>
@tannergooding
Copy link
Member

tannergooding commented Oct 22, 2023

Quick example:

I'm not sure this is "better". vptest is more limited when EVEX is enabled and our register allocator doesn't have the ability to handle that level of nuance today.

For example, vptest can't be used for anything that requires the EVEX encoding. This limits LSRA to only using xmm0-xmm15 (cutting out xmm16-xmm31). Likewise, it prevents the ability to use features like embedded broadcast, can't be used with floating-point, or with other types of comparisons that aren't strict equality. I would expect the native compilers take this into account and will sometimes opt to use kortest instead when register pressure is high or if the total instruction sequence could benefit from an EVEX only feature.

Given that, I'd probably lean towards it being better to just keep things "as is" here and only do the xor, ptest trick for VEX where it is clearly faster. -- If we had a way to do better instruction selection based on some of these heuristics, I think it might be a different story.


Also notably, here are the timings for the instructions involved here

Instruction Intel AMD
kortest 1 uop, 2 latency 1 uop, 6 latency
vpcmp ( VEX) 1 uop, 1 latency 1 uop, 1 latency
vpcmp (EVEX) 1 uop, 3 latency 1 uop, 3 latency
vptest (XMM) 2 uops, 4 latency 2 uops, 7 latency
vptest (YMM) 2 uops, 6 latency 2 uops, 9 latency
vpxor 1 uop, 1 latency 1 uop, 1 latency

Which shows that, at least theoretically, vpcmp + kortest is a better sequence in many, if not most, cases. That is, it shows either the same latency, but 1 less uop; or faster latency + less uops in most cases. The one exception is handling XMM on AMD, where it is 1 latency higher, but 1 less uop.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 22, 2023

I'm not sure this is "better"

Once you switched vector equality to use kortest even for XMM/YMM it regressed my benchmarks on AMD, we don't have AVX512 hardware so it was never reflected in our perflab infra via regressions. As you can see from my godbolt link, native compilers also don't use it.

Also notably, here are the timings for the instructions involved here

I think you forgot to include RThrougput here.

Also, it's smaller in terms of code size, hence, potentially better.

And finally, uiCA stats on these for Tiger Lake (pretty much same for others):

image

image

@EgorBo
Copy link
Member Author

EgorBo commented Oct 22, 2023

Benchmark:

static Benchmarks()
{
    arr1 = (byte*)NativeMemory.AlignedAlloc(8 * 1024, 64);
    arr2 = (byte*)NativeMemory.AlignedAlloc(8 * 1024, 64);
}

static byte* arr1;
static byte* arr2;

[Benchmark]
public bool VectorEquals()
{
    ref byte a = ref Unsafe.AsRef<byte>(arr1);
    ref byte b = ref Unsafe.AsRef<byte>(arr2);

    for (nuint i = 0; i < 1024; i+=16)
    {
        if (Vector128.LoadUnsafe(ref a, i) != Vector128.LoadUnsafe(ref b, i))
            return false;
    }
    return true;
}
Method Job Mean Error StdDev Ratio
VectorEquals Main 20.26 ns 1.239 ns 0.068 ns 1.00
VectorEquals PR 15.24 ns 0.697 ns 0.038 ns 0.75

Ryzen 7950X, will find some modern intel to check there but I suspect the same results. But the difference is quite noticeable.

@tannergooding
Copy link
Member

Ryzen 7950X, will find some modern intel to check there but I suspect the same results. But the difference is quite noticeable.

Can you also check for YMM?

Likewise, what happens in register heavy code where we end up needing to pick XMM16-XMM31 to avoid spilling?

@EgorBo
Copy link
Member Author

EgorBo commented Oct 22, 2023

Ah diffs are empty if I disable it for non-AVX512 hw (presumably, most of our SPMI collections either don't have AVX512 or it's ignored because of throttling issues) so going to close for now. We need some better coverage fro AVX512 in SPMI

@EgorBo EgorBo closed this Oct 22, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Nov 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants