Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmov not being used in nested hand-unrolled loop #109940

Closed
kg opened this issue Nov 19, 2024 · 3 comments
Closed

cmov not being used in nested hand-unrolled loop #109940

kg opened this issue Nov 19, 2024 · 3 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue

Comments

@kg
Copy link
Member

kg commented Nov 19, 2024

Description

I'm attempting to unroll a fixed-size scalar search loop so that it will generate a chain of cmovs. It's possible to get this to happen in C compiled by clang and msvc, but not C# on .NET 9. Instead, the best-case is a chain of movzx; cmp; jne; mov opcode clusters, i.e.

       movzx    r11, byte  ptr [rcx]
       cmp      r11d, r8d
       jne      SHORT G_M000_IG08
 
G_M000_IG07:                ;; offset=0x00C0
       xor      edx, edx
 
G_M000_IG08:                ;; offset=0x00C2
       movzx    r11, byte  ptr [rcx+0x01]
       cmp      r11d, r8d
       jne      SHORT G_M000_IG10
 
G_M000_IG09:                ;; offset=0x00CC
       mov      edx, 1

(For reference, this is the same method as #108092 but with the vectorization mostly disabled.)

Configuration

.NET 9
Windows 11 Pro build 22631
x64 (Ryzen 7950X)
Disasmo spew:

; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 10 single block inlinees; 22 inlinees without PGO data

Regression?

No, this didn't work on 8 either.

Reproduction code

Here's the simplest/easiest-to-fiddle-with version of the repro I can come up with that's 'real'. the C version in simdhash is similar and generates a chain of cmovs the last time I checked. The idea here is to fully pipeline the fixed-width (14 elements) search loop. There's no guarantee it would be faster (it was in C), but I can't actually measure it to see whether it's faster because I can't get it to generate the cmovs :)

Vector128<byte> bucket = ...;
byte needle = ...;
var haystack = (byte*)Unsafe.AsPointer(ref bucket);
var result = 32;
Lane(haystack, needle, ref result, 13);
Lane(haystack, needle, ref result, 12);
Lane(haystack, needle, ref result, 11);
Lane(haystack, needle, ref result, 10);
Lane(haystack, needle, ref result, 9);
Lane(haystack, needle, ref result, 8);
Lane(haystack, needle, ref result, 7);
Lane(haystack, needle, ref result, 6);
Lane(haystack, needle, ref result, 5);
Lane(haystack, needle, ref result, 4);
Lane(haystack, needle, ref result, 3);
Lane(haystack, needle, ref result, 2);
Lane(haystack, needle, ref result, 1);
Lane(haystack, needle, ref result, 0);
return result;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
static void Lane (byte* haystack, byte needle, ref int result, int index) {
    if (haystack[index] == needle)
        result = index;
}

Additional notes

I've tried maybe a dozen different formulations/structures for this code and they all generate similar or worse code, none of them generate cmovs. I've verified that cmovs are generated in other parts of my code - this method even has one further up:

       movzx    r8, al
       mov      ecx, 255
       test     r8d, r8d
       cmovne   ecx, r8d

I've also tried reducing the number of Lane calls to see if that simplification would light up the cmov intrinsic but it didn't. I've also tried manually inlining Lane to no effect.

@kg kg added the tenet-performance Performance related issue label Nov 19, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 19, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Nov 19, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jakobbotsch
Copy link
Member

jakobbotsch commented Nov 19, 2024

Do you have a self-contained example? For

public static unsafe int Foo(Vector128<byte> bucket, byte needle)
{
    var haystack = (byte*)Unsafe.AsPointer(ref bucket);
    var result = 32;
    Lane(haystack, needle, ref result, 13);
    Lane(haystack, needle, ref result, 12);
    Lane(haystack, needle, ref result, 11);
    Lane(haystack, needle, ref result, 10);
    Lane(haystack, needle, ref result, 9);
    Lane(haystack, needle, ref result, 8);
    Lane(haystack, needle, ref result, 7);
    Lane(haystack, needle, ref result, 6);
    Lane(haystack, needle, ref result, 5);
    Lane(haystack, needle, ref result, 4);
    Lane(haystack, needle, ref result, 3);
    Lane(haystack, needle, ref result, 2);
    Lane(haystack, needle, ref result, 1);
    Lane(haystack, needle, ref result, 0);
    return result;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void Lane(byte* haystack, byte needle, ref int result, int index)
{
    if (haystack[index] == needle)
        result = index;
}

I get

; Method Program:Foo(System.Runtime.Intrinsics.Vector128`1[ubyte],ubyte):int (FullOpts)
G_M56088_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M56088_IG02:  ;; offset=0x0000
       mov      rax, rcx
       mov      ecx, 32
       movzx    r8, byte  ptr [rax+0x0D]
       movzx    rdx, dl
       mov      r10d, 13
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x0C]
       mov      r10d, 12
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x0B]
       mov      r10d, 11
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x0A]
       mov      r10d, 10
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x09]
       mov      r10d, 9
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x08]
       mov      r10d, 8
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x07]
       mov      r10d, 7
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x06]
       mov      r10d, 6
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x05]
       mov      r10d, 5
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x04]
       mov      r10d, 4
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x03]
       mov      r10d, 3
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x02]
       mov      r10d, 2
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x01]
       mov      r10d, 1
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    rax, byte  ptr [rax]
       xor      r8d, r8d
       cmp      eax, edx
       cmove    ecx, r8d
       mov      eax, ecx
						;; size=259 bbWeight=1 PerfScore 39.50

G_M56088_IG03:  ;; offset=0x0103
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 260

on main.

Note that if your code is inside a loop then it is expected that we do not emit conditional selects, see #96380 (comment).

@kg
Copy link
Member Author

kg commented Nov 19, 2024

It's due to a loop, thank you. We can close this if that limitation has a tracking issue.

@kg kg changed the title cmov not being used in hand-unrolled loop cmov not being used in nested hand-unrolled loop Nov 19, 2024
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants