cmov not being used in nested hand-unrolled loop #109940

kg · 2024-11-19T00:55:28Z

Description

I'm attempting to unroll a fixed-size scalar search loop so that it will generate a chain of cmovs. It's possible to get this to happen in C compiled by clang and msvc, but not C# on .NET 9. Instead, the best-case is a chain of movzx; cmp; jne; mov opcode clusters, i.e.

       movzx    r11, byte  ptr [rcx]
       cmp      r11d, r8d
       jne      SHORT G_M000_IG08
 
G_M000_IG07:                ;; offset=0x00C0
       xor      edx, edx
 
G_M000_IG08:                ;; offset=0x00C2
       movzx    r11, byte  ptr [rcx+0x01]
       cmp      r11d, r8d
       jne      SHORT G_M000_IG10
 
G_M000_IG09:                ;; offset=0x00CC
       mov      edx, 1

(For reference, this is the same method as #108092 but with the vectorization mostly disabled.)

Configuration

.NET 9
Windows 11 Pro build 22631
x64 (Ryzen 7950X)
Disasmo spew:

; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 10 single block inlinees; 22 inlinees without PGO data

Regression?

No, this didn't work on 8 either.

Reproduction code

Here's the simplest/easiest-to-fiddle-with version of the repro I can come up with that's 'real'. the C version in simdhash is similar and generates a chain of cmovs the last time I checked. The idea here is to fully pipeline the fixed-width (14 elements) search loop. There's no guarantee it would be faster (it was in C), but I can't actually measure it to see whether it's faster because I can't get it to generate the cmovs :)

Vector128<byte> bucket = ...;
byte needle = ...;
var haystack = (byte*)Unsafe.AsPointer(ref bucket);
var result = 32;
Lane(haystack, needle, ref result, 13);
Lane(haystack, needle, ref result, 12);
Lane(haystack, needle, ref result, 11);
Lane(haystack, needle, ref result, 10);
Lane(haystack, needle, ref result, 9);
Lane(haystack, needle, ref result, 8);
Lane(haystack, needle, ref result, 7);
Lane(haystack, needle, ref result, 6);
Lane(haystack, needle, ref result, 5);
Lane(haystack, needle, ref result, 4);
Lane(haystack, needle, ref result, 3);
Lane(haystack, needle, ref result, 2);
Lane(haystack, needle, ref result, 1);
Lane(haystack, needle, ref result, 0);
return result;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
static void Lane (byte* haystack, byte needle, ref int result, int index) {
    if (haystack[index] == needle)
        result = index;
}

Additional notes

I've tried maybe a dozen different formulations/structures for this code and they all generate similar or worse code, none of them generate cmovs. I've verified that cmovs are generated in other parts of my code - this method even has one further up:

       movzx    r8, al
       mov      ecx, 255
       test     r8d, r8d
       cmovne   ecx, r8d

I've also tried reducing the number of Lane calls to see if that simplification would light up the cmov intrinsic but it didn't. I've also tried manually inlining Lane to no effect.

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-11-19T00:56:03Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

jakobbotsch · 2024-11-19T09:14:41Z

Do you have a self-contained example? For

public static unsafe int Foo(Vector128<byte> bucket, byte needle)
{
    var haystack = (byte*)Unsafe.AsPointer(ref bucket);
    var result = 32;
    Lane(haystack, needle, ref result, 13);
    Lane(haystack, needle, ref result, 12);
    Lane(haystack, needle, ref result, 11);
    Lane(haystack, needle, ref result, 10);
    Lane(haystack, needle, ref result, 9);
    Lane(haystack, needle, ref result, 8);
    Lane(haystack, needle, ref result, 7);
    Lane(haystack, needle, ref result, 6);
    Lane(haystack, needle, ref result, 5);
    Lane(haystack, needle, ref result, 4);
    Lane(haystack, needle, ref result, 3);
    Lane(haystack, needle, ref result, 2);
    Lane(haystack, needle, ref result, 1);
    Lane(haystack, needle, ref result, 0);
    return result;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void Lane(byte* haystack, byte needle, ref int result, int index)
{
    if (haystack[index] == needle)
        result = index;
}

I get

; Method Program:Foo(System.Runtime.Intrinsics.Vector128`1[ubyte],ubyte):int (FullOpts)
G_M56088_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M56088_IG02:  ;; offset=0x0000
       mov      rax, rcx
       mov      ecx, 32
       movzx    r8, byte  ptr [rax+0x0D]
       movzx    rdx, dl
       mov      r10d, 13
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x0C]
       mov      r10d, 12
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x0B]
       mov      r10d, 11
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x0A]
       mov      r10d, 10
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x09]
       mov      r10d, 9
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x08]
       mov      r10d, 8
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x07]
       mov      r10d, 7
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x06]
       mov      r10d, 6
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x05]
       mov      r10d, 5
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x04]
       mov      r10d, 4
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x03]
       mov      r10d, 3
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x02]
       mov      r10d, 2
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    r8, byte  ptr [rax+0x01]
       mov      r10d, 1
       cmp      r8d, edx
       cmove    ecx, r10d
       movzx    rax, byte  ptr [rax]
       xor      r8d, r8d
       cmp      eax, edx
       cmove    ecx, r8d
       mov      eax, ecx
						;; size=259 bbWeight=1 PerfScore 39.50

G_M56088_IG03:  ;; offset=0x0103
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 260

on main.

Note that if your code is inside a loop then it is expected that we do not emit conditional selects, see #96380 (comment).

kg · 2024-11-19T18:33:40Z

It's due to a loop, thank you. We can close this if that limitation has a tracking issue.

kg added the tenet-performance Performance related issue label Nov 19, 2024

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 19, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Nov 19, 2024

kg changed the title ~~cmov not being used in hand-unrolled loop~~ cmov not being used in nested hand-unrolled loop Nov 19, 2024

JulieLeeMSFT closed this as completed Nov 21, 2024

dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Nov 21, 2024

github-actions bot locked and limited conversation to collaborators Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmov not being used in nested hand-unrolled loop #109940

cmov not being used in nested hand-unrolled loop #109940

kg commented Nov 19, 2024

dotnet-policy-service bot commented Nov 19, 2024

jakobbotsch commented Nov 19, 2024 •

edited

Loading

kg commented Nov 19, 2024

cmov not being used in nested hand-unrolled loop #109940

cmov not being used in nested hand-unrolled loop #109940

Comments

kg commented Nov 19, 2024

Description

Configuration

Regression?

Reproduction code

Additional notes

dotnet-policy-service bot commented Nov 19, 2024

jakobbotsch commented Nov 19, 2024 • edited Loading

kg commented Nov 19, 2024

jakobbotsch commented Nov 19, 2024 •

edited

Loading