-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmov not being used in nested hand-unrolled loop #109940
Labels
area-CodeGen-coreclr
CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
tenet-performance
Performance related issue
Comments
dotnet-issue-labeler
bot
added
the
area-CodeGen-coreclr
CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
label
Nov 19, 2024
dotnet-policy-service
bot
added
the
untriaged
New issue has not been triaged by the area owner
label
Nov 19, 2024
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Do you have a self-contained example? For public static unsafe int Foo(Vector128<byte> bucket, byte needle)
{
var haystack = (byte*)Unsafe.AsPointer(ref bucket);
var result = 32;
Lane(haystack, needle, ref result, 13);
Lane(haystack, needle, ref result, 12);
Lane(haystack, needle, ref result, 11);
Lane(haystack, needle, ref result, 10);
Lane(haystack, needle, ref result, 9);
Lane(haystack, needle, ref result, 8);
Lane(haystack, needle, ref result, 7);
Lane(haystack, needle, ref result, 6);
Lane(haystack, needle, ref result, 5);
Lane(haystack, needle, ref result, 4);
Lane(haystack, needle, ref result, 3);
Lane(haystack, needle, ref result, 2);
Lane(haystack, needle, ref result, 1);
Lane(haystack, needle, ref result, 0);
return result;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void Lane(byte* haystack, byte needle, ref int result, int index)
{
if (haystack[index] == needle)
result = index;
} I get ; Method Program:Foo(System.Runtime.Intrinsics.Vector128`1[ubyte],ubyte):int (FullOpts)
G_M56088_IG01: ;; offset=0x0000
;; size=0 bbWeight=1 PerfScore 0.00
G_M56088_IG02: ;; offset=0x0000
mov rax, rcx
mov ecx, 32
movzx r8, byte ptr [rax+0x0D]
movzx rdx, dl
mov r10d, 13
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x0C]
mov r10d, 12
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x0B]
mov r10d, 11
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x0A]
mov r10d, 10
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x09]
mov r10d, 9
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x08]
mov r10d, 8
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x07]
mov r10d, 7
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x06]
mov r10d, 6
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x05]
mov r10d, 5
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x04]
mov r10d, 4
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x03]
mov r10d, 3
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x02]
mov r10d, 2
cmp r8d, edx
cmove ecx, r10d
movzx r8, byte ptr [rax+0x01]
mov r10d, 1
cmp r8d, edx
cmove ecx, r10d
movzx rax, byte ptr [rax]
xor r8d, r8d
cmp eax, edx
cmove ecx, r8d
mov eax, ecx
;; size=259 bbWeight=1 PerfScore 39.50
G_M56088_IG03: ;; offset=0x0103
ret
;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 260 on main. Note that if your code is inside a loop then it is expected that we do not emit conditional selects, see #96380 (comment). |
It's due to a loop, thank you. We can close this if that limitation has a tracking issue. |
kg
changed the title
cmov not being used in hand-unrolled loop
cmov not being used in nested hand-unrolled loop
Nov 19, 2024
dotnet-policy-service
bot
removed
the
untriaged
New issue has not been triaged by the area owner
label
Nov 21, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area-CodeGen-coreclr
CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
tenet-performance
Performance related issue
Description
I'm attempting to unroll a fixed-size scalar search loop so that it will generate a chain of cmovs. It's possible to get this to happen in C compiled by clang and msvc, but not C# on .NET 9. Instead, the best-case is a chain of
movzx; cmp; jne; mov
opcode clusters, i.e.(For reference, this is the same method as #108092 but with the vectorization mostly disabled.)
Configuration
.NET 9
Windows 11 Pro build 22631
x64 (Ryzen 7950X)
Disasmo spew:
Regression?
No, this didn't work on 8 either.
Reproduction code
Here's the simplest/easiest-to-fiddle-with version of the repro I can come up with that's 'real'. the C version in simdhash is similar and generates a chain of cmovs the last time I checked. The idea here is to fully pipeline the fixed-width (14 elements) search loop. There's no guarantee it would be faster (it was in C), but I can't actually measure it to see whether it's faster because I can't get it to generate the cmovs :)
Additional notes
I've tried maybe a dozen different formulations/structures for this code and they all generate similar or worse code, none of them generate cmovs. I've verified that cmovs are generated in other parts of my code - this method even has one further up:
I've also tried reducing the number of Lane calls to see if that simplification would light up the cmov intrinsic but it didn't. I've also tried manually inlining Lane to no effect.
The text was updated successfully, but these errors were encountered: