Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} #21073
Conversation
@dotnet-bot test this please |
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work.
@tannergooding Does this PR look good to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Do we have perf numbers for this? |
@dotnet-bot test this please |
Retest due to #21431 |
Perf numbers? (Just to make sure that there is nothing unexpected going on.) |
7c3e116
to
050349f
Compare
Rebased
Double checking |
Not sure, isn't as exciting as I was hoping for... checking asm Method | Categories | Position | Mean |
---------------------- |---------------------- |--------- |-----------:|
- ByteArray_IndexOf | ByteArray_IndexOf | 1 | 5.782 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 1 | 5.242 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 7 | 7.144 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 7 | 6.368 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 8 | 8.811 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 8 | 7.255 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 15 | 7.557 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 15 | 7.560 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 16 | 7.827 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 16 | 7.861 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 31 | 8.140 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 31 | 8.141 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 32 | 7.891 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 32 | 10.784 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 63 | 9.035 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 63 | 9.036 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 64 | 8.595 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 64 | 8.924 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 127 | 10.216 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 127 | 10.794 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 128 | 10.310 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 128 | 10.422 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 255 | 13.235 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 255 | 13.318 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 256 | 13.012 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 256 | 12.594 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 1023 | 29.954 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 1023 | 28.726 ns |
- ByteArray_IndexOf | ByteArray_IndexOf | 1024 | 29.424 ns |
+ ByteArray_IndexOf | ByteArray_IndexOf | 1024 | 27.564 ns |
| | | |
- CharArray_IndexOf | CharArray_IndexOf | 1 | 6.457 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 1 | 5.779 ns |
- CharArray_IndexOf | CharArray_IndexOf | 7 | 7.908 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 7 | 7.068 ns |
- CharArray_IndexOf | CharArray_IndexOf | 8 | 7.434 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 8 | 7.177 ns |
- CharArray_IndexOf | CharArray_IndexOf | 15 | 9.781 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 15 | 9.792 ns |
- CharArray_IndexOf | CharArray_IndexOf | 16 | 10.119 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 16 | 10.081 ns |
- CharArray_IndexOf | CharArray_IndexOf | 31 | 10.910 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 31 | 10.912 ns |
- CharArray_IndexOf | CharArray_IndexOf | 32 | 11.244 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 32 | 11.197 ns |
- CharArray_IndexOf | CharArray_IndexOf | 63 | 12.581 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 63 | 13.347 ns |
- CharArray_IndexOf | CharArray_IndexOf | 64 | 12.861 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 64 | 13.611 ns |
- CharArray_IndexOf | CharArray_IndexOf | 127 | 15.945 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 127 | 16.594 ns |
- CharArray_IndexOf | CharArray_IndexOf | 128 | 16.805 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 128 | 17.091 ns |
- CharArray_IndexOf | CharArray_IndexOf | 255 | 24.315 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 255 | 24.138 ns |
- CharArray_IndexOf | CharArray_IndexOf | 256 | 24.029 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 256 | 22.998 ns |
- CharArray_IndexOf | CharArray_IndexOf | 1023 | 75.582 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 1023 | 69.836 ns |
- CharArray_IndexOf | CharArray_IndexOf | 1024 | 75.272 ns |
+ CharArray_IndexOf | CharArray_IndexOf | 1024 | 69.949 ns |
| | | |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 1 | 83.692 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 1 | 74.153 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 7 | 77.963 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 7 | 71.059 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 8 | 70.367 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 8 | 70.079 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 15 | 68.637 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 15 | 68.673 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 16 | 77.761 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 16 | 77.630 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 31 | 73.634 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 31 | 73.625 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 32 | 76.019 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 32 | 76.100 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 63 | 73.442 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 63 | 78.184 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 64 | 75.912 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 64 | 79.362 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 127 | 71.361 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 127 | 75.242 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 128 | 79.741 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 128 | 79.585 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 255 | 72.470 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 255 | 72.968 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 256 | 75.462 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 256 | 72.413 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 1023 | 39.471 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 1023 | 38.190 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf | 1024 | 41.565 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf | 1024 | 38.972 ns |
| | | |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 1 | 134.611 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 1 | 125.288 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 7 | 141.464 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 7 | 129.250 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 8 | 124.869 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 8 | 122.678 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 15 | 128.994 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 15 | 126.963 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 16 | 127.706 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 16 | 127.286 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 31 | 126.493 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 31 | 126.232 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 32 | 126.623 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 32 | 126.695 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 63 | 124.672 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 63 | 129.997 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 64 | 124.969 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 64 | 129.491 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 127 | 127.052 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 127 | 126.525 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 128 | 129.054 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 128 | 125.883 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 255 | 121.892 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 255 | 121.677 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 256 | 121.790 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 256 | 115.150 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 1023 | 79.442 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 1023 | 74.192 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf | 1024 | 78.782 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf | 1024 | 74.625 ns | |
i.e. was expecting more on LastIndexOf return 7 - (int)(Lzcnt.X64.LeadingZeroCount(match) >> 3); vs coreclr/src/System.Private.CoreLib/shared/System/SpanHelpers.Byte.cs Lines 1122 to 1128 in d4cab6e
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looks like an improvement on average.
{ | ||
match = match << 8; | ||
index--; | ||
return 7 - (int)(Lzcnt.X64.LeadingZeroCount(match) >> 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This yields a different result than the else
case, e.g. LocateLastFoundByte((ulong)10) // when Lzcnt.X64.IsSupported
is 0
and !Lzcnt.X64.IsSupported
(else
case) returns -1
. Is it known/intentional?
Aside from this, the result of Lzcnt.X64.LeadingZeroCount((ulong)10)
is 6010, shouldn't it be 5810?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 isn't a value that can be passed to the method. The bytes in match
can only be 0x00 or 0xff
A fitting conclusion to the Saga of Ben's Magic Number; the days of CPU intrinsics in C# are upon us 😃
*Further improved by #22118
Used by
With #21076 (merged) is additionally used by
With #21116 (merged) is additionally used by
From #20738 (comment)
Mostly whitespace changes https://github.com/dotnet/coreclr/pull/21073/files?w=1
/cc @fiigii @tannergooding @CarolEidt