Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} #21073

Merged
merged 4 commits into from Dec 24, 2018

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Nov 17, 2018

A fitting conclusion to the Saga of Ben's Magic Number; the days of CPU intrinsics in C# are upon us 馃槂

*Further improved by #22118

Used by

string.IndexOf(char value)
string.IndexOf(char value, int startIndex)
string.IndexOf(char value, int startIndex, int count)

string.IndexOfAny(char[] anyOf) // anyOf.lengths 2,3,4,5
string.IndexOfAny(char[] anyOf, int startIndex) // anyOf.lengths 2,3,4,5
string.IndexOfAny(char[] anyOf, int startIndex, int count) // anyOf.lengths 2,3,4,5

// where T : byte, char
IndexOf<T>(this Span<T> span, T value)
IndexOf<T>(this Span<T> span, ReadOnlySpan<T> value)
IndexOf<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> value)
IndexOfAny<T>(this Span<T> span, T value0, T value1)
IndexOfAny<T>(this Span<T> span, T value0, T value1, T value2)
IndexOfAny<T>(this Span<T> span, ReadOnlySpan<T> values)
IndexOfAny<T>(this ReadOnlySpan<T> span, T value)
IndexOfAny<T>(this ReadOnlySpan<T> span, T value0, T value1)
IndexOfAny<T>(this ReadOnlySpan<T> span, T value0, T value1, T value2)
IndexOfAny<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> values)

string.LastIndexOf(char value)
string.LastIndexOf(char value, int startIndex)
string.LastIndexOf(char value, int startIndex, int count)

//  where T : byte, char
LastIndexOf<T>(this Span<T> span, T value)
LastIndexOf<T>(this Span<T> span, ReadOnlySpan<T> value)
LastIndexOf<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> value)
LastIndexOfAny<T>(this Span<T> span, T value0, T value1)
LastIndexOfAny<T>(this Span<T> span, T value0, T value1, T value2)
LastIndexOfAny<T>(this Span<T> span, ReadOnlySpan<T> values)
LastIndexOfAny<T>(this ReadOnlySpan<T> span, T value)
LastIndexOfAny<T>(this ReadOnlySpan<T> span, T value0, T value1)
LastIndexOfAny<T>(this ReadOnlySpan<T> span, T value0, T value1, T value2)
LastIndexOfAny<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> values)

With #21076 (merged) is additionally used by

// where comparisonType == StringComparison.Ordinal 
string.IndexOf(string value, StringComparison comparisonType)
string.IndexOf(string value, int startIndex, StringComparison comparisonType)
string.IndexOf(string value, int startIndex, int count, StringComparison comparisonType)

// where T : char
IndexOf<T>(this Span<T> span, ReadOnlySpan<T> value)
IndexOf<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> value)

// where comparisonType == StringComparison.Ordinal
IndexOf(this ReadOnlySpan<char> span, ReadOnlySpan<char> value, StringComparison comparisonType)

// where options = CompareOptions.Ordinal 
CompareInfo.IndexOf(string source, string value, CompareOptions options)
CompareInfo.IndexOf(string source, string value, int startIndex, CompareOptions options)
CompareInfo.IndexOf(string source, string value, int startIndex, int count, CompareOptions options)

With #21116 (merged) is additionally used by

// where T : byte, char
Array.IndexOf<T>(T[] array, T value)
Array.IndexOf<T>(T[] array, T value, int startIndex)
Array.IndexOf<T>(T[] array, T value, int startIndex, int count)
Array.LastIndexOf<T>(T[] array, T value)
Array.LastIndexOf<T>(T[] array, T value, int startIndex)
Array.LastIndexOf<T>(T[] array, T value, int startIndex, int count)

From #20738 (comment)

Mostly whitespace changes https://github.com/dotnet/coreclr/pull/21073/files?w=1

/cc @fiigii @tannergooding @CarolEidt

@benaadams benaadams changed the title Use Bmi1.TrailingZeroCount for LocateFirstFoundXxx Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} Nov 18, 2018
@benaadams benaadams changed the title Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} [WIP] Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} Nov 18, 2018
@fiigii
Copy link

fiigii commented Dec 3, 2018

@dotnet-bot test this please

@benaadams benaadams changed the title [WIP] Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} Dec 3, 2018
@fiigii
Copy link

fiigii commented Dec 3, 2018

@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx

Copy link

@fiigii fiigii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work.

@fiigii
Copy link

fiigii commented Dec 7, 2018

@tannergooding Does this PR look good to you?

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tannergooding
Copy link
Member

Do we have perf numbers for this?

@fiigii
Copy link

fiigii commented Dec 7, 2018

@dotnet-bot test this please

@fiigii
Copy link

fiigii commented Dec 7, 2018

Retest due to #21431

@jkotas
Copy link
Member

jkotas commented Dec 23, 2018

Perf numbers? (Just to make sure that there is nothing unexpected going on.)

@benaadams
Copy link
Member Author

Rebased

Perf numbers?

Double checking

@benaadams
Copy link
Member Author

Not sure, isn't as exciting as I was hoping for... checking asm

                 Method |            Categories | Position |       Mean |
 ---------------------- |---------------------- |--------- |-----------:|
-     ByteArray_IndexOf |     ByteArray_IndexOf |        1 |   5.782 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |        1 |   5.242 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |        7 |   7.144 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |        7 |   6.368 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |        8 |   8.811 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |        8 |   7.255 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |       15 |   7.557 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |       15 |   7.560 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |       16 |   7.827 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |       16 |   7.861 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |       31 |   8.140 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |       31 |   8.141 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |       32 |   7.891 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |       32 |  10.784 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |       63 |   9.035 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |       63 |   9.036 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |       64 |   8.595 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |       64 |   8.924 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |      127 |  10.216 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |      127 |  10.794 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |      128 |  10.310 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |      128 |  10.422 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |      255 |  13.235 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |      255 |  13.318 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |      256 |  13.012 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |      256 |  12.594 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |     1023 |  29.954 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |     1023 |  28.726 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |     1024 |  29.424 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |     1024 |  27.564 ns |
                        |                       |          |            |
-     CharArray_IndexOf |     CharArray_IndexOf |        1 |   6.457 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |        1 |   5.779 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |        7 |   7.908 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |        7 |   7.068 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |        8 |   7.434 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |        8 |   7.177 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |       15 |   9.781 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |       15 |   9.792 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |       16 |  10.119 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |       16 |  10.081 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |       31 |  10.910 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |       31 |  10.912 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |       32 |  11.244 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |       32 |  11.197 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |       63 |  12.581 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |       63 |  13.347 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |       64 |  12.861 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |       64 |  13.611 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |      127 |  15.945 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |      127 |  16.594 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |      128 |  16.805 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |      128 |  17.091 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |      255 |  24.315 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |      255 |  24.138 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |      256 |  24.029 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |      256 |  22.998 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |     1023 |  75.582 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |     1023 |  69.836 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |     1024 |  75.272 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |     1024 |  69.949 ns |
                        |                       |          |            |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |        1 |  83.692 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |        1 |  74.153 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |        7 |  77.963 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |        7 |  71.059 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |        8 |  70.367 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |        8 |  70.079 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |       15 |  68.637 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |       15 |  68.673 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |       16 |  77.761 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |       16 |  77.630 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |       31 |  73.634 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |       31 |  73.625 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |       32 |  76.019 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |       32 |  76.100 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |       63 |  73.442 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |       63 |  78.184 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |       64 |  75.912 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |       64 |  79.362 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |      127 |  71.361 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |      127 |  75.242 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |      128 |  79.741 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |      128 |  79.585 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |      255 |  72.470 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |      255 |  72.968 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |      256 |  75.462 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |      256 |  72.413 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |     1023 |  39.471 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |     1023 |  38.190 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |     1024 |  41.565 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |     1024 |  38.972 ns |
                        |                       |          |            |
- CharArray_LastIndexOf | CharArray_LastIndexOf |        1 | 134.611 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |        1 | 125.288 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |        7 | 141.464 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |        7 | 129.250 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |        8 | 124.869 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |        8 | 122.678 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |       15 | 128.994 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |       15 | 126.963 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |       16 | 127.706 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |       16 | 127.286 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |       31 | 126.493 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |       31 | 126.232 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |       32 | 126.623 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |       32 | 126.695 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |       63 | 124.672 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |       63 | 129.997 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |       64 | 124.969 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |       64 | 129.491 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |      127 | 127.052 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |      127 | 126.525 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |      128 | 129.054 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |      128 | 125.883 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |      255 | 121.892 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |      255 | 121.677 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |      256 | 121.790 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |      256 | 115.150 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |     1023 |  79.442 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |     1023 |  74.192 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |     1024 |  78.782 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |     1024 |  74.625 ns |

@benaadams
Copy link
Member Author

i.e. was expecting more on LastIndexOf

return 7 - (int)(Lzcnt.X64.LeadingZeroCount(match) >> 3);

vs

int index = 7;
while ((long)match > 0)
{
match = match << 8;
index--;
}
return index;

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks like an improvement on average.

@jkotas jkotas merged commit 0742765 into dotnet:master Dec 24, 2018
{
match = match << 8;
index--;
return 7 - (int)(Lzcnt.X64.LeadingZeroCount(match) >> 3);
Copy link
Member

@am11 am11 Dec 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This yields a different result than the else case, e.g. LocateLastFoundByte((ulong)10) // when Lzcnt.X64.IsSupported is 0 and !Lzcnt.X64.IsSupported (else case) returns -1. Is it known/intentional?
Aside from this, the result of Lzcnt.X64.LeadingZeroCount((ulong)10) is 6010, shouldn't it be 5810?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 isn't a value that can be passed to the method. The bytes in match can only be 0x00 or 0xff

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
5 participants