New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} #21073

Merged
merged 4 commits into from Dec 24, 2018

Conversation

Projects
None yet
5 participants
@benaadams
Copy link
Collaborator

benaadams commented Nov 17, 2018

A fitting conclusion to the Saga of Ben's Magic Number; the days of CPU intrinsics in C# are upon us 馃槂

*Further improved by #22118

Used by

string.IndexOf(char value)
string.IndexOf(char value, int startIndex)
string.IndexOf(char value, int startIndex, int count)

string.IndexOfAny(char[] anyOf) // anyOf.lengths 2,3,4,5
string.IndexOfAny(char[] anyOf, int startIndex) // anyOf.lengths 2,3,4,5
string.IndexOfAny(char[] anyOf, int startIndex, int count) // anyOf.lengths 2,3,4,5

// where T : byte, char
IndexOf<T>(this Span<T> span, T value)
IndexOf<T>(this Span<T> span, ReadOnlySpan<T> value)
IndexOf<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> value)
IndexOfAny<T>(this Span<T> span, T value0, T value1)
IndexOfAny<T>(this Span<T> span, T value0, T value1, T value2)
IndexOfAny<T>(this Span<T> span, ReadOnlySpan<T> values)
IndexOfAny<T>(this ReadOnlySpan<T> span, T value)
IndexOfAny<T>(this ReadOnlySpan<T> span, T value0, T value1)
IndexOfAny<T>(this ReadOnlySpan<T> span, T value0, T value1, T value2)
IndexOfAny<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> values)

string.LastIndexOf(char value)
string.LastIndexOf(char value, int startIndex)
string.LastIndexOf(char value, int startIndex, int count)

//  where T : byte, char
LastIndexOf<T>(this Span<T> span, T value)
LastIndexOf<T>(this Span<T> span, ReadOnlySpan<T> value)
LastIndexOf<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> value)
LastIndexOfAny<T>(this Span<T> span, T value0, T value1)
LastIndexOfAny<T>(this Span<T> span, T value0, T value1, T value2)
LastIndexOfAny<T>(this Span<T> span, ReadOnlySpan<T> values)
LastIndexOfAny<T>(this ReadOnlySpan<T> span, T value)
LastIndexOfAny<T>(this ReadOnlySpan<T> span, T value0, T value1)
LastIndexOfAny<T>(this ReadOnlySpan<T> span, T value0, T value1, T value2)
LastIndexOfAny<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> values)

With #21076 (merged) is additionally used by

// where comparisonType == StringComparison.Ordinal 
string.IndexOf(string value, StringComparison comparisonType)
string.IndexOf(string value, int startIndex, StringComparison comparisonType)
string.IndexOf(string value, int startIndex, int count, StringComparison comparisonType)

// where T : char
IndexOf<T>(this Span<T> span, ReadOnlySpan<T> value)
IndexOf<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> value)

// where comparisonType == StringComparison.Ordinal
IndexOf(this ReadOnlySpan<char> span, ReadOnlySpan<char> value, StringComparison comparisonType)

// where options = CompareOptions.Ordinal 
CompareInfo.IndexOf(string source, string value, CompareOptions options)
CompareInfo.IndexOf(string source, string value, int startIndex, CompareOptions options)
CompareInfo.IndexOf(string source, string value, int startIndex, int count, CompareOptions options)

With #21116 (merged) is additionally used by

// where T : byte, char
Array.IndexOf<T>(T[] array, T value)
Array.IndexOf<T>(T[] array, T value, int startIndex)
Array.IndexOf<T>(T[] array, T value, int startIndex, int count)
Array.LastIndexOf<T>(T[] array, T value)
Array.LastIndexOf<T>(T[] array, T value, int startIndex)
Array.LastIndexOf<T>(T[] array, T value, int startIndex, int count)

From #20738 (comment)

Mostly whitespace changes https://github.com/dotnet/coreclr/pull/21073/files?w=1

/cc @fiigii @tannergooding @CarolEidt

@benaadams benaadams changed the title Use Bmi1.TrailingZeroCount for LocateFirstFoundXxx Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} Nov 18, 2018

@benaadams benaadams changed the title Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} [WIP] Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} Nov 18, 2018

@fiigii

This comment has been minimized.

Copy link
Collaborator

fiigii commented Dec 3, 2018

@dotnet-bot test this please

@benaadams benaadams changed the title [WIP] Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} Use TZCNT and LZCNT for Locate{First|Last}Found{Byte|Char} Dec 3, 2018

@fiigii

This comment has been minimized.

Copy link
Collaborator

fiigii commented Dec 3, 2018

@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx

@fiigii

fiigii approved these changes Dec 4, 2018

Copy link
Collaborator

fiigii left a comment

Thanks for the work.

@fiigii

This comment has been minimized.

Copy link
Collaborator

fiigii commented Dec 7, 2018

@tannergooding Does this PR look good to you?

@tannergooding
Copy link
Member

tannergooding left a comment

LGTM

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Dec 7, 2018

Do we have perf numbers for this?

@fiigii

This comment has been minimized.

Copy link
Collaborator

fiigii commented Dec 7, 2018

@dotnet-bot test this please

@fiigii

This comment has been minimized.

Copy link
Collaborator

fiigii commented Dec 7, 2018

Retest due to #21431

@jkotas

This comment has been minimized.

Copy link
Member

jkotas commented Dec 23, 2018

Perf numbers? (Just to make sure that there is nothing unexpected going on.)

@benaadams benaadams force-pushed the benaadams:Bmi1.TrailingZeroCount branch from 7c3e116 to 050349f Dec 23, 2018

@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Dec 23, 2018

Rebased

Perf numbers?

Double checking

@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Dec 23, 2018

Not sure, isn't as exciting as I was hoping for... checking asm

                 Method |            Categories | Position |       Mean |
 ---------------------- |---------------------- |--------- |-----------:|
-     ByteArray_IndexOf |     ByteArray_IndexOf |        1 |   5.782 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |        1 |   5.242 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |        7 |   7.144 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |        7 |   6.368 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |        8 |   8.811 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |        8 |   7.255 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |       15 |   7.557 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |       15 |   7.560 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |       16 |   7.827 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |       16 |   7.861 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |       31 |   8.140 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |       31 |   8.141 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |       32 |   7.891 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |       32 |  10.784 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |       63 |   9.035 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |       63 |   9.036 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |       64 |   8.595 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |       64 |   8.924 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |      127 |  10.216 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |      127 |  10.794 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |      128 |  10.310 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |      128 |  10.422 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |      255 |  13.235 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |      255 |  13.318 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |      256 |  13.012 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |      256 |  12.594 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |     1023 |  29.954 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |     1023 |  28.726 ns |
-     ByteArray_IndexOf |     ByteArray_IndexOf |     1024 |  29.424 ns |
+     ByteArray_IndexOf |     ByteArray_IndexOf |     1024 |  27.564 ns |
                        |                       |          |            |
-     CharArray_IndexOf |     CharArray_IndexOf |        1 |   6.457 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |        1 |   5.779 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |        7 |   7.908 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |        7 |   7.068 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |        8 |   7.434 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |        8 |   7.177 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |       15 |   9.781 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |       15 |   9.792 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |       16 |  10.119 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |       16 |  10.081 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |       31 |  10.910 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |       31 |  10.912 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |       32 |  11.244 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |       32 |  11.197 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |       63 |  12.581 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |       63 |  13.347 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |       64 |  12.861 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |       64 |  13.611 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |      127 |  15.945 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |      127 |  16.594 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |      128 |  16.805 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |      128 |  17.091 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |      255 |  24.315 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |      255 |  24.138 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |      256 |  24.029 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |      256 |  22.998 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |     1023 |  75.582 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |     1023 |  69.836 ns |
-     CharArray_IndexOf |     CharArray_IndexOf |     1024 |  75.272 ns |
+     CharArray_IndexOf |     CharArray_IndexOf |     1024 |  69.949 ns |
                        |                       |          |            |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |        1 |  83.692 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |        1 |  74.153 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |        7 |  77.963 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |        7 |  71.059 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |        8 |  70.367 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |        8 |  70.079 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |       15 |  68.637 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |       15 |  68.673 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |       16 |  77.761 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |       16 |  77.630 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |       31 |  73.634 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |       31 |  73.625 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |       32 |  76.019 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |       32 |  76.100 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |       63 |  73.442 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |       63 |  78.184 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |       64 |  75.912 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |       64 |  79.362 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |      127 |  71.361 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |      127 |  75.242 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |      128 |  79.741 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |      128 |  79.585 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |      255 |  72.470 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |      255 |  72.968 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |      256 |  75.462 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |      256 |  72.413 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |     1023 |  39.471 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |     1023 |  38.190 ns |
- ByteArray_LastIndexOf | ByteArray_LastIndexOf |     1024 |  41.565 ns |
+ ByteArray_LastIndexOf | ByteArray_LastIndexOf |     1024 |  38.972 ns |
                        |                       |          |            |
- CharArray_LastIndexOf | CharArray_LastIndexOf |        1 | 134.611 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |        1 | 125.288 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |        7 | 141.464 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |        7 | 129.250 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |        8 | 124.869 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |        8 | 122.678 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |       15 | 128.994 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |       15 | 126.963 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |       16 | 127.706 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |       16 | 127.286 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |       31 | 126.493 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |       31 | 126.232 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |       32 | 126.623 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |       32 | 126.695 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |       63 | 124.672 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |       63 | 129.997 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |       64 | 124.969 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |       64 | 129.491 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |      127 | 127.052 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |      127 | 126.525 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |      128 | 129.054 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |      128 | 125.883 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |      255 | 121.892 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |      255 | 121.677 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |      256 | 121.790 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |      256 | 115.150 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |     1023 |  79.442 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |     1023 |  74.192 ns |
- CharArray_LastIndexOf | CharArray_LastIndexOf |     1024 |  78.782 ns |
+ CharArray_LastIndexOf | CharArray_LastIndexOf |     1024 |  74.625 ns |
@benaadams

This comment has been minimized.

Copy link
Collaborator Author

benaadams commented Dec 23, 2018

i.e. was expecting more on LastIndexOf

return 7 - (int)(Lzcnt.X64.LeadingZeroCount(match) >> 3);

vs

int index = 7;
while ((long)match > 0)
{
match = match << 8;
index--;
}
return index;

@jkotas

jkotas approved these changes Dec 24, 2018

Copy link
Member

jkotas left a comment

Still looks like an improvement on average.

@jkotas jkotas merged commit 0742765 into dotnet:master Dec 24, 2018

30 checks passed

CentOS7.1 x64 Checked Innerloop Build and Test Build finished.
Details
CentOS7.1 x64 Debug Innerloop Build Build finished.
Details
Linux-musl x64 Debug Build Build finished.
Details
OSX10.12 x64 Checked Innerloop Build and Test Build finished.
Details
Tizen armel Cross Checked Innerloop Build and Test Build finished.
Details
Ubuntu arm Cross Checked Innerloop Build and Test Build finished.
Details
Ubuntu arm Cross Checked crossgen_comparison Build and Test Build finished.
Details
Ubuntu arm Cross Checked no_tiered_compilation_innerloop Build and Test Build finished.
Details
Ubuntu arm Cross Release crossgen_comparison Build and Test Build finished.
Details
Ubuntu x64 Checked CoreFX Tests Build finished.
Details
Ubuntu x64 Checked Innerloop Build and Test Build finished.
Details
Ubuntu x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0) Build finished.
Details
Ubuntu x64 Formatting Build finished.
Details
Ubuntu16.04 arm64 Cross Checked Innerloop Build and Test Build finished.
Details
Ubuntu16.04 arm64 Cross Checked no_tiered_compilation_innerloop Build and Test Build finished.
Details
Windows_NT arm Cross Debug Innerloop Build Build finished.
Details
Windows_NT arm64 Cross Debug Innerloop Build Build finished.
Details
Windows_NT x64 Checked CoreFX Tests Build finished.
Details
Windows_NT x64 Checked Innerloop Build and Test Build finished.
Details
Windows_NT x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0) Build finished.
Details
Windows_NT x64 Formatting Build finished.
Details
Windows_NT x64 Release CoreFX Tests Build finished.
Details
Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x64 min_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 Checked Innerloop Build and Test Build finished.
Details
Windows_NT x86 Checked Innerloop Build and Test (Jit - TieredCompilation=0) Build finished.
Details
Windows_NT x86 Release Innerloop Build and Test Build finished.
Details
Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 min_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
license/cla All CLA requirements met.
Details
{
match = match << 8;
index--;
return 7 - (int)(Lzcnt.X64.LeadingZeroCount(match) >> 3);

This comment has been minimized.

@am11

am11 Dec 24, 2018

Contributor

This yields a different result than the else case, e.g. LocateLastFoundByte((ulong)10) // when Lzcnt.X64.IsSupported is 0 and !Lzcnt.X64.IsSupported (else case) returns -1. Is it known/intentional?
Aside from this, the result of Lzcnt.X64.LeadingZeroCount((ulong)10) is 6010, shouldn't it be 5810?

This comment has been minimized.

@benaadams

benaadams Dec 24, 2018

Author Collaborator

10 isn't a value that can be passed to the method. The bytes in match can only be 0x00 or 0xff

@benaadams benaadams deleted the benaadams:Bmi1.TrailingZeroCount branch Dec 24, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment