Vectorize String.IndexOf(char) and String.LastIndexOf(char) #16392

eerhardt · 2018-02-14T19:58:28Z

Vectorize String.IndexOf(char) using the same algorithm as SpanHelpers.IndexOf(byte).

I also plan on doing the same for String.LastIndexOf, which is why I marked this as WIP. I wanted to get early feedback on the approach to ensure continuing this work is worth the effort.

Perf results

I ran the following tests Test Code

Machine 1 (windows desktop) (click to expand)

Machine 1 (windows desktop):

BenchmarkDotNet=v0.10.12.20180214-develop, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.248)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical cores and 4 physical cores
Frequency=3328125 Hz, Resolution=300.4695 ns, Timer=TSC

Without changes

Method	Mean	Error	StdDev
IndexOfShort_Match	12.11 ns	0.0951 ns	0.0843 ns
IndexOfShort_Miss	16.84 ns	0.2507 ns	0.2345 ns
IndexOfMedium_Match	28.28 ns	0.4762 ns	0.4454 ns
IndexOfMedium_Miss	36.80 ns	0.7361 ns	0.6885 ns
IndexOfLong_Match	101.22 ns	0.7716 ns	0.6443 ns
IndexOfLong_Miss	160.88 ns	2.4450 ns	2.2870 ns

With changes

Method	Mean	Error	StdDev
IndexOfShort_Match	13.95 ns	0.3054 ns	0.3635 ns
IndexOfShort_Miss	16.95 ns	0.3310 ns	0.3096 ns
IndexOfMedium_Match	27.02 ns	0.3313 ns	0.3099 ns
IndexOfMedium_Miss	33.84 ns	0.4833 ns	0.4284 ns
IndexOfLong_Match	51.26 ns	0.3604 ns	0.3372 ns
IndexOfLong_Miss	67.85 ns	1.2266 ns	1.1473 ns

Machine 2 (MacBook Air ~2012) (click to expand)

### Machine 2 (MacBook Air ~2012):

BenchmarkDotNet=v0.10.12, OS=macOS 10.13.2 (17C88) [Darwin 17.3.0]
Intel Core i7-4650U CPU 1.70GHz (Haswell), 1 CPU, 4 logical cores and 2 physical cores
.NET Core SDK=2.1.300-preview2-008171

Without changes

Method	Mean	Error	StdDev
IndexOfShort_Match	20.23 ns	0.6210 ns	1.772 ns
IndexOfShort_Miss	25.25 ns	0.5338 ns	1.054 ns
IndexOfMedium_Match	36.47 ns	0.7497 ns	1.122 ns
IndexOfMedium_Miss	55.95 ns	1.9238 ns	5.489 ns
IndexOfLong_Match	137.01 ns	1.6394 ns	1.453 ns
IndexOfLong_Miss	217.41 ns	2.4305 ns	2.273 ns

With changes

Method	Mean	Error	StdDev	Median
IndexOfShort_Match	24.32 ns	0.6296 ns	1.8466 ns	24.11 ns
IndexOfShort_Miss	28.72 ns	1.0847 ns	3.1811 ns	28.39 ns
IndexOfMedium_Match	45.40 ns	4.3182 ns	12.7323 ns	40.48 ns
IndexOfMedium_Miss	43.97 ns	0.8260 ns	0.7726 ns	43.72 ns
IndexOfLong_Match	64.66 ns	0.8402 ns	0.7859 ns	64.64 ns
IndexOfLong_Miss	87.54 ns	1.7383 ns	1.7851 ns	87.19 ns

Machine 3 (Lenvo Carbon X1) (click to expand)

Machine 3 (Lenvo Carbon X1):

BenchmarkDotNet=v0.10.12, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.125)
Intel Core i7-3667U CPU 2.00GHz (Ivy Bridge), 1 CPU, 4 logical cores and 2 physical cores
Frequency=2435863 Hz, Resolution=410.5321 ns, Timer=TSC
.NET Core SDK=2.1.300-preview2-008171

Without changes

Method	Mean	Error	StdDev
IndexOfShort_Match	19.66 ns	0.4376 ns	0.6550 ns
IndexOfShort_Miss	26.04 ns	0.4768 ns	0.4226 ns
IndexOfMedium_Match	48.00 ns	0.9828 ns	1.6420 ns
IndexOfMedium_Miss	78.24 ns	1.5603 ns	1.9733 ns
IndexOfLong_Match	205.15 ns	0.9204 ns	0.8610 ns
IndexOfLong_Miss	353.07 ns	1.2634 ns	1.1200 ns

With changes

Method	Mean	Error	StdDev
IndexOfShort_Match	22.20 ns	0.5248 ns	0.8014 ns
IndexOfShort_Miss	29.87 ns	0.7094 ns	2.0238 ns
IndexOfMedium_Match	42.36 ns	0.8687 ns	0.8532 ns
IndexOfMedium_Miss	56.24 ns	1.2761 ns	1.7889 ns
IndexOfLong_Match	119.03 ns	2.4004 ns	4.5086 ns
IndexOfLong_Miss	189.77 ns	4.3292 ns	12.6285 ns

Machine 4 (Tanner's Q6600): (click to expand)

Machine 4 (Tanner's Q6600):

BenchmarkDotNet=v0.10.12, OS=ubuntu 17.10
Intel Core2 Quad CPU Q6600 2.40GHz, 1 CPU, 4 logical cores and 4 physical cores
.NET Core SDK=2.1.300-preview2-008173

Without changes

Method	Mean	Error	StdDev
IndexOfShort_Match	34.62 ns	0.0114 ns	0.0107 ns
IndexOfShort_Miss	39.32 ns	0.0010 ns	0.0008 ns
IndexOfMedium_Match	76.70 ns	0.0017 ns	0.0016 ns
IndexOfMedium_Miss	102.51 ns	0.0052 ns	0.0046 ns
IndexOfLong_Match	298.30 ns	0.0090 ns	0.0075 ns
IndexOfLong_Miss	461.88 ns	0.0238 ns	0.0222 ns

With changes

Method	Mean	Error	StdDev
IndexOfShort_Match	35.74 ns	0.0054 ns	0.0048 ns
IndexOfShort_Miss	46.14 ns	0.0023 ns	0.0021 ns
IndexOfMedium_Match	89.52 ns	0.0122 ns	0.0108 ns
IndexOfMedium_Miss	118.06 ns	0.0052 ns	0.0043 ns
IndexOfLong_Match	323.21 ns	1.0655 ns	0.9445 ns
IndexOfLong_Miss	395.20 ns	0.0050 ns	0.0045 ns

As you can see, for short strings, there is a little degradation, especially when the match is towards the beginning of the string. But for longer strings, where the match is towards the end or doesn't match at all, the gains are substantial.

/cc @benaadams

IndexOf(byte).

tannergooding · 2018-02-14T20:03:59Z

It would be good to get additional perf numbers on hardware without VEX support and potentially on hardware where an unaligned read/write is not as fast as an aligned read/write.

jkotas · 2018-02-14T20:52:14Z

As you can see, for short strings, there is a little degradation

I would be useful to get some data about the distribution of inputs for IndexOf (e.g. dotnet build or ASP.NET MusicStore are easy enough) to prove that this degradation pays for itself.

jkotas · 2018-02-14T20:54:02Z

src/mscorlib/shared/System/String.Searching.cs


-                while (count >= 4)
+                int nLength = count;
+                bool useVectorization = false;


Is there a good reason why this is using this bool variable? https://github.com/dotnet/corefx/blob/master/src/System.Memory/src/System/SpanHelpers.byte.cs#L96 does not use it. The bool variable does not look like an improvement.

My reasoning was so it wouldn't have to do the index calculation and checks when we aren't using the vectorized approach (ex. when the string/count is too short).

https://github.com/dotnet/coreclr/pull/16392/files#diff-f7e6389a6519754f04d20215ff42efcdR126

The SpanHelpers approach uses a ref byte searchSpace and keeps track of the index. Here I kept the same char* pointer approach. So to get the index, I either need to calculate it, or also keep track of the index as we go.

Doing this saved ~1ns in the IndexOfShort_Miss case on my machine.

I can remove it, if you don't think it is necessary.

There are two long-lived variables: pStartCh and useVectorization. They will burn two registers or two stack spill slots. I think it maybe be more efficient if you replace them by just one pEndCh.

Do you need both pCh and index?

ahsonkhan · 2018-02-15T01:08:56Z

We will soon have the span extension methods, like IndexOf (and LastIndexOf) in coreclr (https://github.com/dotnet/corefx/issues/25182).

Does it make sense to keep the vectorization implementation in one place only and have the String APIs call the Span ones or is there a significant cost here to merit duplicating the implementation?

Granted, the span indexof only special-cases T=byte.

cc @atsushikan

ahsonkhan · 2018-02-15T01:12:07Z

src/mscorlib/shared/System/String.Searching.cs

@@ -79,28 +82,77 @@ public unsafe int IndexOf(char value, int startIndex, int count)



nit: reduce the comparisons in the bounds checks:

if (startIndex < 0 || startIndex > Length) => if ((uint)startIndex > (uint)Length)

if (count < 0 || count > Length - startIndex) => if ((uint)count > (uint)(Length - startIndex))

eerhardt · 2018-02-15T17:44:42Z

It would be good to get additional perf numbers on hardware without VEX support and potentially on hardware where an unaligned read/write is not as fast as an aligned read/write.

@tannergooding - I don't think I have access to the hardware you are asking for. I updated the OP with 2 more machines (the worst machines in my house). If you had machines I could remote into, please let me know.

fiigii · 2018-02-15T17:57:53Z

don't think I have access to the hardware you are asking for

You can just set COMPlus_EnableAVX=0 to disable VEX encoding on any machine, which works with release build.

tannergooding · 2018-02-15T18:32:14Z

@eerhardt, I sent you an mail indicating how you can connect to the Q6600 machine for testing a machine without fast unaligned read/write support.

For testing the non-VEX encoding, you can just set COMPlus_EnableAVX=0 (on the machines you've already tested above) as @fiigii indicated.

danmoseley · 2018-02-15T18:57:23Z

Check @ViktorHofer is not on the Q6600 box before you use it... we need to buy another $20 box..

danmoseley · 2018-02-15T19:04:08Z

cc @ViktorHofer

eerhardt · 2018-02-16T17:42:44Z

So I've spent some time data gathering.

I ran the perf tests on the Q6600 machine and added the results to the original post. The numbers don't look as good as modern hardware.
I set COMPlus_EnableAVX=0 on "Machine 1 (windows desktop)" above and ran the perf tests both with and without my changes and got the same results. So I must be doing something wrong.
I traced string.IndexOf(char) while doing dotnet build of the large web project - OrchardCore, and for a dotnet new console project. Here are the numbers I found:

OrchardCore (had to kill it after 200k results)


Total Calls	223283
Average length	163.5 characters
Number of misses	191866
% Missed	86%
Average Result if found	502.8

dotnet new console


Total Calls	26947
Average length	228 characters
Number of misses	24214
% Missed	89%
Average Result if found	1273

This leads me to believe this change is valuable, since:

The average string lengths are over 100 chars.
Most of the time, the character is not found in the string.
When the character is found, on average it is not found near the beginning.

I tried getting E2E perf numbers of msbuild with this change, but the results are so inconsistent (probably due to I/O), that it is impossible to get hard numbers.

RawIndexOfTrace.zip

eerhardt · 2018-02-16T20:53:23Z

Does it make sense to keep the vectorization implementation in one place only and have the String APIs call the Span ones or is there a significant cost here to merit duplicating the implementation?

The one main place that might cause issues is the XorPowerOfTwoToHighChar vs XorPowerOfTwoToHighByte, and some other places in LocateFirstFoundChar. If this PR is deemed acceptable to move forward, I think we can refactor to one implementation in the future, if it makes sense.

eerhardt · 2018-02-21T22:53:16Z

@jkotas - thoughts on whether this approach should be pursued or not?

jkotas · 2018-02-21T23:00:35Z

This is a performance improvement. Performance improvements should be pursued strictly based on data. If the data say that this is a improvement on average (which it sounds like that they do), then it is worth pursuing.

Clean up IndexOf vectorization.

eerhardt · 2018-02-28T00:14:10Z

This PR should be ready for real review now. I've removed WIP.

eerhardt · 2018-03-01T20:40:24Z

Any feedback on this? I'd like to get this in today.

ahsonkhan

Minor nits (ignore if there is no other changes, since CI is green already). Otherwise, LGTM.

ahsonkhan · 2018-03-01T20:42:45Z

src/mscorlib/shared/System/String.Searching.cs

+                {
+                    unchecked
+                    {
+                        const int elementsPerByte = sizeof(ushort) / sizeof(byte);


nit: PascalCasing for constants

ahsonkhan · 2018-03-01T20:43:48Z

src/mscorlib/shared/System/String.Searching.cs

+                {
+                    count = (int)((pEndCh - pCh) & ~(Vector<ushort>.Count - 1));
+                    // Get comparison Vector
+                    Vector<ushort> vComparison = new Vector<ushort>(value);


nit: use var

ahsonkhan · 2018-03-01T20:45:13Z

src/mscorlib/shared/System/String.Searching.cs

+        private static int LocateLastFoundChar(ulong match)
+        {
+            // Find the most significant char that has its highest bit set
+            int index = 3;


Why does index start at 3 here? Can you add a comment please?

ahsonkhan · 2018-03-01T20:46:09Z

src/mscorlib/shared/System/String.Searching.cs

 using System.Runtime.InteropServices;
+using Internal.Runtime.CompilerServices;


nit: add space between System.* and Internal.* using directives

ahsonkhan · 2018-03-01T20:46:54Z

@dotnet-bot test Windows_NT arm Cross Checked corefx_baseline Build and Test
@dotnet-bot test Windows_NT x64_arm64_altjit Checked corefx_baseline
@dotnet-bot test Ubuntu x64 Checked corefx_baseline
@dotnet-bot test Windows_NT x64 Checked corefx_baseline
@dotnet-bot test Windows_NT x86_arm_altjit Checked corefx_baseline
@dotnet-bot test Windows_NT x86 Checked corefx_baseline

ahsonkhan · 2018-03-01T23:49:37Z

@eerhardt, the CI legs were interrupted. Can you run the string tests locally (for instance https://github.com/dotnet/corefx/blob/master/src/System.Runtime/tests/System/StringTests.cs#L1314)?

Build coreclr (from your branch)
Build corefx, pointing to your coreclr build - https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/developer-guide.md#testing-with-private-coreclr-bits
Run the System.Runtime tests (which contain the string.indexof tests).

As long as they pass, I think we should merge this change in.

The CI did manage to run some:

14:25:46   Finished:    System.Runtime.Tests
14:25:46   Discovered:  System.IO.Compression.Performance.Tests
14:25:47   Starting:    System.Management.Tests
14:25:47   
14:25:47   === TEST EXECUTION SUMMARY ===
14:25:47      System.Runtime.Tests  Total: 15140, Errors: 0, Failed: 0, Skipped: 14, Time: 284.250s

eerhardt · 2018-03-01T23:53:34Z

I've run the System.Runtime tests locally and they all passed. I wanted to beef those tests up a little bit in light of this new change -

add some longer strings (the longest looked like it was around 27 chars),
make the "found" character iterate along different indexes, something like:

"abcdefgh".IndexOf('h')
"abcdefgh".IndexOf('g')
"abcdefgh".IndexOf('f')

But that needs to go into corefx in a separate PR.

ahsonkhan · 2018-03-01T23:55:04Z

I've run the System.Runtime tests locally and they all passed.

LGTM. @tarekgh, any comments before we merge this in?

tarekgh · 2018-03-02T00:01:38Z

I'll take a look

tarekgh · 2018-03-02T00:07:28Z

src/mscorlib/shared/System/String.Searching.cs

        public unsafe int IndexOf(char value, int startIndex, int count)
        {
-            if (startIndex < 0 || startIndex > Length)
+            if ((uint)startIndex > (uint)Length)


f ((uint)startIndex > (uint)Length) [](start = 13, length = 35)

wouldn't this can generate a problem or a different exception if passing startIndex as negative value?

Nope.

If start was negative, casting it to uint would make it larger than Int32.MaxValue and would automatically be larger than segment.Count. So the check is already there (just optimized).

We have such checks everywhere.
See #16658 (comment)

And #16392 (comment)

if someone calling this API inside unchecked block would result in undesired results I guess:

for example
Console.WriteLine(unchecked((int)-4294967294));
is printing
2

In reply to: 171733137 [](ancestors = 171733137)

this is not true if using unchecked blocks.
for example
Console.WriteLine(unchecked((int)-4294967294));
is printing
2

In reply to: 171733480 [](ancestors = 171733480)

never mind, the value would be just 2. I think this will work I guess

In reply to: 171734390 [](ancestors = 171734390,171733480)

never mind, the value would be just 2. I think this will work I guess.

could you add some comment mentioning the limits and how this is safe?

In reply to: 171733980 [](ancestors = 171733980,171733137)

could you add some comment mentioning the limits and how this is safe?

Are you suggesting adding a comment everywhere we do this? We have several Span/Memory APIs that all do this.

at least adding the comment on the newly introduced code.

There are hundreds of places in coreclr/corefx that use this idiom. IMHO, I do not think that the comment is needed.

tarekgh · 2018-03-02T00:38:22Z

In general LGTM.

The only worry now is the code complexity became high but I hope we'll not touch this code again for any reason.

jods4 · 2018-08-20T17:59:38Z

This is cool but it confuses me a little bit:

Did you make IndexOf(char, int, int) an ordinal search?
The docs say it was(?) culture-aware.

Is this an intended change, did you announce it? Am I missing something?

@tarekgh

I hope we'll not touch this code again for any reason.

Famous last words 😆

eerhardt · 2018-08-20T18:33:17Z

Did you make IndexOf(char, int, int) an ordinal search?

string.IndexOf(char, int, int) has always been an ordinal search.

The docs say it was(?) culture-aware.

Can you point to the docs that you are referring to?

jods4 · 2018-08-20T18:42:57Z

@eerhardt You are right, my bad! There's a big difference when the needle is a char (ordinal) or string (cultural).

I checked some docs to verify my intuition before posting here and you have to give them a very careful read.

For instance on Best practices for using string on .NET:
https://docs.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings?view=netframework-4.7.2#common-string-comparison-methods-in-net

String.IndexOf and String.LastIndexOf

Default interpretation: StringComparison.CurrentCulture.

But then if you continue reading, the "default interpretation" gets split:

There is a lack of consistency in how the default overloads of these methods perform comparisons. All String.IndexOf and String.LastIndexOf methods that include a Char parameter perform an ordinal comparison, but the default String.IndexOf and String.LastIndexOf methods that include a String parameter perform a culture-sensitive comparison.

So all good, sorry!

Vectorize String.IndexOf(char) using the same algorithm as SpanHelpers

5b20052

IndexOf(byte).

eerhardt requested review from stephentoub, ahsonkhan and jkotas February 14, 2018 19:58

jkotas reviewed Feb 14, 2018

View reviewed changes

ahsonkhan reviewed Feb 15, 2018

View reviewed changes

danmoseley requested a review from JosephTremoulet February 15, 2018 18:56

Respond to feedback.

6c6f85e

Vectorize String.LastIndexOf

ace060a

Clean up IndexOf vectorization.

eerhardt changed the title ~~WIP: Vectorize String.IndexOf(char)~~ Vectorize String.IndexOf(char) and String.LastIndexOf(char) Feb 28, 2018

ahsonkhan approved these changes Mar 1, 2018

View reviewed changes

tarekgh reviewed Mar 2, 2018

View reviewed changes

ahsonkhan assigned eerhardt Mar 2, 2018

ahsonkhan merged commit c4a4391 into dotnet:master Mar 2, 2018

		@@ -79,28 +82,77 @@ public unsafe int IndexOf(char value, int startIndex, int count)

		using System.Runtime.InteropServices;
		using Internal.Runtime.CompilerServices;

Vectorize String.IndexOf(char) and String.LastIndexOf(char) #16392

Vectorize String.IndexOf(char) and String.LastIndexOf(char) #16392

Conversation

eerhardt commented Feb 14, 2018 • edited Loading

Perf results

Machine 1 (windows desktop):

Without changes

With changes

Without changes

With changes

Machine 3 (Lenvo Carbon X1):

Without changes

With changes

Machine 4 (Tanner's Q6600):

Without changes

With changes

tannergooding commented Feb 14, 2018 • edited Loading

jkotas commented Feb 14, 2018

jkotas Feb 14, 2018 • edited Loading

Choose a reason for hiding this comment

eerhardt Feb 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahsonkhan commented Feb 15, 2018 • edited Loading

Choose a reason for hiding this comment

eerhardt commented Feb 15, 2018

fiigii commented Feb 15, 2018

tannergooding commented Feb 15, 2018 • edited Loading

danmoseley commented Feb 15, 2018

danmoseley commented Feb 15, 2018

eerhardt commented Feb 16, 2018

eerhardt commented Feb 16, 2018

eerhardt commented Feb 21, 2018

jkotas commented Feb 21, 2018

eerhardt commented Feb 28, 2018

eerhardt commented Mar 1, 2018

ahsonkhan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahsonkhan commented Mar 1, 2018

ahsonkhan commented Mar 1, 2018 • edited Loading

eerhardt commented Mar 1, 2018

ahsonkhan commented Mar 1, 2018

tarekgh commented Mar 2, 2018

Choose a reason for hiding this comment

ahsonkhan Mar 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarekgh commented Mar 2, 2018

jods4 commented Aug 20, 2018

eerhardt commented Aug 20, 2018

jods4 commented Aug 20, 2018

String.IndexOf and String.LastIndexOf

eerhardt commented Feb 14, 2018 •

edited

Loading

tannergooding commented Feb 14, 2018 •

edited

Loading

jkotas Feb 14, 2018 •

edited

Loading

eerhardt Feb 14, 2018 •

edited

Loading

ahsonkhan commented Feb 15, 2018 •

edited

Loading

tannergooding commented Feb 15, 2018 •

edited

Loading

ahsonkhan commented Mar 1, 2018 •

edited

Loading

ahsonkhan Mar 2, 2018 •

edited

Loading