Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Vectorize String.IndexOf(char) and String.LastIndexOf(char) #16392

Merged
merged 3 commits into from
Mar 2, 2018

Conversation

eerhardt
Copy link
Member

@eerhardt eerhardt commented Feb 14, 2018

Vectorize String.IndexOf(char) using the same algorithm as SpanHelpers.IndexOf(byte).

I also plan on doing the same for String.LastIndexOf, which is why I marked this as WIP. I wanted to get early feedback on the approach to ensure continuing this work is worth the effort.

Perf results

I ran the following tests Test Code

Machine 1 (windows desktop) (click to expand)

Machine 1 (windows desktop):

BenchmarkDotNet=v0.10.12.20180214-develop, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.248)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical cores and 4 physical cores
Frequency=3328125 Hz, Resolution=300.4695 ns, Timer=TSC

Without changes

Method Mean Error StdDev
IndexOfShort_Match 12.11 ns 0.0951 ns 0.0843 ns
IndexOfShort_Miss 16.84 ns 0.2507 ns 0.2345 ns
IndexOfMedium_Match 28.28 ns 0.4762 ns 0.4454 ns
IndexOfMedium_Miss 36.80 ns 0.7361 ns 0.6885 ns
IndexOfLong_Match 101.22 ns 0.7716 ns 0.6443 ns
IndexOfLong_Miss 160.88 ns 2.4450 ns 2.2870 ns

With changes

Method Mean Error StdDev
IndexOfShort_Match 13.95 ns 0.3054 ns 0.3635 ns
IndexOfShort_Miss 16.95 ns 0.3310 ns 0.3096 ns
IndexOfMedium_Match 27.02 ns 0.3313 ns 0.3099 ns
IndexOfMedium_Miss 33.84 ns 0.4833 ns 0.4284 ns
IndexOfLong_Match 51.26 ns 0.3604 ns 0.3372 ns
IndexOfLong_Miss 67.85 ns 1.2266 ns 1.1473 ns
Machine 2 (MacBook Air ~2012) (click to expand) ### Machine 2 (MacBook Air ~2012):
BenchmarkDotNet=v0.10.12, OS=macOS 10.13.2 (17C88) [Darwin 17.3.0]
Intel Core i7-4650U CPU 1.70GHz (Haswell), 1 CPU, 4 logical cores and 2 physical cores
.NET Core SDK=2.1.300-preview2-008171

Without changes

Method Mean Error StdDev
IndexOfShort_Match 20.23 ns 0.6210 ns 1.772 ns
IndexOfShort_Miss 25.25 ns 0.5338 ns 1.054 ns
IndexOfMedium_Match 36.47 ns 0.7497 ns 1.122 ns
IndexOfMedium_Miss 55.95 ns 1.9238 ns 5.489 ns
IndexOfLong_Match 137.01 ns 1.6394 ns 1.453 ns
IndexOfLong_Miss 217.41 ns 2.4305 ns 2.273 ns

With changes

Method Mean Error StdDev Median
IndexOfShort_Match 24.32 ns 0.6296 ns 1.8466 ns 24.11 ns
IndexOfShort_Miss 28.72 ns 1.0847 ns 3.1811 ns 28.39 ns
IndexOfMedium_Match 45.40 ns 4.3182 ns 12.7323 ns 40.48 ns
IndexOfMedium_Miss 43.97 ns 0.8260 ns 0.7726 ns 43.72 ns
IndexOfLong_Match 64.66 ns 0.8402 ns 0.7859 ns 64.64 ns
IndexOfLong_Miss 87.54 ns 1.7383 ns 1.7851 ns 87.19 ns
Machine 3 (Lenvo Carbon X1) (click to expand)

Machine 3 (Lenvo Carbon X1):

BenchmarkDotNet=v0.10.12, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.125)
Intel Core i7-3667U CPU 2.00GHz (Ivy Bridge), 1 CPU, 4 logical cores and 2 physical cores
Frequency=2435863 Hz, Resolution=410.5321 ns, Timer=TSC
.NET Core SDK=2.1.300-preview2-008171

Without changes

Method Mean Error StdDev
IndexOfShort_Match 19.66 ns 0.4376 ns 0.6550 ns
IndexOfShort_Miss 26.04 ns 0.4768 ns 0.4226 ns
IndexOfMedium_Match 48.00 ns 0.9828 ns 1.6420 ns
IndexOfMedium_Miss 78.24 ns 1.5603 ns 1.9733 ns
IndexOfLong_Match 205.15 ns 0.9204 ns 0.8610 ns
IndexOfLong_Miss 353.07 ns 1.2634 ns 1.1200 ns

With changes

Method Mean Error StdDev
IndexOfShort_Match 22.20 ns 0.5248 ns 0.8014 ns
IndexOfShort_Miss 29.87 ns 0.7094 ns 2.0238 ns
IndexOfMedium_Match 42.36 ns 0.8687 ns 0.8532 ns
IndexOfMedium_Miss 56.24 ns 1.2761 ns 1.7889 ns
IndexOfLong_Match 119.03 ns 2.4004 ns 4.5086 ns
IndexOfLong_Miss 189.77 ns 4.3292 ns 12.6285 ns
Machine 4 (Tanner's Q6600): (click to expand)

Machine 4 (Tanner's Q6600):

BenchmarkDotNet=v0.10.12, OS=ubuntu 17.10
Intel Core2 Quad CPU Q6600 2.40GHz, 1 CPU, 4 logical cores and 4 physical cores
.NET Core SDK=2.1.300-preview2-008173

Without changes

Method Mean Error StdDev
IndexOfShort_Match 34.62 ns 0.0114 ns 0.0107 ns
IndexOfShort_Miss 39.32 ns 0.0010 ns 0.0008 ns
IndexOfMedium_Match 76.70 ns 0.0017 ns 0.0016 ns
IndexOfMedium_Miss 102.51 ns 0.0052 ns 0.0046 ns
IndexOfLong_Match 298.30 ns 0.0090 ns 0.0075 ns
IndexOfLong_Miss 461.88 ns 0.0238 ns 0.0222 ns

With changes

Method Mean Error StdDev
IndexOfShort_Match 35.74 ns 0.0054 ns 0.0048 ns
IndexOfShort_Miss 46.14 ns 0.0023 ns 0.0021 ns
IndexOfMedium_Match 89.52 ns 0.0122 ns 0.0108 ns
IndexOfMedium_Miss 118.06 ns 0.0052 ns 0.0043 ns
IndexOfLong_Match 323.21 ns 1.0655 ns 0.9445 ns
IndexOfLong_Miss 395.20 ns 0.0050 ns 0.0045 ns

As you can see, for short strings, there is a little degradation, especially when the match is towards the beginning of the string. But for longer strings, where the match is towards the end or doesn't match at all, the gains are substantial.

/cc @benaadams

@tannergooding
Copy link
Member

tannergooding commented Feb 14, 2018

It would be good to get additional perf numbers on hardware without VEX support and potentially on hardware where an unaligned read/write is not as fast as an aligned read/write.

@jkotas
Copy link
Member

jkotas commented Feb 14, 2018

As you can see, for short strings, there is a little degradation

I would be useful to get some data about the distribution of inputs for IndexOf (e.g. dotnet build or ASP.NET MusicStore are easy enough) to prove that this degradation pays for itself.


while (count >= 4)
int nLength = count;
bool useVectorization = false;
Copy link
Member

@jkotas jkotas Feb 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason why this is using this bool variable? https://github.com/dotnet/corefx/blob/master/src/System.Memory/src/System/SpanHelpers.byte.cs#L96 does not use it. The bool variable does not look like an improvement.

Copy link
Member Author

@eerhardt eerhardt Feb 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning was so it wouldn't have to do the index calculation and checks when we aren't using the vectorized approach (ex. when the string/count is too short).

https://github.com/dotnet/coreclr/pull/16392/files#diff-f7e6389a6519754f04d20215ff42efcdR126

The SpanHelpers approach uses a ref byte searchSpace and keeps track of the index. Here I kept the same char* pointer approach. So to get the index, I either need to calculate it, or also keep track of the index as we go.

Doing this saved ~1ns in the IndexOfShort_Miss case on my machine.

I can remove it, if you don't think it is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two long-lived variables: pStartCh and useVectorization. They will burn two registers or two stack spill slots. I think it maybe be more efficient if you replace them by just one pEndCh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need both pCh and index?

@ahsonkhan
Copy link
Member

ahsonkhan commented Feb 15, 2018

We will soon have the span extension methods, like IndexOf (and LastIndexOf) in coreclr (https://github.com/dotnet/corefx/issues/25182).

Does it make sense to keep the vectorization implementation in one place only and have the String APIs call the Span ones or is there a significant cost here to merit duplicating the implementation?

Granted, the span indexof only special-cases T=byte.

cc @atsushikan

@@ -79,28 +82,77 @@ public unsafe int IndexOf(char value, int startIndex, int count)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: reduce the comparisons in the bounds checks:

if (startIndex < 0 || startIndex > Length) => if ((uint)startIndex > (uint)Length)

if (count < 0 || count > Length - startIndex) => if ((uint)count > (uint)(Length - startIndex))

@eerhardt
Copy link
Member Author

It would be good to get additional perf numbers on hardware without VEX support and potentially on hardware where an unaligned read/write is not as fast as an aligned read/write.

@tannergooding - I don't think I have access to the hardware you are asking for. I updated the OP with 2 more machines (the worst machines in my house). If you had machines I could remote into, please let me know.

@fiigii
Copy link

fiigii commented Feb 15, 2018

don't think I have access to the hardware you are asking for

You can just set COMPlus_EnableAVX=0 to disable VEX encoding on any machine, which works with release build.

@tannergooding
Copy link
Member

tannergooding commented Feb 15, 2018

@eerhardt, I sent you an mail indicating how you can connect to the Q6600 machine for testing a machine without fast unaligned read/write support.

For testing the non-VEX encoding, you can just set COMPlus_EnableAVX=0 (on the machines you've already tested above) as @fiigii indicated.

@danmoseley
Copy link
Member

Check @ViktorHofer is not on the Q6600 box before you use it... we need to buy another $20 box..

@danmoseley
Copy link
Member

cc @ViktorHofer

@eerhardt
Copy link
Member Author

So I've spent some time data gathering.

  1. I ran the perf tests on the Q6600 machine and added the results to the original post. The numbers don't look as good as modern hardware.
  2. I set COMPlus_EnableAVX=0 on "Machine 1 (windows desktop)" above and ran the perf tests both with and without my changes and got the same results. So I must be doing something wrong.
  3. I traced string.IndexOf(char) while doing dotnet build of the large web project - OrchardCore, and for a dotnet new console project. Here are the numbers I found:

OrchardCore (had to kill it after 200k results)

Total Calls 223283
Average length 163.5 characters
Number of misses 191866
% Missed 86%
Average Result if found 502.8

dotnet new console

Total Calls 26947
Average length 228 characters
Number of misses 24214
% Missed 89%
Average Result if found 1273

This leads me to believe this change is valuable, since:

  1. The average string lengths are over 100 chars.
  2. Most of the time, the character is not found in the string.
  3. When the character is found, on average it is not found near the beginning.

I tried getting E2E perf numbers of msbuild with this change, but the results are so inconsistent (probably due to I/O), that it is impossible to get hard numbers.

RawIndexOfTrace.zip

@eerhardt
Copy link
Member Author

Does it make sense to keep the vectorization implementation in one place only and have the String APIs call the Span ones or is there a significant cost here to merit duplicating the implementation?

The one main place that might cause issues is the XorPowerOfTwoToHighChar vs XorPowerOfTwoToHighByte, and some other places in LocateFirstFoundChar. If this PR is deemed acceptable to move forward, I think we can refactor to one implementation in the future, if it makes sense.

@eerhardt
Copy link
Member Author

@jkotas - thoughts on whether this approach should be pursued or not?

@jkotas
Copy link
Member

jkotas commented Feb 21, 2018

This is a performance improvement. Performance improvements should be pursued strictly based on data. If the data say that this is a improvement on average (which it sounds like that they do), then it is worth pursuing.

Clean up IndexOf vectorization.
@eerhardt eerhardt changed the title WIP: Vectorize String.IndexOf(char) Vectorize String.IndexOf(char) and String.LastIndexOf(char) Feb 28, 2018
@eerhardt
Copy link
Member Author

This PR should be ready for real review now. I've removed WIP.

@eerhardt
Copy link
Member Author

eerhardt commented Mar 1, 2018

Any feedback on this? I'd like to get this in today.

Copy link
Member

@ahsonkhan ahsonkhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits (ignore if there is no other changes, since CI is green already). Otherwise, LGTM.

{
unchecked
{
const int elementsPerByte = sizeof(ushort) / sizeof(byte);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: PascalCasing for constants

{
count = (int)((pEndCh - pCh) & ~(Vector<ushort>.Count - 1));
// Get comparison Vector
Vector<ushort> vComparison = new Vector<ushort>(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use var

private static int LocateLastFoundChar(ulong match)
{
// Find the most significant char that has its highest bit set
int index = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does index start at 3 here? Can you add a comment please?

using System.Runtime.InteropServices;
using Internal.Runtime.CompilerServices;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add space between System.* and Internal.* using directives

@ahsonkhan
Copy link
Member

@dotnet-bot test Windows_NT arm Cross Checked corefx_baseline Build and Test
@dotnet-bot test Windows_NT x64_arm64_altjit Checked corefx_baseline
@dotnet-bot test Ubuntu x64 Checked corefx_baseline
@dotnet-bot test Windows_NT x64 Checked corefx_baseline
@dotnet-bot test Windows_NT x86_arm_altjit Checked corefx_baseline
@dotnet-bot test Windows_NT x86 Checked corefx_baseline

@ahsonkhan
Copy link
Member

ahsonkhan commented Mar 1, 2018

@eerhardt, the CI legs were interrupted. Can you run the string tests locally (for instance https://github.com/dotnet/corefx/blob/master/src/System.Runtime/tests/System/StringTests.cs#L1314)?

  1. Build coreclr (from your branch)
  2. Build corefx, pointing to your coreclr build - https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/developer-guide.md#testing-with-private-coreclr-bits
  3. Run the System.Runtime tests (which contain the string.indexof tests).

As long as they pass, I think we should merge this change in.

The CI did manage to run some:

14:25:46   Finished:    System.Runtime.Tests
14:25:46   Discovered:  System.IO.Compression.Performance.Tests
14:25:47   Starting:    System.Management.Tests
14:25:47   
14:25:47   === TEST EXECUTION SUMMARY ===
14:25:47      System.Runtime.Tests  Total: 15140, Errors: 0, Failed: 0, Skipped: 14, Time: 284.250s

@eerhardt
Copy link
Member Author

eerhardt commented Mar 1, 2018

I've run the System.Runtime tests locally and they all passed. I wanted to beef those tests up a little bit in light of this new change -

  1. add some longer strings (the longest looked like it was around 27 chars),
  2. make the "found" character iterate along different indexes, something like:

"abcdefgh".IndexOf('h')
"abcdefgh".IndexOf('g')
"abcdefgh".IndexOf('f')

But that needs to go into corefx in a separate PR.

@ahsonkhan
Copy link
Member

I've run the System.Runtime tests locally and they all passed.

LGTM. @tarekgh, any comments before we merge this in?

@tarekgh
Copy link
Member

tarekgh commented Mar 2, 2018

I'll take a look

public unsafe int IndexOf(char value, int startIndex, int count)
{
if (startIndex < 0 || startIndex > Length)
if ((uint)startIndex > (uint)Length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f ((uint)startIndex > (uint)Length) [](start = 13, length = 35)

wouldn't this can generate a problem or a different exception if passing startIndex as negative value?

Copy link
Member

@ahsonkhan ahsonkhan Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope.

If start was negative, casting it to uint would make it larger than Int32.MaxValue and would automatically be larger than segment.Count. So the check is already there (just optimized).

We have such checks everywhere.
See #16658 (comment)

And #16392 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone calling this API inside unchecked block would result in undesired results I guess:

for example
Console.WriteLine(unchecked((int)-4294967294));
is printing
2


In reply to: 171733137 [](ancestors = 171733137)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not true if using unchecked blocks.
for example
Console.WriteLine(unchecked((int)-4294967294));
is printing
2


In reply to: 171733480 [](ancestors = 171733480)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, the value would be just 2. I think this will work I guess


In reply to: 171734390 [](ancestors = 171734390,171733480)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, the value would be just 2. I think this will work I guess.

could you add some comment mentioning the limits and how this is safe?


In reply to: 171733980 [](ancestors = 171733980,171733137)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some comment mentioning the limits and how this is safe?

Are you suggesting adding a comment everywhere we do this? We have several Span/Memory APIs that all do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least adding the comment on the newly introduced code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are hundreds of places in coreclr/corefx that use this idiom. IMHO, I do not think that the comment is needed.

@tarekgh
Copy link
Member

tarekgh commented Mar 2, 2018

In general LGTM.

The only worry now is the code complexity became high but I hope we'll not touch this code again for any reason.

@ahsonkhan ahsonkhan merged commit c4a4391 into dotnet:master Mar 2, 2018
@jods4
Copy link

jods4 commented Aug 20, 2018

This is cool but it confuses me a little bit:

Did you make IndexOf(char, int, int) an ordinal search?
The docs say it was(?) culture-aware.

Is this an intended change, did you announce it? Am I missing something?

@tarekgh

I hope we'll not touch this code again for any reason.

Famous last words 😆

@eerhardt
Copy link
Member Author

Did you make IndexOf(char, int, int) an ordinal search?

string.IndexOf(char, int, int) has always been an ordinal search.

The docs say it was(?) culture-aware.

Can you point to the docs that you are referring to?

@jods4
Copy link

jods4 commented Aug 20, 2018

@eerhardt You are right, my bad! There's a big difference when the needle is a char (ordinal) or string (cultural).

I checked some docs to verify my intuition before posting here and you have to give them a very careful read.

For instance on Best practices for using string on .NET:
https://docs.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings?view=netframework-4.7.2#common-string-comparison-methods-in-net

String.IndexOf and String.LastIndexOf

Default interpretation: StringComparison.CurrentCulture.

But then if you continue reading, the "default interpretation" gets split:

There is a lack of consistency in how the default overloads of these methods perform comparisons. All String.IndexOf and String.LastIndexOf methods that include a Char parameter perform an ordinal comparison, but the default String.IndexOf and String.LastIndexOf methods that include a String parameter perform a culture-sensitive comparison.

So all good, sorry!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
9 participants