Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double IndexOf throughput for chars #78861

Merged
merged 4 commits into from Jan 3, 2023

Conversation

MihaZupan
Copy link
Member

@MihaZupan MihaZupan commented Nov 26, 2022

When searching through strings, it's very common to have single-byte values (think ASCII).
As long as the value falls within an appropriate range ([1, 254] on X86 or [0, 254] on ARM), we can speed up the search by packing two input vectors together before comparing the value.

The IndexOfAnyAsciiSearcher implementation I added in #78093 is already using this trick, but it applies to regular IndexOf as well.

In this PR, I added implementations that do such packing for Contains(char), IndexOf(char), IndexOfAny(char, char), IndexOfAny(char, char, char), and IndexOfAnyInRange(char, char), roughly doubling the throughput for long inputs.

Do we want to do the same for the Last- variants as well?
I don't think specialized IndexOf(4/5 values) would be useful. For ASCII values, using IndexOfAnyValues is already very close in throughput (and things like Regex will use that).

Benchmark numbers
Method Toolchain Length Mean Ratio
IndexOf main 1 1.973 ns 1.00
IndexOf pr 1 1.751 ns 0.89
IndexOfAny2Values main 1 2.757 ns 1.00
IndexOfAny2Values pr 1 2.593 ns 0.94
IndexOfAnyInRange main 1 2.092 ns 1.00
IndexOfAnyInRange pr 1 1.818 ns 0.87
IndexOf main 7 3.622 ns 1.00
IndexOf pr 7 3.741 ns 1.03
IndexOfAny2Values main 7 5.889 ns 1.00
IndexOfAny2Values pr 7 5.804 ns 0.99
IndexOfAnyInRange main 7 5.102 ns 1.00
IndexOfAnyInRange pr 7 3.638 ns 0.71
IndexOf main 8 2.604 ns 1.00
IndexOf pr 8 2.295 ns 0.88
IndexOfAny2Values main 8 2.588 ns 1.00
IndexOfAny2Values pr 8 2.790 ns 1.08
IndexOfAnyInRange main 8 2.929 ns 1.00
IndexOfAnyInRange pr 8 2.369 ns 0.81
IndexOf main 9 2.855 ns 1.00
IndexOf pr 9 2.279 ns 0.80
IndexOfAny2Values main 9 2.849 ns 1.00
IndexOfAny2Values pr 9 2.774 ns 0.97
IndexOfAnyInRange main 9 2.914 ns 1.00
IndexOfAnyInRange pr 9 2.373 ns 0.81
IndexOf main 15 2.836 ns 1.00
IndexOf pr 15 2.315 ns 0.82
IndexOfAny2Values main 15 2.788 ns 1.00
IndexOfAny2Values pr 15 2.799 ns 1.00
IndexOfAnyInRange main 15 2.928 ns 1.00
IndexOfAnyInRange pr 15 2.361 ns 0.81
IndexOf main 16 2.402 ns 1.00
IndexOf pr 16 2.273 ns 0.95
IndexOfAny2Values main 16 2.830 ns 1.00
IndexOfAny2Values pr 16 2.791 ns 0.99
IndexOfAnyInRange main 16 2.871 ns 1.00
IndexOfAnyInRange pr 16 2.378 ns 0.83
IndexOf main 17 2.740 ns 1.00
IndexOf pr 17 2.574 ns 0.94
IndexOfAny2Values main 17 3.376 ns 1.00
IndexOfAny2Values pr 17 2.967 ns 0.88
IndexOfAnyInRange main 17 3.040 ns 1.00
IndexOfAnyInRange pr 17 2.312 ns 0.76
IndexOf main 32 2.826 ns 1.00
IndexOf pr 32 2.558 ns 0.91
IndexOfAny2Values main 32 3.293 ns 1.00
IndexOfAny2Values pr 32 2.971 ns 0.90
IndexOfAnyInRange main 32 2.903 ns 1.00
IndexOfAnyInRange pr 32 2.312 ns 0.80
IndexOf main 1000 39.015 ns 1.00
IndexOf pr 1000 23.991 ns 0.61
IndexOfAny2Values main 1000 45.199 ns 1.00
IndexOfAny2Values pr 1000 25.563 ns 0.57
IndexOfAnyInRange main 1000 55.958 ns 1.00
IndexOfAnyInRange pr 1000 23.020 ns 0.41
IndexOf main 100000 4,875.359 ns 1.00
IndexOf pr 100000 2,190.711 ns 0.45
IndexOfAny2Values main 100000 5,709.625 ns 1.00
IndexOfAny2Values pr 100000 2,980.139 ns 0.52
IndexOfAnyInRange main 100000 5,133.861 ns 1.00
IndexOfAnyInRange pr 100000 2,987.320 ns 0.58
Method Toolchain Length Mean Error Ratio
IndexOfIgnoreCase main 1 5.807 ns 0.0755 ns 1.00
IndexOfIgnoreCase pr 1 5.766 ns 0.0167 ns 0.99
IndexOfIgnoreCase main 32 8.565 ns 0.0465 ns 1.00
IndexOfIgnoreCase pr 32 8.471 ns 0.0267 ns 0.99
IndexOfIgnoreCase main 1000 51.199 ns 0.1238 ns 1.00
IndexOfIgnoreCase pr 1000 34.857 ns 0.3394 ns 0.68
IndexOfIgnoreCase main 100000 5,734.450 ns 22.3808 ns 1.00
IndexOfIgnoreCase pr 100000 2,979.708 ns 12.2477 ns 0.52

This is generally a slight regression if a match is found at the start

If the first character matches
Method Toolchain Length Mean Error Ratio
IndexOf main 1 1.745 ns 0.0005 ns 1.00
IndexOf pr 1 1.985 ns 0.0026 ns 1.14
IndexOfAny2Values main 1 2.245 ns 0.0015 ns 1.00
IndexOfAny2Values pr 1 2.182 ns 0.0021 ns 0.97
IndexOfAnyInRange main 1 2.079 ns 0.0010 ns 1.00
IndexOfAnyInRange pr 1 1.720 ns 0.0005 ns 0.83
IndexOf main 7 1.526 ns 0.0005 ns 1.00
IndexOf pr 7 1.510 ns 0.0006 ns 0.99
IndexOfAny2Values main 7 1.743 ns 0.0009 ns 1.00
IndexOfAny2Values pr 7 1.695 ns 0.0023 ns 0.97
IndexOfAnyInRange main 7 2.100 ns 0.0010 ns 1.00
IndexOfAnyInRange pr 7 1.751 ns 0.0043 ns 0.83
IndexOf main 8 2.540 ns 0.0291 ns 1.00
IndexOf pr 8 2.978 ns 0.0365 ns 1.18
IndexOfAny2Values main 8 3.021 ns 0.0344 ns 1.00
IndexOfAny2Values pr 8 3.319 ns 0.0072 ns 1.10
IndexOfAnyInRange main 8 2.943 ns 0.0375 ns 1.00
IndexOfAnyInRange pr 8 2.901 ns 0.0335 ns 0.99
IndexOf main 16 2.934 ns 0.0021 ns 1.00
IndexOf pr 16 2.791 ns 0.0023 ns 0.95
IndexOfAny2Values main 16 2.881 ns 0.0028 ns 1.00
IndexOfAny2Values pr 16 3.246 ns 0.0037 ns 1.13
IndexOfAnyInRange main 16 2.657 ns 0.0029 ns 1.00
IndexOfAnyInRange pr 16 2.725 ns 0.0023 ns 1.03
IndexOf main 17 2.946 ns 0.0016 ns 1.00
IndexOf pr 17 3.307 ns 0.0018 ns 1.12
IndexOfAny2Values main 17 2.877 ns 0.0016 ns 1.00
IndexOfAny2Values pr 17 3.496 ns 0.0064 ns 1.22
IndexOfAnyInRange main 17 2.654 ns 0.0017 ns 1.00
IndexOfAnyInRange pr 17 3.057 ns 0.0018 ns 1.15
IndexOf main 32 2.949 ns 0.0027 ns 1.00
IndexOf pr 32 3.360 ns 0.0136 ns 1.14
IndexOfAny2Values main 32 2.914 ns 0.0057 ns 1.00
IndexOfAny2Values pr 32 3.502 ns 0.0073 ns 1.20
IndexOfAnyInRange main 32 2.658 ns 0.0021 ns 1.00
IndexOfAnyInRange pr 32 3.042 ns 0.0012 ns 1.14
IndexOf main 33 2.968 ns 0.0033 ns 1.00
IndexOf pr 33 2.745 ns 0.0059 ns 0.92
IndexOfAny2Values main 33 2.900 ns 0.0042 ns 1.00
IndexOfAny2Values pr 33 3.260 ns 0.0079 ns 1.12
IndexOfAnyInRange main 33 2.684 ns 0.0036 ns 1.00
IndexOfAnyInRange pr 33 2.871 ns 0.0106 ns 1.07
IndexOf main 1000 3.068 ns 0.0239 ns 1.00
IndexOf pr 1000 2.903 ns 0.0343 ns 0.95
IndexOfAny2Values main 1000 3.021 ns 0.0258 ns 1.00
IndexOfAny2Values pr 1000 3.382 ns 0.0335 ns 1.12
IndexOfAnyInRange main 1000 2.784 ns 0.0259 ns 1.00
IndexOfAnyInRange pr 1000 2.981 ns 0.0312 ns 1.07

@MihaZupan MihaZupan added area-System.Memory tenet-performance Performance related issue labels Nov 26, 2022
@MihaZupan MihaZupan added this to the 8.0.0 milestone Nov 26, 2022
@MihaZupan MihaZupan self-assigned this Nov 26, 2022
@ghost
Copy link

ghost commented Nov 26, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

When searching through strings, it's very common to have single-byte values (think ASCII).
As long as the value falls within an appropriate range ([1, 254] on X86 or [0, 254] on ARM), we can speed up the search by packing two input vectors together before comparing the value.

The IndexOfAnyAsciiSearcher implementation I added in #78093 is already using this trick, but it applies to regular IndexOf as well.

In this POC PR, I added implementations that do such packing for IndexOf(char), IndexOfAny(char, char), and IndexOfAnyInRange(char, char), roughly doubling the throughput for long inputs.

If we're happy with the direction, I can add implementations for Contains(char), IndexOfAny(3/5 values), and their Last- counterparts as well if we feel they are useful.

Benchmark numbers
Method Toolchain Length Mean Ratio
IndexOf main 1 1.973 ns 1.00
IndexOf pr 1 1.751 ns 0.89
IndexOfAny2Values main 1 2.757 ns 1.00
IndexOfAny2Values pr 1 2.593 ns 0.94
IndexOfAnyInRange main 1 2.092 ns 1.00
IndexOfAnyInRange pr 1 1.818 ns 0.87
IndexOf main 7 3.622 ns 1.00
IndexOf pr 7 3.741 ns 1.03
IndexOfAny2Values main 7 5.889 ns 1.00
IndexOfAny2Values pr 7 5.804 ns 0.99
IndexOfAnyInRange main 7 5.102 ns 1.00
IndexOfAnyInRange pr 7 3.638 ns 0.71
IndexOf main 8 2.604 ns 1.00
IndexOf pr 8 2.295 ns 0.88
IndexOfAny2Values main 8 2.588 ns 1.00
IndexOfAny2Values pr 8 2.790 ns 1.08
IndexOfAnyInRange main 8 2.929 ns 1.00
IndexOfAnyInRange pr 8 2.369 ns 0.81
IndexOf main 9 2.855 ns 1.00
IndexOf pr 9 2.279 ns 0.80
IndexOfAny2Values main 9 2.849 ns 1.00
IndexOfAny2Values pr 9 2.774 ns 0.97
IndexOfAnyInRange main 9 2.914 ns 1.00
IndexOfAnyInRange pr 9 2.373 ns 0.81
IndexOf main 15 2.836 ns 1.00
IndexOf pr 15 2.315 ns 0.82
IndexOfAny2Values main 15 2.788 ns 1.00
IndexOfAny2Values pr 15 2.799 ns 1.00
IndexOfAnyInRange main 15 2.928 ns 1.00
IndexOfAnyInRange pr 15 2.361 ns 0.81
IndexOf main 16 2.402 ns 1.00
IndexOf pr 16 2.273 ns 0.95
IndexOfAny2Values main 16 2.830 ns 1.00
IndexOfAny2Values pr 16 2.791 ns 0.99
IndexOfAnyInRange main 16 2.871 ns 1.00
IndexOfAnyInRange pr 16 2.378 ns 0.83
IndexOf main 17 2.740 ns 1.00
IndexOf pr 17 2.574 ns 0.94
IndexOfAny2Values main 17 3.376 ns 1.00
IndexOfAny2Values pr 17 2.967 ns 0.88
IndexOfAnyInRange main 17 3.040 ns 1.00
IndexOfAnyInRange pr 17 2.312 ns 0.76
IndexOf main 32 2.826 ns 1.00
IndexOf pr 32 2.558 ns 0.91
IndexOfAny2Values main 32 3.293 ns 1.00
IndexOfAny2Values pr 32 2.971 ns 0.90
IndexOfAnyInRange main 32 2.903 ns 1.00
IndexOfAnyInRange pr 32 2.312 ns 0.80
IndexOf main 1000 39.015 ns 1.00
IndexOf pr 1000 23.991 ns 0.61
IndexOfAny2Values main 1000 45.199 ns 1.00
IndexOfAny2Values pr 1000 25.563 ns 0.57
IndexOfAnyInRange main 1000 55.958 ns 1.00
IndexOfAnyInRange pr 1000 23.020 ns 0.41
IndexOf main 100000 4,875.359 ns 1.00
IndexOf pr 100000 2,190.711 ns 0.45
IndexOfAny2Values main 100000 5,709.625 ns 1.00
IndexOfAny2Values pr 100000 2,980.139 ns 0.52
IndexOfAnyInRange main 100000 5,133.861 ns 1.00
IndexOfAnyInRange pr 100000 2,987.320 ns 0.58
Method Toolchain Length Mean Error Ratio
IndexOfIgnoreCase main 1 5.807 ns 0.0755 ns 1.00
IndexOfIgnoreCase pr 1 5.766 ns 0.0167 ns 0.99
IndexOfIgnoreCase main 32 8.565 ns 0.0465 ns 1.00
IndexOfIgnoreCase pr 32 8.471 ns 0.0267 ns 0.99
IndexOfIgnoreCase main 1000 51.199 ns 0.1238 ns 1.00
IndexOfIgnoreCase pr 1000 34.857 ns 0.3394 ns 0.68
IndexOfIgnoreCase main 100000 5,734.450 ns 22.3808 ns 1.00
IndexOfIgnoreCase pr 100000 2,979.708 ns 12.2477 ns 0.52
Author: MihaZupan
Assignees: MihaZupan
Labels:

area-System.Memory, tenet-performance

Milestone: 8.0.0

@MihaZupan
Copy link
Member Author

cc: @EgorBo @stephentoub

@MihaZupan
Copy link
Member Author

Any thoughts on this approach @dotnet/area-system-memory?

@dakersnar
Copy link
Contributor

@MihaZupan Sorry for the delay, I'll take a look at this early next week.

@dakersnar
Copy link
Contributor

@MihaZupan I'm new to this area and I think I'm missing some context to properly review this.

Can you give me a high-level summary of the intentions behind the changes in each file?

@MihaZupan
Copy link
Member Author

Sure. The main idea behind this change is the observation that the char values we commonly search for are ASCII, in which case half of their UTF16 representation will always be 0. When doing vectorized searches, this means we're mostly ignoring half of the input and half of the result of each comparison.
If we instead pack the input (narrow with saturation) before the comparison, we can process twice as many characters in each loop iteration. Such optimization is only possible for values that aren't ambiguous after saturation ([0, 254]).

  • The core change is the introduction of the new SpanHelpers.Char.Packed.cs file that contains the PackedIndexOf workhorse implementation which mimics the existing IndexOf helpers, but uses the approach of packing the input. The file contains:
    • An internal CanUsePackedIndexOf helper method that determines whether the algorithm can be used for a given value.
    • The search methods themselves - PackedIndexOf, PackedIndexOfAny, etc.
  • The changes in SpanHelpers.T.cs are hooking into the existing IndexOf codepaths to delegate to the PackedIndexOf implementation if it's supported for the given value. If not, they fallback to the existing ("NonPacked") implementation.
  • Changes to Globalization/Ordinal.cs and String.Searching.cs are updating the callers where we know the value to be ASCII to take advantage of the new packed implementation directly, without incurring the cost of checking whether the value is ASCII again.
  • The CanUsePackedIndexOf helper I mentioned is intended to be "free" if the value is constant (common case). If the value isn't constant, the span.IndexOf path now incurs an additional check before calling into the appropriate implementation. Updates to files in /System/IndexOfAnyValues/* are avoiding this overhead for IndexOfAnyValues<char> implementations. They're not really the interesting part of the change.

@dakersnar
Copy link
Contributor

If we instead pack the input (narrow with saturation) before the comparison, we can process twice as many characters in each loop iteration. Such optimization is only possible for values that aren't ambiguous after saturation ([0, 254]).

To confirm, this is because any value that needs the full char to be represented will saturate to 255 when narrowed to a byte, correct?

@MihaZupan
Copy link
Member Author

To confirm, this is because any value that needs the full char to be represented will saturate to 255 when narrowed to a byte, correct?

That's right.
In reality, the ranges are [0, 254] for ARM and [1, 254] for X86 because X86 only has signed pack instructions, and values will saturate to both 0 and 255.

Copy link
Contributor

@dakersnar dakersnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left a few clarifying questions.

For testing, I assume all these paths already had sufficient coverage, right?

@@ -558,7 +558,7 @@ public static ReadOnlyMemory<char> AsMemory(this string? text, Range range)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe int IndexOfAnyExcept<T>(this ReadOnlySpan<T> span, T value) where T : IEquatable<T>?
{
if (SpanHelpers.CanVectorizeAndBenefit<T>(span.Length))
if (RuntimeHelpers.IsBitwiseEquatable<T>())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this update?

Copy link
Member Author

@MihaZupan MihaZupan Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamsitnik I see you added this in #73768, can you please clarify what the intent was?
As far as I can tell, this is just adding a redundant length check given that all the SpanHelpers implementations we're calling into also do the length check and have code for handling short inputs. Am I missing something (I didn't see any discussion about this on your PR)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MihaZupan, @adamsitnik, was this ever answered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so. Is the changed version causing issues somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but it sounded like there might be extra work happening unnecessarily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was the case before this change (we would inline a length check and 2 calls to the worker methods), now it should just be a call to the 1 worker method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I misunderstood the comment

@MihaZupan
Copy link
Member Author

For testing, I assume all these paths already had sufficient coverage, right?

Yes. I'll double-check that we don't accidentally end up losing substantial coverage of the existing (NonPacked) code paths if we're mostly testing with ASCII values.

In this POC PR, I added implementations that do such packing for IndexOf(char), IndexOfAny(char, char), and IndexOfAnyInRange(char, char)
If we're happy with the direction, I can add implementations for Contains(char), IndexOfAny(char, char, char), and their Last- counterparts as well if we feel they are useful.

I'll add the Contains(char) and IndexOfAny(char, char, char) to this PR (it's just more of the same idea).
Not sure how much we care about the Last- variants.

@MihaZupan
Copy link
Member Author

MihaZupan commented Dec 16, 2022

Added Contains(char) and IndexOfAny(char, char, char) now.
I checked and we still have full test coverage of the existing and new methods.

Updated perf numbers
Method Toolchain Length Mean Error Ratio
Contains main 1 2.228 ns 0.0040 ns 1.00
Contains pr 1 2.216 ns 0.0127 ns 0.99
IndexOf main 1 2.065 ns 0.0041 ns 1.00
IndexOf pr 1 1.753 ns 0.0043 ns 0.85
IndexOfAny2Values main 1 2.902 ns 0.0073 ns 1.00
IndexOfAny2Values pr 1 2.537 ns 0.0030 ns 0.87
IndexOfAny3Values main 1 2.975 ns 0.0015 ns 1.00
IndexOfAny3Values pr 1 3.094 ns 0.0041 ns 1.04
IndexOfAnyInRange main 1 2.196 ns 0.0027 ns 1.00
IndexOfAnyInRange pr 1 1.996 ns 0.0039 ns 0.91
Contains main 7 3.435 ns 0.0033 ns 1.00
Contains pr 7 3.465 ns 0.0020 ns 1.01
IndexOf main 7 3.726 ns 0.0034 ns 1.00
IndexOf pr 7 3.755 ns 0.0017 ns 1.01
IndexOfAny2Values main 7 5.915 ns 0.0023 ns 1.00
IndexOfAny2Values pr 7 5.960 ns 0.0092 ns 1.01
IndexOfAny3Values main 7 8.116 ns 0.0047 ns 1.00
IndexOfAny3Values pr 7 7.975 ns 0.0036 ns 0.98
IndexOfAnyInRange main 7 5.145 ns 0.0038 ns 1.00
IndexOfAnyInRange pr 7 3.678 ns 0.0031 ns 0.71
Contains main 8 2.089 ns 0.0067 ns 1.00
Contains pr 8 2.293 ns 0.0019 ns 1.10
IndexOf main 8 2.693 ns 0.0040 ns 1.00
IndexOf pr 8 2.312 ns 0.0019 ns 0.86
IndexOfAny2Values main 8 2.956 ns 0.0012 ns 1.00
IndexOfAny2Values pr 8 2.886 ns 0.0042 ns 0.98
IndexOfAny3Values main 8 3.239 ns 0.0022 ns 1.00
IndexOfAny3Values pr 8 3.436 ns 0.0022 ns 1.06
IndexOfAnyInRange main 8 2.887 ns 0.0021 ns 1.00
IndexOfAnyInRange pr 8 2.379 ns 0.0115 ns 0.82
Contains main 9 2.193 ns 0.0058 ns 1.00
Contains pr 9 2.296 ns 0.0015 ns 1.05
IndexOf main 9 2.914 ns 0.0038 ns 1.00
IndexOf pr 9 2.301 ns 0.0011 ns 0.79
IndexOfAny2Values main 9 3.438 ns 0.0038 ns 1.00
IndexOfAny2Values pr 9 2.882 ns 0.0045 ns 0.84
IndexOfAny3Values main 9 3.727 ns 0.0032 ns 1.00
IndexOfAny3Values pr 9 3.431 ns 0.0014 ns 0.92
IndexOfAnyInRange main 9 2.891 ns 0.0019 ns 1.00
IndexOfAnyInRange pr 9 2.395 ns 0.0112 ns 0.83
Contains main 15 2.194 ns 0.0055 ns 1.00
Contains pr 15 2.286 ns 0.0016 ns 1.04
IndexOf main 15 2.907 ns 0.0042 ns 1.00
IndexOf pr 15 2.313 ns 0.0037 ns 0.80
IndexOfAny2Values main 15 3.450 ns 0.0025 ns 1.00
IndexOfAny2Values pr 15 2.895 ns 0.0065 ns 0.84
IndexOfAny3Values main 15 3.723 ns 0.0015 ns 1.00
IndexOfAny3Values pr 15 3.428 ns 0.0021 ns 0.92
IndexOfAnyInRange main 15 2.886 ns 0.0024 ns 1.00
IndexOfAnyInRange pr 15 2.334 ns 0.0054 ns 0.81
Contains main 16 2.194 ns 0.0063 ns 1.00
Contains pr 16 2.298 ns 0.0020 ns 1.05
IndexOf main 16 2.497 ns 0.0103 ns 1.00
IndexOf pr 16 2.336 ns 0.0058 ns 0.94
IndexOfAny2Values main 16 3.010 ns 0.0021 ns 1.00
IndexOfAny2Values pr 16 2.862 ns 0.0016 ns 0.95
IndexOfAny3Values main 16 3.333 ns 0.0024 ns 1.00
IndexOfAny3Values pr 16 3.428 ns 0.0024 ns 1.03
IndexOfAnyInRange main 16 2.863 ns 0.0035 ns 1.00
IndexOfAnyInRange pr 16 2.376 ns 0.0111 ns 0.83
Contains main 17 2.391 ns 0.0018 ns 1.00
Contains pr 17 2.422 ns 0.0024 ns 1.01
IndexOf main 17 2.688 ns 0.0030 ns 1.00
IndexOf pr 17 2.564 ns 0.0032 ns 0.95
IndexOfAny2Values main 17 3.439 ns 0.0019 ns 1.00
IndexOfAny2Values pr 17 2.943 ns 0.0023 ns 0.86
IndexOfAny3Values main 17 3.840 ns 0.0071 ns 1.00
IndexOfAny3Values pr 17 3.521 ns 0.0017 ns 0.92
IndexOfAnyInRange main 17 2.875 ns 0.0037 ns 1.00
IndexOfAnyInRange pr 17 2.547 ns 0.0028 ns 0.89
Contains main 32 2.834 ns 0.0238 ns 1.00
Contains pr 32 2.419 ns 0.0011 ns 0.85
IndexOf main 32 2.890 ns 0.0035 ns 1.00
IndexOf pr 32 2.575 ns 0.0040 ns 0.89
IndexOfAny2Values main 32 3.435 ns 0.0020 ns 1.00
IndexOfAny2Values pr 32 2.943 ns 0.0023 ns 0.86
IndexOfAny3Values main 32 4.146 ns 0.0029 ns 1.00
IndexOfAny3Values pr 32 3.488 ns 0.0021 ns 0.84
IndexOfAnyInRange main 32 2.865 ns 0.0046 ns 1.00
IndexOfAnyInRange pr 32 2.548 ns 0.0027 ns 0.89
Contains main 33 2.903 ns 0.0102 ns 1.00
Contains pr 33 2.813 ns 0.0022 ns 0.97
IndexOf main 33 3.103 ns 0.0035 ns 1.00
IndexOf pr 33 2.747 ns 0.0030 ns 0.89
IndexOfAny2Values main 33 3.871 ns 0.0022 ns 1.00
IndexOfAny2Values pr 33 3.283 ns 0.0029 ns 0.85
IndexOfAny3Values main 33 4.738 ns 0.0033 ns 1.00
IndexOfAny3Values pr 33 4.349 ns 0.0020 ns 0.92
IndexOfAnyInRange main 33 3.777 ns 0.0169 ns 1.00
IndexOfAnyInRange pr 33 3.018 ns 0.0029 ns 0.80
Contains main 1000 32.797 ns 0.0229 ns 1.00
Contains pr 1000 17.534 ns 0.0275 ns 0.53
IndexOf main 1000 39.024 ns 0.0396 ns 1.00
IndexOf pr 1000 23.966 ns 0.0109 ns 0.61
IndexOfAny2Values main 1000 44.677 ns 0.0203 ns 1.00
IndexOfAny2Values pr 1000 23.087 ns 0.2666 ns 0.52
IndexOfAny3Values main 1000 57.412 ns 0.1221 ns 1.00
IndexOfAny3Values pr 1000 31.902 ns 0.2688 ns 0.56
IndexOfAnyInRange main 1000 56.802 ns 0.0273 ns 1.00
IndexOfAnyInRange pr 1000 26.263 ns 0.1187 ns 0.46
Contains main 100000 3,686.442 ns 3.9490 ns 1.00
Contains pr 100000 1,923.226 ns 3.0266 ns 0.52
IndexOf main 100000 3,827.157 ns 1.4482 ns 1.00
IndexOf pr 100000 1,918.985 ns 1.6545 ns 0.50
IndexOfAny2Values main 100000 4,436.271 ns 2.3896 ns 1.00
IndexOfAny2Values pr 100000 2,559.311 ns 2.1281 ns 0.58
IndexOfAny3Values main 100000 5,416.439 ns 2.5865 ns 1.00
IndexOfAny3Values pr 100000 3,144.892 ns 2.2510 ns 0.58
IndexOfAnyInRange main 100000 5,415.066 ns 2.0259 ns 1.00
IndexOfAnyInRange pr 100000 2,570.033 ns 2.2194 ns 0.47

Copy link
Contributor

@dakersnar dakersnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MihaZupan
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@MihaZupan
Copy link
Member Author

It appears that the packing is noticeably more expensive on ARM in comparison.
While the approach can improve throughput somewhat, the regression for cases where matches are close to the start seems unacceptable. I will update the logic to only apply when running on X86.

ARM64 benchmarks (no match)
Method Length Mean Error
IndexOf 8 3.173 ns 0.0074 ns
IndexOfAny2Values 8 4.455 ns 0.0013 ns
IndexOfAny3Values 8 3.814 ns 0.0048 ns
PackedIndexOf 8 3.343 ns 0.0237 ns
PackedIndexOfAny2Values 8 3.808 ns 0.0104 ns
PackedIndexOfAny3Values 8 4.283 ns 0.0029 ns
IndexOf 9 3.741 ns 0.0144 ns
IndexOfAny2Values 9 5.672 ns 0.0098 ns
IndexOfAny3Values 9 5.095 ns 0.0245 ns
PackedIndexOf 9 3.622 ns 0.0072 ns
PackedIndexOfAny2Values 9 4.226 ns 0.0087 ns
PackedIndexOfAny3Values 9 4.871 ns 0.0148 ns
IndexOf 16 4.141 ns 0.0007 ns
IndexOfAny2Values 16 5.607 ns 0.0002 ns
IndexOfAny3Values 16 5.344 ns 0.0071 ns
PackedIndexOf 16 3.366 ns 0.0263 ns
PackedIndexOfAny2Values 16 3.810 ns 0.0142 ns
PackedIndexOfAny3Values 16 4.285 ns 0.0016 ns
IndexOf 32 5.474 ns 0.0002 ns
IndexOfAny2Values 32 6.940 ns 0.0044 ns
IndexOfAny3Values 32 7.912 ns 0.0124 ns
PackedIndexOf 32 5.002 ns 0.0205 ns
PackedIndexOfAny2Values 32 5.233 ns 0.0050 ns
PackedIndexOfAny3Values 32 6.005 ns 0.0073 ns
IndexOf 128 15.234 ns 0.0050 ns
IndexOfAny2Values 128 18.834 ns 0.0183 ns
IndexOfAny3Values 128 26.844 ns 0.1520 ns
PackedIndexOf 128 14.835 ns 0.0172 ns
PackedIndexOfAny2Values 128 17.159 ns 0.2145 ns
PackedIndexOfAny3Values 128 18.359 ns 0.0738 ns
IndexOf 512 55.640 ns 0.0081 ns
IndexOfAny2Values 512 74.750 ns 0.0188 ns
IndexOfAny3Values 512 93.405 ns 0.0337 ns
PackedIndexOf 512 47.795 ns 0.0702 ns
PackedIndexOfAny2Values 512 65.908 ns 0.2881 ns
PackedIndexOfAny3Values 512 79.024 ns 0.0501 ns
ARM64 benchmarks (the first character matches)
Method Length Mean Error
IndexOf 8 3.952 ns 0.0183 ns
IndexOfAny2Values 8 4.110 ns 0.0237 ns
IndexOfAny3Values 8 4.545 ns 0.0016 ns
PackedIndexOf 8 5.972 ns 0.0140 ns
PackedIndexOfAny2Values 8 6.750 ns 0.0031 ns
PackedIndexOfAny3Values 8 7.356 ns 0.0046 ns
IndexOf 9 3.638 ns 0.0167 ns
IndexOfAny2Values 9 4.476 ns 0.0256 ns
IndexOfAny3Values 9 4.549 ns 0.0020 ns
PackedIndexOf 9 6.668 ns 0.0079 ns
PackedIndexOfAny2Values 9 7.457 ns 0.0016 ns
PackedIndexOfAny3Values 9 8.082 ns 0.0031 ns
IndexOf 16 3.613 ns 0.0304 ns
IndexOfAny2Values 16 4.134 ns 0.0233 ns
IndexOfAny3Values 16 4.546 ns 0.0018 ns
PackedIndexOf 16 5.981 ns 0.0103 ns
PackedIndexOfAny2Values 16 6.758 ns 0.0055 ns
PackedIndexOfAny3Values 16 7.364 ns 0.0026 ns
IndexOf 32 3.646 ns 0.0420 ns
IndexOfAny2Values 32 4.110 ns 0.0236 ns
IndexOfAny3Values 32 4.546 ns 0.0018 ns
PackedIndexOf 32 5.233 ns 0.0221 ns
PackedIndexOfAny2Values 32 5.988 ns 0.0044 ns
PackedIndexOfAny3Values 32 6.673 ns 0.0025 ns
IndexOf 128 3.638 ns 0.0275 ns
IndexOfAny2Values 128 4.110 ns 0.0284 ns
IndexOfAny3Values 128 5.053 ns 0.0077 ns
PackedIndexOf 128 5.215 ns 0.0140 ns
PackedIndexOfAny2Values 128 5.981 ns 0.0080 ns
PackedIndexOfAny3Values 128 6.652 ns 0.0018 ns

@MihaZupan
Copy link
Member Author

@tannergooding any concerns about using this sort of approach only on X86, given that it's not profitable on ARM?

@build-analysis build-analysis bot mentioned this pull request Dec 30, 2022
@MihaZupan
Copy link
Member Author

All failures are known according to build-analysis

@MihaZupan MihaZupan merged commit ac2ffdf into dotnet:main Jan 3, 2023
radekdoulik added a commit to radekdoulik/runtime that referenced this pull request Jan 5, 2023
This should avoids the size regression on WebAssembly and possibly other
platforms without Sse2.

The regression is side effect of dotnet#78861
which uses `PackedSpanHelpers.CanUsePackedIndexOf (!!T)` and TShouldUsePacked.Value
to guard the usage of PackedSpanHelpers.

Because these involve generics, illinker is unable to link
the PackedSpanHelpers type away and that pulls other parts in, like
System.Runtime.Intrinsics.X86.* types. See https://gist.github.com/radekdoulik/c0b52247d472f69bcf983ade78a924ea
for more complete list.

This change gets us back 9,216 bytes in the case of app used to repro
the regression.

    ...
      -             Type System.PackedSpanHelpers
      -             Type System.Runtime.Intrinsics.X86.X86Base
      -             Type System.Runtime.Intrinsics.X86.Sse
      -             Type System.Runtime.Intrinsics.X86.Sse2
    Summary:
      -       9,216 File size -0.76% (of 1,215,488)
      -       2,744 Metadata size -0.43% (of 636,264)
      -           4 Types count
radekdoulik added a commit that referenced this pull request Jan 9, 2023
* Use PackedIndexOfIsSupported checks in more places

This should avoids the size regression on WebAssembly and possibly other
platforms without Sse2.

The regression is side effect of #78861
which uses `PackedSpanHelpers.CanUsePackedIndexOf (!!T)` and TShouldUsePacked.Value
to guard the usage of PackedSpanHelpers.

Because these involve generics, illinker is unable to link
the PackedSpanHelpers type away and that pulls other parts in, like
System.Runtime.Intrinsics.X86.* types. See https://gist.github.com/radekdoulik/c0b52247d472f69bcf983ade78a924ea
for more complete list.

This change gets us back 9,216 bytes in the case of app used to repro
the regression.

    ...
      -             Type System.PackedSpanHelpers
      -             Type System.Runtime.Intrinsics.X86.X86Base
      -             Type System.Runtime.Intrinsics.X86.Sse
      -             Type System.Runtime.Intrinsics.X86.Sse2
    Summary:
      -       9,216 File size -0.76% (of 1,215,488)
      -       2,744 Metadata size -0.43% (of 636,264)
      -           4 Types count

* Update src/libraries/System.Private.CoreLib/src/System/IndexOfAnyValues/IndexOfAnyValues.cs

Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>

* Update src/libraries/System.Private.CoreLib/src/System/IndexOfAnyValues/IndexOfAnyValues.cs

Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>

* Feedback

Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>
@dotnet dotnet locked as resolved and limited conversation to collaborators Feb 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants