Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase #1816

stephentoub · 2015-10-20T22:55:28Z

Our current implementation of IndexOfOrdinal for strings on Unix uses Substring to get the piece of the source string we care about; this results in an unnecessary allocation / string copy. When using OrdinalIgnoreCase, we also convert both the source and search strings to upper-case using ToUpperInvariant, resulting in more allocations. And our LastIndexOfOrdinal implementation delegates to IndexOfOrdinal repeatedly, incurring such allocations potentially multiple times.

This change reimplements Ordinal searching in managed code to not use Substring, and it implements OrdinalIgnoreCase searching via new functions exposed in the native globalization shim, so as to use ICU without having to make managed/native transitions for each character.

With the changes, {Last}IndexOf with Ordinal/OrdinalIgnoreCase are now allocateion-free (as you'd expect), and throughput when startIndex/count and/or OrdinalIgnoreCase are used is increased significantly, on my machine anywhere from 20% to 3x, depending on the inputs.

cc: @ellismg, @tarekgh

ellismg · 2015-10-20T23:12:00Z

src/corefx/System.Globalization.Native/collation.cpp

This will do the wrong thing for characters outside the BMP (and there are some scripts outside the BMP which have casing data). I think that you will need to walk the string using the U16_NEXT macros so we can operate on full code units.

If the overhead that adds is unacceptable, You could consider trying to have a fast path where you unroll the loop and use bit masking to check for surrogates and then fallback to a slower walk via codeunits if you hit some.

Ok, I'll take another pass at it. Could you share some example strings/substrings that should fail with what I've currently got here but should pass when handling full code units correctly?

The deseret letters are outside the BMP and have casing data. You could use:

'DESERET CAPITAL LETTER LONG I' (U+10400) and 'DESERET SMALL LETTER LONG I' (U+10428) as a case pair.

I just looked at the windows behavior and it confuses me a bit.

I would expect that the following would return 0:

"\U00010428".IndexOf("\U00010400", StringComparison.OrdinalIgnoreCase);

But, on Windows 10 (on Full .NET) this returns -1. The "equivalent" code of:

"\U00010428".ToUpperInvariant().IndexOf("\U00010400".ToUpperInvariant(), StringComparison.Ordinal);

Returns "0" as expected.

Per MSDN:

OrdinalIgnoreCase property treats the characters in the strings to compare as if they were converted to uppercase using the conventions of the invariant culture, and then performs a simple byte comparison that is independent of language.

You can generate similar cases for string equality where OridinalIgnoreCase does not behave the same way vs Ordinal with a ToUpperInvariant beforehand.

It's possible that the windows routines don't do the right thing with characters outside the BMP. I'm unsure if we should be copying that behavior, however. It would be interesting to know what @tarekgh thinks.

Invariant culture casing (and the ordinal casing) in Windows doesn't support Surrogate casing. I think Windows do that to avoid changing Invariant behavior. but I don't see any reason why we don't support surrogate casing for ordinal. we may diverge from the desktop here but we can support that in desktop though some quirk. I think using en-US casing for surrogate will do the trick. I didn't try it myself but I expect it will just work

Thanks, guys. I pushed another commit that updates the solution to use U16_NEXT / U16_FWD_1. The LastIndexOfOrdinalIgnoreCase could likely be improved further, as right now we walk forwards rather than backwards, but I wasn't sure enough about the various ICU macros to do it well going backwards; we can leave that for a future change.

bbowyersmyth · 2015-10-21T03:40:45Z

Out of interest, why does CultureInfo have it's own implementation of ordinal string compares?

Our current implementation of IndexOfOrdinal for strings on Unix uses Substring to get the piece of the source string we care about; this results in an unnecessary allocation / string copy. When using OrdinalIgnoreCase, we also convert both the source and search strings to upper-case using ToUpperInvariant, resulting in more allocations. And our LastIndexOfOrdinal implementation delegates to IndexOfOrdinal repeatedly, incurring such allocations potentially multiple times. This change reimplements Ordinal searching in managed code to not use Substring, and it implements OrdinalIgnoreCase searching via new functions exposed in the native globalization shim, so as to use ICU without having to make managed/native transitions for each character. With the changes, {Last}IndexOf with Ordinal/OrdinalIgnoreCase are now allocateion-free (as you'd expect), and throughput when startIndex/count and/or OrdinalIgnoreCase are used is increased significantly, on my machine anywhere from 20% to 3x, depending on the inputs.

stephentoub · 2015-10-22T20:09:44Z

@ellismg, when you get a chance, mind taking another look at this?

ellismg · 2015-10-22T20:43:00Z

LGTM. @bbowyersmyth I don't understand your question.

stephentoub · 2015-10-22T20:47:48Z

Thanks, Matt.

Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase

…f_perf Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase Commit migrated from dotnet/coreclr@527cda0

dnfclas added the cla-already-signed label Oct 20, 2015

dotnet-bot added the 2 - In Progress label Oct 20, 2015

stephentoub mentioned this pull request Oct 20, 2015

Add more string.IndexOf/LastIndexOf tests dotnet/corefx#4000

Merged

ellismg reviewed Oct 20, 2015
View reviewed changes

stephentoub force-pushed the string_indexof_perf branch from 28388d3 to fc8a2d2 Compare October 21, 2015 02:35

stephentoub added 2 commits October 22, 2015 14:31

Update IndexOfOrdinalIgnoreCase to use full code units

91f0f8b

stephentoub force-pushed the string_indexof_perf branch from fc8a2d2 to 91f0f8b Compare October 22, 2015 18:32

stephentoub added a commit that referenced this pull request Oct 22, 2015

Merge pull request #1816 from stephentoub/string_indexof_perf

527cda0

Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase

stephentoub merged commit 527cda0 into dotnet:master Oct 22, 2015

dotnet-bot removed the 2 - In Progress label Oct 22, 2015

stephentoub deleted the string_indexof_perf branch October 22, 2015 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase #1816

Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase #1816

Uh oh!

stephentoub commented Oct 20, 2015

Uh oh!

ellismg Oct 20, 2015

Uh oh!

stephentoub Oct 20, 2015

Uh oh!

ellismg Oct 20, 2015

Uh oh!

tarekgh Oct 21, 2015

Uh oh!

stephentoub Oct 21, 2015

Uh oh!

bbowyersmyth commented Oct 21, 2015

Uh oh!

stephentoub commented Oct 22, 2015

Uh oh!

ellismg commented Oct 22, 2015

Uh oh!

stephentoub commented Oct 22, 2015

Uh oh!

Uh oh!

Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase #1816

Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase #1816

Uh oh!

Conversation

stephentoub commented Oct 20, 2015

Uh oh!

ellismg Oct 20, 2015

Choose a reason for hiding this comment

Uh oh!

stephentoub Oct 20, 2015

Choose a reason for hiding this comment

Uh oh!

ellismg Oct 20, 2015

Choose a reason for hiding this comment

Uh oh!

tarekgh Oct 21, 2015

Choose a reason for hiding this comment

Uh oh!

stephentoub Oct 21, 2015

Choose a reason for hiding this comment

Uh oh!

bbowyersmyth commented Oct 21, 2015

Uh oh!

stephentoub commented Oct 22, 2015

Uh oh!

ellismg commented Oct 22, 2015

Uh oh!

stephentoub commented Oct 22, 2015

Uh oh!

Uh oh!