Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

stephentoub
Copy link
Member

Our current implementation of IndexOfOrdinal for strings on Unix uses Substring to get the piece of the source string we care about; this results in an unnecessary allocation / string copy. When using OrdinalIgnoreCase, we also convert both the source and search strings to upper-case using ToUpperInvariant, resulting in more allocations. And our LastIndexOfOrdinal implementation delegates to IndexOfOrdinal repeatedly, incurring such allocations potentially multiple times.

This change reimplements Ordinal searching in managed code to not use Substring, and it implements OrdinalIgnoreCase searching via new functions exposed in the native globalization shim, so as to use ICU without having to make managed/native transitions for each character.

With the changes, {Last}IndexOf with Ordinal/OrdinalIgnoreCase are now allocateion-free (as you'd expect), and throughput when startIndex/count and/or OrdinalIgnoreCase are used is increased significantly, on my machine anywhere from 20% to 3x, depending on the inputs.

cc: @ellismg, @tarekgh

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will do the wrong thing for characters outside the BMP (and there are some scripts outside the BMP which have casing data). I think that you will need to walk the string using the U16_NEXT macros so we can operate on full code units.

If the overhead that adds is unacceptable, You could consider trying to have a fast path where you unroll the loop and use bit masking to check for surrogates and then fallback to a slower walk via codeunits if you hit some.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll take another pass at it. Could you share some example strings/substrings that should fail with what I've currently got here but should pass when handling full code units correctly?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deseret letters are outside the BMP and have casing data. You could use:

'DESERET CAPITAL LETTER LONG I' (U+10400) and 'DESERET SMALL LETTER LONG I' (U+10428) as a case pair.

I just looked at the windows behavior and it confuses me a bit.

I would expect that the following would return 0:

"\U00010428".IndexOf("\U00010400", StringComparison.OrdinalIgnoreCase);

But, on Windows 10 (on Full .NET) this returns -1. The "equivalent" code of:

"\U00010428".ToUpperInvariant().IndexOf("\U00010400".ToUpperInvariant(), StringComparison.Ordinal);

Returns "0" as expected.

Per MSDN:

OrdinalIgnoreCase property treats the characters in the strings to compare as if they were converted to uppercase using the conventions of the invariant culture, and then performs a simple byte comparison that is independent of language.

You can generate similar cases for string equality where OridinalIgnoreCase does not behave the same way vs Ordinal with a ToUpperInvariant beforehand.

It's possible that the windows routines don't do the right thing with characters outside the BMP. I'm unsure if we should be copying that behavior, however. It would be interesting to know what @tarekgh thinks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invariant culture casing (and the ordinal casing) in Windows doesn't support Surrogate casing. I think Windows do that to avoid changing Invariant behavior. but I don't see any reason why we don't support surrogate casing for ordinal. we may diverge from the desktop here but we can support that in desktop though some quirk. I think using en-US casing for surrogate will do the trick. I didn't try it myself but I expect it will just work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, guys. I pushed another commit that updates the solution to use U16_NEXT / U16_FWD_1. The LastIndexOfOrdinalIgnoreCase could likely be improved further, as right now we walk forwards rather than backwards, but I wasn't sure enough about the various ICU macros to do it well going backwards; we can leave that for a future change.

@bbowyersmyth
Copy link

Out of interest, why does CultureInfo have it's own implementation of ordinal string compares?

Our current implementation of IndexOfOrdinal for strings on Unix uses Substring to get the piece of the source string we care about; this results in an unnecessary allocation / string copy.  When using OrdinalIgnoreCase, we also convert both the source and search strings to upper-case using ToUpperInvariant, resulting in more allocations.  And our LastIndexOfOrdinal implementation delegates to IndexOfOrdinal repeatedly, incurring such allocations potentially multiple times.

This change reimplements Ordinal searching in managed code to not use Substring, and it implements OrdinalIgnoreCase searching via new functions exposed in the native globalization shim, so as to use ICU without having to make managed/native transitions for each character.

With the changes, {Last}IndexOf with Ordinal/OrdinalIgnoreCase are now allocateion-free (as you'd expect), and throughput when startIndex/count and/or OrdinalIgnoreCase are used is increased significantly, on my machine anywhere from 20% to 3x, depending on the inputs.
@stephentoub
Copy link
Member Author

@ellismg, when you get a chance, mind taking another look at this?

@ellismg
Copy link

ellismg commented Oct 22, 2015

LGTM. @bbowyersmyth I don't understand your question.

@stephentoub
Copy link
Member Author

Thanks, Matt.

stephentoub added a commit that referenced this pull request Oct 22, 2015
Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase
@stephentoub stephentoub merged commit 527cda0 into dotnet:master Oct 22, 2015
@stephentoub stephentoub deleted the string_indexof_perf branch October 22, 2015 20:47
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…f_perf

Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase

Commit migrated from dotnet/coreclr@527cda0
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants