-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase #1816
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will do the wrong thing for characters outside the BMP (and there are some scripts outside the BMP which have casing data). I think that you will need to walk the string using the U16_NEXT macros so we can operate on full code units.
If the overhead that adds is unacceptable, You could consider trying to have a fast path where you unroll the loop and use bit masking to check for surrogates and then fallback to a slower walk via codeunits if you hit some.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll take another pass at it. Could you share some example strings/substrings that should fail with what I've currently got here but should pass when handling full code units correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deseret letters are outside the BMP and have casing data. You could use:
'DESERET CAPITAL LETTER LONG I' (U+10400) and 'DESERET SMALL LETTER LONG I' (U+10428) as a case pair.
I just looked at the windows behavior and it confuses me a bit.
I would expect that the following would return 0:
"\U00010428".IndexOf("\U00010400", StringComparison.OrdinalIgnoreCase);
But, on Windows 10 (on Full .NET) this returns -1. The "equivalent" code of:
"\U00010428".ToUpperInvariant().IndexOf("\U00010400".ToUpperInvariant(), StringComparison.Ordinal);
Returns "0" as expected.
Per MSDN:
OrdinalIgnoreCase property treats the characters in the strings to compare as if they were converted to uppercase using the conventions of the invariant culture, and then performs a simple byte comparison that is independent of language.
You can generate similar cases for string equality where OridinalIgnoreCase does not behave the same way vs Ordinal with a ToUpperInvariant beforehand.
It's possible that the windows routines don't do the right thing with characters outside the BMP. I'm unsure if we should be copying that behavior, however. It would be interesting to know what @tarekgh thinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invariant culture casing (and the ordinal casing) in Windows doesn't support Surrogate casing. I think Windows do that to avoid changing Invariant behavior. but I don't see any reason why we don't support surrogate casing for ordinal. we may diverge from the desktop here but we can support that in desktop though some quirk. I think using en-US casing for surrogate will do the trick. I didn't try it myself but I expect it will just work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, guys. I pushed another commit that updates the solution to use U16_NEXT / U16_FWD_1. The LastIndexOfOrdinalIgnoreCase could likely be improved further, as right now we walk forwards rather than backwards, but I wasn't sure enough about the various ICU macros to do it well going backwards; we can leave that for a future change.
28388d3
to
fc8a2d2
Compare
Out of interest, why does CultureInfo have it's own implementation of ordinal string compares? |
Our current implementation of IndexOfOrdinal for strings on Unix uses Substring to get the piece of the source string we care about; this results in an unnecessary allocation / string copy. When using OrdinalIgnoreCase, we also convert both the source and search strings to upper-case using ToUpperInvariant, resulting in more allocations. And our LastIndexOfOrdinal implementation delegates to IndexOfOrdinal repeatedly, incurring such allocations potentially multiple times. This change reimplements Ordinal searching in managed code to not use Substring, and it implements OrdinalIgnoreCase searching via new functions exposed in the native globalization shim, so as to use ICU without having to make managed/native transitions for each character. With the changes, {Last}IndexOf with Ordinal/OrdinalIgnoreCase are now allocateion-free (as you'd expect), and throughput when startIndex/count and/or OrdinalIgnoreCase are used is increased significantly, on my machine anywhere from 20% to 3x, depending on the inputs.
fc8a2d2
to
91f0f8b
Compare
@ellismg, when you get a chance, mind taking another look at this? |
LGTM. @bbowyersmyth I don't understand your question. |
Thanks, Matt. |
Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase
…f_perf Improve string.{Last}IndexOf perf on Unix for Ordinal/OrdinalIgnoreCase Commit migrated from dotnet/coreclr@527cda0
Our current implementation of IndexOfOrdinal for strings on Unix uses Substring to get the piece of the source string we care about; this results in an unnecessary allocation / string copy. When using OrdinalIgnoreCase, we also convert both the source and search strings to upper-case using ToUpperInvariant, resulting in more allocations. And our LastIndexOfOrdinal implementation delegates to IndexOfOrdinal repeatedly, incurring such allocations potentially multiple times.
This change reimplements Ordinal searching in managed code to not use Substring, and it implements OrdinalIgnoreCase searching via new functions exposed in the native globalization shim, so as to use ICU without having to make managed/native transitions for each character.
With the changes, {Last}IndexOf with Ordinal/OrdinalIgnoreCase are now allocateion-free (as you'd expect), and throughput when startIndex/count and/or OrdinalIgnoreCase are used is increased significantly, on my machine anywhere from 20% to 3x, depending on the inputs.
cc: @ellismg, @tarekgh