Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU comparison routines should use case folding, not case mapping #27540

Closed

Conversation

GrabYourPitchforks
Copy link
Member

Fixes #26961.

Adjusts OrdinalIgnoreCase string comparison routines (on ICU only) to use case folding instead of case mapping.

Open question: What should we do about the special-casing logic for U+0131 and U+0049? This PR aside, there are tons of differences in the case mapping tables between ICU and NLS, and the pair that is checked for here is only one such difference. I don't believe it's feasible for us to carry the delta between ICU and NLS within our runtime, which means we should probably scrap the below check and say "we're following the same behavior ICU has."

if (one == 0x0131 || two == 0x0131)
{
// On Windows with InvariantCulture, the LATIN SMALL LETTER DOTLESS I (U+0131)
// capitalizes to itself, whereas with ICU it capitalizes to LATIN CAPITAL LETTER I (U+0049).
// We special case it to match the Windows invariant behavior.
return FALSE;
}

@jkotas
Copy link
Member

jkotas commented Feb 1, 2020

I don't believe it's feasible for us to carry the delta between ICU and NLS within our runtime, which means we should probably scrap the below check and say "we're following the same behavior ICU has."

+1

@GrabYourPitchforks
Copy link
Member Author

@tarekgh @stephentoub Do you have any thoughts on the open question raised at #27540 (comment)? I'm trying to gauge whether it would be best to remove the special-case and rely solely on ICU's built-in mapping tables.

@tarekgh
Copy link
Member

tarekgh commented Feb 5, 2020

Do you have any thoughts on the open question raised at #27540 (comment)? I'm trying to gauge whether it would be best to remove the special-case and rely solely on ICU's built-in mapping tables.

In general, we are moving towards ICU even on Windows. I prefer removing this special case to serve in our future direction. We already have discrepancies between Windows and ICU anyway which I don't think this case will be a big deal.

@GrabYourPitchforks
Copy link
Member Author

It turns out the special case for the Turkish I isn't a problem after all since with this change we're using simple case folding instead of upper case mapping.

I folds to i
i folds to i
İ folds to İ
ı folds to ı

So this means that under an OrdinalIgnoreCase comparer, the code point İ (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE) will never match any of { I, i, ı }. And the code point ı (U+0131 LATIN SMALL LETTER DOTLESS I) will never match any of { I, i, İ }.

This is the desired behavior anyway, so we can remove the special case.

@GrabYourPitchforks GrabYourPitchforks added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Feb 5, 2020
@GrabYourPitchforks
Copy link
Member Author

I've marked this NO MERGE for now because it also impacts APIs like string.GetHashCodeOrdinalIgnoreCase. Will create some unit tests to validate that string and CompareInfo are in sync here.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Feb 5, 2020

There is also potentially security impact for this change. For example, the following mappings are valid under a case folding mechanism:

  • ſ (U+017F LATIN SMALL LETTER LONG S) folds to s
  • (U+212A KELVIN SIGN) folds to k

(The above data is from https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt.)

This means, for instance, that after this change the expression string.Equals("Microſoft", "Microsoft", StringComparison.OrdinalIgnoreCase) will evaluate to true when running under ICU. Normally this wouldn't be an issue, but we do recommend that developers use StringComparison.OrdinalIgnoreCase for security-related code, and this change in behavior might be surprising.

One option is to re-introduce a special case which says "if one of the characters to check is ASCII and the other character is non-ASCII, they're never equal, regardless of what ICU case folding says." But if we go down this path we're going to have our own semantics start to creep on top of ICU's own semantics, and I don't know just where that road leads.

Edit: Jeremy kindly confirmed offline for me that the expression string.Equals("Microſoft", "Microsoft", StringComparison.OrdinalIgnoreCase) already evaluates to true on Linux (ICU) and false on Windows (NLS) today, even without the changes in this PR.

@bartonjs
Copy link
Member

bartonjs commented Feb 5, 2020

It feels strange to me that something would map/fold/whatever across any two of { ASCII, non-ASCII BMP, !BMP }; because I'd personally feel like code of the form

private static readonly HashSet<string> s_knownIds = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
{
    "fantasy",
};

...

if (s_knownIds.Contains(input))
{
    Process(input);
}

has effectively asserted that Process(string) never needs to touch non-ASCII data. (Or, if I had some \u stuff in there, that there aren't surrogate pairs to deal with). Using the already-present form would solve that; but that's case-normalizing instead of input-checking...

@GrabYourPitchforks
Copy link
Member Author

Based on the earlier feedback here and at #32247 I've updated the logic as follows. These changes are only when running under non-Windows platforms.

  • ToUpperInvariant / ToLowerInvariant no longer special-case any characters. So calling ToLowerInvariant('İ') will result in the character 'i'. This uses the data straight from ICU with no adjustment.
  • Comparing two char values under an ordinal case-insensitive comparer (e.g., string.Equals(a, b, StringComparison.OrdinalIgnoreCase)) will now return not equal if one of the chars is ASCII and another char is non-ASCII, even if the two chars have the same simple case folding under ICU. For example, string.Equals("İ", "i", StringComparison.OrdinalIgnoreCase) will return false.

This implies that string.Equals(a, b, StringComparison.OrdinalIgnoreCase) and a.ToUpperInvariant() == b.ToUpperInvariant() do not have the same semantic meaning going forward.

Another notable consequence of this change is how strings are sorted under a case-insensitive comparer. Previously, since ASCII strings were normalized to uppercase on all platforms, the strings "A" and "a" would sort before the string "_" under StringComparer.OrdinalIgnoreCase. The reason for this is that 'A' and 'a' are both normalized to 'A' (U+0041), which is numerically before '_' (U+005F).

With this PR:

  • On Windows, the strings "A" and "a" will still sort before the string "_" under StringComparer.OrdinalIgnoreCase.
  • On non-Windows platforms, the strings "A" and "a" will sort after the string "_" under StringComparer.OrdinalIgnoreCase. This is because 'A' and 'a' are now both normalized to 'a' (U+0061), which is numerically after '_' (U+005F).

This behavior is a platform-specific behavior and is not affected by the value of the GlobalizationMode.Invariant flag.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Feb 15, 2020

There's still an open question as to whether we should block non-ASCII -> ASCII conversion during a ToUpperInvariant call. For example, with this PR as it stands today, the following behaviors hold (under ICU):

string.Equals("administrator", "adminiſtrator", StringComparison.OrdinalIgnoreCase) // FALSE
"administrator".ToUpperInvariant() == "adminiſtrator".ToUpperInvariant() // TRUE

If needed we could block this sort of conversion entirely without carrying the ICU / NLS delta.

See also:

@tarekgh
Copy link
Member

tarekgh commented Feb 16, 2020

There's still an open question as to whether we should block non-ASCII -> ASCII conversion during a ToUpperInvariant call

I wouldn't recommend blocking that for invariant. Invariant operations still cultural operations and doesn't make sense to block this behavior there.

@GrabYourPitchforks
Copy link
Member Author

I'm no longer actively working on this, so closing the PR. Moved everything to GrabYourPitchforks#8 so that I don't lose track of it.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Globalization NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) Security
Projects
None yet
4 participants