ICU comparison routines should use case folding, not case mapping #27540

GrabYourPitchforks · 2020-01-31T23:15:03Z

Adjusts OrdinalIgnoreCase string comparison routines (on ICU only) to use case folding instead of case mapping.

Open question: What should we do about the special-casing logic for U+0131 and U+0049? This PR aside, there are tons of differences in the case mapping tables between ICU and NLS, and the pair that is checked for here is only one such difference. I don't believe it's feasible for us to carry the delta between ICU and NLS within our runtime, which means we should probably scrap the below check and say "we're following the same behavior ICU has."

runtime/src/libraries/Native/Unix/System.Globalization.Native/pal_collation.c

Lines 537 to 543 in 7aff91c

    
           if (one == 0x0131 || two == 0x0131) 
        
           { 
        
               // On Windows with InvariantCulture, the LATIN SMALL LETTER DOTLESS I (U+0131) 
        
               // capitalizes to itself, whereas with ICU it capitalizes to LATIN CAPITAL LETTER I (U+0049). 
        
               // We special case it to match the Windows invariant behavior. 
        
               return FALSE; 
        
           }

jkotas · 2020-02-01T01:52:15Z

I don't believe it's feasible for us to carry the delta between ICU and NLS within our runtime, which means we should probably scrap the below check and say "we're following the same behavior ICU has."

+1

GrabYourPitchforks · 2020-02-05T01:43:49Z

@tarekgh @stephentoub Do you have any thoughts on the open question raised at #27540 (comment)? I'm trying to gauge whether it would be best to remove the special-case and rely solely on ICU's built-in mapping tables.

tarekgh · 2020-02-05T16:12:51Z

Do you have any thoughts on the open question raised at #27540 (comment)? I'm trying to gauge whether it would be best to remove the special-case and rely solely on ICU's built-in mapping tables.

In general, we are moving towards ICU even on Windows. I prefer removing this special case to serve in our future direction. We already have discrepancies between Windows and ICU anyway which I don't think this case will be a big deal.

GrabYourPitchforks · 2020-02-05T18:46:07Z

It turns out the special case for the Turkish I isn't a problem after all since with this change we're using simple case folding instead of upper case mapping.

I folds to i
i folds to i
İ folds to İ
ı folds to ı

So this means that under an OrdinalIgnoreCase comparer, the code point İ (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE) will never match any of { I, i, ı }. And the code point ı (U+0131 LATIN SMALL LETTER DOTLESS I) will never match any of { I, i, İ }.

This is the desired behavior anyway, so we can remove the special case.

GrabYourPitchforks · 2020-02-05T22:27:20Z

I've marked this NO MERGE for now because it also impacts APIs like string.GetHashCodeOrdinalIgnoreCase. Will create some unit tests to validate that string and CompareInfo are in sync here.

GrabYourPitchforks · 2020-02-05T23:07:36Z

There is also potentially security impact for this change. For example, the following mappings are valid under a case folding mechanism:

ſ (U+017F LATIN SMALL LETTER LONG S) folds to s
K (U+212A KELVIN SIGN) folds to k

(The above data is from https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt.)

This means, for instance, that after this change the expression string.Equals("Microſoft", "Microsoft", StringComparison.OrdinalIgnoreCase) will evaluate to true when running under ICU. Normally this wouldn't be an issue, but we do recommend that developers use StringComparison.OrdinalIgnoreCase for security-related code, and this change in behavior might be surprising.

One option is to re-introduce a special case which says "if one of the characters to check is ASCII and the other character is non-ASCII, they're never equal, regardless of what ICU case folding says." But if we go down this path we're going to have our own semantics start to creep on top of ICU's own semantics, and I don't know just where that road leads.

Edit: Jeremy kindly confirmed offline for me that the expression string.Equals("Microſoft", "Microsoft", StringComparison.OrdinalIgnoreCase) already evaluates to true on Linux (ICU) and false on Windows (NLS) today, even without the changes in this PR.

bartonjs · 2020-02-05T23:37:34Z

It feels strange to me that something would map/fold/whatever across any two of { ASCII, non-ASCII BMP, !BMP }; because I'd personally feel like code of the form

private static readonly HashSet<string> s_knownIds = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
{
    "fantasy",
};

...

if (s_knownIds.Contains(input))
{
    Process(input);
}

has effectively asserted that Process(string) never needs to touch non-ASCII data. (Or, if I had some \u stuff in there, that there aren't surrogate pairs to deal with). Using the already-present form would solve that; but that's case-normalizing instead of input-checking...

GrabYourPitchforks · 2020-02-15T02:50:58Z

Based on the earlier feedback here and at #32247 I've updated the logic as follows. These changes are only when running under non-Windows platforms.

ToUpperInvariant / ToLowerInvariant no longer special-case any characters. So calling ToLowerInvariant('İ') will result in the character 'i'. This uses the data straight from ICU with no adjustment.
Comparing two char values under an ordinal case-insensitive comparer (e.g., string.Equals(a, b, StringComparison.OrdinalIgnoreCase)) will now return not equal if one of the chars is ASCII and another char is non-ASCII, even if the two chars have the same simple case folding under ICU. For example, string.Equals("İ", "i", StringComparison.OrdinalIgnoreCase) will return false.

This implies that string.Equals(a, b, StringComparison.OrdinalIgnoreCase) and a.ToUpperInvariant() == b.ToUpperInvariant() do not have the same semantic meaning going forward.

Another notable consequence of this change is how strings are sorted under a case-insensitive comparer. Previously, since ASCII strings were normalized to uppercase on all platforms, the strings "A" and "a" would sort before the string "_" under StringComparer.OrdinalIgnoreCase. The reason for this is that 'A' and 'a' are both normalized to 'A' (U+0041), which is numerically before '_' (U+005F).

With this PR:

On Windows, the strings "A" and "a" will still sort before the string "_" under StringComparer.OrdinalIgnoreCase.
On non-Windows platforms, the strings "A" and "a" will sort after the string "_" under StringComparer.OrdinalIgnoreCase. This is because 'A' and 'a' are now both normalized to 'a' (U+0061), which is numerically after '_' (U+005F).

This behavior is a platform-specific behavior and is not affected by the value of the GlobalizationMode.Invariant flag.

GrabYourPitchforks · 2020-02-15T02:55:32Z

There's still an open question as to whether we should block non-ASCII -> ASCII conversion during a ToUpperInvariant call. For example, with this PR as it stands today, the following behaviors hold (under ICU):

string.Equals("administrator", "adminiſtrator", StringComparison.OrdinalIgnoreCase) // FALSE
"administrator".ToUpperInvariant() == "adminiſtrator".ToUpperInvariant() // TRUE

If needed we could block this sort of conversion entirely without carrying the ICU / NLS delta.

See also:

tarekgh · 2020-02-16T22:50:34Z

There's still an open question as to whether we should block non-ASCII -> ASCII conversion during a ToUpperInvariant call

I wouldn't recommend blocking that for invariant. Invariant operations still cultural operations and doesn't make sense to block this behavior there.

GrabYourPitchforks · 2020-07-06T20:32:13Z

I'm no longer actively working on this, so closing the PR. Moved everything to GrabYourPitchforks#8 so that I don't lose track of it.

GrabYourPitchforks added the area-System.Globalization label Jan 31, 2020

GrabYourPitchforks added this to the 5.0 milestone Jan 31, 2020

GrabYourPitchforks requested review from ViktorHofer and tarekgh January 31, 2020 23:15

tarekgh approved these changes Jan 31, 2020

View reviewed changes

jkotas mentioned this pull request Feb 1, 2020

Crashes caused by "Improve call counting mechanism" change #29934

Closed

GrabYourPitchforks added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Feb 5, 2020

GrabYourPitchforks added the Security label Feb 5, 2020

jaredpar mentioned this pull request Feb 5, 2020

System.Security.Cryptography.OpenSsl.Tests failing on CI runs #2176

Closed

This was referenced Feb 10, 2020

Clean up usage of string.IndexOf / ToUpper / ToLower / Trim throughout the framework #31968

Merged

Non-ASCII chars shouldn't compare equal to ASCII chars under OrdinalIgnoreCase comparison #32247

Closed

GrabYourPitchforks added 2 commits February 14, 2020 17:36

Build native case folding layer

6826ad8

Hook StringComparer.OrdinalIgnoreCase through new system

16fde36

GrabYourPitchforks force-pushed the icu_casefold branch from 503c720 to 16fde36 Compare February 15, 2020 02:36

GrabYourPitchforks linked an issue Feb 15, 2020 that may be closed by this pull request

Non-ASCII chars shouldn't compare equal to ASCII chars under OrdinalIgnoreCase comparison #32247

Closed

GrabYourPitchforks mentioned this pull request Jul 6, 2020

ICU comparison routines should use case folding, not case mapping GrabYourPitchforks/runtime#8

Closed

GrabYourPitchforks closed this Jul 6, 2020

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU comparison routines should use case folding, not case mapping #27540

ICU comparison routines should use case folding, not case mapping #27540

GrabYourPitchforks commented Jan 31, 2020

jkotas commented Feb 1, 2020

GrabYourPitchforks commented Feb 5, 2020

tarekgh commented Feb 5, 2020

GrabYourPitchforks commented Feb 5, 2020

GrabYourPitchforks commented Feb 5, 2020

GrabYourPitchforks commented Feb 5, 2020 •

edited

Loading

bartonjs commented Feb 5, 2020

GrabYourPitchforks commented Feb 15, 2020

GrabYourPitchforks commented Feb 15, 2020 •

edited

Loading

tarekgh commented Feb 16, 2020

GrabYourPitchforks commented Jul 6, 2020

	if (one == 0x0131 \|\| two == 0x0131)
	{
	// On Windows with InvariantCulture, the LATIN SMALL LETTER DOTLESS I (U+0131)
	// capitalizes to itself, whereas with ICU it capitalizes to LATIN CAPITAL LETTER I (U+0049).
	// We special case it to match the Windows invariant behavior.
	return FALSE;
	}

ICU comparison routines should use case folding, not case mapping #27540

ICU comparison routines should use case folding, not case mapping #27540

Conversation

GrabYourPitchforks commented Jan 31, 2020

jkotas commented Feb 1, 2020

GrabYourPitchforks commented Feb 5, 2020

tarekgh commented Feb 5, 2020

GrabYourPitchforks commented Feb 5, 2020

GrabYourPitchforks commented Feb 5, 2020

GrabYourPitchforks commented Feb 5, 2020 • edited Loading

bartonjs commented Feb 5, 2020

GrabYourPitchforks commented Feb 15, 2020

GrabYourPitchforks commented Feb 15, 2020 • edited Loading

tarekgh commented Feb 16, 2020

GrabYourPitchforks commented Jul 6, 2020

GrabYourPitchforks commented Feb 5, 2020 •

edited

Loading

GrabYourPitchforks commented Feb 15, 2020 •

edited

Loading