Non-ASCII chars shouldn't compare equal to ASCII chars under OrdinalIgnoreCase comparison #32247

GrabYourPitchforks · 2020-02-13T20:13:11Z

Under ICU, there are some non-ASCII code points that become ASCII code points after a simple case mapping transformation.

'K' (U+212A KELVIN SIGN) ~= 'k' (U+006B LATIN SMALL LETTER K) [simple lowercase mapping]
'ſ' (U+017F LATIN SMALL LETTER LONG S) ~= 'S' (U+0053 LATIN CAPITAL LETTER S) [simple uppercase mapping]

Since it's common for applications to use StringComparison.OrdinalIgnoreCase when comparing things like usernames, this could be a pit of failure for those applications, as it could lead to the following behavior at runtime.

string.Equals("administrator", "adminiſtrator", StringComparison.OrdinalIgnoreCase) // <-- FALSE on Windows, TRUE on Linux

A fairly straightfoward fix would be to prevent non-ASCII chars and ASCII characters from being equal under an OrdinalIgnoreCase comparison. It would mean that OrdinalIgnoreCase is no longer a direct wrapper around ICU's case mapping / case folding APIs, but it would bring the behavior more in line with what developers have come to expect over .NET's history.

With this proposal, ToUpperInvariant and ToLowerInvariant would be a direct wrapper around ICU's underlying simple case mapping APIs, and it wouldn't special-case any characters.

Related: #27540

The following APIs would be affected:

string.Equals, string.Compare, string.GetHashCode, and any other APIs which might accept StringComparison.OrdinalIgnoreCase as a parameter
StringComparer.OrdinalIgnoreCase.Equals and StringComparer.OrdinalIgnoreCase.GetHashCode
TextInfo.Compare and similar APIs which might accept CompareOptions.OrdinalIgnoreCase

/cc @tarekgh

The text was updated successfully, but these errors were encountered:

tarekgh · 2020-02-13T20:39:14Z

prevent non-ASCII chars from becoming ASCII chars

I would recommend doing that in the case of the ordinal casing but not for invariant. The reason here is Invariant casing still treated as a cultural casing and not ordinal and it makes sense to do the right casing according to Unicode standard as most of the people using it in scenarios that not specific to compare 2 strings securely. instead, the ordinal comparison should be used at that time.

GrabYourPitchforks · 2020-02-13T20:47:21Z

Just to clarify, you're suggesting:

string.Equals("administrator", "adminiſtrator", StringComparison.OrdinalIgnoreCase) // <-- with this change, will return FALSE on all platforms
"administrator".ToUpperInvariant() == "adminiſtrator".ToUpperInvariant() // <-- will continue to return FALSE on Windows, TRUE on Linux

This means that string.Equals(a, b, StringComparison.OrdinalIgnoreCase) will no longer be equivalent to string.Equals(a.ToUpperInvariant(), b.ToUpperInvariant(), StringComparison.Ordinal), which is a behavioral change from previous framework releases.

I'm ok with your suggestion here if you think we can swing the breaking change. :)

tarekgh · 2020-02-13T20:55:16Z

@GrabYourPitchforks yes you have articulated it accurately.

GrabYourPitchforks · 2020-02-13T21:20:42Z

@tarekgh I've updated the proposal text to match your suggestions. Thanks!

Do you suppose we'd need similar APIs char.EqualsOrdinalIgnoreCase or Rune.EqualsOrdinalIgnoreCase to accompany the string ordinal ignore case APIs? I can imagine some applications might be interested in doing the comparison on a character-by-character basis.

tarekgh · 2020-02-13T22:17:27Z

Char already has ToUpper/ToLower which should match the ordinal casing. Also, it has ToUpperInvariant/ToLowerInvariant for invariant case. I think the current APIs are enough for doing any needed operation.

GrabYourPitchforks · 2020-02-13T22:39:43Z

Char already has ToUpper/ToLower which should match the ordinal casing.

There's no such API, unfortunately. :( char.ToUpper implicitly uses the current culture. We don't have an "ordinal" version per se; we've just relied on ToUpperInvariant for this.

After this change, on Linux:

char.ToUpperInvariant('ſ') // <-- returns 'S'
"ſ".ToUpperInvariant() // <-- returns "S"
string.Equals("ſ", "S", StringComparison.OrdinalIgnoreCase) // <-- returns FALSE

tarekgh · 2020-02-13T22:45:35Z

right. but I expect:

string.Equals("ſ", "S", StringComparison.InvariantCultureIgnoreCase) // <-- returns true

Is that right?

GrabYourPitchforks · 2020-02-13T22:50:55Z

I guess my question is that in theory it should be possible to write the following code:

public static bool AreStringsEqualOrdinalCase(string a, string b)
{
    if (a.Length != b.Length) { return false; }
    for (int i = 0; i < a.Length; i++)
    {
        char charA = a[i];
        char charB = b[i];
        if (!CharsAreEqualOrdinalIgnoreCase(charA, charB)) { return false; }
    }
    return true;
}

And that it should behave identically to string.Equals(a, b, StringComparison.OrdinalIgnoreCase), handwaving away that we should be checking for surrogates, etc.

What is the CharsAreEqualOrdinalIgnoreCase method in the example above? Since we don't expose such a method, a developer won't be able to write the sample method above reliably.

tarekgh · 2020-02-13T22:58:48Z

CharsAreEqualOrdinalIgnoreCase should be equivalent to Char.Toupper(ch1) == Char.Toupper(ch2). We want to ensure char.ToUpper is behaving as ordinal casing we are going to have.

GrabYourPitchforks · 2020-02-13T23:01:08Z

char.ToUpper is defined to be culture-aware, though.

runtime/src/libraries/System.Private.CoreLib/src/System/Char.cs

Lines 350 to 353 in 0e0e852

    
           public static char ToUpper(char c) 
        
           { 
        
               return CultureInfo.CurrentCulture.TextInfo.ToUpper(c); 
        
           }

We have no "ordinal" equivalent API. The closest we have is ToUpperInvariant.

tarekgh · 2020-02-13T23:07:03Z

ah, got it. I didn't know char is calling the current culture. That is sad as we already have methods on TextInfo do the cultural casing.

Thanks, for the clarification. I agree with the proposal char.EqualsOrdinalIgnoreCase and Rune.EqualsOrdinalIgnoreCase. The challenge now is how we clarify to the users which one to use :-)

tarekgh · 2020-03-12T00:43:22Z

@GrabYourPitchforks I marked this issue with the tag api-suggestion. do you think we need to finish this in 5.0? or can it be moved to the future?

GrabYourPitchforks · 2020-03-12T00:52:57Z

There's no API being proposed here. I think we should still tackle this in the 5.0 timeframe. It would dovetail nicely with the ICU work you're already doing.

tarekgh · 2020-03-12T01:26:02Z

@GrabYourPitchforks this comment #32247 (comment) kind of suggesting a new APIs too. I removed the API suggestion label anyway.

GrabYourPitchforks · 2020-03-13T17:32:38Z

@tarekgh Ah, got it. I'll open a separate API proposal for those.

tarekgh · 2020-03-13T17:39:46Z

@GrabYourPitchforks if we are going to expose new APIs, shouldn't we break the current behavior? and we ask devs to use the new APIs for correct scenarios?

GrabYourPitchforks · 2020-03-16T20:47:56Z

@tarekgh, not quite sure I follow. You're suggesting that both the current APIs and the new APIs get the new behavior?

tarekgh · 2020-03-16T20:57:55Z

I am trying to say, don't change the current APIs behavior and have the new behavior in the new exposed APIs.

GrabYourPitchforks · 2020-03-16T21:12:02Z

So the current APIs would continue to have the behavior described below?

Console.WriteLine(string.Equals("administrator", "adminiſtrator", StringComparison.OrdinalIgnoreCase)); // prints True (on Linux)

tarekgh · 2020-03-16T21:31:57Z

Right. I think this can be acceptable. and the new proposed APIs can provide more control over the comparison as desired.

GrabYourPitchforks · 2020-03-16T21:52:21Z

I guess one of the things bugging me is that Unicode defines 4 ways to perform case-insensitive text comparison, and OrdinalIgnoreCase just kind of does its own thing that doesn't match any of them. 😊 See specifically https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf, Sec. 3.13, definitions D144 - D147.

Our OrdinalIgnoreCase attempts to detect certain cases (see the comment in the opening description for #27540), but it doesn't detect these special cases reliably. So I'd like to see us either fix the special-casing (which was the proposal for this issue) or to drop the special-casing entirely.

If we wanted to expose the 4 case-insensitive text comparison algorithms as their own separate APIs I can be sold on that idea.

Edit - For reference, the four kinds of case-insensitive matching defined by Unicode are:

Caseless matching (D144). This is the closest to OrdinalIgnoreCase behavior, but since it uses full case folding it also treats "ß" and "ss" as equivalent. It's implemented via u_strCaseCompare.
Canonical caseless matching (D145).
Compatibility caseless matching (D146).
Identifier caseless matching (D147).

tarekgh · 2020-03-17T01:13:22Z

I agree then fixing them as you described #32247 (comment).

GrabYourPitchforks · 2020-03-17T19:51:14Z

@tarekgh One more quick question on ToUpperInvariant / ToLowerInvariant: should they allow for producing a result string whose length differs from the input?

On Windows, the behavior of these APIs is that they iterate through the string char-by-char, changing the case of each char in isolation. Each input char maps exactly to one output char (or to itself, if no case conversion exists). So the result string has the exact same length as the input string.

Today's behavior for ToUpperInvariant / ToLowerInvariant when under ICU is to do something similar. We manually iterate through the strings char-by-char, as shown below. So the result string will still have the same length as the input string.

runtime/src/libraries/Native/Unix/System.Globalization.Native/pal_casing.c

Lines 40 to 49 in 75036ff

    
           if (bToUpper) 
        
           { 
        
               while (srcIdx < cwSrcLength) 
        
               { 
        
                   U16_NEXT(lpSrc, srcIdx, cwSrcLength, srcCodepoint); 
        
                   dstCodepoint = u_toupper(srcCodepoint); 
        
                   U16_APPEND(lpDst, dstIdx, cwDstLength, dstCodepoint, isError); 
        
                   assert(isError == FALSE && srcIdx == dstIdx); 
        
               } 
        
           }

ICU's normal case changing APIs (such as u_strToUpper) do not make this same guarantee. For example, the string "ß" (1 char) will uppercase-convert to "SS" (2 chars) when run through ICU's normal uppercase conversion routine, even when specifying a null locale. This matches the recommended behavior in the Unicode Specification, Ch. 3, Sec. 3.13, Rule R1.

I think at the moment quite a bit of code relies on ToUpperInvariant / ToLowerInvariant not changing the length of the resulting string. So maybe a different API would be needed for these "pure ICU" cases so that we don't break the world. But wanted to get your opinion on this nonetheless.

tarekgh · 2020-03-17T19:57:42Z

I would expose new APIs to support different length for compact reason. I saw a usage before that consumer of the current APIs always assumed the same input length.

jkotas · 2020-03-17T20:31:29Z

I think at the moment quite a bit of code relies on ToUpperInvariant / ToLowerInvariant not changing the length of the resulting string

Do you have examples? We should look at how bad is this.

It is not pretty to have old broken and new correct versions of the same APIs. It has negative value in the long run. We had the same problem with floating point formatting changes, and we choose to fix the existing APIs.

GrabYourPitchforks · 2020-03-17T21:34:22Z

Do you have examples? We should look at how bad is this.

Within the runtime and libraries:

runtime/src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.Globalization.cs

Lines 317 to 319 in 1124c1a

    
           // Assuming that changing case does not affect length 
        
           if (destination.Length < source.Length) 
        
               return -1;

runtime/src/libraries/System.Private.CoreLib/src/System/Marvin.OrdinalIgnoreCase.cs

Lines 81 to 82 in d5be855

    
           int charsWritten = new ReadOnlySpan<char>(ref data, count).ToUpperInvariant(scratch); 
        
           Debug.Assert(charsWritten == count); // invariant case conversion should involve simple folding; preserve code unit count

runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/TextInfo.cs

Line 397 in d5be855

    
           string result = string.FastAllocateString(source.Length); // changing case uses simple folding: doesn't change UTF-16 code unit count

https://github.com/dotnet/machinelearning/blob/cdb1e4b38308d9256cbde9e740a14b3bc7d64c2f/src/Microsoft.ML.Transforms/Expression/BuiltinFunctions.cs#L716-L724

Within application code, I assume the biggest pit of failure would be people who call MemoryExtensions.To{Upper|Lower}Invariant. But I have no easy way of auditing those call sites.

jkotas · 2020-03-17T22:49:39Z

MemoryExtensions.To{Upper|Lower}Invariant - these are new APIs, they won't be used that much.

tarekgh · 2020-08-20T01:22:07Z

This is fixed by the PR #40910.

GrabYourPitchforks added area-System.Globalization breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. labels Feb 13, 2020

GrabYourPitchforks added this to the 5.0 milestone Feb 13, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Feb 13, 2020

GrabYourPitchforks changed the title ~~Non-ASCII chars shouldn't become ASCII chars after ToUpperInvariant / ToLowerInvariant~~ Non-ASCII chars shouldn't compare equal to ASCII chars under OrdinalIgnoreCase comparison Feb 13, 2020

GrabYourPitchforks mentioned this issue Feb 13, 2020

API proposal: compare chars or Runes for equality under an OrdinalIgnoreCase comparer #32268

Open

GrabYourPitchforks linked a pull request Feb 15, 2020 that will close this issue

ICU comparison routines should use case folding, not case mapping #27540

Closed

GrabYourPitchforks mentioned this issue Feb 15, 2020

ICU comparison routines should use case folding, not case mapping #27540

Closed

tarekgh added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed untriaged New issue has not been triaged by the area owner labels Mar 12, 2020

tarekgh added enhancement Product code improvement that does NOT require public API changes/additions and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Mar 12, 2020

tarekgh closed this as completed Aug 20, 2020

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-ASCII chars shouldn't compare equal to ASCII chars under OrdinalIgnoreCase comparison #32247

Non-ASCII chars shouldn't compare equal to ASCII chars under OrdinalIgnoreCase comparison #32247

GrabYourPitchforks commented Feb 13, 2020 •

edited

tarekgh commented Feb 13, 2020

GrabYourPitchforks commented Feb 13, 2020

tarekgh commented Feb 13, 2020

GrabYourPitchforks commented Feb 13, 2020

tarekgh commented Feb 13, 2020

GrabYourPitchforks commented Feb 13, 2020 •

edited

tarekgh commented Feb 13, 2020

GrabYourPitchforks commented Feb 13, 2020

tarekgh commented Feb 13, 2020 •

edited

GrabYourPitchforks commented Feb 13, 2020

tarekgh commented Feb 13, 2020

tarekgh commented Mar 12, 2020

GrabYourPitchforks commented Mar 12, 2020

tarekgh commented Mar 12, 2020 •

edited

GrabYourPitchforks commented Mar 13, 2020

tarekgh commented Mar 13, 2020

GrabYourPitchforks commented Mar 16, 2020

tarekgh commented Mar 16, 2020

GrabYourPitchforks commented Mar 16, 2020

tarekgh commented Mar 16, 2020

GrabYourPitchforks commented Mar 16, 2020 •

edited

tarekgh commented Mar 17, 2020 •

edited

GrabYourPitchforks commented Mar 17, 2020 •

edited

tarekgh commented Mar 17, 2020 •

edited

jkotas commented Mar 17, 2020

GrabYourPitchforks commented Mar 17, 2020

jkotas commented Mar 17, 2020

tarekgh commented Aug 20, 2020

Non-ASCII chars shouldn't compare equal to ASCII chars under OrdinalIgnoreCase comparison #32247

Non-ASCII chars shouldn't compare equal to ASCII chars under OrdinalIgnoreCase comparison #32247

Comments

GrabYourPitchforks commented Feb 13, 2020 • edited

tarekgh commented Feb 13, 2020

GrabYourPitchforks commented Feb 13, 2020

tarekgh commented Feb 13, 2020

GrabYourPitchforks commented Feb 13, 2020

tarekgh commented Feb 13, 2020

GrabYourPitchforks commented Feb 13, 2020 • edited

tarekgh commented Feb 13, 2020

GrabYourPitchforks commented Feb 13, 2020

tarekgh commented Feb 13, 2020 • edited

GrabYourPitchforks commented Feb 13, 2020

tarekgh commented Feb 13, 2020

tarekgh commented Mar 12, 2020

GrabYourPitchforks commented Mar 12, 2020

tarekgh commented Mar 12, 2020 • edited

GrabYourPitchforks commented Mar 13, 2020

tarekgh commented Mar 13, 2020

GrabYourPitchforks commented Mar 16, 2020

tarekgh commented Mar 16, 2020

GrabYourPitchforks commented Mar 16, 2020

tarekgh commented Mar 16, 2020

GrabYourPitchforks commented Mar 16, 2020 • edited

tarekgh commented Mar 17, 2020 • edited

GrabYourPitchforks commented Mar 17, 2020 • edited

tarekgh commented Mar 17, 2020 • edited

jkotas commented Mar 17, 2020

GrabYourPitchforks commented Mar 17, 2020

jkotas commented Mar 17, 2020

tarekgh commented Aug 20, 2020

GrabYourPitchforks commented Feb 13, 2020 •

edited

GrabYourPitchforks commented Feb 13, 2020 •

edited

tarekgh commented Feb 13, 2020 •

edited

tarekgh commented Mar 12, 2020 •

edited

GrabYourPitchforks commented Mar 16, 2020 •

edited

tarekgh commented Mar 17, 2020 •

edited

GrabYourPitchforks commented Mar 17, 2020 •

edited

tarekgh commented Mar 17, 2020 •

edited