Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ASCII chars shouldn't compare equal to ASCII chars under OrdinalIgnoreCase comparison #32247

Closed
GrabYourPitchforks opened this issue Feb 13, 2020 · 28 comments
Labels
area-System.Globalization breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Feb 13, 2020

Under ICU, there are some non-ASCII code points that become ASCII code points after a simple case mapping transformation.

'K' (U+212A KELVIN SIGN) ~= 'k' (U+006B LATIN SMALL LETTER K) [simple lowercase mapping]
'ſ' (U+017F LATIN SMALL LETTER LONG S) ~= 'S' (U+0053 LATIN CAPITAL LETTER S) [simple uppercase mapping]

Since it's common for applications to use StringComparison.OrdinalIgnoreCase when comparing things like usernames, this could be a pit of failure for those applications, as it could lead to the following behavior at runtime.

string.Equals("administrator", "adminiſtrator", StringComparison.OrdinalIgnoreCase) // <-- FALSE on Windows, TRUE on Linux

A fairly straightfoward fix would be to prevent non-ASCII chars and ASCII characters from being equal under an OrdinalIgnoreCase comparison. It would mean that OrdinalIgnoreCase is no longer a direct wrapper around ICU's case mapping / case folding APIs, but it would bring the behavior more in line with what developers have come to expect over .NET's history.

With this proposal, ToUpperInvariant and ToLowerInvariant would be a direct wrapper around ICU's underlying simple case mapping APIs, and it wouldn't special-case any characters.

Related: #27540

The following APIs would be affected:

  • string.Equals, string.Compare, string.GetHashCode, and any other APIs which might accept StringComparison.OrdinalIgnoreCase as a parameter
  • StringComparer.OrdinalIgnoreCase.Equals and StringComparer.OrdinalIgnoreCase.GetHashCode
  • TextInfo.Compare and similar APIs which might accept CompareOptions.OrdinalIgnoreCase

/cc @tarekgh

@GrabYourPitchforks GrabYourPitchforks added area-System.Globalization breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. labels Feb 13, 2020
@GrabYourPitchforks GrabYourPitchforks added this to the 5.0 milestone Feb 13, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Feb 13, 2020
@tarekgh
Copy link
Member

tarekgh commented Feb 13, 2020

prevent non-ASCII chars from becoming ASCII chars

I would recommend doing that in the case of the ordinal casing but not for invariant. The reason here is Invariant casing still treated as a cultural casing and not ordinal and it makes sense to do the right casing according to Unicode standard as most of the people using it in scenarios that not specific to compare 2 strings securely. instead, the ordinal comparison should be used at that time.

@GrabYourPitchforks
Copy link
Member Author

Just to clarify, you're suggesting:

string.Equals("administrator", "adminiſtrator", StringComparison.OrdinalIgnoreCase) // <-- with this change, will return FALSE on all platforms
"administrator".ToUpperInvariant() == "adminiſtrator".ToUpperInvariant() // <-- will continue to return FALSE on Windows, TRUE on Linux

This means that string.Equals(a, b, StringComparison.OrdinalIgnoreCase) will no longer be equivalent to string.Equals(a.ToUpperInvariant(), b.ToUpperInvariant(), StringComparison.Ordinal), which is a behavioral change from previous framework releases.

I'm ok with your suggestion here if you think we can swing the breaking change. :)

@tarekgh
Copy link
Member

tarekgh commented Feb 13, 2020

@GrabYourPitchforks yes you have articulated it accurately.

@GrabYourPitchforks GrabYourPitchforks changed the title Non-ASCII chars shouldn't become ASCII chars after ToUpperInvariant / ToLowerInvariant Non-ASCII chars shouldn't compare equal to ASCII chars under OrdinalIgnoreCase comparison Feb 13, 2020
@GrabYourPitchforks
Copy link
Member Author

@tarekgh I've updated the proposal text to match your suggestions. Thanks!

Do you suppose we'd need similar APIs char.EqualsOrdinalIgnoreCase or Rune.EqualsOrdinalIgnoreCase to accompany the string ordinal ignore case APIs? I can imagine some applications might be interested in doing the comparison on a character-by-character basis.

@tarekgh
Copy link
Member

tarekgh commented Feb 13, 2020

Char already has ToUpper/ToLower which should match the ordinal casing. Also, it has ToUpperInvariant/ToLowerInvariant for invariant case. I think the current APIs are enough for doing any needed operation.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Feb 13, 2020

Char already has ToUpper/ToLower which should match the ordinal casing.

There's no such API, unfortunately. :( char.ToUpper implicitly uses the current culture. We don't have an "ordinal" version per se; we've just relied on ToUpperInvariant for this.

After this change, on Linux:

char.ToUpperInvariant('ſ') // <-- returns 'S'
"ſ".ToUpperInvariant() // <-- returns "S"
string.Equals("ſ", "S", StringComparison.OrdinalIgnoreCase) // <-- returns FALSE

@tarekgh
Copy link
Member

tarekgh commented Feb 13, 2020

right. but I expect:

string.Equals("ſ", "S", StringComparison.InvariantCultureIgnoreCase) // <-- returns true

Is that right?

@GrabYourPitchforks
Copy link
Member Author

I guess my question is that in theory it should be possible to write the following code:

public static bool AreStringsEqualOrdinalCase(string a, string b)
{
    if (a.Length != b.Length) { return false; }
    for (int i = 0; i < a.Length; i++)
    {
        char charA = a[i];
        char charB = b[i];
        if (!CharsAreEqualOrdinalIgnoreCase(charA, charB)) { return false; }
    }
    return true;
}

And that it should behave identically to string.Equals(a, b, StringComparison.OrdinalIgnoreCase), handwaving away that we should be checking for surrogates, etc.

What is the CharsAreEqualOrdinalIgnoreCase method in the example above? Since we don't expose such a method, a developer won't be able to write the sample method above reliably.

@tarekgh
Copy link
Member

tarekgh commented Feb 13, 2020

CharsAreEqualOrdinalIgnoreCase should be equivalent to Char.Toupper(ch1) == Char.Toupper(ch2). We want to ensure char.ToUpper is behaving as ordinal casing we are going to have.

@GrabYourPitchforks
Copy link
Member Author

char.ToUpper is defined to be culture-aware, though.

public static char ToUpper(char c)
{
return CultureInfo.CurrentCulture.TextInfo.ToUpper(c);
}

We have no "ordinal" equivalent API. The closest we have is ToUpperInvariant.

@tarekgh
Copy link
Member

tarekgh commented Feb 13, 2020

ah, got it. I didn't know char is calling the current culture. That is sad as we already have methods on TextInfo do the cultural casing.

Thanks, for the clarification. I agree with the proposal char.EqualsOrdinalIgnoreCase and Rune.EqualsOrdinalIgnoreCase. The challenge now is how we clarify to the users which one to use :-)

@tarekgh
Copy link
Member

tarekgh commented Mar 12, 2020

@GrabYourPitchforks I marked this issue with the tag api-suggestion. do you think we need to finish this in 5.0? or can it be moved to the future?

@GrabYourPitchforks
Copy link
Member Author

There's no API being proposed here. I think we should still tackle this in the 5.0 timeframe. It would dovetail nicely with the ICU work you're already doing.

@tarekgh
Copy link
Member

tarekgh commented Mar 12, 2020

@GrabYourPitchforks this comment #32247 (comment) kind of suggesting a new APIs too. I removed the API suggestion label anyway.

@tarekgh tarekgh added enhancement Product code improvement that does NOT require public API changes/additions and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Mar 12, 2020
@GrabYourPitchforks
Copy link
Member Author

@tarekgh Ah, got it. I'll open a separate API proposal for those.

@tarekgh
Copy link
Member

tarekgh commented Mar 13, 2020

@GrabYourPitchforks if we are going to expose new APIs, shouldn't we break the current behavior? and we ask devs to use the new APIs for correct scenarios?

@GrabYourPitchforks
Copy link
Member Author

@tarekgh, not quite sure I follow. You're suggesting that both the current APIs and the new APIs get the new behavior?

@tarekgh
Copy link
Member

tarekgh commented Mar 16, 2020

I am trying to say, don't change the current APIs behavior and have the new behavior in the new exposed APIs.

@GrabYourPitchforks
Copy link
Member Author

So the current APIs would continue to have the behavior described below?

Console.WriteLine(string.Equals("administrator", "adminiſtrator", StringComparison.OrdinalIgnoreCase)); // prints True (on Linux)

@tarekgh
Copy link
Member

tarekgh commented Mar 16, 2020

Right. I think this can be acceptable. and the new proposed APIs can provide more control over the comparison as desired.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Mar 16, 2020

I guess one of the things bugging me is that Unicode defines 4 ways to perform case-insensitive text comparison, and OrdinalIgnoreCase just kind of does its own thing that doesn't match any of them. 😊 See specifically https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf, Sec. 3.13, definitions D144 - D147.

Our OrdinalIgnoreCase attempts to detect certain cases (see the comment in the opening description for #27540), but it doesn't detect these special cases reliably. So I'd like to see us either fix the special-casing (which was the proposal for this issue) or to drop the special-casing entirely.

If we wanted to expose the 4 case-insensitive text comparison algorithms as their own separate APIs I can be sold on that idea.

Edit - For reference, the four kinds of case-insensitive matching defined by Unicode are:

  • Caseless matching (D144). This is the closest to OrdinalIgnoreCase behavior, but since it uses full case folding it also treats "ß" and "ss" as equivalent. It's implemented via u_strCaseCompare.
  • Canonical caseless matching (D145).
  • Compatibility caseless matching (D146).
  • Identifier caseless matching (D147).

@tarekgh
Copy link
Member

tarekgh commented Mar 17, 2020

I agree then fixing them as you described #32247 (comment).

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Mar 17, 2020

@tarekgh One more quick question on ToUpperInvariant / ToLowerInvariant: should they allow for producing a result string whose length differs from the input?

On Windows, the behavior of these APIs is that they iterate through the string char-by-char, changing the case of each char in isolation. Each input char maps exactly to one output char (or to itself, if no case conversion exists). So the result string has the exact same length as the input string.

Today's behavior for ToUpperInvariant / ToLowerInvariant when under ICU is to do something similar. We manually iterate through the strings char-by-char, as shown below. So the result string will still have the same length as the input string.

if (bToUpper)
{
while (srcIdx < cwSrcLength)
{
U16_NEXT(lpSrc, srcIdx, cwSrcLength, srcCodepoint);
dstCodepoint = u_toupper(srcCodepoint);
U16_APPEND(lpDst, dstIdx, cwDstLength, dstCodepoint, isError);
assert(isError == FALSE && srcIdx == dstIdx);
}
}

ICU's normal case changing APIs (such as u_strToUpper) do not make this same guarantee. For example, the string "ß" (1 char) will uppercase-convert to "SS" (2 chars) when run through ICU's normal uppercase conversion routine, even when specifying a null locale. This matches the recommended behavior in the Unicode Specification, Ch. 3, Sec. 3.13, Rule R1.

I think at the moment quite a bit of code relies on ToUpperInvariant / ToLowerInvariant not changing the length of the resulting string. So maybe a different API would be needed for these "pure ICU" cases so that we don't break the world. But wanted to get your opinion on this nonetheless.

@tarekgh
Copy link
Member

tarekgh commented Mar 17, 2020

I would expose new APIs to support different length for compact reason. I saw a usage before that consumer of the current APIs always assumed the same input length.

@jkotas
Copy link
Member

jkotas commented Mar 17, 2020

I think at the moment quite a bit of code relies on ToUpperInvariant / ToLowerInvariant not changing the length of the resulting string

Do you have examples? We should look at how bad is this.

It is not pretty to have old broken and new correct versions of the same APIs. It has negative value in the long run. We had the same problem with floating point formatting changes, and we choose to fix the existing APIs.

@GrabYourPitchforks
Copy link
Member Author

Do you have examples? We should look at how bad is this.

Within the runtime and libraries:

// Assuming that changing case does not affect length
if (destination.Length < source.Length)
return -1;

int charsWritten = new ReadOnlySpan<char>(ref data, count).ToUpperInvariant(scratch);
Debug.Assert(charsWritten == count); // invariant case conversion should involve simple folding; preserve code unit count

string result = string.FastAllocateString(source.Length); // changing case uses simple folding: doesn't change UTF-16 code unit count

https://github.com/dotnet/machinelearning/blob/cdb1e4b38308d9256cbde9e740a14b3bc7d64c2f/src/Microsoft.ML.Transforms/Expression/BuiltinFunctions.cs#L716-L724

Within application code, I assume the biggest pit of failure would be people who call MemoryExtensions.To{Upper|Lower}Invariant. But I have no easy way of auditing those call sites.

@jkotas
Copy link
Member

jkotas commented Mar 17, 2020

MemoryExtensions.To{Upper|Lower}Invariant - these are new APIs, they won't be used that much.

@tarekgh
Copy link
Member

tarekgh commented Aug 20, 2020

This is fixed by the PR #40910.

@tarekgh tarekgh closed this as completed Aug 20, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Globalization breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants