Need APIs for unicode case mapping and case folding #20674

tarekgh · 2017-03-17T15:26:33Z

@gafter commented on Mon May 11 2015

The Unicode function for determining the category of Unicode characters (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(char)) appears to reflect a fairly recent version of the Unicode standard. However, the functions for performing Unicode case mapping (culture-insensitive uppercase and lowercase) appear to reflect Unicode version 1.0. This mismatch was a severe impediment for implementing spec-compliant case folding in the VB compiler.

There is also no support whatsoever for performing case folding according to the Unicode specification (simplified or otherwise). APIs in the platform to do this would help with any language that, like VB, wants to depend on the Unicode specification for the meaning of case-insensitive identifiers.

@jkotas commented on Mon May 11 2015

cc @ellismg

@ellismg commented on Mon May 11 2015

This is something I'd like to do. I think the open issue is how we support this on Windows (since I think that we can get Case Folding functionality from ICU).

In general, we've tried to move away from the framework itself shipping globalization data and used the OS provided APIs instead. Since Windows doesn't currently expose a way to do case folding, we'd have to figure out what to do.

@karelz commented on Fri Mar 17 2017

@tarekgh do we need a new API for this? If yes, we should move it to CoreFX. If not, let's remove "api addition" label.

The text was updated successfully, but these errors were encountered:

iSazonov · 2018-10-31T05:47:00Z

Could we start designing the simple case folding API?

tarekgh · 2018-10-31T16:16:23Z

@iSazonov wouldn't the supported string normalization and casing APIs are not enough in your case?

gafter · 2018-10-31T21:09:15Z

@tarekgh What APIs are you suggesting approximate case folding?

tarekgh · 2018-10-31T21:36:07Z

We already have a casing APIs on TextInfo and String class. I understand we use the underlying OS for that and this is by design as we need to avoid carrying any globalization data as possible. what exactly the scenario that is not working for you?

gafter · 2018-10-31T21:49:20Z

The scenario is Case Folding as specified in the Unicode specification. See also

http://unicode.org/faq/casemap_charprop.html
http://www.unicode.org/versions/Unicode11.0.0/ch05.pdf#G21180 starting on page 240
ftp://ftp.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

tarekgh · 2018-10-31T22:00:32Z

I know that, I am asking about what is your scenario you need the case folding and case mapping is not enough?

gafter · 2018-10-31T23:15:51Z

The scenario is case folding for the purposes of comparing identifiers in a case insensitive programming language according to the rules described in the Unicode specification. Case mapping is not enough; that is explained in some detail in the Unicode specification. There is no case folding API today.

tarekgh · 2018-11-01T02:25:28Z

Thanks @gafter

Considering we didn't get high demand on this functionality and we'll need to carry some Unicode data to be able to support it, I would suggest this functionality can be implemented in some NuGet package and doesn't have to be in the core for now. If we see there is more demand on that we can consider having it in core. we can use corefxlab if needed for hosting the code if needed.

iSazonov · 2018-11-01T03:24:11Z

Please see https://github.com/dotnet/corefx/issues/33047 and PowerShell/PowerShell#8120

In PowerShell/PowerShell#8120 I provide a plan. In short we hope to get performance benefits in PowerShell, RegEx and string comparisons in general.
In PowerShell we could win in engine because it use string comparisons everywhere. Also we could speed up Select-String cmdlet get parity with Unix grep and fastest Rust ripgrep utility. Ripgrep uses simple case folding. It show that we could get great win in CoreFX RegEx too.

On the week I prepared alfa code for simple case folding and while the results are encouraging.

GrabYourPitchforks · 2019-09-27T19:41:51Z

See also https://github.com/dotnet/corefx/issues/41333, which is a proposal to make string / char / Rune use these mappings by default. It doesn't involve any new API addition.

GrabYourPitchforks · 2019-09-27T19:46:18Z

If we did want to provide APIs for this, they'd probably look like the following:

public namespace System.Globalization
{
    public static class CharUnicodeInfo
    {
        // performs "simple" culture-unaware mapping to uppercase using CharUnicodeInfo backing data
        public static char ToUpperInvariant(char ch);
        public static int ToUpperInvariant(int codePoint);
        public static Rune ToUpperInvariant(Rune rune);

        // performs "simple" culture-unaware mapping to lowercase using CharUnicodeInfo backing data
        public static char ToLowerInvariant(char ch);
        public static int ToLowerInvariant(int codePoint);
        public static Rune ToLowerInvariant(Rune rune);

        // performs "simple" culture-unaware mapping to titlecase using CharUnicodeInfo backing data
        public static char ToTitleInvariant(char ch);
        public static int ToTitleInvariant(int codePoint);
        public static Rune ToTitleInvariant(Rune rune);

        // performs "simple" case-fold mapping using CharUnicodeInfo backing data
        public static char ToCaseFold(char ch);
        public static int ToCaseFold(int codePoint);
        public static Rune ToCaseFold(Rune rune);
    }
}

The behavior of all of these APIs is that they'd use the data that's already present in the code-behind for CharUnicodeInfo (generated from UnicodeData.txt and CaseFolding.txt). APIs which take int will throw if the input is outside the range [ U+0000..U+10FFFF ] (UTF-16 surrogate code points are allowed and will map to themselves). APIs which take char and Rune do not throw.

tarekgh · 2019-09-27T21:48:42Z

@GrabYourPitchforks would it make sense to move these functionality to char and string classes (and support string case folding too)?

GrabYourPitchforks · 2019-09-27T23:52:23Z

@tarekgh If we did https://github.com/dotnet/corefx/issues/41333, I think that would make sense. Otherwise we could end up with a mixture of ICU and NLS behaviors hanging off of string and char, and IMO it would be difficult to reason about.

tarekgh · 2019-09-28T00:08:35Z

I believe in the near future we'll use ICU across all platforms which will help not having this mixture issue.

gafter · 2019-09-30T23:20:45Z

The behavior of all of these APIs is that they'd use the data that's already present in the code-behind for CharUnicodeInfo (generated from UnicodeData.txt and CaseFolding.txt).

I don't understand. Unicode has no concept of an invariant mapping (i.e. one that that fails to follow the evolving Unicode standard).

GrabYourPitchforks · 2019-10-01T01:37:13Z

I don't understand. Unicode has no concept of an invariant mapping (i.e. one that that fails to follow the evolving Unicode standard).

I'm using the term "invariant" to mean "simple" case mapping in a culture-agnostic fashion. This mapping is subject to change depending on which version of the Unicode Standard is pulled in by System.Private.CoreLib.dll. Check out my comment at https://github.com/dotnet/corefx/issues/41333#issuecomment-535274401 for more details on the proposed behavior of these APIs.

GrabYourPitchforks · 2019-10-04T23:34:09Z

@gafter Follow-up question: Would it be acceptable for .NET Core to carry these tables as an internal implementation (e.g., we say ".NET 5's case folding APIs always use the Unicode 12.1 tables"), or do you require the ability to swap out the version of ICU which is used under the covers?

gafter · 2019-10-05T22:36:00Z

@GrabYourPitchforks I think that would work very well for us.

GrabYourPitchforks · 2020-02-12T18:40:44Z

@tarekgh As a straw man, here's what it might look like to spread these APIs out on the classes where they naturally belong:

namespace System
{
    public struct char
    {
        public static char ToCaseFold(char c);
    }

    public static class MemoryExtensions
    {
        public static int ToCaseFold(this ReadOnlySpan<char> source, Span<char> destination);
    }

    public class string
    {
        public string ToCaseFold();
    }
}

namespace System.Globalization
{
    public static class CharUnicodeInfo
    {
        public static int ToCaseFold(int codePoint);
    }

    public class TextInfo
    {
        public static char ToCaseFold(char c);
        public static string ToCaseFold(string value);
    }
}

namespace System.Text
{
    public struct Rune
    {
        public static Rune ToCaseFold(Rune value);
    }
}

If we were to follow these patterns, then we'd probably end up with this behavior:

If you call the CharUnicodeInfo APIs, we use our carried data (currently Unicode 12.1, but likely Unicode 13.0 by the time the implementation ships).
If you call any other API, the following logic occurs:
- If GlobalizationMode.Invariant is set, we fold only ASCII and leave everything else as-is, regardless of OS. Otherwise:
- If we're on a version of Win10 where OS natively carries the ICU data, or if we're on a non-Windows OS, we p/invoke into ICU to perform case folding on our behalf.
- If we're on a version of Windows where the OS doesn't natively carry the ICU data, we fall back to the CharUnicodeInfo tables.

This follows the general concept of CharUnicodeInfo being a fully self-contained data set, while string, char, and Rune all fall back to TextInfo, which interfaces with our unmanaged dependencies.

Personally I think it's a little confusing to have multiple APIs that all share the same name but do slightly different things, but it does follow established convention.

Thoughts?

tarekgh · 2020-02-12T19:38:05Z

Thanks, @GrabYourPitchforks for your proposal. I talked with you offline, I just want to mention what is the idea I like to have. Moving the CaseFolding Library from corefxlab to runtime libraries repo but not including it as part of the shipping SDK and instead, we ship it as a separate independent NuGet package so anyone wants to get the case-folding feature will just need to depend on this package. The reason is we are going to carry the case folding Unicode tables which will increase the footprint size of the product if we ship it inside the SDK. So only apps/libraries need to use this feature will need to carry it.

Here are some comments if we go with this plan:

Properly we'll need to rename the package to something more generic so we can accommodate more features as needed. I would prefer something Microsoft.Extensions.Globalization.
Your proposed APIs look good we'll have to convert the proposal to be extension methods instead.
We shouldn't have any method on TextInfo for case-folding.
We need to visit the current APIs in the package to look if we need to keep it or polish it.

Thanks for your thoughts on this issue.

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 5.0 milestone Jan 31, 2020

iSazonov mentioned this issue Jan 31, 2020

Why we use CompareInfo.Invariant for Ordinal? #27738

Closed

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020

GrabYourPitchforks mentioned this issue May 8, 2020

Regex isn't factoring target culture into lowercasing of ranges #36147

Closed

tarekgh removed the untriaged New issue has not been triaged by the area owner label Jun 17, 2020

tarekgh modified the milestones: 5.0.0, Future Jun 17, 2020

GrabYourPitchforks mentioned this issue Aug 18, 2020

Breaking change proposal: OrdinalIgnoreCase string comparison, ToUpperInvariant, and ToLowerInvariant to use ICU on all platforms #30960

Open

GrabYourPitchforks mentioned this issue Oct 23, 2020

Improve invariant globalization mode to be more complete #43774

Closed

2 tasks

maryamariyan added this to Future in ML, Extensions, Globalization, etc, POD. Feb 3, 2021

tarekgh mentioned this issue Jun 16, 2021

string.ToUpperInvariant("μ") gives different results on different platforms #54278

Closed

DmitryMak mentioned this issue May 15, 2024

[API Proposal]: Introduce new string NormalizationForm for NFKC_Casefold #102269

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need APIs for unicode case mapping and case folding #20674

Need APIs for unicode case mapping and case folding #20674

tarekgh commented Mar 17, 2017

iSazonov commented Oct 31, 2018

tarekgh commented Oct 31, 2018

gafter commented Oct 31, 2018

tarekgh commented Oct 31, 2018

gafter commented Oct 31, 2018 •

edited

tarekgh commented Oct 31, 2018

gafter commented Oct 31, 2018

tarekgh commented Nov 1, 2018

iSazonov commented Nov 1, 2018

GrabYourPitchforks commented Sep 27, 2019

GrabYourPitchforks commented Sep 27, 2019 •

edited

tarekgh commented Sep 27, 2019

GrabYourPitchforks commented Sep 27, 2019

tarekgh commented Sep 28, 2019

gafter commented Sep 30, 2019

GrabYourPitchforks commented Oct 1, 2019

GrabYourPitchforks commented Oct 4, 2019

gafter commented Oct 5, 2019

GrabYourPitchforks commented Feb 12, 2020

tarekgh commented Feb 12, 2020

Need APIs for unicode case mapping and case folding #20674

Need APIs for unicode case mapping and case folding #20674

Comments

tarekgh commented Mar 17, 2017

iSazonov commented Oct 31, 2018

tarekgh commented Oct 31, 2018

gafter commented Oct 31, 2018

tarekgh commented Oct 31, 2018

gafter commented Oct 31, 2018 • edited

tarekgh commented Oct 31, 2018

gafter commented Oct 31, 2018

tarekgh commented Nov 1, 2018

iSazonov commented Nov 1, 2018

GrabYourPitchforks commented Sep 27, 2019

GrabYourPitchforks commented Sep 27, 2019 • edited

tarekgh commented Sep 27, 2019

GrabYourPitchforks commented Sep 27, 2019

tarekgh commented Sep 28, 2019

gafter commented Sep 30, 2019

GrabYourPitchforks commented Oct 1, 2019

GrabYourPitchforks commented Oct 4, 2019

gafter commented Oct 5, 2019

GrabYourPitchforks commented Feb 12, 2020

tarekgh commented Feb 12, 2020

gafter commented Oct 31, 2018 •

edited

GrabYourPitchforks commented Sep 27, 2019 •

edited