Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need APIs for unicode case mapping and case folding #20674

Open
tarekgh opened this issue Mar 17, 2017 · 20 comments
Open

Need APIs for unicode case mapping and case folding #20674

tarekgh opened this issue Mar 17, 2017 · 20 comments
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Globalization
Milestone

Comments

@tarekgh
Copy link
Member

tarekgh commented Mar 17, 2017

@gafter commented on Mon May 11 2015

The Unicode function for determining the category of Unicode characters (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(char)) appears to reflect a fairly recent version of the Unicode standard. However, the functions for performing Unicode case mapping (culture-insensitive uppercase and lowercase) appear to reflect Unicode version 1.0. This mismatch was a severe impediment for implementing spec-compliant case folding in the VB compiler.

There is also no support whatsoever for performing case folding according to the Unicode specification (simplified or otherwise). APIs in the platform to do this would help with any language that, like VB, wants to depend on the Unicode specification for the meaning of case-insensitive identifiers.


@jkotas commented on Mon May 11 2015

cc @ellismg


@ellismg commented on Mon May 11 2015

This is something I'd like to do. I think the open issue is how we support this on Windows (since I think that we can get Case Folding functionality from ICU).

In general, we've tried to move away from the framework itself shipping globalization data and used the OS provided APIs instead. Since Windows doesn't currently expose a way to do case folding, we'd have to figure out what to do.


@karelz commented on Fri Mar 17 2017

@tarekgh do we need a new API for this? If yes, we should move it to CoreFX. If not, let's remove "api addition" label.

@iSazonov
Copy link
Contributor

Could we start designing the simple case folding API?

@tarekgh
Copy link
Member Author

tarekgh commented Oct 31, 2018

@iSazonov wouldn't the supported string normalization and casing APIs are not enough in your case?

@gafter
Copy link
Member

gafter commented Oct 31, 2018

@tarekgh What APIs are you suggesting approximate case folding?

@tarekgh
Copy link
Member Author

tarekgh commented Oct 31, 2018

We already have a casing APIs on TextInfo and String class. I understand we use the underlying OS for that and this is by design as we need to avoid carrying any globalization data as possible. what exactly the scenario that is not working for you?

@gafter
Copy link
Member

gafter commented Oct 31, 2018

The scenario is Case Folding as specified in the Unicode specification. See also

@tarekgh
Copy link
Member Author

tarekgh commented Oct 31, 2018

I know that, I am asking about what is your scenario you need the case folding and case mapping is not enough?

@gafter
Copy link
Member

gafter commented Oct 31, 2018

The scenario is case folding for the purposes of comparing identifiers in a case insensitive programming language according to the rules described in the Unicode specification. Case mapping is not enough; that is explained in some detail in the Unicode specification. There is no case folding API today.

@tarekgh
Copy link
Member Author

tarekgh commented Nov 1, 2018

Thanks @gafter

Considering we didn't get high demand on this functionality and we'll need to carry some Unicode data to be able to support it, I would suggest this functionality can be implemented in some NuGet package and doesn't have to be in the core for now. If we see there is more demand on that we can consider having it in core. we can use corefxlab if needed for hosting the code if needed.

@iSazonov
Copy link
Contributor

iSazonov commented Nov 1, 2018

Please see https://github.com/dotnet/corefx/issues/33047 and PowerShell/PowerShell#8120

In PowerShell/PowerShell#8120 I provide a plan. In short we hope to get performance benefits in PowerShell, RegEx and string comparisons in general.
In PowerShell we could win in engine because it use string comparisons everywhere. Also we could speed up Select-String cmdlet get parity with Unix grep and fastest Rust ripgrep utility. Ripgrep uses simple case folding. It show that we could get great win in CoreFX RegEx too.

On the week I prepared alfa code for simple case folding and while the results are encouraging.

@GrabYourPitchforks
Copy link
Member

See also https://github.com/dotnet/corefx/issues/41333, which is a proposal to make string / char / Rune use these mappings by default. It doesn't involve any new API addition.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Sep 27, 2019

If we did want to provide APIs for this, they'd probably look like the following:

public namespace System.Globalization
{
    public static class CharUnicodeInfo
    {
        // performs "simple" culture-unaware mapping to uppercase using CharUnicodeInfo backing data
        public static char ToUpperInvariant(char ch);
        public static int ToUpperInvariant(int codePoint);
        public static Rune ToUpperInvariant(Rune rune);

        // performs "simple" culture-unaware mapping to lowercase using CharUnicodeInfo backing data
        public static char ToLowerInvariant(char ch);
        public static int ToLowerInvariant(int codePoint);
        public static Rune ToLowerInvariant(Rune rune);

        // performs "simple" culture-unaware mapping to titlecase using CharUnicodeInfo backing data
        public static char ToTitleInvariant(char ch);
        public static int ToTitleInvariant(int codePoint);
        public static Rune ToTitleInvariant(Rune rune);

        // performs "simple" case-fold mapping using CharUnicodeInfo backing data
        public static char ToCaseFold(char ch);
        public static int ToCaseFold(int codePoint);
        public static Rune ToCaseFold(Rune rune);
    }
}

The behavior of all of these APIs is that they'd use the data that's already present in the code-behind for CharUnicodeInfo (generated from UnicodeData.txt and CaseFolding.txt). APIs which take int will throw if the input is outside the range [ U+0000..U+10FFFF ] (UTF-16 surrogate code points are allowed and will map to themselves). APIs which take char and Rune do not throw.

@tarekgh
Copy link
Member Author

tarekgh commented Sep 27, 2019

@GrabYourPitchforks would it make sense to move these functionality to char and string classes (and support string case folding too)?

@GrabYourPitchforks
Copy link
Member

@tarekgh If we did https://github.com/dotnet/corefx/issues/41333, I think that would make sense. Otherwise we could end up with a mixture of ICU and NLS behaviors hanging off of string and char, and IMO it would be difficult to reason about.

@tarekgh
Copy link
Member Author

tarekgh commented Sep 28, 2019

I believe in the near future we'll use ICU across all platforms which will help not having this mixture issue.

@gafter
Copy link
Member

gafter commented Sep 30, 2019

The behavior of all of these APIs is that they'd use the data that's already present in the code-behind for CharUnicodeInfo (generated from UnicodeData.txt and CaseFolding.txt).

I don't understand. Unicode has no concept of an invariant mapping (i.e. one that that fails to follow the evolving Unicode standard).

@GrabYourPitchforks
Copy link
Member

I don't understand. Unicode has no concept of an invariant mapping (i.e. one that that fails to follow the evolving Unicode standard).

I'm using the term "invariant" to mean "simple" case mapping in a culture-agnostic fashion. This mapping is subject to change depending on which version of the Unicode Standard is pulled in by System.Private.CoreLib.dll. Check out my comment at https://github.com/dotnet/corefx/issues/41333#issuecomment-535274401 for more details on the proposed behavior of these APIs.

@GrabYourPitchforks
Copy link
Member

@gafter Follow-up question: Would it be acceptable for .NET Core to carry these tables as an internal implementation (e.g., we say ".NET 5's case folding APIs always use the Unicode 12.1 tables"), or do you require the ability to swap out the version of ICU which is used under the covers?

@gafter
Copy link
Member

gafter commented Oct 5, 2019

@GrabYourPitchforks I think that would work very well for us.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@GrabYourPitchforks
Copy link
Member

@tarekgh As a straw man, here's what it might look like to spread these APIs out on the classes where they naturally belong:

namespace System
{
    public struct char
    {
        public static char ToCaseFold(char c);
    }

    public static class MemoryExtensions
    {
        public static int ToCaseFold(this ReadOnlySpan<char> source, Span<char> destination);
    }

    public class string
    {
        public string ToCaseFold();
    }
}

namespace System.Globalization
{
    public static class CharUnicodeInfo
    {
        public static int ToCaseFold(int codePoint);
    }

    public class TextInfo
    {
        public static char ToCaseFold(char c);
        public static string ToCaseFold(string value);
    }
}

namespace System.Text
{
    public struct Rune
    {
        public static Rune ToCaseFold(Rune value);
    }
}

If we were to follow these patterns, then we'd probably end up with this behavior:

  • If you call the CharUnicodeInfo APIs, we use our carried data (currently Unicode 12.1, but likely Unicode 13.0 by the time the implementation ships).
  • If you call any other API, the following logic occurs:
    • If GlobalizationMode.Invariant is set, we fold only ASCII and leave everything else as-is, regardless of OS. Otherwise:
    • If we're on a version of Win10 where OS natively carries the ICU data, or if we're on a non-Windows OS, we p/invoke into ICU to perform case folding on our behalf.
    • If we're on a version of Windows where the OS doesn't natively carry the ICU data, we fall back to the CharUnicodeInfo tables.

This follows the general concept of CharUnicodeInfo being a fully self-contained data set, while string, char, and Rune all fall back to TextInfo, which interfaces with our unmanaged dependencies.

Personally I think it's a little confusing to have multiple APIs that all share the same name but do slightly different things, but it does follow established convention.

Thoughts?

@tarekgh
Copy link
Member Author

tarekgh commented Feb 12, 2020

Thanks, @GrabYourPitchforks for your proposal. I talked with you offline, I just want to mention what is the idea I like to have. Moving the CaseFolding Library from corefxlab to runtime libraries repo but not including it as part of the shipping SDK and instead, we ship it as a separate independent NuGet package so anyone wants to get the case-folding feature will just need to depend on this package. The reason is we are going to carry the case folding Unicode tables which will increase the footprint size of the product if we ship it inside the SDK. So only apps/libraries need to use this feature will need to carry it.

Here are some comments if we go with this plan:

  • Properly we'll need to rename the package to something more generic so we can accommodate more features as needed. I would prefer something Microsoft.Extensions.Globalization.
  • Your proposed APIs look good we'll have to convert the proposal to be extension methods instead.
  • We shouldn't have any method on TextInfo for case-folding.
  • We need to visit the current APIs in the package to look if we need to keep it or polish it.

Thanks for your thoughts on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Globalization
Projects
None yet
Development

No branches or pull requests

6 participants