Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.NET 5 breaking change: StringInfo and TextElementEnumerator classes are now UAX29-compliant #16702

Open
GrabYourPitchforks opened this issue Jan 15, 2020 · 0 comments

Comments

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Jan 15, 2020

.NET 5's System.Globalization.StringInfo and System.Globalization.TextElementEnumerator classes now follow UAX29 guidelines for extended grapheme enumeration

Unicode has the concept of a "grapheme", which roughly approximates what the user perceives as a single display character. In UAX#29, this is properly formalized as a concept called an "extended grapheme cluster".

Consider the string containing the Thai character kam ("กำ"). This string actually consists of two chars:

  • ' ก ' (= '\u0e01') THAI CHARACTER KO KAI, followed by
  • ' ำ ' (= '\u0e33') THAI CHARACTER SARA AM

When displayed to the user, the operating system combines them to form the single display character (or grapheme) kam, or กำ.

This is also of use when displaying emoji. Consider the string containing the emoji Woman Shrugging: Medium Skin Tone ("🤷🏽‍♀️"). This string actually consists of seven chars: [ '\ud83e', '\udd37', '\ud83c', '\udffd', '\u200d', '\u2640', '\ufe0f' ]. But the operating system will combine all seven of these together and display them as a single unit.

In .NET, the System.Globalization.StringInfo and System.Globalization.TextElementEnumerator classes allow developers to inspect a string and to get information about the graphemes it contains. Prior to this change, the StringInfo and TextElementEnumerator types contained legacy logic that didn't properly handle all grapheme clusters. With this change, these two types process grapheme clusters according to the latest version of the Unicode Standard.

The .NET documentation sometimes uses the term "text element" when referring to a grapheme.

Version introduced

Introduced in .NET 5. The original tracking issue is https://github.com/dotnet/corefx/issues/41324.

Old behavior

In .NET Framework (all versions) and .NET Core 3.x and earlier, the StringInfo and TextElementEnumerator types implemented custom logic that handled certain combining classes but did not fully comply with the Unicode Standard.

using System.Globalization;

static void Main(string[] args)
{
    PrintGraphemes("กำ");
    PrintGraphemes("🤷🏽‍♀️");
}

static void PrintGraphemes(string str)
{
    Console.WriteLine($"Printing graphemes of \"{str}\"...");
    int i = 0;

    TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(str);
    while (enumerator.MoveNext())
    {
        Console.WriteLine($"Grapheme {++i}: \"{enumerator.Current}\"");
    }

    Console.WriteLine($"({i} grapheme(s) total.)");
    Console.WriteLine();
}

Output:

Printing graphemes of "กำ"...
Grapheme 1: "ก"
Grapheme 2: "ำ"
(2 grapheme(s) total.)

Printing graphemes of "🤷🏽‍♀️"...
Grapheme 1: "🤷"
Grapheme 2: "🏽"
Grapheme 3: "‍"
Grapheme 4: "♀️"
(4 grapheme(s) total.)

In the case of the single Thai character kam, the StringInfo and TextElementEnumerator classes incorrectly split the character back into its constituent components instead of keeping them together.

In the case of the emoji character, the StringInfo and TextElementEnumerator classes incorrectly split the emoji character into four clusters: person shrugging, skin tone modifier, gender modifier, and an invisible combiner. All of these should have been kept together as a single grapheme.

New behavior

Starting with .NET 5, the StringInfo and TextElementEnumerator classes follow the Unicode Standard exactly. In particular, they now return extended grapheme clusters. Their logic implements the standard as defined by Unicode Standard Annex #29, rev. 35, sec. 3.

Re-running the sample program above now gives the correct output:

Printing graphemes of "กำ"...
Grapheme 1: "กำ"
(1 grapheme(s) total.)

Printing graphemes of "🤷🏽‍♀️"...
Grapheme 1: "🤷🏽‍♀️"
(1 grapheme(s) total.)

Visual Basic's StrReverse function is also updated to follow the standardized logic.

Reason for change

This change is part of a wider set of Unicode and UTF-8 improvements being made to .NET Core. It also provides a proper "extended grapheme cluster" enumeration API to complement the "Unicode scalar value" enumeration APIs which were introduced in .NET Core 3.0 with the System.Text.Rune type.

Recommended action

No action is needed on the part of developers. Developers should find that their applications now behave in a more standards-compliant fashion in a wider variety of globalization-related scenarios.

Category

  • Globalization

Affected APIs


Issue metadata

  • Issue type: breaking-change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.