Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign up.NET 5 breaking change: StringInfo and TextElementEnumerator classes are now UAX29-compliant #16702
Labels
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
.NET 5's
System.Globalization.StringInfoandSystem.Globalization.TextElementEnumeratorclasses now follow UAX29 guidelines for extended grapheme enumerationUnicode has the concept of a "grapheme", which roughly approximates what the user perceives as a single display character. In UAX#29, this is properly formalized as a concept called an "extended grapheme cluster".
Consider the string containing the Thai character kam (
"กำ"). This string actually consists of twochars:' ก '(='\u0e01') THAI CHARACTER KO KAI, followed by' ำ '(='\u0e33') THAI CHARACTER SARA AMWhen displayed to the user, the operating system combines them to form the single display character (or grapheme) kam, or กำ.
This is also of use when displaying emoji. Consider the string containing the emoji Woman Shrugging: Medium Skin Tone (
"🤷🏽♀️"). This string actually consists of sevenchars:[ '\ud83e', '\udd37', '\ud83c', '\udffd', '\u200d', '\u2640', '\ufe0f' ]. But the operating system will combine all seven of these together and display them as a single unit.In .NET, the
System.Globalization.StringInfoandSystem.Globalization.TextElementEnumeratorclasses allow developers to inspect a string and to get information about the graphemes it contains. Prior to this change, theStringInfoandTextElementEnumeratortypes contained legacy logic that didn't properly handle all grapheme clusters. With this change, these two types process grapheme clusters according to the latest version of the Unicode Standard.The .NET documentation sometimes uses the term "text element" when referring to a grapheme.
Version introduced
Introduced in .NET 5. The original tracking issue is https://github.com/dotnet/corefx/issues/41324.
Old behavior
In .NET Framework (all versions) and .NET Core 3.x and earlier, the
StringInfoandTextElementEnumeratortypes implemented custom logic that handled certain combining classes but did not fully comply with the Unicode Standard.Output:
In the case of the single Thai character kam, the
StringInfoandTextElementEnumeratorclasses incorrectly split the character back into its constituent components instead of keeping them together.In the case of the emoji character, the
StringInfoandTextElementEnumeratorclasses incorrectly split the emoji character into four clusters: person shrugging, skin tone modifier, gender modifier, and an invisible combiner. All of these should have been kept together as a single grapheme.New behavior
Starting with .NET 5, the
StringInfoandTextElementEnumeratorclasses follow the Unicode Standard exactly. In particular, they now return extended grapheme clusters. Their logic implements the standard as defined by Unicode Standard Annex #29, rev. 35, sec. 3.Re-running the sample program above now gives the correct output:
Visual Basic's
StrReversefunction is also updated to follow the standardized logic.Reason for change
This change is part of a wider set of Unicode and UTF-8 improvements being made to .NET Core. It also provides a proper "extended grapheme cluster" enumeration API to complement the "Unicode scalar value" enumeration APIs which were introduced in .NET Core 3.0 with the
System.Text.Runetype.Recommended action
No action is needed on the part of developers. Developers should find that their applications now behave in a more standards-compliant fashion in a wider variety of globalization-related scenarios.
Category
Affected APIs
System.Globalization.StringInfoSystem.Globalization.TextElementEnumeratorMicrosoft.VisualBasic.Strings.StrReverseIssue metadata