[Suggestion] Introduce more easy-to-use ways to work with grapheme clusters #77200

Gnbrkm41 · 2022-10-19T06:24:04Z

(Not marked as an API proposal because I think it definitely needs some more designing before putting one up)

When handling Unicode strings, you quickly realise that strings can be split into many different units. For example, there's raw, minimum byte units of the encoding used (char in case of UTF-16 / byte in case of UTF-8 / int in case of UTF-32), there's codepoints (System.Rune) then grapheme clusters ("Text element"). There exists a lot of cases where all those mean the same thing (e.g. normalised strings containing all BMP characters encoded in UTF-16), however not always.

Widespread usage of emojis (but not limited to) in modern applications means developers may be faced with a situation they run into situations where they need to work with individual grapheme clusters, especially in handling arbitrary user-supplied strings for display in UI.

For example, one may want to truncate long user-supplied strings for display on UI, and let users decide if they want to see the full text by clicking "Read more" button. Naive implementation of this would be just taking a substring of the full string... however this proves to be problematic in certain cases involving emojis for example, as shown in the example below (not an endorsement or anything, just something I ran into a few days ago):

In this example (which is the YouTube application on Android) the 👉🏻 emoji, which consists of U+1F449 White Right Pointing Backhand Index and U+1F3FB Emoji Modifier Fitzpatrick Type-1-2, encoded as surrogate pairs end up truncated & results in U+FFFD Replacement Character getting displayed instead.

While the above example can be solved somewhat by using EnumerateRunes, grapheme clusters are still not considered so you may run into other cases such as 🧑🏽‍👦🏻‍🧒 which without proper handling of grapheme clusters may end up truncated inappropriately. There also exists examples of multiple codepoints combining into a single grapheme cluster other than just emojis such as Hangul in parts, and alphabets with combining marks, indic consonant clusters etc (although I'm not exactly how common those actually are used).

Other issue regarding this issue outlines some more examples where better handling of grapheme clusters would be required: #31642

The above demonstrates how, in most user-facing situations, grapheme cluster is the most logical unit to operate on as it is meant to be displayed / understood to humans as a single 'character' / 'glyph'. Currently, .NET has decent support for handling the raw bytes representation and individual codepoints, however I feel that currently .NET does not expose enough APIs to manipulate grapheme units effectively.

Currently, it appears that the only way to enumerate over the grapheme clusters is by using methods associated with StringInfo, such as StringInfo.GetTextElementEnumerator and StringInfo.GetNextTextElementLength.

However, given how those methods existed from .NET Framework 1.1, those two methods are less than ideal.

GetTextElementEnumerator
- is a non-generic IEnumerator, so you can't use foreach with it / use LINQ etc
  - (It may be possible to implement IEnumerable on the enumerator itself, however: Proposal: Better API for StringInfo.GetTextElementEnumerator #19423 )
- Has an ugly object Current property (which is just GetTextElement but in object)
- The only way to access the individual cluster is to use GetTextElement, however this returns a new string, meaning that using this could result in a new string being allocated, in the worst case for every single char (if the original string is made out of only non-combining, non-surrogate characters.)
GetNextTextElementLength
- You are able to obtain lengths of the individual grapheme clusters, so using this you could write your own non-allocating grapheme cluster enumerator, however involves extra steps & is less intuitive to use

And generally, those two methods to work with grapheme clusters are (somewhat) hidden away in System.Globalization namespace, when IMO it really belongs in the String class.

Dart has characters package that allow you to access the sequence of grapheme clusters like str.characters, and as an extreme example Swift makes individual grapheme clusters (named Character) the default unit of manipulation. I believe that .NET should make handling of grapheme clusters more accessible & easier.

Rust: https://crates.io/crates/unicode-segmentation

Initially, I was thinking of essentially providing EnumerateRunes but instead of runes we use grapheme clusters, but I'm not sure if that would provide you enough ability to effectively and easily handle grapheme clusters.

Two approaches I can think of would be follows:

Essentially come up with a new string-like / string wrapper type that considers the default unit of manipulation a single grapheme cluster by default (basically, what Swift did with their strings)
- Do we introduce a new type that represents individual grapheme clusters?
Come up with multiple helper methods that operate on string / ReadOnlySpan<char>s, treating those string / ROSs as grapheme clusters
- Basically providing EnumerateRunes like methods on string as members / as extension methods

Here's a list of operations on strings, roughly taken from the string documents.

Substring
- Substring that takes grapheme clusters into considerations
Comparison
- Comparison between normalised characters and non-normalised characters? Although, could be argued that invariant culture comparison is enough
Enumeration
- Sequential enumeration using enumerators - should be non-allocating if we use ReadOnlySpan however is limited in capabilities
- Returning a list of indices / Ranges - might be allocating but not as much
- Returning actual strings
Indexing
- Not exactly sure on this. It would not quite be a random access unless we pre-calculate where the boundaries are...
IsNormalized
Normalize
Insert
IsNullOrEmpty / IsNullOrWhitespace
Join
Searching
- Contains - might be enough with invariant culture comparison
- IndexOf / LastIndexOf (Any) - Should we return char index or index of the grapheme cluster?
- StartsWith / EndsWith
Splitting
PadLeft/Right
Remove
Replace
ReplaceLineEndings
ToUpper / ToLower
Trim (Start/End)

Something I haven't considered is the UTF-8 encoded ReadOnlySpan<byte> strings and ReadOnlySpan<char>s. If we are going to provide those grapheme cluster APIs to string, it might be worthwhile to provide similar APIs for those span-based strings as well.

Any opinions are welcome.

Also, one open question - How should we refer to those "grapheme clusters" in API names and such?

TextElement
- Existing APIs call it as such
- However isn't there some other types that are also named TextElement in some UI frameworks?
- "text element" can really mean anything IMO, ranging from individual characters to words / sentences etc.
GraphemeCluster
- The most correct wording, as it appears on the Unicode documents
- Kind of a long name?
Character
- Some other language libraries refer to it as Character
- but we have char which is way different from a grapheme cluster

I have mixed feeling about going either TextElement or GraphemeCluster, but I'm personally fine with going with TextElement.

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2022-10-19T06:24:10Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2022-10-19T09:27:54Z

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details