Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Suggestion] Introduce more easy-to-use ways to work with grapheme clusters #77200

Open
Gnbrkm41 opened this issue Oct 19, 2022 · 4 comments
Open

Comments

@Gnbrkm41
Copy link
Contributor

(Not marked as an API proposal because I think it definitely needs some more designing before putting one up)

When handling Unicode strings, you quickly realise that strings can be split into many different units. For example, there's raw, minimum byte units of the encoding used (char in case of UTF-16 / byte in case of UTF-8 / int in case of UTF-32), there's codepoints (System.Rune) then grapheme clusters ("Text element"). There exists a lot of cases where all those mean the same thing (e.g. normalised strings containing all BMP characters encoded in UTF-16), however not always.

Widespread usage of emojis (but not limited to) in modern applications means developers may be faced with a situation they run into situations where they need to work with individual grapheme clusters, especially in handling arbitrary user-supplied strings for display in UI.

For example, one may want to truncate long user-supplied strings for display on UI, and let users decide if they want to see the full text by clicking "Read more" button. Naive implementation of this would be just taking a substring of the full string... however this proves to be problematic in certain cases involving emojis for example, as shown in the example below (not an endorsement or anything, just something I ran into a few days ago):

Screenshot_20221018-234858_YouTube_1
Screenshot_20221018-234901_YouTube_1

In this example (which is the YouTube application on Android) the 👉🏻 emoji, which consists of U+1F449 White Right Pointing Backhand Index and U+1F3FB Emoji Modifier Fitzpatrick Type-1-2, encoded as surrogate pairs end up truncated & results in U+FFFD Replacement Character getting displayed instead.

While the above example can be solved somewhat by using EnumerateRunes, grapheme clusters are still not considered so you may run into other cases such as 🧑🏽‍👦🏻‍🧒 which without proper handling of grapheme clusters may end up truncated inappropriately. There also exists examples of multiple codepoints combining into a single grapheme cluster other than just emojis such as Hangul in parts, and alphabets with combining marks, indic consonant clusters etc (although I'm not exactly how common those actually are used).

Other issue regarding this issue outlines some more examples where better handling of grapheme clusters would be required: #31642

The above demonstrates how, in most user-facing situations, grapheme cluster is the most logical unit to operate on as it is meant to be displayed / understood to humans as a single 'character' / 'glyph'. Currently, .NET has decent support for handling the raw bytes representation and individual codepoints, however I feel that currently .NET does not expose enough APIs to manipulate grapheme units effectively.

Currently, it appears that the only way to enumerate over the grapheme clusters is by using methods associated with StringInfo, such as StringInfo.GetTextElementEnumerator and StringInfo.GetNextTextElementLength.

However, given how those methods existed from .NET Framework 1.1, those two methods are less than ideal.

  • GetTextElementEnumerator
    • is a non-generic IEnumerator, so you can't use foreach with it / use LINQ etc
    • Has an ugly object Current property (which is just GetTextElement but in object)
    • The only way to access the individual cluster is to use GetTextElement, however this returns a new string, meaning that using this could result in a new string being allocated, in the worst case for every single char (if the original string is made out of only non-combining, non-surrogate characters.)
  • GetNextTextElementLength
    • You are able to obtain lengths of the individual grapheme clusters, so using this you could write your own non-allocating grapheme cluster enumerator, however involves extra steps & is less intuitive to use

And generally, those two methods to work with grapheme clusters are (somewhat) hidden away in System.Globalization namespace, when IMO it really belongs in the String class.

Dart has characters package that allow you to access the sequence of grapheme clusters like str.characters, and as an extreme example Swift makes individual grapheme clusters (named Character) the default unit of manipulation. I believe that .NET should make handling of grapheme clusters more accessible & easier.

Rust: https://crates.io/crates/unicode-segmentation

Initially, I was thinking of essentially providing EnumerateRunes but instead of runes we use grapheme clusters, but I'm not sure if that would provide you enough ability to effectively and easily handle grapheme clusters.

Two approaches I can think of would be follows:

  • Essentially come up with a new string-like / string wrapper type that considers the default unit of manipulation a single grapheme cluster by default (basically, what Swift did with their strings)
    • Do we introduce a new type that represents individual grapheme clusters?
  • Come up with multiple helper methods that operate on string / ReadOnlySpan<char>s, treating those string / ROSs as grapheme clusters
    • Basically providing EnumerateRunes like methods on string as members / as extension methods

Here's a list of operations on strings, roughly taken from the string documents.

  • Substring
    • Substring that takes grapheme clusters into considerations
  • Comparison
    • Comparison between normalised characters and non-normalised characters? Although, could be argued that invariant culture comparison is enough
  • Enumeration
    • Sequential enumeration using enumerators - should be non-allocating if we use ReadOnlySpan however is limited in capabilities
    • Returning a list of indices / Ranges - might be allocating but not as much
    • Returning actual strings
  • Indexing
    • Not exactly sure on this. It would not quite be a random access unless we pre-calculate where the boundaries are...
  • IsNormalized
  • Normalize
  • Insert
  • IsNullOrEmpty / IsNullOrWhitespace
  • Join
  • Searching
    • Contains - might be enough with invariant culture comparison
    • IndexOf / LastIndexOf (Any) - Should we return char index or index of the grapheme cluster?
    • StartsWith / EndsWith
  • Splitting
  • PadLeft/Right
  • Remove
  • Replace
  • ReplaceLineEndings
  • ToUpper / ToLower
  • Trim (Start/End)

Something I haven't considered is the UTF-8 encoded ReadOnlySpan<byte> strings and ReadOnlySpan<char>s. If we are going to provide those grapheme cluster APIs to string, it might be worthwhile to provide similar APIs for those span-based strings as well.

Any opinions are welcome.

Also, one open question - How should we refer to those "grapheme clusters" in API names and such?

  • TextElement
    • Existing APIs call it as such
    • However isn't there some other types that are also named TextElement in some UI frameworks?
    • "text element" can really mean anything IMO, ranging from individual characters to words / sentences etc.
  • GraphemeCluster
    • The most correct wording, as it appears on the Unicode documents
    • Kind of a long name?
  • Character
    • Some other language libraries refer to it as Character
    • but we have char which is way different from a grapheme cluster

I have mixed feeling about going either TextElement or GraphemeCluster, but I'm personally fine with going with TextElement.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 19, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Oct 19, 2022

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details

(Not marked as an API proposal because I think it definitely needs some more designing before putting one up)

When handling Unicode strings, you quickly realise that strings can be split into many different units. For example, there's raw, minimum byte units of the encoding used (char in case of UTF-16 / byte in case of UTF-8 / int in case of UTF-32), there's codepoints (System.Rune) then grapheme clusters ("Text element"). There exists a lot of cases where all those mean the same thing (e.g. normalised strings containing all BMP characters encoded in UTF-16), however not always.

Widespread usage of emojis (but not limited to) in modern applications means developers may be faced with a situation they run into situations where they need to work with individual grapheme clusters, especially in handling arbitrary user-supplied strings for display in UI.

For example, one may want to truncate long user-supplied strings for display on UI, and let users decide if they want to see the full text by clicking "Read more" button. Naive implementation of this would be just taking a substring of the full string... however this proves to be problematic in certain cases involving emojis for example, as shown in the example below (not an endorsement or anything, just something I ran into a few days ago):

Screenshot_20221018-234858_YouTube_1
Screenshot_20221018-234901_YouTube_1

In this example (which is the YouTube application on Android) the 👉🏻 emoji, which consists of U+1F449 White Right Pointing Backhand Index and U+1F3FB Emoji Modifier Fitzpatrick Type-1-2, encoded as surrogate pairs end up truncated & results in U+FFFD Replacement Character getting displayed instead.

While the above example can be solved somewhat by using EnumerateRunes, grapheme clusters are still not considered so you may run into other cases such as 🧑🏽‍👦🏻‍🧒 which without proper handling of grapheme clusters may end up truncated inappropriately. There also exists examples of multiple codepoints combining into a single grapheme cluster other than just emojis such as Hangul in parts, and alphabets with combining marks, indic consonant clusters etc (although I'm not exactly how common those actually are used).

Other issue regarding this issue outlines some more examples where better handling of grapheme clusters would be required: #31642

The above demonstrates how, in most user-facing situations, grapheme cluster is the most logical unit to operate on as it is meant to be displayed / understood to humans as a single 'character' / 'glyph'. Currently, .NET has decent support for handling the raw bytes representation and individual codepoints, however I feel that currently .NET does not expose enough APIs to manipulate grapheme units effectively.

Currently, it appears that the only way to enumerate over the grapheme clusters is by using methods associated with StringInfo, such as StringInfo.GetTextElementEnumerator and StringInfo.GetNextTextElementLength.

However, given how those methods existed from .NET Framework 1.1, those two methods are less than ideal.

  • GetTextElementEnumerator
    • is a non-generic IEnumerator, so you can't use foreach with it / use LINQ etc
    • Has an ugly object Current property (which is just GetTextElement but in object)
    • The only way to access the individual cluster is to use GetTextElement, however this returns a new string, meaning that using this could result in a new string being allocated, in the worst case for every single char (if the original string is made out of only non-combining, non-surrogate characters.)
  • GetNextTextElementLength
    • You are able to obtain lengths of the individual grapheme clusters, so using this you could write your own non-allocating grapheme cluster enumerator, however involves extra steps & is less intuitive to use

And generally, those two methods to work with grapheme clusters are (somewhat) hidden away in System.Globalization namespace, when IMO it really belongs in the String class.

Dart has characters package that allow you to access the sequence of grapheme clusters like str.characters, and as an extreme example Swift makes individual grapheme clusters (named Character) the default unit of manipulation. I believe that .NET should make handling of grapheme clusters more accessible & easier.

Rust: https://crates.io/crates/unicode-segmentation

Initially, I was thinking of essentially providing EnumerateRunes but instead of runes we use grapheme clusters, but I'm not sure if that would provide you enough ability to effectively and easily handle grapheme clusters.

Two approaches I can think of would be follows:

  • Essentially come up with a new string-like / string wrapper type that considers the default unit of manipulation a single grapheme cluster by default (basically, what Swift did with their strings)
    • Do we introduce a new type that represents individual grapheme clusters?
  • Come up with multiple helper methods that operate on string / ReadOnlySpan<char>s, treating those string / ROSs as grapheme clusters
    • Basically providing EnumerateRunes like methods on string as members / as extension methods

Here's a list of operations on strings, roughly taken from the string documents.

  • Substring
    • Substring that takes grapheme clusters into considerations
  • Comparison
    • Comparison between normalised characters and non-normalised characters? Although, could be argued that invariant culture comparison is enough
  • Enumeration
    • Sequential enumeration using enumerators - should be non-allocating if we use ReadOnlySpan however is limited in capabilities
    • Returning a list of indices / Ranges - might be allocating but not as much
    • Returning actual strings
  • Indexing
    • Not exactly sure on this. It would not quite be a random access unless we pre-calculate where the boundaries are...
  • IsNormalized
  • Normalize
  • Insert
  • IsNullOrEmpty / IsNullOrWhitespace
  • Join
  • Searching
    • Contains - might be enough with invariant culture comparison
    • IndexOf / LastIndexOf (Any) - Should we return char index or index of the grapheme cluster?
    • StartsWith / EndsWith
  • Splitting
  • PadLeft/Right
  • Remove
  • Replace
  • ReplaceLineEndings
  • ToUpper / ToLower
  • Trim (Start/End)

Something I haven't considered is the UTF-8 encoded ReadOnlySpan<byte> strings and ReadOnlySpan<char>s. If we are going to provide those grapheme cluster APIs to string, it might be worthwhile to provide similar APIs for those span-based strings as well.

Any opinions are welcome.

Also, one open question - How should we refer to those "grapheme clusters" in API names and such?

  • TextElement
    • Existing APIs call it as such
    • However isn't there some other types that are also named TextElement in some UI frameworks?
    • "text element" can really mean anything IMO, ranging from individual characters to words / sentences etc.
  • GraphemeCluster
    • The most correct wording, as it appears on the Unicode documents
    • Kind of a long name?
  • Character
    • Some other language libraries refer to it as Character
    • but we have char which is way different from a grapheme cluster

I have mixed feeling about going either TextElement or GraphemeCluster, but I'm personally fine with going with TextElement.

Author: Gnbrkm41
Assignees: -
Labels:

area-System.Globalization, untriaged

Milestone: -

@tarekgh tarekgh added this to the Future milestone Oct 19, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Oct 19, 2022
@tarekgh
Copy link
Member

tarekgh commented Oct 19, 2022

CC @GrabYourPitchforks

@Neme12
Copy link

Neme12 commented Aug 23, 2023

Related issue: #91003

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants