-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Suggestion] Introduce more easy-to-use ways to work with grapheme clusters #77200
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/area-system-globalization Issue Details(Not marked as an API proposal because I think it definitely needs some more designing before putting one up) When handling Unicode strings, you quickly realise that strings can be split into many different units. For example, there's raw, minimum byte units of the encoding used ( Widespread usage of emojis (but not limited to) in modern applications means developers may be faced with a situation they run into situations where they need to work with individual grapheme clusters, especially in handling arbitrary user-supplied strings for display in UI. For example, one may want to truncate long user-supplied strings for display on UI, and let users decide if they want to see the full text by clicking "Read more" button. Naive implementation of this would be just taking a substring of the full string... however this proves to be problematic in certain cases involving emojis for example, as shown in the example below (not an endorsement or anything, just something I ran into a few days ago): In this example (which is the YouTube application on Android) the 👉🏻 emoji, which consists of While the above example can be solved somewhat by using Other issue regarding this issue outlines some more examples where better handling of grapheme clusters would be required: #31642 The above demonstrates how, in most user-facing situations, grapheme cluster is the most logical unit to operate on as it is meant to be displayed / understood to humans as a single 'character' / 'glyph'. Currently, .NET has decent support for handling the raw bytes representation and individual codepoints, however I feel that currently .NET does not expose enough APIs to manipulate grapheme units effectively. Currently, it appears that the only way to enumerate over the grapheme clusters is by using methods associated with However, given how those methods existed from .NET Framework 1.1, those two methods are less than ideal.
And generally, those two methods to work with grapheme clusters are (somewhat) hidden away in System.Globalization namespace, when IMO it really belongs in the String class. Dart has Rust: https://crates.io/crates/unicode-segmentation Initially, I was thinking of essentially providing Two approaches I can think of would be follows:
Here's a list of operations on strings, roughly taken from the string documents.
Something I haven't considered is the UTF-8 encoded Any opinions are welcome. Also, one open question - How should we refer to those "grapheme clusters" in API names and such?
I have mixed feeling about going either
|
Related issue: #91003 |
(Not marked as an API proposal because I think it definitely needs some more designing before putting one up)
When handling Unicode strings, you quickly realise that strings can be split into many different units. For example, there's raw, minimum byte units of the encoding used (
char
in case of UTF-16 /byte
in case of UTF-8 /int
in case of UTF-32), there's codepoints (System.Rune
) then grapheme clusters ("Text element"). There exists a lot of cases where all those mean the same thing (e.g. normalised strings containing all BMP characters encoded in UTF-16), however not always.Widespread usage of emojis (but not limited to) in modern applications means developers may be faced with a situation they run into situations where they need to work with individual grapheme clusters, especially in handling arbitrary user-supplied strings for display in UI.
For example, one may want to truncate long user-supplied strings for display on UI, and let users decide if they want to see the full text by clicking "Read more" button. Naive implementation of this would be just taking a substring of the full string... however this proves to be problematic in certain cases involving emojis for example, as shown in the example below (not an endorsement or anything, just something I ran into a few days ago):
In this example (which is the YouTube application on Android) the 👉🏻 emoji, which consists of
U+1F449 White Right Pointing Backhand Index
andU+1F3FB Emoji Modifier Fitzpatrick Type-1-2
, encoded as surrogate pairs end up truncated & results inU+FFFD Replacement Character
getting displayed instead.While the above example can be solved somewhat by using
EnumerateRunes
, grapheme clusters are still not considered so you may run into other cases such as 🧑🏽👦🏻🧒 which without proper handling of grapheme clusters may end up truncated inappropriately. There also exists examples of multiple codepoints combining into a single grapheme cluster other than just emojis such as Hangul in parts, and alphabets with combining marks, indic consonant clusters etc (although I'm not exactly how common those actually are used).Other issue regarding this issue outlines some more examples where better handling of grapheme clusters would be required: #31642
The above demonstrates how, in most user-facing situations, grapheme cluster is the most logical unit to operate on as it is meant to be displayed / understood to humans as a single 'character' / 'glyph'. Currently, .NET has decent support for handling the raw bytes representation and individual codepoints, however I feel that currently .NET does not expose enough APIs to manipulate grapheme units effectively.
Currently, it appears that the only way to enumerate over the grapheme clusters is by using methods associated with
StringInfo
, such asStringInfo.GetTextElementEnumerator
andStringInfo.GetNextTextElementLength
.However, given how those methods existed from .NET Framework 1.1, those two methods are less than ideal.
GetTextElementEnumerator
IEnumerator
, so you can't useforeach
with it / use LINQ etcobject Current
property (which is justGetTextElement
but inobject
)GetTextElement
, however this returns a newstring
, meaning that using this could result in a newstring
being allocated, in the worst case for every singlechar
(if the original string is made out of only non-combining, non-surrogate characters.)GetNextTextElementLength
And generally, those two methods to work with grapheme clusters are (somewhat) hidden away in System.Globalization namespace, when IMO it really belongs in the String class.
Dart has
characters
package that allow you to access the sequence of grapheme clusters likestr.characters
, and as an extreme example Swift makes individual grapheme clusters (namedCharacter
) the default unit of manipulation. I believe that .NET should make handling of grapheme clusters more accessible & easier.Rust: https://crates.io/crates/unicode-segmentation
Initially, I was thinking of essentially providing
EnumerateRunes
but instead of runes we use grapheme clusters, but I'm not sure if that would provide you enough ability to effectively and easily handle grapheme clusters.Two approaches I can think of would be follows:
string
-like /string
wrapper type that considers the default unit of manipulation a single grapheme cluster by default (basically, what Swift did with their strings)string
/ReadOnlySpan<char>
s, treating those string / ROSs as grapheme clustersEnumerateRunes
like methods on string as members / as extension methodsHere's a list of operations on strings, roughly taken from the string documents.
ReadOnlySpan
however is limited in capabilitiesRange
s - might be allocating but not as muchstring
schar
index or index of the grapheme cluster?Something I haven't considered is the UTF-8 encoded
ReadOnlySpan<byte>
strings andReadOnlySpan<char>
s. If we are going to provide those grapheme cluster APIs tostring
, it might be worthwhile to provide similar APIs for those span-based strings as well.Any opinions are welcome.
Also, one open question - How should we refer to those "grapheme clusters" in API names and such?
TextElement
TextElement
in some UI frameworks?GraphemeCluster
Character
Character
char
which is way different from a grapheme clusterI have mixed feeling about going either
TextElement
orGraphemeCluster
, but I'm personally fine with going withTextElement
.The text was updated successfully, but these errors were encountered: