Proposal: Better API for StringInfo.GetTextElementEnumerator #19423

svick · 2016-11-19T14:49:16Z

To get text elements from a string, you can currently use System.Globalization.StringInfo.GetTextElementEnumerator:

public class StringInfo
{
    public static TextElementEnumerator GetTextElementEnumerator(string str);
    public static TextElementEnumerator GetTextElementEnumerator(string str, int index);
}

public class TextElementEnumerator : IEnumerator
{
    public bool MoveNext();
    public object Current { get; }

    public string GetTextElement();

    public int ElementIndex { get; }

    public void Reset();
}

Notice that TextElementEnumerator is a (non-generic) enumerator, not enumerable, so it can't be used in a foreach or in LINQ or pretty much any other collection-related operation. To use it, you write code like:

var textElementEnumerator = StringInfo.GetTextElementEnumerator(s);

while (textElementEnumerator.MoveNext())
{
    string textElement = textElementEnumerator.GetTextElement();
    // process textElement here
}

This makes the API unfamiliar and inconvenient to use. I propose that a new API based on IEnumerable<T> should be added.

Proposed API

class StringInfo
{
    public static IEnumerable<string> GetTextElements(string str);
    public static IEnumerable<string> GetTextElements(string str, int index);
}

The new methods work like the old methods, except that they return generic enumerable instead of enumerator.

Usage

The new API could be used just like any other IEnumerable<T>, e.g.:

foreach (var textElement in StringInfo.GetTextElementEnumerator(s))
{
    // process textElement here
}

Open questions

Returning IEnumerable<string> means information about ElementIndex is lost. Is that information useful enough to use something like IEnumerable<(string textElement, int index)> instead? (Possibly using a custom struct instead of a tuple.)
Each text element string is a substring of the input string. Would it be worth waiting for spans and use IEnumerable<ReadOnlyMemory<char>> instead?
Returning IEnumerable<string> requires allocating that IEnumerable<string>. Would it be worth to return struct enumerable with struct enumerator instead, which would avoid the allocation when used in foreach?

The text was updated successfully, but these errors were encountered:

tarekgh · 2016-11-21T16:20:12Z

I think having TextElementEnumerator impelment the IEnumerable interface should be enough to have it work with the foreach(...)

svick · 2016-11-23T01:12:13Z

@tarekgh That's an interesting option. I don't like that it would mean that the method name GetTextElementEnumerator and the type name TextElementEnumerator would become inaccurate, but it's nice that it does not increase API surface by much.

stephentoub · 2018-11-23T14:00:55Z

With all of the new Rune support being added, including enumerators, is this still needed?
cc: @GrabYourPitchforks

svick · 2018-11-23T14:25:09Z

@stephentoub Rune represents Unicode scalars, which are not the same thing as text elements. For example the string "a\u0301" (where the second character is U+0301 COMBINING ACUTE ACCENT) is a single text element (and GetTextElementEnumerator does return it that way), but it's two Unicode scalars.

So, unless the Rune API also has some explicit support for enumerating text elements that I missed, I don't think it changes anything here.

stephentoub · 2018-11-23T15:59:21Z

So, unless the Rune API also has some explicit support for enumerating text elements that I missed

I thought I heard Levi talking about doing that, which is why I asked, but maybe I misunderstood.

svick · 2018-11-24T12:12:10Z

I found this comment by @GrabYourPitchforks from June: dotnet/corefxlab#2350 (comment). Assuming it's still current and that I understand it correctly, I think it means that Utf8String/UnicodeScalar/Rune won't have API for enumerating text elements.

tarekgh · 2018-11-24T21:40:13Z

I think @svick is right and we should keep this proposal for now.

stephentoub · 2018-11-24T22:42:51Z

Ok.

GrabYourPitchforks · 2018-11-26T19:10:17Z

If we do this, it should probably be married with an overall logic update to TextElementEnumerator. The type's current logic is based on a fairly old version of the Unicode Standard, and it would be great to modernize it.

stephentoub · 2018-11-27T11:43:35Z

If we do this, it should probably be married with an overall logic update to TextElementEnumerator. The type's current logic is based on a fairly old version of the Unicode Standard, and it would be great to modernize it.

How breaking would it be if we updated to the latest standard regardless for 3.0?

tarekgh · 2018-11-27T17:16:02Z

How breaking would it be if we updated to the latest standard regardless for 3.0?

We don't consider updating the Unicode data as a breaking changes. it is like any other globalization data which can change at any time.

tarekgh · 2018-11-27T17:18:20Z

by the way, StringInfo is already using CharUnicodeInfo which should be using the Unicode data we have updated to the latest release of Unicode standard.

GrabYourPitchforks · 2019-09-25T07:21:47Z

Quick update: I found a way to smuggle the TR-29 grapheme break data (see https://www.unicode.org/reports/tr29/) in the existing CharUnicodeInfo class without significantly expanding the size of the RVA static data. Will experiment with this a bit more and submit a PR over the next few days.

GrabYourPitchforks · 2019-09-27T04:36:01Z

@ahsonkhan You had ideas for APIs which essentially took an input and returned an IEnumerable<Range>. Do you think something like that might be applicable more generally to string APIs?

APIs like this don't really need to be allocation-free like the ROS<char> APIs require - when considering string usability is of higher concern than is performance. One could imagine returning IEnumerable<string>, but there is significant value in knowing where this substring occurred in the original source string, so Range springs to mind.

But your thoughts on this would be appreciated since you have recent experience with this type of API surface. :)

Serentty · 2020-06-16T05:20:58Z

I feel like something like this should be a method directly on System.String, similar to the new EnumerateRunes() method. This is not something that most languages have yet done, but I think Swift set a good example by making extended grapheme cluster iteration so easily accessible (Swift actually made it the default way to iterate over a string, which is obviously not feasible for C#). I think that just like EnumerateRunes(), having it be directly on the type would allow for much higher visibility and awareness, and make it easier to reach for this when it is appropriate, instead of falling back on iterating code points or (even worse) UTF-16 code units when what you really mean is user-perceived characters.

Neme12 · 2023-08-31T20:54:54Z

Similar issue: #91003

Neme12 · 2023-08-31T21:20:09Z

You had ideas for APIs which essentially took an input and returned an IEnumerable<Range>. Do you think something like that might be applicable more generally to string APIs?

I really don't think Range should ever be returned. Range can have indexes that are from end, and the consumer of such an API would therefore need to always bind the span to the specific length in order to use it correctly. This is not what Range was intended for, it should really only be used as an input, never an output. A tuple with an index and a length would be better in that scenario.

Neme12 · 2023-08-31T21:22:06Z

I feel like something like this should be a method directly on System.String, similar to the new EnumerateRunes() method. This is not something that most languages have yet done, but I think Swift set a good example by making extended grapheme cluster iteration so easily accessible (Swift actually made it the default way to iterate over a string, which is obviously not feasible for C#). I think that just like EnumerateRunes(), having it be directly on the type would allow for much higher visibility and awareness, and make it easier to reach for this when it is appropriate, instead of falling back on iterating code points or (even worse) UTF-16 code units when what you really mean is user-perceived characters.

@Serentty This is exactly what I argue for in #91003 😄 You might want to look at that issue and voice your opinion as well.

Neme12 · 2023-08-31T21:30:36Z

Since the question of graphemes came up, I'll mention that we've been punting on the idea of having graphemes as a first-class citizen in the framework. (By "grapheme", I mean interpreting the 2-scalar sequence [ U+1F474 Older Man Emoji, U+1F3FF Fitzpatrick Type 6 Skin Modifier ] as the single grapheme "Older man with Fitzpatrick type 6 skin".) The reason we had been punting on this is that it tends to be more of a UI / text editor concern - not a general framework concern

@GrabYourPitchforks I strongly disagree with that. Most of the time you're working with strings, you want to work with whole characters, e.g. you always want to treat an "a" with an acute accent as one letter (or "character") and you never want to split a string on that. This how the majority of developers already think char works today (even though they're wrong), and they wouldn't expect a single readable character to be broken up into multiple chars.

Neme12 · 2023-08-31T21:35:01Z

For example, even in a simple scenario like converting between pascal case/snake case etc (so nothing to do with UI), you want to work with whole graphemes of the initial letters and transform those, not with code units or scalars, because an "a" with an acute accent is still a single letter.

Neme12 · 2023-08-31T21:38:09Z

Or when you want to limit the length of a string and show a validation error to the user when they exceeded the max length, you also want to work with graphemes, because a user would be very surprised if they type in "áá" and the error message says "You have exceeded the maximum length of 3 characters / Your text contains 4 characters.". A user has no idea what code units or scalars are and thinks of individual readable characters, i.e. graphemes.

It doesn't get much simpler than this when it comes to string functionality, and even in this case you want to work with graphemes.

EDIT: While this is technically UI, it definitely feels like something that should be a first-class citizen in the framework.

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 5.0 milestone Jan 31, 2020

GrabYourPitchforks mentioned this issue Feb 3, 2020

Type and supporting API for working with graphemes #31642

Closed

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020

tarekgh removed the untriaged New issue has not been triaged by the area owner label Jun 16, 2020

tarekgh modified the milestones: 5.0.0, Future Jun 16, 2020

Gnbrkm41 mentioned this issue Oct 19, 2022

[Suggestion] Introduce more easy-to-use ways to work with grapheme clusters #77200

Open

Neme12 mentioned this issue Aug 23, 2023

[API Proposal]: Add modern, span-based alternative to StringInfo.GetTextElementEnumerator #91003

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Better API for StringInfo.GetTextElementEnumerator #19423

Proposal: Better API for StringInfo.GetTextElementEnumerator #19423

svick commented Nov 19, 2016

tarekgh commented Nov 21, 2016

svick commented Nov 23, 2016

stephentoub commented Nov 23, 2018

svick commented Nov 23, 2018 •

edited

Loading

stephentoub commented Nov 23, 2018

svick commented Nov 24, 2018

tarekgh commented Nov 24, 2018

stephentoub commented Nov 24, 2018

GrabYourPitchforks commented Nov 26, 2018

stephentoub commented Nov 27, 2018

tarekgh commented Nov 27, 2018

tarekgh commented Nov 27, 2018

GrabYourPitchforks commented Sep 25, 2019

GrabYourPitchforks commented Sep 27, 2019

Serentty commented Jun 16, 2020

Neme12 commented Aug 31, 2023

Neme12 commented Aug 31, 2023 •

edited

Loading

Neme12 commented Aug 31, 2023

Neme12 commented Aug 31, 2023 •

edited

Loading

Neme12 commented Aug 31, 2023 •

edited

Loading

Neme12 commented Aug 31, 2023 •

edited

Loading

Proposal: Better API for StringInfo.GetTextElementEnumerator #19423

Proposal: Better API for StringInfo.GetTextElementEnumerator #19423

Comments

svick commented Nov 19, 2016

Proposed API

Usage

Open questions

tarekgh commented Nov 21, 2016

svick commented Nov 23, 2016

stephentoub commented Nov 23, 2018

svick commented Nov 23, 2018 • edited Loading

stephentoub commented Nov 23, 2018

svick commented Nov 24, 2018

tarekgh commented Nov 24, 2018

stephentoub commented Nov 24, 2018

GrabYourPitchforks commented Nov 26, 2018

stephentoub commented Nov 27, 2018

tarekgh commented Nov 27, 2018

tarekgh commented Nov 27, 2018

GrabYourPitchforks commented Sep 25, 2019

GrabYourPitchforks commented Sep 27, 2019

Serentty commented Jun 16, 2020

Neme12 commented Aug 31, 2023

Neme12 commented Aug 31, 2023 • edited Loading

Neme12 commented Aug 31, 2023

Neme12 commented Aug 31, 2023 • edited Loading

Neme12 commented Aug 31, 2023 • edited Loading

Neme12 commented Aug 31, 2023 • edited Loading

svick commented Nov 23, 2018 •

edited

Loading

Neme12 commented Aug 31, 2023 •

edited

Loading

Neme12 commented Aug 31, 2023 •

edited

Loading

Neme12 commented Aug 31, 2023 •

edited

Loading

Neme12 commented Aug 31, 2023 •

edited

Loading