Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Better API for StringInfo.GetTextElementEnumerator #19423

Open
svick opened this issue Nov 19, 2016 · 21 comments
Open

Proposal: Better API for StringInfo.GetTextElementEnumerator #19423

svick opened this issue Nov 19, 2016 · 21 comments
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Globalization
Milestone

Comments

@svick
Copy link
Contributor

svick commented Nov 19, 2016

To get text elements from a string, you can currently use System.Globalization.StringInfo.GetTextElementEnumerator:

public class StringInfo
{
    public static TextElementEnumerator GetTextElementEnumerator(string str);
    public static TextElementEnumerator GetTextElementEnumerator(string str, int index);
}

public class TextElementEnumerator : IEnumerator
{
    public bool MoveNext();
    public object Current { get; }

    public string GetTextElement();

    public int ElementIndex { get; }

    public void Reset();
}

Notice that TextElementEnumerator is a (non-generic) enumerator, not enumerable, so it can't be used in a foreach or in LINQ or pretty much any other collection-related operation. To use it, you write code like:

var textElementEnumerator = StringInfo.GetTextElementEnumerator(s);

while (textElementEnumerator.MoveNext())
{
    string textElement = textElementEnumerator.GetTextElement();
    // process textElement here
}

This makes the API unfamiliar and inconvenient to use. I propose that a new API based on IEnumerable<T> should be added.

Proposed API

class StringInfo
{
    public static IEnumerable<string> GetTextElements(string str);
    public static IEnumerable<string> GetTextElements(string str, int index);
}

The new methods work like the old methods, except that they return generic enumerable instead of enumerator.

Usage

The new API could be used just like any other IEnumerable<T>, e.g.:

foreach (var textElement in StringInfo.GetTextElementEnumerator(s))
{
    // process textElement here
}

Open questions

  • Returning IEnumerable<string> means information about ElementIndex is lost. Is that information useful enough to use something like IEnumerable<(string textElement, int index)> instead? (Possibly using a custom struct instead of a tuple.)
  • Each text element string is a substring of the input string. Would it be worth waiting for spans and use IEnumerable<ReadOnlyMemory<char>> instead?
  • Returning IEnumerable<string> requires allocating that IEnumerable<string>. Would it be worth to return struct enumerable with struct enumerator instead, which would avoid the allocation when used in foreach?
@tarekgh
Copy link
Member

tarekgh commented Nov 21, 2016

I think having TextElementEnumerator impelment the IEnumerable interface should be enough to have it work with the foreach(...)

@svick
Copy link
Contributor Author

svick commented Nov 23, 2016

@tarekgh That's an interesting option. I don't like that it would mean that the method name GetTextElementEnumerator and the type name TextElementEnumerator would become inaccurate, but it's nice that it does not increase API surface by much.

@stephentoub
Copy link
Member

With all of the new Rune support being added, including enumerators, is this still needed?
cc: @GrabYourPitchforks

@svick
Copy link
Contributor Author

svick commented Nov 23, 2018

@stephentoub Rune represents Unicode scalars, which are not the same thing as text elements. For example the string "a\u0301" (where the second character is U+0301 COMBINING ACUTE ACCENT) is a single text element (and GetTextElementEnumerator does return it that way), but it's two Unicode scalars.

So, unless the Rune API also has some explicit support for enumerating text elements that I missed, I don't think it changes anything here.

@stephentoub
Copy link
Member

So, unless the Rune API also has some explicit support for enumerating text elements that I missed

I thought I heard Levi talking about doing that, which is why I asked, but maybe I misunderstood.

@svick
Copy link
Contributor Author

svick commented Nov 24, 2018

I found this comment by @GrabYourPitchforks from June: dotnet/corefxlab#2350 (comment). Assuming it's still current and that I understand it correctly, I think it means that Utf8String/UnicodeScalar/Rune won't have API for enumerating text elements.

@tarekgh
Copy link
Member

tarekgh commented Nov 24, 2018

I think @svick is right and we should keep this proposal for now.

@stephentoub
Copy link
Member

Ok.

@GrabYourPitchforks
Copy link
Member

If we do this, it should probably be married with an overall logic update to TextElementEnumerator. The type's current logic is based on a fairly old version of the Unicode Standard, and it would be great to modernize it.

@stephentoub
Copy link
Member

If we do this, it should probably be married with an overall logic update to TextElementEnumerator. The type's current logic is based on a fairly old version of the Unicode Standard, and it would be great to modernize it.

How breaking would it be if we updated to the latest standard regardless for 3.0?

@tarekgh
Copy link
Member

tarekgh commented Nov 27, 2018

How breaking would it be if we updated to the latest standard regardless for 3.0?

We don't consider updating the Unicode data as a breaking changes. it is like any other globalization data which can change at any time.

@tarekgh
Copy link
Member

tarekgh commented Nov 27, 2018

by the way, StringInfo is already using CharUnicodeInfo which should be using the Unicode data we have updated to the latest release of Unicode standard.

@GrabYourPitchforks
Copy link
Member

Quick update: I found a way to smuggle the TR-29 grapheme break data (see https://www.unicode.org/reports/tr29/) in the existing CharUnicodeInfo class without significantly expanding the size of the RVA static data. Will experiment with this a bit more and submit a PR over the next few days.

@GrabYourPitchforks
Copy link
Member

@ahsonkhan You had ideas for APIs which essentially took an input and returned an IEnumerable<Range>. Do you think something like that might be applicable more generally to string APIs?

APIs like this don't really need to be allocation-free like the ROS<char> APIs require - when considering string usability is of higher concern than is performance. One could imagine returning IEnumerable<string>, but there is significant value in knowing where this substring occurred in the original source string, so Range springs to mind.

But your thoughts on this would be appreciated since you have recent experience with this type of API surface. :)

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@Serentty
Copy link
Contributor

I feel like something like this should be a method directly on System.String, similar to the new EnumerateRunes() method. This is not something that most languages have yet done, but I think Swift set a good example by making extended grapheme cluster iteration so easily accessible (Swift actually made it the default way to iterate over a string, which is obviously not feasible for C#). I think that just like EnumerateRunes(), having it be directly on the type would allow for much higher visibility and awareness, and make it easier to reach for this when it is appropriate, instead of falling back on iterating code points or (even worse) UTF-16 code units when what you really mean is user-perceived characters.

@Neme12
Copy link

Neme12 commented Aug 31, 2023

Similar issue: #91003

@Neme12
Copy link

Neme12 commented Aug 31, 2023

You had ideas for APIs which essentially took an input and returned an IEnumerable<Range>. Do you think something like that might be applicable more generally to string APIs?

I really don't think Range should ever be returned. Range can have indexes that are from end, and the consumer of such an API would therefore need to always bind the span to the specific length in order to use it correctly. This is not what Range was intended for, it should really only be used as an input, never an output. A tuple with an index and a length would be better in that scenario.

@Neme12
Copy link

Neme12 commented Aug 31, 2023

I feel like something like this should be a method directly on System.String, similar to the new EnumerateRunes() method. This is not something that most languages have yet done, but I think Swift set a good example by making extended grapheme cluster iteration so easily accessible (Swift actually made it the default way to iterate over a string, which is obviously not feasible for C#). I think that just like EnumerateRunes(), having it be directly on the type would allow for much higher visibility and awareness, and make it easier to reach for this when it is appropriate, instead of falling back on iterating code points or (even worse) UTF-16 code units when what you really mean is user-perceived characters.

@Serentty This is exactly what I argue for in #91003 😄 You might want to look at that issue and voice your opinion as well.

@Neme12
Copy link

Neme12 commented Aug 31, 2023

Since the question of graphemes came up, I'll mention that we've been punting on the idea of having graphemes as a first-class citizen in the framework. (By "grapheme", I mean interpreting the 2-scalar sequence [ U+1F474 Older Man Emoji, U+1F3FF Fitzpatrick Type 6 Skin Modifier ] as the single grapheme "Older man with Fitzpatrick type 6 skin".) The reason we had been punting on this is that it tends to be more of a UI / text editor concern - not a general framework concern

@GrabYourPitchforks I strongly disagree with that. Most of the time you're working with strings, you want to work with whole characters, e.g. you always want to treat an "a" with an acute accent as one letter (or "character") and you never want to split a string on that. This how the majority of developers already think char works today (even though they're wrong), and they wouldn't expect a single readable character to be broken up into multiple chars.

@Neme12
Copy link

Neme12 commented Aug 31, 2023

For example, even in a simple scenario like converting between pascal case/snake case etc (so nothing to do with UI), you want to work with whole graphemes of the initial letters and transform those, not with code units or scalars, because an "a" with an acute accent is still a single letter.

@Neme12
Copy link

Neme12 commented Aug 31, 2023

Or when you want to limit the length of a string and show a validation error to the user when they exceeded the max length, you also want to work with graphemes, because a user would be very surprised if they type in "áá" and the error message says "You have exceeded the maximum length of 3 characters / Your text contains 4 characters.". A user has no idea what code units or scalars are and thinks of individual readable characters, i.e. graphemes.

It doesn't get much simpler than this when it comes to string functionality, and even in this case you want to work with graphemes.

EDIT: While this is technically UI, it definitely feels like something that should be a first-class citizen in the framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Globalization
Projects
None yet
Development

No branches or pull requests

8 participants