-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Better API for StringInfo.GetTextElementEnumerator #19423
Comments
I think having TextElementEnumerator impelment the IEnumerable interface should be enough to have it work with the foreach(...) |
@tarekgh That's an interesting option. I don't like that it would mean that the method name |
With all of the new Rune support being added, including enumerators, is this still needed? |
@stephentoub Rune represents Unicode scalars, which are not the same thing as text elements. For example the string So, unless the Rune API also has some explicit support for enumerating text elements that I missed, I don't think it changes anything here. |
I thought I heard Levi talking about doing that, which is why I asked, but maybe I misunderstood. |
I found this comment by @GrabYourPitchforks from June: dotnet/corefxlab#2350 (comment). Assuming it's still current and that I understand it correctly, I think it means that |
I think @svick is right and we should keep this proposal for now. |
Ok. |
If we do this, it should probably be married with an overall logic update to |
How breaking would it be if we updated to the latest standard regardless for 3.0? |
We don't consider updating the Unicode data as a breaking changes. it is like any other globalization data which can change at any time. |
by the way, StringInfo is already using CharUnicodeInfo which should be using the Unicode data we have updated to the latest release of Unicode standard. |
Quick update: I found a way to smuggle the TR-29 grapheme break data (see https://www.unicode.org/reports/tr29/) in the existing |
@ahsonkhan You had ideas for APIs which essentially took an input and returned an APIs like this don't really need to be allocation-free like the But your thoughts on this would be appreciated since you have recent experience with this type of API surface. :) |
I feel like something like this should be a method directly on |
Similar issue: #91003 |
I really don't think |
@Serentty This is exactly what I argue for in #91003 😄 You might want to look at that issue and voice your opinion as well. |
@GrabYourPitchforks I strongly disagree with that. Most of the time you're working with strings, you want to work with whole characters, e.g. you always want to treat an "a" with an acute accent as one letter (or "character") and you never want to split a string on that. This how the majority of developers already think |
For example, even in a simple scenario like converting between pascal case/snake case etc (so nothing to do with UI), you want to work with whole graphemes of the initial letters and transform those, not with code units or scalars, because an "a" with an acute accent is still a single letter. |
Or when you want to limit the length of a string and show a validation error to the user when they exceeded the max length, you also want to work with graphemes, because a user would be very surprised if they type in "áá" and the error message says "You have exceeded the maximum length of 3 characters / Your text contains 4 characters.". A user has no idea what code units or scalars are and thinks of individual readable characters, i.e. graphemes. It doesn't get much simpler than this when it comes to string functionality, and even in this case you want to work with graphemes. EDIT: While this is technically UI, it definitely feels like something that should be a first-class citizen in the framework. |
To get text elements from a
string
, you can currently useSystem.Globalization.StringInfo.GetTextElementEnumerator
:Notice that
TextElementEnumerator
is a (non-generic) enumerator, not enumerable, so it can't be used in aforeach
or in LINQ or pretty much any other collection-related operation. To use it, you write code like:This makes the API unfamiliar and inconvenient to use. I propose that a new API based on
IEnumerable<T>
should be added.Proposed API
The new methods work like the old methods, except that they return generic enumerable instead of enumerator.
Usage
The new API could be used just like any other
IEnumerable<T>
, e.g.:Open questions
IEnumerable<string>
means information aboutElementIndex
is lost. Is that information useful enough to use something likeIEnumerable<(string textElement, int index)>
instead? (Possibly using a customstruct
instead of a tuple.)string
is a substring of the inputstring
. Would it be worth waiting for spans and useIEnumerable<ReadOnlyMemory<char>>
instead?IEnumerable<string>
requires allocating thatIEnumerable<string>
. Would it be worth to return struct enumerable with struct enumerator instead, which would avoid the allocation when used inforeach
?The text was updated successfully, but these errors were encountered: