Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API proposal: Rune.DecodeFirstRune and friends #28504

Closed
GrabYourPitchforks opened this issue Jan 24, 2019 · 7 comments
Closed

API proposal: Rune.DecodeFirstRune and friends #28504

GrabYourPitchforks opened this issue Jan 24, 2019 · 7 comments
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

These APIs are useful for reading the first and last Rune from a sequence. Unlike the existing Rune.TryGetRuneAt API which returns a Boolean "success" / "failure", and unlike the existing String.EnumerateRunes API which silently replaces invalid sequences with U+FFFD, this API instead returns to the caller information about the UTF-* subsequence itself.

These APIs can be used by higher-level functions like transcoders, escapers, and other text manipulation routines.

API proposal

namespace System.Text
{
   public readonly struct Rune
   {
      // Proposed NEW methods on existing Rune type

      public static OperationStatus DecodeFirstRune(ReadOnlySpan<char> span, out Rune result, out int sequenceLength);
      public static OperationStatus DecodeFirstRune(ReadOnlySpan<Utf8Char> span, out Rune result, out int sequenceLength);

      public static OperationStatus DecodeLastRune(ReadOnlySpan<char> span, out Rune result, out int sequenceLength);
      public static OperationStatus DecodeLastRune(ReadOnlySpan<Utf8Char> span, out Rune result, out int sequenceLength);
   }
}

Below are some sample input buffers and what the DecodeFirstRune function would return.

UTF-16: [ 0078 0079 007A ].

The first scalar value in this buffer is encoded by the char [ 0078 ]. This unambiguously corresponds to the scalar value U+0078. The method returns OperationStatus.Done, outs U+0078 via the result parameter, and outs 1 via the sequenceLength parameter. (The sequence was 1 code unit in length.)

UTF-8: [ 78 79 7A ]

Same as above. Returns OperationStatus.Done, outs U+0078 via result, and outs 1 via sequenceLength.

UTF-16: [ D800 ].

The char [ D800 ] is a standalone high surrogate. It's not by itself well-formed, but it's still possible to make this buffer valid by following it up with a low surrogate char. This returns OperationStatus.NeedMoreData, outs default(Rune) via result, and outs 1 via sequenceLength (because we inspected 1 char).

UTF-8: [ F0 BF ]

Same as above. This sequence is not by itself well-formed, but it's still possible to make this buffer valid by following it up with a continuation byte. This returns OperationStatus.NeedMoreData, outs default(Rune) via result, and outs 2 via sequenceLength (because we inspected 2 bytes).

UTF-16: [ D800 0020 ]

The first char [ D800 ] is a standalone high surrogate, and we know that the next char isn't a low surrogate. This allows us to determine unambiguously that the buffer contains ill-formed data. This returns OperationStatus.InvalidData, outs default(Rune) via result, and outs 1 via sequenceLength (because we inspected 1 char).

UTF-8: [ F4 80 80 F5 78 79 7A ]

The 0xF5 byte improperly terminates what we expected to be the start of a 4-byte UTF-8 subsequence. No amount of additional data will ever make this sequence valid. This returns OperationStatus.InvalidData, outs default(Rune) via result, and outs 3 via sequenceLength (per Unicode recommendation re: calculating maximal invalid subsequence length).

Further discussion

Transcoding / escaping / etc. routines often need to know the difference between incomplete and invalid data. If these routines are themselves implemented using the OperationStatus pattern, they need to know whether to return OperationStatus.NeedMoreData or OperationStatus.InvalidData to their own callers.

Even though there are public Rune.Utf8SequenceLength and Rune.Utf16SequenceLength instance properties, we still need an out parameter that answers "how long was the subsequence that we inspected to come up with our determination?" Think of this parameter as akin to "regardless of whether the return value was good / incomplete / invalid, slice the original input buffer by this many elements before you call me next."

To see an example of this in action (in the UTF-8 JSON escaper), see https://github.com/GrabYourPitchforks/jsonescape/blob/1218cf6fa8988352474281c23da7bd9de087ec2f/JsonEscape/Escaper/Utf8JavaScriptEncoder.cs#L66.

@GrabYourPitchforks GrabYourPitchforks self-assigned this Jan 24, 2019
@GSPP
Copy link

GSPP commented Jan 28, 2019

If a caller wants to process an entire string the overhead of creating many new spans and invoking this method many times might be significant. Maybe there should be an API that enumerates all this data. Maybe this could take the form of a ref struct that behaves like an enumerator.

@stephentoub
Copy link
Member

stephentoub commented Jan 28, 2019

@GSPP
Copy link

GSPP commented Jan 28, 2019

@stephentoub yes, exactly.

@GrabYourPitchforks
Copy link
Member Author

@terrajobst
Copy link
Member

terrajobst commented Feb 5, 2019

Video

Looks good.

  • We'll have to make sure that the out parameter charsConsumed makes sense with the final name of Utf8Char. If it doesn't have char in the name. Alternatively, we could decide to use something like consumedCount that implies that it's number of elements, regardless of the type.
public static OperationStatus Decode(ReadOnlySpan<char> source, out Rune result, out int charsConsumed);
public static OperationStatus Decode(ReadOnlySpan<Utf8Char> source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeFromEnd(ReadOnlySpan<char> source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeFromEnd(ReadOnlySpan<Utf8Char> source, out Rune result, out int charsConsumed);

@KrzysztofCwalina, do you have an opinion on the name of the out parameter?

@GrabYourPitchforks
Copy link
Member Author

@terrajobst Given that there's some pushback against introducing Utf8Char right now, can we do this for the UTF-8 overloads so that we can get something in?

public static OperationStatus DecodeUtf8(ReadOnlySpan<byte> source, out Rune result, out int bytesConsumed);
public static OperationStatus DecodeUtf8FromEnd(ReadOnlySpan<byte> source, out Rune result, out int bytesConsumed);

@terrajobst
Copy link
Member

We decided to include the UTF16/UTF8 suffix for all of them:

public static OperationStatus DecodeUtf16(ReadOnlySpan<char> utf16Source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeUtf8(ReadOnlySpan<byte> utf8Source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeUtf16FromEnd(ReadOnlySpan<char> utf16Source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeUtf8FromEnd(ReadOnlySpan<byte> utf8Source, out Rune result, out int charsConsumed);

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 3.0 milestone Feb 1, 2020
@dotnet dotnet locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime
Projects
None yet
Development

No branches or pull requests

5 participants