API proposal: Rune.DecodeFirstRune and friends #28504

GrabYourPitchforks · 2019-01-24T22:49:50Z

These APIs are useful for reading the first and last Rune from a sequence. Unlike the existing Rune.TryGetRuneAt API which returns a Boolean "success" / "failure", and unlike the existing String.EnumerateRunes API which silently replaces invalid sequences with U+FFFD, this API instead returns to the caller information about the UTF-* subsequence itself.

These APIs can be used by higher-level functions like transcoders, escapers, and other text manipulation routines.

API proposal

namespace System.Text
{
   public readonly struct Rune
   {
      // Proposed NEW methods on existing Rune type

      public static OperationStatus DecodeFirstRune(ReadOnlySpan<char> span, out Rune result, out int sequenceLength);
      public static OperationStatus DecodeFirstRune(ReadOnlySpan<Utf8Char> span, out Rune result, out int sequenceLength);

      public static OperationStatus DecodeLastRune(ReadOnlySpan<char> span, out Rune result, out int sequenceLength);
      public static OperationStatus DecodeLastRune(ReadOnlySpan<Utf8Char> span, out Rune result, out int sequenceLength);
   }
}

Below are some sample input buffers and what the DecodeFirstRune function would return.

UTF-16: [ 0078 0079 007A ].

The first scalar value in this buffer is encoded by the char [ 0078 ]. This unambiguously corresponds to the scalar value U+0078. The method returns OperationStatus.Done, outs U+0078 via the result parameter, and outs 1 via the sequenceLength parameter. (The sequence was 1 code unit in length.)

UTF-8: [ 78 79 7A ]

Same as above. Returns OperationStatus.Done, outs U+0078 via result, and outs 1 via sequenceLength.

UTF-16: [ D800 ].

The char [ D800 ] is a standalone high surrogate. It's not by itself well-formed, but it's still possible to make this buffer valid by following it up with a low surrogate char. This returns OperationStatus.NeedMoreData, outs default(Rune) via result, and outs 1 via sequenceLength (because we inspected 1 char).

UTF-8: [ F0 BF ]

Same as above. This sequence is not by itself well-formed, but it's still possible to make this buffer valid by following it up with a continuation byte. This returns OperationStatus.NeedMoreData, outs default(Rune) via result, and outs 2 via sequenceLength (because we inspected 2 bytes).

UTF-16: [ D800 0020 ]

The first char [ D800 ] is a standalone high surrogate, and we know that the next char isn't a low surrogate. This allows us to determine unambiguously that the buffer contains ill-formed data. This returns OperationStatus.InvalidData, outs default(Rune) via result, and outs 1 via sequenceLength (because we inspected 1 char).

UTF-8: [ F4 80 80 F5 78 79 7A ]

The 0xF5 byte improperly terminates what we expected to be the start of a 4-byte UTF-8 subsequence. No amount of additional data will ever make this sequence valid. This returns OperationStatus.InvalidData, outs default(Rune) via result, and outs 3 via sequenceLength (per Unicode recommendation re: calculating maximal invalid subsequence length).

Further discussion

Transcoding / escaping / etc. routines often need to know the difference between incomplete and invalid data. If these routines are themselves implemented using the OperationStatus pattern, they need to know whether to return OperationStatus.NeedMoreData or OperationStatus.InvalidData to their own callers.

Even though there are public Rune.Utf8SequenceLength and Rune.Utf16SequenceLength instance properties, we still need an out parameter that answers "how long was the subsequence that we inspected to come up with our determination?" Think of this parameter as akin to "regardless of whether the return value was good / incomplete / invalid, slice the original input buffer by this many elements before you call me next."

To see an example of this in action (in the UTF-8 JSON escaper), see https://github.com/GrabYourPitchforks/jsonescape/blob/1218cf6fa8988352474281c23da7bd9de087ec2f/JsonEscape/Escaper/Utf8JavaScriptEncoder.cs#L66.

The text was updated successfully, but these errors were encountered:

GSPP · 2019-01-28T10:53:46Z

If a caller wants to process an entire string the overhead of creating many new spans and invoking this method many times might be significant. Maybe there should be an API that enumerates all this data. Maybe this could take the form of a ref struct that behaves like an enumerator.

stephentoub · 2019-01-28T13:36:19Z

Maybe there should be an API that enumerates all this data. Maybe this could take the form of a ref struct that behaves like an enumerator.

You mean like https://github.com/dotnet/coreclr/blob/57fd77e6f8f7f2c37cc5c3b36df3ea4f302e143b/src/System.Private.CoreLib/shared/System/String.cs#L546 and https://github.com/dotnet/coreclr/blob/b526affff2190b05f3896933ea3dd3f6b02879dc/src/System.Private.CoreLib/shared/System/MemoryExtensions.cs#L1041 ?

GSPP · 2019-01-28T15:42:54Z

@stephentoub yes, exactly.

GrabYourPitchforks · 2019-01-28T19:05:32Z

@GSPP See https://github.com/dotnet/corefx/issues/34826 for the proposal. It's the follow-up to the comments given at https://github.com/dotnet/apireviews/blob/master/2018/System.Utf8String/Session2.md.

terrajobst · 2019-02-05T19:56:46Z

Video

Looks good.

We'll have to make sure that the out parameter charsConsumed makes sense with the final name of Utf8Char. If it doesn't have char in the name. Alternatively, we could decide to use something like consumedCount that implies that it's number of elements, regardless of the type.

public static OperationStatus Decode(ReadOnlySpan<char> source, out Rune result, out int charsConsumed);
public static OperationStatus Decode(ReadOnlySpan<Utf8Char> source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeFromEnd(ReadOnlySpan<char> source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeFromEnd(ReadOnlySpan<Utf8Char> source, out Rune result, out int charsConsumed);

@KrzysztofCwalina, do you have an opinion on the name of the out parameter?

GrabYourPitchforks · 2019-02-19T21:02:05Z

@terrajobst Given that there's some pushback against introducing Utf8Char right now, can we do this for the UTF-8 overloads so that we can get something in?

public static OperationStatus DecodeUtf8(ReadOnlySpan<byte> source, out Rune result, out int bytesConsumed);
public static OperationStatus DecodeUtf8FromEnd(ReadOnlySpan<byte> source, out Rune result, out int bytesConsumed);

terrajobst · 2019-02-22T18:30:45Z

We decided to include the UTF16/UTF8 suffix for all of them:

public static OperationStatus DecodeUtf16(ReadOnlySpan<char> utf16Source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeUtf8(ReadOnlySpan<byte> utf8Source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeUtf16FromEnd(ReadOnlySpan<char> utf16Source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeUtf8FromEnd(ReadOnlySpan<byte> utf8Source, out Rune result, out int charsConsumed);

GrabYourPitchforks self-assigned this Jan 24, 2019

GrabYourPitchforks closed this as completed in dotnet/corefx#35469 Mar 8, 2019

msftgits transferred this issue from dotnet/corefx Feb 1, 2020

msftgits added this to the 3.0 milestone Feb 1, 2020

dotnet locked as resolved and limited conversation to collaborators Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API proposal: Rune.DecodeFirstRune and friends #28504

API proposal: Rune.DecodeFirstRune and friends #28504

GrabYourPitchforks commented Jan 24, 2019

GSPP commented Jan 28, 2019

stephentoub commented Jan 28, 2019 •

edited

GSPP commented Jan 28, 2019

GrabYourPitchforks commented Jan 28, 2019

terrajobst commented Feb 5, 2019 •

edited

GrabYourPitchforks commented Feb 19, 2019

terrajobst commented Feb 22, 2019

API proposal: Rune.DecodeFirstRune and friends #28504

API proposal: Rune.DecodeFirstRune and friends #28504

Comments

GrabYourPitchforks commented Jan 24, 2019

API proposal

Further discussion

GSPP commented Jan 28, 2019

stephentoub commented Jan 28, 2019 • edited

GSPP commented Jan 28, 2019

GrabYourPitchforks commented Jan 28, 2019

terrajobst commented Feb 5, 2019 • edited

GrabYourPitchforks commented Feb 19, 2019

terrajobst commented Feb 22, 2019

stephentoub commented Jan 28, 2019 •

edited

terrajobst commented Feb 5, 2019 •

edited