New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API proposal: Rune.DecodeFirstRune and friends #28504
Comments
If a caller wants to process an entire string the overhead of creating many new spans and invoking this method many times might be significant. Maybe there should be an API that enumerates all this data. Maybe this could take the form of a |
You mean like https://github.com/dotnet/coreclr/blob/57fd77e6f8f7f2c37cc5c3b36df3ea4f302e143b/src/System.Private.CoreLib/shared/System/String.cs#L546 and https://github.com/dotnet/coreclr/blob/b526affff2190b05f3896933ea3dd3f6b02879dc/src/System.Private.CoreLib/shared/System/MemoryExtensions.cs#L1041 ? |
@stephentoub yes, exactly. |
@GSPP See https://github.com/dotnet/corefx/issues/34826 for the proposal. It's the follow-up to the comments given at https://github.com/dotnet/apireviews/blob/master/2018/System.Utf8String/Session2.md. |
Looks good.
public static OperationStatus Decode(ReadOnlySpan<char> source, out Rune result, out int charsConsumed);
public static OperationStatus Decode(ReadOnlySpan<Utf8Char> source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeFromEnd(ReadOnlySpan<char> source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeFromEnd(ReadOnlySpan<Utf8Char> source, out Rune result, out int charsConsumed); @KrzysztofCwalina, do you have an opinion on the name of the |
@terrajobst Given that there's some pushback against introducing public static OperationStatus DecodeUtf8(ReadOnlySpan<byte> source, out Rune result, out int bytesConsumed);
public static OperationStatus DecodeUtf8FromEnd(ReadOnlySpan<byte> source, out Rune result, out int bytesConsumed); |
We decided to include the UTF16/UTF8 suffix for all of them: public static OperationStatus DecodeUtf16(ReadOnlySpan<char> utf16Source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeUtf8(ReadOnlySpan<byte> utf8Source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeUtf16FromEnd(ReadOnlySpan<char> utf16Source, out Rune result, out int charsConsumed);
public static OperationStatus DecodeUtf8FromEnd(ReadOnlySpan<byte> utf8Source, out Rune result, out int charsConsumed); |
These APIs are useful for reading the first and last
Rune
from a sequence. Unlike the existingRune.TryGetRuneAt
API which returns a Boolean "success" / "failure", and unlike the existingString.EnumerateRunes
API which silently replaces invalid sequences withU+FFFD
, this API instead returns to the caller information about the UTF-* subsequence itself.These APIs can be used by higher-level functions like transcoders, escapers, and other text manipulation routines.
API proposal
Below are some sample input buffers and what the
DecodeFirstRune
function would return.UTF-16:
[ 0078 0079 007A ]
.The first scalar value in this buffer is encoded by the char
[ 0078 ]
. This unambiguously corresponds to the scalar valueU+0078
. The method returnsOperationStatus.Done
, outsU+0078
via the result parameter, and outs 1 via the sequenceLength parameter. (The sequence was 1 code unit in length.)UTF-8:
[ 78 79 7A ]
Same as above. Returns
OperationStatus.Done
, outsU+0078
via result, and outs 1 via sequenceLength.UTF-16:
[ D800 ]
.The char
[ D800 ]
is a standalone high surrogate. It's not by itself well-formed, but it's still possible to make this buffer valid by following it up with a low surrogate char. This returnsOperationStatus.NeedMoreData
, outsdefault(Rune)
via result, and outs 1 via sequenceLength (because we inspected 1 char).UTF-8:
[ F0 BF ]
Same as above. This sequence is not by itself well-formed, but it's still possible to make this buffer valid by following it up with a continuation byte. This returns
OperationStatus.NeedMoreData
, outsdefault(Rune)
via result, and outs 2 via sequenceLength (because we inspected 2 bytes).UTF-16:
[ D800 0020 ]
The first char
[ D800 ]
is a standalone high surrogate, and we know that the next char isn't a low surrogate. This allows us to determine unambiguously that the buffer contains ill-formed data. This returnsOperationStatus.InvalidData
, outsdefault(Rune)
via result, and outs 1 via sequenceLength (because we inspected 1 char).UTF-8:
[ F4 80 80 F5 78 79 7A ]
The
0xF5
byte improperly terminates what we expected to be the start of a 4-byte UTF-8 subsequence. No amount of additional data will ever make this sequence valid. This returnsOperationStatus.InvalidData
, outsdefault(Rune)
via result, and outs 3 via sequenceLength (per Unicode recommendation re: calculating maximal invalid subsequence length).Further discussion
Transcoding / escaping / etc. routines often need to know the difference between incomplete and invalid data. If these routines are themselves implemented using the
OperationStatus
pattern, they need to know whether to returnOperationStatus.NeedMoreData
orOperationStatus.InvalidData
to their own callers.Even though there are public
Rune.Utf8SequenceLength
andRune.Utf16SequenceLength
instance properties, we still need an out parameter that answers "how long was the subsequence that we inspected to come up with our determination?" Think of this parameter as akin to "regardless of whether the return value was good / incomplete / invalid, slice the original input buffer by this many elements before you call me next."To see an example of this in action (in the UTF-8 JSON escaper), see https://github.com/GrabYourPitchforks/jsonescape/blob/1218cf6fa8988352474281c23da7bd9de087ec2f/JsonEscape/Escaper/Utf8JavaScriptEncoder.cs#L66.
The text was updated successfully, but these errors were encountered: