API Proposal: Add Encoding/Decoding APIs for new System.Buffer types #30882

davidfowl · 2019-09-18T05:00:48Z

Today our encoding APIs support Span<byte>/Span<char>/char[]/byte[] and the Encoder/Decoder API, I'd like to propose some higher level APIs that support ReadOnlySequence<byte/char> and IBufferWriter<byte/char>:

namespace System.Text
{
    public static class EncodingExtensions
    {
        public static int GetBytes(this Encoding encoding, ReadOnlySpan<char> chars, IBufferWriter<byte> writer);
        public static long GetBytes(this Encoding encoding, in ReadOnlySequence<char> chars, IBufferWriter<byte> writer);
        public static int GetBytes(this Encoding encoding, in ReadOnlySequence<char> chars, Span<byte> bytes);
        public static byte[] GetBytes(this Encoding encoding, in ReadOnlySequence<char> bytes);

        public static int GetChars(this Encoding encoding, ReadOnlySpan<byte> bytes, IBufferWriter<char> writer);
        public static long GetChars(this Encoding encoding, in ReadOnlySequence<byte> bytes, IBufferWriter<char> writer);
        public static int GetChars(this Encoding encoding, in ReadOnlySequence<byte> bytes, Span<char> chars);
        public static string GetString(this Encoding encoding, in ReadOnlySequence<byte> bytes);
    }
}

These extensions would likely live in the same assembly as these abstractions since encoding couldn't directly take a dependency on those types (unless we push them lower in the stack).

cc @GrabYourPitchforks

The text was updated successfully, but these errors were encountered:

gfoidl · 2019-09-18T10:21:59Z

for Encode the ReadOnlySequence<char> should be passed by in.
other similar APIs follow the schema source, target, out consumed, out written, this should be aligned to that

public static class EncodingExtensions
{
    public static void Encode(this Encoding encoding, ReadOnlySpan<char> chars, IBufferWriter<byte> writer);

-   public static void Encode(this Encoding encoding, ReadOnlySequence<char> chars, IBufferWriter<byte> writer);
+   public static void Encode(this Encoding encoding, in ReadOnlySequence<char> chars, IBufferWriter<byte> writer);

-   public static OperationStatus Decode(this Encoding encoding, in ReadOnlySequence<byte> bytes, Span<char> chars, out int charsWritten, out SequencePosition bytesConsumedPosition);
+   public static OperationStatus Decode(this Encoding encoding, in ReadOnlySequence<byte> bytes, Span<char> chars, out SequencePosition bytesConsumedPosition, out int charsWritten);

    public static OperationStatus Decode(this Encoding encoding, in ReadOnlySequence<byte> bytes, IBufferWriter<char> writer, out SequencePosition bytesConsumedPosition);
}

davidfowl · 2019-09-18T14:22:03Z

Updated

GrabYourPitchforks · 2019-09-18T16:03:27Z

Keep in mind each API call would incur a single tiny allocation for the Encoder or Decoder object. This should be fine since it'll be short-lived, but wanted to make you aware of it. Additionally, the return value in all cases would be void instead of OperationStatus, as the API would always run to completion (or fail if given invalid data and configured to throw). So you don't need to out the position. But otherwise LGTM.

davidfowl · 2019-09-18T16:16:27Z

Keep in mind each API call would incur a single tiny allocation for the Encoder or Decoder object. This should be fine since it'll be short-lived, but wanted to make you aware of it.

There might be a fast path for the single segment case that avoids the allocation but good reminder (I actually implemented these before filing the issue). Do you think it makes sense to expose a static API that takes an encoder/decoder? The fact that those can be reset would make it possible to re-use them as well.

Additionally, the return value in all cases would be void instead of OperationStatus, as the API would always run to completion (or fail if given invalid data and configured to throw). So you don't need to out the position.

What if there's incomplete data? I got 2 bytes of a 3 byte UTF8 scalar/codepoint and the next read will have all 3 of the bytes? I'd prefer if these operations not throw but fail in a way that lets the caller provide more data.

GrabYourPitchforks · 2019-09-19T01:08:41Z

Do you think it makes sense to expose a static API that takes an encoder/decoder? The fact that those can be reset would make it possible to re-use them as well.

Unlikely, as they wouldn't play well with the other APIs on Encoder or Decoder. The best course of action would probably be to special-case the single-segment case, as you had suggested.

What if there's incomplete data? I got 2 bytes of a 3 byte UTF8 scalar/codepoint and the next read will have all 3 of the bytes? I'd prefer if these operations not throw but fail in a way that lets the caller provide more data.

What's an example of a scenario where this would occur? My understanding of ReadOnlySequence<T> is that any given sequence is supposed to represent self-contained data, even though that data could be divided arbitrarily among numerous segments. Are there scenarios where you instead have a sequence of ReadOnlySequence<T>, where the contained data only makes sense when interpreted as part of the greater "sequence of sequences" whole?

The Encoding type does not provide any such "go as far as you can and tell me how far you got" facility. The closest APIs are on Encoder / Decoder, but those types maintain state, so they're ill-suited to the OperationStatus pattern. If you really, really wanted to take Encoder or Decoder as an API (and I'd not recommend this just due to how wonky those types are), the API proposal would instead look like the below.

// These APIs could return 'long' instead of 'void' if you wanted to know how many elements
// were written to the destination. Additionally, the contract would be that they'd consume the
// input sequence in its entirety, hence no "elements consumed" output parameter. Remember
// to pass "flush = true" on the final call if you're calling this in a loop.

public static void GetBytes(this Encoder encoder, ReadOnlySequence<char> chars, IBufferWriter<byte> bytes, bool flush);
public static void GetChars(this Decoder decoder, ReadOnlySequence<byte> bytes, IBufferWriter<char> chars, bool flush);

GrabYourPitchforks · 2019-09-19T01:25:42Z

For some context behind my "sequence of sequences" comment above: generally speaking text APIs only ever operate on entirely self-contained text. For example, we don't have an OperationStatus-based ToUpper API because it would require extraordinary contortion all the way up and down the stack, including changes to either libicu (on non-Windows) or the Windows operating system itself. For this same reason, most text-based extension methods on ReadOnlySpan<char> (Contains, IndexOf, etc.) should only ever be called if the ROS<char> is a single contiguous buffer that represents the entirety of the text to be inspected. This is especially crucial when culture-aware APIs are called.

Transcoding has always been one of very few special cases where it is possible to operate on data in a chunked manner. But due to the fact that you can't really operate on chunked data other than shuttle it around - you need to reconstitute into a single contiguous buffer the entirety of the data before you can perform textual operations on it - I want to make sure that we're not encouraging devs to treat ROS<char> "segments" of text equivalently to ROS<char> "complete" texts.

davidfowl · 2019-09-19T02:40:13Z

Unlikely, as they wouldn't play well with the other APIs on Encoder or Decoder. The best course of action would probably be to special-case the single-segment case, as you had suggested.

It doesn’t need to play well but it can’t throw an exception. The thing reading the stream shouldn’t need to try catch, it just needs to know how much data was consumed.

What's an example of a scenario where this would occur? My understanding of ReadOnlySequence is that any given sequence is supposed to represent self-contained data, even though that data could be divided arbitrarily among numerous segments. Are there scenarios where you instead have a sequence of ReadOnlySequence, where the contained data only makes sense when interpreted as part of the greater "sequence of sequences" whole?

Using a ReadOnlySequence<byte> that comes from a PipeReader. It’s arbitrarily sized data that comes from a network or file.

This actually came from me trying to implement a TextReader on top of pipe. All I have is an encoding and a PipeReader (let me look at what StreamReader does because it has the same problem)

The Encoding type does not provide any such "go as far as you can and tell me how far you got" facility. The closest APIs are on Encoder / Decoder, but those types maintain state, so they're ill-suited to the OperationStatus pattern. If you really, really wanted to take Encoder or Decoder as an API (and I'd not recommend this just due to how wonky those types are), the API proposal would instead look like the below.

That’s not bad. I’ll add that to the proposal and leave these ones to mean fully formed sequence. He problem is it’s a pit of failure when using it with pipes.

GrabYourPitchforks · 2019-09-19T05:14:07Z

Sounds like we're circling around taking this Encoder or this Decoder as a parameter then? Since those are stateful types they won't throw exceptions when they see partial data; they'll just consume it and wait for more input. (That's also why my proposed modifications to your APIs don't have an out parameter: it'd be guaranteed to always point to the end of the sequence.)

If you wanted a "writes to Span" overload as well, we could add one. Such an API should probably be an overload to the Convert method that is already exists on Encoder / Decoder.

void Convert(..., ReadOnlySequence<T> src, Span<U> dest, bool flush, out SequencePosition consumed, out int written, out bool completed);

Note that this follows the already-established API shape for Convert and doesn't use OperationStatus. (Unfortunately there are reliability issues trying to turn this pattern back into an OperationStatus, so I'm not optimistic we could expose that particular return type.)

GrabYourPitchforks · 2019-09-19T05:40:05Z

FWIW I wish the Encoding types would play better with OperationStatus. When we refactored the base Encoding type last release we internally rewrote much of it in terms of that type.

Check this out:

https://github.com/dotnet/coreclr/blob/d8d6d8a53217806a0f3866e3156e10e1d1b06084/src/System.Private.CoreLib/shared/System/Text/Encoding.Internal.cs#L71-L85

In theory, if we were to completely redo the Encoding class from scratch, we could make these two methods the only abstract methods on the entire type, and every single other method's default virtual implementation would behave correctly because they'd all be written in terms of these two methods.

Here's the logic for UTF8Encoding:

https://github.com/dotnet/coreclr/blob/d8d6d8a53217806a0f3866e3156e10e1d1b06084/src/System.Private.CoreLib/shared/System/Text/UTF8Encoding.cs#L760-L771

Imagine that the entirety of the UTF8Encoding class could be implemented via those 11 lines of code, plus overriding the BOM property if needed. It might be slow, but it would be correct per-spec. It would also naturally allow for the OperationStatus-based APIs you're looking for because they'd be an integral part of the base type rather than an implementation detail of the internal subclassed types.

davidfowl · 2019-09-20T03:22:24Z

OK so the StreamReader uses a Decoder internally which makes sense because it doesn't know when it's seen an entire payload (which makes sense how else could it work). But we may need the encoding as well (to do things like GetMaxByte/GetMaxCharCount)

davidfowl · 2019-09-22T23:15:51Z

OK I've gone back and forth on this and I think the original proposal is fine. It'll throw if we don't have a fully formed payload. I'll leave the Encoder/Decoder based overloads to another proposal.

GrabYourPitchforks · 2019-09-25T06:42:46Z

So what's the proposal then? The current issue description still includes OperationStatus-based APIs, which for the reasons I mentioned above we can't expose reliably.

davidfowl · 2019-09-25T07:56:07Z

@GrabYourPitchforks updated

GrabYourPitchforks · 2019-09-25T17:16:42Z

The behavior of the APIs that write to Span would be that they'll throw if the destination is too small. If this is acceptable then mark this as ready for review. :)
(FYI the methods which write to a IBufferWriter should probably return long, not int.)

davidfowl · 2019-09-25T17:28:15Z

The behavior of the APIs that write to Span would be that they'll throw if the destination is too small. If this is acceptable then mark this as ready for review. :)

That's OK. Isn't it the case today? If you want a Span big enough you need to call GetByteCount/GetCharCount right?

(FYI the methods which write to a IBufferWriter should probably return long, not int.)

Fixed, well sorta. You can't write more than an Int32 worth of data so, it'll throw if the ReadOnlySequence represents > more bytes than an int.

I also added GetString and GetBytes overloads that have the same behavior.

ahsonkhan · 2019-09-25T21:06:31Z

These extensions would likely live in the same assembly as these abstractions since encoding couldn't directly take a dependency on those types (unless we push them lower in the stack).

It's not clear to me which assembly these APIs live in. Are you suggesting they go in System.Buffers.dll because Encoding is in corelib but IBufferWriter<T> and ROSequence<T> are not low enough? If so, should these go in System.Memory.dll instead?

davidfowl · 2019-09-25T21:13:55Z

They have to go where System.Memory lives.

terrajobst · 2019-10-01T17:25:25Z

Video

The signatures makes sense

Should also add

public static long GetByteCount(this Encoding encoding, in ReadOnlySequence<char> chars);
public static long GetCharCount(this Encoding encoding, in ReadOnlySequence<byte> bytes);

Making this an extension seems slightly sad; is this the point where we should consider movign ReadOnlySequence<T>, IBfferWriter<T> into corlib? Can anybody think of scenarios where we'd like those types down?
@GrabYourPitchforks filed dotnet/corefx#41474 to make these instance methods should we move the types down

jkotas · 2019-10-01T19:44:41Z

System.Text.Encoding is an abstraction, moreover not a high-performance zero-allocation one (at least today).

Would it be better to flip these APIs to be UTF8 and IBufferWriter-centric? Something like:

 // Hardcoded to use UTF8 encoding and avoids overhead of System.Text.Encoding abstractions
void WriteBytes(this IBufferWriter<byte> writer, ReadOnlySpan<char> chars);
void WriteBytes(this IBufferWriter<byte> writer, ReadOnlySpan<char> chars, Encoding encoding);
...

GrabYourPitchforks · 2019-10-02T01:02:58Z

I think it would absolutely make sense to have UTF-8 specific APIs for this. But there are also scenarios where we'd want support for non-UTF-8 encodings, such as in an HTTP web request where the client requests that the response be in an alternative encoding.

The implementation of GetBytes / GetChars which output to a Span<T> or an IBufferWriter<T> would be non-allocating for the case where the input ReadOnlySequence<T> is a single chunk; it would allocate a EncoderNLS / DecoderNLS instance for the case where the input sequence consists of multiple chunks. This instance should be very short-lived, so I don't think there's a risk of much cruft ending up in the GC heap.

jkotas · 2019-10-02T04:19:18Z

absolutely make sense to have UTF-8 specific APIs for this. But there are also scenarios where we'd want support for non-UTF-8 encoding

Agree, that's why I have two overloads in my sketch above.

GrabYourPitchforks · 2019-10-02T04:38:52Z

Ah, sorry, misinterpreted your suggestion. @davidfowl, would a UTF8 specific helper be beneficial for your scenarios? We could even make that one OperationStatus based if you want.

davidfowl · 2019-10-02T15:05:49Z

Ah, sorry, misinterpreted your suggestion. @davidfowl, would a UTF8 specific helper be beneficial for your scenarios? We could even make that one OperationStatus based if you want.

In some scenarios as a fast path. Today we're still using the encoding types all over ASP.NET Core (and in the broader .NET library ecosystem as a whole). So maybe eventually but as for right now I'm more concerned with adding overloads for IBW and ReadOnlySequence all over the stack.

GrabYourPitchforks · 2019-10-05T06:13:32Z

After further thought I don't think GetCharCount / GetByteCount can be implemented efficiently. I'd recommend not adding those unless we see demand for them.

davidfowl closed this as completed Sep 19, 2019

davidfowl reopened this Sep 19, 2019

GrabYourPitchforks self-assigned this Oct 3, 2019

GrabYourPitchforks closed this as completed in dotnet/corefx#41810 Oct 22, 2019

msftgits transferred this issue from dotnet/corefx Feb 1, 2020

msftgits added this to the 5.0 milestone Feb 1, 2020

davidfowl mentioned this issue Feb 1, 2020

API Proposal: Add Encoder/Decoder for new System.Buffer types #30957

Closed

ghost locked as resolved and limited conversation to collaborators Dec 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Proposal: Add Encoding/Decoding APIs for new System.Buffer types #30882

API Proposal: Add Encoding/Decoding APIs for new System.Buffer types #30882

davidfowl commented Sep 18, 2019

gfoidl commented Sep 18, 2019

davidfowl commented Sep 18, 2019

GrabYourPitchforks commented Sep 18, 2019 •

edited

Loading

davidfowl commented Sep 18, 2019

GrabYourPitchforks commented Sep 19, 2019 •

edited

Loading

GrabYourPitchforks commented Sep 19, 2019

davidfowl commented Sep 19, 2019 •

edited

Loading

GrabYourPitchforks commented Sep 19, 2019

GrabYourPitchforks commented Sep 19, 2019

davidfowl commented Sep 20, 2019

davidfowl commented Sep 22, 2019 •

edited

Loading

GrabYourPitchforks commented Sep 25, 2019

davidfowl commented Sep 25, 2019

GrabYourPitchforks commented Sep 25, 2019

davidfowl commented Sep 25, 2019

ahsonkhan commented Sep 25, 2019

davidfowl commented Sep 25, 2019

terrajobst commented Oct 1, 2019 •

edited

Loading

jkotas commented Oct 1, 2019 •

edited

Loading

GrabYourPitchforks commented Oct 2, 2019

jkotas commented Oct 2, 2019

GrabYourPitchforks commented Oct 2, 2019

davidfowl commented Oct 2, 2019

GrabYourPitchforks commented Oct 5, 2019

API Proposal: Add Encoding/Decoding APIs for new System.Buffer types #30882

API Proposal: Add Encoding/Decoding APIs for new System.Buffer types #30882

Comments

davidfowl commented Sep 18, 2019

gfoidl commented Sep 18, 2019

davidfowl commented Sep 18, 2019

GrabYourPitchforks commented Sep 18, 2019 • edited Loading

davidfowl commented Sep 18, 2019

GrabYourPitchforks commented Sep 19, 2019 • edited Loading

GrabYourPitchforks commented Sep 19, 2019

davidfowl commented Sep 19, 2019 • edited Loading

GrabYourPitchforks commented Sep 19, 2019

GrabYourPitchforks commented Sep 19, 2019

davidfowl commented Sep 20, 2019

davidfowl commented Sep 22, 2019 • edited Loading

GrabYourPitchforks commented Sep 25, 2019

davidfowl commented Sep 25, 2019

GrabYourPitchforks commented Sep 25, 2019

davidfowl commented Sep 25, 2019

ahsonkhan commented Sep 25, 2019

davidfowl commented Sep 25, 2019

terrajobst commented Oct 1, 2019 • edited Loading

jkotas commented Oct 1, 2019 • edited Loading

GrabYourPitchforks commented Oct 2, 2019

jkotas commented Oct 2, 2019

GrabYourPitchforks commented Oct 2, 2019

davidfowl commented Oct 2, 2019

GrabYourPitchforks commented Oct 5, 2019

GrabYourPitchforks commented Sep 18, 2019 •

edited

Loading

GrabYourPitchforks commented Sep 19, 2019 •

edited

Loading

davidfowl commented Sep 19, 2019 •

edited

Loading

davidfowl commented Sep 22, 2019 •

edited

Loading

terrajobst commented Oct 1, 2019 •

edited

Loading

jkotas commented Oct 1, 2019 •

edited

Loading