Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Proposal: Add Encoding/Decoding APIs for new System.Buffer types #30882

Closed
davidfowl opened this issue Sep 18, 2019 · 24 comments · Fixed by dotnet/corefx#41810
Closed

API Proposal: Add Encoding/Decoding APIs for new System.Buffer types #30882

davidfowl opened this issue Sep 18, 2019 · 24 comments · Fixed by dotnet/corefx#41810
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Buffers
Milestone

Comments

@davidfowl
Copy link
Member

Today our encoding APIs support Span<byte>/Span<char>/char[]/byte[] and the Encoder/Decoder API, I'd like to propose some higher level APIs that support ReadOnlySequence<byte/char> and IBufferWriter<byte/char>:

namespace System.Text
{
    public static class EncodingExtensions
    {
        public static int GetBytes(this Encoding encoding, ReadOnlySpan<char> chars, IBufferWriter<byte> writer);
        public static long GetBytes(this Encoding encoding, in ReadOnlySequence<char> chars, IBufferWriter<byte> writer);
        public static int GetBytes(this Encoding encoding, in ReadOnlySequence<char> chars, Span<byte> bytes);
        public static byte[] GetBytes(this Encoding encoding, in ReadOnlySequence<char> bytes);

        public static int GetChars(this Encoding encoding, ReadOnlySpan<byte> bytes, IBufferWriter<char> writer);
        public static long GetChars(this Encoding encoding, in ReadOnlySequence<byte> bytes, IBufferWriter<char> writer);
        public static int GetChars(this Encoding encoding, in ReadOnlySequence<byte> bytes, Span<char> chars);
        public static string GetString(this Encoding encoding, in ReadOnlySequence<byte> bytes);
    }
}

These extensions would likely live in the same assembly as these abstractions since encoding couldn't directly take a dependency on those types (unless we push them lower in the stack).

cc @GrabYourPitchforks

@gfoidl
Copy link
Member

gfoidl commented Sep 18, 2019

  • for Encode the ReadOnlySequence<char> should be passed by in.
  • other similar APIs follow the schema source, target, out consumed, out written, this should be aligned to that
public static class EncodingExtensions
{
    public static void Encode(this Encoding encoding, ReadOnlySpan<char> chars, IBufferWriter<byte> writer);

-   public static void Encode(this Encoding encoding, ReadOnlySequence<char> chars, IBufferWriter<byte> writer);
+   public static void Encode(this Encoding encoding, in ReadOnlySequence<char> chars, IBufferWriter<byte> writer);

-   public static OperationStatus Decode(this Encoding encoding, in ReadOnlySequence<byte> bytes, Span<char> chars, out int charsWritten, out SequencePosition bytesConsumedPosition);
+   public static OperationStatus Decode(this Encoding encoding, in ReadOnlySequence<byte> bytes, Span<char> chars, out SequencePosition bytesConsumedPosition, out int charsWritten);

    public static OperationStatus Decode(this Encoding encoding, in ReadOnlySequence<byte> bytes, IBufferWriter<char> writer, out SequencePosition bytesConsumedPosition);
}

@davidfowl
Copy link
Member Author

Updated

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Sep 18, 2019

Keep in mind each API call would incur a single tiny allocation for the Encoder or Decoder object. This should be fine since it'll be short-lived, but wanted to make you aware of it. Additionally, the return value in all cases would be void instead of OperationStatus, as the API would always run to completion (or fail if given invalid data and configured to throw). So you don't need to out the position. But otherwise LGTM.

@davidfowl
Copy link
Member Author

Keep in mind each API call would incur a single tiny allocation for the Encoder or Decoder object. This should be fine since it'll be short-lived, but wanted to make you aware of it.

There might be a fast path for the single segment case that avoids the allocation but good reminder (I actually implemented these before filing the issue). Do you think it makes sense to expose a static API that takes an encoder/decoder? The fact that those can be reset would make it possible to re-use them as well.

Additionally, the return value in all cases would be void instead of OperationStatus, as the API would always run to completion (or fail if given invalid data and configured to throw). So you don't need to out the position.

What if there's incomplete data? I got 2 bytes of a 3 byte UTF8 scalar/codepoint and the next read will have all 3 of the bytes? I'd prefer if these operations not throw but fail in a way that lets the caller provide more data.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Sep 19, 2019

Do you think it makes sense to expose a static API that takes an encoder/decoder? The fact that those can be reset would make it possible to re-use them as well.

Unlikely, as they wouldn't play well with the other APIs on Encoder or Decoder. The best course of action would probably be to special-case the single-segment case, as you had suggested.

What if there's incomplete data? I got 2 bytes of a 3 byte UTF8 scalar/codepoint and the next read will have all 3 of the bytes? I'd prefer if these operations not throw but fail in a way that lets the caller provide more data.

What's an example of a scenario where this would occur? My understanding of ReadOnlySequence<T> is that any given sequence is supposed to represent self-contained data, even though that data could be divided arbitrarily among numerous segments. Are there scenarios where you instead have a sequence of ReadOnlySequence<T>, where the contained data only makes sense when interpreted as part of the greater "sequence of sequences" whole?

The Encoding type does not provide any such "go as far as you can and tell me how far you got" facility. The closest APIs are on Encoder / Decoder, but those types maintain state, so they're ill-suited to the OperationStatus pattern. If you really, really wanted to take Encoder or Decoder as an API (and I'd not recommend this just due to how wonky those types are), the API proposal would instead look like the below.

// These APIs could return 'long' instead of 'void' if you wanted to know how many elements
// were written to the destination. Additionally, the contract would be that they'd consume the
// input sequence in its entirety, hence no "elements consumed" output parameter. Remember
// to pass "flush = true" on the final call if you're calling this in a loop.

public static void GetBytes(this Encoder encoder, ReadOnlySequence<char> chars, IBufferWriter<byte> bytes, bool flush);
public static void GetChars(this Decoder decoder, ReadOnlySequence<byte> bytes, IBufferWriter<char> chars, bool flush);

@GrabYourPitchforks
Copy link
Member

For some context behind my "sequence of sequences" comment above: generally speaking text APIs only ever operate on entirely self-contained text. For example, we don't have an OperationStatus-based ToUpper API because it would require extraordinary contortion all the way up and down the stack, including changes to either libicu (on non-Windows) or the Windows operating system itself. For this same reason, most text-based extension methods on ReadOnlySpan<char> (Contains, IndexOf, etc.) should only ever be called if the ROS<char> is a single contiguous buffer that represents the entirety of the text to be inspected. This is especially crucial when culture-aware APIs are called.

Transcoding has always been one of very few special cases where it is possible to operate on data in a chunked manner. But due to the fact that you can't really operate on chunked data other than shuttle it around - you need to reconstitute into a single contiguous buffer the entirety of the data before you can perform textual operations on it - I want to make sure that we're not encouraging devs to treat ROS<char> "segments" of text equivalently to ROS<char> "complete" texts.

@davidfowl
Copy link
Member Author

davidfowl commented Sep 19, 2019

Unlikely, as they wouldn't play well with the other APIs on Encoder or Decoder. The best course of action would probably be to special-case the single-segment case, as you had suggested.

It doesn’t need to play well but it can’t throw an exception. The thing reading the stream shouldn’t need to try catch, it just needs to know how much data was consumed.

What's an example of a scenario where this would occur? My understanding of ReadOnlySequence is that any given sequence is supposed to represent self-contained data, even though that data could be divided arbitrarily among numerous segments. Are there scenarios where you instead have a sequence of ReadOnlySequence, where the contained data only makes sense when interpreted as part of the greater "sequence of sequences" whole?

Using a ReadOnlySequence<byte> that comes from a PipeReader. It’s arbitrarily sized data that comes from a network or file.

This actually came from me trying to implement a TextReader on top of pipe. All I have is an encoding and a PipeReader (let me look at what StreamReader does because it has the same problem)

The Encoding type does not provide any such "go as far as you can and tell me how far you got" facility. The closest APIs are on Encoder / Decoder, but those types maintain state, so they're ill-suited to the OperationStatus pattern. If you really, really wanted to take Encoder or Decoder as an API (and I'd not recommend this just due to how wonky those types are), the API proposal would instead look like the below.

That’s not bad. I’ll add that to the proposal and leave these ones to mean fully formed sequence. He problem is it’s a pit of failure when using it with pipes.

@davidfowl davidfowl reopened this Sep 19, 2019
@GrabYourPitchforks
Copy link
Member

Sounds like we're circling around taking this Encoder or this Decoder as a parameter then? Since those are stateful types they won't throw exceptions when they see partial data; they'll just consume it and wait for more input. (That's also why my proposed modifications to your APIs don't have an out parameter: it'd be guaranteed to always point to the end of the sequence.)

If you wanted a "writes to Span" overload as well, we could add one. Such an API should probably be an overload to the Convert method that is already exists on Encoder / Decoder.

void Convert(..., ReadOnlySequence<T> src, Span<U> dest, bool flush, out SequencePosition consumed, out int written, out bool completed);

Note that this follows the already-established API shape for Convert and doesn't use OperationStatus. (Unfortunately there are reliability issues trying to turn this pattern back into an OperationStatus, so I'm not optimistic we could expose that particular return type.)

@GrabYourPitchforks
Copy link
Member

FWIW I wish the Encoding types would play better with OperationStatus. When we refactored the base Encoding type last release we internally rewrote much of it in terms of that type.

Check this out:

https://github.com/dotnet/coreclr/blob/d8d6d8a53217806a0f3866e3156e10e1d1b06084/src/System.Private.CoreLib/shared/System/Text/Encoding.Internal.cs#L71-L85

In theory, if we were to completely redo the Encoding class from scratch, we could make these two methods the only abstract methods on the entire type, and every single other method's default virtual implementation would behave correctly because they'd all be written in terms of these two methods.

Here's the logic for UTF8Encoding:

https://github.com/dotnet/coreclr/blob/d8d6d8a53217806a0f3866e3156e10e1d1b06084/src/System.Private.CoreLib/shared/System/Text/UTF8Encoding.cs#L760-L771

Imagine that the entirety of the UTF8Encoding class could be implemented via those 11 lines of code, plus overriding the BOM property if needed. It might be slow, but it would be correct per-spec. It would also naturally allow for the OperationStatus-based APIs you're looking for because they'd be an integral part of the base type rather than an implementation detail of the internal subclassed types.

@davidfowl
Copy link
Member Author

OK so the StreamReader uses a Decoder internally which makes sense because it doesn't know when it's seen an entire payload (which makes sense how else could it work). But we may need the encoding as well (to do things like GetMaxByte/GetMaxCharCount)

@davidfowl
Copy link
Member Author

davidfowl commented Sep 22, 2019

OK I've gone back and forth on this and I think the original proposal is fine. It'll throw if we don't have a fully formed payload. I'll leave the Encoder/Decoder based overloads to another proposal.

@GrabYourPitchforks
Copy link
Member

So what's the proposal then? The current issue description still includes OperationStatus-based APIs, which for the reasons I mentioned above we can't expose reliably.

@davidfowl
Copy link
Member Author

@GrabYourPitchforks updated

@GrabYourPitchforks
Copy link
Member

The behavior of the APIs that write to Span would be that they'll throw if the destination is too small. If this is acceptable then mark this as ready for review. :)
(FYI the methods which write to a IBufferWriter should probably return long, not int.)

@davidfowl
Copy link
Member Author

The behavior of the APIs that write to Span would be that they'll throw if the destination is too small. If this is acceptable then mark this as ready for review. :)

That's OK. Isn't it the case today? If you want a Span big enough you need to call GetByteCount/GetCharCount right?

(FYI the methods which write to a IBufferWriter should probably return long, not int.)

Fixed, well sorta. You can't write more than an Int32 worth of data so, it'll throw if the ReadOnlySequence represents > more bytes than an int.

I also added GetString and GetBytes overloads that have the same behavior.

@ahsonkhan
Copy link
Member

These extensions would likely live in the same assembly as these abstractions since encoding couldn't directly take a dependency on those types (unless we push them lower in the stack).

It's not clear to me which assembly these APIs live in. Are you suggesting they go in System.Buffers.dll because Encoding is in corelib but IBufferWriter<T> and ROSequence<T> are not low enough? If so, should these go in System.Memory.dll instead?

@davidfowl
Copy link
Member Author

They have to go where System.Memory lives.

@terrajobst
Copy link
Member

terrajobst commented Oct 1, 2019

Video

  • The signatures makes sense
  • Should also add
    public static long GetByteCount(this Encoding encoding, in ReadOnlySequence<char> chars);
    public static long GetCharCount(this Encoding encoding, in ReadOnlySequence<byte> bytes);
  • Making this an extension seems slightly sad; is this the point where we should consider movign ReadOnlySequence<T>, IBfferWriter<T> into corlib? Can anybody think of scenarios where we'd like those types down?
  • @GrabYourPitchforks filed dotnet/corefx#41474 to make these instance methods should we move the types down

@jkotas
Copy link
Member

jkotas commented Oct 1, 2019

System.Text.Encoding is an abstraction, moreover not a high-performance zero-allocation one (at least today).

Would it be better to flip these APIs to be UTF8 and IBufferWriter-centric? Something like:

 // Hardcoded to use UTF8 encoding and avoids overhead of System.Text.Encoding abstractions
void WriteBytes(this IBufferWriter<byte> writer, ReadOnlySpan<char> chars);
void WriteBytes(this IBufferWriter<byte> writer, ReadOnlySpan<char> chars, Encoding encoding);
...

@GrabYourPitchforks
Copy link
Member

I think it would absolutely make sense to have UTF-8 specific APIs for this. But there are also scenarios where we'd want support for non-UTF-8 encodings, such as in an HTTP web request where the client requests that the response be in an alternative encoding.

The implementation of GetBytes / GetChars which output to a Span<T> or an IBufferWriter<T> would be non-allocating for the case where the input ReadOnlySequence<T> is a single chunk; it would allocate a EncoderNLS / DecoderNLS instance for the case where the input sequence consists of multiple chunks. This instance should be very short-lived, so I don't think there's a risk of much cruft ending up in the GC heap.

@jkotas
Copy link
Member

jkotas commented Oct 2, 2019

absolutely make sense to have UTF-8 specific APIs for this. But there are also scenarios where we'd want support for non-UTF-8 encoding

Agree, that's why I have two overloads in my sketch above.

@GrabYourPitchforks
Copy link
Member

Ah, sorry, misinterpreted your suggestion. @davidfowl, would a UTF8 specific helper be beneficial for your scenarios? We could even make that one OperationStatus based if you want.

@davidfowl
Copy link
Member Author

Ah, sorry, misinterpreted your suggestion. @davidfowl, would a UTF8 specific helper be beneficial for your scenarios? We could even make that one OperationStatus based if you want.

In some scenarios as a fast path. Today we're still using the encoding types all over ASP.NET Core (and in the broader .NET library ecosystem as a whole). So maybe eventually but as for right now I'm more concerned with adding overloads for IBW and ReadOnlySequence all over the stack.

@GrabYourPitchforks GrabYourPitchforks self-assigned this Oct 3, 2019
@GrabYourPitchforks
Copy link
Member

After further thought I don't think GetCharCount / GetByteCount can be implemented efficiently. I'd recommend not adding those unless we see demand for them.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 5.0 milestone Feb 1, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Buffers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants