Proposal: SequenceReader<T>.ReadToEnd / AdvanceToEnd #29360

hexawyz · 2019-04-24T21:29:05Z

Rationale

Today, one can read a sequence up to any delimiter and advance to/after the delimiter by using SequenceReader<T>.TryReadTo or SequenceReader<T>.TryReadToAny.

This is quite convenient when working with bounded and well-delimited data, but can be cumbersome when processing sequences of an unbounded data stream. (e.g. data coming from a PipeReader in an uncontrolled matter)

The problem this proposal seeks to address is: Read data as far as possible until any delimiter is reached.

It applies in these cases

The data may or may not contain delimiters
The amount of data between two delimiters is random and can be very small (0) or very large (virtually ∞)
The data will/may never be complete, and must be processed as soon as available (e.g. redirected stdout from a long running application)

It is possible today to have something like that:

if (!reader.TryReadToAny(out ReadOnlySequence<byte> readSequence, delimiters, false))
{
    readSequence = reader.Sequence.Slice(reader.Position, buffer.End);
    reader.Advance(reader.Remaining);
}

Or something like that:

if (!reader.TryReadToAny(out ReadOnlySequence<byte> readSequence, delimiters, false))
{
    readSequence = reader.Sequence.Slice(reader.Position, buffer.End);
    // "Skip" to the end.
    reader = new SequenceReader<byte>(reader.Sequence.Slice(buffer.End, buffer.End));
}

However, both these options come with their own problems:

The first option relies on SequenceReader<T>.Advance, which will enumerate every remaining segment once. This is pointless and wasteful if the desired result is simply to jump to the end, and also redundant because it has already been done by the call to SequenceReader<T>.TryReadToAny.
The second option simply creates a new reader. Though not perfect, this is probably more efficient, but not very straightforward.
Both options require manual interaction with the original ReadOnlySequence<byte>, while that may not have been required in the first place.

In this case, an API such as SequenceReader<T>.ReadToAnyOrEnd would work, but it would still feel a bit clumsy to use, compared to other methods.
Plus, half of the desired feature can already be achieved from another method, so it would be better to provide only the other half if possible.

Hence this proposal about adding the ability to skip to the end of a sequence.

Proposed API

public ref partial struct SequenceReader<T> where T : unmanaged, IEquatable<T>
{
    // …
    public void AdvanceToEnd() { /* … */ }
    public ReadOnlySequence<T> ReadToEnd() { /* … */ }
    // …
}

Method AdvanceToEnd would simply update the internal state of the reader to the end of the sequence. Hopefully, no complex operation will be required here.

Method ReadToEnd would be pretty similar, but return the remaining part of the sequence before advancing to the end.

Example

Read the maximum amount of data until a delimiter:

if (!sequenceReader.TryReadToAny(out ReadOnlySequence<byte> readSequence, delimiters, false))
{
    readSequence = sequenceReader.ReadToEnd();
}

ProcessPartialData(readSequence);

pipeReader.AdvanceTo(sequenceReader.Position);

Skip to delimiter(s) or skip the sequence entirely:

if (!sequenceReader.TryAdvanceToAny(delimiters, false))
{
    sequenceReader.AdvanceToEnd();
}

pipeReader.AdvanceTo(sequenceReader.Position);

Questions

Should it be bool TryReadToEnd(out ReadOnlySequence<T> sequence) instead of ReadOnlySequence<T> ReadToEnd() ? (Returns false if the end has already been reached)

The text was updated successfully, but these errors were encountered:

ahsonkhan · 2019-04-26T00:41:58Z

cc @JeremyKuhne to triage/give first pass.

JeremyKuhne · 2019-04-26T18:26:51Z

This is presuming that you want a SequenceReader<T> positioned at the end, not a Position from the end of the SequenceReader<T>. If all you want is to quickly get the end position, that is already available with seqenceReader.Sequence.End.

I can see the argument where you want to pass the reader on to something else that will read more if there is anything left to read. In that case we can provide an optimized AdvanceToEnd() that will set all of the state appropriately (Consumed, CurrentSpan, etc.). That is still more expensive than just getting the end position so we'd want to document that you should use sequenceReader.Sequence.End if all you're looking for is the end position.

I'm a little more hesitant on ReadToEnd() as it is just a convenience over:

var remaining = sequenceReader.Sequence.Slice(sequenceReader.Position);
sequenceReader.AdvanceToEnd();

It's also is a bit off an odd fit with the rest of the API. We don't have anything else quite the same. TryReadTo* are the only APIs that return sequences, but they all are about delimiters. They'll return empty sequences, which a TryReadToEnd() would never do.

hexawyz · 2019-04-27T17:49:37Z

This is presuming that you want a SequenceReader<T> positioned at the end

I can see the argument where you want to pass the reader on to something else that will read more if there is anything left to read.

Yeah, that's mostly the idea. The way I understand it, SequenceReader<T>.Position should "always" indicate which part of the sequence has been consumed so that operations can be chained gracefully. (e.g. finish your method with pipeReader(reader.Position, reader.Sequence.End))

I might be wrong there, but I expect it to happen quite often that one wishes to consume the whole remaining sequence if no delimiter were found.

I'm a little more hesitant on ReadToEnd() as it is just a convenience over:

My proposal was actually more about ReadToEnd than AdvanceToEnd. IMO it's more likely that one wishes to actually get the remaining sequence at the same time as skipping to the end. It just didn't really make sense to me to propose ReadToEnd without its counterpart AdvanceToEnd.

In fact, a quick search through the code in AspNetCore lead me to this:
https://github.com/aspnet/AspNetCore/blob/fc9e48877c75562af1d3879a57694f55d5685397/src/Http/WebUtilities/src/FormPipeReader.cs#L255

More context

I'm basically trying to implement a parser atop SequenceReader, and SequenceReader.Position. Let's assume this is for parsing a stream of XML text, because it's easy to reason with. The parsing algorithm for a given ReadOnlySequence should map to the following regex: /^(?:([^&]*(?=&|$))(&[^;]+;)?)+/

I can split text in chunks, but entities need to be decoded as a whole. So, the logic would be as follows:

Read as much text as possible (until either end or the character & is reached)
If the text not empty, advance consumed position to read position, and output the text
If there is any data left to read, try to decode an XML entity
If the entity was decoded successfully, advance consumed position to read position, and output the decoded entity
If the end has not been reached, go to step 1

The fact that SequenceReader exposes delimiter-related methods seemed like a good indication that it would be a good fit for implementing a stream parser, but it still seems to lack some features that are required for this use case. Here, I'm trying to address the "read as much as possible" case, but there are other potentially helpful methods still missing for parsing data. (e.g. TryReadToNotAny/AdvanceToNotAny, but also features such as #28846)

JeremyKuhne · 2019-04-29T18:29:10Z

My proposal was actually more about ReadToEnd than AdvanceToEnd.

That was my understanding as well. In designing the reader it was driven primarily by performance. AdvanceToEnd isn't something you can do efficiently without new API, so it is a very easy sell for me. ReadToEnd you can easily compose externally (say, via an extension method). That isn't a reason to punt on the concept unconditionally, but it is a reason to take our time given the somewhat awkward semantic fit with the existing APIs.

I'll take this to review, the others may allay my concerns or come up with good terminology for the ReadToEnd().

but there are other potentially helpful methods still missing for parsing data

I'm happy to see additional proposals come through. :)

danmoseley · 2019-05-30T19:19:34Z

@JeremyKuhne it seems to me this is not required to ship 3.0. Moving to future. If I am wrong please let me know.

stevejgordon · 2019-09-04T11:51:39Z

In building a small sample for using SequenceReader I stumbled into a similar requirement.

See https://github.com/stevejgordon/SequenceReaderSample

In my case, I'm parsing a sequence of comma-separated values. Once I've found all items using the delimiter, the remainder of the stream is assumed to be the final item (no delimiter after it). My sample grabs the final data using TryCopyTo which works but is something I'd see having to do often enough that a method to ReadToEnd would be handy. I'd like to get the final data as a Span directly.

@JeremyKuhne - I'd welcome any feedback you have on the sample in general as I plan to post a blog about what it does when using SequenceReader.

JeremyKuhne · 2019-09-04T18:30:05Z

@stevejgordon Thanks for the extra info.

In your example, the final item doesn't need a reader- it can simply be Console.WriteLine(Encoding.UTF8.GetString(sequence.ToArray())); (or sequence.CopyTo into another buffer).

I'd also consider changing ReadItem to ReadItems and pass in bool completed so that you can keep all of that logic together. In that case, since you already have a reader you can do something like var finalSequence = sequence.Slice(reader.Position); when TryReadTo fails and you are completed.

stevejgordon · 2019-09-05T07:18:41Z

Thanks very much, @JeremyKuhne. I've updated the sample per your feedback. I appreciate you getting back to me so quickly. Once the post is done I'll ping the link over to you.

stevejgordon · 2019-09-05T11:25:10Z

Published an introduction to using SequenceReader based on my sample to https://www.stevejgordon.co.uk/an-introduction-to-sequencereader

JeremyKuhne · 2019-09-05T18:32:24Z

@stevejgordon nice! One note- the SequenceReader is documented: https://docs.microsoft.com/en-us/dotnet/api/system.buffers.sequencereader-1?view=netcore-3.0 Doesn't have a nice overview like you have though.

davidfowl · 2019-09-06T05:56:49Z

@JeremyKuhne I started working on the doc, I plan to start a open a PR soon.

stevejgordon · 2019-09-06T05:59:56Z

@JeremyKuhne To be honest, I did scan through that API documentation and should probably include a link to that somewhere in the post! I was referring to more user-facing, here's how you use it documentation which it seems @davidfowl has covered! :-)

ahsonkhan · 2019-09-09T21:59:40Z

Based on offline discussion with @JeremyKuhne, since we already have the property UnreadSpan that returns a slice of the CurrentSpan, we should change ReadToEnd() to be a property called UnreadSequence, which does the equivalent of the following:

ReadOnlySequence<byte> unreadSequence = reader.Sequence.Slice(reader.Position);

public ref partial struct SequenceReader<T> where T : unmanaged, IEquatable<T>
{
    // …
    public void AdvanceToEnd() { /* … */ }
    public ReadOnlySequence<T> UnreadSequence { get; }
    // …
}

JeremyKuhne · 2019-09-10T18:40:01Z

Video

This was approved in dotnet/corefx#40962 as:

namespace System.Buffers
{
    public ref struct SequenceReader<T>
    {
        // Optimized API to position the reader at the end of the sequence (much faster than what users can write)
        public void AdvanceToEnd();

        // Pairs with existing Span<T> UnreadSpan;
        public readonly ReadOnlySequence<T> UnreadSequence { get; }
    }
}

Here are the rough implementations:

        /// <summary>
        /// The unread portion of the <see cref="Sequence"/>.
        /// </summary>
        public readonly ReadOnlySequence<T> UnreadSequence
        {
            get => Sequence.Slice(Position);
        }

        public void AdvanceToEnd()
        {
            if (_moreData)
            {
                Consumed = Length;
                CurrentSpan = default;
                CurrentSpanIndex = 0;
                _currentPosition = Sequence.End;
                _moreData = false;
            }
        }

Marking as up-for-grabs to fully implement/test.

JeremyKuhne self-assigned this May 16, 2019

ahsonkhan assigned MarcoRossignoli Dec 5, 2019

ahsonkhan closed this as completed in #388 Dec 7, 2019

msftgits transferred this issue from dotnet/corefx Feb 1, 2020

msftgits added this to the 5.0 milestone Feb 1, 2020

JeremyKuhne mentioned this issue Feb 1, 2020

Review System.Buffers.SequenceReader<T> proposals #30807

Closed

ghost locked as resolved and limited conversation to collaborators Dec 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: SequenceReader<T>.ReadToEnd / AdvanceToEnd #29360

Proposal: SequenceReader<T>.ReadToEnd / AdvanceToEnd #29360

hexawyz commented Apr 24, 2019

ahsonkhan commented Apr 26, 2019

JeremyKuhne commented Apr 26, 2019

hexawyz commented Apr 27, 2019

JeremyKuhne commented Apr 29, 2019

danmoseley commented May 30, 2019

stevejgordon commented Sep 4, 2019

JeremyKuhne commented Sep 4, 2019

stevejgordon commented Sep 5, 2019

stevejgordon commented Sep 5, 2019

JeremyKuhne commented Sep 5, 2019

davidfowl commented Sep 6, 2019

stevejgordon commented Sep 6, 2019

ahsonkhan commented Sep 9, 2019

JeremyKuhne commented Sep 10, 2019 •

edited by terrajobst

Proposal: SequenceReader<T>.ReadToEnd / AdvanceToEnd #29360

Proposal: SequenceReader<T>.ReadToEnd / AdvanceToEnd #29360

Comments

hexawyz commented Apr 24, 2019

Rationale

Proposed API

Example

Questions

ahsonkhan commented Apr 26, 2019

JeremyKuhne commented Apr 26, 2019

hexawyz commented Apr 27, 2019

More context

JeremyKuhne commented Apr 29, 2019

danmoseley commented May 30, 2019

stevejgordon commented Sep 4, 2019

JeremyKuhne commented Sep 4, 2019

stevejgordon commented Sep 5, 2019

stevejgordon commented Sep 5, 2019

JeremyKuhne commented Sep 5, 2019

davidfowl commented Sep 6, 2019

stevejgordon commented Sep 6, 2019

ahsonkhan commented Sep 9, 2019

JeremyKuhne commented Sep 10, 2019 • edited by terrajobst

JeremyKuhne commented Sep 10, 2019 •

edited by terrajobst