Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: SequenceReader<T>.ReadToEnd / AdvanceToEnd #29360

Closed
hexawyz opened this issue Apr 24, 2019 · 14 comments
Closed

Proposal: SequenceReader<T>.ReadToEnd / AdvanceToEnd #29360

hexawyz opened this issue Apr 24, 2019 · 14 comments
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Buffers
Milestone

Comments

@hexawyz
Copy link

hexawyz commented Apr 24, 2019

Rationale

Today, one can read a sequence up to any delimiter and advance to/after the delimiter by using SequenceReader<T>.TryReadTo or SequenceReader<T>.TryReadToAny.

This is quite convenient when working with bounded and well-delimited data, but can be cumbersome when processing sequences of an unbounded data stream. (e.g. data coming from a PipeReader in an uncontrolled matter)

The problem this proposal seeks to address is: Read data as far as possible until any delimiter is reached.

It applies in these cases

  • The data may or may not contain delimiters
  • The amount of data between two delimiters is random and can be very small (0) or very large (virtually ∞)
  • The data will/may never be complete, and must be processed as soon as available (e.g. redirected stdout from a long running application)

It is possible today to have something like that:

if (!reader.TryReadToAny(out ReadOnlySequence<byte> readSequence, delimiters, false))
{
    readSequence = reader.Sequence.Slice(reader.Position, buffer.End);
    reader.Advance(reader.Remaining);
}

Or something like that:

if (!reader.TryReadToAny(out ReadOnlySequence<byte> readSequence, delimiters, false))
{
    readSequence = reader.Sequence.Slice(reader.Position, buffer.End);
    // "Skip" to the end.
    reader = new SequenceReader<byte>(reader.Sequence.Slice(buffer.End, buffer.End));
}

However, both these options come with their own problems:

  • The first option relies on SequenceReader<T>.Advance, which will enumerate every remaining segment once. This is pointless and wasteful if the desired result is simply to jump to the end, and also redundant because it has already been done by the call to SequenceReader<T>.TryReadToAny.
  • The second option simply creates a new reader. Though not perfect, this is probably more efficient, but not very straightforward.
  • Both options require manual interaction with the original ReadOnlySequence<byte>, while that may not have been required in the first place.

In this case, an API such as SequenceReader<T>.ReadToAnyOrEnd would work, but it would still feel a bit clumsy to use, compared to other methods.
Plus, half of the desired feature can already be achieved from another method, so it would be better to provide only the other half if possible.

Hence this proposal about adding the ability to skip to the end of a sequence.

Proposed API

public ref partial struct SequenceReader<T> where T : unmanaged, IEquatable<T>
{
    // …
    public void AdvanceToEnd() { /* … */ }
    public ReadOnlySequence<T> ReadToEnd() { /* … */ }
    // …
}

Method AdvanceToEnd would simply update the internal state of the reader to the end of the sequence. Hopefully, no complex operation will be required here.

Method ReadToEnd would be pretty similar, but return the remaining part of the sequence before advancing to the end.

Example

Read the maximum amount of data until a delimiter:

if (!sequenceReader.TryReadToAny(out ReadOnlySequence<byte> readSequence, delimiters, false))
{
    readSequence = sequenceReader.ReadToEnd();
}

ProcessPartialData(readSequence);

pipeReader.AdvanceTo(sequenceReader.Position);

Skip to delimiter(s) or skip the sequence entirely:

if (!sequenceReader.TryAdvanceToAny(delimiters, false))
{
    sequenceReader.AdvanceToEnd();
}

pipeReader.AdvanceTo(sequenceReader.Position);

Questions

Should it be bool TryReadToEnd(out ReadOnlySequence<T> sequence) instead of ReadOnlySequence<T> ReadToEnd() ? (Returns false if the end has already been reached)

@ahsonkhan
Copy link
Member

cc @JeremyKuhne to triage/give first pass.

@JeremyKuhne
Copy link
Member

This is presuming that you want a SequenceReader<T> positioned at the end, not a Position from the end of the SequenceReader<T>. If all you want is to quickly get the end position, that is already available with seqenceReader.Sequence.End.

I can see the argument where you want to pass the reader on to something else that will read more if there is anything left to read. In that case we can provide an optimized AdvanceToEnd() that will set all of the state appropriately (Consumed, CurrentSpan, etc.). That is still more expensive than just getting the end position so we'd want to document that you should use sequenceReader.Sequence.End if all you're looking for is the end position.

I'm a little more hesitant on ReadToEnd() as it is just a convenience over:

var remaining = sequenceReader.Sequence.Slice(sequenceReader.Position);
sequenceReader.AdvanceToEnd();

It's also is a bit off an odd fit with the rest of the API. We don't have anything else quite the same. TryReadTo* are the only APIs that return sequences, but they all are about delimiters. They'll return empty sequences, which a TryReadToEnd() would never do.

@hexawyz
Copy link
Author

hexawyz commented Apr 27, 2019

This is presuming that you want a SequenceReader<T> positioned at the end

I can see the argument where you want to pass the reader on to something else that will read more if there is anything left to read.

Yeah, that's mostly the idea. The way I understand it, SequenceReader<T>.Position should "always" indicate which part of the sequence has been consumed so that operations can be chained gracefully. (e.g. finish your method with pipeReader(reader.Position, reader.Sequence.End))

I might be wrong there, but I expect it to happen quite often that one wishes to consume the whole remaining sequence if no delimiter were found.

I'm a little more hesitant on ReadToEnd() as it is just a convenience over:

My proposal was actually more about ReadToEnd than AdvanceToEnd. IMO it's more likely that one wishes to actually get the remaining sequence at the same time as skipping to the end. It just didn't really make sense to me to propose ReadToEnd without its counterpart AdvanceToEnd.

In fact, a quick search through the code in AspNetCore lead me to this:
https://github.com/aspnet/AspNetCore/blob/fc9e48877c75562af1d3879a57694f55d5685397/src/Http/WebUtilities/src/FormPipeReader.cs#L255

More context

I'm basically trying to implement a parser atop SequenceReader, and SequenceReader.Position. Let's assume this is for parsing a stream of XML text, because it's easy to reason with. The parsing algorithm for a given ReadOnlySequence should map to the following regex: /^(?:([^&]*(?=&|$))(&[^;]+;)?)+/

I can split text in chunks, but entities need to be decoded as a whole. So, the logic would be as follows:

  1. Read as much text as possible (until either end or the character & is reached)
  2. If the text not empty, advance consumed position to read position, and output the text
  3. If there is any data left to read, try to decode an XML entity
  4. If the entity was decoded successfully, advance consumed position to read position, and output the decoded entity
  5. If the end has not been reached, go to step 1

The fact that SequenceReader exposes delimiter-related methods seemed like a good indication that it would be a good fit for implementing a stream parser, but it still seems to lack some features that are required for this use case. Here, I'm trying to address the "read as much as possible" case, but there are other potentially helpful methods still missing for parsing data. (e.g. TryReadToNotAny/AdvanceToNotAny, but also features such as #28846)

@JeremyKuhne
Copy link
Member

My proposal was actually more about ReadToEnd than AdvanceToEnd.

That was my understanding as well. In designing the reader it was driven primarily by performance. AdvanceToEnd isn't something you can do efficiently without new API, so it is a very easy sell for me. ReadToEnd you can easily compose externally (say, via an extension method). That isn't a reason to punt on the concept unconditionally, but it is a reason to take our time given the somewhat awkward semantic fit with the existing APIs.

I'll take this to review, the others may allay my concerns or come up with good terminology for the ReadToEnd().

but there are other potentially helpful methods still missing for parsing data

I'm happy to see additional proposals come through. :)

@JeremyKuhne JeremyKuhne self-assigned this May 16, 2019
@danmoseley
Copy link
Member

@JeremyKuhne it seems to me this is not required to ship 3.0. Moving to future. If I am wrong please let me know.

@stevejgordon
Copy link
Contributor

In building a small sample for using SequenceReader I stumbled into a similar requirement.

See https://github.com/stevejgordon/SequenceReaderSample

In my case, I'm parsing a sequence of comma-separated values. Once I've found all items using the delimiter, the remainder of the stream is assumed to be the final item (no delimiter after it). My sample grabs the final data using TryCopyTo which works but is something I'd see having to do often enough that a method to ReadToEnd would be handy. I'd like to get the final data as a Span directly.

@JeremyKuhne - I'd welcome any feedback you have on the sample in general as I plan to post a blog about what it does when using SequenceReader.

@JeremyKuhne
Copy link
Member

@stevejgordon Thanks for the extra info.

In your example, the final item doesn't need a reader- it can simply be Console.WriteLine(Encoding.UTF8.GetString(sequence.ToArray())); (or sequence.CopyTo into another buffer).

I'd also consider changing ReadItem to ReadItems and pass in bool completed so that you can keep all of that logic together. In that case, since you already have a reader you can do something like var finalSequence = sequence.Slice(reader.Position); when TryReadTo fails and you are completed.

@stevejgordon
Copy link
Contributor

Thanks very much, @JeremyKuhne. I've updated the sample per your feedback. I appreciate you getting back to me so quickly. Once the post is done I'll ping the link over to you.

@stevejgordon
Copy link
Contributor

Published an introduction to using SequenceReader based on my sample to https://www.stevejgordon.co.uk/an-introduction-to-sequencereader

@JeremyKuhne
Copy link
Member

@stevejgordon nice! One note- the SequenceReader is documented: https://docs.microsoft.com/en-us/dotnet/api/system.buffers.sequencereader-1?view=netcore-3.0 Doesn't have a nice overview like you have though.

@davidfowl
Copy link
Member

@JeremyKuhne I started working on the doc, I plan to start a open a PR soon.

@stevejgordon
Copy link
Contributor

@JeremyKuhne To be honest, I did scan through that API documentation and should probably include a link to that somewhere in the post! I was referring to more user-facing, here's how you use it documentation which it seems @davidfowl has covered! :-)

@ahsonkhan
Copy link
Member

Based on offline discussion with @JeremyKuhne, since we already have the property UnreadSpan that returns a slice of the CurrentSpan, we should change ReadToEnd() to be a property called UnreadSequence, which does the equivalent of the following:

ReadOnlySequence<byte> unreadSequence = reader.Sequence.Slice(reader.Position);
public ref partial struct SequenceReader<T> where T : unmanaged, IEquatable<T>
{
    // …
    public void AdvanceToEnd() { /* … */ }
    public ReadOnlySequence<T> UnreadSequence { get; }
    // …
}

@JeremyKuhne
Copy link
Member

JeremyKuhne commented Sep 10, 2019

Video

This was approved in dotnet/corefx#40962 as:

namespace System.Buffers
{
    public ref struct SequenceReader<T>
    {
        // Optimized API to position the reader at the end of the sequence (much faster than what users can write)
        public void AdvanceToEnd();

        // Pairs with existing Span<T> UnreadSpan;
        public readonly ReadOnlySequence<T> UnreadSequence { get; }
    }
}

Here are the rough implementations:

        /// <summary>
        /// The unread portion of the <see cref="Sequence"/>.
        /// </summary>
        public readonly ReadOnlySequence<T> UnreadSequence
        {
            get => Sequence.Slice(Position);
        }
        public void AdvanceToEnd()
        {
            if (_moreData)
            {
                Consumed = Length;
                CurrentSpan = default;
                CurrentSpanIndex = 0;
                _currentPosition = Sequence.End;
                _moreData = false;
            }
        }

Marking as up-for-grabs to fully implement/test.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 5.0 milestone Feb 1, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Buffers
Projects
None yet
Development

No branches or pull requests

8 participants