Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resumable Parsing #48

Closed
TehPers opened this issue Sep 9, 2019 · 12 comments
Closed

Resumable Parsing #48

TehPers opened this issue Sep 9, 2019 · 12 comments
Milestone

Comments

@TehPers
Copy link

TehPers commented Sep 9, 2019

I would love to be able to resume parsing once I finish parsing something. For example, it would be nice to do something like this:

IEnumerable<Token> ParseStream(TextReader stream) {
    var result = this.Parser.Parse(stream);
    while (result.Success) {
        yield return result.Value;
        result = result.ParseAgain(); // This would probably be implemented differently
    }
}

This would allow us to parse from an input stream as needed, without needing to store all the output tokens in memory. For example, we could process a 2GB+ file and write the output into another file without using very much memory.

Sprache has an equivalent to this, but does not support parsing from a stream:

IEnumerable<Token> ParseStream(string input) {
    var result = this.Parser.TryParse(input);
    while (result.WasSuccessful) {
        yield return result.Value;
        result = this.Parser(result.Remainder);
    }
}

One possible approach is to keep a reference to the ParseState<T> inside the object that Parse returns.

@benjamin-hodgson
Copy link
Owner

benjamin-hodgson commented Sep 9, 2019

Does this work?

IEnumerable<Token> ParseStream(TextReader stream)
{
    var result = this.Parser.Parse(stream);
    while (result.Success)
    {
        yield return result.Value;
        result = this.Parser.Parse(stream);
    }
}

After the call to parser.Parse the stream is left at the position the parser got up to. (nb, mixing imperative code with lazy enumerables is kind of risky.)

@TehPers
Copy link
Author

TehPers commented Sep 9, 2019

Yep, that works. Thank you!

@TehPers TehPers closed this as completed Sep 9, 2019
@TehPers
Copy link
Author

TehPers commented Sep 9, 2019

Sorry for reopening this. I'm still new to this library and I'm trying to use it for a class assignment.

This outputs 6:

private static void Main(string[] args)
{
    var parser = Parser.Digit;
    var input = "121528";

    using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(input)))
    using (var reader = new StreamReader(stream))
    {
        parser.Parse(reader);
        Console.WriteLine(stream.Position);
    }
}

Shouldn't this output 1? I think this is where my issue was coming up.

Edit: for reference, I'm reading input from a file for the assignment.

@TehPers TehPers reopened this Sep 9, 2019
@benjamin-hodgson
Copy link
Owner

benjamin-hodgson commented Sep 17, 2019

Ah, I think you're right. Because data is pulled from the stream in chunks, the parser potentially (usually) reads ahead of where it actually ends up. I guess the fix would be to rewind the stream (after checking CanSeek) after parsing, to where the parser actually consumed up to.

@benjamin-hodgson
Copy link
Owner

benjamin-hodgson commented Sep 17, 2019

Would it be preferable to return the total count of tokens consumed (and possibly even the SourcePos?) as a property of the Result? Then you can rewind the stream yourself.

Edit: On second thoughts you can already pull the SourcePos using CurrentPos so that's probably not necessary.

@TehPers
Copy link
Author

TehPers commented Sep 18, 2019

I think returning an IEnumerable<T> that can be used to continue reading tokens would be helpful. It could maintain a buffer of tokens that have been read from the input source but not processed and return those first. For example:

IEnumerable<T> GetUnconsumedTokens(TToken[] buffer, int startIndex) {
    foreach (int i = startIndex; i < buffer.Length; i++) {
        // You might want to remove the item at some point for memory purposes too, this is just an example
        yield return buffer[i];
    }

    // read from input here, this depends on the input type of course
}

This could be a method on ITokenStream<T> that is invoked and returned as part of the result maybe? It could also just be a method on ParseState<T> that just reads input by calling ITokenStream<T>.ReadInto to just load tokens one at a time (since they'd just get buffered again later anyway). I'm a little nervous about relying solely on the number of tokens consumed because knowing how many tokens were consumed doesn't mean that those tokens can be recovered. Input streams don't need to implement rewinding, and it's possible that re-enumerating an input IEnumerable<T> just to skip some number of tokens would be an expensive operation.

@benjamin-hodgson
Copy link
Owner

Ah yeah I quite like that idea. "Here are the tokens which I pulled from the input but didn't actually eat", ie, the remainder of the ParseState's buffer.

@beho
Copy link

beho commented Sep 23, 2019

Would it be preferable to return the total count of tokens consumed (and possibly even the SourcePos?) as a property of the Result? Then you can rewind the stream yourself.

Edit: On second thoughts you can already pull the SourcePos using CurrentPos so that's probably not necessary.

It might be more convenient to have SourcePos as part of the Result. Otherwise you have to append CurrentPos parser to every parser which sort of mixes multiple concerns together. Or is there any other way?

@TehPers
Copy link
Author

TehPers commented Sep 25, 2019

I've been messing with this a bit and I agree that SourcePos should be part of the Result, although not for rewinding. It would be helpful to be able to pass that back into Parse later to tell the parser where the parsing began. For example, let's say you finish parsing at line 2, column 18. You probably want to tell the parser to continue parsing from there to make sure your SourcePos are accurate and you can get accurate error messages (if you rely on SourcePos for that).

Something that comes to mind is that the SourcePos could be tracked in ParseState and updated as parsing occurs. Bookmarks would probably need to also keep track of the SourcePos though. This is just an idea though, you could just call ComputeSourcePos() at the end but I think that relies on having all of the tokens you parsed stored in the buffer? Nevermind, the parser keeps track of the SourcePos where the buffer starts so it wouldn't really matter that much I guess.

@benjamin-hodgson benjamin-hodgson added this to the v3 milestone Jun 30, 2020
@benjamin-hodgson
Copy link
Owner

I've been kicking around the idea of replacing SourcePos with a similar but slightly different concept of a SourcePosDelta, representing the change in source location since the beginning of parsing.

When you get to (eg) render an error message, you can concretise the SourcePosDelta by adding it to a specific SourcePos, which would be (1, 1) if you started parsing at the beginning of the file or (1, 1) + previousDelta if you'd already consumed previousDelta-worth of input.

Thoughts?

@benjamin-hodgson
Copy link
Owner

Code here: 4141044

benjamin-hodgson pushed a commit that referenced this issue Jul 24, 2020
@benjamin-hodgson
Copy link
Owner

benjamin-hodgson commented Jul 24, 2020

Regarding leftover tokens, I'm thinking it makes sense to make it part of the protocol between the parser and the TokenStream. I've added an OnParserEnd(ReadOnlySpan<TToken>) method to ITokenStream, to which the parser passes any unconsumed tokens. Code here (on the v3 branch): 851dc1c

benjamin-hodgson pushed a commit that referenced this issue Jul 24, 2020
benjamin-hodgson pushed a commit that referenced this issue Apr 26, 2021
benjamin-hodgson pushed a commit that referenced this issue May 2, 2021
* drop support for everything below netstandard21

* add MaybeNullWhen and fix other nullability warns

* make Real into a property. Fixes #78

* change ITokenStream to use span

* replace PosCalculator stuff with IConfiguration

* configurable array pool

* remove unneeded refs

* fast, vectorised version of ComputeSourcePos in the happy path

* didnt need this after all

* add test for cached _lastSourcePos

* remove comment

* fix method name

* fix bug in ComputeSourcePos

* remove vectorised version, it was not working properly

* Change ExpectedCollector to a class, use an object pool, change TryParse's argument to an ICollection

* ObjectPool is marginally faster when you use the base class instead of the interface

* Revert "ObjectPool is marginally faster when you use the base class instead of the interface"

This reverts commit 323afbb.

Revert "Change ExpectedCollector to a class, use an object pool, change TryParse's argument to an ICollection"

This reverts commit 64d23fd.

* replace ExpectedCollector with PooledList. also implement IList and IDisposable, and add tests

* PooledList rent from ArrayPool lazily

* fix test

* change Tokens to an ImmutableArray

* Replace SourcePos with SourcePosDelta. This allows users to work with positions within partially-already-consumed files

* add ResumableTokenStream. fixes #48

* add test

* rename OnParserEnd

* drop support for netstandard2.1. 5.0 all the way baybayy

* Fix warnings

* publish token streams

* Publish custom parser APIs

* fix test

* package updates

* add test for RepeatString

* unpublish InternalError. for now we'll let parsers set error info but not read it. let's see how that goes

* unpublish Dispose

* xmldoc

* New version of SkipWhitespaes which performs much better on long sequences of space characters (quite common in source code)

* oops, was meant to pin this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants