Resumable Parsing #48

TehPers · 2019-09-09T17:34:37Z

I would love to be able to resume parsing once I finish parsing something. For example, it would be nice to do something like this:

IEnumerable<Token> ParseStream(TextReader stream) {
    var result = this.Parser.Parse(stream);
    while (result.Success) {
        yield return result.Value;
        result = result.ParseAgain(); // This would probably be implemented differently
    }
}

This would allow us to parse from an input stream as needed, without needing to store all the output tokens in memory. For example, we could process a 2GB+ file and write the output into another file without using very much memory.

Sprache has an equivalent to this, but does not support parsing from a stream:

IEnumerable<Token> ParseStream(string input) {
    var result = this.Parser.TryParse(input);
    while (result.WasSuccessful) {
        yield return result.Value;
        result = this.Parser(result.Remainder);
    }
}

One possible approach is to keep a reference to the ParseState<T> inside the object that Parse returns.

The text was updated successfully, but these errors were encountered:

benjamin-hodgson · 2019-09-09T17:43:45Z

Does this work?

IEnumerable<Token> ParseStream(TextReader stream)
{
    var result = this.Parser.Parse(stream);
    while (result.Success)
    {
        yield return result.Value;
        result = this.Parser.Parse(stream);
    }
}

After the call to parser.Parse the stream is left at the position the parser got up to. (nb, mixing imperative code with lazy enumerables is kind of risky.)

TehPers · 2019-09-09T18:18:28Z

Yep, that works. Thank you!

TehPers · 2019-09-09T18:33:42Z

Sorry for reopening this. I'm still new to this library and I'm trying to use it for a class assignment.

This outputs 6:

private static void Main(string[] args)
{
    var parser = Parser.Digit;
    var input = "121528";

    using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(input)))
    using (var reader = new StreamReader(stream))
    {
        parser.Parse(reader);
        Console.WriteLine(stream.Position);
    }
}

Shouldn't this output 1? I think this is where my issue was coming up.

Edit: for reference, I'm reading input from a file for the assignment.

benjamin-hodgson · 2019-09-17T14:36:35Z

Ah, I think you're right. Because data is pulled from the stream in chunks, the parser potentially (usually) reads ahead of where it actually ends up. I guess the fix would be to rewind the stream (after checking CanSeek) after parsing, to where the parser actually consumed up to.

benjamin-hodgson · 2019-09-17T17:52:22Z

Would it be preferable to return the total count of tokens consumed ~~(and possibly even the SourcePos?)~~ as a property of the Result? Then you can rewind the stream yourself.

Edit: On second thoughts you can already pull the SourcePos using CurrentPos so that's probably not necessary.

TehPers · 2019-09-18T15:01:22Z

I think returning an IEnumerable<T> that can be used to continue reading tokens would be helpful. It could maintain a buffer of tokens that have been read from the input source but not processed and return those first. For example:

IEnumerable<T> GetUnconsumedTokens(TToken[] buffer, int startIndex) {
    foreach (int i = startIndex; i < buffer.Length; i++) {
        // You might want to remove the item at some point for memory purposes too, this is just an example
        yield return buffer[i];
    }

    // read from input here, this depends on the input type of course
}

This could be a method on ITokenStream<T> that is invoked and returned as part of the result maybe? It could also just be a method on ParseState<T> that just reads input by calling ITokenStream<T>.ReadInto to just load tokens one at a time (since they'd just get buffered again later anyway). I'm a little nervous about relying solely on the number of tokens consumed because knowing how many tokens were consumed doesn't mean that those tokens can be recovered. Input streams don't need to implement rewinding, and it's possible that re-enumerating an input IEnumerable<T> just to skip some number of tokens would be an expensive operation.

benjamin-hodgson · 2019-09-19T17:22:11Z

Ah yeah I quite like that idea. "Here are the tokens which I pulled from the input but didn't actually eat", ie, the remainder of the ParseState's buffer.

beho · 2019-09-23T11:14:39Z

Would it be preferable to return the total count of tokens consumed ~~(and possibly even the SourcePos?)~~ as a property of the Result? Then you can rewind the stream yourself.

Edit: On second thoughts you can already pull the SourcePos using CurrentPos so that's probably not necessary.

It might be more convenient to have SourcePos as part of the Result. Otherwise you have to append CurrentPos parser to every parser which sort of mixes multiple concerns together. Or is there any other way?

TehPers · 2019-09-25T08:00:05Z

I've been messing with this a bit and I agree that SourcePos should be part of the Result, although not for rewinding. It would be helpful to be able to pass that back into Parse later to tell the parser where the parsing began. For example, let's say you finish parsing at line 2, column 18. You probably want to tell the parser to continue parsing from there to make sure your SourcePos are accurate and you can get accurate error messages (if you rely on SourcePos for that).

Something that comes to mind is that the SourcePos could be tracked in ParseState and updated as parsing occurs. Bookmarks would probably need to also keep track of the SourcePos though. This is just an idea though, you could just call ComputeSourcePos() at the end ~~but I think that relies on having all of the tokens you parsed stored in the buffer?~~ Nevermind, the parser keeps track of the SourcePos where the buffer starts so it wouldn't really matter that much I guess.

benjamin-hodgson · 2020-07-22T22:14:14Z

I've been kicking around the idea of replacing SourcePos with a similar but slightly different concept of a SourcePosDelta, representing the change in source location since the beginning of parsing.

When you get to (eg) render an error message, you can concretise the SourcePosDelta by adding it to a specific SourcePos, which would be (1, 1) if you started parsing at the beginning of the file or (1, 1) + previousDelta if you'd already consumed previousDelta-worth of input.

Thoughts?

benjamin-hodgson · 2020-07-22T22:41:50Z

Code here: 4141044

benjamin-hodgson · 2020-07-24T21:02:09Z

Regarding leftover tokens, I'm thinking it makes sense to make it part of the protocol between the parser and the TokenStream. I've added an OnParserEnd(ReadOnlySpan<TToken>) method to ITokenStream, to which the parser passes any unconsumed tokens. Code here (on the v3 branch): 851dc1c

* drop support for everything below netstandard21 * add MaybeNullWhen and fix other nullability warns * make Real into a property. Fixes #78 * change ITokenStream to use span * replace PosCalculator stuff with IConfiguration * configurable array pool * remove unneeded refs * fast, vectorised version of ComputeSourcePos in the happy path * didnt need this after all * add test for cached _lastSourcePos * remove comment * fix method name * fix bug in ComputeSourcePos * remove vectorised version, it was not working properly * Change ExpectedCollector to a class, use an object pool, change TryParse's argument to an ICollection * ObjectPool is marginally faster when you use the base class instead of the interface * Revert "ObjectPool is marginally faster when you use the base class instead of the interface" This reverts commit 323afbb. Revert "Change ExpectedCollector to a class, use an object pool, change TryParse's argument to an ICollection" This reverts commit 64d23fd. * replace ExpectedCollector with PooledList. also implement IList and IDisposable, and add tests * PooledList rent from ArrayPool lazily * fix test * change Tokens to an ImmutableArray * Replace SourcePos with SourcePosDelta. This allows users to work with positions within partially-already-consumed files * add ResumableTokenStream. fixes #48 * add test * rename OnParserEnd * drop support for netstandard2.1. 5.0 all the way baybayy * Fix warnings * publish token streams * Publish custom parser APIs * fix test * package updates * add test for RepeatString * unpublish InternalError. for now we'll let parsers set error info but not read it. let's see how that goes * unpublish Dispose * xmldoc * New version of SkipWhitespaes which performs much better on long sequences of space characters (quite common in source code) * oops, was meant to pin this

TehPers closed this as completed Sep 9, 2019

TehPers reopened this Sep 9, 2019

benjamin-hodgson added the bug label Sep 17, 2019

benjamin-hodgson added enhancement and removed bug labels Sep 25, 2019

benjamin-hodgson added this to the v3 milestone Jun 30, 2020

benjamin-hodgson pushed a commit that referenced this issue Jul 24, 2020

add ResumableTokenStream. fixes #48

015408c

benjamin-hodgson pushed a commit that referenced this issue Jul 24, 2020

add ResumableTokenStream. fixes #48

851dc1c

benjamin-hodgson closed this as completed Jul 30, 2020

benjamin-hodgson pushed a commit that referenced this issue Apr 26, 2021

add ResumableTokenStream. fixes #48

b9521c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resumable Parsing #48

Resumable Parsing #48

TehPers commented Sep 9, 2019 •

edited

benjamin-hodgson commented Sep 9, 2019 •

edited

TehPers commented Sep 9, 2019

TehPers commented Sep 9, 2019 •

edited

benjamin-hodgson commented Sep 17, 2019 •

edited

benjamin-hodgson commented Sep 17, 2019 •

edited

TehPers commented Sep 18, 2019 •

edited

benjamin-hodgson commented Sep 19, 2019

beho commented Sep 23, 2019

TehPers commented Sep 25, 2019 •

edited

benjamin-hodgson commented Jul 22, 2020

benjamin-hodgson commented Jul 22, 2020

benjamin-hodgson commented Jul 24, 2020 •

edited

Resumable Parsing #48

Resumable Parsing #48

Comments

TehPers commented Sep 9, 2019 • edited

benjamin-hodgson commented Sep 9, 2019 • edited

TehPers commented Sep 9, 2019

TehPers commented Sep 9, 2019 • edited

benjamin-hodgson commented Sep 17, 2019 • edited

benjamin-hodgson commented Sep 17, 2019 • edited

TehPers commented Sep 18, 2019 • edited

benjamin-hodgson commented Sep 19, 2019

beho commented Sep 23, 2019

TehPers commented Sep 25, 2019 • edited

benjamin-hodgson commented Jul 22, 2020

benjamin-hodgson commented Jul 22, 2020

benjamin-hodgson commented Jul 24, 2020 • edited

TehPers commented Sep 9, 2019 •

edited

benjamin-hodgson commented Sep 9, 2019 •

edited

TehPers commented Sep 9, 2019 •

edited

benjamin-hodgson commented Sep 17, 2019 •

edited

benjamin-hodgson commented Sep 17, 2019 •

edited

TehPers commented Sep 18, 2019 •

edited

TehPers commented Sep 25, 2019 •

edited

benjamin-hodgson commented Jul 24, 2020 •

edited