Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add incremental parsing support #2527

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

dberlin
Copy link

@dberlin dberlin commented Apr 7, 2019

This commit adds incremental parsing support to ANTLR4
.
I have only updated the Java target, and the out-of-tree typescript target (see tunnelvisionlabs/antlr4ts#414), but it should be very easy to update the other targets for someone who understands that language. The changes are deliberately minimal.

The Java version here is actually a backport of the typescript version, and took O(2 hours).
(as an aside, i have not written Java in a few years, so i totally expect there are things that could be done better). The comments were originally written for the typescript version, I will go through and clean them up.

A detailed description of how it works is here (which also lists the outstanding issues), but it's a very straightforward implementation of detection of rules that could be affected by token changes. Rule contexts that can't have been affected by a set of token changes are reused and the rules are not re-run. To account for possibly infinite lookahead/lookbehind, we keep track of how far ahead/behind the parser looked last time on each rule, and use that as the bounds to detect changes in.

The tests currently test on a simple grammar and the JavaLR grammar (which exercises the left recursion removal support).

The only class i've added that requires anything even mildly interesting of the runtime is the IncrementalParserData class.
Most of the work there is related to changing the start/end tokens of rule contexts to realign them with the token stream changes. If you only care about the text of the parse tree, and not the position/etc info, this is obviously unnecessary. I have not made this an option.

To track changed tokens and stream adjustments, the Java version of IncrementalParseData uses TreeMap/TreeSet. The Typescript versions uses arrays of ranges and binary search (see https://github.com/dberlin/antlr4ts/blob/incremental/src/IncrementalParserData.ts)

I am happy to encapsulate this into a data structure in the runtime if anyone thinks it is worth it.

As for why do this at all: Yes, ANTLR is actually pretty fast.
My use case is a bit weird - large GCode files, which are often 20+ megabytes. As such, a single parse takes 6-10 seconds (for a 20 meg file).
Users often make small edits to various pieces.
(It's part of a vscode extension).

Lexing GCode is also completely trivial to do in a contextless fashion.
The incremental parser brings the reparse time down to <50ms.

I may get around to adding incremental lexing. As i'm sure Terrence knows, this is " trickier".

I have the beginnings of support (elsewhere) based on some papers, but it is incestuous (the parser tells the lexer what tokens could be valid at a given change point and the lexer tries those rules). There are ways that don't do this, but some require being able to store/rewind/replay the transition state at each token, etc.

public abstract class IncrementalParser extends Parser implements ParseTreeListener {
// Current parser epoch. Incremented every time a new incremental parser is
// created.
private static int _PARSER_EPOCH = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an AtomicInteger to avoid race conditions when multiple threads instantiate this class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants