Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dynamic disambiguation of left-recursive rules during closure #1398

Closed
sharwell opened this issue Nov 23, 2016 · 1 comment
Closed

Comments

@sharwell
Copy link
Member

sharwell commented Nov 23, 2016

Background

Since the early days of ANTLR 4, we've known that long lookahead inside of left-recursive (LR) rules poses "interesting challenges" for performance. The earliest approaches to this problem were attempts to inline the evaluation of special "precedence predicates" during prediction, rather than treat them as standard predicates that are only evaluated after an SLL conflict is reached. Unfortunately, after attempting to implement this I realized a major flaw in the design - LR rules are fundamentally ambiguous, and rely on ANTLR's "first alternative wins" decision making strategy to choose the expected alternative.

Eventually we realized that the solution to much of the performance problem associated with LR rules could be solved by disambiguating the ATN prior to making a decision. This is accomplish by eliminating known ambiguous configurations before running the adaptivePredict algorithm.

⚠️ One extremely valuable mechanism in ANTLR 4 is the ability to detect and report ambiguous parse trees. It's critically important that the disambiguation algorithm only eliminate alternatives which are ambiguous as a direct result of writing a LR rule in the grammar.

The initial implementation appeared in #401, with notable bugs identified in #509 and #679 (now fixed).

Remaining problem

While the above work dramatically improved performance in real-world scenarios where LR rules were used, it failed to meet the performance expectations of users working with grammars and inputs that require "looking through" a LR rule to make a prediction. As a simple example, consider the following grammar fragment for a made-up language:

statement
  : 'if' expression
  | 'if' expression 'else' expression
  ;

expression
  : ID
  | expression ('*' | '/') expression
  | expression ('+' | '-') expression
  ;

In this example, the expression following an if keyword in the language must be fully consumed by the adaptivePredict algorithm only to determine whether or not it is followed by an else keyword. I refer to this situation as "looking through expression to make a decision in statement". The performance impact of grammars with this characteristic is tremendous - the adaptivePredict algorithm will explore all possible paths through expression during the prediction! This is clearly not what the user intended when expression was written as an LR rule.

For reference, here is the ATN for statement:

And here is the ATN for expression:

Several issues have been filed which appear to be manifestations of this issue in some form:

Dynamic disambiguation

Early work on generalizing the applyPrecedenceFilter algorithm to "look through" situations focused on ways to apply the precedence filter to decisions traversed deep inside the adaptivePredict algorithm. The implementation plan looked something like this:

  1. Augment the PredictionContext class to keep track of precedence levels when a precedence rule is invoked
  2. During closure operations, identify precedence decisions using the same algorithm we already use to create precedence DFAs
  3. When a precedence decision is reached, suspend the closure operation and perform a closure on just the specific configuration which reached a precedence decision
  4. Using the information provided by the prediction context and limited closure from the precedence decision, run the applyPrecedenceFilter algorithm
  5. Add the resulting filtered set of configurations back to the result of the primary closure operation being performed

However, the subtleties of this approach kept me from ever getting it implemented (even as a prototype).

The real breakthrough came today when I took a much closer look at the syntactic ambiguity produced by LR rules. Remember that aside from computing the DFA start state, adaptivePredict asks only one thing: which alternative provides some viable path to consume more input symbols than any other alternative? For a true ambiguity, then for some input(s) there is more than one such alternative whether we are in SLL or LL prediction modes. More importantly, adaptivePredict doesn't care at all how that longest sequence is reached. For example, consider the following input with the grammar listed above:

if a * b + c else d

Clearly the intended final parse tree for this expression would look like the following:

(statement 'if' (expression (expression (expression 'a') '*' (expression 'b')) '+' (expression 'c')) 'else' (expression 'd'))

However, when adaptivePredict is attempting to choose the first or second alternative of the statement rule, the precedence within the expression is totally irrelevant. A parse looking like the following would lead to exactly the same final result as the desired expression:

(statement 'if' (expression (expression 'a') '*' (expression (expression 'b') '+' (expression 'c'))) 'else' (expression 'd'))

This property can be generalized by the following:

During a closure operation inside adaptivePredict, the elimination of any configuration which is ambiguous with another configuration that predicts the same alternative will not affect the final decision made by adaptivePredict.

Leveraging this property we can redefine the implementation plan to eliminate the opposite set of configurations from the configurations eliminated by applyPrecedenceFilter. It also turns out this is a wonderful idea because it is much simpler from an implementation standpoint. The StarLoopEntryState that forms a precedence decision has two outgoing edges:

  1. The first edge stays inside the LR rule, and matches some terminal which follows a left-recursive invocation of the LR rule.
  2. The second alternative leads to the stop state of the LR rule, and steps into the enclosing context.

If any return edge from the rule stop state leads to the right edge of a left-recursive invocation of the rule, then anything reachable by following the first edge is also reachable by following the second edge. Knowing this, we can implement dynamic disambiguation by intentionally failing to
include the target of the first edge in the closure operation. With the edge eliminated, one specific class of ambiguities, specifically those created by the use of LR rules in a grammar, are reduced to an unambiguous single alternative during adaptivePredict. The fact that the path taken through the grammar during adaptivePredict doesn't match the path taken through the grammar to produce the final parse tree does not change the overall outcome.

sharwell added a commit to sharwell/antlr4 that referenced this issue Nov 23, 2016
sharwell added a commit to sharwell/antlr4 that referenced this issue Nov 23, 2016
sharwell added a commit to sharwell/antlr4 that referenced this issue Nov 23, 2016
@parrt
Copy link
Member

parrt commented Nov 23, 2016

Adding my own interpretation after Sam and I simplified in our minds. Consider slightly simpler grammar:

grammar T;
stat: expr ';'
    | expr '.'
    ;

expr: expr '*' expr
    | expr '+' expr
    | ID
    ;

The (annotated) ATNs are:

where expr is rewritten to be the following rule:

expr[int prec]
    :   ID
        ( {3>=prec}? '*' expr[4]
        | {2>=prec}? '+' expr[3]
        )*
    ;

The closure of a StarLoopEntryState (state 23) in a rewritten LR rule includes a config that pursues the loop entry (orange) path and a config that pursues the (green) loop bypass (rule exit) path. The stack context is same for both configs added for entry and exit paths.

The optimization is to avoid adding the loop entry config when the exit path can only lead back to the same StarLoopEntryState after popping context at the rule end state (traversing only epsilon edges, so we're still in closure, in this same rule).

In the ATN above, if the stack(s) for the current config are only the (implicit) green paths from rule end state 3 to block end state 22 (i.e., no red paths) then we can drop the config for the orange loop entry path.

@sharwell just pointed out there are more nodes we have to detect. It looks like we need to detect any state that can reach loop entry on epsilon w/o exiting rule. We don't have to look at FOLLOW links, just ensure that all stack tops for config refer to key states in LR rule.

We have to check for prefix operator alt right edges and right edges of ternary ops but not the middle expr ref:

img_23112016_192025

Another version of the prefix op is the '(' type ')' expr case.

What should the closure function do differently?

To verify we are in the right situation we must first check closure is at a StarLoopEntryState generated during LR removal. Then we check that each stack top of context is a return state from one of these cases:

  1. 'not' expr. The return state points at loop entry state
  2. expr op expr. The return state is the block end of internal block of (...)*
  3. 'between' expr 'and' expr. The return state of 2nd expr reference. That state points at block end of internal block of (...)*.
  4. '(' type ')' expr. The return state points at block end, which points at loop entry state

If any is true for each stack top, then closure does not add a config to the current config set for edge[0], the loop entry branch.

Conditions fail if any context for the current config is:

  • empty (we'd fall out of expr to do a global FOLLOW which could even be to some weird spot in expr) or,
  • lies outside of expr or,
  • lies within expr but at a state not the BlockEndState generated during LR removal

Do we need to evaluate predicates ever in closure for this case?

No. Predicates, including precedence predicates, are only evaluated when computing a DFA start state. I.e., only before the lookahead (but not parser) consumes a token.

There are no epsilon edges allowed in LR rule alt blocks or in the "primary" part (ID here). If closure is in StarLoopEntryState 23 any lookahead operation will have consumed a token as there are no epsilon-paths that lead to 23. We do not have to evaluate predicates therefore if we are in the generated StarLoopEntryState of a LR rule. Note that when making a prediction starting at that decision point, decision d=2, compute-start-state performs closure starting at edges[0], edges[1] emanating from state 23. That means it is not performing closure on 23 during compute-start-state.

How do we know this always gives same prediction answer?

Without predicates, loop entry and exit paths are ambiguous upon remaining input +b (in, say, a+b). Either paths lead to valid parses. Closure can lead to consuming + immediately or by falling out of this call to expr back into expr and loop back again to 23 to match +b. In this special case, we choose the more efficient path, which is to take the bypass path.

The lookahead language has not changed because closure chooses one path over the other. Both paths lead to consuming the same remaining input during a lookahead operation. If the next token is an operator, lookahead will enter the choice block with operators. If it is not, lookahead will exit expr. Same as if closure had chosen to enter the choice block immediately.

Closure is examining one config (some loopentrystate, some alt, context) which means it is considering exactly one alt. Closure always copies the same alt to any derived configs.

How do we know this optimization doesn't mess up precedence in our parse trees?

Looking through expr from left edge of stat only has to confirm that an input, say, a+b+c; begins with any valid interpretation of an expression. The precedence actually doesn't matter when making a decision in stat seeing through expr. It is only when parsing rule expr that we must use the precedence to get the right interpretation and, hence, parse tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants