Poor performance of expression parsing #192

agacek · 2013-03-21T13:10:59Z

Using Antlr 4.0 I've run into a performance problem with parsing a simple expression language. Here is the grammar:

grammar Expr;

program: expr;

expr: ID
    | 'not' expr
    | expr 'and' expr
    | expr 'or' expr
    ;

ID: [a-zA-Z_][a-zA-Z_0-9]*;
WS: [ \t\n\r\f]+ -> skip;
ERROR: .;

And here is the sample file I'm trying to parse:

not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
    X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and     X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and     X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and     X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and     X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and     X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and     X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and     X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and     X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and     X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and     X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and     X12

And just in case, here is how I am calling everything:

import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CommonTokenStream;

public class Main {
  public static void main(String args[]) throws Exception {
    CharStream stream = new ANTLRFileStream("test.expr");
    ExprLexer lexer = new ExprLexer(stream);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    ExprParser parser = new ExprParser(tokens);

    System.out.println("Start parsing");
    long start = System.currentTimeMillis();

    parser.program();

    long stop = System.currentTimeMillis();
    System.out.println("End parsing: " + (stop - start)/1000.0 + "s");
  }
}

On my machine it takes just over 14 minutes to parse the given expression.

The text was updated successfully, but these errors were encountered:

sharwell · 2013-03-21T13:33:40Z

Nearly all of the time spent parsing the expression is due to the full context prediction portion of ANTLR 4. In the current reference release, this code does not use a dynamic DFA to improve parsing performance. However, the 4.0-opt release uses a highly optimized implementation of this code allowing the expression above to parse in a few milliseconds.

The suggestions below assume you are using the reference release:

You can save a great deal of time on correct inputs by using a two-stage parsing strategy.

Attempt to parse the input using BailErrorStrategy and PredictionMode.SLL. If no exception is thrown, you know the answer is correct.
If a ParseCancellationException is thrown, retry the parse using the default settings (DefaultErrorStrategy and PredictionMode.LL).

Unlike using PredictionMode.SLL alone, the two-stage parsing strategy is guaranteed to return the same result as if you chose to use the full-strength PredictionMode.LL, but tends to do so in time closer to the SLL option.

agacek · 2013-03-21T14:23:57Z

Thanks for the quick followup Sam. The two-stage parsing strategy seems to work fine for me. Is the 4.0-opt release going to be part of the next public Antlr release? And if so, would you recommend going back to a 1-stage parse or not?

sharwell · 2013-03-21T14:29:53Z

I always use the two-stage strategy. In the optimized builds speed isn't so much a problem but it still tends to reduce the overall memory requirements. There are no plans right now for the optimized build to replace the reference version (or ship alongside it), but bits and pieces of it are being incorporated into the reference version over time.

parrt · 2013-03-21T16:38:05Z

that is described in the book as well.
Ter

On Thu, Mar 21, 2013 at 6:33 AM, Sam Harwell notifications@github.comwrote:

Nearly all of the time spent parsing the expression is due to the full
context prediction portion of ANTLR 4. In the current reference release,
this code does not use a dynamic DFA to improve parsing performance.
However, the 4.0-opt release uses a highly optimized implementation of this
code allowing the expression above to parse in a few milliseconds.

The suggestions below assume you are using the official release:

You can save a great deal of time on correct inputs by using a
two-stage parsing strategy.

Attempt to parse the input using BailErrorStrategy and
PredictionMode.SLL. If no exception is thrown, you know the answer is
correct.

If a ParseCancellationException is thrown, retry the parse using the
default settings (DefaultErrorStrategy and PredictionMode.LL).

Unlike using PredictionMode.SLL alone, the two-stage parsing strategy is
guaranteed to return the same result as if you chose to use the
full-strength PredictionMode.LL, but tends to do so in time closer to the
SLL option.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/192#issuecomment-15238595
.

Dictation in use. Please excuse homophones, malapropisms, and nonsense.

ryanlin86 · 2013-12-18T04:38:53Z

I'm having the same problem, I'm using C# instead.
I tried the above options, but didn't work for me, still slow, is there any C# performance-opt version available?

sharwell · 2013-12-18T05:02:45Z

@ryanlin86 The C# target is based on the optimized fork of the project. Hopefully as part of the 4.2 release I can write an article describing the steps available to track down performance problems along with some of the solutions that are available.

parrt · 2013-12-18T17:33:34Z

So you are confirming that two-stage is still very slow for that grammar and input? Have you tried the Java target? it's likely the same but that's the only target I can try this on.

ryanlin86 · 2013-12-20T01:04:39Z

I haven’t tried java target( which I’m not familiar with java)

Here is my grammar file:

grammar Grammar;

@Header {
using System;
using System.Text;
using System.Globalization;
using System.Collections.Generic;
using SBE.AppFrame.Core.Expressions.AST;
}

@parser::members
{
protected const int EOF = Eof;
private static NumberFormatInfo numberFormatInfo = new NumberFormatInfo{NumberDecimalSeparator ="."};
}

@lexer::members
{
protected const int EOF = Eof;
protected const int HIDDEN = Hidden;
}

/*

Parser Rules
*/

compileUnit returns [LogicalExpression value]
: statement EOF {$value=$statement.value;}
| expression EOF {$value=$expression.value;}
| statementList EOF{$value=$statementList.value;}
;

statementList returns[LogicalExpression value]
@init{
List list=new List();
}
: statement{list.Add($statement.value);} (next=statement{list.Add($next.value);})+{$value=new StatementSequence(list);}
;

statement returns [LogicalExpression value]
: block {$value=$block.value;}
| 'if' parExpression ifTrue=statement {$value=new IfExpression($parExpression.value,$ifTrue.value);} ('else' ifFalse=statement {$value=new IfExpression($parExpression.value,$ifTrue.value,$ifFalse.value);})?
| 'for' '(' forInit? ';' forCondition=expression? ';' forUpdate? ')' body=statement {$value=new ForExpression($forInit.value,$forCondition.value,$forUpdate.value,$body.value);}
| 'foreach' '(' 'var' ID 'in' expression ')' body=statement {$value=new ForeachExpression($ID.text, $expression.value, $body.value);}
| 'while' parExpression body=statement {$value=new WhileExpression($parExpression.value,$body.value);}
| 'return' expression? ';' {$value=new ReturnExpression();}
| 'break' ';' {$value=new BreakExpression();}
| 'continue' ';' {$value=new ContinueExpression();}
| statementExpression ';'{$value=$statementExpression.value;}
| ';' {$value=new EmptyStatementExpression();}
;

block returns [LogicalExpression value]
@init{
List list=new List();
}
: '{' (blockStatement{list.Add($blockStatement.value);})* '}' {$value=new BlockExpression(list);}
;

blockStatement returns [LogicalExpression value]
: statement {$value=$statement.value;}
;

forInit returns [LogicalExpression[] value]
: expressionList {$value=$expressionList.value;}
;

parExpression returns [LogicalExpression value]
: '(' expression ')' {$value=$expression.value;}
;

forUpdate returns [LogicalExpression[] value]
: expressionList {$value=$expressionList.value;}
;

statementExpression returns [LogicalExpression value]
: expression {$value=$expression.value;}
;

expressionList returns [LogicalExpression[] value]
@init {
List expressions = new List();
}
: first=expression{expressions.Add($first.value);} (',' follow=expression{expressions.Add($follow.value);})*{$value=expressions.ToArray();}
;

expression returns [LogicalExpression value]
: primary {$value=$primary.value;}
| LEFT=ID ':' member=ID {$value=new MemberExpression(new Identifier($LEFT.text),$member.text);}
| left=expression '.' ID '()'{$value=new MethodCallExpression($left.value,$ID.text);}
| left=expression '.' ID '(' parameters=expressionList? ')'{$value=new MethodCallExpression($left.value,$ID.text,$parameters.value);}
| left=expression '.' ID {$value=new MemberExpression($left.value,$ID.text);}
| ID '(' parameters=expressionList? ')' {$value=new MethodCallExpression(null,$ID.text,$parameters.value);}
| left=expression type=('++' | '--') {$value=new UnaryExpression($type.text=="++"?UnaryExpressionType.RightDoublePlus:UnaryExpressionType.RightDoubleMinus,$left.value);}
| ('+'|'-'|'++'|'--') right=expression {$value=new UnaryExpression($type.text=="++"?UnaryExpressionType.LeftDoublePlus:UnaryExpressionType.LeftDoubleMinus,$right.value);}
| ('~'|'!') right=expression {$value=new UnaryExpression(UnaryExpressionType.Not,$right.value);}
| left=expression type=(''|'/'|'%') right=expression {$value=new BinaryExpression(InternalHelper.GetBinaryExpressionType($type.text),$left.value,$right.value);}
| left=expression type=('+'|'-') right=expression {$value=new BinaryExpression(InternalHelper.GetBinaryExpressionType($type.text),$left.value,$right.value);}
| left=expression type=('<=' | '>=' | '>' | '<') right=expression {$value=new BinaryExpression(InternalHelper.GetBinaryExpressionType($type.text),$left.value,$right.value);}
| left=expression type=('==' | '!=') right=expression {$value=new BinaryExpression(InternalHelper.GetBinaryExpressionType($type.text),$left.value,$right.value);}
| left=expression '&' right=expression {$value=new BinaryExpression(BinaryExpressionType.And,$left.value,$right.value);}
| left=expression '^' right=expression {$value=new BinaryExpression(BinaryExpressionType.And,$left.value,$right.value);}
| left=expression '|' right=expression {$value=new BinaryExpression(BinaryExpressionType.And,$left.value,$right.value);}
| left=expression ('&&'|'and') right=expression {$value=new BinaryExpression(BinaryExpressionType.And,$left.value,$right.value);}
| left=expression ('||'|'or') right=expression {$value=new BinaryExpression(BinaryExpressionType.Or,$left.value,$right.value);}
| left=expression '?' middle=expression ':' right=expression{$value=new TernaryExpression(TernaryExpressionType.Condition,$left.value,$middle.value,$right.value);}
| left=expression '=' right=expression {$value=new BinaryExpression(BinaryExpressionType.Assign,$left.value,$right.value);}
| left=expression '+=' right=expression {$value=new BinaryExpression(BinaryExpressionType.PlusAssign,$left.value,$right.value);}
| left=expression '-=' right=expression {$value=new BinaryExpression(BinaryExpressionType.MinusAssign,$left.value,$right.value);}
| left=expression '=' right=expression {$value=new BinaryExpression(BinaryExpressionType.MultiplyAssign,$left.value,$right.value);}
| left=expression '/=' right=expression {$value=new BinaryExpression(BinaryExpressionType.DivideAssign,$left.value,$right.value);}
| left=expression '%=' right=expression {$value=new BinaryExpression(BinaryExpressionType.ModuloAssign,$left.value,$right.value);}
| 'var' ID ('=' assignValue=expression){$value=new VariableDeclareExpression(null,$ID.text,$assignValue.value);}
| 'new '+ID'()'{$value=new NewExpression($ID.text);}
| left=expression '..' right=expression{$value=new RangeExpression($left.value,$right.value);}
| left=expression 'between' range {$value=new BinaryExpression(BinaryExpressionType.Between,$left.value,$range.value);}
| left=expression 'in' array {$value=new BinaryExpression(BinaryExpressionType.In,$left.value,$array.value);}
| ID'=>'body=expression {$value=new LambdaExpression($ID.text,$body.value);}
;

range returns [LogicalExpression value]
: left=expression '..' right=expression{$value=new RangeExpression($left.value,$right.value);}
;

array returns [LogicalExpression value]
: '['expressionList']'{$value=new ArrayExpression($expressionList.value);}
;
primary returns [LogicalExpression value]
: '(' expression ')' {$value=new UnaryExpression(UnaryExpressionType.Quote,$expression.value);}
| literal {$value=$literal.value;}
| '$this' {$value=new Identifier("$this");}
| 'this' {$value=new Identifier("this");}
| qualified_identifier{$value=$qualified_identifier.value;}
| array{$value=$array.value;}
;

qualified_identifier returns [LogicalExpression value]
: 'Contexts!'?ID{$value=new Identifier($qualified_identifier.text);}
;

literal returns [ValueExpression value]
: INTEGER { try { $value = new ValueExpression(int.Parse($INTEGER.text)); } catch(System.OverflowException) { $value = new ValueExpression(long.Parse($INTEGER.text)); } }
| FLOAT { $value = new ValueExpression(double.Parse($FLOAT.text, NumberStyles.Float, numberFormatInfo)); }
| STRING { $value = new ValueExpression(InternalHelper.ExtractString($STRING.text)); }
| DATETIME { $value = new ValueExpression(DateTime.Parse($DATETIME.text.Substring(1, $DATETIME.text.Length-2))); }
| TRUE { $value = new ValueExpression(true); }
| FALSE { $value = new ValueExpression(false); }
| NULL { $value = new ValueExpression(null); }
;

/*

Lexer Rules
*/

NULL
: 'null'
;

TRUE
: 'true'|'True'
;

FALSE
: 'false'|'False'
;

INTEGER
: DIGIT+
;

FLOAT
: DIGIT* '.' DIGIT+ E?
| DIGIT+ E
;

STRING
: ''' ( EscapeSequence | (options {greedy=false;} : ~('\u0000'..'\u001f' | '' | ''' ) ) )* '''
;

DATETIME
: '#' (options {greedy=false;} : ~('#')*) '#'
;

E : ('E'|'e') ('+'|'-')? DIGIT+
;

ID
: IDStart IDPart*
;

fragment IDStart
: LETTER
| IDUNICODE
;

fragment IDPart
: IDStart
| DIGIT
;

fragment IDUNICODE
: '\u00C0'..'\uFFFE'
;

fragment LETTER
: 'a'..'z'
| 'A'..'Z'
| '_'
;

fragment DIGIT
: '0'..'9'
;

fragment EscapeSequence
: ''
(
'n'
| 'r'
| 't'
| '''
| ''
| UnicodeEscape
)
;

fragment HexDigit
: ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment UnicodeEscape
: 'u' HexDigit HexDigit HexDigit HexDigit
;

LINE_COMMENT
: ('--'|'//') ~( '\r' | '\n' )* {Skip();}
;

/* Ignore white spaces */
WS : (' '|'\r'|'\t'|'\u000C'|'\n') {Skip();}
;

…………………………………..…………………………………..…………………………………..…………………………………..

And here is my input for test performance:

1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1

and

1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1

It compiled like forever

If I changed this to

(1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1)

and

(1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1)

then it compiled in milliseconds.

Sent from Windows Mail

From: Terence Parr
Sent: ‎Thursday‎, ‎December‎ ‎19‎, ‎2013 ‎1‎:‎33‎ ‎AM
To: antlr/antlr4
Cc: ryan lin

So you are confirming that two-stage is still very slow for that grammar and input? Have you tried the Java target? it's likely the same but that's the only target I can try this on.

—
Reply to this email directly or view it on GitHub.

parrt · 2013-12-20T01:14:54Z

wait. so both the small one you originally posted and this one don't work? If the small exhibits the problem why post the big one? just trying to figure out which grammar to focus on.

sharwell · 2014-01-17T14:53:39Z

This is fixed by #400. It now takes a fraction of a second to parse expressions 10x the size of the one in your original example.

Add regression test for expression performance (closes #192)

See: antlr/antlr4#192 (comment) Co-authored-by: Evgeniy <minimajack@gmail.com>

…rser ### What changes were proposed in this pull request? This PR follows the antlr/antlr4#192 (comment) to correct the current implementation of the **two-stage parsing strategy** in `AbstractSqlParser`. ### Why are the changes needed? This should be a long-standing issue, before [SPARK-38385](https://issues.apache.org/jira/browse/SPARK-38385), Spark uses `DefaultErrorStrategy`, and after [SPARK-38385](https://issues.apache.org/jira/browse/SPARK-38385) Spark uses class `SparkParserErrorStrategy() extends DefaultErrorStrategy`. It is not a correct implementation of the "two-stage parsing strategy" As mentioned in antlr/antlr4#192 (comment) > You can save a great deal of time on correct inputs by using a two-stage parsing strategy. > > 1. Attempt to parse the input using BailErrorStrategy and PredictionMode.SLL. > If no exception is thrown, you know the answer is correct. > 2. If a ParseCancellationException is thrown, retry the parse using the default > settings (DefaultErrorStrategy and PredictionMode.LL). ### Does this PR introduce _any_ user-facing change? Yes, the Spark SQL parser becomes more powerful, SQL like `SELECT 1 UNION SELECT 2` parse succeeded after this change. ### How was this patch tested? New UT is added. Closes #40835 from pan3793/SPARK-42552. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

ghost assigned sharwell Mar 21, 2013

sharwell closed this as completed in 14107cc Jan 17, 2014

sharwell added a commit that referenced this issue Jan 17, 2014

Merge pull request #427 from sharwell/fix-192

10f45ab

Add regression test for expression performance (closes #192)

sharwell mentioned this issue Aug 29, 2014

Precedence predicate DFA pruning algorithm may have a bug #679

Closed

qwerppoo mentioned this issue Sep 15, 2015

Another poor performance about expression parsing #994

Closed

daniellansun mentioned this issue Apr 14, 2016

Improve the performance of the new parser jespersm/groovy#37

Closed

daniellansun mentioned this issue Oct 5, 2016

Improve the performance of the new parser daniellansun/groovy-parser#6

Closed

squizz617 mentioned this issue Nov 7, 2016

Parser optimization squizz617/vuddy#7

Closed

gmwxio mentioned this issue Feb 23, 2017

Go micro performance. lexer atn config using murmur hash 32. #1689

Closed

KvanTTT mentioned this issue Sep 10, 2018

Performance optimization for javascript runtime #2360

Closed

MinimaJack mentioned this issue Dec 23, 2019

Change grammar 1c-syntax/bsl-parser#52

Open

nixel2007 added a commit to 1c-syntax/bsl-parser that referenced this issue Dec 27, 2019

Использование SSL prediction mode, если возможно

4ebe138

See: antlr/antlr4#192 (comment) Co-authored-by: Evgeniy <minimajack@gmail.com>

This was referenced Apr 17, 2023

[SPARK-42552][SQL] Fix select without parentheses can't be parsed apache/spark#40823

Closed

[SPARK-42552][SQL] Correct the two-stage parsing strategy of antlr parser apache/spark#40835

Closed

pan3793 mentioned this issue Apr 19, 2023

Correct the two-stage parsing strategy of antlr parser trinodb/trino#17119

Closed

pan3793 mentioned this issue May 30, 2023

Spark 3.4: Correct the two-stage parsing strategy of antlr parser apache/iceberg#7734

Open

cfmcgrady mentioned this issue May 30, 2023

Correct the two-stage parsing strategy of antlr parser delta-io/delta#1800

Open

packy92 mentioned this issue Feb 19, 2024

[Enhancement] use SLL as the default prediction mode StarRocks/starrocks#41148

Merged

23 tasks

mergify bot mentioned this issue Mar 18, 2024

[Enhancement] use SLL as the default prediction mode (backport #41148) StarRocks/starrocks#42758

Merged

41 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of expression parsing #192

Poor performance of expression parsing #192

agacek commented Mar 21, 2013

sharwell commented Mar 21, 2013

agacek commented Mar 21, 2013

sharwell commented Mar 21, 2013

parrt commented Mar 21, 2013

ryanlin86 commented Dec 18, 2013

sharwell commented Dec 18, 2013

parrt commented Dec 18, 2013

ryanlin86 commented Dec 20, 2013

parrt commented Dec 20, 2013

sharwell commented Jan 17, 2014

Poor performance of expression parsing #192

Poor performance of expression parsing #192

Comments

agacek commented Mar 21, 2013

sharwell commented Mar 21, 2013

agacek commented Mar 21, 2013

sharwell commented Mar 21, 2013

parrt commented Mar 21, 2013

ryanlin86 commented Dec 18, 2013

sharwell commented Dec 18, 2013

parrt commented Dec 18, 2013

ryanlin86 commented Dec 20, 2013

parrt commented Dec 20, 2013

sharwell commented Jan 17, 2014