Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of expression parsing #192

Closed
agacek opened this issue Mar 21, 2013 · 10 comments
Closed

Poor performance of expression parsing #192

agacek opened this issue Mar 21, 2013 · 10 comments

Comments

@agacek
Copy link

agacek commented Mar 21, 2013

Using Antlr 4.0 I've run into a performance problem with parsing a simple expression language. Here is the grammar:

grammar Expr;

program: expr;

expr: ID
    | 'not' expr
    | expr 'and' expr
    | expr 'or' expr
    ;

ID: [a-zA-Z_][a-zA-Z_0-9]*;
WS: [ \t\n\r\f]+ -> skip;
ERROR: .;

And here is the sample file I'm trying to parse:

not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
    X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and     X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and     X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and     X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and     X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and     X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and     X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and     X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and     X9 and not X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and     X10 and not X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and     X11 and not X12 or
not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and     X12

And just in case, here is how I am calling everything:

import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CommonTokenStream;

public class Main {
  public static void main(String args[]) throws Exception {
    CharStream stream = new ANTLRFileStream("test.expr");
    ExprLexer lexer = new ExprLexer(stream);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    ExprParser parser = new ExprParser(tokens);

    System.out.println("Start parsing");
    long start = System.currentTimeMillis();

    parser.program();

    long stop = System.currentTimeMillis();
    System.out.println("End parsing: " + (stop - start)/1000.0 + "s");
  }
}

On my machine it takes just over 14 minutes to parse the given expression.

@ghost ghost assigned sharwell Mar 21, 2013
@sharwell
Copy link
Member

Nearly all of the time spent parsing the expression is due to the full context prediction portion of ANTLR 4. In the current reference release, this code does not use a dynamic DFA to improve parsing performance. However, the 4.0-opt release uses a highly optimized implementation of this code allowing the expression above to parse in a few milliseconds.

The suggestions below assume you are using the reference release:

You can save a great deal of time on correct inputs by using a two-stage parsing strategy.

  • Attempt to parse the input using BailErrorStrategy and PredictionMode.SLL. If no exception is thrown, you know the answer is correct.
  • If a ParseCancellationException is thrown, retry the parse using the default settings (DefaultErrorStrategy and PredictionMode.LL).

Unlike using PredictionMode.SLL alone, the two-stage parsing strategy is guaranteed to return the same result as if you chose to use the full-strength PredictionMode.LL, but tends to do so in time closer to the SLL option.

@agacek
Copy link
Author

agacek commented Mar 21, 2013

Thanks for the quick followup Sam. The two-stage parsing strategy seems to work fine for me. Is the 4.0-opt release going to be part of the next public Antlr release? And if so, would you recommend going back to a 1-stage parse or not?

@sharwell
Copy link
Member

I always use the two-stage strategy. In the optimized builds speed isn't so much a problem but it still tends to reduce the overall memory requirements. There are no plans right now for the optimized build to replace the reference version (or ship alongside it), but bits and pieces of it are being incorporated into the reference version over time.

@parrt
Copy link
Member

parrt commented Mar 21, 2013

that is described in the book as well.
Ter

On Thu, Mar 21, 2013 at 6:33 AM, Sam Harwell notifications@github.comwrote:

Nearly all of the time spent parsing the expression is due to the full
context prediction portion of ANTLR 4. In the current reference release,
this code does not use a dynamic DFA to improve parsing performance.
However, the 4.0-opt release uses a highly optimized implementation of this
code allowing the expression above to parse in a few milliseconds.

The suggestions below assume you are using the official release:

You can save a great deal of time on correct inputs by using a
two-stage parsing strategy.

  • Attempt to parse the input using BailErrorStrategy and
    PredictionMode.SLL. If no exception is thrown, you know the answer is
    correct.
  • If a ParseCancellationException is thrown, retry the parse using the
    default settings (DefaultErrorStrategy and PredictionMode.LL).

Unlike using PredictionMode.SLL alone, the two-stage parsing strategy is
guaranteed to return the same result as if you chose to use the
full-strength PredictionMode.LL, but tends to do so in time closer to the
SLL option.


Reply to this email directly or view it on GitHubhttps://github.com//issues/192#issuecomment-15238595
.

Dictation in use. Please excuse homophones, malapropisms, and nonsense.

@ryanlin86
Copy link

I'm having the same problem, I'm using C# instead.
I tried the above options, but didn't work for me, still slow, is there any C# performance-opt version available?

@sharwell
Copy link
Member

@ryanlin86 The C# target is based on the optimized fork of the project. Hopefully as part of the 4.2 release I can write an article describing the steps available to track down performance problems along with some of the solutions that are available.

@parrt
Copy link
Member

parrt commented Dec 18, 2013

So you are confirming that two-stage is still very slow for that grammar and input? Have you tried the Java target? it's likely the same but that's the only target I can try this on.

@ryanlin86
Copy link

I haven’t tried java target( which I’m not familiar with java)

Here is my grammar file:

grammar Grammar;

@Header {
using System;
using System.Text;
using System.Globalization;
using System.Collections.Generic;
using SBE.AppFrame.Core.Expressions.AST;
}

@parser::members
{
protected const int EOF = Eof;
private static NumberFormatInfo numberFormatInfo = new NumberFormatInfo{NumberDecimalSeparator ="."};
}

@lexer::members
{
protected const int EOF = Eof;
protected const int HIDDEN = Hidden;
}

/*

  • Parser Rules
    */

compileUnit returns [LogicalExpression value]
: statement EOF {$value=$statement.value;}
| expression EOF {$value=$expression.value;}
| statementList EOF{$value=$statementList.value;}
;

statementList returns[LogicalExpression value]
@init{
List list=new List();
}
: statement{list.Add($statement.value);} (next=statement{list.Add($next.value);})+{$value=new StatementSequence(list);}
;

statement returns [LogicalExpression value]
: block {$value=$block.value;}
| 'if' parExpression ifTrue=statement {$value=new IfExpression($parExpression.value,$ifTrue.value);} ('else' ifFalse=statement {$value=new IfExpression($parExpression.value,$ifTrue.value,$ifFalse.value);})?
| 'for' '(' forInit? ';' forCondition=expression? ';' forUpdate? ')' body=statement {$value=new ForExpression($forInit.value,$forCondition.value,$forUpdate.value,$body.value);}
| 'foreach' '(' 'var' ID 'in' expression ')' body=statement {$value=new ForeachExpression($ID.text, $expression.value, $body.value);}
| 'while' parExpression body=statement {$value=new WhileExpression($parExpression.value,$body.value);}
| 'return' expression? ';' {$value=new ReturnExpression();}
| 'break' ';' {$value=new BreakExpression();}
| 'continue' ';' {$value=new ContinueExpression();}
| statementExpression ';'{$value=$statementExpression.value;}
| ';' {$value=new EmptyStatementExpression();}
;

block returns [LogicalExpression value]
@init{
List list=new List();
}
: '{' (blockStatement{list.Add($blockStatement.value);})* '}' {$value=new BlockExpression(list);}
;

blockStatement returns [LogicalExpression value]
: statement {$value=$statement.value;}
;

forInit returns [LogicalExpression[] value]
: expressionList {$value=$expressionList.value;}
;

parExpression returns [LogicalExpression value]
: '(' expression ')' {$value=$expression.value;}
;

forUpdate returns [LogicalExpression[] value]
: expressionList {$value=$expressionList.value;}
;

statementExpression returns [LogicalExpression value]
: expression {$value=$expression.value;}
;

expressionList returns [LogicalExpression[] value]
@init {
List expressions = new List();
}
: first=expression{expressions.Add($first.value);} (',' follow=expression{expressions.Add($follow.value);})*{$value=expressions.ToArray();}
;

expression returns [LogicalExpression value]
: primary {$value=$primary.value;}
| LEFT=ID ':' member=ID {$value=new MemberExpression(new Identifier($LEFT.text),$member.text);}
| left=expression '.' ID '()'{$value=new MethodCallExpression($left.value,$ID.text);}
| left=expression '.' ID '(' parameters=expressionList? ')'{$value=new MethodCallExpression($left.value,$ID.text,$parameters.value);}
| left=expression '.' ID {$value=new MemberExpression($left.value,$ID.text);}
| ID '(' parameters=expressionList? ')' {$value=new MethodCallExpression(null,$ID.text,$parameters.value);}
| left=expression type=('++' | '--') {$value=new UnaryExpression($type.text=="++"?UnaryExpressionType.RightDoublePlus:UnaryExpressionType.RightDoubleMinus,$left.value);}
| ('+'|'-'|'++'|'--') right=expression {$value=new UnaryExpression($type.text=="++"?UnaryExpressionType.LeftDoublePlus:UnaryExpressionType.LeftDoubleMinus,$right.value);}
| ('~'|'!') right=expression {$value=new UnaryExpression(UnaryExpressionType.Not,$right.value);}
| left=expression type=(''|'/'|'%') right=expression {$value=new BinaryExpression(InternalHelper.GetBinaryExpressionType($type.text),$left.value,$right.value);}
| left=expression type=('+'|'-') right=expression {$value=new BinaryExpression(InternalHelper.GetBinaryExpressionType($type.text),$left.value,$right.value);}
| left=expression type=('<=' | '>=' | '>' | '<') right=expression {$value=new BinaryExpression(InternalHelper.GetBinaryExpressionType($type.text),$left.value,$right.value);}
| left=expression type=('==' | '!=') right=expression {$value=new BinaryExpression(InternalHelper.GetBinaryExpressionType($type.text),$left.value,$right.value);}
| left=expression '&' right=expression {$value=new BinaryExpression(BinaryExpressionType.And,$left.value,$right.value);}
| left=expression '^' right=expression {$value=new BinaryExpression(BinaryExpressionType.And,$left.value,$right.value);}
| left=expression '|' right=expression {$value=new BinaryExpression(BinaryExpressionType.And,$left.value,$right.value);}
| left=expression ('&&'|'and') right=expression {$value=new BinaryExpression(BinaryExpressionType.And,$left.value,$right.value);}
| left=expression ('||'|'or') right=expression {$value=new BinaryExpression(BinaryExpressionType.Or,$left.value,$right.value);}
| left=expression '?' middle=expression ':' right=expression{$value=new TernaryExpression(TernaryExpressionType.Condition,$left.value,$middle.value,$right.value);}
| left=expression '=' right=expression {$value=new BinaryExpression(BinaryExpressionType.Assign,$left.value,$right.value);}
| left=expression '+=' right=expression {$value=new BinaryExpression(BinaryExpressionType.PlusAssign,$left.value,$right.value);}
| left=expression '-=' right=expression {$value=new BinaryExpression(BinaryExpressionType.MinusAssign,$left.value,$right.value);}
| left=expression '
=' right=expression {$value=new BinaryExpression(BinaryExpressionType.MultiplyAssign,$left.value,$right.value);}
| left=expression '/=' right=expression {$value=new BinaryExpression(BinaryExpressionType.DivideAssign,$left.value,$right.value);}
| left=expression '%=' right=expression {$value=new BinaryExpression(BinaryExpressionType.ModuloAssign,$left.value,$right.value);}
| 'var' ID ('=' assignValue=expression){$value=new VariableDeclareExpression(null,$ID.text,$assignValue.value);}
| 'new '+ID'()'{$value=new NewExpression($ID.text);}
| left=expression '..' right=expression{$value=new RangeExpression($left.value,$right.value);}
| left=expression 'between' range {$value=new BinaryExpression(BinaryExpressionType.Between,$left.value,$range.value);}
| left=expression 'in' array {$value=new BinaryExpression(BinaryExpressionType.In,$left.value,$array.value);}
| ID'=>'body=expression {$value=new LambdaExpression($ID.text,$body.value);}
;

range returns [LogicalExpression value]
: left=expression '..' right=expression{$value=new RangeExpression($left.value,$right.value);}
;

array returns [LogicalExpression value]
: '['expressionList']'{$value=new ArrayExpression($expressionList.value);}
;
primary returns [LogicalExpression value]
: '(' expression ')' {$value=new UnaryExpression(UnaryExpressionType.Quote,$expression.value);}
| literal {$value=$literal.value;}
| '$this' {$value=new Identifier("$this");}
| 'this' {$value=new Identifier("this");}
| qualified_identifier{$value=$qualified_identifier.value;}
| array{$value=$array.value;}
;

qualified_identifier returns [LogicalExpression value]
: 'Contexts!'?ID{$value=new Identifier($qualified_identifier.text);}
;

literal returns [ValueExpression value]
: INTEGER { try { $value = new ValueExpression(int.Parse($INTEGER.text)); } catch(System.OverflowException) { $value = new ValueExpression(long.Parse($INTEGER.text)); } }
| FLOAT { $value = new ValueExpression(double.Parse($FLOAT.text, NumberStyles.Float, numberFormatInfo)); }
| STRING { $value = new ValueExpression(InternalHelper.ExtractString($STRING.text)); }
| DATETIME { $value = new ValueExpression(DateTime.Parse($DATETIME.text.Substring(1, $DATETIME.text.Length-2))); }
| TRUE { $value = new ValueExpression(true); }
| FALSE { $value = new ValueExpression(false); }
| NULL { $value = new ValueExpression(null); }
;

/*

  • Lexer Rules
    */

NULL
: 'null'
;

TRUE
: 'true'|'True'
;

FALSE
: 'false'|'False'
;

INTEGER
: DIGIT+
;

FLOAT
: DIGIT* '.' DIGIT+ E?
| DIGIT+ E
;

STRING
: ''' ( EscapeSequence | (options {greedy=false;} : ~('\u0000'..'\u001f' | '' | ''' ) ) )* '''
;

DATETIME
: '#' (options {greedy=false;} : ~('#')*) '#'
;

E : ('E'|'e') ('+'|'-')? DIGIT+
;

ID
: IDStart IDPart*
;

fragment IDStart
: LETTER
| IDUNICODE
;

fragment IDPart
: IDStart
| DIGIT
;

fragment IDUNICODE
: '\u00C0'..'\uFFFE'
;

fragment LETTER
: 'a'..'z'
| 'A'..'Z'
| '_'
;

fragment DIGIT
: '0'..'9'
;

fragment EscapeSequence
: ''
(
'n'
| 'r'
| 't'
| '''
| ''
| UnicodeEscape
)
;

fragment HexDigit
: ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment UnicodeEscape
: 'u' HexDigit HexDigit HexDigit HexDigit
;

LINE_COMMENT
: ('--'|'//') ~( '\r' | '\n' )* {Skip();}
;

/* Ignore white spaces */
WS : (' '|'\r'|'\t'|'\u000C'|'\n') {Skip();}
;

…………………………………..…………………………………..…………………………………..…………………………………..

And here is my input for test performance:

1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1

and

1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1

It compiled like forever

If I changed this to

(1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1)

and

(1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1 and 1==1)

then it compiled in milliseconds.

Sent from Windows Mail

From: Terence Parr
Sent: ‎Thursday‎, ‎December‎ ‎19‎, ‎2013 ‎1‎:‎33‎ ‎AM
To: antlr/antlr4
Cc: ryan lin

So you are confirming that two-stage is still very slow for that grammar and input? Have you tried the Java target? it's likely the same but that's the only target I can try this on.


Reply to this email directly or view it on GitHub.

@parrt
Copy link
Member

parrt commented Dec 20, 2013

wait. so both the small one you originally posted and this one don't work? If the small exhibits the problem why post the big one? just trying to figure out which grammar to focus on.

@sharwell
Copy link
Member

This is fixed by #400. It now takes a fraction of a second to parse expressions 10x the size of the one in your original example.

sharwell added a commit that referenced this issue Jan 17, 2014
Add regression test for expression performance (closes #192)
nixel2007 added a commit to 1c-syntax/bsl-parser that referenced this issue Dec 27, 2019
cloud-fan pushed a commit to apache/spark that referenced this issue Apr 19, 2023
…rser

### What changes were proposed in this pull request?

This PR follows the antlr/antlr4#192 (comment) to correct the current implementation of the **two-stage parsing strategy** in `AbstractSqlParser`.

### Why are the changes needed?

This should be a long-standing issue, before [SPARK-38385](https://issues.apache.org/jira/browse/SPARK-38385), Spark uses `DefaultErrorStrategy`, and after [SPARK-38385](https://issues.apache.org/jira/browse/SPARK-38385) Spark uses class `SparkParserErrorStrategy() extends DefaultErrorStrategy`. It is not a correct implementation of the "two-stage parsing strategy"

As mentioned in antlr/antlr4#192 (comment)

> You can save a great deal of time on correct inputs by using a two-stage parsing strategy.
>
> 1. Attempt to parse the input using BailErrorStrategy and PredictionMode.SLL.
>    If no exception is thrown, you know the answer is correct.
> 2. If a ParseCancellationException is thrown, retry the parse using the default
>    settings (DefaultErrorStrategy and PredictionMode.LL).

### Does this PR introduce _any_ user-facing change?

Yes, the Spark SQL parser becomes more powerful, SQL like `SELECT 1 UNION SELECT 2` parse succeeded after this change.

### How was this patch tested?

New UT is added.

Closes #40835 from pan3793/SPARK-42552.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants