[SPARK-13657] [SQL] Support parsing very long AND/OR expressions #11501

davies · 2016-03-03T23:39:15Z

What changes were proposed in this pull request?

In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer.

How was this patch tested?

Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds.

[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql

davies · 2016-03-03T23:39:33Z

cc @hvanhovell @nongli

SparkQA · 2016-03-04T01:28:48Z

Test build #52417 has finished for PR 11501 at commit d3e1546.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-04T21:40:43Z

Test build #52482 has finished for PR 11501 at commit 653cb82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-04T23:15:54Z

Why does this reduce the number of partitions?

nongli · 2016-03-04T23:19:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/CatalystQl.scala

+      case _ => false
+    }) {}
+    collected += rest
+    collected.reverse


why do you reverse it?

We want to keep the same order as SQL string,

A OR B OR C will become (OR (OR A B) C), collected will be (C, B, A), we should return (A, B, C)

rxin · 2016-03-07T08:19:25Z

Can you update the title from "parser" -> "support parsing"?

hvanhovell · 2016-03-07T18:24:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/CatalystQl.scala

+        rest = l
+        true
+      case _ => false
+    }) {}


Yeah, we already do the dirty work in the condition, to avoid another match-case in the body of while.

maybe add an explicit "// do nothing" comment would help

davies · 2016-03-07T20:30:46Z

@nongli Before this PR, the query will fail to parse if you specify so many predicates for partition columns. In order to run it, you have to remove those predicates, then the number of partitions will go much higher.

SparkQA · 2016-03-07T22:20:00Z

Test build #52593 has finished for PR 11501 at commit c187554.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-08T07:10:31Z

Test build #2616 has finished for PR 11501 at commit c187554.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-08T07:12:37Z

Oops there is a conflict.

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/CatalystQl.scala

SparkQA · 2016-03-08T08:54:30Z

Test build #52640 has finished for PR 11501 at commit 5765b09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-08T08:58:31Z

Test build #52641 has finished for PR 11501 at commit ea41707.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-08T08:58:41Z

Test build #2617 has finished for PR 11501 at commit ea41707.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-08T18:00:01Z

@rxin @nongli Is this ready to go?

rxin · 2016-03-08T18:15:47Z

LGTM - but can you add comment explaining why we need the reverse in the code itself?

davies · 2016-03-08T18:23:27Z

Added comment, merging this into master.

## What changes were proposed in this pull request? In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer. ## How was this patch tested? Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds. [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql Author: Davies Liu <davies@databricks.com> Closes apache#11501 from davies/long_or.

parser very long AND/OR expressions

d3e1546

Davies Liu added 2 commits March 4, 2016 11:47

Merge branch 'master' of github.com:apache/spark into long_or

1062c27

fix comment

653cb82

nongli reviewed Mar 4, 2016
View reviewed changes

davies changed the title ~~[SPARK-13657] [SQL] parser very long AND/OR expressions~~ [SPARK-13657] [SQL] Support parsing very long AND/OR expressions Mar 7, 2016

hvanhovell mentioned this pull request Mar 7, 2016

[SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4 #11557

Closed

4 tasks

hvanhovell reviewed Mar 7, 2016
View reviewed changes

address comments

c187554

Merge branch 'master' of github.com:apache/spark into long_or

ea41707

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/CatalystQl.scala

davies force-pushed the long_or branch from 5765b09 to ea41707 Compare March 8, 2016 07:23

add commment

e1349be

asfgit closed this in 78d3b60 Mar 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13657] [SQL] Support parsing very long AND/OR expressions #11501

[SPARK-13657] [SQL] Support parsing very long AND/OR expressions #11501

davies commented Mar 3, 2016

davies commented Mar 3, 2016

SparkQA commented Mar 4, 2016

SparkQA commented Mar 4, 2016

nongli commented Mar 4, 2016

nongli Mar 4, 2016

davies Mar 4, 2016

rxin commented Mar 7, 2016

hvanhovell Mar 7, 2016

davies Mar 7, 2016

rxin Mar 7, 2016

davies commented Mar 7, 2016

SparkQA commented Mar 7, 2016

SparkQA commented Mar 8, 2016

rxin commented Mar 8, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

davies commented Mar 8, 2016

rxin commented Mar 8, 2016

davies commented Mar 8, 2016

[SPARK-13657] [SQL] Support parsing very long AND/OR expressions #11501

[SPARK-13657] [SQL] Support parsing very long AND/OR expressions #11501

Conversation

davies commented Mar 3, 2016

What changes were proposed in this pull request?

How was this patch tested?

davies commented Mar 3, 2016

SparkQA commented Mar 4, 2016

SparkQA commented Mar 4, 2016

nongli commented Mar 4, 2016

nongli Mar 4, 2016

Choose a reason for hiding this comment

davies Mar 4, 2016

Choose a reason for hiding this comment

rxin commented Mar 7, 2016

hvanhovell Mar 7, 2016

Choose a reason for hiding this comment

davies Mar 7, 2016

Choose a reason for hiding this comment

rxin Mar 7, 2016

Choose a reason for hiding this comment

davies commented Mar 7, 2016

SparkQA commented Mar 7, 2016

SparkQA commented Mar 8, 2016

rxin commented Mar 8, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

davies commented Mar 8, 2016

rxin commented Mar 8, 2016

davies commented Mar 8, 2016