Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13657] [SQL] Support parsing very long AND/OR expressions #11501

Closed
wants to merge 6 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Mar 3, 2016

What changes were proposed in this pull request?

In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer.

How was this patch tested?

Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds.

[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql

@davies
Copy link
Contributor Author

davies commented Mar 3, 2016

cc @hvanhovell @nongli

@SparkQA
Copy link

SparkQA commented Mar 4, 2016

Test build #52417 has finished for PR 11501 at commit d3e1546.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2016

Test build #52482 has finished for PR 11501 at commit 653cb82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor

nongli commented Mar 4, 2016

Why does this reduce the number of partitions?

case _ => false
}) {}
collected += rest
collected.reverse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you reverse it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to keep the same order as SQL string,

A OR B OR C will become (OR (OR A B) C), collected will be (C, B, A), we should return (A, B, C)

@rxin
Copy link
Contributor

rxin commented Mar 7, 2016

Can you update the title from "parser" -> "support parsing"?

@davies davies changed the title [SPARK-13657] [SQL] parser very long AND/OR expressions [SPARK-13657] [SQL] Support parsing very long AND/OR expressions Mar 7, 2016
rest = l
true
case _ => false
}) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we already do the dirty work in the condition, to avoid another match-case in the body of while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add an explicit "// do nothing" comment would help

@davies
Copy link
Contributor Author

davies commented Mar 7, 2016

@nongli Before this PR, the query will fail to parse if you specify so many predicates for partition columns. In order to run it, you have to remove those predicates, then the number of partitions will go much higher.

@SparkQA
Copy link

SparkQA commented Mar 7, 2016

Test build #52593 has finished for PR 11501 at commit c187554.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #2616 has finished for PR 11501 at commit c187554.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 8, 2016

Oops there is a conflict.

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/CatalystQl.scala
@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52640 has finished for PR 11501 at commit 5765b09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52641 has finished for PR 11501 at commit ea41707.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #2617 has finished for PR 11501 at commit ea41707.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Mar 8, 2016

@rxin @nongli Is this ready to go?

@rxin
Copy link
Contributor

rxin commented Mar 8, 2016

LGTM - but can you add comment explaining why we need the reverse in the code itself?

@davies
Copy link
Contributor Author

davies commented Mar 8, 2016

Added comment, merging this into master.

@asfgit asfgit closed this in 78d3b60 Mar 8, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer.

## How was this patch tested?

Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds.

[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql

Author: Davies Liu <davies@databricks.com>

Closes apache#11501 from davies/long_or.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants