-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13657] [SQL] Support parsing very long AND/OR expressions #11501
Conversation
Test build #52417 has finished for PR 11501 at commit
|
Test build #52482 has finished for PR 11501 at commit
|
Why does this reduce the number of partitions? |
case _ => false | ||
}) {} | ||
collected += rest | ||
collected.reverse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you reverse it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to keep the same order as SQL string,
A OR B OR C
will become (OR (OR A B) C), collected will be (C, B, A), we should return (A, B, C)
Can you update the title from "parser" -> "support parsing"? |
rest = l | ||
true | ||
case _ => false | ||
}) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we already do the dirty work in the condition, to avoid another match-case in the body of while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add an explicit "// do nothing" comment would help
@nongli Before this PR, the query will fail to parse if you specify so many predicates for partition columns. In order to run it, you have to remove those predicates, then the number of partitions will go much higher. |
Test build #52593 has finished for PR 11501 at commit
|
Test build #2616 has finished for PR 11501 at commit
|
Oops there is a conflict. |
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/CatalystQl.scala
Test build #52640 has finished for PR 11501 at commit
|
Test build #52641 has finished for PR 11501 at commit
|
Test build #2617 has finished for PR 11501 at commit
|
LGTM - but can you add comment explaining why we need the reverse in the code itself? |
Added comment, merging this into master. |
## What changes were proposed in this pull request? In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer. ## How was this patch tested? Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds. [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql Author: Davies Liu <davies@databricks.com> Closes apache#11501 from davies/long_or.
What changes were proposed in this pull request?
In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer.
How was this patch tested?
Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds.
[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql