-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4226][SQL]Add subquery (not) in/exists support #9055
Conversation
cc @marmbrus @yhuai @ravipesala Since the anti join is another type of |
Test build #43506 has finished for PR 9055 at commit
|
Test build #43508 has finished for PR 9055 at commit
|
Test build #43528 has finished for PR 9055 at commit
|
Seems the failure is not related. |
retest this please |
what's the difference with #4812? |
ok, does this support multi exists and in in where clause? |
No, we don't support that in this PR, but should be very easy to support once this PR merged. I can plan the work if you feel that's very critical to your customers. |
Test build #43552 has finished for PR 9055 at commit
|
cc @rxin as well, this is required by many of our customers, and most of the code change is about the unit test, should not be hard to follow. |
ab22171
to
7511f47
Compare
Test build #43782 has finished for PR 9055 at commit
|
Seems not related. |
retest this please |
1 similar comment
retest this please |
Test build #43826 has finished for PR 9055 at commit
|
protected def nodeToExpr(node: Node): Expression = node match { | ||
val EXISTS = "(?i)EXISTS".r | ||
|
||
protected def nodeToExpr(node: Node, context: Context): Expression = node match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to pass in context
? We added context
to the argument list of nodeToPlan
to support creating view. We are not expecting a subqeury expr is for creating a view, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use the context
in this PR, however, the def nodeToPlan(..)
need the context
, as in this implementation, I actually add 2 extra expressions, they take the LogcialPlan
as parameters, which mean the function nodeToExpr
will call nodeToPlan()
and pass the context
down. Otherwise I have to pass the null
to nodeToPlan()
, which probably even more confusing and error-prone.
* Exist subquery expression, only used in filter only | ||
*/ | ||
case class Exists(subquery: LogicalPlan, positive: Boolean) | ||
extends LeafExpression with SubQueryExpression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format
Two general comments. First, we need to add document to explain how we rewrite a plan when (1) there is a uncorrelated subquery and (2) there is a correlated subquery. Second, for those rewriting rules, I am thinking if we can have more concise ones. For uncorrelated subqueries, the subquery itself should be a resolved logical plan, right? For correlated subqueries, we only need to extract those conditions referring columns in the outer query block, right? Do we really need to matching those different specific patterns? Can we have some general logics? Actually, does this pr try to support uncorrelated in/not in/exists/not exists subqueries? |
Thank you @yhuai for reviewing this. First, I'll agree with you to make a general logic to partially resolve the correlated condition within the subquery, but it's probably not that easy, particularly we need to give more concise error message to the end user, so my suggestion is to leave it for the future improvement, probably we will have better idea to simplify that by having enough feature supported with the follow up PRs (See my TODO in the description), as currently, the limit patterns actually works for most of cases. Second, I totally agree with the Join Type comments, LeftSemiJoin <-> LeftSemi <-> LeftAnti, the motivation I am trying to make a parent class for LeftSemi / LeftAnti is for reducing the code change in Still, I hope we can merge this PR in 1.6 release, as it's almost 1 years passed since the previous PRs created in #3249 & #4812. And I will keep updating the code once we have the general agreement for the implementation. |
BTW: IN / NOT IN definitely supports the uncorrelated, but EXISTS/NOT EXISTS are not in this case, the same behavior as Hive does. |
Test build #44064 has finished for PR 9055 at commit
|
Hi @yhuai , |
@jameszhouyi |
Thank you @gatorsmile for your suggestion. |
@jameszhouyi |
Yeah, sorry. It is too late for a patch this large. |
So what next ? |
[Moved to Spark dev mailing list as: Expression/LogicalPlan dichotomy in Spark SQL Catalyst] |
I had a offline discussion with @chenghao-intel. We will split this PR to smaller PRs. The first work will be on the backend operators. Then, we will add parser and analyzer rule. |
@chenghao-intel How about we close this PR for now? |
ok, closing it now |
Found a related HIVE JIRA to support the left anti join: https://issues.apache.org/jira/browse/HIVE-12519 However, their proposed solution has a hole. Anyway, if we can support the anti join at the run time, it is much efficient. |
### What changes were proposed in this pull request? This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms: - `[NOT] EXISTS(subquery)` - `[NOT] IN (subquery)` This PR is (loosely) based on the work of davies (#10706) and chenghao-intel (#9055). They should be credited for the work they did. ### How was this patch tested? Modified parsing unit tests. Added tests to `org.apache.spark.sql.SQLQuerySuite` cc rxin, davies & chenghao-intel Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12306 from hvanhovell/SPARK-4226.
Some of the key concepts:
e.g. We reference the "a.value", which is the attribute in parent query, in the subquery.
Basic Logic for the Transformation
Conceptional demo with logical plan , we support the cases like below:
There are also some limitations:
e.g.(bad example)
e.g.(bad example)
e.g.(bad example)
e.g.(bad example)
e.g.(bad example)
e.g.(bad example)
TODOs (In the future improvement)
a. More pretty message to user why we failed in analysis.
b. Support multiple IN / EXISTS clause in the predicates.
c. Implicit reference expression substitution to the parent query
d. More general correlated condition support, particularly for the nested ones in the subquery.
e. SQL Parser supports (More SQL standard supports)
f. ...