Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4226][SQL]Add subquery (not) in/exists support #9055

Closed
wants to merge 5 commits into from

Conversation

chenghao-intel
Copy link
Contributor

Some of the key concepts:

  • Correlated: References the attributes of the parent query within subquery, we call that Correlated.
    e.g. We reference the "a.value", which is the attribute in parent query, in the subquery.
SELECT a.value FROM src a
WHERE a.key in (
  SELECT b.key FROM src1 b
  WHERE a.value > b.value)
  • Uncorrelated: Do not have any attribute reference to its parent query in the subquery.
SELECT a.value FROM src a WHERE a.key IN (SELECT key FROM src WHERE key > 100);

Basic Logic for the Transformation

   EXISTS / IN => LEFT SEMI JOIN
   NOT EXISTS / NOT IN => LEFT ANTI JOIN

Conceptional demo with logical plan , we support the cases like below:

  • e.g. EXISTS / NOT EXISTS
 SELECT value FROM src a WHERE (NOT) EXISTS (SELECT 1 FROM src1 b WHERE a.key < b.key)
     ==>
 SELECT a.value FROM src a LEFT (ANTI) SEMI JOIN src1 b WHERE a.key < b.key
  • e.g. IN / NOT IN
 SELECT value FROM src a WHERE key (NOT) IN (SELECT key FROM src1 b WHERE a.value < b.value)
    ==>
 SELECT value FROM src a LEFT (ANTI) SEMI JOIN src1 b ON a.key = b.key AND a.value < b.value
  • e.g. IN / NOT IN with other conjunctions
 SELECT value FROM src a
 WHERE key (NOT) IN (
   SELECT key FROM src1 b WHERE a.value < b.value
 ) AND a.key > 10
    ==>
 SELECT value
   (FROM src a WHERE a.key > 10)
 LEFT (ANTI) SEMI JOIN src1 b ON a.key = b.key AND a.value < b.value

There are also some limitations:

  • IN/NOT IN subqueries may only select a single column.
    e.g.(bad example)
 SELECT value FROM src a WHERE EXISTS (SELECT key, value FROM src1 WHERE key > 10)
  • EXISTS/NOT EXISTS must have one or more correlated predicates.
    e.g.(bad example)
 SELECT value FROM src a WHERE EXISTS (SELECT 1 FROM src1 b WHERE b.key > 10)
  • References to the parent query is only supported in the WHERE clause of the subquery.
    e.g.(bad example)
 SELECT value FROM src a WHERE key IN (SELECT a.key + b.key FROM src1 b)
  • Only a single subquery can support in IN/EXISTS predicate.
    e.g.(bad example)
 SELECT value FROM src WHERE key IN (SELECT xx1 FROM xxx1) AND key in (SELECT xx2 
FROM xxx2)
  • Disjunction is not supported in the top level.
    e.g.(bad example)
 SELECT value FROM src WHERE key > 10 OR key IN (SELECT xx1 FROM xxx1)
  • Implicit reference expression substitution to the parent query is not supported.
    e.g.(bad example)
 SELECT min(key) FROM src a HAVING EXISTS (SELECT 1 FROM src1 b WHERE b.key = min(a.key))

TODOs (In the future improvement)

a. More pretty message to user why we failed in analysis.
b. Support multiple IN / EXISTS clause in the predicates.
c. Implicit reference expression substitution to the parent query
d. More general correlated condition support, particularly for the nested ones in the subquery.
e. SQL Parser supports (More SQL standard supports)
f. ...

@chenghao-intel
Copy link
Contributor Author

cc @marmbrus @yhuai @ravipesala
This implementation inspired by #3249, by using the SubQueryExpression. and also the follow up with #4812.

Since the anti join is another type of SEMI JOIN, I added it back here for performance concern in transform the "NOT EXISTS / NOT IN" subquery.

@SparkQA
Copy link

SparkQA commented Oct 10, 2015

Test build #43506 has finished for PR 9055 at commit e3aa255.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait SubQueryExpression extends Unevaluable
    • case class Exists(subquery: LogicalPlan, positive: Boolean)
    • case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)

@SparkQA
Copy link

SparkQA commented Oct 10, 2015

Test build #43508 has finished for PR 9055 at commit b382bc9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait SubQueryExpression extends Unevaluable
    • case class Exists(subquery: LogicalPlan, positive: Boolean)
    • case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)

@SparkQA
Copy link

SparkQA commented Oct 10, 2015

Test build #43528 has finished for PR 9055 at commit ab22171.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait SubQueryExpression extends Unevaluable
    • case class Exists(subquery: LogicalPlan, positive: Boolean)
    • case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)

@chenghao-intel
Copy link
Contributor Author

Seems the failure is not related.
retest this please

@chenghao-intel
Copy link
Contributor Author

retest this please

@scwf
Copy link
Contributor

scwf commented Oct 12, 2015

what's the difference with #4812?

@chenghao-intel
Copy link
Contributor Author

This is much simpler than #4812, by using the SubQueryExpression, particularly in processing the case
key IN (subquery) AND other_condition case. #4812 doesn't support the AND other_condition.

@scwf
Copy link
Contributor

scwf commented Oct 12, 2015

ok, does this support multi exists and in in where clause?

@chenghao-intel
Copy link
Contributor Author

No, we don't support that in this PR, but should be very easy to support once this PR merged. I can plan the work if you feel that's very critical to your customers.

@SparkQA
Copy link

SparkQA commented Oct 12, 2015

Test build #43552 has finished for PR 9055 at commit ab22171.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait SubQueryExpression extends Unevaluable
    • case class Exists(subquery: LogicalPlan, positive: Boolean)
    • case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)

@chenghao-intel
Copy link
Contributor Author

cc @rxin as well, this is required by many of our customers, and most of the code change is about the unit test, should not be hard to follow.

@SparkQA
Copy link

SparkQA commented Oct 15, 2015

Test build #43782 has finished for PR 9055 at commit 7511f47.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait SubQueryExpression extends Unevaluable
    • case class Exists(subquery: LogicalPlan, positive: Boolean)
    • case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)

@chenghao-intel
Copy link
Contributor Author

Seems not related.

@chenghao-intel
Copy link
Contributor Author

retest this please

1 similar comment
@chenghao-intel
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 16, 2015

Test build #43826 has finished for PR 9055 at commit 7511f47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait SubQueryExpression extends Unevaluable
    • case class Exists(subquery: LogicalPlan, positive: Boolean)
    • case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)

protected def nodeToExpr(node: Node): Expression = node match {
val EXISTS = "(?i)EXISTS".r

protected def nodeToExpr(node: Node, context: Context): Expression = node match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass in context? We added context to the argument list of nodeToPlan to support creating view. We are not expecting a subqeury expr is for creating a view, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use the context in this PR, however, the def nodeToPlan(..) need the context, as in this implementation, I actually add 2 extra expressions, they take the LogcialPlan as parameters, which mean the function nodeToExpr will call nodeToPlan() and pass the context down. Otherwise I have to pass the null to nodeToPlan(), which probably even more confusing and error-prone.

* Exist subquery expression, only used in filter only
*/
case class Exists(subquery: LogicalPlan, positive: Boolean)
extends LeafExpression with SubQueryExpression {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format

@yhuai
Copy link
Contributor

yhuai commented Oct 21, 2015

Two general comments. First, we need to add document to explain how we rewrite a plan when (1) there is a uncorrelated subquery and (2) there is a correlated subquery. Second, for those rewriting rules, I am thinking if we can have more concise ones. For uncorrelated subqueries, the subquery itself should be a resolved logical plan, right? For correlated subqueries, we only need to extract those conditions referring columns in the outer query block, right? Do we really need to matching those different specific patterns? Can we have some general logics?

Actually, does this pr try to support uncorrelated in/not in/exists/not exists subqueries?

@chenghao-intel
Copy link
Contributor Author

Thank you @yhuai for reviewing this.
I've added some more docs for this PR, hopefully make more sense.

First, I'll agree with you to make a general logic to partially resolve the correlated condition within the subquery, but it's probably not that easy, particularly we need to give more concise error message to the end user, so my suggestion is to leave it for the future improvement, probably we will have better idea to simplify that by having enough feature supported with the follow up PRs (See my TODO in the description), as currently, the limit patterns actually works for most of cases.

Second, I totally agree with the Join Type comments, LeftSemiJoin <-> LeftSemi <-> LeftAnti, the motivation I am trying to make a parent class for LeftSemi / LeftAnti is for reducing the code change in Optimizer and SparkStrategies, maybe I should rename it to LeftSemiOrAntiJoin as the parent class. As well as the Operators' name, since we no longer the LeftSemiXXX, but also supports the LeftAntixxx.

Still, I hope we can merge this PR in 1.6 release, as it's almost 1 years passed since the previous PRs created in #3249 & #4812. And I will keep updating the code once we have the general agreement for the implementation.

@chenghao-intel
Copy link
Contributor Author

BTW: IN / NOT IN definitely supports the uncorrelated, but EXISTS/NOT EXISTS are not in this case, the same behavior as Hive does.

@SparkQA
Copy link

SparkQA commented Oct 21, 2015

Test build #44064 has finished for PR 9055 at commit cb69166.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * trait SubQueryExpression extends Unevaluable\n * case class Exists(subquery: LogicalPlan, positive: Boolean)\n * case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)\n

@jameszhouyi
Copy link

Hi @yhuai ,
This missing feature("IN" sub query) in Spark SQL blocked our real-world case. Could you please help to review this PR ? Strongly hopefully this PR feature can be merged in Spark 1.6.0 ( I saw the Hive implementation supported such feature ). Thanks in advanced !

@gatorsmile
Copy link
Member

@jameszhouyi
We hit the same issue. Now, we bypass it by using joins.

@jameszhouyi
Copy link

Thank you @gatorsmile for your suggestion.
I think this feature("IN" sub query) is necessary for Spark SQL engine as SQL-on-Hadoop.

@gatorsmile
Copy link
Member

@jameszhouyi
Agree. This is an important feature for any SQL engine. We are also waiting for this feature. So far, using joins is an alternative to bypass it.

@chenghao-intel
Copy link
Contributor Author

Unfortunately, we probably will miss this in Spark 1.6, as it's almost code freeze for 1.6. @rxin @yhuai

@marmbrus
Copy link
Contributor

marmbrus commented Nov 5, 2015

Yeah, sorry. It is too late for a patch this large.

@maver1ck
Copy link
Contributor

So what next ?

@roland-mendix
Copy link

[Moved to Spark dev mailing list as: Expression/LogicalPlan dichotomy in Spark SQL Catalyst]

@yhuai
Copy link
Contributor

yhuai commented Dec 31, 2015

I had a offline discussion with @chenghao-intel. We will split this PR to smaller PRs. The first work will be on the backend operators. Then, we will add parser and analyzer rule.

@yhuai
Copy link
Contributor

yhuai commented Dec 31, 2015

@chenghao-intel How about we close this PR for now?

@chenghao-intel
Copy link
Contributor Author

ok, closing it now

@gatorsmile
Copy link
Member

Found a related HIVE JIRA to support the left anti join: https://issues.apache.org/jira/browse/HIVE-12519

However, their proposed solution has a hole. Anyway, if we can support the anti join at the run time, it is much efficient.

asfgit pushed a commit that referenced this pull request Apr 19, 2016
### What changes were proposed in this pull request?
This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms:

- `[NOT] EXISTS(subquery)`
- `[NOT] IN (subquery)`

This PR is (loosely) based on the work of davies (#10706) and chenghao-intel (#9055). They should be credited for the work they did.

### How was this patch tested?
Modified parsing unit tests.
Added tests to `org.apache.spark.sql.SQLQuerySuite`

cc rxin, davies & chenghao-intel

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12306 from hvanhovell/SPARK-4226.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants