[SPARK-4226][SQL]Add subquery (not) in/exists support #9055

chenghao-intel · 2015-10-10T01:11:31Z

Some of the key concepts:

Correlated: References the attributes of the parent query within subquery, we call that Correlated.
e.g. We reference the "a.value", which is the attribute in parent query, in the subquery.

SELECT a.value FROM src a
WHERE a.key in (
  SELECT b.key FROM src1 b
  WHERE a.value > b.value)

Uncorrelated: Do not have any attribute reference to its parent query in the subquery.

SELECT a.value FROM src a WHERE a.key IN (SELECT key FROM src WHERE key > 100);

Basic Logic for the Transformation

   EXISTS / IN => LEFT SEMI JOIN
   NOT EXISTS / NOT IN => LEFT ANTI JOIN

Conceptional demo with logical plan , we support the cases like below:

e.g. EXISTS / NOT EXISTS

 SELECT value FROM src a WHERE (NOT) EXISTS (SELECT 1 FROM src1 b WHERE a.key < b.key)
     ==>
 SELECT a.value FROM src a LEFT (ANTI) SEMI JOIN src1 b WHERE a.key < b.key

e.g. IN / NOT IN

 SELECT value FROM src a WHERE key (NOT) IN (SELECT key FROM src1 b WHERE a.value < b.value)
    ==>
 SELECT value FROM src a LEFT (ANTI) SEMI JOIN src1 b ON a.key = b.key AND a.value < b.value

e.g. IN / NOT IN with other conjunctions

 SELECT value FROM src a
 WHERE key (NOT) IN (
   SELECT key FROM src1 b WHERE a.value < b.value
 ) AND a.key > 10
    ==>
 SELECT value
   (FROM src a WHERE a.key > 10)
 LEFT (ANTI) SEMI JOIN src1 b ON a.key = b.key AND a.value < b.value

There are also some limitations:

IN/NOT IN subqueries may only select a single column.
e.g.(bad example)

 SELECT value FROM src a WHERE EXISTS (SELECT key, value FROM src1 WHERE key > 10)

EXISTS/NOT EXISTS must have one or more correlated predicates.
e.g.(bad example)

 SELECT value FROM src a WHERE EXISTS (SELECT 1 FROM src1 b WHERE b.key > 10)

References to the parent query is only supported in the WHERE clause of the subquery.
e.g.(bad example)

 SELECT value FROM src a WHERE key IN (SELECT a.key + b.key FROM src1 b)

Only a single subquery can support in IN/EXISTS predicate.
e.g.(bad example)

 SELECT value FROM src WHERE key IN (SELECT xx1 FROM xxx1) AND key in (SELECT xx2 
FROM xxx2)

Disjunction is not supported in the top level.
e.g.(bad example)

 SELECT value FROM src WHERE key > 10 OR key IN (SELECT xx1 FROM xxx1)

Implicit reference expression substitution to the parent query is not supported.
e.g.(bad example)

 SELECT min(key) FROM src a HAVING EXISTS (SELECT 1 FROM src1 b WHERE b.key = min(a.key))

TODOs (In the future improvement)

a. More pretty message to user why we failed in analysis.
b. Support multiple IN / EXISTS clause in the predicates.
c. Implicit reference expression substitution to the parent query
d. More general correlated condition support, particularly for the nested ones in the subquery.
e. SQL Parser supports (More SQL standard supports)
f. ...

chenghao-intel · 2015-10-10T01:16:05Z

cc @marmbrus @yhuai @ravipesala
This implementation inspired by #3249, by using the SubQueryExpression. and also the follow up with #4812.

Since the anti join is another type of SEMI JOIN, I added it back here for performance concern in transform the "NOT EXISTS / NOT IN" subquery.

SparkQA · 2015-10-10T01:33:27Z

Test build #43506 has finished for PR 9055 at commit e3aa255.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait SubQueryExpression extends Unevaluable
- case class Exists(subquery: LogicalPlan, positive: Boolean)
- case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)

SparkQA · 2015-10-10T03:52:04Z

Test build #43508 has finished for PR 9055 at commit b382bc9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait SubQueryExpression extends Unevaluable
- case class Exists(subquery: LogicalPlan, positive: Boolean)
- case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)

SparkQA · 2015-10-10T11:00:01Z

Test build #43528 has finished for PR 9055 at commit ab22171.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait SubQueryExpression extends Unevaluable
- case class Exists(subquery: LogicalPlan, positive: Boolean)
- case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)

chenghao-intel · 2015-10-12T00:16:58Z

Seems the failure is not related.
retest this please

chenghao-intel · 2015-10-12T00:35:10Z

retest this please

scwf · 2015-10-12T01:44:42Z

what's the difference with #4812?

chenghao-intel · 2015-10-12T01:52:00Z

This is much simpler than #4812, by using the SubQueryExpression, particularly in processing the case
key IN (subquery) AND other_condition case. #4812 doesn't support the AND other_condition.

scwf · 2015-10-12T02:03:35Z

ok, does this support multi exists and in in where clause?

chenghao-intel · 2015-10-12T02:08:49Z

No, we don't support that in this PR, but should be very easy to support once this PR merged. I can plan the work if you feel that's very critical to your customers.

SparkQA · 2015-10-12T02:54:35Z

Test build #43552 has finished for PR 9055 at commit ab22171.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait SubQueryExpression extends Unevaluable
- case class Exists(subquery: LogicalPlan, positive: Boolean)
- case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)

chenghao-intel · 2015-10-13T00:39:35Z

cc @rxin as well, this is required by many of our customers, and most of the code change is about the unit test, should not be hard to follow.

SparkQA · 2015-10-15T09:11:39Z

Test build #43782 has finished for PR 9055 at commit 7511f47.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait SubQueryExpression extends Unevaluable
- case class Exists(subquery: LogicalPlan, positive: Boolean)
- case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)

chenghao-intel · 2015-10-16T00:31:59Z

Seems not related.

chenghao-intel · 2015-10-16T00:32:05Z

retest this please

chenghao-intel · 2015-10-16T01:25:15Z

retest this please

SparkQA · 2015-10-16T03:46:20Z

Test build #43826 has finished for PR 9055 at commit 7511f47.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait SubQueryExpression extends Unevaluable
- case class Exists(subquery: LogicalPlan, positive: Boolean)
- case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)

yhuai · 2015-10-21T04:11:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala

-  protected def nodeToExpr(node: Node): Expression = node match {
+  val EXISTS = "(?i)EXISTS".r
+
+  protected def nodeToExpr(node: Node, context: Context): Expression = node match {


Do we need to pass in context? We added context to the argument list of nodeToPlan to support creating view. We are not expecting a subqeury expr is for creating a view, right?

We don't use the context in this PR, however, the def nodeToPlan(..) need the context, as in this implementation, I actually add 2 extra expressions, they take the LogcialPlan as parameters, which mean the function nodeToExpr will call nodeToPlan() and pass the context down. Otherwise I have to pass the null to nodeToPlan(), which probably even more confusing and error-prone.

yhuai · 2015-10-21T04:12:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

+ * Exist subquery expression, only used in filter only
+ */
+case class Exists(subquery: LogicalPlan, positive: Boolean)
+  extends LeafExpression with SubQueryExpression {


yhuai · 2015-10-21T04:20:49Z

Two general comments. First, we need to add document to explain how we rewrite a plan when (1) there is a uncorrelated subquery and (2) there is a correlated subquery. Second, for those rewriting rules, I am thinking if we can have more concise ones. For uncorrelated subqueries, the subquery itself should be a resolved logical plan, right? For correlated subqueries, we only need to extract those conditions referring columns in the outer query block, right? Do we really need to matching those different specific patterns? Can we have some general logics?

Actually, does this pr try to support uncorrelated in/not in/exists/not exists subqueries?

chenghao-intel · 2015-10-21T14:08:26Z

Thank you @yhuai for reviewing this.
I've added some more docs for this PR, hopefully make more sense.

First, I'll agree with you to make a general logic to partially resolve the correlated condition within the subquery, but it's probably not that easy, particularly we need to give more concise error message to the end user, so my suggestion is to leave it for the future improvement, probably we will have better idea to simplify that by having enough feature supported with the follow up PRs (See my TODO in the description), as currently, the limit patterns actually works for most of cases.

Second, I totally agree with the Join Type comments, LeftSemiJoin <-> LeftSemi <-> LeftAnti, the motivation I am trying to make a parent class for LeftSemi / LeftAnti is for reducing the code change in Optimizer and SparkStrategies, maybe I should rename it to LeftSemiOrAntiJoin as the parent class. As well as the Operators' name, since we no longer the LeftSemiXXX, but also supports the LeftAntixxx.

Still, I hope we can merge this PR in 1.6 release, as it's almost 1 years passed since the previous PRs created in #3249 & #4812. And I will keep updating the code once we have the general agreement for the implementation.

chenghao-intel · 2015-10-21T14:12:00Z

BTW: IN / NOT IN definitely supports the uncorrelated, but EXISTS/NOT EXISTS are not in this case, the same behavior as Hive does.

SparkQA · 2015-10-21T15:53:55Z

Test build #44064 has finished for PR 9055 at commit cb69166.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * trait SubQueryExpression extends Unevaluable\n * case class Exists(subquery: LogicalPlan, positive: Boolean)\n * case class InSubquery(child: Expression, subquery: LogicalPlan, positive: Boolean)\n

jameszhouyi · 2015-11-02T07:20:44Z

Hi @yhuai ,
This missing feature("IN" sub query) in Spark SQL blocked our real-world case. Could you please help to review this PR ? Strongly hopefully this PR feature can be merged in Spark 1.6.0 ( I saw the Hive implementation supported such feature ). Thanks in advanced !

gatorsmile · 2015-11-04T20:44:02Z

@jameszhouyi
We hit the same issue. Now, we bypass it by using joins.

jameszhouyi · 2015-11-05T00:49:39Z

Thank you @gatorsmile for your suggestion.
I think this feature("IN" sub query) is necessary for Spark SQL engine as SQL-on-Hadoop.

gatorsmile · 2015-11-05T01:03:07Z

@jameszhouyi
Agree. This is an important feature for any SQL engine. We are also waiting for this feature. So far, using joins is an alternative to bypass it.

chenghao-intel · 2015-11-05T01:11:00Z

Unfortunately, we probably will miss this in Spark 1.6, as it's almost code freeze for 1.6. @rxin @yhuai

marmbrus · 2015-11-05T19:05:21Z

Yeah, sorry. It is too late for a patch this large.

maver1ck · 2015-12-15T22:03:35Z

So what next ?

roland-mendix · 2015-12-18T09:26:02Z

[Moved to Spark dev mailing list as: Expression/LogicalPlan dichotomy in Spark SQL Catalyst]

yhuai · 2015-12-31T00:56:25Z

I had a offline discussion with @chenghao-intel. We will split this PR to smaller PRs. The first work will be on the backend operators. Then, we will add parser and analyzer rule.

yhuai · 2015-12-31T00:56:34Z

@chenghao-intel How about we close this PR for now?

chenghao-intel · 2015-12-31T01:25:54Z

ok, closing it now

gatorsmile · 2015-12-31T08:43:55Z

Found a related HIVE JIRA to support the left anti join: https://issues.apache.org/jira/browse/HIVE-12519

However, their proposed solution has a hole. Anyway, if we can support the anti join at the run time, it is much efficient.

### What changes were proposed in this pull request? This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms: - `[NOT] EXISTS(subquery)` - `[NOT] IN (subquery)` This PR is (loosely) based on the work of davies (#10706) and chenghao-intel (#9055). They should be credited for the work they did. ### How was this patch tested? Modified parsing unit tests. Added tests to `org.apache.spark.sql.SQLQuerySuite` cc rxin, davies & chenghao-intel Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12306 from hvanhovell/SPARK-4226.

add in/exists subquery support

7511f47

chenghao-intel force-pushed the anti_join branch from ab22171 to 7511f47 Compare October 15, 2015 08:44

yhuai reviewed Oct 21, 2015
View reviewed changes

chenghao-intel added 4 commits October 21, 2015 20:49

update the code as comments

97bc69b

code style

1b6f858

fix bug in the unit test

4c161e6

fix bug in unit test

cb69166

chenghao-intel closed this Dec 31, 2015

davies mentioned this pull request Jan 13, 2016

[SPARK-12543] [SPARK-4226] [SQL] Subquery in expression #10706

Closed

hvanhovell mentioned this pull request Apr 11, 2016

[SPARK-4226][SQL] Support IN/EXISTS Subqueries #12306

Closed

[SPARK-4226][SQL]Add subquery (not) in/exists support #9055

[SPARK-4226][SQL]Add subquery (not) in/exists support #9055

Conversation

chenghao-intel commented Oct 10, 2015

Some of the key concepts:

Basic Logic for the Transformation

There are also some limitations:

TODOs (In the future improvement)

chenghao-intel commented Oct 10, 2015

SparkQA commented Oct 10, 2015

SparkQA commented Oct 10, 2015

SparkQA commented Oct 10, 2015

chenghao-intel commented Oct 12, 2015

chenghao-intel commented Oct 12, 2015

scwf commented Oct 12, 2015

chenghao-intel commented Oct 12, 2015

scwf commented Oct 12, 2015

chenghao-intel commented Oct 12, 2015

SparkQA commented Oct 12, 2015

chenghao-intel commented Oct 13, 2015

SparkQA commented Oct 15, 2015

chenghao-intel commented Oct 16, 2015

chenghao-intel commented Oct 16, 2015

chenghao-intel commented Oct 16, 2015

SparkQA commented Oct 16, 2015

yhuai Oct 21, 2015

Choose a reason for hiding this comment

chenghao-intel Oct 21, 2015

Choose a reason for hiding this comment

yhuai Oct 21, 2015

Choose a reason for hiding this comment

yhuai commented Oct 21, 2015

chenghao-intel commented Oct 21, 2015

chenghao-intel commented Oct 21, 2015

SparkQA commented Oct 21, 2015

jameszhouyi commented Nov 2, 2015

gatorsmile commented Nov 4, 2015

jameszhouyi commented Nov 5, 2015

gatorsmile commented Nov 5, 2015

chenghao-intel commented Nov 5, 2015

marmbrus commented Nov 5, 2015

maver1ck commented Dec 15, 2015

roland-mendix commented Dec 18, 2015

yhuai commented Dec 31, 2015

yhuai commented Dec 31, 2015

chenghao-intel commented Dec 31, 2015

gatorsmile commented Dec 31, 2015