[SPARK-36114][SQL] Support subqueries with correlated non-equality predicates #38135

allisonwang-db · 2022-10-06T18:20:12Z

What changes were proposed in this pull request?

This PR supports correlated non-equality predicates in subqueries. It leverages the DecorrelateInnerQuery framework to decorrelate subqueries with non-equality predicates. DecorrelateInnerQuery inserts domain joins in the query plan and the rule RewriteCorrelatedScalarSubquery rewrites the domain joins into actual joins with the outer query.

Note, correlated non-equality predicates can lead to query plans with non-equality join conditions, which may be planned as a broadcast NL join or cartesian product.

Why are the changes needed?

To improve subquery support in Spark.

Does this PR introduce any user-facing change?

Yes. Before this PR, Spark does not allow correlated non-equality predicates in subqueries.
For example:

SELECT (SELECT min(c2) FROM t2 WHERE t1.c1 > t2.c1) FROM t1

This will throw an exception: Correlated column is not allowed in a non-equality predicate

After this PR, this query can run successfully.

How was this patch tested?

Unit tests and SQL query tests.

allisonwang-db · 2022-10-11T20:36:41Z

cc @cloud-fan @dtenedor

amaliujia · 2022-10-14T00:49:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+      // and lateral subqueries.
+      val allowNonEqualityPredicates =
+        SQLConf.get.decorrelateInnerQueryEnabled && (isScalar || isLateral)
+      if (!allowNonEqualityPredicates && predicates.nonEmpty) {


Sorry I have been missing context:

After the non-equality predicates are supported, what are the left gap? I assuming all the predicates are supported now?

oh you have an example below which makes sense:

-- Correlated equality predicates that are not supported after SPARK-35080 SELECT c, ( SELECT count(*) FROM (VALUES ('ab'), ('abc'), ('bc')) t2(c) WHERE t1.c = substring(t2.c, 1, 1) ) FROM (VALUES ('a'), ('b')) t1(c);

amaliujia · 2022-10-14T00:50:21Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala

@@ -881,11 +881,10 @@ class AnalysisErrorSuite extends AnalysisTest {
      ($"a" + $"c" === $"b", "(a#x + outer(c#x)) = b#x"),
      (And($"a" === $"c", Cast($"d", IntegerType) === $"c"), "CAST(d#x AS INT) = outer(c#x)"))
    conditions.foreach { case (cond, msg) =>
-      val plan = Project(
-        ScalarSubquery(
+      val plan = Filter(


Nit: looks like this line of Project -> Filter change is not necessary.

You are right. Project can host IN/EXISTS now.

amaliujia · 2022-10-14T00:57:00Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+          |FROM (SELECT CAST(c1 AS SHORT) a FROM t1)
+          |""".stripMargin)
+      checkAnswer(df, Row(5) :: Row(null) :: Nil)
+      checkNumJoins(df.queryExecution.optimizedPlan, 2)


I am missing context here: why need to check NumJoins only for this case? I did a code search and seems like other test cases in this suite do not care NumJoins.

I am verifying the optimized plan should have 1 left outer join and 1 domain (inner) join.

Can we check the number of joins for the safe cast case as well?

cloud-fan · 2022-10-19T14:20:25Z

is this only for scalar subquery?

allisonwang-db · 2022-10-20T00:38:09Z

is this only for scalar subquery?

Scalar and lateral subqueries. IN and EXISTS subqueries are not supported because they are not using the new decorrelation framework.

amaliujia

LGTM

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

cloud-fan · 2022-10-24T03:10:40Z

thanks, merging to master!

…edicates  ### What changes were proposed in this pull request?  This PR supports correlated non-equality predicates in subqueries. It leverages the DecorrelateInnerQuery framework to decorrelate subqueries with non-equality predicates. DecorrelateInnerQuery inserts domain joins in the query plan and the rule RewriteCorrelatedScalarSubquery rewrites the domain joins into actual joins with the outer query. Note, correlated non-equality predicates can lead to query plans with non-equality join conditions, which may be planned as a broadcast NL join or cartesian product. ### Why are the changes needed?  To improve subquery support in Spark. ### Does this PR introduce _any_ user-facing change?  Yes. Before this PR, Spark does not allow correlated non-equality predicates in subqueries. For example: ```sql SELECT (SELECT min(c2) FROM t2 WHERE t1.c1 > t2.c1) FROM t1 ``` This will throw an exception: `Correlated column is not allowed in a non-equality predicate` After this PR, this query can run successfully. ### How was this patch tested?  Unit tests and SQL query tests. Closes apache#38135 from allisonwang-db/spark-36114-non-equality-pred. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Oct 7, 2022

amaliujia reviewed Oct 14, 2022

View reviewed changes

allisonwang-db force-pushed the spark-36114-non-equality-pred branch from 66ebb37 to c2bad52 Compare October 19, 2022 18:54

amaliujia reviewed Oct 20, 2022

View reviewed changes

cloud-fan reviewed Oct 20, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala Show resolved Hide resolved

cloud-fan approved these changes Oct 20, 2022

View reviewed changes

allisonwang-db added 5 commits October 21, 2022 15:46

allow non-equality predicates

3bbc043

retrigger build

4ac7b4f

fix tests

9cbfb2b

address comments

c719647

address comments

a3e94a4

allisonwang-db force-pushed the spark-36114-non-equality-pred branch from 6080efb to a3e94a4 Compare October 21, 2022 22:52

cloud-fan approved these changes Oct 24, 2022

View reviewed changes

cloud-fan closed this in 4d33ee0 Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36114][SQL] Support subqueries with correlated non-equality predicates #38135

[SPARK-36114][SQL] Support subqueries with correlated non-equality predicates #38135

allisonwang-db commented Oct 6, 2022

allisonwang-db commented Oct 11, 2022

amaliujia Oct 14, 2022

amaliujia Oct 14, 2022

amaliujia Oct 14, 2022

allisonwang-db Oct 20, 2022

amaliujia Oct 14, 2022

allisonwang-db Oct 20, 2022

cloud-fan Oct 20, 2022

cloud-fan commented Oct 19, 2022

allisonwang-db commented Oct 20, 2022

amaliujia left a comment

cloud-fan commented Oct 24, 2022

[SPARK-36114][SQL] Support subqueries with correlated non-equality predicates #38135

[SPARK-36114][SQL] Support subqueries with correlated non-equality predicates #38135

Conversation

allisonwang-db commented Oct 6, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

allisonwang-db commented Oct 11, 2022

amaliujia Oct 14, 2022

Choose a reason for hiding this comment

amaliujia Oct 14, 2022

Choose a reason for hiding this comment

amaliujia Oct 14, 2022

Choose a reason for hiding this comment

allisonwang-db Oct 20, 2022

Choose a reason for hiding this comment

amaliujia Oct 14, 2022

Choose a reason for hiding this comment

allisonwang-db Oct 20, 2022

Choose a reason for hiding this comment

cloud-fan Oct 20, 2022

Choose a reason for hiding this comment

cloud-fan commented Oct 19, 2022

allisonwang-db commented Oct 20, 2022

amaliujia left a comment

Choose a reason for hiding this comment

cloud-fan commented Oct 24, 2022