[SPARK-45009][SQL] Decorrelate predicate subqueries in join condition #42725

andylam-db · 2023-08-29T19:12:04Z

What changes were proposed in this pull request?

Pulling up correlated subquery predicates in Joins, and re-writing them into ExistenceJoins if they are not pushed down into the join inputs.

Why are the changes needed?

This change allows correlated IN and EXISTS subqueries in join condition. This is valid SQL that is not yet supported by Spark SQL.

Does this PR introduce any user-facing change?

Yes, previously unsupported queries become supported.

How was this patch tested?

Added SQL tests for IN and EXISTS in join conditions, and crossed-check correctness with postgres (except for ANTI joins, which are not supported in postgres).

Permutations of the tests:

Exists / Not exists / in / not in
Subquery references left child / right child
Join type: inner / left outer
Transitive predicates to try invoking filter inference

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

jchen5

Let's also add test cases for other join types: right, full, semi, anti

Looks pretty good overall! CC @agubichev

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala

andylam-db · 2023-09-06T16:41:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -55,7 +55,8 @@ abstract class Optimizer(catalogManager: CatalogManager)
    Set(
      "PartitionPruning",
      "RewriteSubquery",
-      "Extract Python UDFs")
+      "Extract Python UDFs",
+      "Infer Filters")


"Infer Filters" is not inherently idempotent. We need to exclude it from the idempotency check so that tests will pass.

andylam-db · 2023-09-06T16:46:21Z

Looking for reviews! @cloud-fan @sigmod

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

andylam-db · 2023-10-04T23:40:24Z

Pinging for reviews! @allisonwang-db @cloud-fan

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

…less loop

andylam-db · 2023-10-13T18:05:31Z

@cloud-fan I think the build is failing because of unrelated failing tests -- timeouts for KafkaSourceStressSuite and OracleIntegrationSuite.
Could you take a look and see if we can merge this?

1:

[info] KafkaSourceStressSuite:
[info] - stress test with multiple topics and partitions *** FAILED *** (1 minute, 4 seconds)
[info]   Timed out waiting for stream: The code passed to failAfter did not complete within 30 seconds.

2:

[info] OracleIntegrationSuite:
[info] org.apache.spark.sql.jdbc.v2.OracleIntegrationSuite *** ABORTED *** (8 minutes, 6 seconds)
[info]   The code passed to eventually never returned normally. Attempted 417 times over 7.011344010933333 minutes. Last failure message: ORA-12514: Cannot connect to database. Service freepdb1 is not registered with the listener at host 10.1.0.85 port 40439. (CONNECTION_ID=7qWwov5HRIGzNr+ypAdPFQ==). (DockerJDBCIntegrationSuite.scala:166)

cloud-fan · 2023-10-16T03:39:36Z

yea they are unrelated, thanks, merging to master!

…tion of predicate subqueries in join condition which reference both join child ### What changes were proposed in this pull request? This is a follow up PR for #42725, which decorrelates predicate subqueries in join conditions. I forgot to add the error class definition for the case where the subquery references both join children, and test cases for it. ### Why are the changes needed? To show a clear error message when the condition is hit. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added SQL test and golden files. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46708 from andylam-db/follow-up-decorrelate-subqueries-in-join-cond. Authored-by: Andy Lam <andy.lam@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

Squashed commits

48b3b19

github-actions bot added the SQL label Aug 29, 2023

Change doc description

6a5bff2

andylam-db changed the title ~~Decorrelate predicate subqueries in join condition~~ [SPARK-45009] Decorrelate predicate subqueries in join condition Aug 29, 2023

andylam-db changed the title ~~[SPARK-45009] Decorrelate predicate subqueries in join condition~~ [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition Aug 29, 2023

andylam-db marked this pull request as draft August 29, 2023 20:45

jchen5 reviewed Aug 29, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala Outdated Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala Outdated Show resolved Hide resolved

andylam-db added 5 commits August 29, 2023 14:26

Simplify predicate conjunction

3fe020d

.

cc3f4a4

Create new error class

63df16c

Change error name and signature

05204e2

Remove "new" keyword

8ba4ffd

jchen5 reviewed Aug 30, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala Outdated Show resolved Hide resolved

andylam-db added 2 commits September 1, 2023 13:48

Generate golden files

5d0652f

Add semi/anti/full outer/right outer join cases

4e13013

agubichev approved these changes Sep 1, 2023

View reviewed changes

andylam-db added 3 commits September 1, 2023 16:15

Remove order by invalid columns in left semi/anti join tests

6189d84

Reduce test size

8d57724

Exclude "infer filters" from idempotency check

8ea929a

andylam-db marked this pull request as ready for review September 5, 2023 20:40

andylam-db requested review from jchen5 and agubichev September 5, 2023 20:41

agubichev approved these changes Sep 5, 2023

View reviewed changes

andylam-db added 2 commits September 5, 2023 17:30

Major refactor to allow no rewrite of uncorrelated IN

7ea9b51

Revert unwanted change

13eacd7

andylam-db commented Sep 6, 2023

View reviewed changes

jchen5 reviewed Sep 6, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

jchen5 approved these changes Sep 6, 2023

View reviewed changes

Add conf doc comment

62482c3

allisonwang-db approved these changes Oct 5, 2023

View reviewed changes

cloud-fan reviewed Oct 11, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala Show resolved Hide resolved

cloud-fan reviewed Oct 11, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

jchen5 reviewed Oct 11, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

andylam-db added 4 commits October 11, 2023 12:58

Merge branch 'master' into correlated-subquery-in-join-cond

eb3c339

Flip legacy flag to true

e8b5d12

Add fix for if subqueries dont reference join left/right, prevent end…

7a3ade5

…less loop

Make changes with in-null-semantics tests

cf63d52

cloud-fan approved these changes Oct 12, 2023

View reviewed changes

Retrigger tests

3bc6338

cloud-fan closed this in 4fd2d68 Oct 16, 2023

andylam-db mentioned this pull request May 22, 2024

[SPARK-45009][SQL][FOLLOW UP] Add error class and tests for decorrelation of predicate subqueries in join condition which reference both join child #46708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45009][SQL] Decorrelate predicate subqueries in join condition #42725

[SPARK-45009][SQL] Decorrelate predicate subqueries in join condition #42725

andylam-db commented Aug 29, 2023 •

edited

Loading

jchen5 left a comment •

edited

Loading

andylam-db Sep 6, 2023 •

edited

Loading

andylam-db commented Sep 6, 2023

andylam-db commented Oct 4, 2023

andylam-db commented Oct 13, 2023 •

edited

Loading

cloud-fan commented Oct 16, 2023

[SPARK-45009][SQL] Decorrelate predicate subqueries in join condition #42725

[SPARK-45009][SQL] Decorrelate predicate subqueries in join condition #42725

Conversation

andylam-db commented Aug 29, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

jchen5 left a comment • edited Loading

Choose a reason for hiding this comment

andylam-db Sep 6, 2023 • edited Loading

Choose a reason for hiding this comment

andylam-db commented Sep 6, 2023

andylam-db commented Oct 4, 2023

andylam-db commented Oct 13, 2023 • edited Loading

cloud-fan commented Oct 16, 2023

andylam-db commented Aug 29, 2023 •

edited

Loading

jchen5 left a comment •

edited

Loading

andylam-db Sep 6, 2023 •

edited

Loading

andylam-db commented Oct 13, 2023 •

edited

Loading