[SPARK-56677][SQL] Propagate filter conditions through `Join` nodes in `PlanMerger` by peter-toth · Pull Request #55628 · apache/spark

peter-toth · 2026-04-30T16:16:49Z

What changes were proposed in this pull request?

PlanMerger now supports filter propagation through Join nodes when merging similar subplans. Previously, when two subplans contained identical Join nodes but differed only in a filter applied to one of the join's children, they could not be merged.

This PR adds the ability to propagate such filter conditions through a Join and into the parent Aggregate's FILTER clause. A new filterSafeForJoin helper checks that the filter originates from the non-nullable (preserved) side of the join: the left side of LeftOuter/LeftSemi/LeftAnti, the right side of RightOuter, or either side of Inner/Cross. FullOuter joins are not eligible.

The feature is gated by a new SQL config spark.sql.optimizer.mergeSubplans.filterPropagation.throughJoin.enabled (default: false).

Why are the changes needed?

Without this change, scalar subqueries that differ only in a filter on one side of an identical join cannot be merged, resulting in redundant scans and compute. For example:

SELECT
(SELECT sum(key) FROM t1 JOIN t2 ON t1.id = t2.id),
(SELECT sum(key) FROM t1 JOIN t2 ON t1.id = t2.id WHERE t2.b > 1)

Both subqueries scan t1 and t2 in full even though they share the same base join. After this change a single merged scan is used and the second subquery's result is derived from it via an aggregate FILTER clause.

Does this PR introduce any user-facing change?

Yes. When spark.sql.optimizer.mergeSubplans.filterPropagation.filterPropagationThroughJoin.enabled is set to true, the optimizer may merge scalar subqueries that were previously kept separate, reducing the number of scan and join operations.

How was this patch tested?

Added unit tests in MergeSubplansSuite:

Merge with filter on left inner join child
Merge with filter on right inner join child
No merge when both join children have independent filters
Merge with filter on the preserved side of a LeftSemi join
No merge when filter is on the non-output side of a LeftSemi join
No merge when filter is on the nullable side of an outer join
No merge when the feature is disabled via config

Added integration test in PlanMergeSuite verifying correctness (checkAnswer) and plan shape (SubqueryExec/ReusedSubqueryExec counts) for both the enabled and disabled config cases, with and without AQE.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

peter-toth · 2026-04-30T16:24:26Z

I measured the following improvements with the affected queries:

[info] TPCDS:                                                                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] --------------------------------------------------------------------------------------------------------------------------------------------------------------
[info] q77                                                                                       384            449          77         70.4          14.2       1.0X
[info] q77 - symmetric filter propagation and filter propagation through join enabled            325            342          13         83.2          12.0       1.2X

[info] TPCDS:                                                                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] --------------------------------------------------------------------------------------------------------------------------------------------------------------
[info] q88                                                                                     10145          10593         635          1.4         732.5       1.0X
[info] q88 - symmetric filter propagation and filter propagation through join enabled           1358           1393          48         10.2          98.1       7.5X

[info] TPCDS:                                                                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] --------------------------------------------------------------------------------------------------------------------------------------------------------------
[info] q90                                                                                      2496           2544          68          1.5         676.3       1.0X
[info] q90 - symmetric filter propagation and filter propagation through join enabled           1153           1171          26          3.2         312.4       2.2X

peter-toth · 2026-04-30T16:25:24Z

cc @LuciferYang, @dongjoon-hyun, @yaooqinn, @cloud-fan

dongjoon-hyun · 2026-04-30T19:08:57Z

      .createWithDefault(false)

+  val MERGE_SUBPLANS_FILTER_PROPAGATION_THROUGH_JOIN_ENABLED =
+    buildConf("spark.sql.optimizer.mergeSubplans.filterPropagationThroughJoin.enabled")


It seems that we had better follow the config namespace rule.

I will rebase this PR after #55633, but I already renamed the new config to be in the .filterPropagation namespace.

dongjoon-hyun · 2026-04-30T19:31:02Z

+        "merging, allowing subplans that differ only in their filter conditions and share a " +
+        "common join to be merged into a single scan. A filter attribute is only propagated " +
+        "through a join when it originates from the non-nullable (preserved) side: the left side " +
+        "of LeftOuter/LeftSemi/LeftAnti, the right side of RightOuter, or either  side of " +


nit. either side -> either side

Fixed in 4d87515.

dongjoon-hyun · 2026-04-30T19:34:49Z

+ * produces a boolean attribute that flows through the join output to the enclosing
+ * [[Aggregate]]. Propagation is skipped when both the left and right children simultaneously
+ * produce filter attributes, as combining them would require an additional AND alias above
+ * the join (not yet supported).


According to the filterSafeForJoin, shall we mention NULL-pad cases a little more?

I rewrote this part entirelly to ellaborate on what is the problem with null padded sides: 4d87515.

dongjoon-hyun · 2026-04-30T19:36:26Z


+  val MERGE_SUBPLANS_FILTER_PROPAGATION_THROUGH_JOIN_ENABLED =
+    buildConf(
+      "spark.sql.optimizer.mergeSubplans.filterPropagation.filterPropagationThroughJoin.enabled")


Shall we rename filterPropagationThroughJoin -> throughJoin because (2) is shorter and better in general.

spark.sql.optimizer.mergeSubplans.filterPropagation.filterPropagationThroughJoin.enabled

spark.sql.optimizer.mergeSubplans.filterPropagation.throughJoin.enabled

Good idea, renamed in 4d87515.

dongjoon-hyun · 2026-04-30T19:38:17Z

+                       // rows are NULL-padded so f=NULL, causing FILTER (WHERE f) to incorrectly
+                       // exclude rows that should contribute to the aggregate. Right-side
+                       // attributes are also absent from semi/anti join output.
+                       (leftNPFilter.isEmpty  && leftCPFilter.isEmpty  ||


isEmpty && -> isEmpty && because the Apache Spark community doesn't use vertical alignment.

Fixed in 4d87515.

dongjoon-hyun · 2026-04-30T19:39:52Z

+
+      comparePlans(Optimize.execute(originalQuery.analyze), originalQuery.analyze)
+    }
+  }


Shall we add Cross (positive) and FullOuter (negative) test coverage?

Added in 4d87515.

…PlanMerger ### What changes were proposed in this pull request? `PlanMerger` now supports filter propagation through `Join` nodes when merging similar subplans. Previously, when two subplans contained identical `Join` nodes but differed only in a filter applied to one of the join's children, they could not be merged. This PR adds the ability to propagate such filter conditions through a `Join` and into the parent `Aggregate`'s `FILTER` clause. A new `filterSafeForJoin` helper checks that the filter originates from the non-nullable (preserved) side of the join: the left side of `LeftOuter`/`LeftSemi`/`LeftAnti`, the right side of `RightOuter`, or either side of `Inner`/`Cross`. `FullOuter` joins are not eligible. The feature is gated by a new SQL config: `spark.sql.optimizer.mergeSubplans.filterPropagationThroughJoin.enabled` (default: `true`). ### Why are the changes needed? Without this change, scalar subqueries that differ only in a filter on one side of an identical join cannot be merged, resulting in redundant scans and compute. For example: SELECT (SELECT sum(key) FROM t1 JOIN t2 ON t1.id = t2.id), (SELECT sum(key) FROM t1 JOIN t2 ON t1.id = t2.id WHERE t2.b > 1) Both subqueries scan `t1` and `t2` in full even though they share the same base join. After this change a single merged scan is used and the second subquery's result is derived from it via an aggregate `FILTER` clause. ### Does this PR introduce _any_ user-facing change? Yes. The optimizer may now merge scalar subqueries that were previously kept separate, reducing the number of scan and join operations. The new config `spark.sql.optimizer.mergeSubplans.filterPropagationThroughJoin.enabled` (default `true`) can be used to opt out. ### How was this patch tested? Added unit tests in `MergeSubplansSuite`: - Merge with filter on left inner join child - Merge with filter on right inner join child - No merge when both join children have independent filters - Merge with filter on the preserved side of a `LeftSemi` join - No merge when filter is on the non-output side of a `LeftSemi` join - No merge when filter is on the nullable side of an outer join - No merge when the feature is disabled via config Added integration test in `PlanMergeSuite` verifying correctness (`checkAnswer`) and plan shape (`SubqueryExec`/`ReusedSubqueryExec` counts) for both the enabled and disabled config cases, with and without AQE. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.6

dongjoon-hyun

+1, LGTM. Thank you, @peter-toth .

dongjoon-hyun · 2026-05-01T22:38:38Z

Merged to master for Apache Spark 4.2.0.

peter-toth marked this pull request as draft April 30, 2026 18:55

dongjoon-hyun reviewed Apr 30, 2026

View reviewed changes

peter-toth force-pushed the SPARK-56677-filter-propagation-through-join branch from 2087a88 to bd5f812 Compare April 30, 2026 19:18

peter-toth marked this pull request as ready for review April 30, 2026 19:19

dongjoon-hyun reviewed Apr 30, 2026

View reviewed changes

peter-toth added 2 commits May 1, 2026 14:13

address review findings

4d87515

peter-toth force-pushed the SPARK-56677-filter-propagation-through-join branch from bd5f812 to 4d87515 Compare May 1, 2026 13:25

dongjoon-hyun approved these changes May 1, 2026

View reviewed changes

dongjoon-hyun closed this in 8457567 May 1, 2026

Conversation

peter-toth commented Apr 30, 2026 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

peter-toth commented Apr 30, 2026

Uh oh!

peter-toth commented Apr 30, 2026

Uh oh!

dongjoon-hyun Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peter-toth commented Apr 30, 2026 •

edited by dongjoon-hyun

Loading

dongjoon-hyun Apr 30, 2026 •

edited

Loading