Skip to content

[SPARK-56677][SQL] Propagate filter conditions through Join nodes in PlanMerger#55628

Closed
peter-toth wants to merge 2 commits intoapache:masterfrom
peter-toth:SPARK-56677-filter-propagation-through-join
Closed

[SPARK-56677][SQL] Propagate filter conditions through Join nodes in PlanMerger#55628
peter-toth wants to merge 2 commits intoapache:masterfrom
peter-toth:SPARK-56677-filter-propagation-through-join

Conversation

@peter-toth
Copy link
Copy Markdown
Contributor

@peter-toth peter-toth commented Apr 30, 2026

What changes were proposed in this pull request?

PlanMerger now supports filter propagation through Join nodes when merging similar subplans. Previously, when two subplans contained identical Join nodes but differed only in a filter applied to one of the join's children, they could not be merged.

This PR adds the ability to propagate such filter conditions through a Join and into the parent Aggregate's FILTER clause. A new filterSafeForJoin helper checks that the filter originates from the non-nullable (preserved) side of the join: the left side of LeftOuter/LeftSemi/LeftAnti, the right side of RightOuter, or either side of Inner/Cross. FullOuter joins are not eligible.

The feature is gated by a new SQL config spark.sql.optimizer.mergeSubplans.filterPropagation.throughJoin.enabled (default: false).

Why are the changes needed?

Without this change, scalar subqueries that differ only in a filter on one side of an identical join cannot be merged, resulting in redundant scans and compute. For example:

SELECT
(SELECT sum(key) FROM t1 JOIN t2 ON t1.id = t2.id),
(SELECT sum(key) FROM t1 JOIN t2 ON t1.id = t2.id WHERE t2.b > 1)

Both subqueries scan t1 and t2 in full even though they share the same base join. After this change a single merged scan is used and the second subquery's result is derived from it via an aggregate FILTER clause.

Does this PR introduce any user-facing change?

Yes. When spark.sql.optimizer.mergeSubplans.filterPropagation.filterPropagationThroughJoin.enabled is set to true, the optimizer may merge scalar subqueries that were previously kept separate, reducing the number of scan and join operations.

How was this patch tested?

Added unit tests in MergeSubplansSuite:

  • Merge with filter on left inner join child
  • Merge with filter on right inner join child
  • No merge when both join children have independent filters
  • Merge with filter on the preserved side of a LeftSemi join
  • No merge when filter is on the non-output side of a LeftSemi join
  • No merge when filter is on the nullable side of an outer join
  • No merge when the feature is disabled via config

Added integration test in PlanMergeSuite verifying correctness (checkAnswer) and plan shape (SubqueryExec/ReusedSubqueryExec counts) for both the enabled and disabled config cases, with and without AQE.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

@peter-toth
Copy link
Copy Markdown
Contributor Author

I measured the following improvements with the affected queries:

[info] TPCDS:                                                                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] --------------------------------------------------------------------------------------------------------------------------------------------------------------
[info] q77                                                                                       384            449          77         70.4          14.2       1.0X
[info] q77 - symmetric filter propagation and filter propagation through join enabled            325            342          13         83.2          12.0       1.2X

[info] TPCDS:                                                                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] --------------------------------------------------------------------------------------------------------------------------------------------------------------
[info] q88                                                                                     10145          10593         635          1.4         732.5       1.0X
[info] q88 - symmetric filter propagation and filter propagation through join enabled           1358           1393          48         10.2          98.1       7.5X

[info] TPCDS:                                                                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] --------------------------------------------------------------------------------------------------------------------------------------------------------------
[info] q90                                                                                      2496           2544          68          1.5         676.3       1.0X
[info] q90 - symmetric filter propagation and filter propagation through join enabled           1153           1171          26          3.2         312.4       2.2X

@peter-toth
Copy link
Copy Markdown
Contributor Author

@peter-toth peter-toth marked this pull request as draft April 30, 2026 18:55
.createWithDefault(false)

val MERGE_SUBPLANS_FILTER_PROPAGATION_THROUGH_JOIN_ENABLED =
buildConf("spark.sql.optimizer.mergeSubplans.filterPropagationThroughJoin.enabled")
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we had better follow the config namespace rule.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rebase this PR after #55633, but I already renamed the new config to be in the .filterPropagation namespace.

@peter-toth peter-toth force-pushed the SPARK-56677-filter-propagation-through-join branch from 2087a88 to bd5f812 Compare April 30, 2026 19:18
@peter-toth peter-toth marked this pull request as ready for review April 30, 2026 19:19
"merging, allowing subplans that differ only in their filter conditions and share a " +
"common join to be merged into a single scan. A filter attribute is only propagated " +
"through a join when it originates from the non-nullable (preserved) side: the left side " +
"of LeftOuter/LeftSemi/LeftAnti, the right side of RightOuter, or either side of " +
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. either side -> either side

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4d87515.

* produces a boolean attribute that flows through the join output to the enclosing
* [[Aggregate]]. Propagation is skipped when both the left and right children simultaneously
* produce filter attributes, as combining them would require an additional AND alias above
* the join (not yet supported).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the filterSafeForJoin, shall we mention NULL-pad cases a little more?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this part entirelly to ellaborate on what is the problem with null padded sides: 4d87515.


val MERGE_SUBPLANS_FILTER_PROPAGATION_THROUGH_JOIN_ENABLED =
buildConf(
"spark.sql.optimizer.mergeSubplans.filterPropagation.filterPropagationThroughJoin.enabled")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we rename filterPropagationThroughJoin -> throughJoin because (2) is shorter and better in general.

  1. spark.sql.optimizer.mergeSubplans.filterPropagation.filterPropagationThroughJoin.enabled
  2. spark.sql.optimizer.mergeSubplans.filterPropagation.throughJoin.enabled

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, renamed in 4d87515.

// rows are NULL-padded so f=NULL, causing FILTER (WHERE f) to incorrectly
// exclude rows that should contribute to the aggregate. Right-side
// attributes are also absent from semi/anti join output.
(leftNPFilter.isEmpty && leftCPFilter.isEmpty ||
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isEmpty && -> isEmpty && because the Apache Spark community doesn't use vertical alignment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4d87515.


comparePlans(Optimize.execute(originalQuery.analyze), originalQuery.analyze)
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add Cross (positive) and FullOuter (negative) test coverage?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 4d87515.

peter-toth added 2 commits May 1, 2026 14:13
…PlanMerger

### What changes were proposed in this pull request?

`PlanMerger` now supports filter propagation through `Join` nodes when merging
similar subplans. Previously, when two subplans contained identical `Join` nodes
but differed only in a filter applied to one of the join's children, they could
not be merged.

This PR adds the ability to propagate such filter conditions through a `Join`
and into the parent `Aggregate`'s `FILTER` clause. A new `filterSafeForJoin`
helper checks that the filter originates from the non-nullable (preserved) side
of the join: the left side of `LeftOuter`/`LeftSemi`/`LeftAnti`, the right side
of `RightOuter`, or either side of `Inner`/`Cross`. `FullOuter` joins are not
eligible.

The feature is gated by a new SQL config:
`spark.sql.optimizer.mergeSubplans.filterPropagationThroughJoin.enabled`
(default: `true`).

### Why are the changes needed?

Without this change, scalar subqueries that differ only in a filter on one side
of an identical join cannot be merged, resulting in redundant scans and compute.
For example:

  SELECT
    (SELECT sum(key) FROM t1 JOIN t2 ON t1.id = t2.id),
    (SELECT sum(key) FROM t1 JOIN t2 ON t1.id = t2.id WHERE t2.b > 1)

Both subqueries scan `t1` and `t2` in full even though they share the same base
join. After this change a single merged scan is used and the second subquery's
result is derived from it via an aggregate `FILTER` clause.

### Does this PR introduce _any_ user-facing change?

Yes. The optimizer may now merge scalar subqueries that were previously kept
separate, reducing the number of scan and join operations. The new config
`spark.sql.optimizer.mergeSubplans.filterPropagationThroughJoin.enabled`
(default `true`) can be used to opt out.

### How was this patch tested?

Added unit tests in `MergeSubplansSuite`:
- Merge with filter on left inner join child
- Merge with filter on right inner join child
- No merge when both join children have independent filters
- Merge with filter on the preserved side of a `LeftSemi` join
- No merge when filter is on the non-output side of a `LeftSemi` join
- No merge when filter is on the nullable side of an outer join
- No merge when the feature is disabled via config

Added integration test in `PlanMergeSuite` verifying correctness (`checkAnswer`)
and plan shape (`SubqueryExec`/`ReusedSubqueryExec` counts) for both the enabled
and disabled config cases, with and without AQE.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6
@peter-toth peter-toth force-pushed the SPARK-56677-filter-propagation-through-join branch from bd5f812 to 4d87515 Compare May 1, 2026 13:25
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @peter-toth .

@dongjoon-hyun
Copy link
Copy Markdown
Member

Merged to master for Apache Spark 4.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants