[SPARK-40193][SQL] Merge subquery plans with different filters #37630

peter-toth · 2022-08-23T14:29:01Z

What changes were proposed in this pull request?

After #32298 we were able to merge scalar subquery plans. This PR is a follow-up improvement to the merging logic to be able to combine Filter nodes with different conditions if those conditions can be merged in an ancestor Aggregate node.

Consider the following query with 2 subqueries:

SELECT
  (SELECT avg(a) FROM t WHERE c = 1)
  (SELECT sum(a) FROM t WHERE c = 2)

where the subqueries can be merged to:

SELECT
  avg(a) FILTER (WHERE c = 1),
  sum(b) FILTER (WHERE c = 2)
FORM t
WHERE c = 1 OR c = 2

After this PR the 2 subqueries are merged to this optimized form:

== Optimized Logical Plan ==
Project [scalar-subquery#260 [].avg(a) AS scalarsubquery()#277, scalar-subquery#261 [].sum(b) AS scalarsubquery()#278L]
:  :- Project [named_struct(avg(a), avg(a)#268, sum(b), sum(b)#271L) AS mergedValue#286]
:  :  +- Aggregate [avg(a#264) FILTER (WHERE propagatedFilter#285) AS avg(a)#268, sum(b#265) FILTER (WHERE propagatedFilter#284) AS sum(b)#271L]
:  :     +- Project [a#264, b#265, (isnotnull(c#266) AND (c#266 = 2)) AS propagatedFilter#284, (isnotnull(c#266) AND (c#266 = 1)) AS propagatedFilter#285]
:  :        +- Filter ((isnotnull(c#266) AND (c#266 = 1)) OR (isnotnull(c#266) AND (c#266 = 2)))
:  :           +- Relation spark_catalog.default.t[a#264,b#265,c#266] parquet
:  +- Project [named_struct(avg(a), avg(a)#268, sum(b), sum(b)#271L) AS mergedValue#286]
:     +- Aggregate [avg(a#264) FILTER (WHERE propagatedFilter#285) AS avg(a)#268, sum(b#265) FILTER (WHERE propagatedFilter#284) AS sum(b)#271L]
:        +- Project [a#264, b#265, (isnotnull(c#266) AND (c#266 = 2)) AS propagatedFilter#284, (isnotnull(c#266) AND (c#266 = 1)) AS propagatedFilter#285]
:           +- Filter ((isnotnull(c#266) AND (c#266 = 1)) OR (isnotnull(c#266) AND (c#266 = 2)))
:              +- Relation spark_catalog.default.t[a#264,b#265,c#266] parquet
+- OneRowRelation

and physical form:

== Physical Plan ==
*(1) Project [Subquery scalar-subquery#260, [id=#148].avg(a) AS scalarsubquery()#277, ReusedSubquery Subquery scalar-subquery#260, [id=#148].sum(b) AS scalarsubquery()#278L]
:  :- Subquery scalar-subquery#260, [id=#148]
:  :  +- *(2) Project [named_struct(avg(a), avg(a)#268, sum(b), sum(b)#271L) AS mergedValue#286]
:  :     +- *(2) HashAggregate(keys=[], functions=[avg(a#264), sum(b#265)], output=[avg(a)#268, sum(b)#271L])
:  :        +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=143]
:  :           +- *(1) HashAggregate(keys=[], functions=[partial_avg(a#264) FILTER (WHERE propagatedFilter#285), partial_sum(b#265) FILTER (WHERE propagatedFilter#284)], output=[sum#288, count#289L, sum#290L])
:  :              +- *(1) Project [a#264, b#265, (isnotnull(c#266) AND (c#266 = 2)) AS propagatedFilter#284, (isnotnull(c#266) AND (c#266 = 1)) AS propagatedFilter#285]
:  :                 +- *(1) Filter ((isnotnull(c#266) AND (c#266 = 1)) OR (isnotnull(c#266) AND (c#266 = 2)))
:  :                    +- *(1) ColumnarToRow
:  :                       +- FileScan parquet spark_catalog.default.t[a#264,b#265,c#266] Batched: true, DataFilters: [((isnotnull(c#266) AND (c#266 = 1)) OR (isnotnull(c#266) AND (c#266 = 2)))], Format: Parquet, Location: ..., PartitionFilters: [], PushedFilters: [Or(And(IsNotNull(c),EqualTo(c,1)),And(IsNotNull(c),EqualTo(c,2)))], ReadSchema: struct<a:int,b:int,c:int>
:  +- ReusedSubquery Subquery scalar-subquery#260, [id=#148]
+- *(1) Scan OneRowRelation[]

The PR introduces 2 configs:

spark.sql.planMerge.filterPropagation.enabled to disable filter merge and
spark.sql.planMerge.filterPropagation.maxCost to control how complex plans are allowed to be merged.

Why are the changes needed?

Performance improvement.

[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q9 - Merge different filters off                   9526           9634          97          0.0   244257993.6       1.0X
[info] q9 - Merge different filters on                    3798           3881         133          0.0    97381735.1       2.5X

The performance improvement in case of q9 comes from merging 15 subqueries into 1 subquery (#32298 was able to merge 15 subqueries into 5).

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing and new UTs.

peter-toth · 2022-08-23T14:30:35Z

cc @cloud-fan, @sigmod

peter-toth · 2022-08-23T14:33:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala

@@ -78,6 +77,9 @@ class SparkOptimizer(
      PushPredicateThroughNonJoin,
      RemoveNoopOperators) :+
    Batch("User Provided Optimizers", fixedPoint, experimentalMethods.extraOptimizations: _*) :+
+    Batch("Merge Scalar Subqueries", Once,


I've moved the MergeScalarSubqueries rule to the end of optimization phase, just before ReplaceCTERefWithRepartition. This is needed because we need to peek into the physical plans.

core/src/main/scala/org/apache/spark/util/collection/BitSet.scala

peter-toth · 2022-08-23T14:43:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/merge/MergeScalarSubqueries.scala

@@ -0,0 +1,627 @@
+/*


I moved this file from catalyst to sql in the first commit: f53cddd, but unfortunately Git doesn't recognize the move as there are too many additions to the file: 269c75b

singhpk234 · 2022-08-26T11:23:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/merge/MergeScalarSubqueries.scala

+  private def checkIdenticalPlans(
+      newPlan: LogicalPlan,
+      cachedPlan: LogicalPlan): Option[AttributeMap[Attribute]] = {
+    if (newPlan.canonicalized == cachedPlan.canonicalized) {


[doubt] Does this works with V2 sources as well, considering earlyScanPushDownRules makes changes to the scan, hence changing the canonicalization of the scalar subqueries ?

No this doesn't work with DSv2 sources (nor did the the original #32298).

I'm planning to add DSv2 support in another follow-up PR. Probably with introducing an SupportsMerge interface that Scans can implement to merge with another Scan.

I opened #37711 to add support for DSv2 sources (only Parquet first).

beliefer · 2022-11-07T02:01:22Z

@peter-toth Could you fix these conflicts. I want test this PR. Thank you!

peter-toth · 2022-11-07T09:38:39Z

@peter-toth Could you fix these conflicts. I want test this PR. Thank you!

I've updated the PR with the latest master.

peter-toth · 2022-11-07T14:14:40Z

@beliefer, I made a mistake previously with merging master (with SPARK-40618 changes) into this PR so I had to force-push to 56c287f. Please check the latest version.

beliefer · 2022-11-16T01:49:42Z

We tested this PR and the results is:

cc @sigmod too.

beliefer · 2022-11-16T01:50:13Z

@peter-toth Could you fix the conflicts again?

peter-toth · 2022-11-17T14:01:16Z

@peter-toth Could you fix the conflicts again?

Sure, done.

LuciferYang · 2023-04-18T13:32:18Z

Tested this pr using 10TB TPC-DS, the latency of q9 has been reduced by 83.39% in my production environment.

	Master	SPARK-40193	Percentage
q9	88895.32683 ms	14766.8049 ms	83.39%

also cc @wangyum FYI

peter-toth · 2023-04-24T15:56:32Z

I extracted the first commit of this PR, that just moves MergeScalarSubqueries from spark-catalyst to spark-sql, to #40932 to make the actual change of this PR clearer once that PR has been merged.

peter-toth · 2023-08-22T15:03:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala

@@ -62,7 +62,7 @@ class SparkOptimizer(
      RewriteDistinctAggregates) :+
    Batch("Pushdown Filters from PartitionPruning", fixedPoint,
      PushDownPredicates) :+
-    Batch("Cleanup filters that cannot be pushed down", Once,
+    Batch("Cleanup filters that cannot be pushed down", FixedPoint(1),


This is because BooleanSimplification is not idempotent.

peter-toth · 2023-08-22T17:30:17Z

I've updated this PR, the latest version contains the discussed changes from theads of #42223:

I removed the physical scan equality check to make this PR simpler,
merge is allowed only if the cost differences between the merged and original plans are low: [SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates #42223 (comment) or there is no filter on one of the sides

cc @beliefer, @cloud-fan

benjamin-j-c · 2023-12-14T20:35:17Z

Hey, is this part of generalized subquery fusion? https://www.usenix.org/conference/osdi20/presentation/sarthi

peter-toth · 2023-12-14T21:05:24Z

Hey, is this part of generalized subquery fusion? https://www.usenix.org/conference/osdi20/presentation/sarthi

No, this PR is not based on the above paper but our goals seems to be similar.
This PR merges scalar subquery plans only, but unfortunately it got stuck due to lack of reviews. But, if it ever gets accepted I would like to take the approach futher and do full common subplan elimination/merge...

unigof · 2023-12-15T07:35:33Z

@peter-toth So exciting to see that you're still updating this PR!!

Is this pr base on spark 3.5? And support datasource v2?
Could you help to merge this pr to master

unigof · 2023-12-15T09:35:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala

+      val mergeCost = if (filterPropagationSupported) Some(0d) else None
+
+      (cachedPlan, outputMap, None, None, mergeCost)
+    }.orElse(
      (newPlan, cachedPlan) match {


HI, I remember there are case match FileSourceScanPlan to check whether two FileSourceScanPlan can be merge in your old version, like the photo. Why it is not needed now?

This is a logical plan optimization rule and in the previous version of this PR I was trying to peek into the physical plan by moving this rule to the last in the optimization phase and generate the physical plan of the scans + the adjacent projects/filters above it.
I did this to see if any of those projects/filters gets pushed down to physical scan (as column pruning or pushed partition or data filters). I prevented merging if the 2 physical scans differed (actually there was this PLAN_MERGE_IGNORE_PUSHED_DATA_FILTERS config to still allow merging if only pushed data filters differed) to avoid those cases that could cause performance degradation due to merging non-overlapping scans.
The problem with this approach was that:

The code was pretty complicated,

As most of the physical scans (e.g. Parquet/ORC) allow pushing down data filters so the default of PLAN_MERGE_IGNORE_PUSHED_DATA_FILTERS was true. But actually even data filter diference could cause non-overlapping scans in some physical scans.

This approach didn't work well with DSv2 as DSv2 physical scans can't be compared (they don't have comparable partition and data filters). To solve this I suggested a new SupportsMerge interface that DSv2 scans could implement to decide if merging makes sense. This was in a separete PR: [SPARK-40259][SQL] Support Parquet DSv2 in subquery plan merge #37711 and I implemented the interface for DSv2 Parquet only.

The new version of this PR dropped the physical plan comparison as mentioned here: #37630 (comment) and decides about merging based on costs. If the the sum of the cost differences between the original plans and the merged plan is lower than PLAN_MERGE_FILTER_PROPAGATION_MAX_COST then merging is enabled. The cost function might need some refinement: https://github.com/apache/spark/pull/37630/files#diff-5096416449daefcb91637508ae3e98a11c8ac66cae5b146b0937370115c1cbb1R734-R742 to support more expressions, but it already works for TPCDS q9.
This cost based new approach might also need some follow-up changes to make it work with DSv2, but definitely no huge changes from the DSv2 scans (like the SupportsMerge previously) are required.
This PR targets Spark 4.0 as new features are not backported to already released versions, but it could work with Spark 3.5 too.

Could you add DSv2 support(especially parquet) for this pr?
I can test it's performance in our production env, thank you very much

Actually, I realized that DSv2 support is still not simple to do with this cost based new PR. Also, I don't want to include that feature in this PR as this PR is already complicated enough.
But I rebased #37711 on top of this PR at: https://github.com/peter-toth/spark/commits/SPARK-40259-support-parquet-dsv2-in-plan-merge/ so you can test it there.

peter-toth · 2023-12-15T11:00:22Z

@cloud-fan, @beliefer do you think we can move forward with this PR?

…expressions, no need to restrict based on aggregate fuctions

beliefer · 2024-04-24T06:37:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala

@@ -21,7 +21,7 @@ import scala.collection.mutable
 import scala.collection.mutable.ArrayBuffer

 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression}


Why change this line ?

I don't know what to do with this PR. There doesn't seem to be much interrest in this improvement from the community, but I'm happy to fix this if we can move forward somehow...

My plan was:

to allow filter merging for subqueries in this PR,

and then extract the merging logic to be able to apply it on other areas of the plan,

and then apply it on other areas like [SPARK-43025][SQL] Eliminate Union if filters have the same child plan #40661.

I like this optimization, and it has already been migrated into our company's internal branch by me. User cases similar to the tpcds q9 scenario will stand to benefit.

Yes. I merged this PR into our private repository half years ago. I also want to promote this PR.

Fixed in 563cef9.

…193-merge-filters

peter-toth commented Aug 23, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/util/collection/BitSet.scala Outdated Show resolved Hide resolved

peter-toth commented Aug 23, 2022

View reviewed changes

github-actions bot added CORE SQL labels Aug 23, 2022

singhpk234 reviewed Aug 26, 2022

View reviewed changes

peter-toth force-pushed the SPARK-40193-merge-filters branch from d6fb69c to 56c287f Compare November 7, 2022 14:04

peter-toth force-pushed the SPARK-40193-merge-filters branch from 83c59ab to 1375c79 Compare February 22, 2023 13:49

github-actions bot removed the CORE label Feb 22, 2023

peter-toth mentioned this pull request Apr 24, 2023

[SPARK-43266][SQL] Move MergeScalarSubqueries to spark-sql #40932

Closed

peter-toth force-pushed the SPARK-40193-merge-filters branch from 1375c79 to 558d908 Compare April 24, 2023 15:48

peter-toth changed the title ~~[SPARK-40193][SQL] Merge subquery plans with different filters~~ [WIP][SPARK-40193][SQL] Merge subquery plans with different filters Apr 24, 2023

peter-toth mentioned this pull request May 5, 2023

[SPARK-43025][SQL] Eliminate Union if filters have the same child plan #40661

Open

peter-toth mentioned this pull request Jul 31, 2023

[SPARK-44571][SQL] Eliminate the Join by combine multiple Aggregates #42223

Open

peter-toth force-pushed the SPARK-40193-merge-filters branch 3 times, most recently from ebbe9d6 to 02e3a68 Compare August 2, 2023 11:54

peter-toth force-pushed the SPARK-40193-merge-filters branch from 02e3a68 to ce24661 Compare August 22, 2023 11:19

peter-toth changed the title ~~[WIP][SPARK-40193][SQL] Merge subquery plans with different filters~~ [SPARK-40193][SQL] Merge subquery plans with different filters Aug 22, 2023

peter-toth commented Aug 22, 2023

View reviewed changes

[SPARK-40193][SQL] Merge subquery plans with different filters

5c7c0c5

peter-toth force-pushed the SPARK-40193-merge-filters branch from ce24661 to 5c7c0c5 Compare October 16, 2023 07:50

peter-toth force-pushed the SPARK-40193-merge-filters branch from 50e4f3b to f2d7896 Compare December 13, 2023 12:00

Merge branch 'master' into SPARK-40193-merge-filters

9b72dc4

peter-toth force-pushed the SPARK-40193-merge-filters branch from f2d7896 to 9b72dc4 Compare December 13, 2023 14:55

unigof reviewed Dec 15, 2023

View reviewed changes

peter-toth referenced this pull request in peter-toth/spark Dec 18, 2023

[SPARK-40259][SQL] Support Parquet DSv2 in subquery plan merge

0d45354

peter-toth added 2 commits February 17, 2024 11:59

Merge branch 'master' into SPARK-40193-merge-filters

a83f19f

enable merging filters up into aggregates that doesn't have grouping …

88de1e7

…expressions, no need to restrict based on aggregate fuctions

beliefer reviewed Apr 24, 2024

View reviewed changes

peter-toth added 3 commits April 25, 2024 15:47

Merge commit 'a84cffd8b3dac777350a78896794ca726e91b080' into SPARK-40…

359beed

…193-merge-filters

minor fix

563cef9

remove commented code

f88a441

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40193][SQL] Merge subquery plans with different filters #37630

[SPARK-40193][SQL] Merge subquery plans with different filters #37630

peter-toth commented Aug 23, 2022 •

edited

peter-toth commented Aug 23, 2022

peter-toth Aug 23, 2022

peter-toth Aug 23, 2022

singhpk234 Aug 26, 2022

peter-toth Aug 26, 2022

peter-toth Aug 29, 2022

beliefer commented Nov 7, 2022

peter-toth commented Nov 7, 2022

peter-toth commented Nov 7, 2022

beliefer commented Nov 16, 2022

beliefer commented Nov 16, 2022

peter-toth commented Nov 17, 2022

LuciferYang commented Apr 18, 2023 •

edited

peter-toth commented Apr 24, 2023 •

edited

peter-toth Aug 22, 2023

peter-toth commented Aug 22, 2023 •

edited

benjamin-j-c commented Dec 14, 2023

peter-toth commented Dec 14, 2023

unigof commented Dec 15, 2023 •

edited

unigof Dec 15, 2023

peter-toth Dec 15, 2023 •

edited

unigof Dec 15, 2023 •

edited

peter-toth Dec 15, 2023 •

edited

peter-toth commented Dec 15, 2023

beliefer Apr 24, 2024

peter-toth Apr 24, 2024

LuciferYang Apr 24, 2024

beliefer Apr 25, 2024

peter-toth Apr 25, 2024

[SPARK-40193][SQL] Merge subquery plans with different filters #37630

Are you sure you want to change the base?

[SPARK-40193][SQL] Merge subquery plans with different filters #37630

Conversation

peter-toth commented Aug 23, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

peter-toth commented Aug 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Nov 7, 2022

peter-toth commented Nov 7, 2022

peter-toth commented Nov 7, 2022

beliefer commented Nov 16, 2022

beliefer commented Nov 16, 2022

peter-toth commented Nov 17, 2022

LuciferYang commented Apr 18, 2023 • edited

peter-toth commented Apr 24, 2023 • edited

Choose a reason for hiding this comment

peter-toth commented Aug 22, 2023 • edited

benjamin-j-c commented Dec 14, 2023

peter-toth commented Dec 14, 2023

unigof commented Dec 15, 2023 • edited

Choose a reason for hiding this comment

peter-toth Dec 15, 2023 • edited

Choose a reason for hiding this comment

unigof Dec 15, 2023 • edited

Choose a reason for hiding this comment

peter-toth Dec 15, 2023 • edited

Choose a reason for hiding this comment

peter-toth commented Dec 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth commented Aug 23, 2022 •

edited

LuciferYang commented Apr 18, 2023 •

edited

peter-toth commented Apr 24, 2023 •

edited

peter-toth commented Aug 22, 2023 •

edited

unigof commented Dec 15, 2023 •

edited

peter-toth Dec 15, 2023 •

edited

unigof Dec 15, 2023 •

edited

peter-toth Dec 15, 2023 •

edited