New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40193][SQL] Merge subquery plans with different filters #37630
base: master
Are you sure you want to change the base?
Conversation
cc @cloud-fan, @sigmod |
@@ -78,6 +77,9 @@ class SparkOptimizer( | |||
PushPredicateThroughNonJoin, | |||
RemoveNoopOperators) :+ | |||
Batch("User Provided Optimizers", fixedPoint, experimentalMethods.extraOptimizations: _*) :+ | |||
Batch("Merge Scalar Subqueries", Once, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the MergeScalarSubqueries
rule to the end of optimization phase, just before ReplaceCTERefWithRepartition
. This is needed because we need to peek into the physical plans.
core/src/main/scala/org/apache/spark/util/collection/BitSet.scala
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,627 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private def checkIdenticalPlans( | ||
newPlan: LogicalPlan, | ||
cachedPlan: LogicalPlan): Option[AttributeMap[Attribute]] = { | ||
if (newPlan.canonicalized == cachedPlan.canonicalized) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[doubt] Does this works with V2 sources as well, considering earlyScanPushDownRules makes changes to the scan, hence changing the canonicalization of the scalar subqueries ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this doesn't work with DSv2 sources (nor did the the original #32298).
I'm planning to add DSv2 support in another follow-up PR. Probably with introducing an SupportsMerge
interface that Scan
s can implement to merge with another Scan
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #37711 to add support for DSv2 sources (only Parquet first).
@peter-toth Could you fix these conflicts. I want test this PR. Thank you! |
I've updated the PR with the latest |
d6fb69c
to
56c287f
Compare
We tested this PR and the results is: cc @sigmod too. |
@peter-toth Could you fix the conflicts again? |
Sure, done. |
83c59ab
to
1375c79
Compare
Tested this pr using 10TB TPC-DS, the latency of q9 has been reduced by 83.39% in my production environment.
also cc @wangyum FYI |
1375c79
to
558d908
Compare
I extracted the first commit of this PR, that just moves |
ebbe9d6
to
02e3a68
Compare
02e3a68
to
ce24661
Compare
@@ -62,7 +62,7 @@ class SparkOptimizer( | |||
RewriteDistinctAggregates) :+ | |||
Batch("Pushdown Filters from PartitionPruning", fixedPoint, | |||
PushDownPredicates) :+ | |||
Batch("Cleanup filters that cannot be pushed down", Once, | |||
Batch("Cleanup filters that cannot be pushed down", FixedPoint(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because BooleanSimplification
is not idempotent.
I've updated this PR, the latest version contains the discussed changes from theads of #42223:
cc @beliefer, @cloud-fan |
ce24661
to
5c7c0c5
Compare
50e4f3b
to
f2d7896
Compare
f2d7896
to
9b72dc4
Compare
Hey, is this part of generalized subquery fusion? https://www.usenix.org/conference/osdi20/presentation/sarthi |
No, this PR is not based on the above paper but our goals seems to be similar. |
@peter-toth So exciting to see that you're still updating this PR!! Is this pr base on spark 3.5? And support datasource v2? |
val mergeCost = if (filterPropagationSupported) Some(0d) else None | ||
|
||
(cachedPlan, outputMap, None, None, mergeCost) | ||
}.orElse( | ||
(newPlan, cachedPlan) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a logical plan optimization rule and in the previous version of this PR I was trying to peek into the physical plan by moving this rule to the last in the optimization phase and generate the physical plan of the scans + the adjacent projects/filters above it.
I did this to see if any of those projects/filters gets pushed down to physical scan (as column pruning or pushed partition or data filters). I prevented merging if the 2 physical scans differed (actually there was this PLAN_MERGE_IGNORE_PUSHED_DATA_FILTERS config to still allow merging if only pushed data filters differed) to avoid those cases that could cause performance degradation due to merging non-overlapping scans.
The problem with this approach was that:
- The code was pretty complicated,
- As most of the physical scans (e.g. Parquet/ORC) allow pushing down data filters so the default of PLAN_MERGE_IGNORE_PUSHED_DATA_FILTERS was true. But actually even data filter diference could cause non-overlapping scans in some physical scans.
- This approach didn't work well with DSv2 as DSv2 physical scans can't be compared (they don't have comparable partition and data filters). To solve this I suggested a new
SupportsMerge
interface that DSv2 scans could implement to decide if merging makes sense. This was in a separete PR: [SPARK-40259][SQL] Support Parquet DSv2 in subquery plan merge #37711 and I implemented the interface for DSv2 Parquet only.
The new version of this PR dropped the physical plan comparison as mentioned here: #37630 (comment) and decides about merging based on costs. If the the sum of the cost differences between the original plans and the merged plan is lower than PLAN_MERGE_FILTER_PROPAGATION_MAX_COST then merging is enabled. The cost function might need some refinement: https://github.com/apache/spark/pull/37630/files#diff-5096416449daefcb91637508ae3e98a11c8ac66cae5b146b0937370115c1cbb1R734-R742 to support more expressions, but it already works for TPCDS q9.
This cost based new approach might also need some follow-up changes to make it work with DSv2, but definitely no huge changes from the DSv2 scans (like the SupportsMerge
previously) are required.
This PR targets Spark 4.0 as new features are not backported to already released versions, but it could work with Spark 3.5 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add DSv2 support(especially parquet) for this pr?
I can test it's performance in our production env, thank you very much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I realized that DSv2 support is still not simple to do with this cost based new PR. Also, I don't want to include that feature in this PR as this PR is already complicated enough.
But I rebased #37711 on top of this PR at: https://github.com/peter-toth/spark/commits/SPARK-40259-support-parquet-dsv2-in-plan-merge/ so you can test it there.
@cloud-fan, @beliefer do you think we can move forward with this PR? |
…expressions, no need to restrict based on aggregate fuctions
@@ -21,7 +21,7 @@ import scala.collection.mutable | |||
import scala.collection.mutable.ArrayBuffer | |||
|
|||
import org.apache.spark.sql.catalyst.expressions._ | |||
import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression | |||
import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this line ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what to do with this PR. There doesn't seem to be much interrest in this improvement from the community, but I'm happy to fix this if we can move forward somehow...
My plan was:
- to allow filter merging for subqueries in this PR,
- and then extract the merging logic to be able to apply it on other areas of the plan,
- and then apply it on other areas like [SPARK-43025][SQL] Eliminate Union if filters have the same child plan #40661.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this optimization, and it has already been migrated into our company's internal branch by me. User cases similar to the tpcds q9 scenario will stand to benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I merged this PR into our private repository half years ago. I also want to promote this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 563cef9.
What changes were proposed in this pull request?
After #32298 we were able to merge scalar subquery plans. This PR is a follow-up improvement to the merging logic to be able to combine
Filter
nodes with different conditions if those conditions can be merged in an ancestorAggregate
node.Consider the following query with 2 subqueries:
where the subqueries can be merged to:
After this PR the 2 subqueries are merged to this optimized form:
and physical form:
The PR introduces 2 configs:
spark.sql.planMerge.filterPropagation.enabled
to disable filter merge andspark.sql.planMerge.filterPropagation.maxCost
to control how complex plans are allowed to be merged.Why are the changes needed?
Performance improvement.
The performance improvement in case of
q9
comes from merging 15 subqueries into 1 subquery (#32298 was able to merge 15 subqueries into 5).Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing and new UTs.