New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning #38464
Conversation
Thank you for pinging me, @wangyum . |
@@ -65,7 +70,7 @@ case class PlanAdaptiveDynamicPruningFilters( | |||
DynamicPruningExpression(InSubqueryExec(value, broadcastValues, exprId)) | |||
} else if (onlyInBroadcast) { | |||
DynamicPruningExpression(Literal.TrueLiteral) | |||
} else { | |||
} else if (canBroadcastBySize(buildPlan, conf)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be over estimated. The final plan has an Aggregate
which may dramatically reduce the data size.
} else { | ||
val childPlan = adaptivePlan.executedPlan | ||
val reusedShuffleExchange = collectFirst(rootPlan) { | ||
case s: ShuffleExchangeExec if s.child.sameResult(childPlan) => s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another tricky part: is reusing shuffle always better than starting a new query with column pruning?
I agree with using bloom filters, as the size estimation can be wrong and the build size can be too large that |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR enhances DPP to use bloom filters if
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly
is disabled and build plan can't build broadcast by size and can reuse the existing shuffle exchanges.Why are the changes needed?
Avoid job fail if
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly
is disabled:Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test.