-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][SPARK-32268][SQL] Bloom Filter Join #29065
Conversation
Test build #125627 has finished for PR 29065 at commit
|
Hi @wangyum , This is a nice PR to me. But some issues in my mind should be thrown here. I didn't do more perf between MinMax and Bloom, but in my personal sense, these may effect the perf of different cases.
That is just a rough idea, but the key point is to make DynamicFilter(or name it RuntimeFilter) more general(that means both of MinMaxFilter, BloomFilter and DPPFilter are DynamicFilter), so that it will be easy to get extended. I have seen another proposal about RuntimeFilter(MinMax) before, so making things easy to be extended should be important as well as the perf result. em, maybe it's hard to make it more extendable. Feel free to point my incorrect understanding out, thx. |
@jovany-wang Thank you very much for your suggestion. I appreciate the time and effort you have spent to share your insightful comments, which will be seriously considered. |
Test build #126700 has finished for PR 29065 at commit
|
Test build #126705 has finished for PR 29065 at commit
|
Retest this please. |
with DynamicPruning | ||
with Unevaluable { | ||
with DynamicPruning | ||
with Unevaluable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original indentation is correct.
exprId: ExprId = NamedExpression.newExprId) | ||
extends SubqueryExpression(buildQuery, Seq(pruningKey), exprId) | ||
with DynamicPruning | ||
with Unevaluable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see https://github.com/databricks/scala-style-guide/blob/master/README.md#indent and adjust the indentation.
Hi, @wangyum . The doc and PR looks reasonable. Is there a plan for further update because there is |
Test build #127281 has finished for PR 29065 at commit
|
Test build #128383 has finished for PR 29065 at commit
|
What's the current status of this PR? Waiting for reviews? |
btw, IMHO |
} | ||
} | ||
|
||
case class DynamicShufflePruningSubquery( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks DynamicPartitionPruningSubquery
and DynamicShufflePruningSubquery
are almost the same, so we need this new predicate? Could we add a value to represent a pruning type in a class field of DynamicPruningSubquery
like this?
case class DynamicPruningSubquery(
pruningKey: Expression,
buildQuery: LogicalPlan,
buildKeys: Seq[Expression],
broadcastKeyIndex: Int,
onlyInBroadcast: Boolean,
exprId: ExprId,
pruningType: PruningType) <---- This?
val hasBenefit = pruningHasBenefit(r, partScan, l, left) | ||
newRight = insertPartitionPredicate(r, newRight, l, left, leftKeys, hasBenefit) | ||
// shuffle pruning | ||
case None if conf.dynamicShufflePruningEnabled && canPruneRight(joinType) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new feature is enabled only if both dynamicPartitionPruningEnabled
and dynamicShufflePruningEnabled
are true?
DynamicPruningExpression(InSubqueryExec(value, broadcastValues, exprId)) | ||
val broadcastValues = SubqueryBroadcastExec(name, broadcastKeyIndex, buildKeys, exchange) | ||
if (preferBloomFilter(buildKeys(broadcastKeyIndex), buildPlan)) { | ||
DynamicPruningExpression(BloomFilterSubqueryExec(value, broadcastValues, exprId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR propose two things: 1. improving the existing part pruning by bloom filters and 2. implementing a new dynamic pruning strategy (shuffle pruning)?
s"Bloom filter only supports atomic types, but got ${colType.catalogString}.") | ||
|
||
val updater: (BloomFilter, InternalRow) => Unit = | ||
(filter, row) => BloomFilterUtils.putValue(filter, row.get(0, colType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change can cause perf. regression because the pattern matching of colType
happens every time updater
called.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Reduce the shuffle data can significantly improve the query performance and increase Spark cluster stability. There is a DPP-like way to filter out shuffle data. The main difference is that the bloom filter is used to filter the data(A simple Bloom filter benchmark). This PR add support this feature. The design document could be found here.
Why are the changes needed?
Improve query performance.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Evaluate dynamic Bloom Filter runtime-filtering by TPCDS.