[WIP][SPARK-32268][SQL] Bloom Filter Join #29065

wangyum · 2020-07-10T16:07:27Z

What changes were proposed in this pull request?

Reduce the shuffle data can significantly improve the query performance and increase Spark cluster stability. There is a DPP-like way to filter out shuffle data. The main difference is that the bloom filter is used to filter the data(A simple Bloom filter benchmark). This PR add support this feature. The design document could be found here.

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Evaluate dynamic Bloom Filter runtime-filtering by TPCDS.

SQL	Shuffle stage origin data size	Pruning shuffle stage data size	Shuffle stage origin records	Pruning shuffle stage records	Disable shuffle pruning (second)	Enable shuffle pruning (second)
q13	3.2 GiB	93.1 MiB	86,409,332	1,480,662	13	14
q16	158.2 GiB	270.7 MiB	7,136,969,739	10,295,246	84	36
q24a	355.6 GiB	39.7 GiB	13,428,037,922	1,504,810,137	660	432
q24b	355.6 GiB	39.7 GiB	13,428,037,922	1,504,810,137	660	492
q65	37.2 GiB	37.2 GiB	2,627,543,089	2,627,543,089	45	45
q72	40.9 GiB	1543.3 MiB	1,418,327,817	47,248,271	276	38
q80	8.3 GiB	8.2 GiB	295,853,928	292,353,065	37	36
q85	12.8 GiB	1508.2 MiB	329,635,219	37,337,231	16	12
q93	26.7 GiB	435.9 MiB	1,389,592,792	20,744,184	270	258
q94	87.6 GiB	1012.0 MiB	3,598,433,079	34,538,210	58	27
q95	640.5 GiB	344.6 GiB	872,596,314	793,654,526	414	402

SparkQA · 2020-07-10T20:22:13Z

Test build #125627 has finished for PR 29065 at commit a47485b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class RuntimeBloomFilterPruningSubquery(
case class BuildBloomFilter(
case class InBloomFilter(bloomFilterExp: Expression, value: Expression)
case class BloomFilterSubqueryExec(

jovany-wang · 2020-07-22T05:49:51Z

Hi @wangyum , This is a nice PR to me. But some issues in my mind should be thrown here.

I didn't do more perf between MinMax and Bloom, but in my personal sense, these may effect the perf of different cases.
So how about making these things more general? like:

                          DynamicFilter
                                |
               Is the filtering key partitioned?
                        /                  \
                      Y                     N
                     /                       \
              DPP filter         Choose a best filter for it. (from MinMax, Bloom or other filters such as index filter, etc)
                                       Note: Not all of the filters can be pushed to scan.

That is just a rough idea, but the key point is to make DynamicFilter(or name it RuntimeFilter) more general(that means both of MinMaxFilter, BloomFilter and DPPFilter are DynamicFilter), so that it will be easy to get extended.

I have seen another proposal about RuntimeFilter(MinMax) before, so making things easy to be extended should be important as well as the perf result. em, maybe it's hard to make it more extendable.

Feel free to point my incorrect understanding out, thx.

wangyum · 2020-07-28T06:38:30Z

@jovany-wang Thank you very much for your suggestion. I appreciate the time and effort you have spent to share your insightful comments, which will be seriously considered.

SparkQA · 2020-07-28T06:50:58Z

Test build #126700 has finished for PR 29065 at commit da0a420.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-28T11:54:58Z

Test build #126705 has finished for PR 29065 at commit 94bfb36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-10T13:41:41Z

Retest this please.

dongjoon-hyun · 2020-08-10T13:50:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/DynamicPruning.scala

-  with DynamicPruning
-  with Unevaluable {
+    with DynamicPruning
+    with Unevaluable {


The original indentation is correct.

https://github.com/databricks/scala-style-guide/blob/master/README.md#indent

dongjoon-hyun · 2020-08-10T13:51:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/DynamicPruning.scala

+    exprId: ExprId = NamedExpression.newExprId)
+  extends SubqueryExpression(buildQuery, Seq(pruningKey), exprId)
+    with DynamicPruning
+    with Unevaluable {


Please see https://github.com/databricks/scala-style-guide/blob/master/README.md#indent and adjust the indentation.

dongjoon-hyun · 2020-08-10T13:59:13Z

Hi, @wangyum . The doc and PR looks reasonable. Is there a plan for further update because there is [WIP] still?

SparkQA · 2020-08-10T18:41:25Z

Test build #127281 has finished for PR 29065 at commit 94bfb36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-08T07:05:02Z

Test build #128383 has finished for PR 29065 at commit 1a8cc9b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-10-05T12:40:22Z

What's the current status of this PR? Waiting for reviews?

maropu · 2020-10-05T12:42:49Z

Reduce the shuffle data can significantly improve the query performance

btw, IMHO ShufflePruning looks a bit misleading. I thought first this PR targets at removing shuffle exchanges by runtime filters.

maropu · 2020-10-05T12:48:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/DynamicPruning.scala

+  }
+}
+
+case class DynamicShufflePruningSubquery(


Looks DynamicPartitionPruningSubquery and DynamicShufflePruningSubquery are almost the same, so we need this new predicate? Could we add a value to represent a pruning type in a class field of DynamicPruningSubquery like this?

case class DynamicPruningSubquery( pruningKey: Expression, buildQuery: LogicalPlan, buildKeys: Seq[Expression], broadcastKeyIndex: Int, onlyInBroadcast: Boolean, exprId: ExprId, pruningType: PruningType) <---- This?

maropu · 2020-10-05T12:49:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala

+                val hasBenefit = pruningHasBenefit(r, partScan, l, left)
+                newRight = insertPartitionPredicate(r, newRight, l, left, leftKeys, hasBenefit)
+              // shuffle pruning
+              case None if conf.dynamicShufflePruningEnabled && canPruneRight(joinType) &&


This new feature is enabled only if both dynamicPartitionPruningEnabled and dynamicShufflePruningEnabled are true?

maropu · 2020-10-05T12:52:44Z

...src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PlanDynamicPruningFilters.scala

-          DynamicPruningExpression(InSubqueryExec(value, broadcastValues, exprId))
+          val broadcastValues = SubqueryBroadcastExec(name, broadcastKeyIndex, buildKeys, exchange)
+          if (preferBloomFilter(buildKeys(broadcastKeyIndex), buildPlan)) {
+            DynamicPruningExpression(BloomFilterSubqueryExec(value, broadcastValues, exprId))


Does this PR propose two things: 1. improving the existing part pruning by bloom filters and 2. implementing a new dynamic pruning strategy (shuffle pruning)?

maropu · 2020-10-05T12:57:26Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

+      s"Bloom filter only supports atomic types, but got ${colType.catalogString}.")
+
+    val updater: (BloomFilter, InternalRow) => Unit =
+      (filter, row) => BloomFilterUtils.putValue(filter, row.get(0, colType))


I think this change can cause perf. regression because the pattern matching of colType happens every time updater called.

github-actions · 2021-01-14T01:30:37Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

wangyum · 2021-08-15T11:01:22Z

Some real cases of our cluster.
Case 1:

Before this PR	After this PR

Case 2:

Before this PR	After this PR

probot-autolabeler bot added the SQL label Jul 10, 2020

wangyum closed this Jul 28, 2020

wangyum deleted the SPARK-32268 branch July 28, 2020 07:06

wangyum restored the SPARK-32268 branch July 28, 2020 07:06

wangyum reopened this Jul 28, 2020

dongjoon-hyun reviewed Aug 10, 2020

View reviewed changes

wangyum added 4 commits September 6, 2020 19:57

Init

367dc68

Update

014b008

Rebase

725404f

Reduce bloomFilterThreshold to 100000L to improve q85

1a8cc9b

maropu reviewed Oct 5, 2020

View reviewed changes

github-actions bot added the Stale label Jan 14, 2021

github-actions bot closed this Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-32268][SQL] Bloom Filter Join #29065

[WIP][SPARK-32268][SQL] Bloom Filter Join #29065

wangyum commented Jul 10, 2020 •

edited

Loading

SparkQA commented Jul 10, 2020

jovany-wang commented Jul 22, 2020 •

edited

Loading

wangyum commented Jul 28, 2020

SparkQA commented Jul 28, 2020

SparkQA commented Jul 28, 2020

dongjoon-hyun commented Aug 10, 2020

dongjoon-hyun Aug 10, 2020

dongjoon-hyun Aug 10, 2020 •

edited

Loading

dongjoon-hyun commented Aug 10, 2020

SparkQA commented Aug 10, 2020

SparkQA commented Sep 8, 2020

maropu commented Oct 5, 2020

maropu commented Oct 5, 2020

maropu Oct 5, 2020

maropu Oct 5, 2020

maropu Oct 5, 2020

maropu Oct 5, 2020

github-actions bot commented Jan 14, 2021

wangyum commented Aug 15, 2021

[WIP][SPARK-32268][SQL] Bloom Filter Join #29065

[WIP][SPARK-32268][SQL] Bloom Filter Join #29065

Conversation

wangyum commented Jul 10, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jul 10, 2020

jovany-wang commented Jul 22, 2020 • edited Loading

wangyum commented Jul 28, 2020

SparkQA commented Jul 28, 2020

SparkQA commented Jul 28, 2020

dongjoon-hyun commented Aug 10, 2020

dongjoon-hyun Aug 10, 2020

Choose a reason for hiding this comment

dongjoon-hyun Aug 10, 2020 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Aug 10, 2020

SparkQA commented Aug 10, 2020

SparkQA commented Sep 8, 2020

maropu commented Oct 5, 2020

maropu commented Oct 5, 2020

maropu Oct 5, 2020

Choose a reason for hiding this comment

maropu Oct 5, 2020

Choose a reason for hiding this comment

maropu Oct 5, 2020

Choose a reason for hiding this comment

maropu Oct 5, 2020

Choose a reason for hiding this comment

github-actions bot commented Jan 14, 2021

wangyum commented Aug 15, 2021

wangyum commented Jul 10, 2020 •

edited

Loading

jovany-wang commented Jul 22, 2020 •

edited

Loading

dongjoon-hyun Aug 10, 2020 •

edited

Loading