[SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning #38464

wangyum · 2022-11-01T08:56:47Z

What changes were proposed in this pull request?

This PR enhances DPP to use bloom filters if spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is disabled and build plan can't build broadcast by size and can reuse the existing shuffle exchanges.

Why are the changes needed?

Avoid job fail if spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly is disabled:

select catalog_sales.* from  catalog_sales join catalog_returns  where cr_order_number = cs_sold_date_sk and cr_returned_time_sk < 40000;

20/08/16 06:44:42 ERROR TaskSetManager: Total size of serialized results of 494 tasks (1225.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

wangyum · 2022-11-01T09:01:00Z

cc @cloud-fan @sigmod @aokolnychyi @dongjoon-hyun @huaxingao @viirya

dongjoon-hyun · 2022-11-01T09:53:34Z

Thank you for pinging me, @wangyum .

cloud-fan · 2022-11-16T08:36:32Z

...c/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala

@@ -65,7 +70,7 @@ case class PlanAdaptiveDynamicPruningFilters(
          DynamicPruningExpression(InSubqueryExec(value, broadcastValues, exprId))
        } else if (onlyInBroadcast) {
          DynamicPruningExpression(Literal.TrueLiteral)
-        } else {
+        } else if (canBroadcastBySize(buildPlan, conf)) {


this can be over estimated. The final plan has an Aggregate which may dramatically reduce the data size.

cloud-fan · 2022-11-16T08:41:34Z

...c/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveDynamicPruningFilters.scala

+        } else {
+          val childPlan = adaptivePlan.executedPlan
+          val reusedShuffleExchange = collectFirst(rootPlan) {
+            case s: ShuffleExchangeExec if s.child.sameResult(childPlan) => s


This is another tricky part: is reusing shuffle always better than starting a new query with column pruning?

cloud-fan · 2022-11-16T08:45:06Z

I agree with using bloom filters, as the size estimation can be wrong and the build size can be too large that InSubquery can't work. However, this PR contains another optimization that forces shuffle reuse when building the subquery to build bloom filter. Can we do it later with more discussions? This is a general optimization that can apply in other places as well: InSubquery DPP, bloom filter join.

github-actions · 2023-02-25T00:20:48Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

wangyum added 4 commits October 30, 2022 09:30

Init commit

1f9c522

Use bloom filter to improve dynamic partition pruning

4fef793

fix test

763dc60

revert canBroadcastBySize(buildPlan, conf)

7ec932b

github-actions bot added the SQL label Nov 1, 2022

cloud-fan reviewed Nov 16, 2022

View reviewed changes

github-actions bot added the Stale label Feb 25, 2023

github-actions bot closed this Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning #38464

[SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning #38464

wangyum commented Nov 1, 2022

wangyum commented Nov 1, 2022

dongjoon-hyun commented Nov 1, 2022

cloud-fan Nov 16, 2022

cloud-fan Nov 16, 2022

cloud-fan commented Nov 16, 2022

github-actions bot commented Feb 25, 2023

[SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning #38464

[SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning #38464

Conversation

wangyum commented Nov 1, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

wangyum commented Nov 1, 2022

dongjoon-hyun commented Nov 1, 2022

cloud-fan Nov 16, 2022

Choose a reason for hiding this comment

cloud-fan Nov 16, 2022

Choose a reason for hiding this comment

cloud-fan commented Nov 16, 2022

github-actions bot commented Feb 25, 2023