[SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable #29473

imback82 · 2020-08-19T01:14:12Z

What changes were proposed in this pull request?

#28123 and #29079 introduced coalescing bucketed tables for sort merge join / shuffled hash join.

This PR proposes to introduce repartitioning bucketed tables to increase parallelism at the cost of reading duplicate source data. It is applied if the following conditions are met:

Join is sort merge join or shuffled hash join.
Join keys match with output partition expressions on their respective sides.
The larger bucket count is divisible by the smaller bucket count.
spark.sql.sources.bucketing.readStrategyInJoin is set to repartition.
The ratio of the number of buckets should be less than the value set in spark.sql.sources.bucketing.readStrategyInJoin.maxBucketRatio.

Why are the changes needed?

Coalescing buckets is useful but repartitioning can also help due to the increased parallelism depending on the workloads.

Does this PR introduce any user-facing change?

Yes. If the bucket repartitioning conditions explained above are met, a full shuffle can be eliminated (also note that you will see SelectedBucketsCount: 4 out of 4 (Repartitioned to 8) in the physical plan):

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
spark.conf.set("spark.sql.sources.bucketing.readStrategyInJoin", "repartition")
val df1 = (0 until 20).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", "k")
val df2 = (0 until 20).map(i => (i % 7, i % 11, i.toString)).toDF("i", "j", "k")
df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
df2.write.format("parquet").bucketBy(4, "i").saveAsTable("t2")
val t1 = spark.table("t1")
val t2 = spark.table("t2")
val joined = t1.join(t2, t1("i") === t2("i"))
joined.explain

== Physical Plan ==
*(3) SortMergeJoin [i#38], [i#44], Inner
:- *(1) Sort [i#38 ASC NULLS FIRST], false, 0
:  +- *(1) Filter isnotnull(i#38)
:     +- *(1) ColumnarToRow
:        +- FileScan parquet default.t1[i#38,j#39,k#40] Batched: true, DataFilters: [isnotnull(i#38)], Format: Parquet, Location: InMemoryFileIndex[], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 8 out of 8
+- *(2) Sort [i#44 ASC NULLS FIRST], false, 0
   +- *(2) Filter isnotnull(i#44)
      +- FileScan parquet default.t2[i#44,j#45,k#46] Batched: false, DataFilters: [isnotnull(i#44)], Format: Parquet, Location: InMemoryFileIndex[], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 4 out of 4 (Repartitioned to 8)

How was this patch tested?

Added new tests.

SparkQA · 2020-08-19T06:16:09Z

Test build #127609 has finished for PR 29473 at commit 21882ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2020-08-22T23:00:59Z

@imback82 - thanks for working on this.

Seeing this is still marked as draft. Please change it once it's ready for review. Thanks.

maropu · 2020-08-24T12:46:55Z

Thanks for the work, @imback82. Just a question; we cannot set a simpler rule to select which strategy (reapportioning or coalescing) we use when reading buckets? I think it is a bit annoying to set true to repartitionBucketsInJoin.enabled or coalesceBucketsInJoin.enabled data-by-data. What's a factor to make a difference between the two strategies, e.g., bucket size? For example, if one has too small buckets (this case is extreme though), repartitioning for higher cardinality might not be able to help for better performance.

maropu · 2020-08-24T12:49:33Z

...main/scala/org/apache/spark/sql/execution/bucketing/CoalesceOrRepartitionBucketsInJoin.scala

+    if (conf.coalesceBucketsInJoinEnabled && conf.repartitionBucketsInJoinEnabled) {
+      throw new AnalysisException("Both 'spark.sql.bucketing.coalesceBucketsInJoin.enabled' and " +
+        "'spark.sql.bucketing.repartitionBucketsInJoin.enabled' cannot be set to true at the" +
+        "same time")


Could we use Enumeration and checkValues instead? I think this check should be done in SQLConf.

For example;

object BucketReadStrategyMode extends Enumeration { val COALESCING, REPARTITIONING, AUTOMATIC, OFF = Value } val BUCKET_READ_STORATEGY_MODE = .buildConf("...") .version("3.1.0") .stringConf .transform(_.toUpperCase(Locale.ROOT)) .checkValues(BucketReadStrategyMode.values.map(_.toString)) .createWithDefault(BucketReadStrategyMode.OFF.toString)

Thanks for the suggestion! I think the new config makes more sense. I renamed few, and let me know if it doesn't make sense.

Btw, do you think I can introduce AUTOMATIC as a follow up since this PR is sizable? Let me know if you want to see it in this PR. Thanks.

maropu · 2020-08-24T13:08:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+    // `RepartitioningBucketRDD` converts columnar batches to rows to calculate bucket id for each
+    // row, thus columnar is not supported when `RepartitioningBucketRDD` is used to avoid
+    // conversions from batches to rows and back to batches.
+    relation.fileFormat.supportBatch(relation.sparkSession, schema) && !isRepartitioningBuckets


I'm not sure about how much this columnar execution makes performance gains though, the proposed idea is to give up the gains then use bucket repartitioning instead?

Note that the datasource will still be read as batches in this case (if whole stage codegen is enabled).

I see that physical plans operate on rows, so batches are converted to rows via ColumnarToRow anyway. So, I think perf impact would be minimal here; the difference could be the code-gen conversion from columnar to row vs. iterating batch.rowIterator() in BucketRepartitioningRDD.

maropu · 2020-08-24T13:09:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+    relation.fileFormat.supportBatch(relation.sparkSession, schema) && !isRepartitioningBuckets
+  }
+
+  @transient private lazy val isRepartitioningBuckets: Boolean = {


nit: we don't need : Boolean ?

I followed the same style from override lazy val supportsColumnar: Boolean, etc. Is this still not needed?

imback82 · 2020-08-27T22:00:49Z

Just a question; we cannot set a simpler rule to select which strategy (reapportioning or coalescing) we use when reading buckets? I think it is a bit annoying to set true to repartitionBucketsInJoin.enabled or coalesceBucketsInJoin.enabled data-by-data.

Good point. One use case for repartition over coalesce is when there are enough cores available in the cluster, not to reduce the parallelism by coalescing. @c21, did you observe any patterns or heuristics on your workloads where repartition is preferred?

For example, if one has too small buckets (this case is extreme though), repartitioning for higher cardinality might not be able to help for better performance.

This is still guarded by spark.sql.bucketing.coalesceOrRepartitionBucketsInJoin.maxBucketRatio, so this scenario is a little less concerning?

maropu · 2020-08-28T01:19:55Z

btw, still Draft?

maropu · 2020-08-28T01:22:52Z

Also, I think its better to describe performance numbers for this proposed idea in the PR description above.

imback82 · 2020-08-28T02:45:40Z

btw, still Draft?

I still need to add few more tests.

Also, I think its better to describe performance numbers for this proposed idea in the PR description above.

Yes, will update.

SparkQA · 2020-08-29T02:03:06Z

Test build #128003 has finished for PR 29473 at commit 5665bc1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-29T05:43:14Z

Test build #128011 has finished for PR 29473 at commit e2374ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

c21 · 2020-08-29T05:50:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

-        "equal to this value for bucket coalescing to be applied. This configuration only " +
-        s"has an effect when '${COALESCE_BUCKETS_IN_JOIN_ENABLED.key}' is set to true.")
+  val BUCKET_READ_STRATEGY_IN_JOIN =
+    buildConf("spark.sql.bucketing.bucketReadStrategyInJoin")


nit: shall we have a name also consistent with existing config "spark.sql.sources.bucketing", e.g. "spark.sql.sources.bucketing.readStrategyInJoin". No big deal, but "bucketing.bucket..." seems a little bit verbose. Point out here because users might depend on this config for bucketing optimization and raise questions for developers with this config.

Makes sense. I change this to spark.sql.sources.bucketing.readStrategyInJoin.

c21 · 2020-08-29T05:52:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .createWithDefault(BucketReadStrategyInJoin.OFF.toString)
+
+  val BUCKET_READ_STRATEGY_IN_JOIN_MAX_BUCKET_RATIO =
+    buildConf("spark.sql.bucketing.bucketReadStrategyInJoin.maxBucketRatio")


nit: same as above, might be just ""spark.sql.sources.bucketing.readStrategyInJoinMaxBucketRatio" ?

I changed this to spark.sql.sources.bucketing.readStrategyInJoin.maxBucketRatio, but I don't have a strong opinion on this if spark.sql.sources.bucketing.readStrategyInJoinMaxBucketRatio is better.

...main/scala/org/apache/spark/sql/execution/bucketing/CoalesceOrRepartitionBucketsInJoin.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

c21 · 2020-08-29T06:04:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -314,7 +324,7 @@ case class FileSourceScanExec(
        val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 1)

        // TODO SPARK-24528 Sort order is currently ignored if buckets are coalesced.
-        if (singleFilePartitions && optionalNumCoalescedBuckets.isEmpty) {
+        if (singleFilePartitions && (optionalNewNumBuckets.isEmpty || isRepartitioningBuckets)) {


we don't need || isRepartitioningBuckets right?

Repartition can still maintain the sort order whereas coalescing cannot, thus this check is needed.

c21 · 2020-08-29T06:20:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+        // There are now more files to be read.
+        val filesNum = filePartitions.map(_.files.size.toLong).sum
+        val filesSize = filePartitions.map(_.files.map(_.length).sum).sum
+        driverMetrics("numFiles") = filesNum


per setFilesNumAndSizeMetric, should we set staticFilesNum here or numFiles ?

I think staticFilesNum is used only for dynamic partition pruning:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

Lines 421 to 424 in cfe012a

/** SQL metrics generated only for scans using dynamic partition pruning. */

private lazy val staticMetrics = if (partitionFilters.filter(isDynamicPruningFilter).nonEmpty) {

Map("staticFilesNum" -> SQLMetrics.createMetric(sparkContext, "static number of files read"),

"staticFilesSize" -> SQLMetrics.createSizeMetric(sparkContext, "static size of files read"))

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/BucketRepartitioningRDD.scala

c21 · 2020-08-29T06:36:24Z

did you observe any patterns or heuristics on your workloads where repartition is preferred?

From our side, honestly now we don't have any automation for deciding coalesce vs repartition. We provided configs similar here for users themselves to control coalesce vs repartition.

I think a rule of thumb can be we don't want to
(1).coalesce: if the coalesced table is too big and # of coalesced buckets is too few, then each task has too much data and will take more time.
(2).repartition: if the repartition table is too big and # of repartitioned buckets is too many, then too much duplicated data is read and will have too much more CPU/IO cost (might be worse than just shuffling this table).

SparkQA · 2020-08-29T07:05:02Z

Test build #128012 has finished for PR 29473 at commit 7481e36.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-30T07:05:01Z

Test build #128028 has finished for PR 29473 at commit 366c9c3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-30T07:05:01Z

Test build #128027 has finished for PR 29473 at commit 2c4925b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-12-09T00:47:43Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

imback82 added 5 commits August 12, 2020 17:09

initial checkin

c2c7a59

columnar batch support

5d8390c

disable columnar to row

404020b

renaming

bc5fcd2

tests and cleanup

21882ab

probot-autolabeler bot added the SQL label Aug 19, 2020

imback82 marked this pull request as draft August 19, 2020 01:15

imback82 changed the title ~~[WIP][SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable~~ [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable Aug 19, 2020

maropu reviewed Aug 24, 2020

View reviewed changes

Merge branch 'master' into split_bucket

3871ef2

imback82 added 2 commits August 28, 2020 14:36

add test

5665bc1

Change config to use enum

e2374ac

Modify config name / update tests

7481e36

imback82 marked this pull request as ready for review August 29, 2020 03:25

c21 reviewed Aug 29, 2020

View reviewed changes

imback82 added 2 commits August 29, 2020 21:31

Address comments

2c4925b

PR comment

366c9c3

github-actions bot added the Stale label Dec 9, 2020

github-actions bot closed this Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable #29473

[SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable #29473

imback82 commented Aug 19, 2020 •

edited

SparkQA commented Aug 19, 2020

c21 commented Aug 22, 2020

maropu commented Aug 24, 2020

maropu Aug 24, 2020

maropu Aug 28, 2020

imback82 Aug 29, 2020

maropu Aug 24, 2020

imback82 Aug 27, 2020 •

edited

maropu Aug 24, 2020

imback82 Aug 27, 2020

imback82 commented Aug 27, 2020

maropu commented Aug 28, 2020

maropu commented Aug 28, 2020

imback82 commented Aug 28, 2020

SparkQA commented Aug 29, 2020

SparkQA commented Aug 29, 2020

c21 Aug 29, 2020

imback82 Aug 30, 2020

c21 Aug 29, 2020

imback82 Aug 30, 2020

c21 Aug 29, 2020

imback82 Aug 30, 2020

c21 Aug 29, 2020

imback82 Aug 30, 2020

c21 commented Aug 29, 2020

SparkQA commented Aug 29, 2020

SparkQA commented Aug 30, 2020

SparkQA commented Aug 30, 2020

github-actions bot commented Dec 9, 2020

	/** SQL metrics generated only for scans using dynamic partition pruning. */
	private lazy val staticMetrics = if (partitionFilters.filter(isDynamicPruningFilter).nonEmpty) {
	Map("staticFilesNum" -> SQLMetrics.createMetric(sparkContext, "static number of files read"),
	"staticFilesSize" -> SQLMetrics.createSizeMetric(sparkContext, "static size of files read"))

[SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable #29473

[SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable #29473

Conversation

imback82 commented Aug 19, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Aug 19, 2020

c21 commented Aug 22, 2020

maropu commented Aug 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 Aug 27, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 commented Aug 27, 2020

maropu commented Aug 28, 2020

maropu commented Aug 28, 2020

imback82 commented Aug 28, 2020

SparkQA commented Aug 29, 2020

SparkQA commented Aug 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 commented Aug 29, 2020

SparkQA commented Aug 29, 2020

SparkQA commented Aug 30, 2020

SparkQA commented Aug 30, 2020

github-actions bot commented Dec 9, 2020

imback82 commented Aug 19, 2020 •

edited

imback82 Aug 27, 2020 •

edited