New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable #29473
Conversation
Test build #127609 has finished for PR 29473 at commit
|
@imback82 - thanks for working on this. Seeing this is still marked as draft. Please change it once it's ready for review. Thanks. |
Thanks for the work, @imback82. Just a question; we cannot set a simpler rule to select which strategy (reapportioning or coalescing) we use when reading buckets? I think it is a bit annoying to set true to |
if (conf.coalesceBucketsInJoinEnabled && conf.repartitionBucketsInJoinEnabled) { | ||
throw new AnalysisException("Both 'spark.sql.bucketing.coalesceBucketsInJoin.enabled' and " + | ||
"'spark.sql.bucketing.repartitionBucketsInJoin.enabled' cannot be set to true at the" + | ||
"same time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use Enumeration
and checkValues
instead? I think this check should be done in SQLConf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example;
object BucketReadStrategyMode extends Enumeration {
val COALESCING, REPARTITIONING, AUTOMATIC, OFF = Value
}
val BUCKET_READ_STORATEGY_MODE =
.buildConf("...")
.version("3.1.0")
.stringConf
.transform(_.toUpperCase(Locale.ROOT))
.checkValues(BucketReadStrategyMode.values.map(_.toString))
.createWithDefault(BucketReadStrategyMode.OFF.toString)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I think the new config makes more sense. I renamed few, and let me know if it doesn't make sense.
Btw, do you think I can introduce AUTOMATIC
as a follow up since this PR is sizable? Let me know if you want to see it in this PR. Thanks.
// `RepartitioningBucketRDD` converts columnar batches to rows to calculate bucket id for each | ||
// row, thus columnar is not supported when `RepartitioningBucketRDD` is used to avoid | ||
// conversions from batches to rows and back to batches. | ||
relation.fileFormat.supportBatch(relation.sparkSession, schema) && !isRepartitioningBuckets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about how much this columnar execution makes performance gains though, the proposed idea is to give up the gains then use bucket repartitioning instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the datasource will still be read as batches in this case (if whole stage codegen is enabled).
I see that physical plans operate on rows, so batches are converted to rows via ColumnarToRow
anyway. So, I think perf impact would be minimal here; the difference could be the code-gen conversion from columnar to row vs. iterating batch.rowIterator()
in BucketRepartitioningRDD
.
relation.fileFormat.supportBatch(relation.sparkSession, schema) && !isRepartitioningBuckets | ||
} | ||
|
||
@transient private lazy val isRepartitioningBuckets: Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't need : Boolean
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the same style from override lazy val supportsColumnar: Boolean
, etc. Is this still not needed?
Good point. One use case for repartition over coalesce is when there are enough cores available in the cluster, not to reduce the parallelism by coalescing. @c21, did you observe any patterns or heuristics on your workloads where repartition is preferred?
This is still guarded by |
btw, still |
Also, I think its better to describe performance numbers for this proposed idea in the PR description above. |
I still need to add few more tests.
Yes, will update. |
Test build #128003 has finished for PR 29473 at commit
|
Test build #128011 has finished for PR 29473 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
"equal to this value for bucket coalescing to be applied. This configuration only " + | ||
s"has an effect when '${COALESCE_BUCKETS_IN_JOIN_ENABLED.key}' is set to true.") | ||
val BUCKET_READ_STRATEGY_IN_JOIN = | ||
buildConf("spark.sql.bucketing.bucketReadStrategyInJoin") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shall we have a name also consistent with existing config "spark.sql.sources.bucketing", e.g. "spark.sql.sources.bucketing.readStrategyInJoin". No big deal, but "bucketing.bucket..." seems a little bit verbose. Point out here because users might depend on this config for bucketing optimization and raise questions for developers with this config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I change this to spark.sql.sources.bucketing.readStrategyInJoin
.
.createWithDefault(BucketReadStrategyInJoin.OFF.toString) | ||
|
||
val BUCKET_READ_STRATEGY_IN_JOIN_MAX_BUCKET_RATIO = | ||
buildConf("spark.sql.bucketing.bucketReadStrategyInJoin.maxBucketRatio") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same as above, might be just ""spark.sql.sources.bucketing.readStrategyInJoinMaxBucketRatio" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to spark.sql.sources.bucketing.readStrategyInJoin.maxBucketRatio
, but I don't have a strong opinion on this if spark.sql.sources.bucketing.readStrategyInJoinMaxBucketRatio
is better.
...main/scala/org/apache/spark/sql/execution/bucketing/CoalesceOrRepartitionBucketsInJoin.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
Outdated
Show resolved
Hide resolved
@@ -314,7 +324,7 @@ case class FileSourceScanExec( | |||
val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 1) | |||
|
|||
// TODO SPARK-24528 Sort order is currently ignored if buckets are coalesced. | |||
if (singleFilePartitions && optionalNumCoalescedBuckets.isEmpty) { | |||
if (singleFilePartitions && (optionalNewNumBuckets.isEmpty || isRepartitioningBuckets)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need || isRepartitioningBuckets
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repartition can still maintain the sort order whereas coalescing cannot, thus this check is needed.
// There are now more files to be read. | ||
val filesNum = filePartitions.map(_.files.size.toLong).sum | ||
val filesSize = filePartitions.map(_.files.map(_.length).sum).sum | ||
driverMetrics("numFiles") = filesNum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per setFilesNumAndSizeMetric
, should we set staticFilesNum
here or numFiles
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think staticFilesNum
is used only for dynamic partition pruning:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
Lines 421 to 424 in cfe012a
/** SQL metrics generated only for scans using dynamic partition pruning. */ | |
private lazy val staticMetrics = if (partitionFilters.filter(isDynamicPruningFilter).nonEmpty) { | |
Map("staticFilesNum" -> SQLMetrics.createMetric(sparkContext, "static number of files read"), | |
"staticFilesSize" -> SQLMetrics.createSizeMetric(sparkContext, "static size of files read")) |
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/BucketRepartitioningRDD.scala
Outdated
Show resolved
Hide resolved
From our side, honestly now we don't have any automation for deciding coalesce vs repartition. We provided configs similar here for users themselves to control coalesce vs repartition. I think a rule of thumb can be we don't want to |
Test build #128012 has finished for PR 29473 at commit
|
Test build #128028 has finished for PR 29473 at commit
|
Test build #128027 has finished for PR 29473 at commit
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
#28123 and #29079 introduced coalescing bucketed tables for sort merge join / shuffled hash join.
This PR proposes to introduce repartitioning bucketed tables to increase parallelism at the cost of reading duplicate source data. It is applied if the following conditions are met:
spark.sql.sources.bucketing.readStrategyInJoin
is set torepartition
.spark.sql.sources.bucketing.readStrategyInJoin.maxBucketRatio
.Why are the changes needed?
Coalescing buckets is useful but repartitioning can also help due to the increased parallelism depending on the workloads.
Does this PR introduce any user-facing change?
Yes. If the bucket repartitioning conditions explained above are met, a full shuffle can be eliminated (also note that you will see
SelectedBucketsCount: 4 out of 4 (Repartitioned to 8)
in the physical plan):How was this patch tested?
Added new tests.