-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32932][SQL] Do not use local shuffle reader at final stage on write command #29797
Conversation
fc831f6
to
93775e8
Compare
Test build #128851 has finished for PR 29797 at commit
|
Test build #128858 has finished for PR 29797 at commit
|
Yea this is an issue, thanks for reporting! BTW why is it OK when coalesce partition is enabled? |
@cloud-fan |
seems like we should be stricter to apply local shuffle reader, to make sure we can satisfy the expected parallelism. |
noUserSpecifiedNumPartition = conf.coalesceShufflePartitionsEnabled && | ||
r.optNumPartitions.isEmpty) :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions so far.
-
Seems to me
RepartitionByExpression
here is not always for dynamic partition overwrite case. By this change, allRepartitionByExpression
are affected bycoalesceShufflePartitionsEnabled
config. -
So the users of dynamic partition overwrite need to set
conf.coalesceShufflePartitionsEnabled
false to avoid that? If the user needs to coalesce shuffle partition in the query, but also needs dynamic partition overwrite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Thanks for the good questions.
This solution is not ideal till we find a way not to apply local shuffle reader if the partitioning doesn't match that of dynamic partition. It's not ideal either that any AQE config is global.
With coalesceShufflePartitionsEnabled
and localShuffleReaderEnabled
, the output partitioning of shuffle is uncertain which arguably contradicts the purpose of RepartitionByExpression
.
If the user needs to coalesce shuffle partition in the query, but also needs dynamic partition overwrite?
Another option is to introduce a new config specific for repartition, e.g. spark.sql.adaptive.repartition.canChangeNumPartitions
I think the root cause is, we give local shuffle reader expected parallelism, but it's not always satisfied. For example, too many mappers can lead to over-parallelism and many small files. We should add a check and only apply local shuffle reader if expected parallelism can be satisfied. |
Could you please elaborate on this ? How will it solve the small files issue due to removed repartition when writing to dynamic partition or bucket table ? |
See
If we skip local shuffle reader, we fix the small files issue. |
Are you talking about a different issue? IIUC the repartition (shuffle) is still there and the only problem is local shuffle reader. |
I mean the physical shuffle doesn't happen so that each shuffle task will generate at most If we add a check, I'm not sure whether the local shuffle reader will ever be applied in practice. In our use cases, the target bucket tables usually have more than 1000 buckets. |
OK I get your point. The file write node works better with certain partitioning and the local shuffle reader breaks it. Then how about we update |
That will also disable local shuffle reader for broadcast join. How about adding a variable |
We can just disable local shuffle reader in the final stage, if the root node is data writing node. What do you think? |
93775e8
to
7e0d766
Compare
@cloud-fan Redo the PR as you suggested. Please help review. |
Test build #129110 has finished for PR 29797 at commit
|
…DataWritingCommand
7e0d766
to
84134b0
Compare
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test status success |
Test build #129111 has finished for PR 29797 at commit
|
@@ -102,6 +103,14 @@ case class AdaptiveSparkPlanExec( | |||
OptimizeLocalShuffleReader(conf) | |||
) | |||
|
|||
@transient private val finalStageOptimizerRules: Seq[Rule[SparkPlan]] = | |||
context.qe.sparkPlan match { | |||
case _: DataWritingCommandExec => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to match all writing commands, including DS v1, v2 and file source. Maybe we can create a tagging trait like UserDefinedExpression
, to tag these writing commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems DSv2 is not ready for write as per https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L988-L994.
Meanwhile, will it too big a change for those interfaces to extend the tagging trait ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File source v2 is not ready yet, but it doesn't mean DS v2 is not ready for writing. Please follow InsertAdaptiveSparkPlan.applyInternal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, is there a UT for DS v2 write ? I find only V1 is used for write no matter format I specify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See DataSourceV2Suite
, we have testing v2 source that support writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added match for V2TableWriteExec
and UT. Not sure I've got it right as v2 API is quite new to me. Please help review. Thanks.
Kubernetes integration test starting |
Kubernetes integration test status success |
@@ -102,6 +104,16 @@ case class AdaptiveSparkPlanExec( | |||
OptimizeLocalShuffleReader(conf) | |||
) | |||
|
|||
@transient private val finalStageOptimizerRules: Seq[Rule[SparkPlan]] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's only called once, can be a def
// Test DataSource v2 | ||
withTempPath { f => | ||
val path = f.getCanonicalPath | ||
val format = classOf[V2Source].getName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use NoopDataSource
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using NoopDataSource
will fail
java.lang.IllegalArgumentException: requirement failed: The provided partitioning does not match of the table.
- provided: j
- table:
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.DataFrameWriter.checkPartitioningMatchesV2Table(DataFrameWriter.scala:772)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitioned or not doesn't matter, we can test df.write.format(...).save()
Test build #129691 has finished for PR 29797 at commit
|
case _: DataWritingCommandExec | _: V2TableWriteExec => | ||
// SPARK-32932: Local shuffle reader could break partitioning that works best | ||
// for the following writing command | ||
queryStageOptimizerRules.filterNot(_.isInstanceOf[OptimizeLocalShuffleReader]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation is one space off.
|
||
// Test DataSource v2 | ||
withTempPath { f => | ||
val path = f.getCanonicalPath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path
is not needed for noop source. We can just call save()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and we can remove withTempPath
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129767 has finished for PR 29797 at commit
|
thanks, merging to master! |
@manuzhang This change seems incomplete. The problem here is not about the writer, but rather about the repartition Exchange. We should avoid letting LSR rule work on such shuffles across the board, not just for the writer case.
So that it'll only modify the child node of an Exchange, which is the purpose of this rule in the first place. |
@maryannxue This could miss a LSR optimization on
|
Can you highlight the problem? I do see |
@cloud-fan Sorry, it's probe side, |
I assume in this case, the repartition node has been optimized out. Then still this fix won't cover just the repartition case without a writer, right? |
We have some fundamental problems hanging around in our optimizations. In general, when we optimize a Shuffle over another Shuffle that is semantically equivalent (almost equivalent in this case), we should get rid of the child shuffle instead of the parent one, yet it's a little hard to implement right now. |
@maryannxue thanks for pointing out the fundamental problems while I'm not familiar with the design behind. When starting out, I was trying to solve the most urgent issue in our use cases, too many output files caused by LSR, which broke the downstream jobs on the pipeline. Meanwhile, disabling LSR entirely downgrades the performance of BHJ switched from SMJ. Hence, this PR with help of @cloud-fan.
|
I'm reading the classdoc of
The reason for condition 1 is it will never introduce shuffle. This is true, but this may change the final output partitioning which may be bad for cases like write command. I like the idea from @maryannxue which is more general: 1) move LSR rule into postStageCreationRules; and 2) make the LSR rule match an Exchange first (so condition 1 becomes: the shuffle is a direct child of an exchange). By doing this we can skip LSR rule in the last stage, as the last stage's root node is not exchange. I'm not very sure why the current approach can add LSR to BHJ probe side. This seems like an accident to me as it's not mentioned in the classdoc. |
In LSR, we have a sanity check to make sure that the output partitioning requirement is not broken after this rule. For example, if there's a parent join sitting above the current BHJ for which we are trying to optimize the probe side to LSR, and that parent join has to take advantage of the output partitioning of the current BHJ. When applying the LSR rule, the check would fail, and we would back out of it. |
@maryannxue thanks for the thorough explanation. Let me rework this. |
Thank you, @manuzhang! I have a local fix already, and I'll submit a PR shortly. I assume that the coalescing rule has a similar issue where a repartition shuffle with a specific number is optimized out because of another shuffle introduced by join or agg, and later that shuffle gets coalesced, which means the specified repartition number is ignored. |
@maryannxue Great to know. I've been testing a draft fix which doesn't look good without optimizing the probe side of BHJ. I can help test your PR. Coalescing won't be applied to repartition if user specifies a repartition number. |
It does apply when the specified number happen to be the default partition number, right? |
What changes were proposed in this pull request?
Do not use local shuffle reader at final stage if the root node is write command.
Why are the changes needed?
Users usually repartition with partition column on dynamic partition overwrite. AQE could break it by removing physical shuffle with local shuffle reader. That could lead to a large number of output files, even exceeding the file system limit.
Does this PR introduce any user-facing change?
Yes.
How was this patch tested?
Add test.