New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47518][CORE] Skip transfer the last spilled shuffle data #45661
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the PR description, are you suggesting the linux file move is slow, @wankunde ? Could you elaborate your experience with numbers?
there will still be a heavy data transfer.
java.nio.file.Files.move(Files.java:1395)
Test
When mv file to another disk which has a high iowait is very slow while mv file to the same disk is quite fast. |
Thanks. Could you put that into the PR description, @wankunde . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @cloud-fan , @mridulm , @tgravescs , @viirya too
Optional<File> finalDataFileDir; | ||
if (shuffleExecutorComponents instanceof LocalDiskShuffleExecutorComponents) { | ||
File dataFile = | ||
new IndexShuffleBlockResolver(sparkConf, blockManager).getDataFile(shuffleId, mapId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only used to invoke getParentFile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
@@ -226,6 +226,24 @@ private[spark] class DiskBlockManager( | |||
(blockId, getFile(blockId)) | |||
} | |||
|
|||
/** Produces a unique block id and File suitable for storing shuffled intermediate results | |||
* in the input directory. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit.
/**
* Produces a unique block id and File suitable for storing shuffled intermediate results
* in the input directory.
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
var tmpFile = new File(fileDir, blockId.name) | ||
while (tmpFile.exists()) { | ||
blockId = TempShuffleBlockId(UUID.randomUUID()) | ||
tmpFile = new File(fileDir, blockId.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If fileDir
is invalid, we are in the Infinite
loop, aren't we? It seems that we need a safe guard to avoid the infinite loop, @wankunde . Also, please add a unit test case for the invalid fileDir
(maybe, nonExist)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added UT, if fileDir
is invalid, tmpFile will not exists, exit this loop.
Generate new block only when the fileDir
is valid and the first TempShuffleBlockId
file already created by some other task.
@@ -219,7 +221,15 @@ void closeAndWriteOutput() throws IOException { | |||
updatePeakMemoryUsed(); | |||
serBuffer = null; | |||
serOutputStream = null; | |||
final SpillInfo[] spills = sorter.closeAndGetSpills(); | |||
Optional<File> finalDataFileDir; | |||
if (shuffleExecutorComponents instanceof LocalDiskShuffleExecutorComponents) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it looks a bit hacky to handle local disk shuffle specially here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not familiar with the block storage in KubernetesLocalDiskShuffleExecutorComponents, so only handle LocalDiskShuffleExecutorComponents here.
Or should I add a new method getDataFile()
in trait ShuffleExecutorComponents
?
UT
|
Retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @wankunde , to be clear, the below two comments are critical to this PR. You had better answer to resolve them.
My replies were not submitted , sorry about that. |
I'm a bit worried about this change, as it changes the assumption of always having two phases during shuffle: first phase only write to temp files, the second phase "commit" it to the final destination in a short time. The shuffle and task scheduling process is quite convoluted in Spark and I can't be 100% sure that this is a safe change. |
Hi, @cloud-fan , the shuffle in this PR still need two phases during shuffle:
|
@wankunde what kind of disk did you test ? Have you tried ssd ? |
This test is on HDD disks, copy 2.6 file on NVMe disks is 10x times faster than HDD. |
Before this PR, will Spark write any temp files to the final shuffle file directory? |
In most cases, spark will manage its internal files as block files, including If the block files have the same hash code, then will be in the same directory. |
To confirm: is the goal to write the last spill file to the same directory as the final shuffle file, so that the following |
A move between disks requires actual data copy, while a mv within the same disk is simply a metadata operation. I understand the motivation for the proposed fix here, but I agree with @cloud-fan - I am not in favor of complicating spark code for handling extremely degenerate cases like this. |
Yes |
I think a more general approach is, |
Spark will rename the TempShuffleBlock file to the final data file when there is only one TempShuffleBlock file. spark/core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java Lines 277 to 290 in 0bbef20
If there are multiple TempShuffleBlock files, spark will always read all the shuffle data into the final shuffle data file. |
I think renaming and reading/writing should be faster if the files are co-located on the same disk? |
At these disk speeds, other than rename, it will be slower if they are on same disk. |
@@ -198,7 +201,8 @@ private void writeSortedFile(boolean isFinalFile) { | |||
// spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use | |||
// createTempShuffleBlock here; see SPARK-3426 for more details. | |||
final Tuple2<TempShuffleBlockId, File> spilledFileInfo = | |||
blockManager.diskBlockManager().createTempShuffleBlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one idea: if isFinalFile
is true, then we call a special version of createTempShuffleBlock
that takes shuffle & map id, and returns a file path under the same directory of the final shuffle file. Then we don't need to change other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean change the parameters of createTempShuffleBlockInDir
from finalDataFileDir to Tuple2<ShuffleId, MapId> ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would assume the final output has to go to blockResolver.getDataFile(shuffleId, mapId)
right @cloud-fan ?
Currently at this layer we do not make that assumption ...
I was initially toying with the idea of passing mapId
and shuffleId
as constructor params ... and do something similar when I realized this would make assumptions that the code currently does not make - and so why the base directory is being passed around.
(And then ofcourse I thought we could solve it in LocalDiskSingleSpillMapOutputWriter
here, but was completely wrong :-( ).
This is only going to marginally alleviate the issues though, the slow disks are the RC here - and their impact will manifest in a lot of other ways as well. (We built push based shuffle when we started seeing very high disk issues of this sort - though unfortunately it is only supported in yarn right now) |
Thanks @mridulm |
Actually, thinking more, what I suggested will not work - since the local directory is chosen based on the hash of the block id - which is different, sigh. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still very hesitant about this change ... it is trying to workaround overloaded disks for a specific corner case observed.
You will have pervasive slowdowns and sporadic failures when this sort of disk characterstics are exhibited, and there is only so much we can make spark robust unfortunately in this context - without looking at other options to mitigate the underlying RC.
finalDataFileDir.map(blockManager.diskBlockManager()::createTempShuffleBlockInDir) | ||
.orElseGet(blockManager.diskBlockManager()::createTempShuffleBlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finalDataFileDir.map(blockManager.diskBlockManager()::createTempShuffleBlockInDir) | |
.orElseGet(blockManager.diskBlockManager()::createTempShuffleBlock); | |
finalDataFileDir.filter(v -> spills.isEmpty()).map(blockManager.diskBlockManager()::createTempShuffleBlockInDir) | |
.orElseGet(blockManager.diskBlockManager()::createTempShuffleBlock); |
We need this only when there is a single output file, else we want to spread it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit this change, thanks
@@ -198,7 +201,8 @@ private void writeSortedFile(boolean isFinalFile) { | |||
// spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use | |||
// createTempShuffleBlock here; see SPARK-3426 for more details. | |||
final Tuple2<TempShuffleBlockId, File> spilledFileInfo = | |||
blockManager.diskBlockManager().createTempShuffleBlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would assume the final output has to go to blockResolver.getDataFile(shuffleId, mapId)
right @cloud-fan ?
Currently at this layer we do not make that assumption ...
I was initially toying with the idea of passing mapId
and shuffleId
as constructor params ... and do something similar when I realized this would make assumptions that the code currently does not make - and so why the base directory is being passed around.
(And then ofcourse I thought we could solve it in LocalDiskSingleSpillMapOutputWriter
here, but was completely wrong :-( ).
What changes were proposed in this pull request?
If there is only one spill data file, spark will transfer that spill file to the final data file.
But if that spill file and the final data file are on different disks, there will still be a heavy data transfer.
We can get the disk where the final shuffle data file is located, and spill the final temp shuffle block in that disk to avoid unnecessary data copying.
Why are the changes needed?
Optimize spark shuffle performance.
Test
mv
a 2.6 G file between HDD disks on a node which has a high iowait :When mv file to another disk which has a high iowait is very slow while mv file to the same disk is quite fast.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Exists UT
Was this patch authored or co-authored using generative AI tooling?
No