-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47518][CORE] Skip transfer the last spilled shuffle data #45661
base: master
Are you sure you want to change the base?
Changes from 1 commit
4c3256d
2434b03
5e23dcf
9b1c993
ca30594
0bbef20
04e9fd9
9d36397
5cb52bc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -52,12 +52,14 @@ | |
import org.apache.spark.shuffle.ShuffleWriteMetricsReporter; | ||
import org.apache.spark.serializer.SerializationStream; | ||
import org.apache.spark.serializer.SerializerInstance; | ||
import org.apache.spark.shuffle.IndexShuffleBlockResolver; | ||
import org.apache.spark.shuffle.ShuffleWriter; | ||
import org.apache.spark.shuffle.api.ShuffleExecutorComponents; | ||
import org.apache.spark.shuffle.api.ShuffleMapOutputWriter; | ||
import org.apache.spark.shuffle.api.ShufflePartitionWriter; | ||
import org.apache.spark.shuffle.api.SingleSpillShuffleMapOutputWriter; | ||
import org.apache.spark.shuffle.api.WritableByteChannelWrapper; | ||
import org.apache.spark.shuffle.sort.io.LocalDiskShuffleExecutorComponents; | ||
import org.apache.spark.storage.BlockManager; | ||
import org.apache.spark.storage.TimeTrackingOutputStream; | ||
import org.apache.spark.unsafe.Platform; | ||
|
@@ -219,7 +221,15 @@ void closeAndWriteOutput() throws IOException { | |
updatePeakMemoryUsed(); | ||
serBuffer = null; | ||
serOutputStream = null; | ||
final SpillInfo[] spills = sorter.closeAndGetSpills(); | ||
Optional<File> finalDataFileDir; | ||
if (shuffleExecutorComponents instanceof LocalDiskShuffleExecutorComponents) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, it looks a bit hacky to handle local disk shuffle specially here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I'm not familiar with the block storage in KubernetesLocalDiskShuffleExecutorComponents, so only handle LocalDiskShuffleExecutorComponents here. Or should I add a new method |
||
File dataFile = | ||
new IndexShuffleBlockResolver(sparkConf, blockManager).getDataFile(shuffleId, mapId); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this only used to invoke There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes |
||
finalDataFileDir = Optional.of(dataFile.getParentFile()); | ||
} else { | ||
finalDataFileDir = Optional.empty(); | ||
} | ||
final SpillInfo[] spills = sorter.closeAndGetSpills(finalDataFileDir); | ||
try { | ||
partitionLengths = mergeSpills(spills); | ||
} finally { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one idea: if
isFinalFile
is true, then we call a special version ofcreateTempShuffleBlock
that takes shuffle & map id, and returns a file path under the same directory of the final shuffle file. Then we don't need to change other places?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean change the parameters of
createTempShuffleBlockInDir
from finalDataFileDir to Tuple2<ShuffleId, MapId> ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would assume the final output has to go to
blockResolver.getDataFile(shuffleId, mapId)
right @cloud-fan ?Currently at this layer we do not make that assumption ...
I was initially toying with the idea of passing
mapId
andshuffleId
as constructor params ... and do something similar when I realized this would make assumptions that the code currently does not make - and so why the base directory is being passed around.(And then ofcourse I thought we could solve it in
LocalDiskSingleSpillMapOutputWriter
here, but was completely wrong :-( ).