-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37793][CORE][SHUFFLE] Fallback to fetch original blocks when noLocalMergedBlockDataError #35076
Conversation
Can one of the admins verify this patch? |
+CC @otterc |
I don't think that |
No particular query, but easy to reproduce when run all 1T TPCDS queries. |
@otterc I agree with you that I also suspect there are some bugs or concurrence issues in code, and add some assertions, but unfortunately, nothing was found. Besides those 2 issues, I also met the shuffle data corruption issues frequently.
Both hardware(disk) issue and network issue may cause shuffle data corruption, and due to the lack of checksum mechanism of push-based shuffle, there is a chance we pass the corrupt data to So I think except to the code bug, there still has opportunity to read the corrupt metadata from disk/network, even the possibility is lower than shuffle data because metadata usually smaller, and when it happens, fallback to fetch the original blocks should be safe. With this patch and #34934, the data corruption is the only critical issue(I mean can fail the job) in our dozen rounds of 1T TPC-DS test, and I think add the checksum should solve that issue. |
Update: After f6128a6, I run 3 rounds of 1T TPC-DS using master branch's code,
|
@pan3793 I will try to reproduce this issue and debug. It will take me couple of days. Thanks for debugging and updating with your findings. I will let you know what I find. |
Hi @otterc I got more information for this issue. Add assertion and debug log in public MergeStatuses finalizeShuffleMerge(FinalizeShuffleMerge msg) {
...
for (AppShufflePartitionInfo partition: shuffleMergePartitions.values()) {
synchronized (partition) {
try {
// This can throw IOException which will marks this shuffle partition as not merged.
partition.finalizePartition();
bitmaps.add(partition.mapTracker);
reduceIds.add(partition.reduceId);
sizes.add(partition.getLastChunkOffset());
} catch (IOException ioe) {
logger.warn("Exception while finalizing shuffle partition {}_{} {} {}", msg.appId,
msg.appAttemptId, msg.shuffleId, partition.reduceId, ioe);
} finally {
partition.closeAllFilesAndDeleteIfNeeded(false);
}
}
+ assert partition.dataFile.length() == partition.lastChunkOffset;
+ assert partition.indexFile.file.length() == partition.indexFile.getPos();
+ assert partition.metaFile.file.length() == partition.metaFile.getPos();
+ logger.info("shuffle partition {}_{} {} {}, chunk_size={}, meta_length={}, data_length={}",
+ msg.appId, msg.appAttemptId, msg.shuffleId, partition.reduceId,
+ partition.indexFile.getPos() / 8 - 1,
+ partition.metaFile.getPos(),
+ partition.lastChunkOffset);
}
mergeStatuses = new MergeStatuses(msg.shuffleId, msg.shuffleMergeId,
bitmaps.toArray(new RoaringBitmap[0]), Ints.toArray(reduceIds),
Longs.toArray(sizes));
}
...
} Add assertion and debug log in override def getMergedBlockData(
blockId: ShuffleMergedBlockId,
dirs: Option[Array[String]]): Seq[ManagedBuffer] = {
val indexFile =
getMergedBlockIndexFile(conf.getAppId, blockId.shuffleId, blockId.shuffleMergeId,
blockId.reduceId, dirs)
val dataFile = getMergedBlockDataFile(conf.getAppId, blockId.shuffleId,
blockId.shuffleMergeId, blockId.reduceId, dirs)
val metaFile = getMergedBlockMetaFile(conf.getAppId, blockId.shuffleId,
blockId.shuffleMergeId, blockId.reduceId, dirs)
// Load all the indexes in order to identify all chunks in the specified merged shuffle file.
val size = indexFile.length.toInt
val offsets = Utils.tryWithResource {
new DataInputStream(Files.newInputStream(indexFile.toPath))
} { dis =>
val buffer = ByteBuffer.allocate(size)
dis.readFully(buffer.array)
buffer.asLongBuffer
}
// Number of chunks is number of indexes - 1
val numChunks = size / 8 - 1
+ if (numChunks == 0) {
+ val indexBackupPath = java.nio.file.Paths.get(s"/tmp/${indexFile.toPath.getFileName}")
+ val dataBackupPath = java.nio.file.Paths.get(s"/tmp/${dataFile.toPath.getFileName}")
+ val metaBackupPath = java.nio.file.Paths.get(s"/tmp/${metaFile.toPath.getFileName}")
+ logError(s"$blockId chunk_size is 0, " +
+ s"index_file is $indexFile, backup to $indexBackupPath" +
+ s"data_file is $dataFile, backup to $dataBackupPath" +
+ s"meta_file is $metaFile, backup to $metaBackupPath")
+ Files.copy(indexFile.toPath, indexBackupPath)
+ Files.copy(dataFile.toPath, dataBackupPath)
+ Files.copy(metaFile.toPath, metaBackupPath)
+ assert(false)
}
for (index <- 0 until numChunks) yield {
new FileSegmentManagedBuffer(transportConf, dataFile,
offsets.get(index),
offsets.get(index + 1) - offsets.get(index))
}
} Then I run TPCDS several rounds and reproduce the exception. Assertion failed in reduce task side.
ESS logs
Reduce task backup merged shuffle files
So, the ESS and reduce task running on same machine, and ESS closed the 'data', 'index', 'meta' files and reported the there size as |
One possibility I was thinking of was if we calling Can you test with this change ?
+CC @otterc |
Sorry, I don't get your point @mridulm The ESS log indicate that
|
After reading and debugging the push-based shuffle code, I don't know if I understand it correctly, and have some questions, will appreciate it if you can give me some feedbacks @mridulm @otterc
|
I had written up my comment before your log messages @pan3793. The discrepancy between data file size and data length seems to indicate some other issue here - though I cant make much out from the details provided unfortunately: particularly the file sizes are extremely confusing observation. |
To respond to your queries @pan3793 (pls feel free to elaborate @otterc):
Yes. In addition, any exceptions during write/etc would trigger a failure, and would reset back to previous 'good' state.
The
The truncate is actually a best case effort to clean up excess disk space usage.
Yes, unless there is some other interleaving modifications to that file (or some OS/fs/driver bugs, but I am discounting them for the time being !). |
@mridulm thank you for your response!
Oops, sorry, I mean
Then I think if
Then it has a chance that the actual file size is greater than metadata, i.e. because of a hardware issue, the disk becomes read-only when doing truncate(in this case it will throw IOE?). But since we can always trust the content before the 'good' position, for the case that file length greater than the 'good' position still can be treated as good merged data? |
We do depend on truncate to ensure that the data in the merged file is consistent with the metadata. If there is some partial shuffle block that has been written to the merged file but it has not been committed yet (metadata not updated), then truncate will remove that data. If there is a failure during truncation then The |
@otterc Thanks for reply. I misunderstanded the code at first, if any |
@pan3793 I have been trying to reproduce this issue by running different tpch queries but wasn't able to reproduce the issue. I am running the code in the master branch. Is it possible for you to share the entire application log and the shuffle service logs with me? Some other things that can be useful for me to reproduce this problem
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
When enable push-based shuffle, there is a chance that task hang at
And
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:756)
isAfter some investigations, found that the last
FetchResult
put intoresult
isPushMergedLocalMetaFetchResult
, and there is a chance thatbufs
is empty, will cause noSuccessFetchResult
be added toresults
, and thread hang if no otherFetchResult
is put intoresults
.Why are the changes needed?
Fallback to fetch original blocks when noLocalMergedBlockDataError to avoid task hang.
Does this PR introduce any user-facing change?
Bug fix, to make push-based shuffle more stable.
How was this patch tested?
Pass 1T TPC-DS tests