-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22227][CORE] DiskBlockManager.getAllBlocks now tolerates temp files #19458
Conversation
getAllFiles().map(f => BlockId(f.getName)) | ||
// SPARK-22227: the Try guides against temporary files written | ||
// during shuffle which do not correspond to valid block IDs. | ||
getAllFiles().flatMap(f => Try(BlockId(f.getName)).toOption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the effect of swallowing a number of possible exceptions. I think at least you'd want to unpack this and log the error. But is there a more explicit way of excluding temp files? it seems like getAllFiles shouldn't return those either?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the effect of swallowing a number of possible exceptions.
It does. Right now the only possible exception coming fromBlockId.apply
isIllegalStateException
, but I agree that usingTry
makes the code fragile. My intention was to make an easy to read proof of concept and then adjust the patch according to the feedback.
What do you think about adding a "safe" parsing method to BlockId
? The metho would return an Option
instead of throwing and would only be used in DiskBlockManager
for now? BlockId.apply
can then be expressed parse(s).getOrElse { throw ... }
.
Alernatively, we can expose Block.isValid
which would combine the individual regexes into one.
But is there a more explicit way of excluding temp files? it seems like getAllFiles shouldn't return those either?
I think this is tricky to do without leaking some abstractions, e.g. the ".UUID" naming scheme in Utils.tempFileWith
and Temp*BlockId
naming scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, cc @cloud-fan
Instead of filtering out temp blocks, why not adding parsing rule for |
Because this fixes the issue for 2 out of 3 possible temp files. The unhandled case is produced by
I agree that it makes sense to keep those in sync, therefore I prefer to introduce
Possibly, but in any case |
Yes, I agree in any case it should not throw an exception. But in this PR you filtered out temp shuffle/local blocks, do you think this block is valid or not, are they blocks? So I'd like not filtering out those blocks, instead adding two parsing rules for those blocks. And for any other illegal files (cannot be parsed) catch and log the exception. |
I would say that temporary shuffle blocks are also blocks. As for |
@@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea | |||
|
|||
/** List all the blocks currently stored on disk by the disk manager. */ | |||
def getAllBlocks(): Seq[BlockId] = { | |||
getAllFiles().map(f => BlockId(f.getName)) | |||
getAllFiles().flatMap { f => | |||
val blockId = BlockId.guess(f.getName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks not so necessary to define a new method guess
for the use here only. I think here we can still use apply
and catch/log the exception. In another word, we can simply changes apply()
and use it here, defining new guess
method is not so necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for using try-catch here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. Should I log the exception even if the file has been produced by Utils.tempFileWith
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to log here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we don't need to log at all?
ping @superbobry |
@@ -100,6 +100,8 @@ private[spark] case class TestBlockId(id: String) extends BlockId { | |||
override def name: String = "test_" + id | |||
} | |||
|
|||
class UnrecognizedBlockId(name: String) extends Exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extend SparkException
case _: UnrecognizedBlockId => | ||
// This does not handle a special-case of a temporary file | ||
// created by [[SortShuffleWriter]]. | ||
log.warn(s"Encountered an unexpected file in a managed directory: $f") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question here: is it really something we need to warn users? What should they do about it? Here we are getting block ids from file names, so it's legal to meet a temporary file at this place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced we should log either. I've added this line because it was suggested by @jerryshao.
retest this please |
Test build #83014 has finished for PR 19458 at commit
|
…files Prior to this commit getAllBlocks implicitly assumed that the directories managed by the DiskBlockManager contain only the files corresponding to valid block IDs. In reality this assumption was violated during shuffle, which produces temporary files in the same directory as the resulting blocks. As a result, calls to getAllBlocks during shuffle were unreliable. The fix could be made more efficient, but this is probably good enough.
This commit also introduces a safe alternative to BlockId.apply which return option instead of throwing an exception.
cf9db5c
to
ff9a6ae
Compare
retest this please |
There's a UT failure (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83014/testReport/junit/org.apache.spark.storage/BlockIdSuite/test_bad_deserialization/). @superbobry please fix this failure. |
Test build #83034 has finished for PR 19458 at commit
|
retest this please. |
Test build #83043 has finished for PR 19458 at commit
|
retest this please |
Test build #83048 has finished for PR 19458 at commit
|
…files Prior to this commit getAllBlocks implicitly assumed that the directories managed by the DiskBlockManager contain only the files corresponding to valid block IDs. In reality, this assumption was violated during shuffle, which produces temporary files in the same directory as the resulting blocks. As a result, calls to getAllBlocks during shuffle were unreliable. The fix could be made more efficient, but this is probably good enough. `DiskBlockManagerSuite` Author: Sergei Lebedev <s.lebedev@criteo.com> Closes #19458 from superbobry/block-id-option. (cherry picked from commit b377ef1) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
thanks, merging to master/2.2 |
…files Prior to this commit getAllBlocks implicitly assumed that the directories managed by the DiskBlockManager contain only the files corresponding to valid block IDs. In reality, this assumption was violated during shuffle, which produces temporary files in the same directory as the resulting blocks. As a result, calls to getAllBlocks during shuffle were unreliable. The fix could be made more efficient, but this is probably good enough. `DiskBlockManagerSuite` Author: Sergei Lebedev <s.lebedev@criteo.com> Closes apache#19458 from superbobry/block-id-option. (cherry picked from commit b377ef1) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Prior to this commit getAllBlocks implicitly assumed that the directories
managed by the DiskBlockManager contain only the files corresponding to
valid block IDs. In reality, this assumption was violated during shuffle,
which produces temporary files in the same directory as the resulting
blocks. As a result, calls to getAllBlocks during shuffle were unreliable.
The fix could be made more efficient, but this is probably good enough.
How was this patch tested?
DiskBlockManagerSuite