Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22227][CORE] DiskBlockManager.getAllBlocks now tolerates temp files #19458

Closed
wants to merge 7 commits into from

Conversation

superbobry
Copy link
Contributor

@superbobry superbobry commented Oct 9, 2017

What changes were proposed in this pull request?

Prior to this commit getAllBlocks implicitly assumed that the directories
managed by the DiskBlockManager contain only the files corresponding to
valid block IDs. In reality, this assumption was violated during shuffle,
which produces temporary files in the same directory as the resulting
blocks. As a result, calls to getAllBlocks during shuffle were unreliable.

The fix could be made more efficient, but this is probably good enough.

How was this patch tested?

DiskBlockManagerSuite

getAllFiles().map(f => BlockId(f.getName))
// SPARK-22227: the Try guides against temporary files written
// during shuffle which do not correspond to valid block IDs.
getAllFiles().flatMap(f => Try(BlockId(f.getName)).toOption)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the effect of swallowing a number of possible exceptions. I think at least you'd want to unpack this and log the error. But is there a more explicit way of excluding temp files? it seems like getAllFiles shouldn't return those either?

Copy link
Contributor Author

@superbobry superbobry Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the effect of swallowing a number of possible exceptions.
It does. Right now the only possible exception coming from BlockId.apply is IllegalStateException , but I agree that using Try makes the code fragile. My intention was to make an easy to read proof of concept and then adjust the patch according to the feedback.

What do you think about adding a "safe" parsing method to BlockId? The metho would return an Option instead of throwing and would only be used in DiskBlockManager for now? BlockId.apply can then be expressed parse(s).getOrElse { throw ... }.

Alernatively, we can expose Block.isValid which would combine the individual regexes into one.

But is there a more explicit way of excluding temp files? it seems like getAllFiles shouldn't return those either?

I think this is tricky to do without leaking some abstractions, e.g. the ".UUID" naming scheme in Utils.tempFileWith and Temp*BlockId naming scheme.

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cc @cloud-fan

@jerryshao
Copy link
Contributor

jerryshao commented Oct 12, 2017

Instead of filtering out temp blocks, why not adding parsing rule for TempLocalBlockId and TempShuffleBlockId? That could also solve the problem. Since DiskBlockManager#getAllFiles doesn't filter out temp shuffle/local files, is it better to keep the same behavior for DiskBlockManager#getAllBlocks? Also it would better for upstream code to decide whether to filter out temp files/blocks.

@superbobry
Copy link
Contributor Author

superbobry commented Oct 12, 2017

Instead of filtering out temp blocks, why not adding parsing rule for TempLocalBlockId and TempShuffleBlockId?

Because this fixes the issue for 2 out of 3 possible temp files. The unhandled case is produced by Utils.tempFileWith.

Since DiskBlockManager#getAllFiles doesn't filter out temp shuffle/local files, is it better to keep the same behavior for DiskBlockManager#getAllBlocks?

I agree that it makes sense to keep those in sync, therefore I prefer to introduce Block.isValid and use it in getAllFiles.

Also it would better for upstream code to decide whether to filter out temp files/blocks.

Possibly, but in any case getAllBlocks should not throw, since temp blocks are an implementation detail.

@jerryshao
Copy link
Contributor

Yes, I agree in any case it should not throw an exception. But in this PR you filtered out temp shuffle/local blocks, do you think this block is valid or not, are they blocks?

So I'd like not filtering out those blocks, instead adding two parsing rules for those blocks. And for any other illegal files (cannot be parsed) catch and log the exception.

@superbobry
Copy link
Contributor Author

superbobry commented Oct 12, 2017

I would say that temporary shuffle blocks are also blocks.

As for DiskBlockManager.getAllFiles, the documentation states that it returns all files in the directories managed by DiskBlockManager. Filtering anything would violate this contract, WDYT?

@@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea

/** List all the blocks currently stored on disk by the disk manager. */
def getAllBlocks(): Seq[BlockId] = {
getAllFiles().map(f => BlockId(f.getName))
getAllFiles().flatMap { f =>
val blockId = BlockId.guess(f.getName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks not so necessary to define a new method guess for the use here only. I think here we can still use apply and catch/log the exception. In another word, we can simply changes apply() and use it here, defining new guess method is not so necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using try-catch here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Should I log the exception even if the file has been produced by Utils.tempFileWith?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to log here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we don't need to log at all?

@jiangxb1987
Copy link
Contributor

ping @superbobry

@@ -100,6 +100,8 @@ private[spark] case class TestBlockId(id: String) extends BlockId {
override def name: String = "test_" + id
}

class UnrecognizedBlockId(name: String) extends Exception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extend SparkException

case _: UnrecognizedBlockId =>
// This does not handle a special-case of a temporary file
// created by [[SortShuffleWriter]].
log.warn(s"Encountered an unexpected file in a managed directory: $f")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question here: is it really something we need to warn users? What should they do about it? Here we are getting block ids from file names, so it's legal to meet a temporary file at this place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced we should log either. I've added this line because it was suggested by @jerryshao.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Oct 24, 2017

Test build #83014 has finished for PR 19458 at commit febdd7e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class UnrecognizedBlockId(name: String)

Sergei Lebedev added 6 commits October 24, 2017 19:25
…files

Prior to this commit getAllBlocks implicitly assumed that the directories
managed by the DiskBlockManager contain only the files corresponding to
valid block IDs. In reality this assumption was violated during shuffle,
which produces temporary files in the same directory as the resulting
blocks. As a result, calls to getAllBlocks during shuffle were unreliable.

The fix could be made more efficient, but this is probably good enough.
This commit also introduces a safe alternative to BlockId.apply which
return option instead of throwing an exception.
@cloud-fan
Copy link
Contributor

retest this please

@jerryshao
Copy link
Contributor

@SparkQA
Copy link

SparkQA commented Oct 25, 2017

Test build #83034 has finished for PR 19458 at commit ff9a6ae.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 25, 2017

Test build #83043 has finished for PR 19458 at commit 2b61347.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Oct 25, 2017

Test build #83048 has finished for PR 19458 at commit 2b61347.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in b377ef1 Oct 25, 2017
asfgit pushed a commit that referenced this pull request Oct 25, 2017
…files

Prior to this commit getAllBlocks implicitly assumed that the directories
managed by the DiskBlockManager contain only the files corresponding to
valid block IDs. In reality, this assumption was violated during shuffle,
which produces temporary files in the same directory as the resulting
blocks. As a result, calls to getAllBlocks during shuffle were unreliable.

The fix could be made more efficient, but this is probably good enough.

`DiskBlockManagerSuite`

Author: Sergei Lebedev <s.lebedev@criteo.com>

Closes #19458 from superbobry/block-id-option.

(cherry picked from commit b377ef1)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2

MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…files

Prior to this commit getAllBlocks implicitly assumed that the directories
managed by the DiskBlockManager contain only the files corresponding to
valid block IDs. In reality, this assumption was violated during shuffle,
which produces temporary files in the same directory as the resulting
blocks. As a result, calls to getAllBlocks during shuffle were unreliable.

The fix could be made more efficient, but this is probably good enough.

`DiskBlockManagerSuite`

Author: Sergei Lebedev <s.lebedev@criteo.com>

Closes apache#19458 from superbobry/block-id-option.

(cherry picked from commit b377ef1)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@Willymontaz Willymontaz deleted the block-id-option branch April 2, 2019 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants