-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15736][CORE] Gracefully handle loss of DiskStore files #13473
Conversation
/cc @andrewor14 |
LGTM. Have you double checked that these are all the places where we do this? |
@andrewor14, yeah, I think this is the complete set of locations in Spark 2.x. |
Test build #59880 has finished for PR 13473 at commit
|
Test build #59884 has finished for PR 13473 at commit
|
If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure. In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block. This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`). Author: Josh Rosen <joshrosen@databricks.com> Closes #13473 from JoshRosen/handle-missing-cache-files. (cherry picked from commit 229f902) Signed-off-by: Andrew Or <andrew@databricks.com>
Merging into master 2.0. |
…iles If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure. In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block. This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`). This is a branch-1.6 backport of #13473. Author: Josh Rosen <joshrosen@databricks.com> Closes #13479 from JoshRosen/handle-missing-cache-files-branch-1.6.
…iles If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure. In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block. This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`). This is a branch-1.6 backport of apache#13473. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#13479 from JoshRosen/handle-missing-cache-files-branch-1.6. (cherry picked from commit 4259a28)
private def handleLocalReadFailure(blockId: BlockId): Nothing = { | ||
releaseLock(blockId) | ||
// Remove the missing block so that its unavailability is reported to the driver | ||
removeBlock(blockId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be called before the releaseLock() call ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think so: internally, removeBlock
acquires a write lock on the block, so if we called it before the releaseLock
call then we'd be calling it while holding a read lock which would cause us to deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at BlockInfoManager#lockForWriting(), I think you're right.
If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure.
In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block.
This patch fixes this bug and adds an end-to-end regression test (in
FailureSuite
) and a set of unit tests (in BlockManagerSuite
).