[SPARK-15736][CORE] Gracefully handle loss of DiskStore files #13473

JoshRosen · 2016-06-02T21:22:42Z

If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure.

In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block.

This patch fixes this bug and adds an end-to-end regression test (in FailureSuite) and a set of unit tests (in BlockManagerSuite).

JoshRosen · 2016-06-02T21:25:14Z

/cc @andrewor14

andrewor14 · 2016-06-02T21:33:27Z

LGTM. Have you double checked that these are all the places where we do this?

JoshRosen · 2016-06-02T22:22:49Z

@andrewor14, yeah, I think this is the complete set of locations in Spark 2.x.

SparkQA · 2016-06-02T23:41:56Z

Test build #59880 has finished for PR 13473 at commit 0c51bb6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-03T00:08:54Z

Test build #59884 has finished for PR 13473 at commit 80bc486.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure. In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block. This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`). Author: Josh Rosen <joshrosen@databricks.com> Closes #13473 from JoshRosen/handle-missing-cache-files. (cherry picked from commit 229f902) Signed-off-by: Andrew Or <andrew@databricks.com>

andrewor14 · 2016-06-03T00:42:07Z

Merging into master 2.0.

…iles If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure. In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block. This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`). This is a branch-1.6 backport of #13473. Author: Josh Rosen <joshrosen@databricks.com> Closes #13479 from JoshRosen/handle-missing-cache-files-branch-1.6.

…iles If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure. In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block. This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`). This is a branch-1.6 backport of apache#13473. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#13479 from JoshRosen/handle-missing-cache-files-branch-1.6. (cherry picked from commit 4259a28)

tedyu · 2016-06-03T02:08:17Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+  private def handleLocalReadFailure(blockId: BlockId): Nothing = {
+    releaseLock(blockId)
+    // Remove the missing block so that its unavailability is reported to the driver
+    removeBlock(blockId)


Should this be called before the releaseLock() call ?

No, I don't think so: internally, removeBlock acquires a write lock on the block, so if we called it before the releaseLock call then we'd be calling it while holding a read lock which would cause us to deadlock.

Looking at BlockInfoManager#lockForWriting(), I think you're right.

JoshRosen added 4 commits June 2, 2016 13:30

Add failing regression test.

92bfce8

Add failing unit tests in BlockManagerSuite.

a104ac5

Fix actual bug.

e26b2f6

Update test names.

0c51bb6

JoshRosen added 2 commits June 2, 2016 14:50

Simplify integration test.

b86ff24

De-duplicate copy-pasted code.

80bc486

JoshRosen mentioned this pull request Jun 2, 2016

[SPARK-15736][CORE][branch-1.6] Gracefully handle loss of DiskStore files #13479

Closed

asfgit closed this in 229f902 Jun 3, 2016

JoshRosen deleted the handle-missing-cache-files branch June 3, 2016 00:39

tedyu reviewed Jun 3, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15736][CORE] Gracefully handle loss of DiskStore files #13473

[SPARK-15736][CORE] Gracefully handle loss of DiskStore files #13473

JoshRosen commented Jun 2, 2016 •

edited

Loading

JoshRosen commented Jun 2, 2016

andrewor14 commented Jun 2, 2016

JoshRosen commented Jun 2, 2016

SparkQA commented Jun 2, 2016

SparkQA commented Jun 3, 2016

andrewor14 commented Jun 3, 2016

tedyu Jun 3, 2016

JoshRosen Jun 3, 2016

tedyu Jun 3, 2016

[SPARK-15736][CORE] Gracefully handle loss of DiskStore files #13473

[SPARK-15736][CORE] Gracefully handle loss of DiskStore files #13473

Conversation

JoshRosen commented Jun 2, 2016 • edited Loading

JoshRosen commented Jun 2, 2016

andrewor14 commented Jun 2, 2016

JoshRosen commented Jun 2, 2016

SparkQA commented Jun 2, 2016

SparkQA commented Jun 3, 2016

andrewor14 commented Jun 3, 2016

tedyu Jun 3, 2016

Choose a reason for hiding this comment

JoshRosen Jun 3, 2016

Choose a reason for hiding this comment

tedyu Jun 3, 2016

Choose a reason for hiding this comment

JoshRosen commented Jun 2, 2016 •

edited

Loading