[WIP][SPARK-29998][CORE] Retry getFile() until all folder failed then exit #26643

AngersZhuuuu · 2019-11-23T03:48:02Z

What changes were proposed in this pull request?

If one NodeManager's disk is broken. when task begin to run, it will get jobConf by broadcast, executor's BlockManager failed to create file. and throw IOException.

19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: "ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due to Job aborted due to stage failure: Task 21 in st
age 343.0 failed 4 times, most recent failure: Lost task 21.3 in stage 343.0 (TID 34968, hostname, executor 104): java.io.IOException: Failed to create local dir in /disk
11/yarn/local/usercache/username/appcache/application_1573542949548_2889852/blockmgr-a70777d8-5159-48e7-a47e-848df01a831e/3b.
        at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
        at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:129)
        at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:605)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
        at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
        at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
        at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:228)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Since in TaskSetManager.handleFailedTask()
For this kind of fail reason, it will retry on this Executor until failedTime > maxTaskFailTime
Then this stage failed, total job failed.

In this pr , i want to make it try all local folders, if all folder is broken. then exit executor.

Why are the changes needed?

This problem make job failed, we can fix it by retry.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

WIP

AngersZhuuuu · 2019-11-23T03:49:26Z

@cloud-fan @dongjoon-hyun @HyukjinKwon @srowen
To fix this problem , WDYT?
Hope for advise

AmplabJenkins · 2019-11-23T03:56:15Z

Can one of the admins verify this patch?

dongjoon-hyun · 2019-11-24T00:54:30Z

Hi, @AngersZhuuuu .
This is possible, but I'm not sure it's a good idea to live with the node.
We already handle executor failure in the upper layers, don't we?

AngersZhuuuu · 2019-11-24T00:59:48Z

Hi, @AngersZhuuuu .
This is possible, but I'm not sure it's a good idea to live with the node.
We already handle executor failure in the upper layers, don't we?

I know we will handle executor, but when this situation happened. Executor won't failed.
By current getFile's logical. This task will always retry on this executor's same localdir and subdir.
Since the hash value is same. And will failed all task retries. Finally job failed.
And this kind problem won't make stage retry.

dongjoon-hyun · 2019-11-24T01:00:45Z

Can we have a reproducible test case for your claim?

AngersZhuuuu · 2019-11-24T01:02:25Z

Can we have a reproducible test case for your claim?

Yea, I will try reproduce this in UT.

dongjoon-hyun · 2019-11-24T01:12:00Z

Sorry, but this is not okay to trigger Jenkins yet.

AngersZhuuuu · 2019-11-24T01:15:24Z

Sorry, but this is not okay to trigger Jenkins yet.

No need to trigger yet. Make this pr want to show more clear where the problem is . And then work base this. Add WIP to the title. Thanks for your rigorous work

AngersZhuuuu · 2019-11-24T10:35:17Z

Can we have a reproducible test case for your claim?

Hard to reproduce since this happened when job start and before HadoopRDD.compute(). Then will show error message like I have show. I can't reproduce it in UT.
But When test , I can get so many kinds of error cased by DiskBlockManger.getFile() that can destroy stage or job.

yaooqinn · 2019-11-24T12:47:21Z

core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala

+            case e: IOException =>
+              logError(s"Failed to create local dir in $newDir.", e)
+              count = count + 1
+              localDirIndex.remove(hashIndex)


Methinks this change corrupts shuffle writer an reader's dependency

Methinks this change corrupts shuffle writer an reader's dependency

Retry only happened for newSubDir, won't destroy origin dependency.
If one blockId have corresponding folder. It will return origin old.
For new coming blockId, will retry when mkdir failed.
And if finally return File and put into subDirs

subDirs(dirId)(subDirId) = newDir

When this block's request come again, will return old since subDirs(dirId)(subDirId) != null

the dirId seems unstable, I guess for external shuffle it probably goes wrong

the dirId seems unstable, I guess for external shuffle it probably goes wrong

Got your point. If the disk problem was fixed, the dirId may not same

@yaooqinn Add a method to control this problem, but it doesn't look very elegant. As @srowen mentioned, we can use blacklist to prevent this bad case. But i don't think it can handle all executor disk problem

srowen · 2019-11-24T14:01:29Z

Does the executor not eventually blacklist in this case or am I missing the idea here?

AngersZhuuuu · 2019-11-24T14:13:21Z

Does the executor not eventually blacklist in this case or am I missing the idea here?

BlackList can be useful. But some times stage failed to start running task.

cloud-fan · 2019-11-25T04:12:40Z

If the shuffle final path is in a broken disk, we have the same problem, right?

Currently we have a deterministic mapping from filename to its path. The benefit is: the path calculating is stateless and it's cheap to do. We can even do it in different JVMs. But not sure if we leverage this property in Spark. cc @vanzin @squito

AngersZhuuuu · 2019-11-25T04:22:35Z

If the shuffle final path is in a broken disk, we have the same problem, right?

Currently we have a deterministic mapping from filename to its path. The benefit is: the path calculating is stateless and it's cheap to do. We can even do it in different JVMs. But not sure if we leverage this property in Spark. cc @vanzin @squito

yes, maybe add blacklist is enough for this problem. I will check how ExternalShuffleService use deterministic mapping from filename to its path.

Retry until all folder failed then exit

796d87e

dongjoon-hyun changed the title ~~[SPARK-29998][CORE]Retry getFile() until all folder failed then exit~~ [SPARK-29998][CORE] Retry getFile() until all folder failed then exit Nov 24, 2019

AngersZhuuuu closed this Nov 24, 2019

AngersZhuuuu reopened this Nov 24, 2019

AngersZhuuuu changed the title ~~[SPARK-29998][CORE] Retry getFile() until all folder failed then exit~~ [WIP][SPARK-29998][CORE] Retry getFile() until all folder failed then exit Nov 24, 2019

dongjoon-hyun added the SPARK CORE label Nov 24, 2019

yaooqinn reviewed Nov 24, 2019

View reviewed changes

Update DiskBlockManager.scala

2304837

srowen closed this Nov 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-29998][CORE] Retry getFile() until all folder failed then exit #26643

[WIP][SPARK-29998][CORE] Retry getFile() until all folder failed then exit #26643

AngersZhuuuu commented Nov 23, 2019

AngersZhuuuu commented Nov 23, 2019

AmplabJenkins commented Nov 23, 2019

dongjoon-hyun commented Nov 24, 2019

AngersZhuuuu commented Nov 24, 2019

dongjoon-hyun commented Nov 24, 2019

AngersZhuuuu commented Nov 24, 2019

dongjoon-hyun commented Nov 24, 2019

AngersZhuuuu commented Nov 24, 2019

AngersZhuuuu commented Nov 24, 2019

yaooqinn Nov 24, 2019

AngersZhuuuu Nov 24, 2019

yaooqinn Nov 24, 2019 •

edited

AngersZhuuuu Nov 24, 2019

AngersZhuuuu Nov 25, 2019

srowen commented Nov 24, 2019

AngersZhuuuu commented Nov 24, 2019 •

edited

cloud-fan commented Nov 25, 2019

AngersZhuuuu commented Nov 25, 2019

[WIP][SPARK-29998][CORE] Retry getFile() until all folder failed then exit #26643

[WIP][SPARK-29998][CORE] Retry getFile() until all folder failed then exit #26643

Conversation

AngersZhuuuu commented Nov 23, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AngersZhuuuu commented Nov 23, 2019

AmplabJenkins commented Nov 23, 2019

dongjoon-hyun commented Nov 24, 2019

AngersZhuuuu commented Nov 24, 2019

dongjoon-hyun commented Nov 24, 2019

AngersZhuuuu commented Nov 24, 2019

dongjoon-hyun commented Nov 24, 2019

AngersZhuuuu commented Nov 24, 2019

AngersZhuuuu commented Nov 24, 2019

yaooqinn Nov 24, 2019

Choose a reason for hiding this comment

AngersZhuuuu Nov 24, 2019

Choose a reason for hiding this comment

yaooqinn Nov 24, 2019 • edited

Choose a reason for hiding this comment

AngersZhuuuu Nov 24, 2019

Choose a reason for hiding this comment

AngersZhuuuu Nov 25, 2019

Choose a reason for hiding this comment

srowen commented Nov 24, 2019

AngersZhuuuu commented Nov 24, 2019 • edited

cloud-fan commented Nov 25, 2019

AngersZhuuuu commented Nov 25, 2019

yaooqinn Nov 24, 2019 •

edited

AngersZhuuuu commented Nov 24, 2019 •

edited