New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][SPARK-29998][CORE] Retry getFile() until all folder failed then exit #26643
Conversation
@cloud-fan @dongjoon-hyun @HyukjinKwon @srowen |
Can one of the admins verify this patch? |
Hi, @AngersZhuuuu . |
I know we will handle executor, but when this situation happened. Executor won't failed. |
Can we have a reproducible test case for your claim? |
Yea, I will try reproduce this in UT. |
Sorry, but this is not okay to trigger Jenkins yet. |
No need to trigger yet. Make this pr want to show more clear where the problem is . And then work base this. Add |
Hard to reproduce since this happened when job start and before HadoopRDD.compute(). Then will show error message like I have show. I can't reproduce it in UT. |
case e: IOException => | ||
logError(s"Failed to create local dir in $newDir.", e) | ||
count = count + 1 | ||
localDirIndex.remove(hashIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Methinks this change corrupts shuffle writer an reader's dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Methinks this change corrupts shuffle writer an reader's dependency
Retry only happened for newSubDir, won't destroy origin dependency.
If one blockId have corresponding folder. It will return origin old
.
For new coming blockId, will retry when mkdir failed.
And if finally return File
and put into subDirs
subDirs(dirId)(subDirId) = newDir
When this block's request come again, will return old
since subDirs(dirId)(subDirId) != null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the dirId
seems unstable, I guess for external shuffle it probably goes wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the
dirId
seems unstable, I guess for external shuffle it probably goes wrong
Got your point. If the disk problem was fixed, the dirId
may not same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the executor not eventually blacklist in this case or am I missing the idea here? |
BlackList can be useful. But some times stage failed to start running task. |
If the shuffle final path is in a broken disk, we have the same problem, right? Currently we have a deterministic mapping from filename to its path. The benefit is: the path calculating is stateless and it's cheap to do. We can even do it in different JVMs. But not sure if we leverage this property in Spark. cc @vanzin @squito |
yes, maybe add blacklist is enough for this problem. I will check how ExternalShuffleService use |
What changes were proposed in this pull request?
If one NodeManager's disk is broken. when task begin to run, it will get jobConf by broadcast, executor's BlockManager failed to create file. and throw IOException.
Since in TaskSetManager.handleFailedTask()
For this kind of fail reason, it will retry on this Executor until
failedTime > maxTaskFailTime
Then this stage failed, total job failed.
In this pr , i want to make it try all local folders, if all folder is broken. then exit executor.
Why are the changes needed?
This problem make job failed, we can fix it by retry.
Does this PR introduce any user-facing change?
NO
How was this patch tested?
WIP