[SPARK-30297][CORE] Fix executor lost in net cause app hung upp#26938
[SPARK-30297][CORE] Fix executor lost in net cause app hung upp#26938seayoun wants to merge 1 commit intoapache:masterfrom
Conversation
|
@dongjoon-hyun @lababidi |
|
@seayoun, please fill up the PR description. What are you trying to fix? Also, please review https://spark.apache.org/contributing.html |
|
ok to test |
|
Test build #115527 has finished for PR 26938 at commit
|
|
@seayoun, please also fill up the PR description. |
done, please see it againt, thanks! |
|
@seayoun can you fill the PR description according to the template? You will see the template when you open a PR. |
I don't get it. Even if a task is scheduled to the dead executor (before we mark the executor as dead), we will reschedule the task when we mark the executor as dead later. What's the problem? |
The task will reschedule before we mark the executor as dead throuth SchedulerBackend fix rate to send ReviveOffers, the task has chances to reschedule on this executor, it will response nothing after launch task on this executor beause the executor was lost in the network, it will not reply rst or else. |
So the task status is running, right? When we eventually mark the executor as dead, won't we kill the task and reschedule it to somewhere else? |
The kill logical is the driver will send a KillExecutors to am, am will not send RemoveExecutor to driver beause am process the KillExecutors as a Am processCompletedContainers logical: |
|
@squito @wangshuo128 @cloud-fan please see this patch again. |
| // Mark executor pending to remove if executor heartbeat expired | ||
| // to avoid reschedule task on this executor again | ||
| if (!backend.executorsPendingToRemove.contains(executorId)) { | ||
| backend.executorsPendingToRemove(executorId) = false |
There was a problem hiding this comment.
I think this does not resolve the issue completely. Because when you look into sc.killAndReplaceExecutor below, you can see that we've already try to mark it as "pending to remove". Though it happens in another separate thread, but I don't see big difference according to this issue. WDYT?
There was a problem hiding this comment.
sc.killAndReplaceExecutoralready try to mark it as "pending to remove" this is right, but the task has rescheduled at this executor agait at this time, the executor must be removed from the ExecutorBackend to avoid.
For example, it willdisableExecutorinCoarseGrainedSchedulerBackendif the driver lost connection from executor,disableExecutormark the executor dead, and then to reschedule the task on the lost connection executors.
There was a problem hiding this comment.
The task can reschedule at this executor before mark it as "pending to remove".
There was a problem hiding this comment.
We add the executor before rescheduler can avoid the tasks to rescheduler the bad executors
There was a problem hiding this comment.
Hi, @seayoun. After a second think, I think this fix can really avoid app hang, though it can not clean up dead records in CoarseGrainedSchedulerBackend.
As you may have realized that a same issue has been posted in SPARK-27348 and its author(@xuanyuanking) has assigned to me recently. And it's really coincidental. And I really appreciate that you would like my proposal. Actually, the main idea is still base on original author's contribution.
Also, thanks for review.
|
A better way to fix here #26980 |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Backgroud
The driver can't sense this executor was lost through the network connection disconnection If an executor was lost in the network and it have not responsed rst and close packet to driver, so driver can only sense this executor dead through heartbeat expired.
Problems
Heartbeat expiration processing flow as follows:
The tasks on the dead executor have a chance to rescheduled on this dead executor again if the task rescheduler before the executor has't remove from executorBackend, it will send launch task to this executor again, the executor will not response and the driver can't sense through heartbeat beause the executor has lost in network. This cause those tasks rescheduled on this lost executor can't finish forever, and the app will hung up here forever.
This patch fix this problem, it remove the executor before rescheduler.
Why are the changes needed?
This will cause app hung up.
Does this PR introduce any user-facing change?
NO
How was this patch tested?