Skip to content

[SPARK-30297][CORE] Fix executor lost in net cause app hung upp#26938

Closed
seayoun wants to merge 1 commit intoapache:masterfrom
seayoun:fix_executor_heartbeat_hung
Closed

[SPARK-30297][CORE] Fix executor lost in net cause app hung upp#26938
seayoun wants to merge 1 commit intoapache:masterfrom
seayoun:fix_executor_heartbeat_hung

Conversation

@seayoun
Copy link

@seayoun seayoun commented Dec 18, 2019

What changes were proposed in this pull request?

Backgroud
The driver can't sense this executor was lost through the network connection disconnection If an executor was lost in the network and it have not responsed rst and close packet to driver, so driver can only sense this executor dead through heartbeat expired.

Problems
Heartbeat expiration processing flow as follows:

  1. Executor heartbeat expired as above.
  2. HeartbeatReceiver will call scheduler executor lost to rescheduler the tasks on this executor.
  3. HeartbeatReceiver kill the executor.

The tasks on the dead executor have a chance to rescheduled on this dead executor again if the task rescheduler before the executor has't remove from executorBackend, it will send launch task to this executor again, the executor will not response and the driver can't sense through heartbeat beause the executor has lost in network. This cause those tasks rescheduled on this lost executor can't finish forever, and the app will hung up here forever.
This patch fix this problem, it remove the executor before rescheduler.

Why are the changes needed?

This will cause app hung up.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

@seayoun
Copy link
Author

seayoun commented Dec 18, 2019

@seayoun
Copy link
Author

seayoun commented Dec 18, 2019

@dongjoon-hyun @lababidi
@rxin @yanboliang
Please look this patch, thanks!

@HyukjinKwon
Copy link
Member

@seayoun, please fill up the PR description. What are you trying to fix? Also, please review https://spark.apache.org/contributing.html

@dongjoon-hyun dongjoon-hyun changed the title fix executor lost in net cause app hung upp [SPARK-30297][CORE] Fix executor lost in net cause app hung upp Dec 18, 2019
@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Dec 18, 2019

Test build #115527 has finished for PR 26938 at commit 036dcc2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@seayoun, please also fill up the PR description.

@seayoun
Copy link
Author

seayoun commented Dec 19, 2019

@seayoun, please also fill up the PR description.

done, please see it againt, thanks!

@cloud-fan
Copy link
Contributor

@seayoun can you fill the PR description according to the template? You will see the template when you open a PR.

@cloud-fan
Copy link
Contributor

The tasks on the dead executor have a chance to rescheduled on this dead executor again

I don't get it. Even if a task is scheduled to the dead executor (before we mark the executor as dead), we will reschedule the task when we mark the executor as dead later. What's the problem?

@seayoun seayoun closed this Dec 19, 2019
@seayoun seayoun reopened this Dec 19, 2019
@seayoun
Copy link
Author

seayoun commented Dec 19, 2019

The tasks on the dead executor have a chance to rescheduled on this dead executor again

I don't get it. Even if a task is scheduled to the dead executor (before we mark the executor as dead), we will reschedule the task when we mark the executor as dead later. What's the problem?

The task will reschedule before we mark the executor as dead throuth SchedulerBackend fix rate to send ReviveOffers, the task has chances to reschedule on this executor, it will response nothing after launch task on this executor beause the executor was lost in the network, it will not reply rst or else.

@cloud-fan
Copy link
Contributor

the task has chances to reschedule on this executor, it will response nothing ...

So the task status is running, right? When we eventually mark the executor as dead, won't we kill the task and reschedule it to somewhere else?

@seayoun
Copy link
Author

seayoun commented Dec 19, 2019

the task has chances to reschedule on this executor, it will response nothing ...

So the task status is running, right? When we eventually mark the executor as dead, won't we kill the task and reschedule it to somewhere else?

The kill logical is the driver will send a KillExecutors to am, am will not send RemoveExecutor to driver beause am process the KillExecutors as a internalReleaseContainer, when am processCompletedContainers, am consider the executor as alreadyReleased, and the driver will not ask and the am will not send to driver either.
The task only can be rescheduled by executor lost and task failed, the executor can't be sense as lost and the task will not failed in this situation.

Am processCompletedContainers logical:

private[yarn] def processCompletedContainers(completedContainers: Seq[ContainerStatus]): Unit = {
    for (completedContainer <- completedContainers) {
      val containerId = completedContainer.getContainerId
      val alreadyReleased = releasedContainers.remove(containerId)
      val hostOpt = allocatedContainerToHostMap.get(containerId)
      val onHostStr = hostOpt.map(host => s" on host: $host").getOrElse("")
      val exitReason = if (!alreadyReleased) {
         ......
        } else {
        // If we have already released this container, then it must mean
        // that the driver has explicitly requested it to be killed
        ExecutorExited(completedContainer.getExitStatus, exitCausedByApp = false,
          s"Container $containerId exited from explicit termination request.")
      }
      ......
      containerIdToExecutorId.remove(containerId).foreach { eid =>
        executorIdToContainer.remove(eid)
        pendingLossReasonRequests.remove(eid) match {
          case Some(pendingRequests) =>
            // Notify application of executor loss reasons so it can decide whether it should abort
            pendingRequests.foreach(_.reply(exitReason))

          case None =>
            // We cannot find executor for pending reasons. This is because completed container
            // is processed before querying pending result. We should store it for later query.
            // This is usually happened when explicitly killing a container, the result will be
            // returned in one AM-RM communication. So query RPC will be later than this completed
            // container process.
            releasedExecutorLossReasons.put(eid, exitReason)
        }
        if (!alreadyReleased) {
          // The executor could have gone away (like no route to host, node failure, etc)
          // Notify backend about the failure of the executor
          numUnexpectedContainerRelease += 1
          driverRef.send(RemoveExecutor(eid, exitReason))
        }
      }
    }
  }

@seayoun
Copy link
Author

seayoun commented Dec 20, 2019

@squito @wangshuo128 @cloud-fan please see this patch again.

// Mark executor pending to remove if executor heartbeat expired
// to avoid reschedule task on this executor again
if (!backend.executorsPendingToRemove.contains(executorId)) {
backend.executorsPendingToRemove(executorId) = false
Copy link
Member

@Ngone51 Ngone51 Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does not resolve the issue completely. Because when you look into sc.killAndReplaceExecutor below, you can see that we've already try to mark it as "pending to remove". Though it happens in another separate thread, but I don't see big difference according to this issue. WDYT?

Copy link
Author

@seayoun seayoun Dec 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sc.killAndReplaceExecutor already try to mark it as "pending to remove" this is right, but the task has rescheduled at this executor agait at this time, the executor must be removed from the ExecutorBackend to avoid.
For example, it will disableExecutor in CoarseGrainedSchedulerBackend if the driver lost connection from executor, disableExecutor mark the executor dead, and then to reschedule the task on the lost connection executors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task can reschedule at this executor before mark it as "pending to remove".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add the executor before rescheduler can avoid the tasks to rescheduler the bad executors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @seayoun. After a second think, I think this fix can really avoid app hang, though it can not clean up dead records in CoarseGrainedSchedulerBackend.

As you may have realized that a same issue has been posted in SPARK-27348 and its author(@xuanyuanking) has assigned to me recently. And it's really coincidental. And I really appreciate that you would like my proposal. Actually, the main idea is still base on original author's contribution.

Also, thanks for review.

@seayoun
Copy link
Author

seayoun commented Dec 23, 2019

A better way to fix here #26980

@github-actions
Copy link

github-actions bot commented Apr 2, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants