[SPARK-30297][CORE] Fix executor lost in net cause app hung upp by seayoun · Pull Request #26938 · apache/spark

seayoun · 2019-12-18T14:49:38Z

What changes were proposed in this pull request?

Backgroud
The driver can't sense this executor was lost through the network connection disconnection If an executor was lost in the network and it have not responsed rst and close packet to driver, so driver can only sense this executor dead through heartbeat expired.

Problems
Heartbeat expiration processing flow as follows:

Executor heartbeat expired as above.
HeartbeatReceiver will call scheduler executor lost to rescheduler the tasks on this executor.
HeartbeatReceiver kill the executor.

The tasks on the dead executor have a chance to rescheduled on this dead executor again if the task rescheduler before the executor has't remove from executorBackend, it will send launch task to this executor again, the executor will not response and the driver can't sense through heartbeat beause the executor has lost in network. This cause those tasks rescheduled on this lost executor can't finish forever, and the app will hung up here forever.
This patch fix this problem, it remove the executor before rescheduler.

Why are the changes needed?

This will cause app hung up.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

seayoun · 2019-12-18T14:50:31Z

issue here: https://issues.apache.org/jira/browse/SPARK-30297

seayoun · 2019-12-18T14:56:35Z

@dongjoon-hyun @lababidi
@rxin @yanboliang
Please look this patch, thanks!

HyukjinKwon · 2019-12-18T14:59:25Z

@seayoun, please fill up the PR description. What are you trying to fix? Also, please review https://spark.apache.org/contributing.html

dongjoon-hyun · 2019-12-18T19:56:31Z

ok to test

SparkQA · 2019-12-18T21:44:09Z

Test build #115527 has finished for PR 26938 at commit 036dcc2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-19T00:30:58Z

@seayoun, please also fill up the PR description.

seayoun · 2019-12-19T01:53:30Z

@seayoun, please also fill up the PR description.

done, please see it againt, thanks!

cloud-fan · 2019-12-19T03:05:39Z

@seayoun can you fill the PR description according to the template? You will see the template when you open a PR.

cloud-fan · 2019-12-19T03:09:00Z

The tasks on the dead executor have a chance to rescheduled on this dead executor again

I don't get it. Even if a task is scheduled to the dead executor (before we mark the executor as dead), we will reschedule the task when we mark the executor as dead later. What's the problem?

seayoun · 2019-12-19T03:23:37Z

The tasks on the dead executor have a chance to rescheduled on this dead executor again

I don't get it. Even if a task is scheduled to the dead executor (before we mark the executor as dead), we will reschedule the task when we mark the executor as dead later. What's the problem?

The task will reschedule before we mark the executor as dead throuth SchedulerBackend fix rate to send ReviveOffers, the task has chances to reschedule on this executor, it will response nothing after launch task on this executor beause the executor was lost in the network, it will not reply rst or else.

cloud-fan · 2019-12-19T08:41:34Z

the task has chances to reschedule on this executor, it will response nothing ...

So the task status is running, right? When we eventually mark the executor as dead, won't we kill the task and reschedule it to somewhere else?

seayoun · 2019-12-19T10:26:56Z

the task has chances to reschedule on this executor, it will response nothing ...

So the task status is running, right? When we eventually mark the executor as dead, won't we kill the task and reschedule it to somewhere else?

The kill logical is the driver will send a KillExecutors to am, am will not send RemoveExecutor to driver beause am process the KillExecutors as a internalReleaseContainer, when am processCompletedContainers, am consider the executor as alreadyReleased, and the driver will not ask and the am will not send to driver either.
The task only can be rescheduled by executor lost and task failed, the executor can't be sense as lost and the task will not failed in this situation.

Am processCompletedContainers logical:

private[yarn] def processCompletedContainers(completedContainers: Seq[ContainerStatus]): Unit = {
    for (completedContainer <- completedContainers) {
      val containerId = completedContainer.getContainerId
      val alreadyReleased = releasedContainers.remove(containerId)
      val hostOpt = allocatedContainerToHostMap.get(containerId)
      val onHostStr = hostOpt.map(host => s" on host: $host").getOrElse("")
      val exitReason = if (!alreadyReleased) {
         ......
        } else {
        // If we have already released this container, then it must mean
        // that the driver has explicitly requested it to be killed
        ExecutorExited(completedContainer.getExitStatus, exitCausedByApp = false,
          s"Container $containerId exited from explicit termination request.")
      }
      ......
      containerIdToExecutorId.remove(containerId).foreach { eid =>
        executorIdToContainer.remove(eid)
        pendingLossReasonRequests.remove(eid) match {
          case Some(pendingRequests) =>
            // Notify application of executor loss reasons so it can decide whether it should abort
            pendingRequests.foreach(_.reply(exitReason))

          case None =>
            // We cannot find executor for pending reasons. This is because completed container
            // is processed before querying pending result. We should store it for later query.
            // This is usually happened when explicitly killing a container, the result will be
            // returned in one AM-RM communication. So query RPC will be later than this completed
            // container process.
            releasedExecutorLossReasons.put(eid, exitReason)
        }
        if (!alreadyReleased) {
          // The executor could have gone away (like no route to host, node failure, etc)
          // Notify backend about the failure of the executor
          numUnexpectedContainerRelease += 1
          driverRef.send(RemoveExecutor(eid, exitReason))
        }
      }
    }
  }

seayoun · 2019-12-20T02:13:40Z

@squito @wangshuo128 @cloud-fan please see this patch again.

Ngone51 · 2019-12-20T13:32:43Z

core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala

+              // Mark executor pending to remove if executor heartbeat expired
+              // to avoid reschedule task on this executor again
+              if (!backend.executorsPendingToRemove.contains(executorId)) {
+                backend.executorsPendingToRemove(executorId) = false


I think this does not resolve the issue completely. Because when you look into sc.killAndReplaceExecutor below, you can see that we've already try to mark it as "pending to remove". Though it happens in another separate thread, but I don't see big difference according to this issue. WDYT?

sc.killAndReplaceExecutor already try to mark it as "pending to remove" this is right, but the task has rescheduled at this executor agait at this time, the executor must be removed from the ExecutorBackend to avoid.
For example, it will disableExecutor in CoarseGrainedSchedulerBackend if the driver lost connection from executor, disableExecutor mark the executor dead, and then to reschedule the task on the lost connection executors.

The task can reschedule at this executor before mark it as "pending to remove".

We add the executor before rescheduler can avoid the tasks to rescheduler the bad executors

Hi, @seayoun. After a second think, I think this fix can really avoid app hang, though it can not clean up dead records in CoarseGrainedSchedulerBackend.

As you may have realized that a same issue has been posted in SPARK-27348 and its author(@xuanyuanking) has assigned to me recently. And it's really coincidental. And I really appreciate that you would like my proposal. Actually, the main idea is still base on original author's contribution.

Also, thanks for review.

seayoun · 2019-12-23T10:17:33Z

A better way to fix here #26980

github-actions · 2020-04-02T00:10:54Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

fix executor lost in net cause app hung upp

036dcc2

dongjoon-hyun changed the title ~~fix executor lost in net cause app hung upp~~ [SPARK-30297][CORE] Fix executor lost in net cause app hung upp Dec 18, 2019

dongjoon-hyun added the SPARK CORE label Dec 18, 2019

seayoun closed this Dec 19, 2019

seayoun reopened this Dec 19, 2019

Ngone51 reviewed Dec 20, 2019

View reviewed changes

seayoun mentioned this pull request Dec 23, 2019

[SPARK-27348][CORE] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend #26980

Closed

github-actions bot added the Stale label Apr 2, 2020

dongjoon-hyun closed this Apr 2, 2020

Conversation

seayoun commented Dec 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

seayoun commented Dec 18, 2019

Uh oh!

seayoun commented Dec 18, 2019

Uh oh!

HyukjinKwon commented Dec 18, 2019

Uh oh!

dongjoon-hyun commented Dec 18, 2019

Uh oh!

SparkQA commented Dec 18, 2019

Uh oh!

HyukjinKwon commented Dec 19, 2019

Uh oh!

seayoun commented Dec 19, 2019

Uh oh!

cloud-fan commented Dec 19, 2019

Uh oh!

cloud-fan commented Dec 19, 2019

Uh oh!

seayoun commented Dec 19, 2019

Uh oh!

cloud-fan commented Dec 19, 2019

Uh oh!

seayoun commented Dec 19, 2019

Uh oh!

seayoun commented Dec 20, 2019

Uh oh!

Ngone51 Dec 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seayoun Dec 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seayoun Dec 21, 2019

Choose a reason for hiding this comment

Uh oh!

seayoun Dec 21, 2019

Choose a reason for hiding this comment

Uh oh!

Ngone51 Dec 23, 2019

Choose a reason for hiding this comment

Uh oh!

seayoun commented Dec 23, 2019

Uh oh!

github-actions bot commented Apr 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

seayoun commented Dec 18, 2019 •

edited

Loading

Ngone51 Dec 20, 2019 •

edited

Loading

seayoun Dec 21, 2019 •

edited

Loading