[SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl #29579

Ngone51 · 2020-08-29T15:44:28Z

What changes were proposed in this pull request?

The motivation of this PR is to avoid caching the removed decommissioned executors in TaskSchedulerImpl. The cache is introduced in #29422. The cache will hold the isHostDecommissioned info for a while. So if the task FetchFailure event comes after the executor loss event, DAGScheduler can still get the isHostDecommissioned from the cache and unregister the host shuffle map status when the host is decommissioned too.

This PR tries to achieve the same goal without the cache. Instead of saving the workerLost in ExecutorUpdated / ExecutorDecommissionInfo / ExecutorDecommissionState, we could save the hostOpt directly. When the host is decommissioned or lost too, the hostOpt can be a specific host address. Otherwise, it's None to indicate that only the executor is decommissioned or lost.

Now that we have the host info, we can also unregister the host shuffle map status when executorLost is triggered for the decommissioned executor.

Besides, this PR also includes a few cleanups around the touched code.

Why are the changes needed?

It helps to unregister the shuffle map status earlier for both decommission and normal executor lost cases.

It also saves memory in TaskSchedulerImpl and simplifies the code a little bit.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR only refactor the code. The original behaviour should be covered by DecommissionWorkerSuite.

SparkQA · 2020-08-29T15:55:45Z

Test build #128019 has finished for PR 29579 at commit d5bc756.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ExecutorDecommissionInfo(message: String, hostOpt: Option[String] = None)

SparkQA · 2020-08-30T03:37:27Z

Test build #128024 has finished for PR 29579 at commit 404f92b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-30T05:52:45Z

Test build #128025 has finished for PR 29579 at commit 47da0d7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-08-31T02:33:09Z

@agrawaldevesh @cloud-fan @holdenk Please take a look, thanks!

agrawaldevesh

There is a semantic change here:

Earlier the shuffle status was removed both when the downstream triggered a fetch failure AND when the executor is lost (heartbeat failure). Whichever comes first.

However, with this PR, it seems you are removing the "clear shuffle on fetch failure" part. It seems that you will wait for the heartbeat failure to occur and the host be lost, even if the downstream has signaled fetch failure. Can you confirm if this understanding is right ?

The memory used by the cache is trivially small. And the code simplification is also not a whole lot: so it seems that I am missing the bigger motivation for this change.

agrawaldevesh · 2020-08-31T03:08:54Z

core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala

@@ -188,7 +188,7 @@ private[deploy] object DeployMessages {
  }

  case class ExecutorUpdated(id: Int, state: ExecutorState, message: Option[String],
-    exitStatus: Option[Int], workerLost: Boolean)
+    exitStatus: Option[Int], hostOpt: Option[String])


I think we need a better name than hostOpt ? How about just "Hostname" ? The type already conveys that this is an optional.

How about hostLost ?

We can also leave the name as workerLost and just make it be an Optional[String] ? In the spirit of minimal code change ?

After a second thought, I changed it to workerHost. We need the keyword worker because it's specific to Standalone Worker. And Host gives the direct meaning of the value. And workerLost sounds more appropriate for the Boolean type. WDYT?

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

Ngone51 · 2020-08-31T03:51:35Z

Thank you for the quick response @agrawaldevesh .

However, with this PR, it seems you are removing the "clear shuffle on fetch failure" part. It seems that you will wait for the heartbeat failure to occur and the host be lost, even if the downstream has signaled fetch failure.

I think this PR doesn't change the semantics. We still clear shuffle status on fetch failure as you can see the only change for fetch failure in DAGScheduler is:

-  .exists(_.isHostDecommissioned)
+  .exists(_.hostOpt.isDefined)

If the fetch failure comes first before the executor lost, DAGScheduler will still ask TaskSchedulerImpl for the decommission state and unregister the shuffle status then. While if the executor lost comes first, fetch failure becomes a NoOp on shuffle status unregister.

I think the only difference is that, before this PR, if the executor lost event comes first, it can only unregister shuffle map status on that executor, even if we know the host is also decommissioned. But now we can unregister the host shuffle status because we pass in the host info directly.

agrawaldevesh · 2020-08-31T04:17:20Z

Thanks for the explanations! I will get back to you in like 2-3 days after playing with it locally. (I am on PTO tomorrow).

SparkQA · 2020-08-31T05:07:42Z

Test build #128064 has finished for PR 29579 at commit a0bc4f6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

agrawaldevesh

I think we need a better name than hostOpt. Consider keeping the name as workerLost.
Consider making ExecutorDecommission be immutable.

I couldn't repro the failing GH test in ExecutorAllocationManagerSuite, but interestingly a different one fails for me on the upstream itself. Might be a good idea to rebase and get a green GH run.

agrawaldevesh · 2020-08-31T06:19:44Z

core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala

@@ -188,7 +188,7 @@ private[deploy] object DeployMessages {
  }

  case class ExecutorUpdated(id: Int, state: ExecutorState, message: Option[String],
-    exitStatus: Option[Int], workerLost: Boolean)
+    exitStatus: Option[Int], hostOpt: Option[String])


We can also leave the name as workerLost and just make it be an Optional[String] ? In the spirit of minimal code change ?

agrawaldevesh · 2020-08-31T06:21:18Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

@@ -909,9 +909,9 @@ private[deploy] class Master(
        exec.application.driver.send(ExecutorUpdated(
          exec.id, ExecutorState.DECOMMISSIONED,
          Some("worker decommissioned"), None,
-          // workerLost is being set to true here to let the driver know that the host (aka. worker)
+          // worker host is being set here to let the driver know that the host (aka. worker)


nit: can you reword the comment to be more accurate now :-)

Updated a little bit.

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

agrawaldevesh · 2020-09-01T20:20:51Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -1989,15 +1989,15 @@ private[spark] class DAGScheduler(
   */
  private[scheduler] def handleExecutorLost(
      execId: String,
-      workerLost: Boolean): Unit = {
+      hostOpt: Option[String]): Unit = {


Can you change this method's comment also if you decide to go with hostOpt instead of workerLost (perhaps you ought to consider my consider my comment on making workerLost itself be an Optional[String]). The comment still refers to "standalone worker"

I changed it to workerHost, so I guess we can keep the comment?

agrawaldevesh · 2020-09-01T21:19:16Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorLossReason.scala

 */
-private [spark] object ExecutorDecommission extends ExecutorLossReason("Executor decommission.")
+private [spark] case class ExecutorDecommission(var hostOpt: Option[String] = None)


I am not a fan of this change of making the hostOpt be a var instead of a val. I think you only need this for line 932 in TaskSchedulerImpl. I am sure you would be able to accommodate that use case in a different way.

The reason I don't like it is because other ExecutorLossReason's are "messages" (for example ExecutorProcessLost) and these messages tend to be immutable. I think it's a bit hacky to have ExecutorDecommission masquerading as a message but then make it be mutable.

Even ExecutorDecommission is a message that the TaskSchedulerImpl enqueues into the event loop of the DAGScheduler.

TBH I don't like the way myself too. I tried another way to get rid of the problem here but requires storing the redundant workHost info at CoarseGrainedSchedulerBackend.

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

Ngone51 · 2020-09-03T09:56:57Z

@cloud-fan @holdenk Could you also take a look?

SparkQA · 2020-09-03T13:09:17Z

Test build #128251 has finished for PR 29579 at commit 0c0749e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

agrawaldevesh

LGTM ... Just last few comments and will accept right afterwards.

agrawaldevesh · 2020-09-03T16:33:42Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

-  // Executors which are being decommissioned
-  protected val executorsPendingDecommission = new HashSet[String]
+  // Executors which are being decommissioned. Maps from executorId to
+  // workerHost(it's defined when the worker is also decommissioned)


super nit: space after workerHost.

I think workerHost is already an Option and thus already matches the value type of the executorsPendingDecommission map. Thus, we can perhaps drop the parenthesis clause entirely ?

agrawaldevesh · 2020-09-03T16:34:12Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

+            if (killedByDriver) {
+              ExecutorKilled
+            } else if (decommissioned.isDefined) {
+              ExecutorDecommission(decommissioned.get)
+            } else {
+              reason
+            }


:-) I can read !.

agrawaldevesh · 2020-09-03T16:36:03Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

@@ -394,10 +395,15 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
            addressToExecutorId -= executorInfo.executorAddress
            executorDataMap -= executorId
            executorsPendingLossReason -= executorId
+            val killedByDriver = executorsPendingToRemove.remove(executorId).getOrElse(false)
            val decommissioned = executorsPendingDecommission.remove(executorId)


Rename decommissioned to workerHostOpt and perhaps give it an explicit type: Option[Option[String]]. Its no longer a simple boolean.

Rename to workerHostOpt make sense to me. But I don't have a strong feeling to add the explicit type. It also breaks one line length limitation. I'd like to keep it in one line when it's not necessary to break it.

core/src/main/scala/org/apache/spark/scheduler/cluster/StandaloneSchedulerBackend.scala

agrawaldevesh · 2020-09-03T16:37:39Z

core/src/main/scala/org/apache/spark/scheduler/ExecutorLossReason.scala

@@ -70,7 +71,7 @@ case class ExecutorProcessLost(
 * This is used by the task scheduler to remove state associated with the executor, but
 * not yet fail any tasks that were running in the executor before the executor is "fully" lost.
 *
- * @param hostOpt it will be set by [[TaskSchedulerImpl]] when the host is decommissioned too
+ * @param workerHost it's defined when the worker is decommissioned too


nit: "it's" -> "it is"

Also, should we explicitly bring out the word 'host' here ? "It is defined when the worker host is decommissioned too"

I think "worker" should be enough.

agrawaldevesh · 2020-09-03T16:42:32Z

core/src/main/scala/org/apache/spark/deploy/client/StandaloneAppClient.scala

@@ -175,15 +175,15 @@ private[spark] class StandaloneAppClient(
          cores))
        listener.executorAdded(fullId, workerId, hostPort, cores, memory)

-      case ExecutorUpdated(id, state, message, exitStatus, workerLost) =>
+      case ExecutorUpdated(id, state, message, exitStatus, workerHost) =>


Personally, I would still be okay with workerLost being an Option[String] instead of a Boolean. Obviously, had it been called "workerIsLost" then we would have to rename it. But I am also fine with the new name workerHost as well. I don't particularly think that the name workerLost must connote a boolean.

This ExecutorUpdated message is a case in point where the "lost" part is meaningful because it refers to the "worker that is lost" as opposed to some random worker-host.

But no strong feelings on this and I am happy with the choice workerHost.

core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala

SparkQA · 2020-09-04T13:01:11Z

Test build #128297 has finished for PR 29579 at commit 58add67.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-09-04T13:31:09Z

retest this please.

SparkQA · 2020-09-04T16:19:19Z

Test build #128305 has finished for PR 29579 at commit 58add67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

agrawaldevesh

LGTM ! Thanks for simplifying and thinning out the logic. I think the changes are more direct and easier to read.

I confirm that there are no semantic changes introduced.

Cc: @holdenk, @cloud-fan please do review this PR.

Ngone51 · 2020-09-07T02:24:18Z

@cloud-fan @holdenk Could you take a look?

core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala

SparkQA · 2020-09-07T16:02:05Z

Test build #128356 has finished for PR 29579 at commit d246840.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-08T04:40:10Z

thanks, merging to master!

Ngone51 · 2020-09-09T03:52:48Z

thanks all!

impr

d5bc756

probot-autolabeler bot added the CORE label Aug 29, 2020

fix compile error

404f92b

probot-autolabeler bot added the DSTREAM label Aug 30, 2020

fix

47da0d7

Ngone51 added 2 commits August 31, 2020 10:02

fix TSM rest

3152e9b

fix tests

a0bc4f6

agrawaldevesh reviewed Aug 31, 2020

View reviewed changes

agrawaldevesh reviewed Sep 1, 2020

View reviewed changes

Ngone51 added 7 commits September 3, 2020 16:58

rename to workerHost

75a14a6

update comment

b6490fc

close the parenthesis

bea465b

update

c12c82d

avoid var

90f1fd1

fix style

84df735

update

0c0749e

agrawaldevesh reviewed Sep 3, 2020

View reviewed changes

Ngone51 added 4 commits September 4, 2020 18:07

drop parenthesis

6e8b57e

rename to workerHostOpt

a39ba8e

remove extra space

ff02621

it is

9096cb9

comma

58add67

Ngone51 closed this Sep 4, 2020

Ngone51 reopened this Sep 4, 2020

agrawaldevesh approved these changes Sep 4, 2020

View reviewed changes

cloud-fan reviewed Sep 7, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala Show resolved Hide resolved

add comment

d246840

cloud-fan closed this in 125cbe3 Sep 8, 2020

[SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl #29579

[SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl #29579

Conversation

Ngone51 commented Aug 29, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Aug 29, 2020

SparkQA commented Aug 30, 2020

SparkQA commented Aug 30, 2020

Ngone51 commented Aug 31, 2020

agrawaldevesh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ngone51 commented Aug 31, 2020 • edited Loading

agrawaldevesh commented Aug 31, 2020

SparkQA commented Aug 31, 2020

agrawaldevesh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ngone51 commented Sep 3, 2020

SparkQA commented Sep 3, 2020

agrawaldevesh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 4, 2020

Ngone51 commented Sep 4, 2020

SparkQA commented Sep 4, 2020

agrawaldevesh left a comment

Choose a reason for hiding this comment

Ngone51 commented Sep 7, 2020

SparkQA commented Sep 7, 2020

cloud-fan commented Sep 8, 2020

Ngone51 commented Sep 9, 2020

Ngone51 commented Aug 31, 2020 •

edited

Loading