[SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling #29367

holdenk · 2020-08-06T00:32:29Z

What changes were proposed in this pull request?

If graceful decommissioning is enabled, Spark's dynamic scaling uses this instead of directly killing executors.

Why are the changes needed?

When scaling down Spark we should avoid triggering recomputes as much as possible.

Does this PR introduce any user-facing change?

Hopefully their jobs run faster or at the same speed. It also enables experimental shuffle service free dynamic scaling when graceful decommissioning is enabled (using the same code as the shuffle tracking dynamic scaling).

How was this patch tested?

For now I've extended the ExecutorAllocationManagerSuite for both core & streaming.

holdenk · 2020-08-06T00:33:30Z

This is a rebase of #28818 now that its pre-requisites have been merged.

holdenk · 2020-08-06T00:34:15Z

cc @attilapiros & @agrawaldevesh

holdenk · 2020-08-06T00:35:59Z

cc @cloud-fan who asked about the progress on a related PR in case he is interested.

agrawaldevesh · 2020-08-06T00:41:27Z

@holdenk, I am a bit confused by the commit message of the only commit in this PR: "Shutdown executor once we are done decommissioning". Isn't this the recently merged PR #29211 (it went to master right) ?

Can you make sure that this commit is appropriately rebased on master with a commit like "Use graceful decommissioning as part of dynamic scaling"

HyukjinKwon · 2020-08-06T01:38:20Z

cc @tgravescs, @mridulm, @squito, @Ngone51, @jiangxb1987 as well FYI

SparkQA · 2020-08-06T03:11:54Z

Test build #127114 has finished for PR 29367 at commit 427b26c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-06T03:35:39Z

Test build #127113 has finished for PR 29367 at commit 3839d31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2020-08-06T03:44:33Z

@holdenk, I am a bit confused by the commit message of the only commit in this PR: "Shutdown executor once we are done decommissioning". Isn't this the recently merged PR #29211 (it went to master right) ?

Can you make sure that this commit is appropriately rebased on master with a commit like "Use graceful decommissioning as part of dynamic scaling"

Ah yeah, if you click expand you can see it's just all squished down together into one commit and the full commit text covers everything. When it gets merged the commit message is picked from the title anyways but I'll rename the title line of the commit.

agrawaldevesh · 2020-08-06T03:48:32Z

Ah yeah, if you click expand you can see it's just all squished down together into one commit and the full commit text covers everything. When it gets merged the commit message is picked from the title anyways but I'll rename the title line of the commit.

It would really help the review if you could please force push the rebased version with the commits properly separated/pruned.

Is this PR ready to be reviewed that you can do that ? Thanks !

agrawaldevesh

I am still confused about whether this PR is properly rebased to master branch or not.

As of commit 375d348, #29211 has been pushed to master.

I am not sure if I ended up re-reviewing some of the already pushed code or how much of this code is new.

core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala

core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala

core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala

core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala

streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala

SparkQA · 2020-08-06T06:18:36Z

Test build #127122 has finished for PR 29367 at commit 38a413e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-08-06T12:10:36Z

@holdenk Would you mind adding more description about your basic idea in this PR to integrate decommission and dynamic allocation?

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

holdenk · 2020-08-06T21:30:09Z

I think the javadoc failure in GHA is unrelated, I'll rebase this in a bit (I can't reproduce locally though).

SparkQA · 2020-08-06T22:38:09Z

Test build #127154 has finished for PR 29367 at commit 780b00b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-07T00:22:43Z

Test build #127156 has finished for PR 29367 at commit 3fa3313.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala

holdenk · 2020-08-07T02:55:43Z

I'm taking the next few days off (Friday-Sunday), I'll take another poke at this on Monday :)

agrawaldevesh

Taking one of the last looks after checking it out locally.

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

agrawaldevesh · 2020-08-08T17:17:25Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

@@ -298,6 +323,7 @@ private[spark] class ExecutorMonitor(
      //
      // This means that an executor may be marked as having shuffle data, and thus prevented
      // from being removed, even though the data may not be used.
+      // TODO: Only track used files (SPARK-31974)


Is this comment change intended ?

Yes, since we're eventually going to want to use intelligent metrics to decide who to scale down I'd like us to only track shuffle files that are being used not speculative ones. Doesn't need to be addressed right now which is why it's a TODO.

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionSuite.scala

core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala

streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala

...ing/src/test/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManagerSuite.scala

agrawaldevesh

Hi Holden,

I went through the code again. I feel that this is not yet in a state to be merged in because of issues marked as "[blocker]" inline.

I am also happy to sync offline to discuss them further.

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionSuite.scala

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

agrawaldevesh · 2020-08-08T23:54:32Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

@@ -114,7 +114,8 @@ private[spark] class ExecutorMonitor(

      var newNextTimeout = Long.MaxValue
      timedOutExecs = executors.asScala
-        .filter { case (_, exec) => !exec.pendingRemoval && !exec.hasActiveShuffle }
+        .filter { case (_, exec) =>
+          !exec.pendingRemoval && !exec.hasActiveShuffle && !exec.decommissioning}


I went through all of the usages of executor.pendingRemoval and executor.decommissioning flag: They are treated identically right now. That is for all practical purposes an executor being decommissioned is treated the same as an executor pending to be removed.

Do you have a use case in mind of why you would like to distinguish b/w these two states ? If you don't need to distinguish, the change would become simpler if you treat a decommissioned executor as pending removal.

I cannot see where this distinction is relevant in this PR, so perhaps you have a future use case in mind for this distinction ?

^^^ @holdenk ... any thoughts/followup on this ?

Eventually I'd like us to have better logging and metrics around decommissioning and understanding it's impact versus blacklisting, although to be fair that isn't in the short term.

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

agrawaldevesh

Thanks for the refactoring of that helper method.

My other inline comments are mainly just redrawing your attention to some of the other comments I made over the weekend. No rush if you were already planning to address them in a bit !

(As an aside, do you have the ability to mark these older resolved comments as resolved ? I no longer see the resolve comment button on even my comments).

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

agrawaldevesh · 2020-08-10T19:42:55Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

@@ -114,7 +114,8 @@ private[spark] class ExecutorMonitor(

      var newNextTimeout = Long.MaxValue
      timedOutExecs = executors.asScala
-        .filter { case (_, exec) => !exec.pendingRemoval && !exec.hasActiveShuffle }
+        .filter { case (_, exec) =>
+          !exec.pendingRemoval && !exec.hasActiveShuffle && !exec.decommissioning}


^^^ @holdenk ... any thoughts/followup on this ?

core/src/test/scala/org/apache/spark/scheduler/WorkerDecommissionSuite.scala

core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala

holdenk · 2020-08-10T21:56:42Z

Sorry I'm dealing with some other things so I only had the cycles to do a partial response to the comments. I'll try and get back to them tonight or tomorrow.

holdenk · 2020-08-10T21:59:21Z

(I'll also try and go through and resolve the old comments tonight).

SparkQA · 2020-08-10T22:16:36Z

Test build #127292 has finished for PR 29367 at commit 01d4137.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-11T07:05:02Z

Test build #127306 has finished for PR 29367 at commit a099152.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

Because the mock always says there is an RDD we may replicate more than once, and now that there are independent threads Make Spark's dynamic allocation use decommissioning Track the decommissioning executors in the core dynamic scheduler so we don't scale down too low, update the streaming ExecutorAllocationManager to also delegate to decommission Fix up executor add for resource profile Fix our exiting and cleanup thread for better debugging next time. Cleanup the locks we use in decommissioning and clarify some more bits. Verify executors decommissioned, then killed by external external cluster manager are re-launched Verify some additional calls are not occuring in the executor allocation manager suite. Dont' close the watcher until the end of the test Use decommissionExecutors and set adjustTargetNumExecutors to false so that we can match the pattern for killExecutor/killExecutors bump numparts up to 6 Revert "bump numparts up to 6" This reverts commit daf96dd. Small coment & visibility cleanup CR feedback/cleanup

…d a comment

SparkQA · 2020-08-11T20:44:59Z

Test build #127342 has finished for PR 29367 at commit 4d8b6cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-11T21:26:15Z

Test build #127343 has finished for PR 29367 at commit cc76ff5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

agrawaldevesh

Looks much better, one blocker remaining if I am understanding the code properly.

I would like to know more about the failure of DecommissionWorkerSuite also please.

Thanks !

agrawaldevesh · 2020-08-11T21:40:07Z

core/src/test/scala/org/apache/spark/deploy/DecommissionWorkerSuite.scala

@@ -242,8 +242,10 @@ class DecommissionWorkerSuite
      assert(jobResult === 2)
    }
    // 6 tasks: 2 from first stage, 2 rerun again from first stage, 2nd stage attempt 1 and 2.
-    val tasksSeen = listener.getTasksFinished()


Would you happen to recall the github actions error you got that lead to this change ? I would like to dig further because I invoke the listener using TestUtils.withListener(sc, listener): Which waits for the listener to drain and also removes the listener.

So I don't think wrapping this in an eventually should actually be doing anything: The listener has already been removed. Perhaps I ought to bring back the the "waiting for job done" inside of the getTasksFinished or as a separate call.

I would like to understand further just so that I can learn about some of the gotchas with this listener stuff.

Yeah good point, it's probably not the listener. It's only showing up for me in GHA though - https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579

Lets undo this change then. I am rerunning this PR locally to debug. Thanks for sharing the GHA link. It helps.

I think I know the race. I will file another PR for this either against this PR or against master. The race is simply that we need to wait for the decommissioning to have happened before triggering the fetch failure. On a busy machine, the listener can be delayed.

Sounds good, I'll back this change out.

This will take me a while to fix. I will make this fix against the master branch.

Apparently #29211 broke some of my state keeping that I was relying on in #29032 :-P. Let me think through how to fix this for real. So I think the test failure is real and it is worrisome that it isn't failing as frequently as it should.

Stay tuned for a PR to fix this but in the meanwhile please back out this test change. Thanks for surfacing this issue.

Okay so @HyukjinKwon also reported a test failure: #29014 (comment) and that is encouraging. I will work on a fix for this ASAP.

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

SparkQA · 2020-08-11T22:53:31Z

Test build #127345 has finished for PR 29367 at commit 6a69126.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

agrawaldevesh · 2020-08-12T01:36:10Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

+    }
+
+    // If we don't want to replace the executors we are decommissioning
+    if (adjustTargetNumExecutors) {


should there be a check for executorsToDecommission.notEmpty ? Otherwise, we will request executors again with no change in the adjustExecutors helper method. Could again lead to some unnecessary strain on the driver.

Not a big deal because this is one time, since doDecommission isn't called again and again.

Let me put that logic inside adjustExecutors :)

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

agrawaldevesh · 2020-08-12T01:44:20Z

core/src/test/scala/org/apache/spark/deploy/DecommissionWorkerSuite.scala

@@ -242,8 +242,10 @@ class DecommissionWorkerSuite
      assert(jobResult === 2)
    }
    // 6 tasks: 2 from first stage, 2 rerun again from first stage, 2nd stage attempt 1 and 2.
-    val tasksSeen = listener.getTasksFinished()


Lets undo this change then. I am rerunning this PR locally to debug. Thanks for sharing the GHA link. It helps.

holdenk · 2020-08-12T18:56:59Z

Just to be clear what's the outstanding blocker in your opinion?

… state.

agrawaldevesh

PR looks good to me and I have no blockers. The two things that we arrived at were:

Please backout the test DecommissionWorkerSuite change.
Consider not doing doRequestTotalExecutors in adjustExecutors if the input arg list is empty.

This is going to be great and finally brings decommissioning more prime time.

holdenk · 2020-08-12T19:19:45Z

Gotcha, I've got those two changes in now and I'll see how it goes in Jenkins/GHA :) Just an FYI to other folks since there are no outstanding blockers I intend to merge this once CI completes. If anyone needs more time to review please leave a comment and I'll hold off on merging.

SparkQA · 2020-08-12T21:39:17Z

Test build #127389 has finished for PR 29367 at commit e970cb1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-13T01:41:08Z

Thank you, @holdenk and @agrawaldevesh .

probot-autolabeler bot added CORE DSTREAM KUBERNETES labels Aug 6, 2020

holdenk force-pushed the SPARK-31198-use-graceful-decommissioning-as-part-of-dynamic-scaling branch from 427b26c to 38a413e Compare August 6, 2020 03:47

agrawaldevesh reviewed Aug 6, 2020

View reviewed changes

Ngone51 reviewed Aug 6, 2020

View reviewed changes

holdenk force-pushed the SPARK-31198-use-graceful-decommissioning-as-part-of-dynamic-scaling branch from 780b00b to 3fa3313 Compare August 6, 2020 21:31

HyukjinKwon reviewed Aug 7, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala Outdated Show resolved Hide resolved

agrawaldevesh reviewed Aug 8, 2020

View reviewed changes

agrawaldevesh suggested changes Aug 8, 2020

View reviewed changes

agrawaldevesh reviewed Aug 10, 2020

View reviewed changes

probot-autolabeler bot added the BUILD label Aug 11, 2020

holdenk force-pushed the SPARK-31198-use-graceful-decommissioning-as-part-of-dynamic-scaling branch from 4a8ba4d to a099152 Compare August 11, 2020 04:42

holdenk and others added 4 commits August 11, 2020 11:39

CR feedback, move adjustExecutors to a common utility function

bff1ef7

Exclude some non-public APIs

995ffa9

More CR feedback

cc76ff5

holdenk force-pushed the SPARK-31198-use-graceful-decommissioning-as-part-of-dynamic-scaling branch from 4d8b6cd to cc76ff5 Compare August 11, 2020 18:40

On github actions the listener might take more time to finish, and ad…

6a69126

…d a comment

agrawaldevesh reviewed Aug 11, 2020

View reviewed changes

agrawaldevesh reviewed Aug 12, 2020

View reviewed changes

Back out the taskSeen change in test

80629eb

Avoid adjusting executors requested if we don't any actually changing…

e970cb1

… state.

agrawaldevesh approved these changes Aug 12, 2020

View reviewed changes

asfgit closed this in 548ac7c Aug 13, 2020

[SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling #29367

[SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling #29367

Conversation

holdenk commented Aug 6, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

holdenk commented Aug 6, 2020

holdenk commented Aug 6, 2020

holdenk commented Aug 6, 2020

agrawaldevesh commented Aug 6, 2020 • edited Loading

HyukjinKwon commented Aug 6, 2020

SparkQA commented Aug 6, 2020

SparkQA commented Aug 6, 2020

holdenk commented Aug 6, 2020

agrawaldevesh commented Aug 6, 2020

agrawaldevesh left a comment

Choose a reason for hiding this comment

SparkQA commented Aug 6, 2020

Ngone51 commented Aug 6, 2020

holdenk commented Aug 6, 2020

SparkQA commented Aug 6, 2020

SparkQA commented Aug 7, 2020

holdenk commented Aug 7, 2020

agrawaldevesh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agrawaldevesh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agrawaldevesh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk commented Aug 10, 2020

holdenk commented Aug 10, 2020

SparkQA commented Aug 10, 2020

SparkQA commented Aug 11, 2020

SparkQA commented Aug 11, 2020

SparkQA commented Aug 11, 2020

agrawaldevesh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 11, 2020

agrawaldevesh Aug 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk commented Aug 12, 2020

agrawaldevesh left a comment

Choose a reason for hiding this comment

holdenk commented Aug 12, 2020

SparkQA commented Aug 12, 2020

dongjoon-hyun commented Aug 13, 2020

agrawaldevesh commented Aug 6, 2020 •

edited

Loading

agrawaldevesh Aug 12, 2020 •

edited

Loading