[SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission #29817

Ngone51 · 2020-09-21T12:59:06Z

What changes were proposed in this pull request?

This PR cleans up the RPC message flow among the multiple decommission use cases, it includes changes:

Keep Worker's decommission status be consistent between the case where decommission starts from Worker and the case where decommission starts from the MasterWebUI: sending DecommissionWorker from Master to Worker in the latter case.
Change from two-way communication to one-way communication when notifying decommission between driver and executor: it's obviously unnecessary for the executor to acknowledge the decommission status to the driver since the decommission request is from the driver. And it's same in reverse.
Only send one message instead of two(DecommissionSelf/DecommissionBlockManager) when decommission the executor: executor and BlockManager are in the same JVM.
Clean up codes around here.

Why are the changes needed?

Before:

After:

(Note the diagrams only counts those RPC calls that needed to go through the network. Local RPC calls are not counted here.)

After this change, We reduced 6 original RPC calls and added one more RPC call for keeping the consistent decommission status for the Worker. And the RPC flow becomes more clear.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Updated existing tests.

Ngone51 · 2020-09-21T13:04:16Z

Bring this back since it isn't the original commit that breaks the K8S test. As the PR #29751 merged before this(#29722) already failed the K8S tests

HyukjinKwon · 2020-09-21T13:14:52Z

@holdenk, seems like the test failure wasn't caused by this PR. Dose your -1 at here still stand?

cc @dongjoon-hyun as well per #29751.

SparkQA · 2020-09-21T13:47:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33556/

holdenk · 2020-09-21T13:59:38Z

The -1 is only on the PR that I mentioned it on (the refactoring with many sub classes) which still stands. I would like to review this PR more though since I think we probably need better test coverage of this change than it originally had. Sound good? Thanks for checking in about that :)

HyukjinKwon · 2020-09-21T14:05:15Z

Got it about -1 but what about we pushing this as is, and work on the test as followups? It's a bit odds that we reverted it for the reason this PR didn't cause, and ask more things before merging it in.

SparkQA · 2020-09-21T14:06:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33556/

holdenk · 2020-09-21T14:56:00Z

I think we should not commit this with the K8s test being broken. It’s in the same chunk of code and changes the logging string (although another PR also changed that string first too?). I do not believe this PR was appropriately tested when first merger given it changed decommissioning messages and did not run the decommission tests.

For clarity: if you want to fix the tests in a separate PR that’s ok with me, but I would prefer not to commit this without passing integration testing.

SparkQA · 2020-09-21T15:50:57Z

Test build #128934 has finished for PR 29817 at commit 5ca0fe8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-21T17:21:04Z

Test build #128935 has finished for PR 29817 at commit 15f6085.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DecommissionWorkers(ids: Seq[String]) extends DeployMessage
case class WorkerDecommissioning(id: String, workerRef: RpcEndpointRef) extends DeployMessage
case class ExecutorDecommissioning(executorId: String) extends CoarseGrainedClusterMessage

Ngone51 · 2020-09-22T01:51:47Z

I think we should not commit this with the K8s test being broken.

We don't. That's also why I added [K8S] tag in the PR title. And feel free to leave comments, I can address them in followups.

HyukjinKwon · 2020-09-22T03:41:47Z

@holdenk, why don't you take a look for the test failure since it blocks all changes in decommission in k8s, and you were involved mainly in the development there?

cloud-fan · 2020-09-22T08:47:56Z

Can we find out which commit caused the test failure in the first place? We should either revert that commit, or fix it soon, as the test failure blocks others.

Since this PR is resubmitted (although the revert is not necessary now given the test failure was already there), I think it's a good chance for @holdenk to take a closer look before re-merging. And I agree with @holdenk that we can't merge a PR when the related test is already broken. We should fix that first. @holdenk can you give some hints about it? I took a look at https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33556/ , but I don't even see how the test failed. The output is very different from normal Spark tests.

holdenk

Thanks for reverting this and resubmitting it. I know you believe the original PR didn't cause the test failure, but that's only half true, this PR just broke the test some more.

That being said I still have concerns this PR is not sufficiently tested, can you add some more tests for the new flows you've introduced?

core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala

holdenk · 2020-09-23T23:03:32Z

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

    SignalUtils.register("PWR", "Failed to register SIGPWR handler - " +
-      "disabling worker decommission feature.")(decommissionSelf)
+      "disabling worker decommission feature.") {
+       self.send(WorkerSigPWRReceived)


I think this might mean we return from handling the signal right away rather than waiting for decommissionSelf to be finished. Is this an intentional change?

Also this will no longer report a decommissioning failure by signal return value, so may block pod deletion or other cleanup task longer than needed.

This's Worker, I guess you cares more about the executor? In Worker, decommissionSelf always returns true. and In exeutor, there's a change to return false to fail the decommissionSelf but seems rarely happen. If you would insist on returning the value, I think we can use askSync instead.

Can you look into what the difference of this behavior might cause at the system level and then tell me if that’s a desired change? I’m ok with us making changes here, I just want us to be intentional and know if we need to test the change and it seems like this change was incidental.

The return value of the signal handling decides whether we should forward the signal to the other handlers. If true, no other handlers will handle the PWR signal except ourselves. If false, we will handle it (for decommission) and other handlers will handle it too. Do you expect other handlers to continue handling the SIGPWR when the system isn't really experiencing a power failure?

I do, I think if the signal is unhandled then the process will be killed immediately. If we think of decommissioning/graceful shut down I believe that behavior is desirable, since if we can't shut down gracefully the least we can do is exit quickly.

Updated, but please note that I only updated for the executor's case since Worker's case always returns true before.

core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

holdenk · 2020-09-23T23:06:42Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

-        if (decommissioned) {
-          val msg = "Asked to launch a task while decommissioned."
-          logError(msg)
-          driver match {
-            case Some(endpoint) =>
-              logInfo("Sending DecommissionExecutor to driver.")
-              endpoint.send(DecommissionExecutor(executorId, ExecutorDecommissionInfo(msg)))
-            case _ =>
-              logError("No registered driver to send Decommission to.")
-          }
-        }


I don't think we should just take this out. async sends could fail, re-sending the message if we receive a request which indicates the master hasn't received our notification indicates we should resend.

First, we use askSync to send decommission notice to the driver whenever it needs(see ExecutorSigPWRReceived). Second, even if driver receives the decommission notice successfully, there still could be LaunchTask request due to the async between LaunchTask and decommission notice. Third, this part also uses async send, so we still can not ensure the decommission notice is received by driver successfully.

Right, so we should resend the notice then right?

No. Sorry if I didn't explain it clearly.

We should already send decommission notice to the driver when decommissioned = true, which using askSync. If askSync still fail, I wouldn't expect another send would succeed.

Yea I don't see the point of resending the notice to the driver, especially in this race condition. If we want to make sure the driver is noticed, we should design a mechanism for it, instead of doing it here randomly.

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

holdenk · 2020-09-23T23:10:33Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

-        context.reply(decommissionExecutor(executorId, decommissionInfo,
-          adjustTargetNumExecutors = false))
+      case ExecutorDecommissioning(executorId) =>
+        logWarning(s"Received executor $executorId decommissioned message")


Here might be where you break the test suite last time, so double check it.

holdenk · 2020-09-23T23:12:05Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

  }

-  def decommissionBlockManager(): Unit = synchronized {
+  def decommissionBlockManager(): Unit = storageEndpoint.ask(DecommissionBlockManager)


Why did you make this change?

If I unsterstand your question correctly:

We didn't really change the decommissionBlockManager. The original decommissionBlockManager has been renamed to decommissionSelf to avoid the naming collision.

Makes sense, although maybe introducing a new name instead of changing the use of a previous function name would be easier for verifying.

holdenk · 2020-09-23T23:30:14Z

I really like the idea of simplifying the RPC message flow, thanks for taking this on @Ngone51 and I'm sorry the code here is so brittle to these types of changes (the K8s integration tests are kind of limited).

HyukjinKwon · 2020-09-24T01:46:53Z

but that's only half true, this PR just broke the test some more.

@holdenk, you're kidding right? There was only one test failure that was not caused by this PR in K8S tests. That test was just fixed by you. How come this PR broke more tests? Can you be more explicit on that? Which tests were more broken, and how did you test?

holdenk · 2020-09-24T01:51:16Z

I’m not joking. The test has multiple conditions. Another PR broke one of the conditions. This PR in its original broke another one of the conditions. It was reported in the same test because K8s integration tests focus on the integration so they cover multiple pieces. If you want I’m happy to do a video call (and I can be flex on my timezone if that’s a constraint) and screen share so we can discuss the details.

holdenk · 2020-09-24T01:52:21Z

If it helps I called out the part of the PR I believe most likely responsible for that during my code review already.

Ngone51 · 2020-09-24T02:21:43Z

This PR in its original broke another one of the conditions.

Except for the breaking of the log(like you just fixed), what other conditions does this PR break?

Ngone51 · 2020-09-24T02:54:58Z

That being said I still have concerns this PR is not sufficiently tested, can you add some more tests for the new flows you've introduced?

There's only one new flow that is from Master to Worker. I can update the existing test by verifying Worker's decommission status... What kind of other concerns do you have? Could you elaborate more? So I can improve the PR accordingly.

holdenk · 2020-09-24T16:03:24Z

I’m mostly concerned with the change around how the storage decommissioning is being done now, I’d like to see some tests that the flow from the master to the worker results in storage decommissioning.

holdenk · 2020-09-24T16:04:21Z

Also there are multiple log conditions, this PR broke one of them. Another PR had broken another one.

HyukjinKwon · 2020-09-27T09:23:35Z

@Ngone51, shall we add a test #29817 (comment) and fix the test as requested? Seems like otherwise good to go.

Ngone51 · 2020-09-28T08:42:09Z

@holdenk Thanks for the review! I've addressed your most comments. And tests are updated and added in fa04b49 and
9d0f36d .

SparkQA · 2020-09-28T09:09:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33787/

SparkQA · 2020-10-22T14:20:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34766/

SparkQA · 2020-10-22T14:40:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34770/

SparkQA · 2020-10-22T14:44:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34766/

SparkQA · 2020-10-22T15:12:11Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34770/

SparkQA · 2020-10-22T16:27:42Z

Test build #130159 has finished for PR 29817 at commit e625051.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-22T16:54:33Z

Test build #130163 has finished for PR 29817 at commit 3c1e033.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-10-23T04:58:12Z

Merged to master.

Ngone51 · 2020-10-23T05:31:39Z

thanks all!!

MaxGekk · 2021-01-11T17:19:40Z

core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala

+        .set(config.DECOMMISSION_ENABLED, true)
+        .set(config.STORAGE_DECOMMISSION_ENABLED, isEnabled)
+      sc = new SparkContext(conf)
+      TestUtils.waitUntilExecutorsUp(sc, 2, 6000)


I got test failure in my PR #31131 (the PR is not related to the test I believe):

[info] BlockManagerDecommissionIntegrationSuite: [info] - SPARK-32850: BlockManager decommission should respect the configuration (enabled=false) *** FAILED *** (6 seconds, 165 milliseconds) [info] java.util.concurrent.TimeoutException: Can't find 2 executors before 6000 milliseconds elapsed [info] at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:374) [info] at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.$anonfun$new$2(BlockManagerDecommissionIntegrationSuite.scala:52) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

probot-autolabeler bot added CORE DSTREAM KUBERNETES labels Sep 21, 2020

Ngone51 force-pushed the simplify-decommission-rpc branch from 5ca0fe8 to 15f6085 Compare September 21, 2020 13:54

holdenk reviewed Sep 23, 2020

View reviewed changes

Ngone51 force-pushed the simplify-decommission-rpc branch from 15f6085 to 23bfdf5 Compare September 28, 2020 08:15

Ngone51 mentioned this pull request Sep 28, 2020

[WIP][SPARK-32913][CORE][K8S] Improve ExecutorDecommissionInfo and ExecutorDecommissionState for different use cases #29788

Closed

Ngone51 added 18 commits October 22, 2020 21:52

update test

22fcd57

add test to verify BlockManager decommission flow

540ef59

update log

72054b3

fix

9ab982e

fix

4a88cbd

address comment

a23122d

update

b9d113e

fix

9ec8a39

update decommissionExecutors

2737d0b

update test

935db8d

update comment

be5b460

update test

f96fe23

update

8d061bb

update

96abc76

remove if

874b4b4

workerRef doc

ccd21b2

revert unnecessary change

d6cf38f

update tests

3c1e033

Ngone51 force-pushed the simplify-decommission-rpc branch from e625051 to 3c1e033 Compare October 22, 2020 13:56

HyukjinKwon approved these changes Oct 23, 2020

View reviewed changes

HyukjinKwon closed this in edeecad Oct 23, 2020

MaxGekk reviewed Jan 11, 2021

View reviewed changes

[SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission #29817

[SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission #29817

Uh oh!

Conversation

Ngone51 commented Sep 21, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Ngone51 commented Sep 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Sep 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 21, 2020

Uh oh!

holdenk commented Sep 21, 2020

Uh oh!

HyukjinKwon commented Sep 21, 2020

Uh oh!

SparkQA commented Sep 21, 2020

Uh oh!

holdenk commented Sep 21, 2020

Uh oh!

SparkQA commented Sep 21, 2020

Uh oh!

SparkQA commented Sep 21, 2020

Uh oh!

Ngone51 commented Sep 22, 2020

Uh oh!

HyukjinKwon commented Sep 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Sep 22, 2020

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 commented Sep 21, 2020 •

edited

Loading

HyukjinKwon commented Sep 21, 2020 •

edited

Loading

HyukjinKwon commented Sep 22, 2020 •

edited

Loading

Ngone51 commented Sep 24, 2020 •

edited

Loading