Skip to content

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Sep 21, 2020

What changes were proposed in this pull request?

This PR cleans up the RPC message flow among the multiple decommission use cases, it includes changes:

  • Keep Worker's decommission status be consistent between the case where decommission starts from Worker and the case where decommission starts from the MasterWebUI: sending DecommissionWorker from Master to Worker in the latter case.

  • Change from two-way communication to one-way communication when notifying decommission between driver and executor: it's obviously unnecessary for the executor to acknowledge the decommission status to the driver since the decommission request is from the driver. And it's same in reverse.

  • Only send one message instead of two(DecommissionSelf/DecommissionBlockManager) when decommission the executor: executor and BlockManager are in the same JVM.

  • Clean up codes around here.

Why are the changes needed?

Before:

WeChat56c00cc34d9785a67a544dca036d49da

After:
WeChat05f7afb017e3f0132394c5e54245e49e

(Note the diagrams only counts those RPC calls that needed to go through the network. Local RPC calls are not counted here.)

After this change, We reduced 6 original RPC calls and added one more RPC call for keeping the consistent decommission status for the Worker. And the RPC flow becomes more clear.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Updated existing tests.

@Ngone51
Copy link
Member Author

Ngone51 commented Sep 21, 2020

Bring this back since it isn't the original commit that breaks the K8S test. As the PR #29751 merged before this(#29722) already failed the K8S tests

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 21, 2020

@holdenk, seems like the test failure wasn't caused by this PR. Dose your -1 at here still stand?

cc @dongjoon-hyun as well per #29751.

@SparkQA
Copy link

SparkQA commented Sep 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33556/

@Ngone51 Ngone51 force-pushed the simplify-decommission-rpc branch from 5ca0fe8 to 15f6085 Compare September 21, 2020 13:54
@holdenk
Copy link
Contributor

holdenk commented Sep 21, 2020

The -1 is only on the PR that I mentioned it on (the refactoring with many sub classes) which still stands. I would like to review this PR more though since I think we probably need better test coverage of this change than it originally had. Sound good? Thanks for checking in about that :)

@HyukjinKwon
Copy link
Member

Got it about -1 but what about we pushing this as is, and work on the test as followups? It's a bit odds that we reverted it for the reason this PR didn't cause, and ask more things before merging it in.

@SparkQA
Copy link

SparkQA commented Sep 21, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33556/

@holdenk
Copy link
Contributor

holdenk commented Sep 21, 2020

I think we should not commit this with the K8s test being broken. It’s in the same chunk of code and changes the logging string (although another PR also changed that string first too?). I do not believe this PR was appropriately tested when first merger given it changed decommissioning messages and did not run the decommission tests.

For clarity: if you want to fix the tests in a separate PR that’s ok with me, but I would prefer not to commit this without passing integration testing.

@SparkQA
Copy link

SparkQA commented Sep 21, 2020

Test build #128934 has finished for PR 29817 at commit 5ca0fe8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 21, 2020

Test build #128935 has finished for PR 29817 at commit 15f6085.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DecommissionWorkers(ids: Seq[String]) extends DeployMessage
  • case class WorkerDecommissioning(id: String, workerRef: RpcEndpointRef) extends DeployMessage
  • case class ExecutorDecommissioning(executorId: String) extends CoarseGrainedClusterMessage

@Ngone51
Copy link
Member Author

Ngone51 commented Sep 22, 2020

I think we should not commit this with the K8s test being broken.

We don't. That's also why I added [K8S] tag in the PR title. And feel free to leave comments, I can address them in followups.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 22, 2020

@holdenk, why don't you take a look for the test failure since it blocks all changes in decommission in k8s, and you were involved mainly in the development there?

@cloud-fan
Copy link
Contributor

Can we find out which commit caused the test failure in the first place? We should either revert that commit, or fix it soon, as the test failure blocks others.

Since this PR is resubmitted (although the revert is not necessary now given the test failure was already there), I think it's a good chance for @holdenk to take a closer look before re-merging. And I agree with @holdenk that we can't merge a PR when the related test is already broken. We should fix that first. @holdenk can you give some hints about it? I took a look at https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33556/ , but I don't even see how the test failed. The output is very different from normal Spark tests.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reverting this and resubmitting it. I know you believe the original PR didn't cause the test failure, but that's only half true, this PR just broke the test some more.

That being said I still have concerns this PR is not sufficiently tested, can you add some more tests for the new flows you've introduced?

SignalUtils.register("PWR", "Failed to register SIGPWR handler - " +
"disabling worker decommission feature.")(decommissionSelf)
"disabling worker decommission feature.") {
self.send(WorkerSigPWRReceived)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might mean we return from handling the signal right away rather than waiting for decommissionSelf to be finished. Is this an intentional change?

Also this will no longer report a decommissioning failure by signal return value, so may block pod deletion or other cleanup task longer than needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This's Worker, I guess you cares more about the executor? In Worker, decommissionSelf always returns true. and In exeutor, there's a change to return false to fail the decommissionSelf but seems rarely happen. If you would insist on returning the value, I think we can use askSync instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you look into what the difference of this behavior might cause at the system level and then tell me if that’s a desired change? I’m ok with us making changes here, I just want us to be intentional and know if we need to test the change and it seems like this change was incidental.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of the signal handling decides whether we should forward the signal to the other handlers. If true, no other handlers will handle the PWR signal except ourselves. If false, we will handle it (for decommission) and other handlers will handle it too. Do you expect other handlers to continue handling the SIGPWR when the system isn't really experiencing a power failure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do, I think if the signal is unhandled then the process will be killed immediately. If we think of decommissioning/graceful shut down I believe that behavior is desirable, since if we can't shut down gracefully the least we can do is exit quickly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, but please note that I only updated for the executor's case since Worker's case always returns true before.

Comment on lines -169 to -179
if (decommissioned) {
val msg = "Asked to launch a task while decommissioned."
logError(msg)
driver match {
case Some(endpoint) =>
logInfo("Sending DecommissionExecutor to driver.")
endpoint.send(DecommissionExecutor(executorId, ExecutorDecommissionInfo(msg)))
case _ =>
logError("No registered driver to send Decommission to.")
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should just take this out. async sends could fail, re-sending the message if we receive a request which indicates the master hasn't received our notification indicates we should resend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, we use askSync to send decommission notice to the driver whenever it needs(see ExecutorSigPWRReceived). Second, even if driver receives the decommission notice successfully, there still could be LaunchTask request due to the async between LaunchTask and decommission notice. Third, this part also uses async send, so we still can not ensure the decommission notice is received by driver successfully.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so we should resend the notice then right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Sorry if I didn't explain it clearly.

We should already send decommission notice to the driver when decommissioned = true, which using askSync. If askSync still fail, I wouldn't expect another send would succeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I don't see the point of resending the notice to the driver, especially in this race condition. If we want to make sure the driver is noticed, we should design a mechanism for it, instead of doing it here randomly.

context.reply(decommissionExecutor(executorId, decommissionInfo,
adjustTargetNumExecutors = false))
case ExecutorDecommissioning(executorId) =>
logWarning(s"Received executor $executorId decommissioned message")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here might be where you break the test suite last time, so double check it.

}

def decommissionBlockManager(): Unit = synchronized {
def decommissionBlockManager(): Unit = storageEndpoint.ask(DecommissionBlockManager)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you make this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I unsterstand your question correctly:

We didn't really change the decommissionBlockManager. The original decommissionBlockManager has been renamed to decommissionSelf to avoid the naming collision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, although maybe introducing a new name instead of changing the use of a previous function name would be easier for verifying.

@holdenk
Copy link
Contributor

holdenk commented Sep 23, 2020

I really like the idea of simplifying the RPC message flow, thanks for taking this on @Ngone51 and I'm sorry the code here is so brittle to these types of changes (the K8s integration tests are kind of limited).

@HyukjinKwon
Copy link
Member

but that's only half true, this PR just broke the test some more.

@holdenk, you're kidding right? There was only one test failure that was not caused by this PR in K8S tests. That test was just fixed by you. How come this PR broke more tests? Can you be more explicit on that? Which tests were more broken, and how did you test?

@holdenk
Copy link
Contributor

holdenk commented Sep 24, 2020

I’m not joking. The test has multiple conditions. Another PR broke one of the conditions. This PR in its original broke another one of the conditions. It was reported in the same test because K8s integration tests focus on the integration so they cover multiple pieces. If you want I’m happy to do a video call (and I can be flex on my timezone if that’s a constraint) and screen share so we can discuss the details.

@holdenk
Copy link
Contributor

holdenk commented Sep 24, 2020

If it helps I called out the part of the PR I believe most likely responsible for that during my code review already.

@Ngone51
Copy link
Member Author

Ngone51 commented Sep 24, 2020

This PR in its original broke another one of the conditions.

Except for the breaking of the log(like you just fixed), what other conditions does this PR break?

@Ngone51
Copy link
Member Author

Ngone51 commented Sep 24, 2020

That being said I still have concerns this PR is not sufficiently tested, can you add some more tests for the new flows you've introduced?

There's only one new flow that is from Master to Worker. I can update the existing test by verifying Worker's decommission status... What kind of other concerns do you have? Could you elaborate more? So I can improve the PR accordingly.

@holdenk
Copy link
Contributor

holdenk commented Sep 24, 2020

I’m mostly concerned with the change around how the storage decommissioning is being done now, I’d like to see some tests that the flow from the master to the worker results in storage decommissioning.

@holdenk
Copy link
Contributor

holdenk commented Sep 24, 2020

Also there are multiple log conditions, this PR broke one of them. Another PR had broken another one.

@HyukjinKwon
Copy link
Member

@Ngone51, shall we add a test #29817 (comment) and fix the test as requested? Seems like otherwise good to go.

@Ngone51 Ngone51 force-pushed the simplify-decommission-rpc branch from 15f6085 to 23bfdf5 Compare September 28, 2020 08:15
@Ngone51
Copy link
Member Author

Ngone51 commented Sep 28, 2020

@holdenk Thanks for the review! I've addressed your most comments. And tests are updated and added in fa04b49 and
9d0f36d .

@SparkQA
Copy link

SparkQA commented Sep 28, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33787/

@Ngone51 Ngone51 force-pushed the simplify-decommission-rpc branch from e625051 to 3c1e033 Compare October 22, 2020 13:56
@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34766/

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34770/

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34766/

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34770/

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Test build #130159 has finished for PR 29817 at commit e625051.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Test build #130163 has finished for PR 29817 at commit 3c1e033.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@Ngone51
Copy link
Member Author

Ngone51 commented Oct 23, 2020

thanks all!!

.set(config.DECOMMISSION_ENABLED, true)
.set(config.STORAGE_DECOMMISSION_ENABLED, isEnabled)
sc = new SparkContext(conf)
TestUtils.waitUntilExecutorsUp(sc, 2, 6000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got test failure in my PR #31131 (the PR is not related to the test I believe):

[info] BlockManagerDecommissionIntegrationSuite:
[info] - SPARK-32850: BlockManager decommission should respect the configuration (enabled=false) *** FAILED *** (6 seconds, 165 milliseconds)
[info]   java.util.concurrent.TimeoutException: Can't find 2 executors before 6000 milliseconds elapsed
[info]   at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:374)
[info]   at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.$anonfun$new$2(BlockManagerDecommissionIntegrationSuite.scala:52)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants