[SPARK-4955]With executor dynamic scaling enabled,executor shoude be added or killed in yarn-cluster mode. #3962

lianhuiwang · 2015-01-09T02:28:31Z

With executor dynamic scaling enabled, executor number shoude be added or killed in yarn-cluster mode.so in yarn-cluster mode, ApplicationMaster start a AMActor that add or kill a executor. then YarnSchedulerActor in YarnSchedulerBackend send message to am's AMActor.
@andrewor14 @ChengXiangLi @tdas

SparkQA · 2015-01-09T02:32:36Z

Test build #25290 has started for PR 3962 at commit 6dfeeec.

This patch merges cleanly.

SparkQA · 2015-01-09T02:34:07Z

Test build #25290 has finished for PR 3962 at commit 6dfeeec.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class YarnSchedulerActor(isDriver: Boolean) extends Actor

AmplabJenkins · 2015-01-09T02:34:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25290/
Test FAILed.

SparkQA · 2015-01-09T02:52:35Z

Test build #25291 has started for PR 3962 at commit 2164ea8.

This patch merges cleanly.

SparkQA · 2015-01-09T03:51:19Z

Test build #25291 has finished for PR 3962 at commit 2164ea8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class YarnSchedulerActor(isDriver: Boolean) extends Actor

AmplabJenkins · 2015-01-09T03:51:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25291/
Test FAILed.

SparkQA · 2015-01-09T06:12:40Z

Test build #25299 has started for PR 3962 at commit 7d33791.

This patch merges cleanly.

SparkQA · 2015-01-09T06:22:34Z

Test build #25303 has started for PR 3962 at commit 1b029a4.

This patch merges cleanly.

SparkQA · 2015-01-09T07:21:33Z

Test build #25303 has finished for PR 3962 at commit 1b029a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class YarnSchedulerActor(isDriver: Boolean) extends Actor

AmplabJenkins · 2015-01-09T07:21:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25303/
Test FAILed.

SparkQA · 2015-01-09T07:30:02Z

Test build #25299 has finished for PR 3962 at commit 7d33791.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class YarnSchedulerActor(isDriver: Boolean) extends Actor

AmplabJenkins · 2015-01-09T07:30:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25299/
Test PASSed.

andrewor14 · 2015-01-13T18:17:29Z

core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala

why not listen for these in cluster mode?

because in cluster mode, YarnSchedulerActor and driver use same actorSystem, if YarnSchedulerActor subscribe and listen driver's actor, messages from driver's actor that should send to executor will send to YarnSchedulerActor .so there is a big wrong and YarnSchedulerActor cannot listen driver's actor in cluster mode.

andrewor14 · 2015-01-13T18:24:57Z

Hi @lianhuiwang thanks for fixing this. It's pretty clear why it's not working in cluster mode; the actor that acts as a server for allocation requests is simply not started in this mode... this seems like a critical omission on my part.

The approach is pretty straightforward in this PR. However, is there a reason why we need to subscribe to disassociated events in one deploy mode but not the other?

lianhuiwang · 2015-01-14T01:38:43Z

@andrewor14 in comment's reply, I think I should answer your questions in my reply. if you have any question, please tell me and i will update for your comments. thanks.

SparkQA · 2015-01-14T07:22:37Z

Test build #25520 has started for PR 3962 at commit bbc4d5a.

This patch merges cleanly.

SparkQA · 2015-01-14T07:27:30Z

Test build #25522 has started for PR 3962 at commit 7a7767a.

This patch merges cleanly.

SparkQA · 2015-01-14T07:28:27Z

Test build #25520 has finished for PR 3962 at commit bbc4d5a.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class YarnSchedulerActor(isClusterMode: Boolean) extends Actor

AmplabJenkins · 2015-01-14T07:28:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25520/
Test FAILed.

SparkQA · 2015-01-14T07:33:41Z

Test build #25522 has finished for PR 3962 at commit 7a7767a.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class YarnSchedulerActor(isClusterMode: Boolean) extends Actor

SparkQA · 2015-01-26T04:21:23Z

Test build #26071 has finished for PR 3962 at commit 08ba473.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class YarnSchedulerActor(isClusterMode: Boolean) extends Actor

AmplabJenkins · 2015-01-26T04:21:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26071/
Test FAILed.

SparkQA · 2015-01-26T07:07:45Z

Test build #26082 has started for PR 3962 at commit 9318fc1.

This patch merges cleanly.

SparkQA · 2015-01-26T07:12:44Z

Test build #26083 has started for PR 3962 at commit 45da3b0.

This patch merges cleanly.

SparkQA · 2015-01-26T07:17:42Z

Test build #26084 has started for PR 3962 at commit 12426af.

This patch merges cleanly.

lianhuiwang · 2015-01-26T07:21:40Z

@andrewor14 I have looked at it in depth. YarnSchedulerActor can work very well in both yarn cluster and yarn client mode and i have tested in these two mode. Now we just change small code of AM. Can you review this PR again? thanks.

SparkQA · 2015-01-26T08:05:07Z

Test build #26083 has finished for PR 3962 at commit 45da3b0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-26T08:05:11Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26083/
Test FAILed.

SparkQA · 2015-01-26T08:07:30Z

Test build #26084 has finished for PR 3962 at commit 12426af.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-26T08:07:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26084/
Test FAILed.

SparkQA · 2015-01-26T08:11:10Z

Test build #26082 has finished for PR 3962 at commit 9318fc1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class YarnSchedulerActor extends Actor

AmplabJenkins · 2015-01-26T08:11:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26082/
Test PASSed.

andrewor14 · 2015-01-26T18:08:40Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

to be fair, YARN is not a deploy mode. I would just update this to say

An actor that communicates with the driver's scheduler backend.

andrewor14 · 2015-01-26T18:23:52Z

@lianhuiwang Great that the latest changes are now much simpler and minimal. However, I still don't fully agree with one point:

so if AMActor subscribe to disassociated event and finish with FinalApplicationStatus.SUCCEEDED, that's incorrect to do so

Why is it incorrect? Although the AM and the driver belong to the same process, the driver still runs in its own thread. In cluster mode, even if we don't finish with SUCCEEDED on driver disassociation (as in your current patch), the AM still eventually finish with 0 exit code anyway since it joins on the driver thread. Listening on the disassociated event is just one way we detect whether the driver has exited, and the behavior should be the same across deploy modes.

By the way, I'm not saying that this patch is incorrect in its current state. I just don't find the isDriver check in AMActor necessary because the ultimate behavior should be the same without it.

lianhuiwang · 2015-01-27T01:32:08Z

@andrewor14 if we don't finish with SUCCEEDED on driver disassociation, AM should finish with non-zero. example: if driver's main class throw some exception and exit, I think now AMActor will listen on the disassociated event and finish with SUCCEEDED.if AMActor donot listen on the disassociated event, userThread in AM can catch these exception from driver's main class and finish with FAILED, not SUCCEEDED.
so i think AMActor is unnecessary to listening to driver's disassociated event in yarn cluster mode because userThread in AM can monitor any exceptions of driver's main class. but AMActor donot know any exceptions of driver. Do you understand what i said? or Do you have different opinions?

SparkQA · 2015-01-27T01:42:47Z

Test build #26134 has started for PR 3962 at commit 48d9ebb.

This patch merges cleanly.

SparkQA · 2015-01-27T02:53:09Z

Test build #26134 has finished for PR 3962 at commit 48d9ebb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-27T02:53:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26134/
Test PASSed.

andrewor14 · 2015-01-28T20:32:50Z

so i think AMActor is unnecessary to listening to driver's disassociated event in yarn cluster mode because userThread in AM can monitor any exceptions of driver's main class. but AMActor donot know any exceptions of driver. Do you understand what i said?

Yes I understand. Note however we already catch exceptions in the user thread and finish with failed exit code here. Since finish can only be called once, when an exception occurs we already exit with a non-zero code so the disassociation event shouldn't do anything. However, this assumes that the disassociation event comes after the exception is caught, which may not necessarily be true. For this reason I think it's safer to leave the check in there as you suggest, to guard against any unforeseen race conditions that might occur. We definitely need some comment to explain why the behavior is different in cluster mode though.

andrewor14 · 2015-01-28T20:36:39Z

LGTM actually. I'm going to merge this into master after adding the comment I talked about earlier. Thanks for explaining your reasoning @lianhuiwang.

lianhuiwang · 2015-01-29T01:35:45Z

@andrewor14 thanks for you help.

in yarn-cluster mode,executor number can be added or killed.

6dfeeec

fix a min bug

2164ea8

in AM create a new actorSystem

7d33791

fix bug

1b029a4

andrewor14 reviewed Jan 13, 2015
View reviewed changes

lianhuiwang added 2 commits January 14, 2015 15:19

address andrewor14's comments

bbc4d5a

fix address andrewor14's comments

7a7767a

lianhuiwang added 2 commits January 26, 2015 15:03

update with andrewor14's comments

9318fc1

remove unrelated code

45da3b0

refactor am's code

12426af

andrewor14 reviewed Jan 26, 2015
View reviewed changes

update with andrewor14's comments

48d9ebb

asfgit closed this in 81f8f34 Jan 28, 2015

[SPARK-4955]With executor dynamic scaling enabled,executor shoude be added or killed in yarn-cluster mode. #3962

[SPARK-4955]With executor dynamic scaling enabled,executor shoude be added or killed in yarn-cluster mode. #3962

Uh oh!

Conversation

lianhuiwang commented Jan 9, 2015

Uh oh!

SparkQA commented Jan 9, 2015

Uh oh!

SparkQA commented Jan 9, 2015

Uh oh!

AmplabJenkins commented Jan 9, 2015

Uh oh!

SparkQA commented Jan 9, 2015

Uh oh!

SparkQA commented Jan 9, 2015

Uh oh!

AmplabJenkins commented Jan 9, 2015

Uh oh!

SparkQA commented Jan 9, 2015

Uh oh!

SparkQA commented Jan 9, 2015

Uh oh!

SparkQA commented Jan 9, 2015

Uh oh!

AmplabJenkins commented Jan 9, 2015

Uh oh!

SparkQA commented Jan 9, 2015

Uh oh!

AmplabJenkins commented Jan 9, 2015

Uh oh!

andrewor14 Jan 13, 2015

Choose a reason for hiding this comment

Uh oh!

lianhuiwang Jan 14, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Jan 13, 2015

Uh oh!

lianhuiwang commented Jan 14, 2015

Uh oh!

SparkQA commented Jan 14, 2015

Uh oh!

SparkQA commented Jan 14, 2015

Uh oh!

SparkQA commented Jan 14, 2015

Uh oh!

AmplabJenkins commented Jan 14, 2015

Uh oh!

SparkQA commented Jan 14, 2015

Uh oh!

SparkQA commented Jan 26, 2015

Uh oh!

AmplabJenkins commented Jan 26, 2015

Uh oh!

SparkQA commented Jan 26, 2015

Uh oh!

SparkQA commented Jan 26, 2015

Uh oh!

SparkQA commented Jan 26, 2015

Uh oh!

lianhuiwang commented Jan 26, 2015

Uh oh!

SparkQA commented Jan 26, 2015

Uh oh!

AmplabJenkins commented Jan 26, 2015

Uh oh!

SparkQA commented Jan 26, 2015

Uh oh!

AmplabJenkins commented Jan 26, 2015

Uh oh!

SparkQA commented Jan 26, 2015

Uh oh!

AmplabJenkins commented Jan 26, 2015

Uh oh!

andrewor14 Jan 26, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Jan 26, 2015

Uh oh!

lianhuiwang commented Jan 27, 2015