Skip to content

Conversation

@lianhuiwang
Copy link
Contributor

With executor dynamic scaling enabled, executor number shoude be added or killed in yarn-cluster mode.so in yarn-cluster mode, ApplicationMaster start a AMActor that add or kill a executor. then YarnSchedulerActor in YarnSchedulerBackend send message to am's AMActor.
@andrewor14 @ChengXiangLi @tdas

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25290 has started for PR 3962 at commit 6dfeeec.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25290 has finished for PR 3962 at commit 6dfeeec.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected class YarnSchedulerActor(isDriver: Boolean) extends Actor

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25290/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25291 has started for PR 3962 at commit 2164ea8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25291 has finished for PR 3962 at commit 2164ea8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected class YarnSchedulerActor(isDriver: Boolean) extends Actor

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25291/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25299 has started for PR 3962 at commit 7d33791.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25303 has started for PR 3962 at commit 1b029a4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25303 has finished for PR 3962 at commit 1b029a4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected class YarnSchedulerActor(isDriver: Boolean) extends Actor

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25303/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25299 has finished for PR 3962 at commit 7d33791.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected class YarnSchedulerActor(isDriver: Boolean) extends Actor

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25299/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not listen for these in cluster mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because in cluster mode, YarnSchedulerActor and driver use same actorSystem, if YarnSchedulerActor subscribe and listen driver's actor, messages from driver's actor that should send to executor will send to YarnSchedulerActor .so there is a big wrong and YarnSchedulerActor cannot listen driver's actor in cluster mode.

@andrewor14
Copy link
Contributor

Hi @lianhuiwang thanks for fixing this. It's pretty clear why it's not working in cluster mode; the actor that acts as a server for allocation requests is simply not started in this mode... this seems like a critical omission on my part.

The approach is pretty straightforward in this PR. However, is there a reason why we need to subscribe to disassociated events in one deploy mode but not the other?

@lianhuiwang
Copy link
Contributor Author

@andrewor14 in comment's reply, I think I should answer your questions in my reply. if you have any question, please tell me and i will update for your comments. thanks.

@SparkQA
Copy link

SparkQA commented Jan 14, 2015

Test build #25520 has started for PR 3962 at commit bbc4d5a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 14, 2015

Test build #25522 has started for PR 3962 at commit 7a7767a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 14, 2015

Test build #25520 has finished for PR 3962 at commit bbc4d5a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected class YarnSchedulerActor(isClusterMode: Boolean) extends Actor

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25520/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 14, 2015

Test build #25522 has finished for PR 3962 at commit 7a7767a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected class YarnSchedulerActor(isClusterMode: Boolean) extends Actor

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26071 has finished for PR 3962 at commit 08ba473.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected class YarnSchedulerActor(isClusterMode: Boolean) extends Actor

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26071/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26082 has started for PR 3962 at commit 9318fc1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26083 has started for PR 3962 at commit 45da3b0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26084 has started for PR 3962 at commit 12426af.

  • This patch merges cleanly.

@lianhuiwang
Copy link
Contributor Author

@andrewor14 I have looked at it in depth. YarnSchedulerActor can work very well in both yarn cluster and yarn client mode and i have tested in these two mode. Now we just change small code of AM. Can you review this PR again? thanks.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26083 has finished for PR 3962 at commit 45da3b0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26083/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26084 has finished for PR 3962 at commit 12426af.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26084/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26082 has finished for PR 3962 at commit 9318fc1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected class YarnSchedulerActor extends Actor

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26082/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be fair, YARN is not a deploy mode. I would just update this to say

An actor that communicates with the driver's scheduler backend.

@andrewor14
Copy link
Contributor

@lianhuiwang Great that the latest changes are now much simpler and minimal. However, I still don't fully agree with one point:

so if AMActor subscribe to disassociated event and finish with FinalApplicationStatus.SUCCEEDED, that's incorrect to do so

Why is it incorrect? Although the AM and the driver belong to the same process, the driver still runs in its own thread. In cluster mode, even if we don't finish with SUCCEEDED on driver disassociation (as in your current patch), the AM still eventually finish with 0 exit code anyway since it joins on the driver thread. Listening on the disassociated event is just one way we detect whether the driver has exited, and the behavior should be the same across deploy modes.

By the way, I'm not saying that this patch is incorrect in its current state. I just don't find the isDriver check in AMActor necessary because the ultimate behavior should be the same without it.

@lianhuiwang
Copy link
Contributor Author

@andrewor14 if we don't finish with SUCCEEDED on driver disassociation, AM should finish with non-zero. example: if driver's main class throw some exception and exit, I think now AMActor will listen on the disassociated event and finish with SUCCEEDED.if AMActor donot listen on the disassociated event, userThread in AM can catch these exception from driver's main class and finish with FAILED, not SUCCEEDED.
so i think AMActor is unnecessary to listening to driver's disassociated event in yarn cluster mode because userThread in AM can monitor any exceptions of driver's main class. but AMActor donot know any exceptions of driver. Do you understand what i said? or Do you have different opinions?

@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26134 has started for PR 3962 at commit 48d9ebb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26134 has finished for PR 3962 at commit 48d9ebb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26134/
Test PASSed.

@andrewor14
Copy link
Contributor

so i think AMActor is unnecessary to listening to driver's disassociated event in yarn cluster mode because userThread in AM can monitor any exceptions of driver's main class. but AMActor donot know any exceptions of driver. Do you understand what i said?

Yes I understand. Note however we already catch exceptions in the user thread and finish with failed exit code here. Since finish can only be called once, when an exception occurs we already exit with a non-zero code so the disassociation event shouldn't do anything. However, this assumes that the disassociation event comes after the exception is caught, which may not necessarily be true. For this reason I think it's safer to leave the check in there as you suggest, to guard against any unforeseen race conditions that might occur. We definitely need some comment to explain why the behavior is different in cluster mode though.

@andrewor14
Copy link
Contributor

LGTM actually. I'm going to merge this into master after adding the comment I talked about earlier. Thanks for explaining your reasoning @lianhuiwang.

@asfgit asfgit closed this in 81f8f34 Jan 28, 2015
@lianhuiwang
Copy link
Contributor Author

@andrewor14 thanks for you help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants