-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4955]With executor dynamic scaling enabled,executor shoude be added or killed in yarn-cluster mode. #3962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #25290 has started for PR 3962 at commit
|
|
Test build #25290 has finished for PR 3962 at commit
|
|
Test FAILed. |
|
Test build #25291 has started for PR 3962 at commit
|
|
Test build #25291 has finished for PR 3962 at commit
|
|
Test FAILed. |
|
Test build #25299 has started for PR 3962 at commit
|
|
Test build #25303 has started for PR 3962 at commit
|
|
Test build #25303 has finished for PR 3962 at commit
|
|
Test FAILed. |
|
Test build #25299 has finished for PR 3962 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not listen for these in cluster mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because in cluster mode, YarnSchedulerActor and driver use same actorSystem, if YarnSchedulerActor subscribe and listen driver's actor, messages from driver's actor that should send to executor will send to YarnSchedulerActor .so there is a big wrong and YarnSchedulerActor cannot listen driver's actor in cluster mode.
|
Hi @lianhuiwang thanks for fixing this. It's pretty clear why it's not working in cluster mode; the actor that acts as a server for allocation requests is simply not started in this mode... this seems like a critical omission on my part. The approach is pretty straightforward in this PR. However, is there a reason why we need to subscribe to disassociated events in one deploy mode but not the other? |
|
@andrewor14 in comment's reply, I think I should answer your questions in my reply. if you have any question, please tell me and i will update for your comments. thanks. |
|
Test build #25520 has started for PR 3962 at commit
|
|
Test build #25522 has started for PR 3962 at commit
|
|
Test build #25520 has finished for PR 3962 at commit
|
|
Test FAILed. |
|
Test build #25522 has finished for PR 3962 at commit
|
|
Test build #26071 has finished for PR 3962 at commit
|
|
Test FAILed. |
|
Test build #26082 has started for PR 3962 at commit
|
|
Test build #26083 has started for PR 3962 at commit
|
|
Test build #26084 has started for PR 3962 at commit
|
|
@andrewor14 I have looked at it in depth. YarnSchedulerActor can work very well in both yarn cluster and yarn client mode and i have tested in these two mode. Now we just change small code of AM. Can you review this PR again? thanks. |
|
Test build #26083 has finished for PR 3962 at commit
|
|
Test FAILed. |
|
Test build #26084 has finished for PR 3962 at commit
|
|
Test FAILed. |
|
Test build #26082 has finished for PR 3962 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be fair, YARN is not a deploy mode. I would just update this to say
An actor that communicates with the driver's scheduler backend.
|
@lianhuiwang Great that the latest changes are now much simpler and minimal. However, I still don't fully agree with one point:
Why is it incorrect? Although the AM and the driver belong to the same process, the driver still runs in its own thread. In cluster mode, even if we don't finish with By the way, I'm not saying that this patch is incorrect in its current state. I just don't find the |
|
@andrewor14 if we don't finish with SUCCEEDED on driver disassociation, AM should finish with non-zero. example: if driver's main class throw some exception and exit, I think now AMActor will listen on the disassociated event and finish with SUCCEEDED.if AMActor donot listen on the disassociated event, userThread in AM can catch these exception from driver's main class and finish with FAILED, not SUCCEEDED. |
|
Test build #26134 has started for PR 3962 at commit
|
|
Test build #26134 has finished for PR 3962 at commit
|
|
Test PASSed. |
Yes I understand. Note however we already catch exceptions in the user thread and |
|
LGTM actually. I'm going to merge this into master after adding the comment I talked about earlier. Thanks for explaining your reasoning @lianhuiwang. |
|
@andrewor14 thanks for you help. |
With executor dynamic scaling enabled, executor number shoude be added or killed in yarn-cluster mode.so in yarn-cluster mode, ApplicationMaster start a AMActor that add or kill a executor. then YarnSchedulerActor in YarnSchedulerBackend send message to am's AMActor.
@andrewor14 @ChengXiangLi @tdas