-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6443][Spark Submit]Could not submit app in standalone cluster mode when HA is enabled #5116
Conversation
Test build #28942 has finished for PR 5116 at commit
|
retest this please |
Test build #29138 has finished for PR 5116 at commit
|
da4ad2e
to
fa1fa80
Compare
Test build #29147 has finished for PR 5116 at commit
|
Test build #29146 has finished for PR 5116 at commit
|
Test build #29156 has finished for PR 5116 at commit
|
Though passed all the tests, it still doesn't work well in some cases. |
Test build #29205 has finished for PR 5116 at commit
|
The UT failed because of hive compile error which would be fixed in #5198. |
Jenkins, test this please. |
Test build #29222 timed out for PR 5116 at commit |
Jenkins, test this please. |
Test build #29229 has finished for PR 5116 at commit
|
Jenkins, test this please. |
Test build #29235 has finished for PR 5116 at commit
|
@andrewor14 Could you please take a look? |
if (success) { | ||
activeMasterActor = context.actorSelection(sender.path) | ||
pollAndReportStatus(driverId.get) | ||
} else if (!message.contains("Can only")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain what "Can only" is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, it's from this message: Can only accept driver submissions in ALIVE state
.
Jenkins, test this please. |
Test build #30416 has finished for PR 5116 at commit
|
Test build #30719 has finished for PR 5116 at commit
|
Test build #30724 has finished for PR 5116 at commit
|
@andrewor14 I've changed and made some rebase. If everything is ok i'd like to merge it ASAP. Please take a look. |
ping @andrewor14 |
Test build #31047 has started for PR 5116 at commit |
Looks like Jenkins has done his work but not posted the result here. |
retest this please |
If you look into the these methods you'll notice that the only place where it actually makes a connection to the master is in |
By the way the rest of the changes look fine to me. |
Test build #31161 has finished for PR 5116 at commit
|
@andrewor14 No. In
|
I see, that seems to be the case, then I would do it in |
Sorry but I don't see benefit to push it down which still have same duplicated code. I think we should catch |
Test build #31227 has finished for PR 5116 at commit
|
Test build #31259 has finished for PR 5116 at commit
|
Ok, no problem. The slightly nice thing is that it hides the details from the higher level and only handle it in relatively low level methods (e.g. post), but not a big deal. I will merge this first and we can always do any clean ups later if needed. I don't want to make you keep rebasing. |
But first, retest this please |
Test build #31303 has finished for PR 5116 at commit
|
@andrewor14 Thanks for all comments. :) |
Merging into master. |
…r mode when HA is enabled **3/26 update:** * Akka-based: Use an array of `ActorSelection` to represent multiple master. Add an `activeMasterActor` for query status of driver. And will add lost masters( including the standby one) to `lostMasters`. When size of `lostMasters` equals or greater than # of all masters, we should give an error that all masters are not avalible. * Rest-based: When all masters are not available(throw an exception), we use akka gateway to submit apps. I have tested simply on standalone HA cluster(with two masters alive and one alive/one dead), it worked. There might remains some issues on style or message print, but we can check the solution then fix them together. /cc srowen andrewor14 Author: WangTaoTheTonic <wangtao111@huawei.com> Closes apache#5116 from WangTaoTheTonic/SPARK-6443 and squashes the following commits: 2a28aab [WangTaoTheTonic] based the newest change apache#5144 76fd411 [WangTaoTheTonic] rebase f4f972b [WangTaoTheTonic] rebase...again a41de0b [WangTaoTheTonic] rebase 220cb3c [WangTaoTheTonic] move connect exception inside 35119a0 [WangTaoTheTonic] style and compile issues 9d636be [WangTaoTheTonic] per Andrew's comments 979760c [WangTaoTheTonic] rebase e4f4ece [WangTaoTheTonic] fix failed test 5d23958 [WangTaoTheTonic] refact some duplicated code, style and comments 7a881b3 [WangTaoTheTonic] when one of masters is gone, we still can submit 2b011c9 [WangTaoTheTonic] fix broken tests 60d97a4 [WangTaoTheTonic] rebase fa1fa80 [WangTaoTheTonic] submit app to HA cluster in standalone cluster mode
…r mode when HA is enabled **3/26 update:** * Akka-based: Use an array of `ActorSelection` to represent multiple master. Add an `activeMasterActor` for query status of driver. And will add lost masters( including the standby one) to `lostMasters`. When size of `lostMasters` equals or greater than # of all masters, we should give an error that all masters are not avalible. * Rest-based: When all masters are not available(throw an exception), we use akka gateway to submit apps. I have tested simply on standalone HA cluster(with two masters alive and one alive/one dead), it worked. There might remains some issues on style or message print, but we can check the solution then fix them together. /cc srowen andrewor14 Author: WangTaoTheTonic <wangtao111@huawei.com> Closes apache#5116 from WangTaoTheTonic/SPARK-6443 and squashes the following commits: 2a28aab [WangTaoTheTonic] based the newest change apache#5144 76fd411 [WangTaoTheTonic] rebase f4f972b [WangTaoTheTonic] rebase...again a41de0b [WangTaoTheTonic] rebase 220cb3c [WangTaoTheTonic] move connect exception inside 35119a0 [WangTaoTheTonic] style and compile issues 9d636be [WangTaoTheTonic] per Andrew's comments 979760c [WangTaoTheTonic] rebase e4f4ece [WangTaoTheTonic] fix failed test 5d23958 [WangTaoTheTonic] refact some duplicated code, style and comments 7a881b3 [WangTaoTheTonic] when one of masters is gone, we still can submit 2b011c9 [WangTaoTheTonic] fix broken tests 60d97a4 [WangTaoTheTonic] rebase fa1fa80 [WangTaoTheTonic] submit app to HA cluster in standalone cluster mode
…r mode when HA is enabled **3/26 update:** * Akka-based: Use an array of `ActorSelection` to represent multiple master. Add an `activeMasterActor` for query status of driver. And will add lost masters( including the standby one) to `lostMasters`. When size of `lostMasters` equals or greater than # of all masters, we should give an error that all masters are not avalible. * Rest-based: When all masters are not available(throw an exception), we use akka gateway to submit apps. I have tested simply on standalone HA cluster(with two masters alive and one alive/one dead), it worked. There might remains some issues on style or message print, but we can check the solution then fix them together. /cc srowen andrewor14 Author: WangTaoTheTonic <wangtao111@huawei.com> Closes apache#5116 from WangTaoTheTonic/SPARK-6443 and squashes the following commits: 2a28aab [WangTaoTheTonic] based the newest change apache#5144 76fd411 [WangTaoTheTonic] rebase f4f972b [WangTaoTheTonic] rebase...again a41de0b [WangTaoTheTonic] rebase 220cb3c [WangTaoTheTonic] move connect exception inside 35119a0 [WangTaoTheTonic] style and compile issues 9d636be [WangTaoTheTonic] per Andrew's comments 979760c [WangTaoTheTonic] rebase e4f4ece [WangTaoTheTonic] fix failed test 5d23958 [WangTaoTheTonic] refact some duplicated code, style and comments 7a881b3 [WangTaoTheTonic] when one of masters is gone, we still can submit 2b011c9 [WangTaoTheTonic] fix broken tests 60d97a4 [WangTaoTheTonic] rebase fa1fa80 [WangTaoTheTonic] submit app to HA cluster in standalone cluster mode
3/26 update:
Use an array of
ActorSelection
to represent multiple master. Add anactiveMasterActor
for query status of driver. And will add lost masters( including the standby one) tolostMasters
.When size of
lostMasters
equals or greater than # of all masters, we should give an error that all masters are not avalible.When all masters are not available(throw an exception), we use akka gateway to submit apps.
I have tested simply on standalone HA cluster(with two masters alive and one alive/one dead), it worked.
There might remains some issues on style or message print, but we can check the solution then fix them together.
/cc @srowen @andrewor14