Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6443][Spark Submit]Could not submit app in standalone cluster mode when HA is enabled #5116

Closed
wants to merge 14 commits into from

Conversation

WangTaoTheTonic
Copy link
Contributor

3/26 update:

  • Akka-based:
    Use an array of ActorSelection to represent multiple master. Add an activeMasterActor for query status of driver. And will add lost masters( including the standby one) to lostMasters.
    When size of lostMasters equals or greater than # of all masters, we should give an error that all masters are not avalible.
  • Rest-based:
    When all masters are not available(throw an exception), we use akka gateway to submit apps.

I have tested simply on standalone HA cluster(with two masters alive and one alive/one dead), it worked.

There might remains some issues on style or message print, but we can check the solution then fix them together.

/cc @srowen @andrewor14

@SparkQA
Copy link

SparkQA commented Mar 21, 2015

Test build #28942 has finished for PR 5116 at commit da4ad2e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SubmitDriverResponse(
    • case class KillDriverResponse(

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29138 has finished for PR 5116 at commit da4ad2e.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class SubmitDriverResponse(
    • case class KillDriverResponse(

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29147 has finished for PR 5116 at commit 60d97a4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29146 has finished for PR 5116 at commit fa1fa80.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29156 has finished for PR 5116 at commit 2b011c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WangTaoTheTonic
Copy link
Contributor Author

Though passed all the tests, it still doesn't work well in some cases.
I'm still working on it.

@SparkQA
Copy link

SparkQA commented Mar 26, 2015

Test build #29205 has finished for PR 5116 at commit 7a881b3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WangTaoTheTonic WangTaoTheTonic changed the title [SPARK-6443][WIP]Could not submit app in standalone cluster mode when HA is enabled [SPARK-6443][Spark Submit]Could not submit app in standalone cluster mode when HA is enabled Mar 26, 2015
@WangTaoTheTonic
Copy link
Contributor Author

The UT failed because of hive compile error which would be fixed in #5198.

@WangTaoTheTonic
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Mar 26, 2015

Test build #29222 timed out for PR 5116 at commit 7a881b3 after a configured wait of 120m.

@WangTaoTheTonic
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Mar 26, 2015

Test build #29229 has finished for PR 5116 at commit 7a881b3.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WangTaoTheTonic
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Mar 26, 2015

Test build #29235 has finished for PR 5116 at commit 7a881b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WangTaoTheTonic
Copy link
Contributor Author

@andrewor14 Could you please take a look?

if (success) {
activeMasterActor = context.actorSelection(sender.path)
pollAndReportStatus(driverId.get)
} else if (!message.contains("Can only")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain what "Can only" is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, it's from this message: Can only accept driver submissions in ALIVE state.

@WangTaoTheTonic
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Apr 16, 2015

Test build #30416 has finished for PR 5116 at commit 220cb3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30719 has finished for PR 5116 at commit a41de0b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30724 has finished for PR 5116 at commit f4f972b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@WangTaoTheTonic
Copy link
Contributor Author

@andrewor14 I've changed and made some rebase. If everything is ok i'd like to merge it ASAP. Please take a look.

@WangTaoTheTonic
Copy link
Contributor Author

ping @andrewor14

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #31047 has started for PR 5116 at commit f4f972b.

@WangTaoTheTonic
Copy link
Contributor Author

Looks like Jenkins has done his work but not posted the result here.

@andrewor14
Copy link
Contributor

retest this please

@andrewor14
Copy link
Contributor

I think catching SubmitRestConnectionException in createSubmission/requestSubmissionStatus/killSubmission is more reasonable than in readResponse as we use requestSubmissionStatus to Read the response from the server and return it as a validated and we submit requests to multiple masters in createSubmission/requestSubmissionStatus/killSubmission so we should take the response there either.

If you look into the these methods you'll notice that the only place where it actually makes a connection to the master is in post or postJson, and both of these methods call readResponse. The current scope of catching these exceptions is too wide (I believe we will never throw this exception outside of readResponse), and the cost we pay is duplicate code.

@andrewor14
Copy link
Contributor

By the way the rest of the changes look fine to me.

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31161 has finished for PR 5116 at commit f4f972b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@WangTaoTheTonic
Copy link
Contributor Author

I believe we will never throw this exception outside of readResponse

@andrewor14 No. In postJson, it will throw a ConnectException when try to get output stream of a connection while the remote server is down. I take care of it here:

try {
val out = new DataOutputStream(conn.getOutputStream)
Utils.tryWithSafeFinally {
out.write(json.getBytes(Charsets.UTF_8))
} {
out.close()
}
} catch {
case e: ConnectException =>
throw new SubmitRestConnectionException("Connect Exception when connect to server", e)
}

@andrewor14
Copy link
Contributor

I see, that seems to be the case, then I would do it in get, post and postJson instead of the higher level methods.

@WangTaoTheTonic
Copy link
Contributor Author

Sorry but I don't see benefit to push it down which still have same duplicated code.

I think we should catch SubmitRestConnectionException while send an request to one master rather than post a request to some url. Code-wise they are same in this case, the meaning is different though.

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31227 has finished for PR 5116 at commit 76fd411.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31259 has finished for PR 5116 at commit 2a28aab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@andrewor14
Copy link
Contributor

Ok, no problem. The slightly nice thing is that it hides the details from the higher level and only handle it in relatively low level methods (e.g. post), but not a big deal. I will merge this first and we can always do any clean ups later if needed. I don't want to make you keep rebasing.

@andrewor14
Copy link
Contributor

But first, retest this please

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31303 has finished for PR 5116 at commit 2a28aab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@WangTaoTheTonic
Copy link
Contributor Author

@andrewor14 Thanks for all comments. :)

@andrewor14
Copy link
Contributor

Merging into master.

@asfgit asfgit closed this in b4b43df May 2, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
…r mode when HA is enabled

**3/26 update:**
* Akka-based:
  Use an array of `ActorSelection` to represent multiple master. Add an `activeMasterActor` for query status of driver. And will add lost masters( including the standby one) to `lostMasters`.
  When size of `lostMasters` equals or greater than # of all masters, we should give an error that all masters are not avalible.

* Rest-based:
  When all masters are not available(throw an exception), we use akka gateway to submit apps.

I have tested simply on standalone HA cluster(with two masters alive and one alive/one dead), it worked.

There might remains some issues on style or message print, but we can check the solution then fix them together.

/cc srowen andrewor14

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes apache#5116 from WangTaoTheTonic/SPARK-6443 and squashes the following commits:

2a28aab [WangTaoTheTonic] based the newest change apache#5144
76fd411 [WangTaoTheTonic] rebase
f4f972b [WangTaoTheTonic] rebase...again
a41de0b [WangTaoTheTonic] rebase
220cb3c [WangTaoTheTonic] move connect exception inside
35119a0 [WangTaoTheTonic] style and compile issues
9d636be [WangTaoTheTonic] per Andrew's comments
979760c [WangTaoTheTonic] rebase
e4f4ece [WangTaoTheTonic] fix failed test
5d23958 [WangTaoTheTonic] refact some duplicated code, style and comments
7a881b3 [WangTaoTheTonic] when one of masters is gone, we still can submit
2b011c9 [WangTaoTheTonic] fix broken tests
60d97a4 [WangTaoTheTonic] rebase
fa1fa80 [WangTaoTheTonic] submit app to HA cluster in standalone cluster mode
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…r mode when HA is enabled

**3/26 update:**
* Akka-based:
  Use an array of `ActorSelection` to represent multiple master. Add an `activeMasterActor` for query status of driver. And will add lost masters( including the standby one) to `lostMasters`.
  When size of `lostMasters` equals or greater than # of all masters, we should give an error that all masters are not avalible.

* Rest-based:
  When all masters are not available(throw an exception), we use akka gateway to submit apps.

I have tested simply on standalone HA cluster(with two masters alive and one alive/one dead), it worked.

There might remains some issues on style or message print, but we can check the solution then fix them together.

/cc srowen andrewor14

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes apache#5116 from WangTaoTheTonic/SPARK-6443 and squashes the following commits:

2a28aab [WangTaoTheTonic] based the newest change apache#5144
76fd411 [WangTaoTheTonic] rebase
f4f972b [WangTaoTheTonic] rebase...again
a41de0b [WangTaoTheTonic] rebase
220cb3c [WangTaoTheTonic] move connect exception inside
35119a0 [WangTaoTheTonic] style and compile issues
9d636be [WangTaoTheTonic] per Andrew's comments
979760c [WangTaoTheTonic] rebase
e4f4ece [WangTaoTheTonic] fix failed test
5d23958 [WangTaoTheTonic] refact some duplicated code, style and comments
7a881b3 [WangTaoTheTonic] when one of masters is gone, we still can submit
2b011c9 [WangTaoTheTonic] fix broken tests
60d97a4 [WangTaoTheTonic] rebase
fa1fa80 [WangTaoTheTonic] submit app to HA cluster in standalone cluster mode
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…r mode when HA is enabled

**3/26 update:**
* Akka-based:
  Use an array of `ActorSelection` to represent multiple master. Add an `activeMasterActor` for query status of driver. And will add lost masters( including the standby one) to `lostMasters`.
  When size of `lostMasters` equals or greater than # of all masters, we should give an error that all masters are not avalible.

* Rest-based:
  When all masters are not available(throw an exception), we use akka gateway to submit apps.

I have tested simply on standalone HA cluster(with two masters alive and one alive/one dead), it worked.

There might remains some issues on style or message print, but we can check the solution then fix them together.

/cc srowen andrewor14

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes apache#5116 from WangTaoTheTonic/SPARK-6443 and squashes the following commits:

2a28aab [WangTaoTheTonic] based the newest change apache#5144
76fd411 [WangTaoTheTonic] rebase
f4f972b [WangTaoTheTonic] rebase...again
a41de0b [WangTaoTheTonic] rebase
220cb3c [WangTaoTheTonic] move connect exception inside
35119a0 [WangTaoTheTonic] style and compile issues
9d636be [WangTaoTheTonic] per Andrew's comments
979760c [WangTaoTheTonic] rebase
e4f4ece [WangTaoTheTonic] fix failed test
5d23958 [WangTaoTheTonic] refact some duplicated code, style and comments
7a881b3 [WangTaoTheTonic] when one of masters is gone, we still can submit
2b011c9 [WangTaoTheTonic] fix broken tests
60d97a4 [WangTaoTheTonic] rebase
fa1fa80 [WangTaoTheTonic] submit app to HA cluster in standalone cluster mode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants