Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7736] [core] [yarn] Make pyspark fail YARN app on failure. #7751

Closed
wants to merge 3 commits into from

Conversation

@vanzin
Copy link
Contributor

commented Jul 29, 2015

The YARN backend doesn't like when user code calls System.exit,
since it cannot know the exit status and thus cannot set an
appropriate final status for the application.

So, for pyspark, avoid that call and instead throw an exception with
the exit code. SparkSubmit handles that exception and exits with
the given exit code, while YARN uses the exit code as the failure
code for the Spark app.

The YARN backend doesn't like when user code calls `System.exit`,
since it cannot know the exit status and thus cannot set an
appropriate final status for the application.

So, for pyspark, avoid that call and instead throw an exception with
the exit code. SparkSubmit handles that exception and exits with
the given exit code, while YARN uses the exit code as the failure
code for the Spark app.
@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Jul 29, 2015

Tested on real cluster. Not sure if it's worth it to add another test in YarnClusterSuite, since those are generally slow, but it would be easy to do so.

@SparkQA

This comment has been minimized.

Copy link

commented Jul 29, 2015

Test build #38854 timed out for PR 7751 at commit 2bb2a8a after a configured wait of 175m.

@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Jul 29, 2015

Jenkins retest this please.

@andrewor14

This comment has been minimized.

Copy link
Contributor

commented Jul 29, 2015

can you make this [Core] [YARN] so our PR dashboard displays this properly?

@vanzin vanzin changed the title [SPARK-9416] [core,yarn] Make pyspark fail YARN app on failure. [SPARK-9416] [core] [yarn] Make pyspark fail YARN app on failure. Jul 29, 2015
@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Jul 29, 2015

Just ran into an issue where py4j threads are not daemon threads, so the YARN app is not exiting. Taking a look...

py4j uses non-daemon threads internally, so if it's not
explicitly stopped, it will prevent the process from
exiting now that System.exit() is not being used.
@SparkQA

This comment has been minimized.

Copy link

commented Jul 30, 2015

Test build #146 timed out for PR 7751 at commit 2bb2a8a after a configured wait of 175m.

@SparkQA

This comment has been minimized.

Copy link

commented Jul 30, 2015

Test build #38885 timed out for PR 7751 at commit 2bb2a8a after a configured wait of 175m.

@SparkQA

This comment has been minimized.

Copy link

commented Jul 30, 2015

Test build #38911 has finished for PR 7751 at commit 8df6b09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@srowen

This comment has been minimized.

Copy link
Member

commented Jul 30, 2015

Seems pretty reasonable to me.

@vanzin vanzin changed the title [SPARK-9416] [core] [yarn] Make pyspark fail YARN app on failure. [SPARK-7736] [core] [yarn] Make pyspark fail YARN app on failure. Aug 3, 2015
Conflicts:
	core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
@SparkQA

This comment has been minimized.

Copy link

commented Aug 5, 2015

Test build #39774 has finished for PR 7751 at commit d6667da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2015

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 5, 2015

Test build #39886 has finished for PR 7751 at commit d6667da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Aug 5, 2015

Test build #235 has finished for PR 7751 at commit d6667da.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BoundPortsResponse(rpcEndpointPort: Int, webUIPort: Int, restPort: Option[Int])
    • ParamDesc[Array[Double]]("thresholds", "Thresholds in multi-class classification" +
    • final val thresholds: DoubleArrayParam = new DoubleArrayParam(this, "thresholds", "Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.", (t: Array[Double]) => t.forall(_ >= 0))
@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 11, 2015

retest this please

@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 11, 2015

I'm gonna merge this later today if there are no more comments.

@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 11, 2015

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 12, 2015

Test build #40540 timed out for PR 7751 at commit d6667da after a configured wait of 175m.

@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2015

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 12, 2015

Test build #40576 has finished for PR 7751 at commit d6667da.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@andrewor14

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2015

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 12, 2015

Test build #1494 has finished for PR 7751 at commit d6667da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2015

This has passed tests already, all latest failures are due to flakiness...

retest this please

@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2015

retest this please

@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2015

jenkins, retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 13, 2015

Test build #40689 has finished for PR 7751 at commit d6667da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@andrewor14

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2015

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 13, 2015

Test build #40714 timed out for PR 7751 at commit d6667da after a configured wait of 175m.

@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2015

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 13, 2015

Test build #40797 timed out for PR 7751 at commit d6667da after a configured wait of 175m.

@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2015

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 14, 2015

Test build #40836 timed out for PR 7751 at commit d6667da after a configured wait of 175m.

@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2015

retest this please

@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2015

wtf

@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2015

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 14, 2015

Test build #1605 has finished for PR 7751 at commit d6667da.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@vanzin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2015

Streaming tests are really flaky... anyway, all interesting tests passed, so unless I hear otherwise, I'll merge this Monday morning.

@asfgit asfgit closed this in f68d024 Aug 17, 2015
@vanzin vanzin deleted the vanzin:SPARK-9416 branch Aug 17, 2015
asfgit pushed a commit that referenced this pull request Sep 9, 2015
The YARN backend doesn't like when user code calls `System.exit`,
since it cannot know the exit status and thus cannot set an
appropriate final status for the application.

So, for pyspark, avoid that call and instead throw an exception with
the exit code. SparkSubmit handles that exception and exits with
the given exit code, while YARN uses the exit code as the failure
code for the Spark app.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7751 from vanzin/SPARK-9416.

(cherry picked from commit f68d024)
ashangit pushed a commit to ashangit/spark that referenced this pull request Oct 19, 2016
The YARN backend doesn't like when user code calls `System.exit`,
since it cannot know the exit status and thus cannot set an
appropriate final status for the application.

So, for pyspark, avoid that call and instead throw an exception with
the exit code. SparkSubmit handles that exception and exits with
the given exit code, while YARN uses the exit code as the failure
code for the Spark app.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#7751 from vanzin/SPARK-9416.

(cherry picked from commit f68d024)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.