-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12265][Mesos] Spark calls System.exit inside driver instead of throwing exception #10729
Conversation
Can one of the admins verify this patch? |
Description in PR please? Makes it easier to review + becomes commit message. |
Updated the commit message & description |
@nraychaudhuri I think this is a step in a good direction, but now if there is an error, how does the process know to exit? it seems like it just continues. The semantics have changed, unless I'm missing something. |
@srowen good point, it hangs. But as far as I can see, this is due to a count-down latch that's not released in case of error:
|
@dragos do you see it hanging? I introduced markErr method to take care of the countdown latch. This is called from the error handler. |
Yes, I see it hanging. I ran this on a Mesos cluster with no roles defined.
|
@@ -376,6 +376,7 @@ private[spark] class MesosSchedulerBackend( | |||
inClassLoader() { | |||
logError("Mesos error: " + message) | |||
scheduler.error(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is this call: it throws an exception, so markErr
is never called.
@dragos good catch. I guess I only tested in coarse grain mode. I will push a fix. Thx |
@@ -109,20 +109,17 @@ private[mesos] trait MesosSchedulerUtils extends Logging { | |||
|
|||
new Thread(Utils.getFormattedClassName(this) + "-mesos-driver") { | |||
setDaemon(true) | |||
setUncaughtExceptionHandler(new Thread.UncaughtExceptionHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want this vs simply handling exceptions in run
? that's how it was before at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to cleanly separate core logic vs exception handling code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's clearer in that you'd have to understand what this handler does. A try-catch seems more recognizable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is used in many other places in Spark though but if it helps I can change it back to try-catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see it one place, in FsHistoryProvider
, where it looks like it's mostly there to support testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm...I thought I saw it in other places also. I noticed setDefaultUncaughtExceptionHandler
used as well. Anyways I will change this to try-catch to keep things simple
@nraychaudhuri it doesn't hang indeed, but it doesn't stop the Spark Shell either. Since there's no Mesos driver, it won't get any resources and no job can progress.
|
@nraychaudhuri will you have time to work on this further, or someone else should pick it up? |
@dragos Thanks for looking into it. I will not have time to work on this. Please take it up |
Ok, will do! |
@nraychaudhuri can you please close this PR? |
… throwing exception This takes over #10729 and makes sure that `spark-shell` fails with a proper error message. There is a slight behavioral change: before this change `spark-shell` would exit, while now the REPL is still there, but `sc` and `sqlContext` are not defined and the error is visible to the user. Author: Nilanjan Raychaudhuri <nraychaudhuri@gmail.com> Author: Iulian Dragos <jaguarul@gmail.com> Closes #10921 from dragos/pr/10729.
Spark calls System.exit instead of throwing exception which makes harder to test. This fixes it by throwing exception and logging message.