Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12265][Mesos] Spark calls System.exit inside driver instead of throwing exception #10729

Closed
wants to merge 3 commits into from

Conversation

nraychaudhuri
Copy link
Contributor

Spark calls System.exit instead of throwing exception which makes harder to test. This fixes it by throwing exception and logging message.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@JoshRosen
Copy link
Contributor

Description in PR please? Makes it easier to review + becomes commit message.

@nraychaudhuri nraychaudhuri changed the title [SPARK-12265][Mesos] Remove System.exit [SPARK-12265][Mesos] Spark calls System.exit inside driver instead of throwing exception Jan 13, 2016
@nraychaudhuri
Copy link
Contributor Author

Updated the commit message & description

@srowen
Copy link
Member

srowen commented Jan 13, 2016

@nraychaudhuri I think this is a step in a good direction, but now if there is an error, how does the process know to exit? it seems like it just continues. The semantics have changed, unless I'm missing something.

@dragos
Copy link
Contributor

dragos commented Jan 13, 2016

@srowen good point, it hangs. But as far as I can see, this is due to a count-down latch that's not released in case of error:

"main" #1 prio=5 os_prio=31 tid=0x00007f8e53003000 nid=0x1703 waiting on condition [0x0000700000215000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000007b1b875a8> (a java.util.concurrent.CountDownLatch$Sync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
    at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
    at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils$class.startScheduler(MesosSchedulerUtils.scala:127)
    - locked <0x00000007b1b6c860> (a org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend)
    at org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend.startScheduler(CoarseMesosSchedulerBackend.scala:49)
    at org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend.start(CoarseMesosSchedulerBackend.scala:140)
    at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:513)
    at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1022)
    at $line3.$read$$iwC$$iwC.<init>(<console>:15)
    at $line3.$read$$iwC.<init>(<console>:25)
    at $line3.$read.<init>(<console>:27)
    at $line3.$read$.<init>(<console>:31)
    at $line3.$read$.<clinit>(<console>)
    at $line3.$eval$.<init>(<console>:7)
    at $line3.$eval$.<clinit>(<console>)
    at $line3.$eval.$print(<console>)

@nraychaudhuri
Copy link
Contributor Author

@dragos do you see it hanging? I introduced markErr method to take care of the countdown latch. This is called from the error handler.

@dragos
Copy link
Contributor

dragos commented Jan 13, 2016

Yes, I see it hanging. I ran this on a Mesos cluster with no roles defined.

bin/spark-shell --conf spark.mesos.role=mu --master mesos://192.168.99.100:5050

@@ -376,6 +376,7 @@ private[spark] class MesosSchedulerBackend(
inClassLoader() {
logError("Mesos error: " + message)
scheduler.error(message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is this call: it throws an exception, so markErr is never called.

@nraychaudhuri
Copy link
Contributor Author

@dragos good catch. I guess I only tested in coarse grain mode. I will push a fix. Thx

@nraychaudhuri
Copy link
Contributor Author

@dragos @srowen Took care of the latch countdown. The process should not hang anymore.

@@ -109,20 +109,17 @@ private[mesos] trait MesosSchedulerUtils extends Logging {

new Thread(Utils.getFormattedClassName(this) + "-mesos-driver") {
setDaemon(true)
setUncaughtExceptionHandler(new Thread.UncaughtExceptionHandler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want this vs simply handling exceptions in run? that's how it was before at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to cleanly separate core logic vs exception handling code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's clearer in that you'd have to understand what this handler does. A try-catch seems more recognizable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is used in many other places in Spark though but if it helps I can change it back to try-catch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see it one place, in FsHistoryProvider, where it looks like it's mostly there to support testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm...I thought I saw it in other places also. I noticed setDefaultUncaughtExceptionHandler used as well. Anyways I will change this to try-catch to keep things simple

@dragos
Copy link
Contributor

dragos commented Jan 20, 2016

@nraychaudhuri it doesn't hang indeed, but it doesn't stop the Spark Shell either. Since there's no Mesos driver, it won't get any resources and no job can progress.

$ bin/spark-shell --master mesos://192.168.99.100:5050 --conf spark.mesos.role=aaa
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
16/01/20 15:47:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
I0120 15:47:08.975942 65032192 sched.cpp:164] Version: 0.25.0
I0120 15:47:08.977934 59678720 sched.cpp:262] New master detected at master@192.168.99.100:5050
I0120 15:47:08.978126 59678720 sched.cpp:272] No credentials provided. Attempting to register without authentication
I0120 15:47:08.981176 61825024 sched.cpp:1024] Got error 'Role 'aaa' is not present in the master's --roles'
I0120 15:47:08.981202 61825024 sched.cpp:1805] Asked to abort the driver
E0120 15:47:08.981292 63971328 socket.hpp:174] Shutdown failed on fd=112: Socket is not connected [57]
16/01/20 15:47:08 ERROR CoarseMesosSchedulerBackend: Mesos error: Role 'aaa' is not present in the master's --roles
Exception in thread "Thread-14" org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Role 'aaa' is not present in the master's --roles
    at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:438)
    at org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend.error(CoarseMesosSchedulerBackend.scala:364)
I0120 15:47:08.982903 61825024 sched.cpp:1805] Asked to abort the driver
I0120 15:47:08.982928 61825024 sched.cpp:1070] Aborting framework ''
16/01/20 15:47:08 WARN CoarseMesosSchedulerBackend: Application ID is not initialized yet.
16/01/20 15:47:08 ERROR CoarseMesosSchedulerBackend: Error starting driver, DRIVER_ABORTED
16/01/20 15:47:09 INFO SparkILoop: Created spark context..
Spark context available as sc (master = mesos://192.168.99.100:5050, app id = spark-application-1453301228917).
16/01/20 15:47:09 INFO SparkILoop: Created sql context..
SQL context available as sqlContext.

scala> 

@dragos
Copy link
Contributor

dragos commented Jan 25, 2016

@nraychaudhuri will you have time to work on this further, or someone else should pick it up?

@nraychaudhuri
Copy link
Contributor Author

@dragos Thanks for looking into it. I will not have time to work on this. Please take it up

@dragos
Copy link
Contributor

dragos commented Jan 25, 2016

Ok, will do!

@dragos
Copy link
Contributor

dragos commented Jan 25, 2016

@nraychaudhuri can you please close this PR?

asfgit pushed a commit that referenced this pull request Feb 1, 2016
… throwing exception

This takes over #10729 and makes sure that `spark-shell` fails with a proper error message. There is a slight behavioral change: before this change `spark-shell` would exit, while now the REPL is still there, but `sc` and `sqlContext` are not defined and the error is visible to the user.

Author: Nilanjan Raychaudhuri <nraychaudhuri@gmail.com>
Author: Iulian Dragos <jaguarul@gmail.com>

Closes #10921 from dragos/pr/10729.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants