[SPARK-12265][Mesos] Spark calls System.exit inside driver instead of throwing exception #10729

nraychaudhuri · 2016-01-12T23:22:06Z

Spark calls System.exit instead of throwing exception which makes harder to test. This fixes it by throwing exception and logging message.

AmplabJenkins · 2016-01-12T23:22:17Z

Can one of the admins verify this patch?

JoshRosen · 2016-01-13T00:24:27Z

Description in PR please? Makes it easier to review + becomes commit message.

nraychaudhuri · 2016-01-13T00:35:21Z

Updated the commit message & description

srowen · 2016-01-13T09:22:43Z

@nraychaudhuri I think this is a step in a good direction, but now if there is an error, how does the process know to exit? it seems like it just continues. The semantics have changed, unless I'm missing something.

dragos · 2016-01-13T11:08:45Z

@srowen good point, it hangs. But as far as I can see, this is due to a count-down latch that's not released in case of error:

"main" #1 prio=5 os_prio=31 tid=0x00007f8e53003000 nid=0x1703 waiting on condition [0x0000700000215000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000007b1b875a8> (a java.util.concurrent.CountDownLatch$Sync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
    at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
    at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils$class.startScheduler(MesosSchedulerUtils.scala:127)
    - locked <0x00000007b1b6c860> (a org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend)
    at org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend.startScheduler(CoarseMesosSchedulerBackend.scala:49)
    at org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend.start(CoarseMesosSchedulerBackend.scala:140)
    at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:513)
    at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1022)
    at $line3.$read$$iwC$$iwC.<init>(<console>:15)
    at $line3.$read$$iwC.<init>(<console>:25)
    at $line3.$read.<init>(<console>:27)
    at $line3.$read$.<init>(<console>:31)
    at $line3.$read$.<clinit>(<console>)
    at $line3.$eval$.<init>(<console>:7)
    at $line3.$eval$.<clinit>(<console>)
    at $line3.$eval.$print(<console>)

nraychaudhuri · 2016-01-13T12:43:28Z

@dragos do you see it hanging? I introduced markErr method to take care of the countdown latch. This is called from the error handler.

dragos · 2016-01-13T13:15:21Z

Yes, I see it hanging. I ran this on a Mesos cluster with no roles defined.

bin/spark-shell --conf spark.mesos.role=mu --master mesos://192.168.99.100:5050

dragos · 2016-01-13T13:19:01Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala

@@ -376,6 +376,7 @@ private[spark] class MesosSchedulerBackend(
    inClassLoader() {
      logError("Mesos error: " + message)
      scheduler.error(message)


The problem is this call: it throws an exception, so markErr is never called.

nraychaudhuri · 2016-01-13T14:15:51Z

@dragos good catch. I guess I only tested in coarse grain mode. I will push a fix. Thx

nraychaudhuri · 2016-01-13T21:22:26Z

@dragos @srowen Took care of the latch countdown. The process should not hang anymore.

srowen · 2016-01-14T10:54:47Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala

@@ -109,20 +109,17 @@ private[mesos] trait MesosSchedulerUtils extends Logging {

      new Thread(Utils.getFormattedClassName(this) + "-mesos-driver") {
        setDaemon(true)
+        setUncaughtExceptionHandler(new Thread.UncaughtExceptionHandler {


Why do we want this vs simply handling exceptions in run? that's how it was before at least.

I wanted to cleanly separate core logic vs exception handling code.

I don't think it's clearer in that you'd have to understand what this handler does. A try-catch seems more recognizable

This pattern is used in many other places in Spark though but if it helps I can change it back to try-catch.

I only see it one place, in FsHistoryProvider, where it looks like it's mostly there to support testing.

hmm...I thought I saw it in other places also. I noticed setDefaultUncaughtExceptionHandler used as well. Anyways I will change this to try-catch to keep things simple

dragos · 2016-01-20T14:47:18Z

@nraychaudhuri it doesn't hang indeed, but it doesn't stop the Spark Shell either. Since there's no Mesos driver, it won't get any resources and no job can progress.

$ bin/spark-shell --master mesos://192.168.99.100:5050 --conf spark.mesos.role=aaa
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
16/01/20 15:47:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
I0120 15:47:08.975942 65032192 sched.cpp:164] Version: 0.25.0
I0120 15:47:08.977934 59678720 sched.cpp:262] New master detected at master@192.168.99.100:5050
I0120 15:47:08.978126 59678720 sched.cpp:272] No credentials provided. Attempting to register without authentication
I0120 15:47:08.981176 61825024 sched.cpp:1024] Got error 'Role 'aaa' is not present in the master's --roles'
I0120 15:47:08.981202 61825024 sched.cpp:1805] Asked to abort the driver
E0120 15:47:08.981292 63971328 socket.hpp:174] Shutdown failed on fd=112: Socket is not connected [57]
16/01/20 15:47:08 ERROR CoarseMesosSchedulerBackend: Mesos error: Role 'aaa' is not present in the master's --roles
Exception in thread "Thread-14" org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Role 'aaa' is not present in the master's --roles
    at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:438)
    at org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend.error(CoarseMesosSchedulerBackend.scala:364)
I0120 15:47:08.982903 61825024 sched.cpp:1805] Asked to abort the driver
I0120 15:47:08.982928 61825024 sched.cpp:1070] Aborting framework ''
16/01/20 15:47:08 WARN CoarseMesosSchedulerBackend: Application ID is not initialized yet.
16/01/20 15:47:08 ERROR CoarseMesosSchedulerBackend: Error starting driver, DRIVER_ABORTED
16/01/20 15:47:09 INFO SparkILoop: Created spark context..
Spark context available as sc (master = mesos://192.168.99.100:5050, app id = spark-application-1453301228917).
16/01/20 15:47:09 INFO SparkILoop: Created sql context..
SQL context available as sqlContext.

scala>

dragos · 2016-01-25T13:39:43Z

@nraychaudhuri will you have time to work on this further, or someone else should pick it up?

nraychaudhuri · 2016-01-25T14:25:20Z

@dragos Thanks for looking into it. I will not have time to work on this. Please take it up

dragos · 2016-01-25T15:36:08Z

Ok, will do!

dragos · 2016-01-25T15:58:17Z

@nraychaudhuri can you please close this PR?

… throwing exception This takes over #10729 and makes sure that `spark-shell` fails with a proper error message. There is a slight behavioral change: before this change `spark-shell` would exit, while now the REPL is still there, but `sc` and `sqlContext` are not defined and the error is visible to the user. Author: Nilanjan Raychaudhuri <nraychaudhuri@gmail.com> Author: Iulian Dragos <jaguarul@gmail.com> Closes #10921 from dragos/pr/10729.

[SPARK-12265][Mesos] Remove System.exit

8ad399c

nraychaudhuri changed the title ~~[SPARK-12265][Mesos] Remove System.exit~~ [SPARK-12265][Mesos] Spark calls System.exit inside driver instead of throwing exception Jan 13, 2016

dragos reviewed Jan 13, 2016
View reviewed changes

dragos mentioned this pull request Jan 13, 2016

Remove System.exit lightbend/spark#25

Closed

Closing the latch before throwing error

5c71930

srowen reviewed Jan 14, 2016
View reviewed changes

using try-catch

997721c

dragos mentioned this pull request Jan 26, 2016

[SPARK-12265][Mesos] Spark calls System.exit inside driver instead of throwing exception #10921

Closed

nraychaudhuri closed this Jan 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12265][Mesos] Spark calls System.exit inside driver instead of throwing exception #10729

[SPARK-12265][Mesos] Spark calls System.exit inside driver instead of throwing exception #10729

nraychaudhuri commented Jan 12, 2016

AmplabJenkins commented Jan 12, 2016

JoshRosen commented Jan 13, 2016

nraychaudhuri commented Jan 13, 2016

srowen commented Jan 13, 2016

dragos commented Jan 13, 2016

nraychaudhuri commented Jan 13, 2016

dragos commented Jan 13, 2016

dragos Jan 13, 2016

nraychaudhuri commented Jan 13, 2016

nraychaudhuri commented Jan 13, 2016

srowen Jan 14, 2016

nraychaudhuri Jan 14, 2016

srowen Jan 14, 2016

nraychaudhuri Jan 14, 2016

srowen Jan 14, 2016

nraychaudhuri Jan 14, 2016

dragos commented Jan 20, 2016

dragos commented Jan 25, 2016

nraychaudhuri commented Jan 25, 2016

dragos commented Jan 25, 2016

dragos commented Jan 25, 2016

[SPARK-12265][Mesos] Spark calls System.exit inside driver instead of throwing exception #10729

[SPARK-12265][Mesos] Spark calls System.exit inside driver instead of throwing exception #10729

Conversation

nraychaudhuri commented Jan 12, 2016

AmplabJenkins commented Jan 12, 2016

JoshRosen commented Jan 13, 2016

nraychaudhuri commented Jan 13, 2016

srowen commented Jan 13, 2016

dragos commented Jan 13, 2016

nraychaudhuri commented Jan 13, 2016

dragos commented Jan 13, 2016

dragos Jan 13, 2016

Choose a reason for hiding this comment

nraychaudhuri commented Jan 13, 2016

nraychaudhuri commented Jan 13, 2016

srowen Jan 14, 2016

Choose a reason for hiding this comment

nraychaudhuri Jan 14, 2016

Choose a reason for hiding this comment

srowen Jan 14, 2016

Choose a reason for hiding this comment

nraychaudhuri Jan 14, 2016

Choose a reason for hiding this comment

srowen Jan 14, 2016

Choose a reason for hiding this comment

nraychaudhuri Jan 14, 2016

Choose a reason for hiding this comment

dragos commented Jan 20, 2016

dragos commented Jan 25, 2016

nraychaudhuri commented Jan 25, 2016

dragos commented Jan 25, 2016

dragos commented Jan 25, 2016