[SPARK-12330] [MESOS] Fix mesos coarse mode cleanup #10319

drcrallen · 2015-12-16T00:02:22Z

In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work).

This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by spark.mesos.coarse.shutdown.ms)

This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0)

With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as FINISHED in the Mesos console (as opposed to KILLED in < 1.6 or FAILED in 1.6 and later)

andrewor14 · 2015-12-16T00:33:18Z

ok to test

andrewor14 · 2015-12-16T00:33:45Z

@drcrallen mind changing the [CORE] in the title to [MESOS]?

tnachen · 2015-12-16T01:24:00Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

@@ -45,6 +46,7 @@ private[spark] class CoarseGrainedExecutorBackend(
    env: SparkEnv)
  extends ThreadSafeRpcEndpoint with ExecutorBackend with Logging {

+  val stopping = new AtomicBoolean(false)


this can be private

SparkQA · 2015-12-16T05:03:06Z

Test build #47761 has finished for PR 10319 at commit 5e1b13d.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

tnachen · 2015-12-16T18:19:58Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala

@@ -60,6 +63,11 @@ private[spark] class CoarseMesosSchedulerBackend(
  // Maximum number of cores to acquire (TODO: we'll need more flexible controls here)
  val maxCores = conf.get("spark.cores.max", Int.MaxValue.toString).toInt

+  val shutdownTimeoutMS = conf.getInt("spark.mesos.coarse.shutdown.ms", 10000)


private val

tnachen · 2015-12-16T18:26:05Z

So IIUC stop is only invoked when an exception occured or shutdown hook is invoked, where both cases it's an task that didn't really finish and user/system want it to be killed. So I'm not sure if ending in these cases to have TASK_KILLED is an bad idea.
The main reason I'm proposing not to change this is really I don't think sleeping in the stop method with a while is a great idea, as it clearly can block other things going on in the system.

drcrallen · 2015-12-18T15:38:45Z

@tnachen For some reason the shutdown hooks are not finishing properly if it receives a SIGTERM during shutdown. See logs in https://issues.apache.org/jira/browse/SPARK-12330 where INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown is followed by a ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

Fixing that may be a better root cause fix, but I haven't looked into the hadoop shutdown hooks enough to know where the issue could be with that.

drcrallen · 2015-12-18T15:59:02Z

To clarify... That should solve the block cleanup issue. It will not solve the executor reporting incorrect status.

If the executors are to exit "cleanly" (aka FINNISHED instead of KILLED or FAILED) there needs to be a wait of some kind between spark sending a shutdown and shutting down the mesos driver.

drcrallen · 2015-12-18T16:58:20Z

I may have found the root cause of the failure to cleanup blocks. Testing

drcrallen · 2015-12-18T17:29:24Z

Nope, SparkEnv.stop() does not block on multiple calls to make sure stop has completed at least once. But even when ensuring that happens the shutdown process still does not end cleanly.

drcrallen · 2015-12-18T19:19:40Z

After more thorough testing, it seems that there is still a race for getting a FINISHED vs KILLED final status with this patch.

here's an example log where it was reported as killed:

STDERR

15/12/18 19:01:34 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
15/12/18 19:01:34 INFO MemoryStore: MemoryStore cleared
15/12/18 19:01:34 INFO BlockManager: BlockManager stopped
15/12/18 19:01:34 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/12/18 19:01:34 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/12/18 19:01:34 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
15/12/18 19:01:34 INFO ShutdownHookManager: Shutdown hook called
15/12/18 19:01:34 INFO ShutdownHookManager: Deleting directory /mesos/slaves/355857ee-1069-480d-baa8-e89427bd5840-S13/frameworks/355857ee-1069-480d-baa8-e89427bd5840-0079/executors/3/runs/b1cb5f55-43f3-4d1d-8ada-18fd29319868/tmp/java/spark-76c38206-57df-46aa-bc2a-4ad143d65959
I1218 19:01:44.844650 94385 exec.cpp:381] Executor asked to shutdown

STDOUT

2015-12-18T19:01:29.228+0000: Total time for which application threads were stopped: 0.0000720 seconds, Stopping threads took: 0.0000280 seconds
2015-12-18T19:01:34.418+0000: Total time for which application threads were stopped: 0.0202110 seconds, Stopping threads took: 0.0170400 seconds
2015-12-18T19:01:34.895+0000: Total time for which application threads were stopped: 0.0003310 seconds, Stopping threads took: 0.0002190 seconds
2015-12-18T19:01:34.896+0000: Total time for which application threads were stopped: 0.0001710 seconds, Stopping threads took: 0.0000940 seconds
2015-12-18T19:01:34.907+0000: Total time for which application threads were stopped: 0.0002100 seconds, Stopping threads took: 0.0001420 seconds
2015-12-18T19:01:34.907+0000: Total time for which application threads were stopped: 0.0002470 seconds, Stopping threads took: 0.0001740 seconds
2015-12-18T19:01:34.945+0000: Total time for which application threads were stopped: 0.0004670 seconds, Stopping threads took: 0.0000420 seconds
2015-12-18T19:01:34.961+0000: Total time for which application threads were stopped: 0.0001110 seconds, Stopping threads took: 0.0000720 seconds
Heap
 PSYoungGen      total 39748608K, used 18046635K [0x00007f7749000000, 0x00007f8149000000, 0x00007f8149000000)
  eden space 37492224K, 46% used [0x00007f7749000000,0x00007f7b7afe63e8,0x00007f8039580000)
  from space 2256384K, 19% used [0x00007f80bf480000,0x00007f80dac44b78,0x00007f8149000000)
  to   space 2194432K, 0% used [0x00007f8039580000,0x00007f8039580000,0x00007f80bf480000)
 ParOldGen       total 83886080K, used 70680448K [0x00007f6349000000, 0x00007f7749000000, 0x00007f7749000000)
  object space 83886080K, 84% used [0x00007f6349000000,0x00007f7422fe0010,0x00007f7749000000)
 PSPermGen       total 106496K, used 70405K [0x00007f6339000000, 0x00007f633f800000, 0x00007f6349000000)
  object space 106496K, 66% used [0x00007f6339000000,0x00007f633d4c16e8,0x00007f633f800000)
Shutting down
Sending SIGTERM to process tree at pid 94404
Killing the following process trees:
[ 
-+- 94404 sh -c  "/mnt/tmp/spark/bin/spark-class" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@REDACTED/user/CoarseGrainedScheduler --executor-id 355857ee-1069-480d-baa8-e89427bd5840-S13 --hostname REDACTED --cores 13 --app-id 355857ee-1069-480d-baa8-e89427bd5840-0079 
 \--- 94405 ()
]
Command terminated with signal Terminated (pid: 94404)

drcrallen · 2015-12-18T19:21:21Z

You can see in the above log entry where the terminal heap information was printed... THEN a SIGTERM was processed.

SparkQA · 2015-12-18T19:24:10Z

Test build #48013 has finished for PR 10319 at commit 97de2f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

drcrallen · 2015-12-18T20:19:40Z

The block manager is cleaning up as expected with this patch.

drcrallen · 2016-01-06T17:11:43Z

@tnachen Do you have any suggestions on ways to wait for executors to report as being cleaned up before calling mesosDriver.stop()?

drcrallen · 2016-01-13T16:40:04Z

@tnachen Ping again regarding question in #10319 (comment)

drcrallen · 2016-01-13T16:46:49Z

@tnachen I also haven't done a good job at making it more clear previously in this PR that the block manager does not properly cleanup without this patch. See https://issues.apache.org/jira/browse/SPARK-12330 for more info

andrewor14 · 2016-01-13T19:39:12Z

also cc @dragos

tnachen · 2016-01-14T08:02:05Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala

+    // Wait for finish
+    val stopwatch = new Stopwatch()
+    stopwatch.start()
+    // Eventually consistent slaveIdsWithExecutors


Not sure if this comments says much, can you perhaps add some comments that slaveIdsWithExecutors will be empty once we receive task termination status updates from all tasks

tnachen · 2016-01-14T08:08:55Z

@drcrallen about waiting suggestions, the best way from the scheduler side is waiting until all tasks are terminated when you like to shutdown. I'm thinking that if this stop is at the very end of the shutdown call, another way is to use a shutdownLock with wait on a timeout here and then just call notify on the TaskStatus update when the map is empty. If the wait exits before it's empty then we know the timeout has hit first. I think we just want to make sure nothing in the scheduler is calling stop by itself, so we don't deadlock.

dragos · 2016-01-15T09:49:10Z

The approach makes sense to me, I will give it a go and report back

dragos · 2016-01-15T16:07:15Z

Things work as advertised in this PR! 👍

dragos · 2016-01-15T16:15:26Z

Regarding the question on waiting strategies, I don't think polling is that bad. It's localized and simple to reason about, and I suppose stop should be synchronous. @tnachen's suggestion is fine, but unless I miss something, there's no significant advantage, but there's a potential deadlock situation and more complicated reasoning (notify is called from a callback, so looking at stop is no longer enough to understand what's going on).

dragos · 2016-01-15T16:27:21Z

One more thing: Please add the new setting in the Mesos docs.

tnachen · 2016-01-15T18:19:50Z

I think that either case just looking at stop isn't enough since we are relying on the callback to empty the executors map for us to exit the loop before the timeout, so either way i thought it's just explicit and don't have to busy wait for no reason. I'm fine with a while loop here, I might increase the sleep time since we are waiting on messages to be sent back from executors to driver and I would think checking every 100ms probably enough.

dragos · 2016-01-15T21:03:31Z

You're right about the callback for emptying the map, and also about the sleep interval.

dragos · 2016-01-18T11:38:24Z

Let's try to move this forward. Looks like there's still e few things to do:

rebase on master
increase the sleep interval
document the new setting

drcrallen · 2016-01-19T01:24:19Z

Thanks for feedback. I'll get to the fixes here very shortly.

andrewor14 · 2016-02-01T19:49:08Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala

@@ -60,6 +62,12 @@ private[spark] class CoarseMesosSchedulerBackend(
  // Maximum number of cores to acquire (TODO: we'll need more flexible controls here)
  val maxCores = conf.get("spark.cores.max", Int.MaxValue.toString).toInt

+  private[this] val shutdownTimeoutMS = conf.getInt("spark.mesos.coarse.shutdown.ms", 10000)


this needs to be

conf.getTimeAsMillis("spark.mesos.coarse.shutdownTimeout", "10s")

we have a consistent time string format for accepting similar configs.

andrewor14 · 2016-02-01T20:00:01Z

@drcrallen This looks great, thanks for fixing the issue. I left a few minor comments. Once you fix them I'll merge this.

drcrallen · 2016-02-02T21:35:28Z

@andrewor14 Addressed comments except for #10319 (comment)

drcrallen · 2016-02-02T22:04:19Z

Fail was dumb (couldn't fetch from git). Needs retest

drcrallen · 2016-02-02T22:04:39Z

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50593/consoleFull is still running

SparkQA · 2016-02-02T23:13:33Z

Test build #50593 has finished for PR 10319 at commit 75887c4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-02-02T23:53:42Z

retest this please

SparkQA · 2016-02-03T01:46:01Z

Test build #50616 has finished for PR 10319 at commit 75887c4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-02-03T18:16:02Z

retest this please

SparkQA · 2016-02-03T20:19:57Z

Test build #50667 has finished for PR 10319 at commit 75887c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dragos · 2016-02-04T17:33:58Z

@drcrallen probably my comment and @andrewor14'r reply were buried by the GitHub interface. The consensus is to remove the docs about the new setting. Sorry for the trouble.

drcrallen · 2016-02-04T18:00:09Z

@dragos removed

andrewor14 · 2016-02-04T18:26:42Z

LGTM merged into master. The last commit didn't change any code so this can't fail tests.

mgummelt · 2016-02-04T19:49:19Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala

@@ -60,6 +62,12 @@ private[spark] class CoarseMesosSchedulerBackend(
  // Maximum number of cores to acquire (TODO: we'll need more flexible controls here)
  val maxCores = conf.get("spark.cores.max", Int.MaxValue.toString).toInt

+  private[this] val shutdownTimeoutMS = conf.getTimeAsMs("spark.mesos.coarse.shutdown.ms", "10s")


Can you please submit a new PR to add this to the docs?

no #10319 (comment)

by the way I just realized this config is not properly named. Unfortunately I did not catch this during code reviews and I just pushed a hot fix in master to correct this: c756bda

SparkQA · 2016-02-04T20:09:22Z

Test build #50755 has finished for PR 10319 at commit 1181a05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-02-25T22:21:49Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala

+    }
+    // Wait for executors to report done, or else mesosDriver.stop() will forcefully kill them.
+    // See SPARK-12330
+    val stopwatch = new Stopwatch()


This Stopwatch constructor was deprecated in newer versions of Guava (google/guava@fd0cbc2). In order to work around this issue, I'd like to remove this use of Stopwatch since we don't use it anywhere else and it doesn't seem to be buying us a whole lot in the way that it's used here.

Sounds good. Sorry, didn't see this note until now, and it looks like this was already fixed in master.

In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work). This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by `spark.mesos.coarse.shutdown.ms`) This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0) With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as `FINISHED` in the Mesos console (as opposed to `KILLED` in < 1.6 or `FAILED` in 1.6 and later) Author: Charles Allen <charles@allen-net.com> Closes apache#10319 from drcrallen/SPARK-12330.

tnachen reviewed Dec 16, 2015
View reviewed changes

drcrallen changed the title ~~[SPARK-12330] [CORE] Fix mesos coarse mode cleanup~~ [SPARK-12330] [MESOS] Fix mesos coarse mode cleanup Dec 18, 2015

tnachen reviewed Jan 14, 2016
View reviewed changes

andrewor14 reviewed Feb 1, 2016
View reviewed changes

drcrallen added 3 commits February 2, 2016 13:23

Fix formatting problem in CoarseMesosSchedulerBackend

9ae106c

Address review comments on logging

f9c15c7

Update mesos docs for shutdown timeout

75887c4

Remove docs

1181a05

asfgit closed this in 2eaeafe Feb 4, 2016

mgummelt reviewed Feb 4, 2016
View reviewed changes

mgummelt mentioned this pull request Feb 4, 2016

[SPARK-5095] [Mesos] Support launching multiple mesos executors in coarse grained mesos mode. #10993

Closed

drcrallen deleted the SPARK-12330 branch February 4, 2016 19:54

JoshRosen reviewed Feb 25, 2016
View reviewed changes

[SPARK-12330] [MESOS] Fix mesos coarse mode cleanup #10319

[SPARK-12330] [MESOS] Fix mesos coarse mode cleanup #10319

Conversation

drcrallen commented Dec 16, 2015

andrewor14 commented Dec 16, 2015

andrewor14 commented Dec 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnachen commented Dec 16, 2015

drcrallen commented Dec 18, 2015

drcrallen commented Dec 18, 2015

drcrallen commented Dec 18, 2015

drcrallen commented Dec 18, 2015

drcrallen commented Dec 18, 2015

drcrallen commented Dec 18, 2015

SparkQA commented Dec 18, 2015

drcrallen commented Dec 18, 2015

drcrallen commented Jan 6, 2016

drcrallen commented Jan 13, 2016

drcrallen commented Jan 13, 2016

andrewor14 commented Jan 13, 2016

Choose a reason for hiding this comment

tnachen commented Jan 14, 2016

dragos commented Jan 15, 2016

dragos commented Jan 15, 2016

dragos commented Jan 15, 2016

dragos commented Jan 15, 2016

tnachen commented Jan 15, 2016

dragos commented Jan 15, 2016

dragos commented Jan 18, 2016

drcrallen commented Jan 19, 2016

Choose a reason for hiding this comment

andrewor14 commented Feb 1, 2016

drcrallen commented Feb 2, 2016

drcrallen commented Feb 2, 2016

drcrallen commented Feb 2, 2016

SparkQA commented Feb 2, 2016

andrewor14 commented Feb 2, 2016

SparkQA commented Feb 3, 2016

andrewor14 commented Feb 3, 2016

SparkQA commented Feb 3, 2016

dragos commented Feb 4, 2016

drcrallen commented Feb 4, 2016

andrewor14 commented Feb 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment