Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12330] [MESOS] Fix mesos coarse mode cleanup #10319

Closed
wants to merge 7 commits into from

Conversation

drcrallen
Copy link
Contributor

In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work).

This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by spark.mesos.coarse.shutdown.ms)

This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0)

With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as FINISHED in the Mesos console (as opposed to KILLED in < 1.6 or FAILED in 1.6 and later)

@andrewor14
Copy link
Contributor

ok to test

@andrewor14
Copy link
Contributor

@drcrallen mind changing the [CORE] in the title to [MESOS]?

@@ -45,6 +46,7 @@ private[spark] class CoarseGrainedExecutorBackend(
env: SparkEnv)
extends ThreadSafeRpcEndpoint with ExecutorBackend with Logging {

val stopping = new AtomicBoolean(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47761 has finished for PR 10319 at commit 5e1b13d.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -60,6 +63,11 @@ private[spark] class CoarseMesosSchedulerBackend(
// Maximum number of cores to acquire (TODO: we'll need more flexible controls here)
val maxCores = conf.get("spark.cores.max", Int.MaxValue.toString).toInt

val shutdownTimeoutMS = conf.getInt("spark.mesos.coarse.shutdown.ms", 10000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private val

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@tnachen
Copy link
Contributor

tnachen commented Dec 16, 2015

So IIUC stop is only invoked when an exception occured or shutdown hook is invoked, where both cases it's an task that didn't really finish and user/system want it to be killed. So I'm not sure if ending in these cases to have TASK_KILLED is an bad idea.
The main reason I'm proposing not to change this is really I don't think sleeping in the stop method with a while is a great idea, as it clearly can block other things going on in the system.

@drcrallen drcrallen changed the title [SPARK-12330] [CORE] Fix mesos coarse mode cleanup [SPARK-12330] [MESOS] Fix mesos coarse mode cleanup Dec 18, 2015
@drcrallen
Copy link
Contributor Author

@tnachen For some reason the shutdown hooks are not finishing properly if it receives a SIGTERM during shutdown. See logs in https://issues.apache.org/jira/browse/SPARK-12330 where INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown is followed by a ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

Fixing that may be a better root cause fix, but I haven't looked into the hadoop shutdown hooks enough to know where the issue could be with that.

@drcrallen
Copy link
Contributor Author

To clarify... That should solve the block cleanup issue. It will not solve the executor reporting incorrect status.

If the executors are to exit "cleanly" (aka FINNISHED instead of KILLED or FAILED) there needs to be a wait of some kind between spark sending a shutdown and shutting down the mesos driver.

@drcrallen
Copy link
Contributor Author

I may have found the root cause of the failure to cleanup blocks. Testing

@drcrallen
Copy link
Contributor Author

Nope, SparkEnv.stop() does not block on multiple calls to make sure stop has completed at least once. But even when ensuring that happens the shutdown process still does not end cleanly.

@drcrallen
Copy link
Contributor Author

After more thorough testing, it seems that there is still a race for getting a FINISHED vs KILLED final status with this patch.

here's an example log where it was reported as killed:

STDERR

15/12/18 19:01:34 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
15/12/18 19:01:34 INFO MemoryStore: MemoryStore cleared
15/12/18 19:01:34 INFO BlockManager: BlockManager stopped
15/12/18 19:01:34 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/12/18 19:01:34 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/12/18 19:01:34 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
15/12/18 19:01:34 INFO ShutdownHookManager: Shutdown hook called
15/12/18 19:01:34 INFO ShutdownHookManager: Deleting directory /mesos/slaves/355857ee-1069-480d-baa8-e89427bd5840-S13/frameworks/355857ee-1069-480d-baa8-e89427bd5840-0079/executors/3/runs/b1cb5f55-43f3-4d1d-8ada-18fd29319868/tmp/java/spark-76c38206-57df-46aa-bc2a-4ad143d65959
I1218 19:01:44.844650 94385 exec.cpp:381] Executor asked to shutdown

STDOUT

2015-12-18T19:01:29.228+0000: Total time for which application threads were stopped: 0.0000720 seconds, Stopping threads took: 0.0000280 seconds
2015-12-18T19:01:34.418+0000: Total time for which application threads were stopped: 0.0202110 seconds, Stopping threads took: 0.0170400 seconds
2015-12-18T19:01:34.895+0000: Total time for which application threads were stopped: 0.0003310 seconds, Stopping threads took: 0.0002190 seconds
2015-12-18T19:01:34.896+0000: Total time for which application threads were stopped: 0.0001710 seconds, Stopping threads took: 0.0000940 seconds
2015-12-18T19:01:34.907+0000: Total time for which application threads were stopped: 0.0002100 seconds, Stopping threads took: 0.0001420 seconds
2015-12-18T19:01:34.907+0000: Total time for which application threads were stopped: 0.0002470 seconds, Stopping threads took: 0.0001740 seconds
2015-12-18T19:01:34.945+0000: Total time for which application threads were stopped: 0.0004670 seconds, Stopping threads took: 0.0000420 seconds
2015-12-18T19:01:34.961+0000: Total time for which application threads were stopped: 0.0001110 seconds, Stopping threads took: 0.0000720 seconds
Heap
 PSYoungGen      total 39748608K, used 18046635K [0x00007f7749000000, 0x00007f8149000000, 0x00007f8149000000)
  eden space 37492224K, 46% used [0x00007f7749000000,0x00007f7b7afe63e8,0x00007f8039580000)
  from space 2256384K, 19% used [0x00007f80bf480000,0x00007f80dac44b78,0x00007f8149000000)
  to   space 2194432K, 0% used [0x00007f8039580000,0x00007f8039580000,0x00007f80bf480000)
 ParOldGen       total 83886080K, used 70680448K [0x00007f6349000000, 0x00007f7749000000, 0x00007f7749000000)
  object space 83886080K, 84% used [0x00007f6349000000,0x00007f7422fe0010,0x00007f7749000000)
 PSPermGen       total 106496K, used 70405K [0x00007f6339000000, 0x00007f633f800000, 0x00007f6349000000)
  object space 106496K, 66% used [0x00007f6339000000,0x00007f633d4c16e8,0x00007f633f800000)
Shutting down
Sending SIGTERM to process tree at pid 94404
Killing the following process trees:
[ 
-+- 94404 sh -c  "/mnt/tmp/spark/bin/spark-class" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@REDACTED/user/CoarseGrainedScheduler --executor-id 355857ee-1069-480d-baa8-e89427bd5840-S13 --hostname REDACTED --cores 13 --app-id 355857ee-1069-480d-baa8-e89427bd5840-0079 
 \--- 94405 ()
]
Command terminated with signal Terminated (pid: 94404)

@drcrallen
Copy link
Contributor Author

You can see in the above log entry where the terminal heap information was printed... THEN a SIGTERM was processed.

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #48013 has finished for PR 10319 at commit 97de2f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@drcrallen
Copy link
Contributor Author

The block manager is cleaning up as expected with this patch.

@drcrallen
Copy link
Contributor Author

@tnachen Do you have any suggestions on ways to wait for executors to report as being cleaned up before calling mesosDriver.stop()?

@drcrallen
Copy link
Contributor Author

@tnachen Ping again regarding question in #10319 (comment)

@drcrallen
Copy link
Contributor Author

@tnachen I also haven't done a good job at making it more clear previously in this PR that the block manager does not properly cleanup without this patch. See https://issues.apache.org/jira/browse/SPARK-12330 for more info

@andrewor14
Copy link
Contributor

also cc @dragos

// Wait for finish
val stopwatch = new Stopwatch()
stopwatch.start()
// Eventually consistent slaveIdsWithExecutors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this comments says much, can you perhaps add some comments that slaveIdsWithExecutors will be empty once we receive task termination status updates from all tasks

@tnachen
Copy link
Contributor

tnachen commented Jan 14, 2016

@drcrallen about waiting suggestions, the best way from the scheduler side is waiting until all tasks are terminated when you like to shutdown. I'm thinking that if this stop is at the very end of the shutdown call, another way is to use a shutdownLock with wait on a timeout here and then just call notify on the TaskStatus update when the map is empty. If the wait exits before it's empty then we know the timeout has hit first. I think we just want to make sure nothing in the scheduler is calling stop by itself, so we don't deadlock.

@dragos
Copy link
Contributor

dragos commented Jan 15, 2016

The approach makes sense to me, I will give it a go and report back

@dragos
Copy link
Contributor

dragos commented Jan 15, 2016

Things work as advertised in this PR! 👍

@dragos
Copy link
Contributor

dragos commented Jan 15, 2016

Regarding the question on waiting strategies, I don't think polling is that bad. It's localized and simple to reason about, and I suppose stop should be synchronous. @tnachen's suggestion is fine, but unless I miss something, there's no significant advantage, but there's a potential deadlock situation and more complicated reasoning (notify is called from a callback, so looking at stop is no longer enough to understand what's going on).

@dragos
Copy link
Contributor

dragos commented Jan 15, 2016

One more thing: Please add the new setting in the Mesos docs.

@tnachen
Copy link
Contributor

tnachen commented Jan 15, 2016

I think that either case just looking at stop isn't enough since we are relying on the callback to empty the executors map for us to exit the loop before the timeout, so either way i thought it's just explicit and don't have to busy wait for no reason. I'm fine with a while loop here, I might increase the sleep time since we are waiting on messages to be sent back from executors to driver and I would think checking every 100ms probably enough.

@dragos
Copy link
Contributor

dragos commented Jan 15, 2016

You're right about the callback for emptying the map, and also about the sleep interval.

@dragos
Copy link
Contributor

dragos commented Jan 18, 2016

Let's try to move this forward. Looks like there's still e few things to do:

  • rebase on master
  • increase the sleep interval
  • document the new setting

@drcrallen
Copy link
Contributor Author

Thanks for feedback. I'll get to the fixes here very shortly.

@@ -60,6 +62,12 @@ private[spark] class CoarseMesosSchedulerBackend(
// Maximum number of cores to acquire (TODO: we'll need more flexible controls here)
val maxCores = conf.get("spark.cores.max", Int.MaxValue.toString).toInt

private[this] val shutdownTimeoutMS = conf.getInt("spark.mesos.coarse.shutdown.ms", 10000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be

conf.getTimeAsMillis("spark.mesos.coarse.shutdownTimeout", "10s")

we have a consistent time string format for accepting similar configs.

@andrewor14
Copy link
Contributor

@drcrallen This looks great, thanks for fixing the issue. I left a few minor comments. Once you fix them I'll merge this.

@drcrallen
Copy link
Contributor Author

@andrewor14 Addressed comments except for #10319 (comment)

@drcrallen
Copy link
Contributor Author

Fail was dumb (couldn't fetch from git). Needs retest

@drcrallen
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50593 has finished for PR 10319 at commit 75887c4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 3, 2016

Test build #50616 has finished for PR 10319 at commit 75887c4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 3, 2016

Test build #50667 has finished for PR 10319 at commit 75887c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dragos
Copy link
Contributor

dragos commented Feb 4, 2016

@drcrallen probably my comment and @andrewor14'r reply were buried by the GitHub interface. The consensus is to remove the docs about the new setting. Sorry for the trouble.

@drcrallen
Copy link
Contributor Author

@dragos removed

@andrewor14
Copy link
Contributor

LGTM merged into master. The last commit didn't change any code so this can't fail tests.

@asfgit asfgit closed this in 2eaeafe Feb 4, 2016
@@ -60,6 +62,12 @@ private[spark] class CoarseMesosSchedulerBackend(
// Maximum number of cores to acquire (TODO: we'll need more flexible controls here)
val maxCores = conf.get("spark.cores.max", Int.MaxValue.toString).toInt

private[this] val shutdownTimeoutMS = conf.getTimeAsMs("spark.mesos.coarse.shutdown.ms", "10s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please submit a new PR to add this to the docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, nvm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by the way I just realized this config is not properly named. Unfortunately I did not catch this during code reviews and I just pushed a hot fix in master to correct this: c756bda

@SparkQA
Copy link

SparkQA commented Feb 4, 2016

Test build #50755 has finished for PR 10319 at commit 1181a05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
// Wait for executors to report done, or else mesosDriver.stop() will forcefully kill them.
// See SPARK-12330
val stopwatch = new Stopwatch()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Stopwatch constructor was deprecated in newer versions of Guava (google/guava@fd0cbc2). In order to work around this issue, I'd like to remove this use of Stopwatch since we don't use it anywhere else and it doesn't seem to be buying us a whole lot in the way that it's used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Sorry, didn't see this note until now, and it looks like this was already fixed in master.

sttts pushed a commit to mesosphere/spark that referenced this pull request Mar 15, 2016
In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work).

This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by `spark.mesos.coarse.shutdown.ms`)

This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0)

With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as `FINISHED` in the Mesos console (as opposed to `KILLED` in < 1.6 or `FAILED` in 1.6 and later)

Author: Charles Allen <charles@allen-net.com>

Closes apache#10319 from drcrallen/SPARK-12330.
mgummelt pushed a commit to mesosphere/spark that referenced this pull request Aug 2, 2016
In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work).

This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by `spark.mesos.coarse.shutdown.ms`)

This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0)

With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as `FINISHED` in the Mesos console (as opposed to `KILLED` in < 1.6 or `FAILED` in 1.6 and later)

Author: Charles Allen <charles@allen-net.com>

Closes apache#10319 from drcrallen/SPARK-12330.
mgummelt pushed a commit to mesosphere/spark that referenced this pull request Mar 7, 2017
In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work).

This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by `spark.mesos.coarse.shutdown.ms`)

This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0)

With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as `FINISHED` in the Mesos console (as opposed to `KILLED` in < 1.6 or `FAILED` in 1.6 and later)

Author: Charles Allen <charles@allen-net.com>

Closes apache#10319 from drcrallen/SPARK-12330.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants