-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12330] [MESOS] Fix mesos coarse mode cleanup #10319
Conversation
ok to test |
@drcrallen mind changing the |
@@ -45,6 +46,7 @@ private[spark] class CoarseGrainedExecutorBackend( | |||
env: SparkEnv) | |||
extends ThreadSafeRpcEndpoint with ExecutorBackend with Logging { | |||
|
|||
val stopping = new AtomicBoolean(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Test build #47761 has finished for PR 10319 at commit
|
@@ -60,6 +63,11 @@ private[spark] class CoarseMesosSchedulerBackend( | |||
// Maximum number of cores to acquire (TODO: we'll need more flexible controls here) | |||
val maxCores = conf.get("spark.cores.max", Int.MaxValue.toString).toInt | |||
|
|||
val shutdownTimeoutMS = conf.getInt("spark.mesos.coarse.shutdown.ms", 10000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private val
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
So IIUC stop is only invoked when an exception occured or shutdown hook is invoked, where both cases it's an task that didn't really finish and user/system want it to be killed. So I'm not sure if ending in these cases to have TASK_KILLED is an bad idea. |
@tnachen For some reason the shutdown hooks are not finishing properly if it receives a SIGTERM during shutdown. See logs in https://issues.apache.org/jira/browse/SPARK-12330 where Fixing that may be a better root cause fix, but I haven't looked into the hadoop shutdown hooks enough to know where the issue could be with that. |
To clarify... That should solve the block cleanup issue. It will not solve the executor reporting incorrect status. If the executors are to exit "cleanly" (aka FINNISHED instead of KILLED or FAILED) there needs to be a wait of some kind between spark sending a shutdown and shutting down the mesos driver. |
I may have found the root cause of the failure to cleanup blocks. Testing |
Nope, SparkEnv.stop() does not block on multiple calls to make sure stop has completed at least once. But even when ensuring that happens the shutdown process still does not end cleanly. |
After more thorough testing, it seems that there is still a race for getting a here's an example log where it was reported as killed: STDERR
STDOUT
|
You can see in the above log entry where the terminal heap information was printed... THEN a SIGTERM was processed. |
Test build #48013 has finished for PR 10319 at commit
|
The block manager is cleaning up as expected with this patch. |
@tnachen Do you have any suggestions on ways to wait for executors to report as being cleaned up before calling |
@tnachen Ping again regarding question in #10319 (comment) |
@tnachen I also haven't done a good job at making it more clear previously in this PR that the block manager does not properly cleanup without this patch. See https://issues.apache.org/jira/browse/SPARK-12330 for more info |
also cc @dragos |
// Wait for finish | ||
val stopwatch = new Stopwatch() | ||
stopwatch.start() | ||
// Eventually consistent slaveIdsWithExecutors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this comments says much, can you perhaps add some comments that slaveIdsWithExecutors will be empty once we receive task termination status updates from all tasks
@drcrallen about waiting suggestions, the best way from the scheduler side is waiting until all tasks are terminated when you like to shutdown. I'm thinking that if this stop is at the very end of the shutdown call, another way is to use a shutdownLock with wait on a timeout here and then just call notify on the TaskStatus update when the map is empty. If the wait exits before it's empty then we know the timeout has hit first. I think we just want to make sure nothing in the scheduler is calling stop by itself, so we don't deadlock. |
The approach makes sense to me, I will give it a go and report back |
Things work as advertised in this PR! 👍 |
Regarding the question on waiting strategies, I don't think polling is that bad. It's localized and simple to reason about, and I suppose |
One more thing: Please add the new setting in the Mesos docs. |
I think that either case just looking at stop isn't enough since we are relying on the callback to empty the executors map for us to exit the loop before the timeout, so either way i thought it's just explicit and don't have to busy wait for no reason. I'm fine with a while loop here, I might increase the sleep time since we are waiting on messages to be sent back from executors to driver and I would think checking every 100ms probably enough. |
You're right about the callback for emptying the map, and also about the sleep interval. |
Let's try to move this forward. Looks like there's still e few things to do:
|
Thanks for feedback. I'll get to the fixes here very shortly. |
@@ -60,6 +62,12 @@ private[spark] class CoarseMesosSchedulerBackend( | |||
// Maximum number of cores to acquire (TODO: we'll need more flexible controls here) | |||
val maxCores = conf.get("spark.cores.max", Int.MaxValue.toString).toInt | |||
|
|||
private[this] val shutdownTimeoutMS = conf.getInt("spark.mesos.coarse.shutdown.ms", 10000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be
conf.getTimeAsMillis("spark.mesos.coarse.shutdownTimeout", "10s")
we have a consistent time string format for accepting similar configs.
@drcrallen This looks great, thanks for fixing the issue. I left a few minor comments. Once you fix them I'll merge this. |
@andrewor14 Addressed comments except for #10319 (comment) |
Fail was dumb (couldn't fetch from git). Needs retest |
Test build #50593 has finished for PR 10319 at commit
|
retest this please |
Test build #50616 has finished for PR 10319 at commit
|
retest this please |
Test build #50667 has finished for PR 10319 at commit
|
@drcrallen probably my comment and @andrewor14'r reply were buried by the GitHub interface. The consensus is to remove the docs about the new setting. Sorry for the trouble. |
@dragos removed |
LGTM merged into master. The last commit didn't change any code so this can't fail tests. |
@@ -60,6 +62,12 @@ private[spark] class CoarseMesosSchedulerBackend( | |||
// Maximum number of cores to acquire (TODO: we'll need more flexible controls here) | |||
val maxCores = conf.get("spark.cores.max", Int.MaxValue.toString).toInt | |||
|
|||
private[this] val shutdownTimeoutMS = conf.getTimeAsMs("spark.mesos.coarse.shutdown.ms", "10s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please submit a new PR to add this to the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, nvm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by the way I just realized this config is not properly named. Unfortunately I did not catch this during code reviews and I just pushed a hot fix in master to correct this: c756bda
Test build #50755 has finished for PR 10319 at commit
|
} | ||
// Wait for executors to report done, or else mesosDriver.stop() will forcefully kill them. | ||
// See SPARK-12330 | ||
val stopwatch = new Stopwatch() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Stopwatch constructor was deprecated in newer versions of Guava (google/guava@fd0cbc2). In order to work around this issue, I'd like to remove this use of Stopwatch
since we don't use it anywhere else and it doesn't seem to be buying us a whole lot in the way that it's used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Sorry, didn't see this note until now, and it looks like this was already fixed in master.
In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work). This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by `spark.mesos.coarse.shutdown.ms`) This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0) With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as `FINISHED` in the Mesos console (as opposed to `KILLED` in < 1.6 or `FAILED` in 1.6 and later) Author: Charles Allen <charles@allen-net.com> Closes apache#10319 from drcrallen/SPARK-12330.
In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work). This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by `spark.mesos.coarse.shutdown.ms`) This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0) With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as `FINISHED` in the Mesos console (as opposed to `KILLED` in < 1.6 or `FAILED` in 1.6 and later) Author: Charles Allen <charles@allen-net.com> Closes apache#10319 from drcrallen/SPARK-12330.
In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work). This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by `spark.mesos.coarse.shutdown.ms`) This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0) With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as `FINISHED` in the Mesos console (as opposed to `KILLED` in < 1.6 or `FAILED` in 1.6 and later) Author: Charles Allen <charles@allen-net.com> Closes apache#10319 from drcrallen/SPARK-12330.
In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work).
This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by
spark.mesos.coarse.shutdown.ms
)This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0)
With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as
FINISHED
in the Mesos console (as opposed toKILLED
in < 1.6 orFAILED
in 1.6 and later)