[SPARK-12583][MESOS] BACKPORT to 1.6.x - Mesos shuffle service: Don't delete shuffle file… #13279

corruptmemory · 2016-05-24T18:32:33Z

What changes were proposed in this pull request?

This is a backport of #11272 to the 1.6.x version line.

How was this patch tested?

This PR was tested the same way as the original PR: manual testing with a local mesos cluster.

…s before application has stopped

Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped.

spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583
External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159

This is a follow up on #11207 .

This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case.

This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service:

16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms).
16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort}
16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort}
16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files.
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs

Note: there are 2 executors running on this slave.

Author: Bertrand Bossy bertrand.bossy@teralytics.net

Closes #11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.

Initial backport of #11272

No new test failures introduced.
Provisional backport complete

…s before application has stopped Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped. spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583 External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159 This is a follow up on apache#11207 . This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case. This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service: ``` 16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms). 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files. 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs ``` Note: there are 2 executors running on this slave. Author: Bertrand Bossy <bertrand.bossy@teralytics.net> Closes apache#11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat. Initial backport of apache#11272 * No new test failures introduced. * Provisional backport complete

AmplabJenkins · 2016-05-24T18:37:17Z

Can one of the admins verify this patch?

rxin · 2016-05-24T18:51:36Z

hmm thanks for submitting this, but we rarely backport patch of this size due to the risk of regression in maintenance releases.

andrewor14 · 2016-05-24T18:57:02Z

also why is this patch so much bigger than #11272, which is only 200 lines?

corruptmemory · 2016-05-24T19:12:24Z

@rxin understood. I have a user that is relying on the backport and would like to see what can be done to get it upstream.

@andrewor14 #11272 was done against master at the time. This is was done against 1.6 HEAD. It is possible some additional stuff may have gotten in, but certainly no patch bulking up was intended.

rxin · 2016-05-24T19:43:44Z

Can your user use a custom build you provide? This has already gone into master/branch-2.0 so it is part of the upstream codebase.

corruptmemory · 2016-05-26T15:23:23Z

Understood that this fix is upstream -- that's where I got the patch that started this.

This fix is (according to the original PR) something that made something work that was supposed to work but didn't. Assuming that the patch I started with was focused on addressing only fixing the non-functional shuffle service then this current PR should inherit that property. I haven't knowingly extended the scope of the intended fix. At the moment my user is not going to be deploying 2.0 anytime soon (they are still in the process of rolling out 1.6) so there is some value in knowing that any further attention to fixing bugs for 1.6 takes into account fixes in the context of a working shuffle service.

rxin · 2016-05-26T17:25:28Z

Sorry @corruptmemory. The scope of the patch is too scary for a maintenance release. It changes the scheduler beyond Mesos.

The best way going forward is probably for you to provide a custom build for your customers, if you believe the risk of regression for this patch is not high.

corruptmemory changed the title ~~[SPARK-12583][MESOS] Mesos shuffle service: Don't delete shuffle file…~~ [SPARK-12583][MESOS] BACKPORT to 1.6.x - Mesos shuffle service: Don't delete shuffle file… May 24, 2016

corruptmemory closed this May 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12583][MESOS] BACKPORT to 1.6.x - Mesos shuffle service: Don't delete shuffle file… #13279

[SPARK-12583][MESOS] BACKPORT to 1.6.x - Mesos shuffle service: Don't delete shuffle file… #13279

corruptmemory commented May 24, 2016

AmplabJenkins commented May 24, 2016

rxin commented May 24, 2016

andrewor14 commented May 24, 2016

corruptmemory commented May 24, 2016

rxin commented May 24, 2016

corruptmemory commented May 26, 2016

rxin commented May 26, 2016

[SPARK-12583][MESOS] BACKPORT to 1.6.x - Mesos shuffle service: Don't delete shuffle file… #13279

[SPARK-12583][MESOS] BACKPORT to 1.6.x - Mesos shuffle service: Don't delete shuffle file… #13279

Conversation

corruptmemory commented May 24, 2016

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented May 24, 2016

rxin commented May 24, 2016

andrewor14 commented May 24, 2016

corruptmemory commented May 24, 2016

rxin commented May 24, 2016

corruptmemory commented May 26, 2016

rxin commented May 26, 2016