Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12583][MESOS] BACKPORT to 1.6.x - Mesos shuffle service: Don't delete shuffle file… #13279

Closed
wants to merge 1 commit into from

Conversation

corruptmemory
Copy link

What changes were proposed in this pull request?

This is a backport of #11272 to the 1.6.x version line.

How was this patch tested?

This PR was tested the same way as the original PR: manual testing with a local mesos cluster.

…s before application has stopped

Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped.

spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583
External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159

This is a follow up on #11207 .

This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case.

This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service:

16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms).
16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort}
16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort}
16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files.
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs

Note: there are 2 executors running on this slave.

Author: Bertrand Bossy bertrand.bossy@teralytics.net

Closes #11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.

Initial backport of #11272

  • No new test failures introduced.
  • Provisional backport complete

…s before application has stopped

Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped.

spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583
External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159

This is a follow up on apache#11207 .

This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case.

This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service:
```
16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms).
16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort}
16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort}
16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files.
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs
```
Note: there are 2 executors running on this slave.

Author: Bertrand Bossy <bertrand.bossy@teralytics.net>

Closes apache#11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.

Initial backport of apache#11272

* No new test failures introduced.
* Provisional backport complete
@corruptmemory corruptmemory changed the title [SPARK-12583][MESOS] Mesos shuffle service: Don't delete shuffle file… [SPARK-12583][MESOS] BACKPORT to 1.6.x - Mesos shuffle service: Don't delete shuffle file… May 24, 2016
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented May 24, 2016

hmm thanks for submitting this, but we rarely backport patch of this size due to the risk of regression in maintenance releases.

@andrewor14
Copy link
Contributor

also why is this patch so much bigger than #11272, which is only 200 lines?

@corruptmemory
Copy link
Author

@rxin understood. I have a user that is relying on the backport and would like to see what can be done to get it upstream.

@andrewor14 #11272 was done against master at the time. This is was done against 1.6 HEAD. It is possible some additional stuff may have gotten in, but certainly no patch bulking up was intended.

@rxin
Copy link
Contributor

rxin commented May 24, 2016

Can your user use a custom build you provide? This has already gone into master/branch-2.0 so it is part of the upstream codebase.

@corruptmemory
Copy link
Author

Understood that this fix is upstream -- that's where I got the patch that started this.

This fix is (according to the original PR) something that made something work that was supposed to work but didn't. Assuming that the patch I started with was focused on addressing only fixing the non-functional shuffle service then this current PR should inherit that property. I haven't knowingly extended the scope of the intended fix. At the moment my user is not going to be deploying 2.0 anytime soon (they are still in the process of rolling out 1.6) so there is some value in knowing that any further attention to fixing bugs for 1.6 takes into account fixes in the context of a working shuffle service.

@rxin
Copy link
Contributor

rxin commented May 26, 2016

Sorry @corruptmemory. The scope of the patch is too scary for a maintenance release. It changes the scheduler beyond Mesos.

The best way going forward is probably for you to provide a custom build for your customers, if you believe the risk of regression for this patch is not high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants