-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12583][MESOS] BACKPORT to 1.6.x - Mesos shuffle service: Don't delete shuffle file… #13279
Conversation
…s before application has stopped Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped. spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583 External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159 This is a follow up on apache#11207 . This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case. This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service: ``` 16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms). 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files. 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs ``` Note: there are 2 executors running on this slave. Author: Bertrand Bossy <bertrand.bossy@teralytics.net> Closes apache#11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat. Initial backport of apache#11272 * No new test failures introduced. * Provisional backport complete
Can one of the admins verify this patch? |
hmm thanks for submitting this, but we rarely backport patch of this size due to the risk of regression in maintenance releases. |
also why is this patch so much bigger than #11272, which is only 200 lines? |
@rxin understood. I have a user that is relying on the backport and would like to see what can be done to get it upstream. @andrewor14 #11272 was done against master at the time. This is was done against 1.6 HEAD. It is possible some additional stuff may have gotten in, but certainly no patch bulking up was intended. |
Can your user use a custom build you provide? This has already gone into master/branch-2.0 so it is part of the upstream codebase. |
Understood that this fix is upstream -- that's where I got the patch that started this. This fix is (according to the original PR) something that made something work that was supposed to work but didn't. Assuming that the patch I started with was focused on addressing only fixing the non-functional shuffle service then this current PR should inherit that property. I haven't knowingly extended the scope of the intended fix. At the moment my user is not going to be deploying 2.0 anytime soon (they are still in the process of rolling out 1.6) so there is some value in knowing that any further attention to fixing bugs for 1.6 takes into account fixes in the context of a working shuffle service. |
Sorry @corruptmemory. The scope of the patch is too scary for a maintenance release. It changes the scheduler beyond Mesos. The best way going forward is probably for you to provide a custom build for your customers, if you believe the risk of regression for this patch is not high. |
What changes were proposed in this pull request?
This is a backport of #11272 to the 1.6.x version line.
How was this patch tested?
This PR was tested the same way as the original PR: manual testing with a local mesos cluster.
…s before application has stopped
Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped.
spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583
External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159
This is a follow up on #11207 .
This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case.
This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service:
Note: there are 2 executors running on this slave.
Author: Bertrand Bossy bertrand.bossy@teralytics.net
Closes #11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.
Initial backport of #11272