[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped #11207

bbossy · 2016-02-15T17:21:50Z

Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped.

Context and analysis:

spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583
External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159

Proposed fix:

Instead of relying on a connection being disconnected, or a heartbeat signal lost, the mesos shuffle service periodically checks with the mesos master whether the framework (spark application) is still running:

Every spark.storage.blockManagerSlaveTimeoutMs / 4 the mesos shuffle service retrieves the leading master's /master/state.json. It checks whether it got a reply from the actual leading master and updates a "last seen" timestamp in its internal state (spark applications on mesos register with the external shuffle service using their framework id). Then, it deletes the temporary files of all the frameworks (that have previously registered), that have not been reported as running in the past spark.storage.blockManagerSlaveTimeoutMs.

It requires mesos-resolve to be on the PATH where the shuffle service is running. This is used to find the leading master in a mesos-HA setup (through zookeeper).

Further, spark.master needs to be set when the service is started.

Delete shuffle files once a framework is no longer running

AmplabJenkins · 2016-02-15T17:22:12Z

Can one of the admins verify this patch?

bbossy · 2016-02-15T17:27:44Z

core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala

+    new MesosExternalShuffleBlockHandler(
+      conf,
+      this.conf.get("spark.master"),
+      this.conf.getTimeAsMs("spark.storage.blockManagerSlaveTimeoutMs", "120s"))


Should I use a different config key here?

JoshRosen · 2016-02-16T04:25:41Z

No comment on the contents of this PR (since I haven't looked at it yet), but would you mind changing the PR to something more descriptive? As it stands now, "Fix Mesos shuffle service" is a lot less descriptive than, say, "Delete shuffle files after Mesos shuffle service exits" or something similar.

Could you also edit the description to include a concise one-sentence description of the user-facing bug / symptom that this fixes? Right now this describes a lot of mechanism, but I feel like the description is a bit thin on context for newcomers who are trying to understand what this patch is doing.

bbossy · 2016-02-16T07:47:56Z

@JoshRosen changed to a more descriptive title and added a more detailed problem description.

bbossy · 2016-02-16T08:08:23Z

@dragos Take a look, please

dragos · 2016-02-16T10:42:54Z

@bbossy thanks for picking this up!

I have a problem with the bandwidth this design implies. For instance, my state.json is 200KB (a cluster of 1 master and 2 nodes, with virtually no frameworks running). This is most likely a very low minimum. Multiply this by the number of nodes (let's say we want to scale to 10.000 nodes).

200KB * 10,000 nodes = 2GB

2GB of traffic for each check seems pretty bad.

I think it'd be better (and probably simpler) to increase the timeout for this particular connection, or send a heartbeat message (whatever akka remote did before).

@tnachen @mgummelt what do you think?

bbossy · 2016-02-16T10:48:21Z

@dragos You're right. I'll have a look at your proposals.

tnachen · 2016-02-16T10:50:45Z

I also agree with @dragos, and I think we should keep the same semantics by having a heartbeat instead.

…s before application has stopped ## Problem description: Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped. ### Context and analysis: spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583 External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159 This is a follow up on #11207 . ## What changes were proposed in this pull request? This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case. ## How was the this patch tested? This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service: ``` 16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms). 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files. 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs ``` Note: there are 2 executors running on this slave. Author: Bertrand Bossy <bertrand.bossy@teralytics.net> Closes #11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.

…s before application has stopped ## Problem description: Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped. ### Context and analysis: spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583 External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159 This is a follow up on apache#11207 . ## What changes were proposed in this pull request? This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case. ## How was the this patch tested? This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service: ``` 16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms). 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files. 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs ``` Note: there are 2 executors running on this slave. Author: Bertrand Bossy <bertrand.bossy@teralytics.net> Closes apache#11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.

…s before application has stopped Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped. spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583 External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159 This is a follow up on apache#11207 . This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case. This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service: ``` 16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms). 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files. 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs ``` Note: there are 2 executors running on this slave. Author: Bertrand Bossy <bertrand.bossy@teralytics.net> Closes apache#11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.

…s before application has stopped Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped. spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583 External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159 This is a follow up on apache#11207 . This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case. This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service: ``` 16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms). 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files. 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs ``` Note: there are 2 executors running on this slave. Author: Bertrand Bossy <bertrand.bossy@teralytics.net> Closes apache#11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat. Initial backport of apache#11272 * No new test failures introduced. * Provisional backport complete

IgorBerman · 2018-01-31T10:19:31Z

Hi @bbossy
just to make sure this PR wasn't merged in favor to #11272?

bbossy · 2018-01-31T12:25:38Z

@IgorBerman correct, #11272 was a follow up on this and got merged. I still experience issues with the shuffle service on Mesos, but I haven't been able to pin point the cause.

IgorBerman · 2018-01-31T14:24:23Z

@bbossy thanks. I'm experiencing them too despite the merged PR. I think I have some direction why it fails, here is my setup:

v2.2.0
Dynamic allocation is on
min executors is 2(e.g.)
I've enabled 2 loggers MesosCoarseGrainedSchedulerBackend and MesosExternalShuffleClient to be debug
My workload is empty-high-empty

So it behaves like following:

In the beginning driver registers to external shuffle service(for 2 initial executors which are tasks in mesos). I can see in log "Mesos task 1 is now TASK_RUNNING" and then "Connecting to shuffle service on slave..." two times for 2 inital executors
After the load increases I see that dynamic allocation works and there are new executors created(in mesos console I see new tasks), however driver doesn't get notification on status update, so it doesn't register to new external shuffle service.(i.e. there is no additional Mesos task x is now TASK_RUNNING"

WDYT?

IgorBerman · 2018-02-23T07:16:09Z

@bbossy so my work around: ive disabled cleanup of external shuffle service and removing shuffle files by cron that finds files that were not accessed in last X hours.

SPARK-12583: Fix mesos shuffle service

e387def

Delete shuffle files once a framework is no longer running

bbossy changed the title ~~[SPARK-12583 Fix mesos shuffle service~~ [SPARK-12583][Mesos] Fix mesos shuffle service Feb 15, 2016

bbossy reviewed Feb 15, 2016
View reviewed changes

bbossy changed the title ~~[SPARK-12583][Mesos] Fix mesos shuffle service~~ [SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped Feb 16, 2016

bbossy closed this Feb 16, 2016

bbossy mentioned this pull request Feb 19, 2016

[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped #11272

Closed

bbossy deleted the SPARK-12583-mesos-shuffle-service-fix branch April 4, 2016 09:09

corruptmemory mentioned this pull request May 24, 2016

[SPARK-12583][MESOS] BACKPORT to 1.6.x - Mesos shuffle service: Don't delete shuffle file… #13279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped #11207

[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped #11207

bbossy commented Feb 15, 2016

AmplabJenkins commented Feb 15, 2016

bbossy Feb 15, 2016

JoshRosen commented Feb 16, 2016

bbossy commented Feb 16, 2016

bbossy commented Feb 16, 2016

dragos commented Feb 16, 2016

bbossy commented Feb 16, 2016

tnachen commented Feb 16, 2016

IgorBerman commented Jan 31, 2018

bbossy commented Jan 31, 2018

IgorBerman commented Jan 31, 2018 •

edited

Loading

IgorBerman commented Feb 23, 2018

[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped #11207

[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped #11207

Conversation

bbossy commented Feb 15, 2016

Proposed fix:

AmplabJenkins commented Feb 15, 2016

bbossy Feb 15, 2016

Choose a reason for hiding this comment

JoshRosen commented Feb 16, 2016

bbossy commented Feb 16, 2016

bbossy commented Feb 16, 2016

dragos commented Feb 16, 2016

bbossy commented Feb 16, 2016

tnachen commented Feb 16, 2016

IgorBerman commented Jan 31, 2018

bbossy commented Jan 31, 2018

IgorBerman commented Jan 31, 2018 • edited Loading

IgorBerman commented Feb 23, 2018

IgorBerman commented Jan 31, 2018 •

edited

Loading