MXNet on Spark Roadmap #2268

yzhliu · 2016-05-28T15:28:29Z

#2256 makes MXNet on Spark possible. It works on a stable Spark cluster, but when it is brought to a complex environment, e.g., executors may fail and retry, multiple tasks may configured to run in one executor, etc.

Related to issue #1637 @tqchen

Here's a roadmap for all those issues which may prevent from using MXNet on Spark in production environment.

KVStore workers (and scheduler/servers) failover. I found that, when a worker fails and restarts, it is not able to connect the scheduler again. Is it expected for ps-lite? @mli
Timeout for ps scheduler and servers. PS scheduler and servers are started in a spawned process. But when something goes wrong, e.g., workers, servers or the scheduler crash, they will have no chance to stop themselves. I think the simplest way to solve this problem is to set a timeout T, if the scheduler/server does not receive any message in T seconds, stop itself.
Multiple ps-lite threads in one process. Currently ps-lite is singleton, but Spark clusters can be configured to run multiple tasks in one executor (related to one process).
Make device resources transparent to application. Instead of specifying which GPU to use, users only need to know how many GPUs they want. This is important for clusters which use Yarn to do resource management.
Upload and distribute MXNet core library to all the worker nodes automatically.

tqchen · 2016-05-28T17:44:22Z

as a side note. As far as I know, most GPU related distributed frameworks relies on a more reliable env than common data processing frameworks.

Due to complicated nature of learning, and relative small size of the model. Usually a checkpoint reloading strategy is used instead of complicated fault tolerant strategies.

yzhliu · 2016-05-29T15:55:08Z

Yes, I agree. But since IO failure/restart is quite common in Spark, I think it is required that KVStore workers be able to reconnect. For servers & scheduler, maybe we shall find a way to fail the whole application when they crash.

tqchen · 2016-05-29T15:56:53Z

The most ideal way is not to fail the application, but still occupy the working cell, and just make sure the working ones reconnect and load checkpoints

mli · 2016-05-31T14:53:33Z

hi yizhi,

thanks for the feedbacks. we have several options for the fault tolerance.
we tried the most expensive solution before, namely continuous chain
replication described in the osdi paper. it, however, may affect the
performance a lot, and hard to make it run smoothly.

the current ps-lite has little fault tolerance. the reason is that i want
to understand more about the environment. so my questions are:

is it possible to set the highest job priority. so that the chance to be
preempted within several iteration is little. then we can checkpoint the
model for every iteration.
is it possible to make the servers stable even the worker may be
preempted at any time. then we can do continuous fault tolerance for
workers.

we are also considering to run mxnet as a service in cloud such as aws. we
hope to have a single solution works on both situations.

On Sun, May 29, 2016 at 11:56 AM, Tianqi Chen notifications@github.com
wrote:

The most ideal way is not to fail the application, but still occupy the
working cell, and just make sure the working ones reconnect and load
checkpoints

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2268 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAZv4RVIA_DOHeWuJyc5o0j56zcVYvnsks5qGbdHgaJpZM4IpGQg
.

yzhliu · 2016-05-31T16:53:37Z

Hi Mu,

I think your idea make sense. We do have chance to checkpoint in most use-cases. For the 2nd point, the only way can cause server crash I see is the physical node itself goes out. In such situation, which is rare, we can simply fail the application. But as you said, we do need continuous fault tolerance for
workers.

mli · 2016-06-02T16:57:46Z

@Javelinjs for workers, it is doable.

if using async-sgd, then it's ok to loss any worker at any time.
if using sync-sgd, it's a little bit tricky. we may relax the constraint that we should wait all workers' data into we should wait 95% workers' data. the only problem is that we need to change the batch_size dynamically according to how many workers are allive https://github.com/dmlc/mxnet/blob/master/python/mxnet/model.py#L764

yzhliu · 2016-06-03T02:37:15Z

Spark will restart failed tasks automatically. In such situation, lost workers need to re-connet to scheduler and servers. Is it now supported in async mode?

mli · 2016-06-03T02:39:58Z

it was here before. i can add it back.

On Thu, Jun 2, 2016 at 10:37 PM, Yizhi Liu notifications@github.com wrote:

Spark will restart failed tasks automatically. In such situation, lost
workers need to re-connet to scheduler and servers. Is it now supported in
async mode?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#2268 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAZv4UnXfNlD__9qSwoQWyaPfAPdhOE0ks5qH5NdgaJpZM4IpGQg
.

yzhliu · 2016-06-03T02:50:03Z

Great, please. Currently mxnet on spark is using async mode.

liyuance · 2016-08-21T10:27:11Z

@Javelinjs For the 2nd point, we can consider resgistered "kill child process action" in ShutdownHook , then PS scheduler and Servers preocess will exit immediately as Spark Driver or Executor stop.

butterluo · 2017-03-11T03:00:54Z

any update about progress ?

velconia · 2017-09-12T07:41:13Z

any update about progress ?

igorcosta · 2017-12-07T23:03:32Z

Even with the latest version 2.1.2 version happens? I've tried to KVStore workers but seems to work on this latest version.
There's a fix free some block space when the workers have to restart, [SPARK-22083]

lanking520 · 2018-07-30T21:27:28Z

I have updated with a CWiki page here: https://cwiki.apache.org/confluence/display/MXNET/Scala+Project+Status
Let's keep all TODOs into one place and also pick up from there. @yzhliu @nswamy let's close this for now.

idibidiart · 2018-08-23T22:01:30Z

At this time, is there a possibility of MXNet on Spark similar to TensorFlowOnSpark from Yahoo?

nswamy · 2018-08-23T22:22:59Z

@idibidiart I am personally very interested(and probably will work on) in getting MXNet on Spark for training, in that effort there is work being done by the Spark community to introduce a barrier mode scheduling that will help run deepLearning frameworks https://jira.apache.org/jira/browse/SPARK-24374. reach out to me on ASF Slack(#mxnet channel ) if you are interested to collaborate on this.

yzhliu added Call for Contribution Roadmap Scala labels May 28, 2016

yzhliu mentioned this issue Jun 1, 2016

[spark] upload and distribute dependency jars automatically #2309

Merged

yzhliu mentioned this issue Jun 16, 2016

Heartbeat dmlc/ps-lite#54

Merged

This was referenced Jul 10, 2016

Support recovery nodes dmlc/ps-lite#59

Merged

PS node recovery support #2671

Merged

yzhliu mentioned this issue Jul 24, 2016

allow ps process to exit without barrier dmlc/ps-lite#61

Merged

yzhliu mentioned this issue Aug 20, 2016

Scala Package for v1.0 TODO List #3084

Closed

11 tasks

yzhliu mentioned this issue Mar 11, 2017

Spark Model Training with multiple executors #5330

Closed

LakeCarrot mentioned this issue Jul 21, 2017

Questions about the distributed training recovery using mxnet python #7153

Closed

nswamy closed this as completed Jul 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MXNet on Spark Roadmap #2268

MXNet on Spark Roadmap #2268

yzhliu commented May 28, 2016 •

edited

tqchen commented May 28, 2016

yzhliu commented May 29, 2016

tqchen commented May 29, 2016

mli commented May 31, 2016

yzhliu commented May 31, 2016

mli commented Jun 2, 2016

yzhliu commented Jun 3, 2016

mli commented Jun 3, 2016

yzhliu commented Jun 3, 2016

liyuance commented Aug 21, 2016

butterluo commented Mar 11, 2017

velconia commented Sep 12, 2017

igorcosta commented Dec 7, 2017 •

edited

lanking520 commented Jul 30, 2018

idibidiart commented Aug 23, 2018

nswamy commented Aug 23, 2018

MXNet on Spark Roadmap #2268

MXNet on Spark Roadmap #2268

Comments

yzhliu commented May 28, 2016 • edited

tqchen commented May 28, 2016

yzhliu commented May 29, 2016

tqchen commented May 29, 2016

mli commented May 31, 2016

yzhliu commented May 31, 2016

mli commented Jun 2, 2016

yzhliu commented Jun 3, 2016

mli commented Jun 3, 2016

yzhliu commented Jun 3, 2016

liyuance commented Aug 21, 2016

butterluo commented Mar 11, 2017

velconia commented Sep 12, 2017

igorcosta commented Dec 7, 2017 • edited

lanking520 commented Jul 30, 2018

idibidiart commented Aug 23, 2018

nswamy commented Aug 23, 2018

yzhliu commented May 28, 2016 •

edited

igorcosta commented Dec 7, 2017 •

edited