Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MXNet on Spark Roadmap #2268

Closed
3 of 5 tasks
yzhliu opened this issue May 28, 2016 · 16 comments
Closed
3 of 5 tasks

MXNet on Spark Roadmap #2268

yzhliu opened this issue May 28, 2016 · 16 comments

Comments

@yzhliu
Copy link
Member

yzhliu commented May 28, 2016

#2256 makes MXNet on Spark possible. It works on a stable Spark cluster, but when it is brought to a complex environment, e.g., executors may fail and retry, multiple tasks may configured to run in one executor, etc.

Related to issue #1637 @tqchen

Here's a roadmap for all those issues which may prevent from using MXNet on Spark in production environment.

  • KVStore workers (and scheduler/servers) failover. I found that, when a worker fails and restarts, it is not able to connect the scheduler again. Is it expected for ps-lite? @mli
  • Timeout for ps scheduler and servers. PS scheduler and servers are started in a spawned process. But when something goes wrong, e.g., workers, servers or the scheduler crash, they will have no chance to stop themselves. I think the simplest way to solve this problem is to set a timeout T, if the scheduler/server does not receive any message in T seconds, stop itself.
  • Multiple ps-lite threads in one process. Currently ps-lite is singleton, but Spark clusters can be configured to run multiple tasks in one executor (related to one process).
  • Make device resources transparent to application. Instead of specifying which GPU to use, users only need to know how many GPUs they want. This is important for clusters which use Yarn to do resource management.
  • Upload and distribute MXNet core library to all the worker nodes automatically.
@tqchen
Copy link
Member

tqchen commented May 28, 2016

as a side note. As far as I know, most GPU related distributed frameworks relies on a more reliable env than common data processing frameworks.

Due to complicated nature of learning, and relative small size of the model. Usually a checkpoint reloading strategy is used instead of complicated fault tolerant strategies.

@yzhliu
Copy link
Member Author

yzhliu commented May 29, 2016

Yes, I agree. But since IO failure/restart is quite common in Spark, I think it is required that KVStore workers be able to reconnect. For servers & scheduler, maybe we shall find a way to fail the whole application when they crash.

@tqchen
Copy link
Member

tqchen commented May 29, 2016

The most ideal way is not to fail the application, but still occupy the working cell, and just make sure the working ones reconnect and load checkpoints

@mli
Copy link
Member

mli commented May 31, 2016

hi yizhi,

thanks for the feedbacks. we have several options for the fault tolerance.
we tried the most expensive solution before, namely continuous chain
replication described in the osdi paper. it, however, may affect the
performance a lot, and hard to make it run smoothly.

the current ps-lite has little fault tolerance. the reason is that i want
to understand more about the environment. so my questions are:

  1. is it possible to set the highest job priority. so that the chance to be
    preempted within several iteration is little. then we can checkpoint the
    model for every iteration.
  2. is it possible to make the servers stable even the worker may be
    preempted at any time. then we can do continuous fault tolerance for
    workers.

we are also considering to run mxnet as a service in cloud such as aws. we
hope to have a single solution works on both situations.

On Sun, May 29, 2016 at 11:56 AM, Tianqi Chen notifications@github.com
wrote:

The most ideal way is not to fail the application, but still occupy the
working cell, and just make sure the working ones reconnect and load
checkpoints


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2268 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAZv4RVIA_DOHeWuJyc5o0j56zcVYvnsks5qGbdHgaJpZM4IpGQg
.

@yzhliu
Copy link
Member Author

yzhliu commented May 31, 2016

Hi Mu,

I think your idea make sense. We do have chance to checkpoint in most use-cases. For the 2nd point, the only way can cause server crash I see is the physical node itself goes out. In such situation, which is rare, we can simply fail the application. But as you said, we do need continuous fault tolerance for
workers.

@mli
Copy link
Member

mli commented Jun 2, 2016

@Javelinjs for workers, it is doable.

  1. if using async-sgd, then it's ok to loss any worker at any time.
  2. if using sync-sgd, it's a little bit tricky. we may relax the constraint that we should wait all workers' data into we should wait 95% workers' data. the only problem is that we need to change the batch_size dynamically according to how many workers are allive https://github.com/dmlc/mxnet/blob/master/python/mxnet/model.py#L764

@yzhliu
Copy link
Member Author

yzhliu commented Jun 3, 2016

Spark will restart failed tasks automatically. In such situation, lost workers need to re-connet to scheduler and servers. Is it now supported in async mode?

@mli
Copy link
Member

mli commented Jun 3, 2016

it was here before. i can add it back.

On Thu, Jun 2, 2016 at 10:37 PM, Yizhi Liu notifications@github.com wrote:

Spark will restart failed tasks automatically. In such situation, lost
workers need to re-connet to scheduler and servers. Is it now supported in
async mode?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#2268 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAZv4UnXfNlD__9qSwoQWyaPfAPdhOE0ks5qH5NdgaJpZM4IpGQg
.

@yzhliu
Copy link
Member Author

yzhliu commented Jun 3, 2016

Great, please. Currently mxnet on spark is using async mode.

@liyuance
Copy link
Contributor

@Javelinjs For the 2nd point, we can consider resgistered "kill child process action" in ShutdownHook , then PS scheduler and Servers preocess will exit immediately as Spark Driver or Executor stop.

@butterluo
Copy link

any update about progress ?

@velconia
Copy link

any update about progress ?

@igorcosta
Copy link

igorcosta commented Dec 7, 2017

Even with the latest version 2.1.2 version happens? I've tried to KVStore workers but seems to work on this latest version.
There's a fix free some block space when the workers have to restart, [SPARK-22083]

@lanking520
Copy link
Member

I have updated with a CWiki page here: https://cwiki.apache.org/confluence/display/MXNET/Scala+Project+Status
Let's keep all TODOs into one place and also pick up from there. @yzhliu @nswamy let's close this for now.

@nswamy nswamy closed this as completed Jul 31, 2018
@idibidiart
Copy link

At this time, is there a possibility of MXNet on Spark similar to TensorFlowOnSpark from Yahoo?

@nswamy
Copy link
Member

nswamy commented Aug 23, 2018

@idibidiart I am personally very interested(and probably will work on) in getting MXNet on Spark for training, in that effort there is work being done by the Spark community to introduce a barrier mode scheduling that will help run deepLearning frameworks https://jira.apache.org/jira/browse/SPARK-24374. reach out to me on ASF Slack(#mxnet channel ) if you are interested to collaborate on this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants