MXNet on Spark Roadmap #2268
Comments
as a side note. As far as I know, most GPU related distributed frameworks relies on a more reliable env than common data processing frameworks. Due to complicated nature of learning, and relative small size of the model. Usually a checkpoint reloading strategy is used instead of complicated fault tolerant strategies. |
Yes, I agree. But since IO failure/restart is quite common in Spark, I think it is required that KVStore workers be able to reconnect. For servers & scheduler, maybe we shall find a way to fail the whole application when they crash. |
The most ideal way is not to fail the application, but still occupy the working cell, and just make sure the working ones reconnect and load checkpoints |
hi yizhi, thanks for the feedbacks. we have several options for the fault tolerance. the current ps-lite has little fault tolerance. the reason is that i want
we are also considering to run mxnet as a service in cloud such as aws. we On Sun, May 29, 2016 at 11:56 AM, Tianqi Chen notifications@github.com
|
Hi Mu, I think your idea make sense. We do have chance to checkpoint in most use-cases. For the 2nd point, the only way can cause server crash I see is the physical node itself goes out. In such situation, which is rare, we can simply fail the application. But as you said, we do need continuous fault tolerance for |
@Javelinjs for workers, it is doable.
|
Spark will restart failed tasks automatically. In such situation, lost workers need to re-connet to scheduler and servers. Is it now supported in async mode? |
it was here before. i can add it back. On Thu, Jun 2, 2016 at 10:37 PM, Yizhi Liu notifications@github.com wrote:
|
Great, please. Currently mxnet on spark is using async mode. |
@Javelinjs For the 2nd point, we can consider resgistered "kill child process action" in ShutdownHook , then PS scheduler and Servers preocess will exit immediately as Spark Driver or Executor stop. |
any update about progress ? |
any update about progress ? |
Even with the latest version 2.1.2 version happens? I've tried to KVStore workers but seems to work on this latest version. |
I have updated with a CWiki page here: https://cwiki.apache.org/confluence/display/MXNET/Scala+Project+Status |
At this time, is there a possibility of MXNet on Spark similar to TensorFlowOnSpark from Yahoo? |
@idibidiart I am personally very interested(and probably will work on) in getting MXNet on Spark for training, in that effort there is work being done by the Spark community to introduce a barrier mode scheduling that will help run deepLearning frameworks https://jira.apache.org/jira/browse/SPARK-24374. reach out to me on ASF Slack(#mxnet channel ) if you are interested to collaborate on this. |
#2256 makes MXNet on Spark possible. It works on a stable Spark cluster, but when it is brought to a complex environment, e.g., executors may fail and retry, multiple tasks may configured to run in one executor, etc.
Related to issue #1637 @tqchen
Here's a roadmap for all those issues which may prevent from using MXNet on Spark in production environment.
The text was updated successfully, but these errors were encountered: