PS node recovery support by yzhliu · Pull Request #2671 · apache/mxnet

yzhliu · 2016-07-11T15:38:42Z

When a worker node died and restarts, it can join the cluster again. This PR Support recovery nodes dmlc/ps-lite#59 need to be merged first
Timeout for ps scheduler and servers for Spark. PS scheduler and servers are started in a spawned process. But when something goes wrong, e.g., workers, servers or the scheduler crash, they will have no chance to stop themselves. I think the simplest way to solve this problem is to set a timeout T, if the scheduler/server does not receive any message in T seconds, stop itself.

Related to #2268

piiswrong · 2016-07-11T18:20:13Z

mli · 2016-07-12T20:53:30Z

include/mxnet/kvstore.h

+   *        will be presumed as 'dead'
+   *
+   * Always return 0 when type == "local"
+   */


please consider change to get _num_dead_node or get_num_timeout_node. dead_num doesn't sound like english

yzhliu · 2016-07-31T15:39:30Z

@mli updated and support non-barrier exist

mli · 2016-07-31T20:33:43Z

LGTM, thanks. please rebase then we can merge it.

mli reviewed Jul 12, 2016
View reviewed changes

yzhliu force-pushed the dist-tol branch from 31103e3 to 6681d9f Compare July 24, 2016 09:28

yzhliu added 5 commits July 31, 2016 15:34

get number of dead node from kvstore

3f8395a

[scala] support kvstore exit when others go out

bc37d72

support worker node recovery

7673215

allow ps worker to exist without barrier

70e2e75

code lint

8fa1a35

yzhliu force-pushed the dist-tol branch from 6681d9f to 8fa1a35 Compare July 31, 2016 07:55

tqchen merged commit c49ef1a into apache:master Aug 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PS node recovery support#2671

PS node recovery support#2671
tqchen merged 5 commits intoapache:masterfrom
yzhliu:dist-tol

yzhliu commented Jul 11, 2016

Uh oh!

piiswrong commented Jul 11, 2016

Uh oh!

mli Jul 12, 2016

Uh oh!

yzhliu commented Jul 31, 2016

Uh oh!

mli commented Jul 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yzhliu commented Jul 11, 2016

Uh oh!

piiswrong commented Jul 11, 2016

Uh oh!

mli Jul 12, 2016

Choose a reason for hiding this comment

Uh oh!

yzhliu commented Jul 31, 2016

Uh oh!

mli commented Jul 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants