Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

PS node recovery support#2671

Merged
tqchen merged 5 commits intoapache:masterfrom
yzhliu:dist-tol
Aug 2, 2016
Merged

PS node recovery support#2671
tqchen merged 5 commits intoapache:masterfrom
yzhliu:dist-tol

Conversation

@yzhliu
Copy link
Member

@yzhliu yzhliu commented Jul 11, 2016

  • When a worker node died and restarts, it can join the cluster again. This PR Support recovery nodes dmlc/ps-lite#59 need to be merged first
  • Timeout for ps scheduler and servers for Spark. PS scheduler and servers are started in a spawned process. But when something goes wrong, e.g., workers, servers or the scheduler crash, they will have no chance to stop themselves. I think the simplest way to solve this problem is to set a timeout T, if the scheduler/server does not receive any message in T seconds, stop itself.

Related to #2268

@piiswrong
Copy link
Contributor

@mli

* will be presumed as 'dead'
*
* Always return 0 when type == "local"
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please consider change to get _num_dead_node or get_num_timeout_node. dead_num doesn't sound like english

@yzhliu
Copy link
Member Author

yzhliu commented Jul 31, 2016

@mli updated and support non-barrier exist

@mli
Copy link
Contributor

mli commented Jul 31, 2016

LGTM, thanks. please rebase then we can merge it.

@tqchen tqchen merged commit c49ef1a into apache:master Aug 2, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants