Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分布式训练中模型保存到了哪里? #36

Closed
lixusign opened this issue Feb 19, 2019 · 16 comments
Closed

分布式训练中模型保存到了哪里? #36

lixusign opened this issue Feb 19, 2019 · 16 comments

Comments

@lixusign
Copy link

分布式训练中模型保存到了哪里?

@lixusign
Copy link
Author

hihi 在吗?

@lixusign
Copy link
Author

pip安装的euler无法支持分布式训练?

@yangsiran
Copy link
Member

yangsiran commented Feb 20, 2019

@lixusign

  1. 命令行参数中有--model_dir这个选项,默认是(worker 0)本地当前目录下的ckpt文件夹,模型会被保存在这里,实际上就是TensorFlow中Checkpoint的路径;
  2. 目前是支持的。

@lixusign
Copy link
Author

hihi, 请问pip安装的euler无法支持HDFS的话,如何支持分布式训练呢?

@yangsiran
Copy link
Member

yangsiran commented Feb 21, 2019

@lixusign PyPI上0.1.0这个版本的包是支持HDFS的。更新的版本的话需要编译安装并打开HDFS的选项。

@chengenbao
Copy link
Contributor

chengenbao commented Feb 21, 2019 via email

@lixusign
Copy link
Author

非常感谢各位大大,我先用下0.1.0 版本的Pypi安装试试,编译安装很麻烦而且系统上很多依赖都要各种版本哎。

@lixusign
Copy link
Author

你好,下面是进行分布式训练ppi-graphSage的日志,请问卡到这块意味着什么?我用的pip安装的0.1.0版本euler + hdfs2.9.2 + zk + tensorflow1.12 + 当前2worker + 1ps 。

I0218 11:22:33.431810 19041 graph_builder.cc:84] Thread 98, job size: 0

I0218 11:22:33.432725 19041 graph_builder.cc:84] Thread 99, job size: 0

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

I0218 11:22:33.970232 19075 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_train.id

I0218 11:22:34.135093 19070 graph_builder.cc:59] Load Done: hdfs://xxx :9000/user/euler/ppi/ppi-walks.txt

I0218 11:22:34.300680 19069 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-id_map.json

I0218 11:22:34.424460 19067 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-class_map.json

I0218 11:22:34.482018 19074 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_test.id

I0218 11:22:34.498394 19076 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_val.id

I0218 11:22:36.113641 19073 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_meta.json

I0218 11:22:37.140648 19068 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-feats.npy

I0218 11:22:37.287214 19066 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-G.json

I0218 11:22:37.952857 19072 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_data.json

I0218 11:22:38.380856 19071 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_data.dat

I0218 11:22:38.395088 19041 graph_builder.cc:102] Done: build node sampler

I0218 11:22:38.395136 19041 graph_builder.cc:112] Graph build finish

I0218 11:22:38.395155 19041 graph_service.cc:179] service init finish

I0218 11:22:38.396541 19041 graph_service.cc:131] bound port: xxx:32804

W0218 11:22:38.448446 19041 graph.h:148] global sampler is not ok

I0218 11:22:38.451355 19041 graph_service.cc:146] service start

I0218 11:22:38.463107 19045 zk_server_monitor.cc:238] Online node: 0#ip:32804.

I0218 11:22:38.463485 18999 remote_graph.cc:106] Retrieve meta info success, shard number: 2

I0218 11:22:38.463508 18999 remote_graph.cc:119] Retrieve meta info success, partition number: 1

I0218 11:22:38.463533 18999 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: node_sum_weight, Meta Info: 44906.000000,6514.000000,5524.000000

I0218 11:22:38.463547 18999 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: edge_sum_weight, Meta Info:

@chengenbao
Copy link
Contributor

chengenbao commented Feb 21, 2019 via email

@lixusign
Copy link
Author

你好 ,是这样的,我只有一个shard,有2个worker 那么只有一个worker会加载一个shard,这样不行吗?
另外上面提到的训练模型保存 配置是:--model_dir:"/model" ,然后就会save到worker0的这个目录下 ?

@chengenbao
Copy link
Contributor

chengenbao commented Feb 21, 2019 via email

@lixusign
Copy link
Author

非常感谢,还有一个小问题,即训练完成后ps服务无法退出。

@chengenbao
Copy link
Contributor

chengenbao commented Feb 22, 2019 via email

@lixusign
Copy link
Author

好的 非常感谢 我先看下 这个issue可以关闭了

@chengenbao
Copy link
Contributor

chengenbao commented Feb 22, 2019 via email

@lixusign
Copy link
Author

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants