分布式训练中模型保存到了哪里？ #36

lixusign · 2019-02-19T08:21:34Z

分布式训练中模型保存到了哪里？

lixusign · 2019-02-19T08:53:46Z

hihi 在吗？

lixusign · 2019-02-19T09:54:51Z

pip安装的euler无法支持分布式训练？

yangsiran · 2019-02-20T09:41:09Z

@lixusign

命令行参数中有--model_dir这个选项，默认是（worker 0）本地当前目录下的ckpt文件夹，模型会被保存在这里，实际上就是TensorFlow中Checkpoint的路径；
目前是支持的。

lixusign · 2019-02-21T04:02:36Z

hihi, 请问pip安装的euler无法支持HDFS的话，如何支持分布式训练呢？

yangsiran · 2019-02-21T05:35:45Z

@lixusign PyPI上0.1.0这个版本的包是支持HDFS的。更新的版本的话需要编译安装并打开HDFS的选项。

chengenbao · 2019-02-21T05:38:50Z

分布式训练未必需要hdfs，用nfs也可以的

…

On 21 Feb 2019, at 12:02 PM, lixusign ***@***.***> wrote: hihi, 请问pip安装的euler无法支持HDFS的话，如何支持分布式训练呢？ — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD20y_DUriKxm3et_6H0Oa4G12wPiO0Yks5vPhpdgaJpZM4bCacT>.

lixusign · 2019-02-21T06:28:13Z

非常感谢各位大大，我先用下0.1.0 版本的Pypi安装试试，编译安装很麻烦而且系统上很多依赖都要各种版本哎。

lixusign · 2019-02-21T06:54:36Z

你好，下面是进行分布式训练ppi-graphSage的日志，请问卡到这块意味着什么？我用的pip安装的0.1.0版本euler + hdfs2.9.2 + zk + tensorflow1.12 + 当前2worker + 1ps 。

I0218 11:22:33.431810 19041 graph_builder.cc:84] Thread 98, job size: 0

I0218 11:22:33.432725 19041 graph_builder.cc:84] Thread 99, job size: 0

19/02/18 11:22:33 WARN hdfs.DFSClient: zero

I0218 11:22:33.970232 19075 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_train.id

I0218 11:22:34.135093 19070 graph_builder.cc:59] Load Done: hdfs://xxx :9000/user/euler/ppi/ppi-walks.txt

I0218 11:22:34.300680 19069 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-id_map.json

I0218 11:22:34.424460 19067 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-class_map.json

I0218 11:22:34.482018 19074 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_test.id

I0218 11:22:34.498394 19076 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_val.id

I0218 11:22:36.113641 19073 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_meta.json

I0218 11:22:37.140648 19068 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-feats.npy

I0218 11:22:37.287214 19066 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-G.json

I0218 11:22:37.952857 19072 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_data.json

I0218 11:22:38.380856 19071 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_data.dat

I0218 11:22:38.395088 19041 graph_builder.cc:102] Done: build node sampler

I0218 11:22:38.395136 19041 graph_builder.cc:112] Graph build finish

I0218 11:22:38.395155 19041 graph_service.cc:179] service init finish

I0218 11:22:38.396541 19041 graph_service.cc:131] bound port: xxx:32804

W0218 11:22:38.448446 19041 graph.h:148] global sampler is not ok

I0218 11:22:38.451355 19041 graph_service.cc:146] service start

I0218 11:22:38.463107 19045 zk_server_monitor.cc:238] Online node: 0#ip:32804.

I0218 11:22:38.463485 18999 remote_graph.cc:106] Retrieve meta info success, shard number: 2

I0218 11:22:38.463508 18999 remote_graph.cc:119] Retrieve meta info success, partition number: 1

I0218 11:22:38.463533 18999 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: node_sum_weight, Meta Info: 44906.000000,6514.000000,5524.000000

I0218 11:22:38.463547 18999 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: edge_sum_weight, Meta Info:

chengenbao · 2019-02-21T06:57:24Z

通过日志看起来好像是有一个shard的euler server没启动起来

…

On 21 Feb 2019, at 2:54 PM, lixusign ***@***.***> wrote: 你好，下面是进行分布式训练ppi-graphSage的日志，请问卡到这块意味着什么？我用的pip安装的0.1.0版本euler + hdfs2.9.2 + zk + tensorflow1.12 + 当前2worker + 1ps 。 I0218 11:22:33.431810 19041 graph_builder.cc:84] Thread 98, job size: 0 I0218 11:22:33.432725 19041 graph_builder.cc:84] Thread 99, job size: 0 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero I0218 11:22:33.970232 19075 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_train.id I0218 11:22:34.135093 19070 graph_builder.cc:59] Load Done: hdfs://xxx :9000/user/euler/ppi/ppi-walks.txt I0218 11:22:34.300680 19069 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-id_map.json I0218 11:22:34.424460 19067 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-class_map.json I0218 11:22:34.482018 19074 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_test.id I0218 11:22:34.498394 19076 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_val.id I0218 11:22:36.113641 19073 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_meta.json I0218 11:22:37.140648 19068 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-feats.npy I0218 11:22:37.287214 19066 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-G.json I0218 11:22:37.952857 19072 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_data.json I0218 11:22:38.380856 19071 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_data.dat I0218 11:22:38.395088 19041 graph_builder.cc:102] Done: build node sampler I0218 11:22:38.395136 19041 graph_builder.cc:112] Graph build finish I0218 11:22:38.395155 19041 graph_service.cc:179] service init finish I0218 11:22:38.396541 19041 graph_service.cc:131] bound port: xxx:32804 W0218 11:22:38.448446 19041 graph.h:148] global sampler is not ok I0218 11:22:38.451355 19041 graph_service.cc:146] service start I0218 11:22:38.463107 19045 zk_server_monitor.cc:238] Online node: 0#ip:32804. I0218 11:22:38.463485 18999 remote_graph.cc:106] Retrieve meta info success, shard number: 2 I0218 11:22:38.463508 18999 remote_graph.cc:119] Retrieve meta info success, partition number: 1 I0218 11:22:38.463533 18999 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: node_sum_weight, Meta Info: 44906.000000,6514.000000,5524.000000 I0218 11:22:38.463547 18999 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: edge_sum_weight, Meta Info: — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD20yyIpVF2UuoYMWLO40Oqetjn-QE0kks5vPkKxgaJpZM4bCacT>.

lixusign · 2019-02-21T08:53:46Z

你好，是这样的，我只有一个shard，有2个worker 那么只有一个worker会加载一个shard，这样不行吗？
另外上面提到的训练模型保存配置是：--model_dir:"/model" ，然后就会save到worker0的这个目录下？

chengenbao · 2019-02-21T09:06:08Z

“I0218 11:22:38.463485 18999 remote_graph.cc:106 <http://remote_graph.cc:106/>] Retrieve meta info success, shard number: 2” 看起来是是有两个shard，应该是你euler server的启动命令不对， 306 def run_distributed(flags_obj, run): 307 cluster = tf.train.ClusterSpec({ 308 'ps': flags_obj.ps_hosts, 309 'worker': flags_obj.worker_hosts 310 }) 311 server = tf.train.Server( 312 cluster, job_name=flags_obj.job_name, task_index=flags_obj.task_index) 313 314 if flags_obj.job_name == 'ps': 315 server.join() 316 elif flags_obj.job_name == 'worker': 317 if not euler_ops.initialize_shared_graph( 318 directory=flags_obj.data_dir, 319 zk_addr=flags_obj.euler_zk_addr, 320 zk_path=flags_obj.euler_zk_path, 321 shard_idx=flags_obj.task_index, 322 shard_num=len(flags_obj.worker_hosts), 323 global_sampler_type='node'): 324 raise RuntimeError('Failed to initialize graph.’) 根据这对代码，你有几个worker就有几个shard，你应该是task_index没指定对

…

On 21 Feb 2019, at 4:53 PM, lixusign ***@***.***> wrote: 你好，是这样的，我只有一个shard，有2个worker 那么只有一个worker会加载一个shard，这样不行吗？另外上面提到的训练模型保存配置是：--model_dir:"/model" ，然后就会save到worker0的这个目录下？ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD20y9NA3TPaqEJElkivFP3AJhFffGHOks5vPl6bgaJpZM4bCacT>.

lixusign · 2019-02-22T02:31:05Z

非常感谢，还有一个小问题，即训练完成后ps服务无法退出。

chengenbao · 2019-02-22T02:48:29Z

训练完了之后你应该手动kill ps，这是因为目前ps执行的server.join()会一直block住，如果你觉得这是一种不够优雅的方式，你可以使用deque让ps在合适的时机退出，具体请参见 https://stackoverflow.com/questions/39810356/shut-down-server-in-tensorflow <https://stackoverflow.com/questions/39810356/shut-down-server-in-tensorflow> 谢谢

…

On 22 Feb 2019, at 10:31 AM, lixusign ***@***.***> wrote: 非常感谢，还有一个小问题，即训练完成后ps服务无法退出。 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD20y_cWoNbAFmtMFWHVeiXUaU0oPEDvks5vP1ZrgaJpZM4bCacT>.

lixusign · 2019-02-22T06:04:08Z

好的非常感谢我先看下这个issue可以关闭了

chengenbao · 2019-02-22T07:54:32Z

如果你觉得没有啥问题了，你可以关了，github的issue貌似只能由发起人关

…

On 22 Feb 2019, at 2:04 PM, lixusign ***@***.***> wrote: 好的非常感谢我先看下这个issue可以关闭了 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD20y5CIVlBXgpyCZBHMABYgQppqQyTJks5vP4hlgaJpZM4bCacT>.

lixusign · 2019-02-25T09:51:54Z

ok

lixusign closed this as completed Feb 25, 2019

chrisxu2016 mentioned this issue Mar 15, 2019

分布式训练shared_num>1,hdfs数据都download到一个里面 #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

分布式训练中模型保存到了哪里？ #36

分布式训练中模型保存到了哪里？ #36

lixusign commented Feb 19, 2019

lixusign commented Feb 19, 2019

lixusign commented Feb 19, 2019

yangsiran commented Feb 20, 2019 •

edited

lixusign commented Feb 21, 2019

yangsiran commented Feb 21, 2019 •

edited

chengenbao commented Feb 21, 2019 via email

lixusign commented Feb 21, 2019

lixusign commented Feb 21, 2019

chengenbao commented Feb 21, 2019 via email

lixusign commented Feb 21, 2019

chengenbao commented Feb 21, 2019 via email

lixusign commented Feb 22, 2019

chengenbao commented Feb 22, 2019 via email

lixusign commented Feb 22, 2019

chengenbao commented Feb 22, 2019 via email

lixusign commented Feb 25, 2019

分布式训练中模型保存到了哪里？ #36

分布式训练中模型保存到了哪里？ #36

Comments

lixusign commented Feb 19, 2019

lixusign commented Feb 19, 2019

lixusign commented Feb 19, 2019

yangsiran commented Feb 20, 2019 • edited

lixusign commented Feb 21, 2019

yangsiran commented Feb 21, 2019 • edited

chengenbao commented Feb 21, 2019 via email

lixusign commented Feb 21, 2019

lixusign commented Feb 21, 2019

chengenbao commented Feb 21, 2019 via email

lixusign commented Feb 21, 2019

chengenbao commented Feb 21, 2019 via email

lixusign commented Feb 22, 2019

chengenbao commented Feb 22, 2019 via email

lixusign commented Feb 22, 2019

chengenbao commented Feb 22, 2019 via email

lixusign commented Feb 25, 2019

yangsiran commented Feb 20, 2019 •

edited

yangsiran commented Feb 21, 2019 •

edited