-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
分布式训练中模型保存到了哪里? #36
Comments
hihi 在吗? |
pip安装的euler无法支持分布式训练? |
|
hihi, 请问pip安装的euler无法支持HDFS的话,如何支持分布式训练呢? |
@lixusign PyPI上0.1.0这个版本的包是支持HDFS的。更新的版本的话需要编译安装并打开HDFS的选项。 |
分布式训练未必需要hdfs,用nfs也可以的
… On 21 Feb 2019, at 12:02 PM, lixusign ***@***.***> wrote:
hihi, 请问pip安装的euler无法支持HDFS的话,如何支持分布式训练呢?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD20y_DUriKxm3et_6H0Oa4G12wPiO0Yks5vPhpdgaJpZM4bCacT>.
|
非常感谢各位大大,我先用下0.1.0 版本的Pypi安装试试,编译安装很麻烦而且系统上很多依赖都要各种版本哎。 |
你好,下面是进行分布式训练ppi-graphSage的日志,请问卡到这块意味着什么?我用的pip安装的0.1.0版本euler + hdfs2.9.2 + zk + tensorflow1.12 + 当前2worker + 1ps 。 I0218 11:22:33.431810 19041 graph_builder.cc:84] Thread 98, job size: 0 I0218 11:22:33.432725 19041 graph_builder.cc:84] Thread 99, job size: 0 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero 19/02/18 11:22:33 WARN hdfs.DFSClient: zero I0218 11:22:33.970232 19075 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_train.id I0218 11:22:34.135093 19070 graph_builder.cc:59] Load Done: hdfs://xxx :9000/user/euler/ppi/ppi-walks.txt I0218 11:22:34.300680 19069 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-id_map.json I0218 11:22:34.424460 19067 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-class_map.json I0218 11:22:34.482018 19074 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_test.id I0218 11:22:34.498394 19076 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_val.id I0218 11:22:36.113641 19073 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_meta.json I0218 11:22:37.140648 19068 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-feats.npy I0218 11:22:37.287214 19066 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-G.json I0218 11:22:37.952857 19072 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_data.json I0218 11:22:38.380856 19071 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_data.dat I0218 11:22:38.395088 19041 graph_builder.cc:102] Done: build node sampler I0218 11:22:38.395136 19041 graph_builder.cc:112] Graph build finish I0218 11:22:38.395155 19041 graph_service.cc:179] service init finish I0218 11:22:38.396541 19041 graph_service.cc:131] bound port: xxx:32804 W0218 11:22:38.448446 19041 graph.h:148] global sampler is not ok I0218 11:22:38.451355 19041 graph_service.cc:146] service start I0218 11:22:38.463107 19045 zk_server_monitor.cc:238] Online node: 0#ip:32804. I0218 11:22:38.463485 18999 remote_graph.cc:106] Retrieve meta info success, shard number: 2 I0218 11:22:38.463508 18999 remote_graph.cc:119] Retrieve meta info success, partition number: 1 I0218 11:22:38.463533 18999 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: node_sum_weight, Meta Info: 44906.000000,6514.000000,5524.000000 I0218 11:22:38.463547 18999 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: edge_sum_weight, Meta Info: |
通过日志看起来好像是有一个shard的euler server没启动起来
… On 21 Feb 2019, at 2:54 PM, lixusign ***@***.***> wrote:
你好,下面是进行分布式训练ppi-graphSage的日志,请问卡到这块意味着什么?我用的pip安装的0.1.0版本euler + hdfs2.9.2 + zk + tensorflow1.12 + 当前2worker + 1ps 。
I0218 11:22:33.431810 19041 graph_builder.cc:84] Thread 98, job size: 0
I0218 11:22:33.432725 19041 graph_builder.cc:84] Thread 99, job size: 0
19/02/18 11:22:33 WARN hdfs.DFSClient: zero
19/02/18 11:22:33 WARN hdfs.DFSClient: zero
19/02/18 11:22:33 WARN hdfs.DFSClient: zero
19/02/18 11:22:33 WARN hdfs.DFSClient: zero
19/02/18 11:22:33 WARN hdfs.DFSClient: zero
19/02/18 11:22:33 WARN hdfs.DFSClient: zero
19/02/18 11:22:33 WARN hdfs.DFSClient: zero
19/02/18 11:22:33 WARN hdfs.DFSClient: zero
19/02/18 11:22:33 WARN hdfs.DFSClient: zero
19/02/18 11:22:33 WARN hdfs.DFSClient: zero
19/02/18 11:22:33 WARN hdfs.DFSClient: zero
I0218 11:22:33.970232 19075 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_train.id
I0218 11:22:34.135093 19070 graph_builder.cc:59] Load Done: hdfs://xxx :9000/user/euler/ppi/ppi-walks.txt
I0218 11:22:34.300680 19069 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-id_map.json
I0218 11:22:34.424460 19067 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-class_map.json
I0218 11:22:34.482018 19074 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_test.id
I0218 11:22:34.498394 19076 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_val.id
I0218 11:22:36.113641 19073 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_meta.json
I0218 11:22:37.140648 19068 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-feats.npy
I0218 11:22:37.287214 19066 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi-G.json
I0218 11:22:37.952857 19072 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_data.json
I0218 11:22:38.380856 19071 graph_builder.cc:59] Load Done: hdfs://xxx:9000/user/euler/ppi/ppi_data.dat
I0218 11:22:38.395088 19041 graph_builder.cc:102] Done: build node sampler
I0218 11:22:38.395136 19041 graph_builder.cc:112] Graph build finish
I0218 11:22:38.395155 19041 graph_service.cc:179] service init finish
I0218 11:22:38.396541 19041 graph_service.cc:131] bound port: xxx:32804
W0218 11:22:38.448446 19041 graph.h:148] global sampler is not ok
I0218 11:22:38.451355 19041 graph_service.cc:146] service start
I0218 11:22:38.463107 19045 zk_server_monitor.cc:238] Online node: 0#ip:32804.
I0218 11:22:38.463485 18999 remote_graph.cc:106] Retrieve meta info success, shard number: 2
I0218 11:22:38.463508 18999 remote_graph.cc:119] Retrieve meta info success, partition number: 1
I0218 11:22:38.463533 18999 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: node_sum_weight, Meta Info: 44906.000000,6514.000000,5524.000000
I0218 11:22:38.463547 18999 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: edge_sum_weight, Meta Info:
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD20yyIpVF2UuoYMWLO40Oqetjn-QE0kks5vPkKxgaJpZM4bCacT>.
|
你好 ,是这样的,我只有一个shard,有2个worker 那么只有一个worker会加载一个shard,这样不行吗? |
“I0218 11:22:38.463485 18999 remote_graph.cc:106 <http://remote_graph.cc:106/>] Retrieve meta info success, shard number: 2”
看起来是是有两个shard,应该是你euler server的启动命令不对,
306 def run_distributed(flags_obj, run):
307 cluster = tf.train.ClusterSpec({
308 'ps': flags_obj.ps_hosts,
309 'worker': flags_obj.worker_hosts
310 })
311 server = tf.train.Server(
312 cluster, job_name=flags_obj.job_name, task_index=flags_obj.task_index)
313
314 if flags_obj.job_name == 'ps':
315 server.join()
316 elif flags_obj.job_name == 'worker':
317 if not euler_ops.initialize_shared_graph(
318 directory=flags_obj.data_dir,
319 zk_addr=flags_obj.euler_zk_addr,
320 zk_path=flags_obj.euler_zk_path,
321 shard_idx=flags_obj.task_index,
322 shard_num=len(flags_obj.worker_hosts),
323 global_sampler_type='node'):
324 raise RuntimeError('Failed to initialize graph.’)
根据这对代码, 你有几个worker就有几个shard, 你应该是task_index没指定对
… On 21 Feb 2019, at 4:53 PM, lixusign ***@***.***> wrote:
你好 ,是这样的,我只有一个shard,有2个worker 那么只有一个worker会加载一个shard,这样不行吗?
另外上面提到的训练模型保存 配置是:--model_dir:"/model" ,然后就会save到worker0的这个目录下 ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD20y9NA3TPaqEJElkivFP3AJhFffGHOks5vPl6bgaJpZM4bCacT>.
|
非常感谢,还有一个小问题,即训练完成后ps服务无法退出。 |
训练完了之后你应该手动kill ps, 这是因为目前ps执行的server.join()会一直block住, 如果你觉得这是一种不够优雅的方式, 你可以使用deque让ps在合适的时机退出,
具体请参见 https://stackoverflow.com/questions/39810356/shut-down-server-in-tensorflow <https://stackoverflow.com/questions/39810356/shut-down-server-in-tensorflow>
谢谢
… On 22 Feb 2019, at 10:31 AM, lixusign ***@***.***> wrote:
非常感谢,还有一个小问题,即训练完成后ps服务无法退出。
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD20y_cWoNbAFmtMFWHVeiXUaU0oPEDvks5vP1ZrgaJpZM4bCacT>.
|
好的 非常感谢 我先看下 这个issue可以关闭了 |
如果你觉得没有啥问题了,你可以关了,github的issue貌似只能由发起人关
… On 22 Feb 2019, at 2:04 PM, lixusign ***@***.***> wrote:
好的 非常感谢 我先看下 这个issue可以关闭了
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD20y5CIVlBXgpyCZBHMABYgQppqQyTJks5vP4hlgaJpZM4bCacT>.
|
ok |
分布式训练中模型保存到了哪里?
The text was updated successfully, but these errors were encountered: