You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Clone code from this repo, build with USE_DIST_KVSTORE=1 in ubuntu:16.04 container, instance multi container, then run cmd bellow:
$ vim hosts
# two container can connect by ssh with no auth
172.17.0.1
172.17.0.2
# start training
$ python ../../tools/launch.py -n 2 --launcher ssh -H hosts python train_mnist.py --network lenet --kv-store dist_sync
i get the error as bellow:
[03:58:38] src/./zmq_van.h:123: there is no socket to node 11
terminate called after throwing an instance of 'dmlc::Error'
what(): [03:58:38] src/van.cc:132: Check failed: (send_bytes) != (-1)
Stack trace returned 7 entries:
[bt] (0) /root/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x55) [0x7fc2f9929295]
[bt] (1) /root/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7fc2f9929d98]
[bt] (2) /root/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Send(ps::Message const&)+0x245) [0x7fc2fc420895]
[bt] (3) /root/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Receiving()+0x2cdb) [0x7fc2fc42501b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fc32aeabc80]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fc33229a6ba]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fc331fd041d]
Aborted (core dumped)
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/root/incubator-mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 366, in <lambda>
target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'python train_mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 134
how can i fixed it?
thanks a lot
The text was updated successfully, but these errors were encountered:
Clone code from this repo, build with USE_DIST_KVSTORE=1 in ubuntu:16.04 container, instance multi container, then run cmd bellow:
i get the error as bellow:
how can i fixed it?
thanks a lot
The text was updated successfully, but these errors were encountered: