Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

terminate called after throwing an instance of 'dmlc::Error' #9489

Closed
ilibx opened this issue Jan 19, 2018 · 3 comments
Closed

terminate called after throwing an instance of 'dmlc::Error' #9489

ilibx opened this issue Jan 19, 2018 · 3 comments

Comments

@ilibx
Copy link

ilibx commented Jan 19, 2018

Clone code from this repo, build with USE_DIST_KVSTORE=1 in ubuntu:16.04 container, instance multi container, then run cmd bellow:

$ vim hosts 
# two container can connect by ssh with no auth
172.17.0.1
172.17.0.2

# start training
$ python ../../tools/launch.py -n 2 --launcher ssh -H hosts python train_mnist.py --network lenet --kv-store dist_sync

i get the error as bellow:

[03:58:38] src/./zmq_van.h:123: there is no socket to node 11
terminate called after throwing an instance of 'dmlc::Error'
  what():  [03:58:38] src/van.cc:132: Check failed: (send_bytes) != (-1) 

Stack trace returned 7 entries:
[bt] (0) /root/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x55) [0x7fc2f9929295]
[bt] (1) /root/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7fc2f9929d98]
[bt] (2) /root/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Send(ps::Message const&)+0x245) [0x7fc2fc420895]
[bt] (3) /root/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(ps::Van::Receiving()+0x2cdb) [0x7fc2fc42501b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fc32aeabc80]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fc33229a6ba]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fc331fd041d]


Aborted (core dumped)
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/root/incubator-mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 366, in <lambda>
    target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'python train_mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 134

how can i fixed it?

thanks a lot

@aodhan-domhnaill
Copy link

From this,

Exit code 134 means your program was aborted (received SIGABRT), perhaps as a result of a failed assertion.

Just as a sanity check. Can you ssh into each of the hosts without a password?

And what happens if you run python train_mnist.py --network lenet --kv-store local?

@ilibx
Copy link
Author

ilibx commented Jan 20, 2018

@aidan-plenert-macdonald
Thanks for your response, i restart all container, and then run succeeded. i do not know why.

Before run that, i had ran succeeded local

@ilibx ilibx closed this as completed Jan 20, 2018
@Davidrjx
Copy link

Davidrjx commented Feb 6, 2018

i think should run container with nvidia-docker, at lease so am i ,and i run mxnet with gpu

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants