-
Notifications
You must be signed in to change notification settings - Fork 147
linear local mode error:JUST_A_UNKNOWN_NODE is disconnected #75
Description
command is :
repo/dmlc-core/tracker/dmlc-submit --cluster local --env DMLC_CPU_VCORES=1 --env DMLC_MEMORY_MB=512 --num-workers 2 --num-servers 1 --worker-cores 1 --server-cores 1 learn/linear/build/linear.dmlc learn/linear/guide/demo.conf
client error show:
Connected 1 servers and 2 workers
Training: iter = 0
sec ttl #ex inc #ex |w|_0 logloss accuracy AUC
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/opt/alalei/wormhole_new2/repo/dmlc-core/tracker/dmlc_tracker/local.py", line 45, in exec_cmd
raise RuntimeError('Get nonzero return code=%d' % ret)
RuntimeError: Get nonzero return code=-11
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/opt/alalei/wormhole_new2/repo/dmlc-core/tracker/dmlc_tracker/local.py", line 45, in exec_cmd
raise RuntimeError('Get nonzero return code=%d' % ret)
RuntimeError: Get nonzero return code=-11
/tmp/linear.dmlc.H.log.INFO.20170918-133532.8524 show:
I0918 13:35:32.054481 8524 van.cc:30] I'm [role: SCHEDULER id: "H" hostname: "10.2.177.240" port: 9092]
I0918 13:35:32.057025 8524 manager.cc:34] Staring system. Logging into /tmp/linear.dmlc.log.*
I0918 13:35:32.068665 8557 workload_pool.h:168] assign W_10.2.177.240_52648 job learn/data/agaricus.txt.train 0 / 10. 1 #jobs on processing.
I0918 13:35:32.068797 8557 workload_pool.h:168] assign W_10.2.177.240_43323 job learn/data/agaricus.txt.train 1 / 10. 2 #jobs on processing.
I0918 13:35:32.125037 8556 manager.cc:275] JUST_A_UNKNOWN_NODE is disconnected
/tmp/linear.dmlc.W_10.2.177.240_43323.log.INFO.20170918-133532.8525 show:
I0918 13:35:32.048825 8525 van.cc:30] I'm [role: WORKER id: "W_10.2.177.240_43323" hostname: "10.2.177.240" port: 43323]
I0918 13:35:32.069018 8551 minibatch_solver.h:291] iter = 0, training, learn/data/agaricus.txt.train 1 / 10, minibatch = 1000, concurrency = 2, shuffle ratio = 10000, negative sampling =
/tmp/linear.dmlc.S_10.2.177.240_40067.log.INFO.20170918-133532.8529 show:
I0918 13:35:32.052947 8529 van.cc:30] I'm [role: SERVER id: "S_10.2.177.240_40067" hostname: "10.2.177.240" port: 40067]