Skip to content
This repository was archived by the owner on May 24, 2018. It is now read-only.
This repository was archived by the owner on May 24, 2018. It is now read-only.

linear local mode error:JUST_A_UNKNOWN_NODE is disconnected #75

@alaleiwang

Description

@alaleiwang

command is :
repo/dmlc-core/tracker/dmlc-submit --cluster local --env DMLC_CPU_VCORES=1 --env DMLC_MEMORY_MB=512 --num-workers 2 --num-servers 1 --worker-cores 1 --server-cores 1 learn/linear/build/linear.dmlc learn/linear/guide/demo.conf

client error show:
Connected 1 servers and 2 workers
Training: iter = 0
sec ttl #ex inc #ex |w|_0 logloss accuracy AUC
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/opt/alalei/wormhole_new2/repo/dmlc-core/tracker/dmlc_tracker/local.py", line 45, in exec_cmd
raise RuntimeError('Get nonzero return code=%d' % ret)
RuntimeError: Get nonzero return code=-11

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/opt/alalei/wormhole_new2/repo/dmlc-core/tracker/dmlc_tracker/local.py", line 45, in exec_cmd
raise RuntimeError('Get nonzero return code=%d' % ret)
RuntimeError: Get nonzero return code=-11

/tmp/linear.dmlc.H.log.INFO.20170918-133532.8524 show:
I0918 13:35:32.054481 8524 van.cc:30] I'm [role: SCHEDULER id: "H" hostname: "10.2.177.240" port: 9092]
I0918 13:35:32.057025 8524 manager.cc:34] Staring system. Logging into /tmp/linear.dmlc.log.*
I0918 13:35:32.068665 8557 workload_pool.h:168] assign W_10.2.177.240_52648 job learn/data/agaricus.txt.train 0 / 10. 1 #jobs on processing.
I0918 13:35:32.068797 8557 workload_pool.h:168] assign W_10.2.177.240_43323 job learn/data/agaricus.txt.train 1 / 10. 2 #jobs on processing.
I0918 13:35:32.125037 8556 manager.cc:275] JUST_A_UNKNOWN_NODE is disconnected

/tmp/linear.dmlc.W_10.2.177.240_43323.log.INFO.20170918-133532.8525 show:
I0918 13:35:32.048825 8525 van.cc:30] I'm [role: WORKER id: "W_10.2.177.240_43323" hostname: "10.2.177.240" port: 43323]
I0918 13:35:32.069018 8551 minibatch_solver.h:291] iter = 0, training, learn/data/agaricus.txt.train 1 / 10, minibatch = 1000, concurrency = 2, shuffle ratio = 10000, negative sampling =

/tmp/linear.dmlc.S_10.2.177.240_40067.log.INFO.20170918-133532.8529 show:
I0918 13:35:32.052947 8529 van.cc:30] I'm [role: SERVER id: "S_10.2.177.240_40067" hostname: "10.2.177.240" port: 40067]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions