distributed.nanny - INFO - Start Nanny at: 'tcp://10.36.106.27:37614'
distributed.worker - INFO - Start worker at: tcp://10.36.106.27:35646
distributed.worker - INFO - Listening to: tcp://10.36.106.27:35646
distributed.worker - INFO - nanny at: 10.36.106.27:37614
distributed.worker - INFO - bokeh at: 10.36.106.27:37208
distributed.worker - INFO - Waiting to connect to: tcp://10.36.106.22:36860
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 8.44 GB
distributed.worker - INFO - Local Directory: /groups/dudman/home/kirkhamj/dask-worker-space/worker-dqo75wxc
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process. Restarting
distributed.worker - INFO - Start worker at: tcp://10.36.106.27:33421
distributed.worker - INFO - Listening to: tcp://10.36.106.27:33421
distributed.worker - INFO - nanny at: 10.36.106.27:37614
distributed.worker - INFO - bokeh at: 10.36.106.27:46435
distributed.worker - INFO - Waiting to connect to: tcp://10.36.106.22:36860
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 8.44 GB
distributed.worker - INFO - Local Directory: /groups/dudman/home/kirkhamj/dask-worker-space/worker-79121wo6
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process. Restarting
distributed.worker - INFO - Start worker at: tcp://10.36.106.27:37974
distributed.worker - INFO - Listening to: tcp://10.36.106.27:37974
distributed.worker - INFO - nanny at: 10.36.106.27:37614
distributed.worker - INFO - bokeh at: 10.36.106.27:33165
distributed.worker - INFO - Waiting to connect to: tcp://10.36.106.22:36860
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 8.44 GB
distributed.worker - INFO - Local Directory: /groups/dudman/home/kirkhamj/dask-worker-space/worker-k0jhzicl
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda3/lib/python3.6/site-packages/distributed/nanny.py", line 522, in run
yield worker._start(*worker_start_args)
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda3/lib/python3.6/site-packages/distributed/worker.py", line 372, in _start
yield self._register_with_scheduler()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/opt/conda3/lib/python3.6/site-packages/distributed/worker.py", line 295, in _register_with_scheduler
(response,))
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 30451989.6', 'time': 1519764566.2687201}
distributed.diskutils - ERROR - Failed to remove '/groups/dudman/home/kirkhamj/dask-worker-space/worker-k0jhzicl' (failed in <built-in function lstat>): [Errno 2] No such file or directory: '/groups/dudman/home/kirkhamj/dask-worker-space/worker-k0jhzicl'
tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x2b447e967d30> after timeout
Traceback (most recent call last):
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/opt/conda3/lib/python3.6/site-packages/distributed/nanny.py", line 458, in _wait_until_running
raise msg
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 30451989.6', 'time': 1519764566.2687201}
distributed.nanny - WARNING - Restarting worker
distributed.worker - INFO - Start worker at: tcp://10.36.106.27:41120
distributed.worker - INFO - Listening to: tcp://10.36.106.27:41120
distributed.worker - INFO - nanny at: 10.36.106.27:37614
distributed.worker - INFO - bokeh at: 10.36.106.27:8789
distributed.worker - INFO - Waiting to connect to: tcp://10.36.106.22:36860
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 8.44 GB
distributed.worker - INFO - Local Directory: /groups/dudman/home/kirkhamj/dask-worker-space/worker-vju2tks3
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda3/lib/python3.6/site-packages/distributed/nanny.py", line 522, in run
yield worker._start(*worker_start_args)
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda3/lib/python3.6/site-packages/distributed/worker.py", line 372, in _start
yield self._register_with_scheduler()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/opt/conda3/lib/python3.6/site-packages/distributed/worker.py", line 295, in _register_with_scheduler
(response,))
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 30451989.6', 'time': 1519764568.099204}
distributed.diskutils - ERROR - Failed to remove '/groups/dudman/home/kirkhamj/dask-worker-space/worker-vju2tks3' (failed in <built-in function lstat>): [Errno 2] No such file or directory: '/groups/dudman/home/kirkhamj/dask-worker-space/worker-vju2tks3'
tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x2b447e9676a0> after timeout
Traceback (most recent call last):
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/opt/conda3/lib/python3.6/site-packages/distributed/nanny.py", line 458, in _wait_until_running
raise msg
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 30451989.6', 'time': 1519764568.099204}
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/opt/conda3/bin/dask-worker", line 6, in <module>
sys.exit(distributed.cli.dask_worker.go())
File "/opt/conda3/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 248, in go
main()
File "/opt/conda3/lib/python3.6/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/opt/conda3/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/opt/conda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/opt/conda3/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 239, in main
loop.run_sync(run)
File "/opt/conda3/lib/python3.6/site-packages/tornado/ioloop.py", line 458, in run_sync
return future_cell[0].result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda3/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 232, in run
yield [n._start(addr) for n in nannies]
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 828, in callback
result_list.append(f.result())
File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/opt/conda3/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
assert self.worker_address
AssertionError
Running into an issue with
distributedversion1.21.1usingdask-drmaaversion0.1.0where workers fail to register. Some other issues crop up in the process like providing incorrect information about the number of workers available or other resources (e.g. cores, memory, etc.). A log from one of the workers that failed and a full environment listing are included below. Downgrading todistributedversion1.21.0resolves all of these issues.Failed worker log:
Environment: