Skip to content

Workers fail to register #1792

@jakirkham

Description

@jakirkham

Running into an issue with distributed version 1.21.1 using dask-drmaa version 0.1.0 where workers fail to register. Some other issues crop up in the process like providing incorrect information about the number of workers available or other resources (e.g. cores, memory, etc.). A log from one of the workers that failed and a full environment listing are included below. Downgrading to distributed version 1.21.0 resolves all of these issues.

Failed worker log:
distributed.nanny - INFO -         Start Nanny at: 'tcp://10.36.106.27:37614'
distributed.worker - INFO -       Start worker at:   tcp://10.36.106.27:35646
distributed.worker - INFO -          Listening to:   tcp://10.36.106.27:35646
distributed.worker - INFO -              nanny at:         10.36.106.27:37614
distributed.worker - INFO -              bokeh at:         10.36.106.27:37208
distributed.worker - INFO - Waiting to connect to:   tcp://10.36.106.22:36860
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.44 GB
distributed.worker - INFO -       Local Directory: /groups/dudman/home/kirkhamj/dask-worker-space/worker-dqo75wxc
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.worker - INFO -       Start worker at:   tcp://10.36.106.27:33421
distributed.worker - INFO -          Listening to:   tcp://10.36.106.27:33421
distributed.worker - INFO -              nanny at:         10.36.106.27:37614
distributed.worker - INFO -              bokeh at:         10.36.106.27:46435
distributed.worker - INFO - Waiting to connect to:   tcp://10.36.106.22:36860
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.44 GB
distributed.worker - INFO -       Local Directory: /groups/dudman/home/kirkhamj/dask-worker-space/worker-79121wo6
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.worker - INFO -       Start worker at:   tcp://10.36.106.27:37974
distributed.worker - INFO -          Listening to:   tcp://10.36.106.27:37974
distributed.worker - INFO -              nanny at:         10.36.106.27:37614
distributed.worker - INFO -              bokeh at:         10.36.106.27:33165
distributed.worker - INFO - Waiting to connect to:   tcp://10.36.106.22:36860
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.44 GB
distributed.worker - INFO -       Local Directory: /groups/dudman/home/kirkhamj/dask-worker-space/worker-k0jhzicl
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/opt/conda3/lib/python3.6/site-packages/distributed/nanny.py", line 522, in run
    yield worker._start(*worker_start_args)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/worker.py", line 372, in _start
    yield self._register_with_scheduler()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/worker.py", line 295, in _register_with_scheduler
    (response,))
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 30451989.6', 'time': 1519764566.2687201}
distributed.diskutils - ERROR - Failed to remove '/groups/dudman/home/kirkhamj/dask-worker-space/worker-k0jhzicl' (failed in <built-in function lstat>): [Errno 2] No such file or directory: '/groups/dudman/home/kirkhamj/dask-worker-space/worker-k0jhzicl'
tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x2b447e967d30> after timeout
Traceback (most recent call last):
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
    future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/nanny.py", line 458, in _wait_until_running
    raise msg
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 30451989.6', 'time': 1519764566.2687201}
distributed.nanny - WARNING - Restarting worker
distributed.worker - INFO -       Start worker at:   tcp://10.36.106.27:41120
distributed.worker - INFO -          Listening to:   tcp://10.36.106.27:41120
distributed.worker - INFO -              nanny at:         10.36.106.27:37614
distributed.worker - INFO -              bokeh at:          10.36.106.27:8789
distributed.worker - INFO - Waiting to connect to:   tcp://10.36.106.22:36860
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.44 GB
distributed.worker - INFO -       Local Directory: /groups/dudman/home/kirkhamj/dask-worker-space/worker-vju2tks3
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/opt/conda3/lib/python3.6/site-packages/distributed/nanny.py", line 522, in run
    yield worker._start(*worker_start_args)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/worker.py", line 372, in _start
    yield self._register_with_scheduler()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/worker.py", line 295, in _register_with_scheduler
    (response,))
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 30451989.6', 'time': 1519764568.099204}
distributed.diskutils - ERROR - Failed to remove '/groups/dudman/home/kirkhamj/dask-worker-space/worker-vju2tks3' (failed in <built-in function lstat>): [Errno 2] No such file or directory: '/groups/dudman/home/kirkhamj/dask-worker-space/worker-vju2tks3'
tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x2b447e9676a0> after timeout
Traceback (most recent call last):
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
    future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/nanny.py", line 458, in _wait_until_running
    raise msg
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 30451989.6', 'time': 1519764568.099204}
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/opt/conda3/bin/dask-worker", line 6, in <module>
    sys.exit(distributed.cli.dask_worker.go())
  File "/opt/conda3/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 248, in go
    main()
  File "/opt/conda3/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda3/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/opt/conda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 239, in main
    loop.run_sync(run)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/ioloop.py", line 458, in run_sync
    return future_cell[0].result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 232, in run
    yield [n._start(addr) for n in nannies]
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 828, in callback
    result_list.append(f.result())
  File "/opt/conda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
    assert self.worker_address
AssertionError

Environment:
name: root
channels:
- nanshe
- conda-forge
- defaults
dependencies:
- anaconda-client=1.6.5=py_0
- asciitree=0.3.3=py36_1
- asn1crypto=0.22.0=py36_0
- backports=1.0=py36_1
- backports.functools_lru_cache=1.5=py36_0
- beautifulsoup4=4.6.0=py36_0
- blas=1.1=openblas
- bleach=2.0.0=py_1
- bokeh=0.12.13=py36_0
- boost=1.66.0=py36_1
- boost-cpp=1.66.0=1
- bottleneck=1.2.1=py36_1
- bzip2=1.0.6=1
- ca-certificates=2018.1.18=0
- cairo=1.14.10=0
- certifi=2018.1.18=py36_0
- cffi=1.11.2=py36_0
- chardet=3.0.4=py36_0
- click=6.7=py_1
- cloudpickle=0.5.2=py_0
- clyent=1.2.2=py36_0
- conda=4.3.34=py36_0
- conda-build=3.4.2=py36_0
- conda-env=2.6.0=0
- conda-verify=2.0.0=py36_0
- cryptography=2.1.4=py36_0
- curl=7.55.1=0
- cycler=0.10.0=py36_0
- dask=0.17.1=py_0
- dask-core=0.17.1=py_0
- dask-drmaa=0.1.0=py_0
- dask-imread=0.1.1=py36_0
- dask-ndfilters=0.1.2=py36_0
- dask-ndfourier=0.1.2=py36_0
- dbus=1.10.22=0
- decorator=4.1.2=py36_0
- distributed=1.21.1=py36_0
- drmaa=0.7.7=py36_0
- entrypoints=0.2.3=py36_1
- expat=2.2.5=0
- fasteners=0.14.1=py36_2
- fftw=3.3.7=0
- filelock=2.0.6=py36_0
- fontconfig=2.12.6=0
- freetype=2.8.1=0
- future=0.16.0=py36_0
- gettext=0.19.8.1=0
- git=2.14.2=3
- glib=2.55.0=0
- glob2=0.5=py36_0
- gmp=6.1.2=0
- graphite2=1.3.10=0
- graphviz=2.38.0=7
- gst-plugins-base=1.8.0=0
- gstreamer=1.8.0=1
- h5py=2.7.1=py36_2
- harfbuzz=1.7.1=0
- hdf5=1.10.1=2
- heapdict=1.0.0=py36_0
- html5lib=1.0.1=py_0
- icu=58.2=0
- idna=2.6=py36_1
- imageio=2.2.0=py36_0
- imgroi=0.0.2=py36_0
- ipykernel=4.8.2=py36_0
- ipyparallel=6.1.1=py36_1
- ipython=6.2.1=py36_1
- ipython_genutils=0.2.0=py36_0
- ipywidgets=7.1.2=py36_0
- jedi=0.11.1=py36_0
- jinja2=2.10=py36_0
- jpeg=9b=2
- jsonschema=2.6.0=py36_1
- jupyter_client=5.2.2=py36_0
- jupyter_contrib_core=0.3.3=py36_1
- jupyter_contrib_nbextensions=0.4.0=py36_0
- jupyter_core=4.4.0=py_0
- jupyter_highlight_selected_word=0.1.0=py36_0
- jupyter_latex_envs=1.4.0=py36_1
- jupyter_nbextensions_configurator=0.4.0=py36_0
- kenjutsu=0.5.1=py36_0
- krb5=1.14.2=0
- libffi=3.2.1=3
- libiconv=1.15=0
- libpng=1.6.34=0
- libsodium=1.0.15=1
- libssh2=1.8.0=2
- libtiff=4.0.9=0
- libtool=2.4.6=0
- libxcb=1.12=1
- libxml2=2.9.7=0
- libxslt=1.1.32=0
- locket=0.2.0=py36_1
- lxml=4.1.1=py36_0
- mahotas=1.4.4=py36_0
- markupsafe=1.0=py36_0
- matplotlib=2.1.2=py36_0
- metawrap=0.0.2=py36_0
- mistune=0.8.3=py_0
- monotonic=1.3=py36_0
- mplview=0.0.5=py_0
- msgpack-python=0.5.1=py36_0
- nbconvert=5.3.1=py_1
- nbformat=4.4.0=py36_0
- ncurses=5.9=10
- networkx=2.1=py36_0
- notebook=5.4.0=py36_0
- npctypes=0.0.4=py36_0
- numcodecs=0.5.3=py36_0
- numpy=1.14.1=py36_blas_openblas_200
- olefile=0.45.1=py36_0
- openblas=0.2.20=7
- openssl=1.0.2n=0
- packaging=16.8=py36_0
- pandas=0.22.0=py36_0
- pandoc=2.1.1=0
- pandocfilters=1.4.1=py36_0
- pango=1.40.14=0
- parso=0.1.1=py_0
- partd=0.3.8=py36_0
- patchelf=0.9=2
- pcre=8.39=0
- pexpect=4.4.0=py36_0
- pickleshare=0.7.4=py36_0
- pillow=5.0.0=py36_0
- pims=0.4.1=py_1
- pip=9.0.1=py36_1
- pixman=0.34.0=1
- pkginfo=1.4.1=py36_0
- prompt_toolkit=1.0.15=py36_0
- psutil=5.4.0=py36_0
- ptyprocess=0.5.2=py36_0
- pycosat=0.6.3=py36_0
- pycparser=2.18=py36_0
- pycrypto=2.6.1=py36_1
- pygments=2.2.0=py36_0
- pyopenssl=17.4.0=py36_0
- pyparsing=2.2.0=py36_0
- pyqt=5.6.0=py36_4
- pysocks=1.6.8=py36_1
- python=3.6.4=0
- python-dateutil=2.6.1=py36_0
- python-graphviz=0.8=py36_0
- pytz=2018.3=py_0
- pywavelets=0.5.2=py36_1
- pyyaml=3.12=py36_1
- pyzmq=17.0.0=py36_3
- qt=5.6.2=7
- rank_filter=0.4.15=py36_0
- readline=7.0=0
- requests=2.18.4=py36_1
- ruamel_yaml=0.11.14=py36_0
- scandir=1.7=py36_0
- scikit-image=0.13.1=py36_0
- scikit-learn=0.19.1=py36_blas_openblas_201
- scipy=1.0.0=py36_blas_openblas_201
- send2trash=1.5.0=py_0
- setuptools=38.5.1=py36_0
- simplegeneric=0.8.1=py36_0
- sip=4.18=py36_1
- six=1.11.0=py36_1
- slicerator=0.9.8=py_1
- sortedcontainers=1.5.7=py36_0
- sqlite=3.20.1=2
- tblib=1.3.2=py36_0
- terminado=0.8.1=py36_0
- testpath=0.3.1=py36_0
- tifffile=0.14.0=py36_1
- tini=0.16.1=0
- tk=8.6.7=0
- toolz=0.8.2=py_2
- tornado=4.5.3=py36_0
- traitlets=4.3.2=py36_0
- urllib3=1.22=py36_0
- vigra=1.11.1=py36_6
- wcwidth=0.1.7=py36_0
- webcolors=1.7=py_1
- webencodings=0.5=py36_0
- wheel=0.30.0=py36_2
- widgetsnbextension=3.1.4=py36_0
- xnumpy=0.1.2=py36_0
- xorg-libxau=1.0.8=3
- xorg-libxdmcp=1.1.2=3
- xz=5.2.3=0
- yail=0.0.2=py36_0
- yaml=0.1.6=0
- zeromq=4.2.3=2
- zict=0.1.3=py_0
- zlib=1.2.11=0
- apr=1.6.3=he40df45_0
- libgcc-ng=7.2.0=h7cc24e2_2
- libgfortran=3.0.0=1
- serf=1.3.9=hb3b5fc1_0
- svn=1.9.7=h30a3245_0
- nanshe=0.1.0a62=py36_0
- zarr=2.2.0rc3=py_0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions