Skip to content

Retiring workers begets AssertionError in replicate #1930

@jakirkham

Description

@jakirkham

During one step in our analysis, we are reliably seeing an AssertionError in replicate from distributed. This is happening on the cluster using dask-drmaa. So the problem may very well be there (though that's not very clear from the traceback). Not sure exactly how we are ending up here. So some advice about what might be happening and/or what this assertion is for would be very helpful.

Traceback:
distributed.utils - ERROR - 
Traceback (most recent call last):
  File "/opt/conda3/lib/python3.6/site-packages/distributed/utils.py", line 622, in log_errors
    yield
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2764, in retire_workers
    n=1, delete=False)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2583, in replicate
    assert count > 0
AssertionError
distributed.utils - ERROR - 
Traceback (most recent call last):
  File "/opt/conda3/lib/python3.6/site-packages/distributed/utils.py", line 622, in log_errors
    yield
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2746, in retire_workers
    close_workers=close_workers)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2764, in retire_workers
    n=1, delete=False)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2583, in replicate
    assert count > 0
AssertionError
distributed.utils - ERROR - 
Traceback (most recent call last):
  File "/opt/conda3/lib/python3.6/site-packages/distributed/utils.py", line 622, in log_errors
    yield
  File "/opt/conda3/lib/python3.6/site-packages/dask_drmaa/adaptive.py", line 107, in _retire_workers
    close_workers=True)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2746, in retire_workers
    close_workers=close_workers)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2764, in retire_workers
    n=1, delete=False)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2583, in replicate
    assert count > 0
AssertionError
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x2b9bfc660400>, <Future finished exception=AssertionError()>)
Traceback (most recent call last):
  File "/opt/conda3/lib/python3.6/site-packages/tornado/ioloop.py", line 759, in _run_callback
    ret = callback()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/ioloop.py", line 780, in _discard_future_result
    future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/deploy/adaptive.py", line 306, in _adapt
    workers = yield self._retire_workers(workers=to_close)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/dask_drmaa/adaptive.py", line 107, in _retire_workers
    close_workers=True)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2746, in retire_workers
    close_workers=close_workers)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2764, in retire_workers
    n=1, delete=False)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2583, in replicate
    assert count > 0
AssertionError
Environment:
name: base
channels:
  - nanshe
  - conda-forge
  - defaults
dependencies:
  - nanshe=0.1.0a62=py36_0
  - anaconda-client=1.6.14=py_0
  - asciitree=0.3.3=py36_1
  - asn1crypto=0.24.0=py36_0
  - backcall=0.1.0=py_0
  - beautifulsoup4=4.6.0=py36_0
  - blas=1.1=openblas
  - bleach=2.1.3=py_0
  - bokeh=0.12.15=py36_0
  - boost=1.66.0=py36_1
  - boost-cpp=1.66.0=1
  - bottleneck=1.2.1=py36_1
  - bzip2=1.0.6=1
  - ca-certificates=2018.4.16=0
  - cairo=1.14.10=0
  - certifi=2018.4.16=py36_0
  - cffi=1.11.5=py36_0
  - chardet=3.0.4=py36_0
  - click=6.7=py_1
  - cloudpickle=0.5.2=py_0
  - clyent=1.2.2=py36_0
  - conda=4.5.1=py36_0
  - conda-build=3.8.1=py36_0
  - conda-env=2.6.0=0
  - conda-verify=2.0.0=py36_0
  - contextlib2=0.5.5=py36_1
  - cryptography=2.2.1=py36_0
  - curl=7.59.0=1
  - cycler=0.10.0=py36_0
  - cytoolz=0.9.0.1=py36_0
  - dask=0.17.2=py_0
  - dask-core=0.17.2=py_0
  - dask-drmaa=0.2.0=py_0
  - dask-imread=0.1.1=py36_0
  - dask-ndfilters=0.1.2=py36_0
  - dask-ndfourier=0.1.2=py36_0
  - dask-ndmeasure=0.1.1=py36_0
  - dask-ndmorph=0.1.1=py36_0
  - dbus=1.10.22=0
  - decorator=4.3.0=py_0
  - distributed=1.21.6=py36_0
  - drmaa=0.7.8=py_0
  - entrypoints=0.2.3=py36_1
  - expat=2.2.5=0
  - fasteners=0.14.1=py36_2
  - fftw=3.3.7=0
  - filelock=3.0.4=py36_0
  - fontconfig=2.12.6=0
  - freetype=2.8.1=0
  - future=0.16.0=py36_0
  - gettext=0.19.8.1=0
  - git=2.14.2=3
  - glib=2.55.0=0
  - glob2=0.5=py36_0
  - gmp=6.1.2=0
  - graphite2=1.3.11=0
  - graphviz=2.38.0=7
  - gst-plugins-base=1.8.0=0
  - gstreamer=1.8.0=1
  - h5py=2.7.1=py36_2
  - harfbuzz=1.7.6=0
  - hdf5=1.10.1=2
  - heapdict=1.0.0=py36_0
  - html5lib=1.0.1=py_0
  - icu=58.2=0
  - idna=2.6=py36_1
  - imageio=2.3.0=py36_0
  - imgroi=0.0.2=py36_0
  - ipykernel=4.8.2=py36_0
  - ipyparallel=6.1.1=py36_1
  - ipython=6.3.1=py36_0
  - ipython_genutils=0.2.0=py36_0
  - ipywidgets=7.2.1=py36_1
  - jedi=0.12.0=py36_0
  - jinja2=2.10=py36_0
  - jpeg=9b=2
  - jsonschema=2.6.0=py36_1
  - jupyter_client=5.2.3=py36_0
  - jupyter_contrib_core=0.3.3=py36_1
  - jupyter_contrib_nbextensions=0.5.0=py36_0
  - jupyter_core=4.4.0=py_0
  - jupyter_highlight_selected_word=0.2.0=py36_0
  - jupyter_latex_envs=1.4.4=py36_0
  - jupyter_nbextensions_configurator=0.4.0=py36_0
  - kenjutsu=0.5.1=py36_0
  - kiwisolver=1.0.1=py36_1
  - krb5=1.14.6=0
  - libedit=3.1.20170329=0
  - libffi=3.2.1=3
  - libiconv=1.15=0
  - libpng=1.6.34=0
  - libsodium=1.0.16=0
  - libssh2=1.8.0=2
  - libtiff=4.0.9=0
  - libtool=2.4.6=0
  - libxcb=1.13=0
  - libxml2=2.9.8=0
  - libxslt=1.1.32=0
  - locket=0.2.0=py36_1
  - lxml=4.2.1=py36_0
  - mahotas=1.4.4=py36_0
  - markupsafe=1.0=py36_0
  - matplotlib=2.2.2=py36_1
  - metawrap=0.0.2=py36_0
  - mistune=0.8.3=py_0
  - monotonic=1.4=py36_0
  - mplview=0.0.5=py_0
  - msgpack-python=0.5.6=py36_0
  - nbconvert=5.3.1=py_1
  - nbformat=4.4.0=py36_0
  - ncurses=5.9=10
  - networkx=2.1=py36_0
  - notebook=5.4.1=py36_0
  - npctypes=0.0.4=py36_0
  - numcodecs=0.5.5=py36_0
  - numpy=1.14.2=py36_blas_openblas_200
  - olefile=0.45.1=py36_0
  - openblas=0.2.20=7
  - openssl=1.0.2o=0
  - packaging=17.1=py_0
  - pandas=0.22.0=py36_0
  - pandoc=2.1.3=0
  - pandocfilters=1.4.1=py36_0
  - pango=1.40.14=0
  - parso=0.2.0=py_0
  - partd=0.3.8=py36_0
  - patchelf=0.9=2
  - pcre=8.41=1
  - pexpect=4.5.0=py36_0
  - pickleshare=0.7.4=py36_0
  - pillow=5.1.0=py36_0
  - pims=0.4.1=py_1
  - pip=9.0.3=py36_0
  - pixman=0.34.0=1
  - pkginfo=1.4.2=py36_0
  - prompt_toolkit=1.0.15=py36_0
  - psutil=5.4.5=py36_0
  - ptyprocess=0.5.2=py36_0
  - pycosat=0.6.3=py36_0
  - pycparser=2.18=py36_0
  - pycrypto=2.6.1=py36_1
  - pygments=2.2.0=py36_0
  - pyopenssl=17.5.0=py36_1
  - pyparsing=2.2.0=py36_0
  - pyqt=5.6.0=py36_5
  - pysocks=1.6.8=py36_1
  - python=3.6.5=1
  - python-dateutil=2.7.2=py_0
  - python-graphviz=0.8.2=py36_0
  - pytz=2018.4=py_0
  - pywavelets=0.5.2=py36_1
  - pyyaml=3.12=py36_1
  - pyzmq=17.0.0=py36_4
  - qt=5.6.2=7
  - rank_filter=0.4.15=py36_0
  - readline=7.0=0
  - requests=2.18.4=py36_1
  - ruamel_yaml=0.15.35=py36_0
  - scandir=1.7=py36_0
  - scikit-image=0.13.1=py36_0
  - scikit-learn=0.19.1=py36_blas_openblas_201
  - scipy=1.0.1=py36_blas_openblas_200
  - send2trash=1.5.0=py_0
  - setuptools=39.0.1=py36_0
  - simplegeneric=0.8.1=py36_0
  - sip=4.18=py36_1
  - six=1.11.0=py36_1
  - slicerator=0.9.8=py_1
  - sortedcontainers=1.5.9=py36_0
  - sqlite=3.20.1=2
  - tblib=1.3.2=py36_0
  - terminado=0.8.1=py36_0
  - testpath=0.3.1=py36_0
  - tifffile=0.14.0=py36_1
  - tini=0.17.0=0
  - tk=8.6.7=0
  - toolz=0.9.0=py_0
  - tornado=5.0.2=py36_0
  - traitlets=4.3.2=py36_0
  - urllib3=1.22=py36_0
  - vigra=1.11.1=py36_6
  - wcwidth=0.1.7=py36_0
  - webcolors=1.8.1=py_0
  - webencodings=0.5=py36_0
  - wheel=0.31.0=py36_0
  - widgetsnbextension=3.2.1=py36_0
  - xnumpy=0.1.2=py36_0
  - xorg-libxau=1.0.8=3
  - xorg-libxdmcp=1.1.2=3
  - xz=5.2.3=0
  - yail=0.0.2=py36_0
  - yaml=0.1.7=0
  - zarr=2.2.0=py_1
  - zeromq=4.2.5=1
  - zict=0.1.3=py_0
  - zlib=1.2.11=0
  - apr=1.6.3=he40df45_0
  - libgcc=7.2.0=h69d50b8_2
  - libgcc-ng=7.2.0=hdf63c60_3
  - libgfortran=3.0.0=1
  - libstdcxx-ng=7.2.0=hdf63c60_3
  - serf=1.3.9=hb3b5fc1_0
  - svn=1.9.7=h30a3245_0
  - pip:
    - pyfftw==0.10.4
prefix: /opt/conda3

Metadata

Metadata

Assignees

No one assigned

    Labels

    stabilityIssue or feature related to cluster stability (e.g. deadlock)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions