Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retiring workers begets AssertionError in replicate #1930

Closed
jakirkham opened this issue Apr 23, 2018 · 4 comments
Closed

Retiring workers begets AssertionError in replicate #1930

jakirkham opened this issue Apr 23, 2018 · 4 comments
Labels
stability Issue or feature related to cluster stability (e.g. deadlock)

Comments

@jakirkham
Copy link
Member

During one step in our analysis, we are reliably seeing an AssertionError in replicate from distributed. This is happening on the cluster using dask-drmaa. So the problem may very well be there (though that's not very clear from the traceback). Not sure exactly how we are ending up here. So some advice about what might be happening and/or what this assertion is for would be very helpful.

Traceback:
distributed.utils - ERROR - 
Traceback (most recent call last):
  File "/opt/conda3/lib/python3.6/site-packages/distributed/utils.py", line 622, in log_errors
    yield
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2764, in retire_workers
    n=1, delete=False)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2583, in replicate
    assert count > 0
AssertionError
distributed.utils - ERROR - 
Traceback (most recent call last):
  File "/opt/conda3/lib/python3.6/site-packages/distributed/utils.py", line 622, in log_errors
    yield
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2746, in retire_workers
    close_workers=close_workers)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2764, in retire_workers
    n=1, delete=False)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2583, in replicate
    assert count > 0
AssertionError
distributed.utils - ERROR - 
Traceback (most recent call last):
  File "/opt/conda3/lib/python3.6/site-packages/distributed/utils.py", line 622, in log_errors
    yield
  File "/opt/conda3/lib/python3.6/site-packages/dask_drmaa/adaptive.py", line 107, in _retire_workers
    close_workers=True)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2746, in retire_workers
    close_workers=close_workers)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2764, in retire_workers
    n=1, delete=False)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2583, in replicate
    assert count > 0
AssertionError
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x2b9bfc660400>, <Future finished exception=AssertionError()>)
Traceback (most recent call last):
  File "/opt/conda3/lib/python3.6/site-packages/tornado/ioloop.py", line 759, in _run_callback
    ret = callback()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/ioloop.py", line 780, in _discard_future_result
    future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/deploy/adaptive.py", line 306, in _adapt
    workers = yield self._retire_workers(workers=to_close)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/dask_drmaa/adaptive.py", line 107, in _retire_workers
    close_workers=True)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2746, in retire_workers
    close_workers=close_workers)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2764, in retire_workers
    n=1, delete=False)
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2583, in replicate
    assert count > 0
AssertionError
Environment:
name: base
channels:
  - nanshe
  - conda-forge
  - defaults
dependencies:
  - nanshe=0.1.0a62=py36_0
  - anaconda-client=1.6.14=py_0
  - asciitree=0.3.3=py36_1
  - asn1crypto=0.24.0=py36_0
  - backcall=0.1.0=py_0
  - beautifulsoup4=4.6.0=py36_0
  - blas=1.1=openblas
  - bleach=2.1.3=py_0
  - bokeh=0.12.15=py36_0
  - boost=1.66.0=py36_1
  - boost-cpp=1.66.0=1
  - bottleneck=1.2.1=py36_1
  - bzip2=1.0.6=1
  - ca-certificates=2018.4.16=0
  - cairo=1.14.10=0
  - certifi=2018.4.16=py36_0
  - cffi=1.11.5=py36_0
  - chardet=3.0.4=py36_0
  - click=6.7=py_1
  - cloudpickle=0.5.2=py_0
  - clyent=1.2.2=py36_0
  - conda=4.5.1=py36_0
  - conda-build=3.8.1=py36_0
  - conda-env=2.6.0=0
  - conda-verify=2.0.0=py36_0
  - contextlib2=0.5.5=py36_1
  - cryptography=2.2.1=py36_0
  - curl=7.59.0=1
  - cycler=0.10.0=py36_0
  - cytoolz=0.9.0.1=py36_0
  - dask=0.17.2=py_0
  - dask-core=0.17.2=py_0
  - dask-drmaa=0.2.0=py_0
  - dask-imread=0.1.1=py36_0
  - dask-ndfilters=0.1.2=py36_0
  - dask-ndfourier=0.1.2=py36_0
  - dask-ndmeasure=0.1.1=py36_0
  - dask-ndmorph=0.1.1=py36_0
  - dbus=1.10.22=0
  - decorator=4.3.0=py_0
  - distributed=1.21.6=py36_0
  - drmaa=0.7.8=py_0
  - entrypoints=0.2.3=py36_1
  - expat=2.2.5=0
  - fasteners=0.14.1=py36_2
  - fftw=3.3.7=0
  - filelock=3.0.4=py36_0
  - fontconfig=2.12.6=0
  - freetype=2.8.1=0
  - future=0.16.0=py36_0
  - gettext=0.19.8.1=0
  - git=2.14.2=3
  - glib=2.55.0=0
  - glob2=0.5=py36_0
  - gmp=6.1.2=0
  - graphite2=1.3.11=0
  - graphviz=2.38.0=7
  - gst-plugins-base=1.8.0=0
  - gstreamer=1.8.0=1
  - h5py=2.7.1=py36_2
  - harfbuzz=1.7.6=0
  - hdf5=1.10.1=2
  - heapdict=1.0.0=py36_0
  - html5lib=1.0.1=py_0
  - icu=58.2=0
  - idna=2.6=py36_1
  - imageio=2.3.0=py36_0
  - imgroi=0.0.2=py36_0
  - ipykernel=4.8.2=py36_0
  - ipyparallel=6.1.1=py36_1
  - ipython=6.3.1=py36_0
  - ipython_genutils=0.2.0=py36_0
  - ipywidgets=7.2.1=py36_1
  - jedi=0.12.0=py36_0
  - jinja2=2.10=py36_0
  - jpeg=9b=2
  - jsonschema=2.6.0=py36_1
  - jupyter_client=5.2.3=py36_0
  - jupyter_contrib_core=0.3.3=py36_1
  - jupyter_contrib_nbextensions=0.5.0=py36_0
  - jupyter_core=4.4.0=py_0
  - jupyter_highlight_selected_word=0.2.0=py36_0
  - jupyter_latex_envs=1.4.4=py36_0
  - jupyter_nbextensions_configurator=0.4.0=py36_0
  - kenjutsu=0.5.1=py36_0
  - kiwisolver=1.0.1=py36_1
  - krb5=1.14.6=0
  - libedit=3.1.20170329=0
  - libffi=3.2.1=3
  - libiconv=1.15=0
  - libpng=1.6.34=0
  - libsodium=1.0.16=0
  - libssh2=1.8.0=2
  - libtiff=4.0.9=0
  - libtool=2.4.6=0
  - libxcb=1.13=0
  - libxml2=2.9.8=0
  - libxslt=1.1.32=0
  - locket=0.2.0=py36_1
  - lxml=4.2.1=py36_0
  - mahotas=1.4.4=py36_0
  - markupsafe=1.0=py36_0
  - matplotlib=2.2.2=py36_1
  - metawrap=0.0.2=py36_0
  - mistune=0.8.3=py_0
  - monotonic=1.4=py36_0
  - mplview=0.0.5=py_0
  - msgpack-python=0.5.6=py36_0
  - nbconvert=5.3.1=py_1
  - nbformat=4.4.0=py36_0
  - ncurses=5.9=10
  - networkx=2.1=py36_0
  - notebook=5.4.1=py36_0
  - npctypes=0.0.4=py36_0
  - numcodecs=0.5.5=py36_0
  - numpy=1.14.2=py36_blas_openblas_200
  - olefile=0.45.1=py36_0
  - openblas=0.2.20=7
  - openssl=1.0.2o=0
  - packaging=17.1=py_0
  - pandas=0.22.0=py36_0
  - pandoc=2.1.3=0
  - pandocfilters=1.4.1=py36_0
  - pango=1.40.14=0
  - parso=0.2.0=py_0
  - partd=0.3.8=py36_0
  - patchelf=0.9=2
  - pcre=8.41=1
  - pexpect=4.5.0=py36_0
  - pickleshare=0.7.4=py36_0
  - pillow=5.1.0=py36_0
  - pims=0.4.1=py_1
  - pip=9.0.3=py36_0
  - pixman=0.34.0=1
  - pkginfo=1.4.2=py36_0
  - prompt_toolkit=1.0.15=py36_0
  - psutil=5.4.5=py36_0
  - ptyprocess=0.5.2=py36_0
  - pycosat=0.6.3=py36_0
  - pycparser=2.18=py36_0
  - pycrypto=2.6.1=py36_1
  - pygments=2.2.0=py36_0
  - pyopenssl=17.5.0=py36_1
  - pyparsing=2.2.0=py36_0
  - pyqt=5.6.0=py36_5
  - pysocks=1.6.8=py36_1
  - python=3.6.5=1
  - python-dateutil=2.7.2=py_0
  - python-graphviz=0.8.2=py36_0
  - pytz=2018.4=py_0
  - pywavelets=0.5.2=py36_1
  - pyyaml=3.12=py36_1
  - pyzmq=17.0.0=py36_4
  - qt=5.6.2=7
  - rank_filter=0.4.15=py36_0
  - readline=7.0=0
  - requests=2.18.4=py36_1
  - ruamel_yaml=0.15.35=py36_0
  - scandir=1.7=py36_0
  - scikit-image=0.13.1=py36_0
  - scikit-learn=0.19.1=py36_blas_openblas_201
  - scipy=1.0.1=py36_blas_openblas_200
  - send2trash=1.5.0=py_0
  - setuptools=39.0.1=py36_0
  - simplegeneric=0.8.1=py36_0
  - sip=4.18=py36_1
  - six=1.11.0=py36_1
  - slicerator=0.9.8=py_1
  - sortedcontainers=1.5.9=py36_0
  - sqlite=3.20.1=2
  - tblib=1.3.2=py36_0
  - terminado=0.8.1=py36_0
  - testpath=0.3.1=py36_0
  - tifffile=0.14.0=py36_1
  - tini=0.17.0=0
  - tk=8.6.7=0
  - toolz=0.9.0=py_0
  - tornado=5.0.2=py36_0
  - traitlets=4.3.2=py36_0
  - urllib3=1.22=py36_0
  - vigra=1.11.1=py36_6
  - wcwidth=0.1.7=py36_0
  - webcolors=1.8.1=py_0
  - webencodings=0.5=py36_0
  - wheel=0.31.0=py36_0
  - widgetsnbextension=3.2.1=py36_0
  - xnumpy=0.1.2=py36_0
  - xorg-libxau=1.0.8=3
  - xorg-libxdmcp=1.1.2=3
  - xz=5.2.3=0
  - yail=0.0.2=py36_0
  - yaml=0.1.7=0
  - zarr=2.2.0=py_1
  - zeromq=4.2.5=1
  - zict=0.1.3=py_0
  - zlib=1.2.11=0
  - apr=1.6.3=he40df45_0
  - libgcc=7.2.0=h69d50b8_2
  - libgcc-ng=7.2.0=hdf63c60_3
  - libgfortran=3.0.0=1
  - libstdcxx-ng=7.2.0=hdf63c60_3
  - serf=1.3.9=hb3b5fc1_0
  - svn=1.9.7=h30a3245_0
  - pip:
    - pyfftw==0.10.4
prefix: /opt/conda3
@mrocklin
Copy link
Member

mrocklin commented Apr 23, 2018 via email

@jakirkham
Copy link
Member Author

Thanks for the follow-up, Matt. So the moral of the story is this is not so much a bug as a feature request for Adaptive clusters generally. Good to know.

FWIW even when seeing this warning, it does seem to manage to complete the work (maybe by repeating it). The warning is a little unsettling though.

@mrocklin
Copy link
Member

mrocklin commented Apr 23, 2018 via email

@crusaderky
Copy link
Collaborator

retire_workers() has been completely reimplemented on top of the Active Memory Manager in 2022.2.0.
The underlying replicate() has been penned in to be reimplemented on top of AMM too (#6578).

@crusaderky crusaderky added the stability Issue or feature related to cluster stability (e.g. deadlock) label Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stability Issue or feature related to cluster stability (e.g. deadlock)
Projects
None yet
Development

No branches or pull requests

3 participants