-
-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retiring workers begets AssertionError in replicate #1930
Comments
Note that replicate and rebalance are not transactional and not resilient.
They are only guaranteed to work in a quiet cluster. I would not be
surprised to see them misbehave when used with a system like Adaptive that
removes and adds workers on the fly. The long-term solution here is to
build a proper memory manager that can act in a more resilient manner.
…On Mon, Apr 23, 2018 at 12:52 PM, jakirkham ***@***.***> wrote:
During one step in our analysis, we are reliably seeing an AssertionError
in replicate from distributed. This is happening on the cluster using
dask-drmaa. So the problem may very well be there (though that's not very
clear from the traceback). Not sure exactly how we are ending up here. So
some advice about what might be happening and/or what this assertion is for
would be very helpful.
Traceback:
distributed.utils - ERROR -
Traceback (most recent call last):
File "/opt/conda3/lib/python3.6/site-packages/distributed/utils.py", line 622, in log_errors
yield
File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2764, in retire_workers
n=1, delete=False)
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
yielded = self.gen.send(value)
File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2583, in replicate
assert count > 0AssertionError
distributed.utils - ERROR -
Traceback (most recent call last):
File "/opt/conda3/lib/python3.6/site-packages/distributed/utils.py", line 622, in log_errors
yield
File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2746, in retire_workers
close_workers=close_workers)
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2764, in retire_workers
n=1, delete=False)
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
yielded = self.gen.send(value)
File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2583, in replicate
assert count > 0AssertionError
distributed.utils - ERROR -
Traceback (most recent call last):
File "/opt/conda3/lib/python3.6/site-packages/distributed/utils.py", line 622, in log_errors
yield
File "/opt/conda3/lib/python3.6/site-packages/dask_drmaa/adaptive.py", line 107, in _retire_workers
close_workers=True)
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2746, in retire_workers
close_workers=close_workers)
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2764, in retire_workers
n=1, delete=False)
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
yielded = self.gen.send(value)
File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2583, in replicate
assert count > 0AssertionError
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x2b9bfc660400>, <Future finished exception=AssertionError()>)
Traceback (most recent call last):
File "/opt/conda3/lib/python3.6/site-packages/tornado/ioloop.py", line 759, in _run_callback
ret = callback()
File "/opt/conda3/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper
return fn(*args, **kwargs)
File "/opt/conda3/lib/python3.6/site-packages/tornado/ioloop.py", line 780, in _discard_future_result
future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda3/lib/python3.6/site-packages/distributed/deploy/adaptive.py", line 306, in _adapt
workers = yield self._retire_workers(workers=to_close)
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda3/lib/python3.6/site-packages/dask_drmaa/adaptive.py", line 107, in _retire_workers
close_workers=True)
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2746, in retire_workers
close_workers=close_workers)
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2764, in retire_workers
n=1, delete=False)
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/opt/conda3/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
yielded = self.gen.send(value)
File "/opt/conda3/lib/python3.6/site-packages/distributed/scheduler.py", line 2583, in replicate
assert count > 0AssertionError
Environment:
name: basechannels:
- nanshe
- conda-forge
- defaultsdependencies:
- nanshe=0.1.0a62=py36_0
- anaconda-client=1.6.14=py_0
- asciitree=0.3.3=py36_1
- asn1crypto=0.24.0=py36_0
- backcall=0.1.0=py_0
- beautifulsoup4=4.6.0=py36_0
- blas=1.1=openblas
- bleach=2.1.3=py_0
- bokeh=0.12.15=py36_0
- boost=1.66.0=py36_1
- boost-cpp=1.66.0=1
- bottleneck=1.2.1=py36_1
- bzip2=1.0.6=1
- ca-certificates=2018.4.16=0
- cairo=1.14.10=0
- certifi=2018.4.16=py36_0
- cffi=1.11.5=py36_0
- chardet=3.0.4=py36_0
- click=6.7=py_1
- cloudpickle=0.5.2=py_0
- clyent=1.2.2=py36_0
- conda=4.5.1=py36_0
- conda-build=3.8.1=py36_0
- conda-env=2.6.0=0
- conda-verify=2.0.0=py36_0
- contextlib2=0.5.5=py36_1
- cryptography=2.2.1=py36_0
- curl=7.59.0=1
- cycler=0.10.0=py36_0
- cytoolz=0.9.0.1=py36_0
- dask=0.17.2=py_0
- dask-core=0.17.2=py_0
- dask-drmaa=0.2.0=py_0
- dask-imread=0.1.1=py36_0
- dask-ndfilters=0.1.2=py36_0
- dask-ndfourier=0.1.2=py36_0
- dask-ndmeasure=0.1.1=py36_0
- dask-ndmorph=0.1.1=py36_0
- dbus=1.10.22=0
- decorator=4.3.0=py_0
- distributed=1.21.6=py36_0
- drmaa=0.7.8=py_0
- entrypoints=0.2.3=py36_1
- expat=2.2.5=0
- fasteners=0.14.1=py36_2
- fftw=3.3.7=0
- filelock=3.0.4=py36_0
- fontconfig=2.12.6=0
- freetype=2.8.1=0
- future=0.16.0=py36_0
- gettext=0.19.8.1=0
- git=2.14.2=3
- glib=2.55.0=0
- glob2=0.5=py36_0
- gmp=6.1.2=0
- graphite2=1.3.11=0
- graphviz=2.38.0=7
- gst-plugins-base=1.8.0=0
- gstreamer=1.8.0=1
- h5py=2.7.1=py36_2
- harfbuzz=1.7.6=0
- hdf5=1.10.1=2
- heapdict=1.0.0=py36_0
- html5lib=1.0.1=py_0
- icu=58.2=0
- idna=2.6=py36_1
- imageio=2.3.0=py36_0
- imgroi=0.0.2=py36_0
- ipykernel=4.8.2=py36_0
- ipyparallel=6.1.1=py36_1
- ipython=6.3.1=py36_0
- ipython_genutils=0.2.0=py36_0
- ipywidgets=7.2.1=py36_1
- jedi=0.12.0=py36_0
- jinja2=2.10=py36_0
- jpeg=9b=2
- jsonschema=2.6.0=py36_1
- jupyter_client=5.2.3=py36_0
- jupyter_contrib_core=0.3.3=py36_1
- jupyter_contrib_nbextensions=0.5.0=py36_0
- jupyter_core=4.4.0=py_0
- jupyter_highlight_selected_word=0.2.0=py36_0
- jupyter_latex_envs=1.4.4=py36_0
- jupyter_nbextensions_configurator=0.4.0=py36_0
- kenjutsu=0.5.1=py36_0
- kiwisolver=1.0.1=py36_1
- krb5=1.14.6=0
- libedit=3.1.20170329=0
- libffi=3.2.1=3
- libiconv=1.15=0
- libpng=1.6.34=0
- libsodium=1.0.16=0
- libssh2=1.8.0=2
- libtiff=4.0.9=0
- libtool=2.4.6=0
- libxcb=1.13=0
- libxml2=2.9.8=0
- libxslt=1.1.32=0
- locket=0.2.0=py36_1
- lxml=4.2.1=py36_0
- mahotas=1.4.4=py36_0
- markupsafe=1.0=py36_0
- matplotlib=2.2.2=py36_1
- metawrap=0.0.2=py36_0
- mistune=0.8.3=py_0
- monotonic=1.4=py36_0
- mplview=0.0.5=py_0
- msgpack-python=0.5.6=py36_0
- nbconvert=5.3.1=py_1
- nbformat=4.4.0=py36_0
- ncurses=5.9=10
- networkx=2.1=py36_0
- notebook=5.4.1=py36_0
- npctypes=0.0.4=py36_0
- numcodecs=0.5.5=py36_0
- numpy=1.14.2=py36_blas_openblas_200
- olefile=0.45.1=py36_0
- openblas=0.2.20=7
- openssl=1.0.2o=0
- packaging=17.1=py_0
- pandas=0.22.0=py36_0
- pandoc=2.1.3=0
- pandocfilters=1.4.1=py36_0
- pango=1.40.14=0
- parso=0.2.0=py_0
- partd=0.3.8=py36_0
- patchelf=0.9=2
- pcre=8.41=1
- pexpect=4.5.0=py36_0
- pickleshare=0.7.4=py36_0
- pillow=5.1.0=py36_0
- pims=0.4.1=py_1
- pip=9.0.3=py36_0
- pixman=0.34.0=1
- pkginfo=1.4.2=py36_0
- prompt_toolkit=1.0.15=py36_0
- psutil=5.4.5=py36_0
- ptyprocess=0.5.2=py36_0
- pycosat=0.6.3=py36_0
- pycparser=2.18=py36_0
- pycrypto=2.6.1=py36_1
- pygments=2.2.0=py36_0
- pyopenssl=17.5.0=py36_1
- pyparsing=2.2.0=py36_0
- pyqt=5.6.0=py36_5
- pysocks=1.6.8=py36_1
- python=3.6.5=1
- python-dateutil=2.7.2=py_0
- python-graphviz=0.8.2=py36_0
- pytz=2018.4=py_0
- pywavelets=0.5.2=py36_1
- pyyaml=3.12=py36_1
- pyzmq=17.0.0=py36_4
- qt=5.6.2=7
- rank_filter=0.4.15=py36_0
- readline=7.0=0
- requests=2.18.4=py36_1
- ruamel_yaml=0.15.35=py36_0
- scandir=1.7=py36_0
- scikit-image=0.13.1=py36_0
- scikit-learn=0.19.1=py36_blas_openblas_201
- scipy=1.0.1=py36_blas_openblas_200
- send2trash=1.5.0=py_0
- setuptools=39.0.1=py36_0
- simplegeneric=0.8.1=py36_0
- sip=4.18=py36_1
- six=1.11.0=py36_1
- slicerator=0.9.8=py_1
- sortedcontainers=1.5.9=py36_0
- sqlite=3.20.1=2
- tblib=1.3.2=py36_0
- terminado=0.8.1=py36_0
- testpath=0.3.1=py36_0
- tifffile=0.14.0=py36_1
- tini=0.17.0=0
- tk=8.6.7=0
- toolz=0.9.0=py_0
- tornado=5.0.2=py36_0
- traitlets=4.3.2=py36_0
- urllib3=1.22=py36_0
- vigra=1.11.1=py36_6
- wcwidth=0.1.7=py36_0
- webcolors=1.8.1=py_0
- webencodings=0.5=py36_0
- wheel=0.31.0=py36_0
- widgetsnbextension=3.2.1=py36_0
- xnumpy=0.1.2=py36_0
- xorg-libxau=1.0.8=3
- xorg-libxdmcp=1.1.2=3
- xz=5.2.3=0
- yail=0.0.2=py36_0
- yaml=0.1.7=0
- zarr=2.2.0=py_1
- zeromq=4.2.5=1
- zict=0.1.3=py_0
- zlib=1.2.11=0
- apr=1.6.3=he40df45_0
- libgcc=7.2.0=h69d50b8_2
- libgcc-ng=7.2.0=hdf63c60_3
- libgfortran=3.0.0=1
- libstdcxx-ng=7.2.0=hdf63c60_3
- serf=1.3.9=hb3b5fc1_0
- svn=1.9.7=h30a3245_0
- pip:
- pyfftw==0.10.4prefix: /opt/conda3
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1930>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszO6kPA2i43Ak1sb2xk-QxxOuxdlvks5trga0gaJpZM4TgQT4>
.
|
Thanks for the follow-up, Matt. So the moral of the story is this is not so much a bug as a feature request for FWIW even when seeing this warning, it does seem to manage to complete the work (maybe by repeating it). The warning is a little unsettling though. |
To be clear I'd suggest that Adaptive is fine, and that this is a feature
request for replicate/rebalance. It would be a challenging, though
interesting effort, to rethink dask's memory management to include control
of cleaning up extra data or ensuring that data is sufficiently replicated
if anyone is interested.
…On Mon, Apr 23, 2018 at 1:53 PM, jakirkham ***@***.***> wrote:
Thanks for the follow-up, Matt. So the moral of the story is this is not
so much a bug as a feature request for Adaptive clusters generally. Good
to know.
FWIW even when seeing this warning, it does seem to manage to complete the
work (maybe by repeating it). The warning is a little unsettling though.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1930 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszDpmQfDq8BXefsNgmUgKqNGfLT0Rks5trhUkgaJpZM4TgQT4>
.
|
|
During one step in our analysis, we are reliably seeing an
AssertionError
inreplicate
fromdistributed
. This is happening on the cluster usingdask-drmaa
. So the problem may very well be there (though that's not very clear from the traceback). Not sure exactly how we are ending up here. So some advice about what might be happening and/or what this assertion is for would be very helpful.Traceback:
Environment:
The text was updated successfully, but these errors were encountered: