New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add exception details to the exception to help debug (#2549) #2610
Conversation
709e836
to
af45860
Compare
af45860
to
242485d
Compare
@@ -885,7 +885,7 @@ def clean_exception(exception, traceback, **kwargs): | |||
-------- | |||
error_message: create and serialize errors into message | |||
""" | |||
if isinstance(exception, bytes): | |||
if isinstance(exception, bytes) or isinstance(exception, bytearray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't nailed down exactly where it came from but it seems that when an exception has a lot of information, it's passed in two frames. When it is small enough, it is represented as bytes otherwise I believe the network code will concatenate those bytes into a bytearray. Because of that reason, the exception returned was unpickled.
distributed/scheduler.py
Outdated
worker_host=ws.host, | ||
worker_local_directory=ws.local_directory, | ||
worker_metrics=ws.metrics, | ||
worker_pid=ws.pid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is a little odd because a KilledWorker
error isn't specifically associated to one worker, but rather to a task that has killed a set of workers. I guess though that it can be helpful to include information about one representative of that set.
I wonder if, rather than include each of these keywords individually (which locks us into a particularly rigid structure going forward) we might just include the ws
object itself. That way in the future every time one of these attributes changes you're not on the hook to fix things here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I did at first but it seems to fail to pickle on Python 2.7 (https://travis-ci.org/dask/distributed/jobs/518810169#L2679). I would prefer having ws itself.
I did think about "which locks us into a particularly rigid structure going forward" and to go around it was to pass them as kwargs on the exception. The downside is after an upgrade, if someone uses one of the variable that is removed, it will throw an exception. I'm not sure what is ideal here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer having ws itself.
I recommend that we figure out why ws
isn't serializable and possible resolve that, if you have time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unable to reproduce the CI that you have. Is there a good starter point that I could look into to run the Python 2.7 tests on Linux and Conda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I entirely understand. I would clone the repository, which you've done, make a conda environment with python 2.7, install things, and then run py.test distributed
https://distributed.dask.org/en/latest/develop.html
From what you've said before it sounds like you've already gotten things to fail on 2.7, so maybe I'm not understanding your question correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It fails in your continuous integration. I've been trying really hard to reproduce the environment properly here and had some very mixed results on the jankiest setup you can imagine. I was able to get the breakpoint to work only when another breakpoint somewhere else in the code was causing a delay long enough to cause the race condition for me to get the state in the scheduler. Now it stopped working without any explanation while I was doing a divide and conquer on worker state fields. The short list that is left to test is ws.metrics, ws.services and ws.processing. Perhaps I'll push until my divide and conquer is done but it's very painful. The procedure in the doc also doesn't work as is. For example tornado gets installed at version 6 even though it does not support python 2.7. It's also missing the pytest-timeout package among other things. Are you able to run the environment easily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying really hard to reproduce the environment properly here and had some very mixed results on the jankiest setup you can imagine.
Heh, sorry to hear that you've been having a frustrating time. Lets see if we can get past that.
First, please note that we have some intermittent failures (they've been quite annoying lately) so if the error looks entirely unrelated to what you're working on, then please don't stress about it.
It fails in your continuous integration
Do you have a link to the failure?
Now it stopped working without any explanation while I was doing a divide and conquer on worker state fields
In your situation I would use a debugger like pdb and look at the state that failed.
The procedure in the doc also doesn't work as is. For example tornado gets installed at version 6 even though it does not support python 2.7
I'm quite surprised that you were able to install tornado 6 in a Python 2 environment. That either sounds like a terrible bug in tornado's packaging (please report upstream if so) or possibly that you're building an environment in an odd way.
distributed/scheduler.py
Outdated
def __init__(self, *args, **kwargs): | ||
super(KilledWorker, self).__init__(*args) | ||
self.args = args | ||
self.kwargs = kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do as above and include just a single keyword last_failed_worker
or something then it would be good to include that explicitly, rather than catch everything.
242485d
to
fc50a34
Compare
fc50a34
to
2a99fa1
Compare
@mrocklin Thanks a lot! Having the -1 did the trick for me, what did you stumble upon that prompted the clean() method on the WorkerState? |
I wanted to remove references to TaskState objects, which link to each other to refer to the entire graph. Otherwise you would end up downloading way more state than you wanted. |
This is in. Thanks @plbertrand ! |
* upstream/master: Add Type Attribute to TaskState (dask#2657) Add waiting task count to progress title bar (dask#2663) DOC: Clean up reference to cluster object (dask#2664) Allow scheduler to politely close workers as part of shutdown (dask#2651) Check direct_to_workers before using get_worker in Client (dask#2656) Fixed comment regarding keeping existing level if less verbose (dask#2655) Add idle timeout to scheduler (dask#2652) Avoid deprecation warnings (dask#2653) Use an LRU cache for deserialized functions (dask#2623) Rename Worker._close to Worker.close (dask#2650) Add Comm closed bookkeeping (dask#2648) Explain LocalCluster behavior in Client docstring (dask#2647) Add last worker into KilledWorker exception to help debug (dask#2610) Set working worker class for dask-ssh (dask#2646) Add as_completed methods to docs (dask#2642) Add timeout to Client._reconnect (dask#2639) Limit test_spill_by_default memory, reenable it (dask#2633) Use proper address in worker -> nanny comms (dask#2640)
No description provided.