Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exception details to the exception to help debug (#2549) #2610

Merged
merged 2 commits into from May 1, 2019

Conversation

plbertrand
Copy link
Contributor

No description provided.

@plbertrand plbertrand changed the title Add exception details to the exception to help debug. Fixes #2549 Add exception details to the exception to help debug (#2549) Apr 12, 2019
@@ -885,7 +885,7 @@ def clean_exception(exception, traceback, **kwargs):
--------
error_message: create and serialize errors into message
"""
if isinstance(exception, bytes):
if isinstance(exception, bytes) or isinstance(exception, bytearray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't nailed down exactly where it came from but it seems that when an exception has a lot of information, it's passed in two frames. When it is small enough, it is represented as bytes otherwise I believe the network code will concatenate those bytes into a bytearray. Because of that reason, the exception returned was unpickled.

worker_host=ws.host,
worker_local_directory=ws.local_directory,
worker_metrics=ws.metrics,
worker_pid=ws.pid))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is a little odd because a KilledWorker error isn't specifically associated to one worker, but rather to a task that has killed a set of workers. I guess though that it can be helpful to include information about one representative of that set.

I wonder if, rather than include each of these keywords individually (which locks us into a particularly rigid structure going forward) we might just include the ws object itself. That way in the future every time one of these attributes changes you're not on the hook to fix things here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I did at first but it seems to fail to pickle on Python 2.7 (https://travis-ci.org/dask/distributed/jobs/518810169#L2679). I would prefer having ws itself.

I did think about "which locks us into a particularly rigid structure going forward" and to go around it was to pass them as kwargs on the exception. The downside is after an upgrade, if someone uses one of the variable that is removed, it will throw an exception. I'm not sure what is ideal here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer having ws itself.

I recommend that we figure out why ws isn't serializable and possible resolve that, if you have time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unable to reproduce the CI that you have. Is there a good starter point that I could look into to run the Python 2.7 tests on Linux and Conda?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I entirely understand. I would clone the repository, which you've done, make a conda environment with python 2.7, install things, and then run py.test distributed

https://distributed.dask.org/en/latest/develop.html

From what you've said before it sounds like you've already gotten things to fail on 2.7, so maybe I'm not understanding your question correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails in your continuous integration. I've been trying really hard to reproduce the environment properly here and had some very mixed results on the jankiest setup you can imagine. I was able to get the breakpoint to work only when another breakpoint somewhere else in the code was causing a delay long enough to cause the race condition for me to get the state in the scheduler. Now it stopped working without any explanation while I was doing a divide and conquer on worker state fields. The short list that is left to test is ws.metrics, ws.services and ws.processing. Perhaps I'll push until my divide and conquer is done but it's very painful. The procedure in the doc also doesn't work as is. For example tornado gets installed at version 6 even though it does not support python 2.7. It's also missing the pytest-timeout package among other things. Are you able to run the environment easily?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying really hard to reproduce the environment properly here and had some very mixed results on the jankiest setup you can imagine.

Heh, sorry to hear that you've been having a frustrating time. Lets see if we can get past that.

First, please note that we have some intermittent failures (they've been quite annoying lately) so if the error looks entirely unrelated to what you're working on, then please don't stress about it.

It fails in your continuous integration

Do you have a link to the failure?

Now it stopped working without any explanation while I was doing a divide and conquer on worker state fields

In your situation I would use a debugger like pdb and look at the state that failed.

The procedure in the doc also doesn't work as is. For example tornado gets installed at version 6 even though it does not support python 2.7

I'm quite surprised that you were able to install tornado 6 in a Python 2 environment. That either sounds like a terrible bug in tornado's packaging (please report upstream if so) or possibly that you're building an environment in an odd way.

def __init__(self, *args, **kwargs):
super(KilledWorker, self).__init__(*args)
self.args = args
self.kwargs = kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do as above and include just a single keyword last_failed_worker or something then it would be good to include that explicitly, rather than catch everything.

@plbertrand
Copy link
Contributor Author

@mrocklin Thanks a lot! Having the -1 did the trick for me, what did you stumble upon that prompted the clean() method on the WorkerState?

@mrocklin
Copy link
Member

I wanted to remove references to TaskState objects, which link to each other to refer to the entire graph. Otherwise you would end up downloading way more state than you wanted.

@mrocklin mrocklin merged commit 7b470c4 into dask:master May 1, 2019
@mrocklin
Copy link
Member

mrocklin commented May 1, 2019

This is in. Thanks @plbertrand !

muammar added a commit to muammar/distributed that referenced this pull request May 8, 2019
* upstream/master:
  Add Type Attribute to TaskState (dask#2657)
  Add waiting task count to progress title bar (dask#2663)
  DOC: Clean up reference to cluster object (dask#2664)
  Allow scheduler to politely close workers as part of shutdown (dask#2651)
  Check direct_to_workers before using get_worker in Client (dask#2656)
  Fixed comment regarding keeping existing level if less verbose (dask#2655)
  Add idle timeout to scheduler (dask#2652)
  Avoid deprecation warnings (dask#2653)
  Use an LRU cache for deserialized functions (dask#2623)
  Rename Worker._close to Worker.close (dask#2650)
  Add Comm closed bookkeeping (dask#2648)
  Explain LocalCluster behavior in Client docstring (dask#2647)
  Add last worker into KilledWorker exception to help debug  (dask#2610)
  Set working worker class for dask-ssh (dask#2646)
  Add as_completed methods to docs (dask#2642)
  Add timeout to Client._reconnect (dask#2639)
  Limit test_spill_by_default memory, reenable it (dask#2633)
  Use proper address in worker -> nanny comms (dask#2640)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants