Add exception details to the exception to help debug (#2549) #2610

plbertrand · 2019-04-11T14:49:53Z

No description provided.

mrocklin · 2019-04-12T17:53:10Z

distributed/core.py

@@ -885,7 +885,7 @@ def clean_exception(exception, traceback, **kwargs):
    --------
    error_message: create and serialize errors into message
    """
-    if isinstance(exception, bytes):
+    if isinstance(exception, bytes) or isinstance(exception, bytearray):


I'm curious, why this change?

I couldn't nailed down exactly where it came from but it seems that when an exception has a lot of information, it's passed in two frames. When it is small enough, it is represented as bytes otherwise I believe the network code will concatenate those bytes into a bytearray. Because of that reason, the exception returned was unpickled.

mrocklin · 2019-04-12T17:55:21Z

distributed/scheduler.py

+                                         worker_host=ws.host,
+                                         worker_local_directory=ws.local_directory,
+                                         worker_metrics=ws.metrics,
+                                         worker_pid=ws.pid))


So, this is a little odd because a KilledWorker error isn't specifically associated to one worker, but rather to a task that has killed a set of workers. I guess though that it can be helpful to include information about one representative of that set.

I wonder if, rather than include each of these keywords individually (which locks us into a particularly rigid structure going forward) we might just include the ws object itself. That way in the future every time one of these attributes changes you're not on the hook to fix things here as well.

This is what I did at first but it seems to fail to pickle on Python 2.7 (https://travis-ci.org/dask/distributed/jobs/518810169#L2679). I would prefer having ws itself.

I did think about "which locks us into a particularly rigid structure going forward" and to go around it was to pass them as kwargs on the exception. The downside is after an upgrade, if someone uses one of the variable that is removed, it will throw an exception. I'm not sure what is ideal here.

I would prefer having ws itself.

I recommend that we figure out why ws isn't serializable and possible resolve that, if you have time.

I'm unable to reproduce the CI that you have. Is there a good starter point that I could look into to run the Python 2.7 tests on Linux and Conda?

I'm not sure I entirely understand. I would clone the repository, which you've done, make a conda environment with python 2.7, install things, and then run py.test distributed

https://distributed.dask.org/en/latest/develop.html

From what you've said before it sounds like you've already gotten things to fail on 2.7, so maybe I'm not understanding your question correctly.

It fails in your continuous integration. I've been trying really hard to reproduce the environment properly here and had some very mixed results on the jankiest setup you can imagine. I was able to get the breakpoint to work only when another breakpoint somewhere else in the code was causing a delay long enough to cause the race condition for me to get the state in the scheduler. Now it stopped working without any explanation while I was doing a divide and conquer on worker state fields. The short list that is left to test is ws.metrics, ws.services and ws.processing. Perhaps I'll push until my divide and conquer is done but it's very painful. The procedure in the doc also doesn't work as is. For example tornado gets installed at version 6 even though it does not support python 2.7. It's also missing the pytest-timeout package among other things. Are you able to run the environment easily?

I've been trying really hard to reproduce the environment properly here and had some very mixed results on the jankiest setup you can imagine.

Heh, sorry to hear that you've been having a frustrating time. Lets see if we can get past that.

First, please note that we have some intermittent failures (they've been quite annoying lately) so if the error looks entirely unrelated to what you're working on, then please don't stress about it.

It fails in your continuous integration

Do you have a link to the failure?

Now it stopped working without any explanation while I was doing a divide and conquer on worker state fields

In your situation I would use a debugger like pdb and look at the state that failed.

The procedure in the doc also doesn't work as is. For example tornado gets installed at version 6 even though it does not support python 2.7

I'm quite surprised that you were able to install tornado 6 in a Python 2 environment. That either sounds like a terrible bug in tornado's packaging (please report upstream if so) or possibly that you're building an environment in an odd way.

mrocklin · 2019-04-12T17:56:11Z

distributed/scheduler.py

+    def __init__(self, *args, **kwargs):
+        super(KilledWorker, self).__init__(*args)
+        self.args = args
+        self.kwargs = kwargs


If we do as above and include just a single keyword last_failed_worker or something then it would be good to include that explicitly, rather than catch everything.

plbertrand · 2019-04-30T20:52:29Z

@mrocklin Thanks a lot! Having the -1 did the trick for me, what did you stumble upon that prompted the clean() method on the WorkerState?

mrocklin · 2019-04-30T21:02:45Z

I wanted to remove references to TaskState objects, which link to each other to refer to the entire graph. Otherwise you would end up downloading way more state than you wanted.

mrocklin · 2019-05-01T16:39:56Z

This is in. Thanks @plbertrand !

* upstream/master: Add Type Attribute to TaskState (dask#2657) Add waiting task count to progress title bar (dask#2663) DOC: Clean up reference to cluster object (dask#2664) Allow scheduler to politely close workers as part of shutdown (dask#2651) Check direct_to_workers before using get_worker in Client (dask#2656) Fixed comment regarding keeping existing level if less verbose (dask#2655) Add idle timeout to scheduler (dask#2652) Avoid deprecation warnings (dask#2653) Use an LRU cache for deserialized functions (dask#2623) Rename Worker._close to Worker.close (dask#2650) Add Comm closed bookkeeping (dask#2648) Explain LocalCluster behavior in Client docstring (dask#2647) Add last worker into KilledWorker exception to help debug (dask#2610) Set working worker class for dask-ssh (dask#2646) Add as_completed methods to docs (dask#2642) Add timeout to Client._reconnect (dask#2639) Limit test_spill_by_default memory, reenable it (dask#2633) Use proper address in worker -> nanny comms (dask#2640)

plbertrand force-pushed the killed-details branch from 709e836 to af45860 Compare April 12, 2019 15:40

plbertrand changed the title ~~Add exception details to the exception to help debug. Fixes #2549~~ Add exception details to the exception to help debug (#2549) Apr 12, 2019

plbertrand force-pushed the killed-details branch from af45860 to 242485d Compare April 12, 2019 15:41

mrocklin reviewed Apr 12, 2019

View reviewed changes

plbertrand force-pushed the killed-details branch from 242485d to fc50a34 Compare April 25, 2019 21:13

Add exception details to the exception to help debug (dask#2549)

2a99fa1

plbertrand force-pushed the killed-details branch from fc50a34 to 2a99fa1 Compare April 25, 2019 21:16

Add cleaned version of WorkerState

ec67d23

mrocklin merged commit 7b470c4 into dask:master May 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exception details to the exception to help debug (#2549) #2610

Add exception details to the exception to help debug (#2549) #2610

plbertrand commented Apr 11, 2019

mrocklin Apr 12, 2019

plbertrand Apr 12, 2019

mrocklin Apr 12, 2019

plbertrand Apr 12, 2019

mrocklin Apr 12, 2019

plbertrand Apr 12, 2019 •

edited

mrocklin Apr 12, 2019

plbertrand Apr 12, 2019

mrocklin Apr 12, 2019

mrocklin Apr 12, 2019

plbertrand commented Apr 30, 2019

mrocklin commented Apr 30, 2019

mrocklin commented May 1, 2019

Add exception details to the exception to help debug (#2549) #2610

Add exception details to the exception to help debug (#2549) #2610

Conversation

plbertrand commented Apr 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plbertrand Apr 12, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plbertrand commented Apr 30, 2019

mrocklin commented Apr 30, 2019

mrocklin commented May 1, 2019

plbertrand Apr 12, 2019 •

edited