Add resilient example #75

willirath · 2019-06-05T13:25:10Z

ToDo's (so far):

Link to http://distributed.dask.org/en/latest/resilience.html#hardware-failures and explain preemption as hardware failure (from Dask's PoV)
~~Add example dealing with Exceptions in user code? Relevant docs: http://distributed.dask.org/en/latest/resilience.html#user-code-failures~~

mrocklin · 2019-06-05T15:05:21Z

Some small notes:

Lets merge cells together when we can. For example the first three that import, make a cluster, make a client. The user doesn't gain anything by pressing Shift-Enter three times rather than once.
The memory limit is small. If they just import pandas then we'll probably be close to the limit, which will raise lots of annoying error messages.
cluster.scheduler.allowed_failures = int(1e32) This is internal API. Maybe use dask.config.set({'distributed.scheduler.allowed-failures': 1e32}) before creating the cluster instead?
The function _get_worker_pids seems to be a one-liner that is called once. Lets just call the code directly instead of hiding it in a function.
```
pids = [w.pid for w in cluster.scheduler.workers.values()]
pids
```
Same with _get_preemptible_worker_pids. Also, lets change the filter into a list comprehension, which seems to be more broadly understood
It's taking me a while to understand maybe_kill_n_perc_of_workers_and_wait, which means that a new reader probably doesn't have much chance. There are too many function calls deep for me to trace through what is happening directly.

I wonder if we might be able to do something with fewer steps of indirection like the following.
```
def kill_a_worker():
    worker = random.choice(cluster.scheduler.workers.values())
    os.kill(worker.pid)

summed = client.compute(summed)

while not summed.finished():
    kill_a_worker()
    sleep(...)
```
This is subjective (code is always way simpler to the author than the reader) but I think that I could put this in front of someone and they would have a better chance of tracking things through in a 10-20s attention span.

- Merge cells where it makes sense - Increase memory limit - use dask's config API to increase reslience - simplify worker-killing logic

willirath · 2019-06-06T09:01:47Z

Thanks @mrocklin for these remarks. I have addressed them as far as possible.

There's a few things that I couldn't, however, simplify as much as I'd have liked to:

While somewhat more streamlined now, the logic of finding a worker to kill and leaving some of them running all the time is now this:

_all_current_workers = [w.pid for w in cluster.scheduler.workers.values()]
non_preemptible_workers = _all_current_workers[:2]

def kill_a_worker():
    preemptible_workers = [
        w.pid for w in cluster.scheduler.workers.values()
        if w.pid not in non_preemptible_workers]
    if preemptible_workers:  # need this  because random.choice([]) will raise
        os.kill(random.choice(preemptible_workers), 15)

The memory limits are at 400e6 times 4 workers now. But I'm not sure if binder will be fine if we really use all of this at the same time. (They're limiting to 1 or 2 GB depending on load, IIRC.)

willirath · 2019-06-06T09:06:11Z

Add example dealing with Exceptions in user code? Relevant docs: http://distributed.dask.org/en/latest/resilience.html#user-code-failures

I think this deserves a separate example (could add more than one).

resilience.ipynb

This allows for setting the config after importing distributed xref dask/dask-examples#75 (comment)

TomAugspurger

This generally looks good to me. I would remove the comment about needing to set the config before importing distributed, since that may change.

dask/distributed#2761 addresses this.

willirath · 2019-06-08T10:23:26Z

I've removed the comment.. Thanks for having a look, @TomAugspurger.

guillaumeeb

This is really interesting from dask-jobqueue perspective!

I've made some small comments.

More generally, I executed the whole notebook before undestanding the proposed solution to the problem was implemented in the first executable cell!! Would this be possible to demonstrate the problem, and then fix it to be more educative?

What about an example with as_completed? But this may not be the point here, and may be more adapted to dask-jobqueue documentation. Speeking of which, the solution proposed in this notebook should be in dask-jobqueue docs, or even distributed docs.

resilience.ipynb

guillaumeeb · 2019-06-08T14:20:02Z

resilience.ipynb

+   "source": [
+    "## Increase resilience\n",
+    "\n",
+    "Whenever a worker shuts down, the scheduler will increment the suspicousness counter of _all_ tasks that were assigned (not necessarily computing) to the worker in question.  Whenever the suspiciousness of a task exceeds a certain threshold (3 by default), the task will be considered broken.  We want to compute many tasks on only a few workers with workers shutting down randomly.  So we expect the suspiciousness of all tasks to grow rapidly.  Let's effectively disable the threshold by setting it to a large number:"


Interesting! Isn't there another way? Add a per task counter in dask somehow?

If there are multiple tasks executing at once when worker is killed (say from a segfault) it may be hard to know exactly which task caused the segfault.

Hm, I agree, but shouldn't it count as one global failure? And we should be allowed 3 times this failure. As far as I understand, currently if there are more than 3 tasks failing due to this one task error, the computation is halted?

Ah, now I understand where the misunderstanding is: This is a per task counter. But all tasks belonging to a given worker when this worker dies will be marked as suspicious. With a large number of tasks per worker, this counter still grows fast. There's no way for the scheduler to know which task was active at the time of failure. So all are marked.

(And it's also not always clear that the task currently computing is the problem. Think of a task that needs more memory or disk than the cluster can cope with. This task might not directly lead to failure but just let reasonably sized tasks trigger OOM kills further down the road.)

So you confirm there is one counter for each task? And these counters take a +1 failure for all the tasks that were on a given Worker when it dies? That means that by default, we should be robust to 3 workers down in one computation, is this the case?

This allows for setting the config after importing distributed xref dask/dask-examples#75 (comment)

willirath · 2019-06-10T11:10:59Z

@guillaumeeb part of the reason for creating this example was to make others explore Dask's resilience. There's a lot of potential!

I agree that this should be expanded on in a jobqueue or kubernetes context. But I'd argue for keeping this notebook here short an simple.

guillaumeeb · 2019-06-10T13:29:47Z

And thanks for that @willirath!

And yes, just keep this notebook simple, and elaborate more into dask-jobqueue context, That's one good example for a dedicated notebook gallery.

mrocklin

Two small suggested changes.

Also, when I run this on binder I find that we make progress only very slightly faster than we kill workers, which is a little frustrating. I wonder if we might reduce the total time of execution and also increase the duration between killing workers. I think that we still get the point across that "yes, workers can die" without having them die ten times during the computation. Thoughts?

resilience.ipynb

willirath · 2019-06-11T10:09:49Z

I've added links to Dask Kubernetes and Dask Jobqueue. And after making the example computation smaller and faster and letting fewer workers die, I've reduced the allowed_failures threshold to a still comfortable 100.

TomAugspurger · 2019-06-13T20:58:10Z

Restarted the failed build. LMK if you see that it fails / succeeds before I do.

TomAugspurger · 2019-06-13T21:29:56Z

All green. Thanks @willirath!

willirath · 2019-06-14T08:22:24Z

Let's revisit this as soon as distributed v2 is out. With dask/distributed#2761 and later modification of allowed_failures this will be easier to understand.

Add resilient example

79dea39

Account for MR's and TA's remarks

43aef75

- Merge cells where it makes sense - Increase memory limit - use dask's config API to increase reslience - simplify worker-killing logic

TomAugspurger reviewed Jun 6, 2019

View reviewed changes

resilience.ipynb Outdated Show resolved Hide resolved

resilience.ipynb Outdated Show resolved Hide resolved

resilience.ipynb Outdated Show resolved Hide resolved

Fix typos

e794c2e

TomAugspurger added a commit to TomAugspurger/distributed that referenced this pull request Jun 7, 2019

Delay lookup of allowed failures.

277fd99

This allows for setting the config after importing distributed xref dask/dask-examples#75 (comment)

TomAugspurger mentioned this pull request Jun 7, 2019

Delay lookup of allowed failures. dask/distributed#2761

Merged

TomAugspurger reviewed Jun 7, 2019

View reviewed changes

Remove comment on order of import and config

facb42d

dask/distributed#2761 addresses this.

guillaumeeb reviewed Jun 8, 2019

View reviewed changes

mrocklin pushed a commit to dask/distributed that referenced this pull request Jun 8, 2019

Delay lookup of allowed failures. (#2761)

d378b41

This allows for setting the config after importing distributed xref dask/dask-examples#75 (comment)

mrocklin reviewed Jun 11, 2019

View reviewed changes

resilience.ipynb Outdated Show resolved Hide resolved

resilience.ipynb Outdated Show resolved Hide resolved

Simplify computation and let fewer workers die

08f8961

TomAugspurger merged commit f6485d2 into dask:master Jun 13, 2019

TomAugspurger mentioned this pull request Aug 8, 2019

Add example showing resilience against worker deaths? #74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resilient example #75

Add resilient example #75

willirath commented Jun 5, 2019 •

edited

Loading

mrocklin commented Jun 5, 2019

willirath commented Jun 6, 2019

willirath commented Jun 6, 2019

TomAugspurger left a comment

willirath commented Jun 8, 2019

guillaumeeb left a comment

guillaumeeb Jun 8, 2019

TomAugspurger Jun 10, 2019

guillaumeeb Jun 10, 2019

willirath Jun 10, 2019

guillaumeeb Jun 10, 2019

willirath commented Jun 10, 2019

guillaumeeb commented Jun 10, 2019

mrocklin left a comment

willirath commented Jun 11, 2019

TomAugspurger commented Jun 13, 2019

TomAugspurger commented Jun 13, 2019

willirath commented Jun 14, 2019

Add resilient example #75

Add resilient example #75

Conversation

willirath commented Jun 5, 2019 • edited Loading

mrocklin commented Jun 5, 2019

willirath commented Jun 6, 2019

willirath commented Jun 6, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

willirath commented Jun 8, 2019

guillaumeeb left a comment

Choose a reason for hiding this comment

guillaumeeb Jun 8, 2019

Choose a reason for hiding this comment

TomAugspurger Jun 10, 2019

Choose a reason for hiding this comment

guillaumeeb Jun 10, 2019

Choose a reason for hiding this comment

willirath Jun 10, 2019

Choose a reason for hiding this comment

guillaumeeb Jun 10, 2019

Choose a reason for hiding this comment

willirath commented Jun 10, 2019

guillaumeeb commented Jun 10, 2019

mrocklin left a comment

Choose a reason for hiding this comment

willirath commented Jun 11, 2019

TomAugspurger commented Jun 13, 2019

TomAugspurger commented Jun 13, 2019

willirath commented Jun 14, 2019

willirath commented Jun 5, 2019 •

edited

Loading