Don't make False add-keys report to scheduler #2421

tjb900 · 2018-12-18T06:30:31Z

Possible fix for #2420, for now putting here mostly to trigger CI.

Cleans up (imho) a couple of things in the worker:

responsibility for maintaining the in_flight_tasks / in_flight_workers maps is entirely contained within the transition_dep_* functions
better handles the case where dependencies are released (due to e.g. stealing) after they are marked as in-flight but before or during the transfer. In particular, removes an add-keys notification to the scheduler for keys that we might then throw in the trash due to being no-longer needed.

mrocklin · 2018-12-18T16:10:23Z

Thank you for submitting this @tjb900

It looks like there are a few smaller changes here. In general we're pretty conservative with changes to the scheduling logic, so I encouarage you to isolate changes into different PRs if possible and also demonstrate that they improve concrete situations with tests if possible. Often constructing a minimal test that shows off failing behavior can be difficult, but is quite useful to ensure that future changes don't subtly break things again.

tjb900 · 2018-12-19T05:46:32Z

Will do @mrocklin - thanks for the feedback. Will keep this PR for the removal of the add-keys only.

mrocklin · 2018-12-19T19:50:30Z

I think that we still want to inform the scheduler that we have this data in the common case that we decide to store it.

tjb900 · 2018-12-20T02:41:11Z

I'm pretty sure that's done inside transition_dep_flight_memory?

tjb900 · 2018-12-20T02:42:39Z

(still planning to come up with a test, too - so far I have just removed all of the other changes)

mrocklin

Thanks for following up and adding the test @tjb900 . I've added a few minor comments in-line.

One larger issue is time. We have a few thousand tests in the test suite and like to run them frequently. Ideally we would find a way to significantly reduce the time of this test (most take under a few hundred milliseconds) while still making it sensitive to the issue at hand. Thoughts?

mrocklin · 2019-01-03T15:40:36Z

distributed/tests/test_worker.py

+        return 2
+
+    y = c.submit(bogus_task, x, workers=b.address)
+    yield c._cancel(y)


You can safely drop the underscore here. The public functions will operate asynchronously in an asynchronous context.

I'm slightly concerned that y may never start. Should we wait until b is executing something?

while not b.executing: yield gen.sleep(0.01)

Actually, I think it's important that y never starts - we need to cause x to be transferred from a to b, but then have y be cancelled while that transfer is occurring - so that x is never actually placed in memory on b. This timing is hard to arrange in the test (and I suspect actually now even harder with #2428),

mrocklin · 2019-01-03T15:44:17Z

distributed/tests/test_worker.py

+    yield gen.sleep(5)
+    try:
+        yield wait(y)
+    except:


I recommend except Exception: otherwise this catches KeyboardInterrupts

Yeah, I agree this except is way too broad actually - it's really for the CancelledError. Will fix.

tjb900 · 2019-01-04T01:24:49Z

Thanks @mrocklin. Understand re the timing. The long waits are arbitrary and should really be replaced with something better.

The three cases I need to trigger on are:

gather_dep on b has started to fetch the data. This is the cue to cancel y (but note that we need to arrange for the cancellation notice from s to reach b before the data from a does).
gather_dep on b has finished fetching the data, and enqueued any resulting communications to the s
those batched communications from b to s have completed

For the first two I don't have any great ideas, for now. Checking b.log seems very hacky. Ideally we could put test hooks into the RPCs so the test could control when they are allowed to return (is this something that would be welcome?).

For the third, maybe I can call the batched comm callback directly from the test rather than waiting for the periodic callback.

tjb900 · 2019-01-04T02:06:44Z

@mrocklin tbh, I'm a little concerned that this test is going to end up so targeted at the current implementation of the worker that it is highly unlikely to catch a future regression. On top of this, because it depends on the internal details of the worker rather than some kind of semi-stable API, it comes with a maintenance burden of updating the test to match any logic changes inside the worker.

Is there instead a longer running "test everything" test which we could augment to catch this situation? For instance, by verifying after a large amount of work that the scheduler and workers' ideas of what they all have are consistent with each other?

mrocklin · 2019-01-04T02:11:34Z

For gather_dep it might work to look at Worker.in_flight_workers. This will be populated while the dependency is possibly in flight, although there is definitely a period when it's in in_flight_workers before the transfer has started.

but note that we need to arrange for the cancellation notice from s to reach b before the data from a does

As a warning, you should count on random 3ish second delays from time to time. A combination of slow containers on travis-ci and some intense GC can play havoc with any test that is depending on something happening within a finite window.

Another approach?

I've tended to avoid intriciately orchestrated tests for this reason, and instead tend to try to find statistical tests. Is there something we could do quickly a hundred times that would have some measurable negative effect?

In particular, I wonder if we might draw inspiration from your original concern:

We are working in an adaptive-scaling environment, and are hitting a race condition where sometimes workers don't have what the scheduler thinks they have - and thus frees data that it shouldn't.

Is there a lightweight system we could put together that would simulate this? Maybe we start and stop workers very quickly and submit and cancel tasks very quickly at the same time? Then we verify that at the end (or throughout) that everything is as it should be? My guess is that you could construct a system that would test what you've done here, and probably test a bunch of other things at the same time as well and have it finish in less time. A test like this might also survive future refactorings and keep us honest for longer.

Thoughts?

mrocklin · 2019-01-04T02:17:13Z

Whoops, I wrote my answer while you were writing yours. Sounds like we came to the same conclusion independently :)

mrocklin · 2019-01-04T21:20:17Z

Is there instead a longer running "test everything" test which we could augment to catch this situation? For instance, by verifying after a large amount of work that the scheduler and workers' ideas of what they all have are consistent with each other?

There are a variety of such tests in test_stress.py, but obviously none of them have caught this :)

I think that a new stress test that focused on highly adaptive clusters would be useful generally if you're interested in cooking something up (no obligation though).

mrocklin · 2019-01-29T23:53:45Z

Checking in here @tjb900 . Any thoughts?

tjb900 · 2019-02-01T04:50:03Z

Definitely keen to work on a test that works out the adaptive functionality - but am snowed at the moment, so it's kind of on hold sorry.

jrbourbeau · 2019-06-20T19:04:39Z

Checking in here, @tjb900 just wondering what the status is on this PR. Do you plan to pick things up again?

tjb900 · 2019-06-24T01:55:19Z

Hi @jrbourbeau - sorry I've left this languishing for so long. Realistically, I don't think there's a chance I'll get to it sorry. Happy to close.

jrbourbeau · 2019-07-15T16:04:09Z

I removed test_worker_keeps_data_gh2420 and merged the current master branch into this branch, hope that's okay @tjb900

The proposed changes seem reasonable to me (I discussed this issue with @mrocklin yesterday). @mrocklin, when you get a moment, would you mind taking another look at this PR

* upstream/master: (33 commits) SpecCluster: move init logic into start (dask#2850) Dont reuse closed worker in get_worker (dask#2841) Add alternative SSHCluster implementation (dask#2827) Extend prometheus metrics endpoint (dask#2792) (dask#2833) Include type name in SpecCluster repr (dask#2834) Don't make False add-keys report to scheduler (dask#2421) Add Nanny to worker docs (dask#2826) Respect security configuration in LocalCluster (dask#2822) bump version to 2.1.0 Fix typo that prevented error message (dask#2825) Remove dask-mpi (dask#2824) Updates to use update_graph in task journey docs (dask#2821) Fix Client repr with memory_info=None (dask#2816) Fix case where key, rather than TaskState, could end up in ts.waiting_on (dask#2819) Use Keyword-only arguments (dask#2814) Relax check for worker references in cluster context manager (dask#2813) Add HTTPS support for the dashboard (dask#2812) CLN: Use dask.utils.format_bytes (dask#2810) bump version to 2.0.1 Add python_requires entry to setup.py (dask#2807) ...

do not send add-keys to scheduler for keys we might not keep

9195971

tjb900 force-pushed the wip_2420 branch from 5fc66b7 to 9195971 Compare December 19, 2018 05:49

tjb900 force-pushed the wip_2420 branch from eb2a66b to 6ffdc39 Compare January 3, 2019 10:14

test for dask#2420

f750b1a

tjb900 force-pushed the wip_2420 branch from 6ffdc39 to f750b1a Compare January 3, 2019 10:18

mrocklin reviewed Jan 3, 2019

View reviewed changes

jrbourbeau mentioned this pull request Jul 3, 2019

potential issue with worker.release_dep #2420

Closed

Merge remote-tracking branch 'upstream/master' into wip_2420

19d0335

mrocklin changed the title ~~WIP: don't make False add-keys report to scheduler~~ Don't make False add-keys report to scheduler Jul 15, 2019

mrocklin merged commit d493498 into dask:master Jul 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't make False add-keys report to scheduler #2421

Don't make False add-keys report to scheduler #2421

tjb900 commented Dec 18, 2018

mrocklin commented Dec 18, 2018

tjb900 commented Dec 19, 2018

mrocklin commented Dec 19, 2018

tjb900 commented Dec 20, 2018

tjb900 commented Dec 20, 2018

mrocklin left a comment

mrocklin Jan 3, 2019

mrocklin Jan 3, 2019

tjb900 Jan 4, 2019

mrocklin Jan 3, 2019

tjb900 Jan 4, 2019

tjb900 commented Jan 4, 2019

tjb900 commented Jan 4, 2019

mrocklin commented Jan 4, 2019

mrocklin commented Jan 4, 2019

mrocklin commented Jan 4, 2019

mrocklin commented Jan 29, 2019

tjb900 commented Feb 1, 2019

jrbourbeau commented Jun 20, 2019

tjb900 commented Jun 24, 2019

jrbourbeau commented Jul 15, 2019

Don't make False add-keys report to scheduler #2421

Don't make False add-keys report to scheduler #2421

Conversation

tjb900 commented Dec 18, 2018

mrocklin commented Dec 18, 2018

tjb900 commented Dec 19, 2018

mrocklin commented Dec 19, 2018

tjb900 commented Dec 20, 2018

tjb900 commented Dec 20, 2018

mrocklin left a comment

Choose a reason for hiding this comment

mrocklin Jan 3, 2019

Choose a reason for hiding this comment

mrocklin Jan 3, 2019

Choose a reason for hiding this comment

tjb900 Jan 4, 2019

Choose a reason for hiding this comment

mrocklin Jan 3, 2019

Choose a reason for hiding this comment

tjb900 Jan 4, 2019

Choose a reason for hiding this comment

tjb900 commented Jan 4, 2019

tjb900 commented Jan 4, 2019

mrocklin commented Jan 4, 2019

Another approach?

mrocklin commented Jan 4, 2019

mrocklin commented Jan 4, 2019

mrocklin commented Jan 29, 2019

tjb900 commented Feb 1, 2019

jrbourbeau commented Jun 20, 2019

tjb900 commented Jun 24, 2019

jrbourbeau commented Jul 15, 2019