Explicitly track dependencies in worker #804

mrocklin · 2017-01-10T16:22:40Z

This adds state to explicitly track the status of dependencies in workers. Previously we tracked only tasks, and not the dependencies of tasks. This led to some ambiguous situations that were difficult to track down. We are now more explicit and, I think, more robust as a result.

However, as with any significant change to the scheduling logic we should probably expect a tiny bit of havoc in the near future.

Builds on #798

Explanation of what went wrong

Workers track two kinds of keys. Keys corresponding to tasks that the worker is being asked to compute and keys corresponding to data that the worker has to gather. Previously we only modeled the state of tasks that we were asked to compute, not keys associated to gathered data, which was handled implicitly.

This was fine, except when the two systems happened to interact, such as would happen if we gathered a key that we were supposed to compute or computed a key that we were supposed to gather. In this case things were less well defined. This was a rare occurrence (or should have been) and so wasn't much of an issue on its own.

However, another issue arose due to work stealing. When a worker was given the names of peers that held a piece of data that it wanted it was often the case that, if the worker computed the result, and was then told that another worker had stolen the data it would remove the data from itself without informing the scheduler (actually, it did inform the scheduler but a race condition occurred). This actually was occurring relatively frequently on stolen data and so workers were often asking peers for data that they didn't have. This, again, was fine because Dask knew how to try again in the face of this error, and so again things mostly worked ok.

Except that when you had a lot of churn in the data dependencies because data wasn't where you expected (problem 2) and when the data dependencies are poorly modeled in the worker (problem 1) then there was some bleeding of bad results into the task state on the workers, causing havoc. These problems had been around for a while and were raising errors but they were usually being handled through Dask's resilience. We've now resolved both classes of issues and also cleaned up the system that was cleaning up.

This was a nice exercise in how coupling mostly-working components can easily yield a faulty system.

Also put in client= keywords as appropriate

This reverts commit b9c3f51.

mrocklin added 18 commits January 10, 2017 06:58

add stress test to disconnect workers

0278501

squash

7ff6e9a

squash

6a79b24

clean up on_closed callback

d9db2f9

squash

c1fe515

increase frequency of connection closing

5be10f2

log exception in handle_client

656fada

don't gather or validate if shutting down

11539be

punt on port collision

8cff285

only cancel pending keys

1fe18d8

test - collect nbytes in handle_missing_dep

983d1a6

only cancel pending keys

f9d4090

don't delete nbytes if key in data

029c96b

Explicitly track dependent state in worker

523cb7f

fix worker_clients

d274699

scatter creates keys, not deps

16b9cb3

change 'missing-data' route to 'report-key' request

9ac9998

rename compute_handlers to client_handlers

ecb49af

Also put in client= keywords as appropriate

mrocklin force-pushed the worker-dep-state branch from a1ec093 to ecb49af Compare January 10, 2017 16:24

add worker dep image to docs

e811a64

mrocklin changed the title ~~WIP - Explicitly track dependencies in worker~~ Explicitly track dependencies in worker Jan 10, 2017

mrocklin added 2 commits January 10, 2017 16:39

Avoid getting data in main thread

b9c3f51

Don't accept task from worker not assigned task

156439a

mrocklin force-pushed the worker-dep-state branch from 6b67a42 to 79279ad Compare January 10, 2017 22:54

help slow test

e923f31

mrocklin force-pushed the worker-dep-state branch from 79279ad to 27aaf8e Compare January 11, 2017 12:42

Revert "Avoid getting data in main thread"

5e381c3

This reverts commit b9c3f51.

mrocklin force-pushed the worker-dep-state branch from 27aaf8e to 5e381c3 Compare January 11, 2017 12:47

mrocklin added 2 commits January 11, 2017 10:41

Test policies for holding onto data

dd19424

release dependency if no dependents

b0d6c0c

don't overwrite released deps

4ab8f10

mrocklin mentioned this pull request Jan 11, 2017

Dask Stream Closed Error with Bayesian Optimization #807

Open

mrocklin merged commit a36da41 into dask:master Jan 11, 2017

mrocklin deleted the worker-dep-state branch January 11, 2017 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly track dependencies in worker #804

Explicitly track dependencies in worker #804

mrocklin commented Jan 10, 2017 •

edited

Loading

Explicitly track dependencies in worker #804

Explicitly track dependencies in worker #804

Conversation

mrocklin commented Jan 10, 2017 • edited Loading

Explanation of what went wrong

mrocklin commented Jan 10, 2017 •

edited

Loading