Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly track dependencies in worker #804

Merged
merged 26 commits into from Jan 11, 2017

Conversation

@mrocklin
Copy link
Member

commented Jan 10, 2017

This adds state to explicitly track the status of dependencies in workers. Previously we tracked only tasks, and not the dependencies of tasks. This led to some ambiguous situations that were difficult to track down. We are now more explicit and, I think, more robust as a result.

However, as with any significant change to the scheduling logic we should probably expect a tiny bit of havoc in the near future.

Builds on #798

Explanation of what went wrong

Workers track two kinds of keys. Keys corresponding to tasks that the worker is being asked to compute and keys corresponding to data that the worker has to gather. Previously we only modeled the state of tasks that we were asked to compute, not keys associated to gathered data, which was handled implicitly.

This was fine, except when the two systems happened to interact, such as would happen if we gathered a key that we were supposed to compute or computed a key that we were supposed to gather. In this case things were less well defined. This was a rare occurrence (or should have been) and so wasn't much of an issue on its own.

However, another issue arose due to work stealing. When a worker was given the names of peers that held a piece of data that it wanted it was often the case that, if the worker computed the result, and was then told that another worker had stolen the data it would remove the data from itself without informing the scheduler (actually, it did inform the scheduler but a race condition occurred). This actually was occurring relatively frequently on stolen data and so workers were often asking peers for data that they didn't have. This, again, was fine because Dask knew how to try again in the face of this error, and so again things mostly worked ok.

Except that when you had a lot of churn in the data dependencies because data wasn't where you expected (problem 2) and when the data dependencies are poorly modeled in the worker (problem 1) then there was some bleeding of bad results into the task state on the workers, causing havoc. These problems had been around for a while and were raising errors but they were usually being handled through Dask's resilience. We've now resolved both classes of issues and also cleaned up the system that was cleaning up.

This was a nice exercise in how coupling mostly-working components can easily yield a faulty system.

@mrocklin mrocklin force-pushed the mrocklin:worker-dep-state branch from a1ec093 to ecb49af Jan 10, 2017

@mrocklin mrocklin changed the title WIP - Explicitly track dependencies in worker Explicitly track dependencies in worker Jan 10, 2017

@mrocklin mrocklin force-pushed the mrocklin:worker-dep-state branch from 6b67a42 to 79279ad Jan 10, 2017

@mrocklin mrocklin force-pushed the mrocklin:worker-dep-state branch from 79279ad to 27aaf8e Jan 11, 2017

@mrocklin mrocklin force-pushed the mrocklin:worker-dep-state branch from 27aaf8e to 5e381c3 Jan 11, 2017

mrocklin added some commits Jan 11, 2017

@mrocklin mrocklin merged commit a36da41 into dask:master Jan 11, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@mrocklin mrocklin deleted the mrocklin:worker-dep-state branch Jan 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.