New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicitly track dependencies in worker #804
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Also put in client= keywords as appropriate
mrocklin
force-pushed
the
worker-dep-state
branch
from
January 10, 2017 16:24
a1ec093
to
ecb49af
Compare
mrocklin
changed the title
WIP - Explicitly track dependencies in worker
Explicitly track dependencies in worker
Jan 10, 2017
mrocklin
force-pushed
the
worker-dep-state
branch
from
January 10, 2017 22:54
6b67a42
to
79279ad
Compare
mrocklin
force-pushed
the
worker-dep-state
branch
from
January 11, 2017 12:42
79279ad
to
27aaf8e
Compare
This reverts commit b9c3f51.
mrocklin
force-pushed
the
worker-dep-state
branch
from
January 11, 2017 12:47
27aaf8e
to
5e381c3
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds state to explicitly track the status of dependencies in workers. Previously we tracked only tasks, and not the dependencies of tasks. This led to some ambiguous situations that were difficult to track down. We are now more explicit and, I think, more robust as a result.
However, as with any significant change to the scheduling logic we should probably expect a tiny bit of havoc in the near future.
Builds on #798
Explanation of what went wrong
Workers track two kinds of keys. Keys corresponding to tasks that the worker is being asked to compute and keys corresponding to data that the worker has to gather. Previously we only modeled the state of tasks that we were asked to compute, not keys associated to gathered data, which was handled implicitly.
This was fine, except when the two systems happened to interact, such as would happen if we gathered a key that we were supposed to compute or computed a key that we were supposed to gather. In this case things were less well defined. This was a rare occurrence (or should have been) and so wasn't much of an issue on its own.
However, another issue arose due to work stealing. When a worker was given the names of peers that held a piece of data that it wanted it was often the case that, if the worker computed the result, and was then told that another worker had stolen the data it would remove the data from itself without informing the scheduler (actually, it did inform the scheduler but a race condition occurred). This actually was occurring relatively frequently on stolen data and so workers were often asking peers for data that they didn't have. This, again, was fine because Dask knew how to try again in the face of this error, and so again things mostly worked ok.
Except that when you had a lot of churn in the data dependencies because data wasn't where you expected (problem 2) and when the data dependencies are poorly modeled in the worker (problem 1) then there was some bleeding of bad results into the task state on the workers, causing havoc. These problems had been around for a while and were raising errors but they were usually being handled through Dask's resilience. We've now resolved both classes of issues and also cleaned up the system that was cleaning up.
This was a nice exercise in how coupling mostly-working components can easily yield a faulty system.