-
-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stress disconnect #798
Closed
Closed
Stress disconnect #798
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1e5e25c
to
1a619a4
Compare
mrocklin
added a commit
that referenced
this pull request
Jan 11, 2017
This adds state to explicitly track the status of dependencies in workers. Previously we tracked only tasks, and not the dependencies of tasks. This led to some ambiguous situations that were difficult to track down. We are now more explicit and, I think, more robust as a result. However, as with any significant change to the scheduling logic we should probably expect a tiny bit of havoc in the near future. Builds on #798 Explanation of what went wrong ------------------------------------------ Workers track two kinds of keys. Keys corresponding to tasks that the worker is being asked to compute and keys corresponding to data that the worker has to gather. Previously we only modeled the state of tasks that we were asked to compute, not keys associated to gathered data, which was handled implicitly. This was fine, except when the two systems happened to interact, such as would happen if we gathered a key that we were supposed to compute or computed a key that we were supposed to gather. In this case things were less well defined. This was a rare occurrence (or should have been) and so wasn't much of an issue on its own. However, another issue arose due to work stealing. When a worker was given the names of peers that held a piece of data that it wanted it was often the case that, if the worker computed the result, and was then told that another worker had stolen the data it would remove the data from itself without informing the scheduler (actually, it did inform the scheduler but a race condition occurred). This actually was occurring relatively frequently on stolen data and so workers were often asking peers for data that they didn't have. This, again, was fine because Dask knew how to try again in the face of this error, and so again things mostly worked ok. Except that when you had a lot of churn in the data dependencies because data wasn't where you expected (problem 2) and when the data dependencies are poorly modeled in the worker (problem 1) then there was some bleeding of bad results into the task state on the workers, causing havoc. These problems had been around for a while and were raising errors but they were usually being handled through Dask's resilience. We've now resolved both classes of issues and also cleaned up the system that was cleaning up. This was a nice exercise in how coupling mostly-working components can easily yield a faulty system.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.