Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Explicitly track dependencies in worker #804
This adds state to explicitly track the status of dependencies in workers. Previously we tracked only tasks, and not the dependencies of tasks. This led to some ambiguous situations that were difficult to track down. We are now more explicit and, I think, more robust as a result.
However, as with any significant change to the scheduling logic we should probably expect a tiny bit of havoc in the near future.
Builds on #798
Explanation of what went wrong
Workers track two kinds of keys. Keys corresponding to tasks that the worker is being asked to compute and keys corresponding to data that the worker has to gather. Previously we only modeled the state of tasks that we were asked to compute, not keys associated to gathered data, which was handled implicitly.
This was fine, except when the two systems happened to interact, such as would happen if we gathered a key that we were supposed to compute or computed a key that we were supposed to gather. In this case things were less well defined. This was a rare occurrence (or should have been) and so wasn't much of an issue on its own.
However, another issue arose due to work stealing. When a worker was given the names of peers that held a piece of data that it wanted it was often the case that, if the worker computed the result, and was then told that another worker had stolen the data it would remove the data from itself without informing the scheduler (actually, it did inform the scheduler but a race condition occurred). This actually was occurring relatively frequently on stolen data and so workers were often asking peers for data that they didn't have. This, again, was fine because Dask knew how to try again in the face of this error, and so again things mostly worked ok.
Except that when you had a lot of churn in the data dependencies because data wasn't where you expected (problem 2) and when the data dependencies are poorly modeled in the worker (problem 1) then there was some bleeding of bad results into the task state on the workers, causing havoc. These problems had been around for a while and were raising errors but they were usually being handled through Dask's resilience. We've now resolved both classes of issues and also cleaned up the system that was cleaning up.
This was a nice exercise in how coupling mostly-working components can easily yield a faulty system.