Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure worker reconnect registers existing tasks properly #5103

Merged
merged 4 commits into from
Jul 28, 2021

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jul 21, 2021

Yet another deadlock and some inconsistent behaviour.

If a worker disconnects while it is computing, e.g. connection broke because of , it is allowed to reconnect but the behaviour is currently a bit ill defined and there are multiple issues

  • Tasks currently in memory of the worker are never registered and will neither be forgotten nor will they be used at any point in time
  • If the computation finishes before the rescheduled one, the task is simply ignored. We should either allow the worker to keep that task and transition everything to memory or instruct it to forget it again. Currently we're doing neither.
  • Lastly, since the reconnect submits all tasks it knows via nbytes but not via types this causes an exception on worker side which actually results in another deadlock, see stuck tasks that never complete when running on many SLURM nodes #5078 The scheduler is stuck with a partial registration and the worker waits for the heartbeat response.

I went for the "keep both tasks" once computed, regardless of the worker who is supposed to compute it. IMO this is a sane behaviour for most computations. I see problems with non-pure tasks but I do think we are not treating this case properly for any kind of rescheduling, do we??

@fjetter
Copy link
Member Author

fjetter commented Jul 23, 2021

This builds on #5034

@fjetter
Copy link
Member Author

fjetter commented Jul 23, 2021

Test failures on py3.7 appear to be unrelated

The ubu win-py3.8 test is a timeout of the newly introduced test test_worker_reconnects_mid_compute. will have a look

Ran the test locally about 2k times without encountering the failure. Ran it on windows as well without success of reproduction. I assume this is a funny coincidence. I re-triggered everything and this time a different test fails

@fjetter
Copy link
Member Author

fjetter commented Jul 23, 2021

cc @jrbourbeau @mrocklin

gjoseph92 added a commit to gjoseph92/coiled-parameter-sweep-profiling that referenced this pull request Jul 27, 2021
@fjetter
Copy link
Member Author

fjetter commented Jul 27, 2021

Test failures appear to be unrelated. Will give it another day and will merge tomorrow if there are no further objections.

@jakirkham I reverted the changes you mentioned. I'd appreciate a short comment if you are fine with this

@jakirkham
Copy link
Member

jakirkham commented Jul 28, 2021

Yep the new changes seem fine. Marked the thread resolved. Thank you :)

@fjetter fjetter merged commit 85c95be into dask:main Jul 28, 2021
@fjetter fjetter deleted the worker_reconnect_deadlock branch July 28, 2021 10:59
madsbk added a commit to madsbk/distributed that referenced this pull request Aug 16, 2021
madsbk added a commit to madsbk/distributed that referenced this pull request Aug 16, 2021
madsbk added a commit to madsbk/distributed that referenced this pull request Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants