Ensure worker reconnect registers existing tasks properly #5103

fjetter · 2021-07-21T17:18:15Z

Yet another deadlock and some inconsistent behaviour.

If a worker disconnects while it is computing, e.g. connection broke because of , it is allowed to reconnect but the behaviour is currently a bit ill defined and there are multiple issues

Tasks currently in memory of the worker are never registered and will neither be forgotten nor will they be used at any point in time
If the computation finishes before the rescheduled one, the task is simply ignored. We should either allow the worker to keep that task and transition everything to memory or instruct it to forget it again. Currently we're doing neither.
Lastly, since the reconnect submits all tasks it knows via nbytes but not via types this causes an exception on worker side which actually results in another deadlock, see stuck tasks that never complete when running on many SLURM nodes #5078 The scheduler is stuck with a partial registration and the worker waits for the heartbeat response.

I went for the "keep both tasks" once computed, regardless of the worker who is supposed to compute it. IMO this is a sane behaviour for most computations. I see problems with non-pure tasks but I do think we are not treating this case properly for any kind of rescheduling, do we??

Closes stuck tasks that never complete when running on many SLURM nodes #5078
Tests added / passed
Passes black distributed / flake8 distributed / isort distributed

fjetter · 2021-07-23T10:39:44Z

This builds on #5034

fjetter · 2021-07-23T11:44:39Z

~~Test failures on py3.7 appear to be unrelated~~

~~The ubu win-py3.8 test is a timeout of the newly introduced test test_worker_reconnects_mid_compute. will have a look~~

Ran the test locally about 2k times without encountering the failure. Ran it on windows as well without success of reproduction. I assume this is a funny coincidence. I re-triggered everything and this time a different test fails

fjetter · 2021-07-23T15:54:46Z

cc @jrbourbeau @mrocklin

distributed/scheduler.py

fjetter · 2021-07-27T13:33:42Z

Test failures appear to be unrelated. Will give it another day and will merge tomorrow if there are no further objections.

@jakirkham I reverted the changes you mentioned. I'd appreciate a short comment if you are fine with this

jakirkham · 2021-07-28T00:09:47Z

Yep the new changes seem fine. Marked the thread resolved. Thank you :)

fjetter added 2 commits July 21, 2021 19:11

Ensure worker reconnect registers existing tasks properly

701e4b8

handle resheduled tasks

3b7b46f

fjetter mentioned this pull request Jul 22, 2021

KeyError: ('error', 'waiting') #4800

Closed

Test more robust

0eb2094

fjetter mentioned this pull request Jul 23, 2021

Release 2021.07.1 dask/community#171

Closed

jakirkham reviewed Jul 23, 2021

View reviewed changes

distributed/scheduler.py Show resolved Hide resolved

Merge branch 'main' into worker_reconnect_deadlock

3af453d

fjetter force-pushed the worker_reconnect_deadlock branch from 8393578 to 3af453d Compare July 26, 2021 11:49

gjoseph92 added a commit to gjoseph92/coiled-parameter-sweep-profiling that referenced this pull request Jul 27, 2021

Use dask/distributed#5103

b9ed60d

fjetter merged commit 85c95be into dask:main Jul 28, 2021

fjetter deleted the worker_reconnect_deadlock branch July 28, 2021 10:59

jrbourbeau mentioned this pull request Jul 29, 2021

Release (off cycle) dask/community#173

Closed

fjetter mentioned this pull request Aug 4, 2021

stuck tasks that never complete when running on many SLURM nodes #5078

Closed

madsbk added a commit to madsbk/distributed that referenced this pull request Aug 16, 2021

Rolling back some of dask#5103 to fix hang

048d2f5

madsbk mentioned this pull request Aug 16, 2021

Rolling back some of #5103 to fix failing stress test #5215

Closed

2 tasks

madsbk added a commit to madsbk/distributed that referenced this pull request Aug 16, 2021

Rolling back some of dask#5103 to fix hang

e8e6f96

madsbk added a commit to madsbk/distributed that referenced this pull request Aug 16, 2021

Rolling back some of dask#5103 to fix hang

8e7d6a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure worker reconnect registers existing tasks properly #5103

Ensure worker reconnect registers existing tasks properly #5103

fjetter commented Jul 21, 2021

fjetter commented Jul 23, 2021

fjetter commented Jul 23, 2021 •

edited

Loading

fjetter commented Jul 23, 2021

fjetter commented Jul 27, 2021

jakirkham commented Jul 28, 2021 •

edited

Loading

Ensure worker reconnect registers existing tasks properly #5103

Ensure worker reconnect registers existing tasks properly #5103

Conversation

fjetter commented Jul 21, 2021

fjetter commented Jul 23, 2021

fjetter commented Jul 23, 2021 • edited Loading

fjetter commented Jul 23, 2021

fjetter commented Jul 27, 2021

jakirkham commented Jul 28, 2021 • edited Loading

fjetter commented Jul 23, 2021 •

edited

Loading

jakirkham commented Jul 28, 2021 •

edited

Loading