-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure worker reconnect registers existing tasks properly #5103
Conversation
This builds on #5034 |
Ran the test locally about 2k times without encountering the failure. Ran it on windows as well without success of reproduction. I assume this is a funny coincidence. I re-triggered everything and this time a different test fails |
8393578
to
3af453d
Compare
Test failures appear to be unrelated. Will give it another day and will merge tomorrow if there are no further objections. @jakirkham I reverted the changes you mentioned. I'd appreciate a short comment if you are fine with this |
Yep the new changes seem fine. Marked the thread resolved. Thank you :) |
Yet another deadlock and some inconsistent behaviour.
If a worker disconnects while it is computing, e.g. connection broke because of , it is allowed to reconnect but the behaviour is currently a bit ill defined and there are multiple issues
nbytes
but not viatypes
this causes an exception on worker side which actually results in another deadlock, see stuck tasks that never complete when running on many SLURM nodes #5078 The scheduler is stuck with a partial registration and the worker waits for the heartbeat response.I went for the "keep both tasks" once computed, regardless of the worker who is supposed to compute it. IMO this is a sane behaviour for most computations. I see problems with non-pure tasks but I do think we are not treating this case properly for any kind of rescheduling, do we??
black distributed
/flake8 distributed
/isort distributed