Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove worker reconnection logic #6350

Closed
gjoseph92 opened this issue May 16, 2022 · 1 comment · Fixed by #6361
Closed

Remove worker reconnection logic #6350

gjoseph92 opened this issue May 16, 2022 · 1 comment · Fixed by #6361

Comments

@gjoseph92
Copy link
Collaborator

It's proving difficult to merge a fix for #5480. #6329, #6272, and #6341 are causing tests to fail in unexpected and hard-to-debug ways. Because reconnection is quite broken as things stand right now (and causing issues for users #6228), instead of fixing it, how about we just remove it?

If a worker's connection to the scheduler is broken, have it shut down immediately. This would at least puts us in a consistent state, instead of deadlocking.

Questions:

  1. If there's a nanny, should the worker restart?

cc @fjetter @crusaderky @mrocklin

@fjetter
Copy link
Member

fjetter commented May 17, 2022

I'm in favour of a clean shutdown. I suggest to log an informative message, e.g. suggesting users to increase distributed.comm.timeouts.tcp if this issue occurs frequently.

mrocklin pushed a commit that referenced this issue May 20, 2022
When a worker disconnects from the scheduler, close it immediately instead of trying to reconnect.

Also prohibit workers from joining if they have data in memory, as an alternative to #6341.

Closes #6350
sanderegg added a commit to sanderegg/osparc-simcore that referenced this issue Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants