Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't block event loop in Worker.close waiting for executor to shut down #6239

Closed
gjoseph92 opened this issue Apr 28, 2022 · 2 comments · Fixed by #6091
Closed

Don't block event loop in Worker.close waiting for executor to shut down #6239

gjoseph92 opened this issue Apr 28, 2022 · 2 comments · Fixed by #6091
Labels
bug Something is broken tests Unit tests and/or continuous integration

Comments

@gjoseph92
Copy link
Collaborator

if isinstance(executor, ThreadPoolExecutor):
executor._work_queue.queue.clear()
executor.shutdown(wait=executor_wait, timeout=timeout)
else:
executor.shutdown(wait=executor_wait)

shutdown is a blocking call that generally calls join on the thread/process. If the executor takes a long time to shut down (say there's a function running that's itself blocked on something), this can block the worker event loop for 30s, or whatever the timeout is.

This is particularly important for our test suite, because the worker event loop is actually the only event loop.

I discovered this trying to run a test something like

event = distributed.Event()
f = client.submit(event.wait, workers=[a.address])
t = asyncio.create_task(a.close())
await asyncio.sleep(1)
await event.set()  # this hangs!
# because the whole event loop is blocked waiting for the ThreadPoolExecutor to shut down,
# which is waiting for the event to be set... which is waiting for the event loop to be free

This probably won't affect real-life clusters that much, but would be good to clean up.

cc @graingert

@gjoseph92 gjoseph92 added bug Something is broken tests Unit tests and/or continuous integration labels Apr 28, 2022
@mrocklin
Copy link
Member

mrocklin commented Apr 28, 2022 via email

@gjoseph92
Copy link
Collaborator Author

@mrocklin thanks, #6091 would fix this.

mrocklin pushed a commit that referenced this issue Apr 29, 2022
…6091)

This reinstates #5883
which was reverted in #5961 / #5932

I could confirm the flakyness of `test_missing_data_errant_worker` after this change and am reasonably certain this is caused by #5910 which causes a closing worker to be restarted such that, even after `Worker.close` is done, the worker still appears to be partially up. 

The only reason I can see why this change promotes this behaviour is that if we no longer block the event loop while the threadpool is closing, this opens a much larger window for incoming requests to come in and being processed while close is running.

Closes #6239
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken tests Unit tests and/or continuous integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants