-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't block event loop in Worker.close
waiting for executor to shut down
#6239
Comments
gjoseph92
added
bug
Something is broken
tests
Unit tests and/or continuous integration
labels
Apr 28, 2022
See also #6091 maybe
…On Thu, Apr 28, 2022 at 3:03 PM Gabe Joseph ***@***.***> wrote:
https://github.com/dask/distributed/blob/1cbee7f18717f060c65b2ac44445745135c8dd46/distributed/worker.py#L1517-L1521
shutdown is a blocking call that generally calls join on the
thread/process. If the executor takes a long time to shut down (say there's
a function running that's itself blocked on something), this can block the
worker event loop for 30s, or whatever the timeout is.
This is particularly important for our test suite, because the worker
event loop is actually the *only* event loop.
I discovered this trying to run a test something like
event = distributed.Event()f = client.submit(event.wait, workers=[a.address])t = asyncio.create_task(a.close())await asyncio.sleep(1)await event.set() # this hangs!# because the whole event loop is blocked waiting for the ThreadPoolExecutor to shut down,# which is waiting for the event to be set... which is waiting for the event loop to be free
This *probably* won't affect real-life clusters that much, but would be
good to clean up.
cc @graingert <https://github.com/graingert>
—
Reply to this email directly, view it on GitHub
<#6239>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTAGMPXPA65JR2XKCLTVHLVIBANCNFSM5UTTGKJA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
mrocklin
pushed a commit
that referenced
this issue
Apr 29, 2022
…6091) This reinstates #5883 which was reverted in #5961 / #5932 I could confirm the flakyness of `test_missing_data_errant_worker` after this change and am reasonably certain this is caused by #5910 which causes a closing worker to be restarted such that, even after `Worker.close` is done, the worker still appears to be partially up. The only reason I can see why this change promotes this behaviour is that if we no longer block the event loop while the threadpool is closing, this opens a much larger window for incoming requests to come in and being processed while close is running. Closes #6239
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
distributed/distributed/worker.py
Lines 1517 to 1521 in 1cbee7f
shutdown
is a blocking call that generally callsjoin
on the thread/process. If the executor takes a long time to shut down (say there's a function running that's itself blocked on something), this can block the worker event loop for 30s, or whatever the timeout is.This is particularly important for our test suite, because the worker event loop is actually the only event loop.
I discovered this trying to run a test something like
This probably won't affect real-life clusters that much, but would be good to clean up.
cc @graingert
The text was updated successfully, but these errors were encountered: