New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gather_dep should handle CancelledError #8013
Conversation
crusaderky
commented
Jul 18, 2023
- Closes Regression: frequent deadlocks when gather_dep fails to contact peer #8006
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 14 files ± 0 14 suites ±0 6h 58m 13s ⏱️ + 8m 54s For more details on these failures, see this check. Results for commit 2b90156. ± Comparison against base commit b7e5f8f. ♻️ This comment has been updated with latest results. |
# Note: CancelledError and asyncio.TimeoutError are rare conditions | ||
# that can be raised by the network stack. | ||
# See https://github.com/dask/distributed/issues/8006 | ||
except (OSError, asyncio.CancelledError, asyncio.TimeoutError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this makes sense, but I'm wondering if the catch-all below shouldn't have prevented a deadlock in the first place. If not, might we still face a deadlock in the event of an exception being caught by the catch-all (e.g., an error in deserialization)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The catch-all below, which normally catches deserialization errors, is broken to begin with and will lead to a deadlocked cluster too. Fixing it is complicated because it would require implementing a memory->error transition on the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more in general I feel that we should treat everything that is a networking failure, which can and will happen in a production cluster, apart from (de)serialization error, internal errors, etc., which in theory should be weeded out in development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The catchall does not cover CancelledError
s since those are BaseException
s (since py3.8) https://docs.python.org/3/library/asyncio-exceptions.html#asyncio.CancelledError
This change is motivated because too many people were accidentally catching this exception which then caused hard to debug problems, see https://bugs.python.org/issue32528 for a conversation
I think that capturing CancelledError
here is the correct way forward because gather_dep
is not just a coroutine but we are always scheduling it as a dedicated asyncio.Task
. This task terminates immediately in this exception handler and we're technically not even awaiting this task anywhere (this is the backend task foo in BaseWorker
) so locking up due to not reraising is impossible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @crusaderky!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was an offline conversation between @crusaderky and myself about this in the context of https://github.com/dask/distributed/pull/7997/files#r1268104944 hence I left a review with more details about my concerns
# Note: CancelledError and asyncio.TimeoutError are rare conditions | ||
# that can be raised by the network stack. | ||
# See https://github.com/dask/distributed/issues/8006 | ||
except (OSError, asyncio.CancelledError, asyncio.TimeoutError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The catchall does not cover CancelledError
s since those are BaseException
s (since py3.8) https://docs.python.org/3/library/asyncio-exceptions.html#asyncio.CancelledError
This change is motivated because too many people were accidentally catching this exception which then caused hard to debug problems, see https://bugs.python.org/issue32528 for a conversation
I think that capturing CancelledError
here is the correct way forward because gather_dep
is not just a coroutine but we are always scheduling it as a dedicated asyncio.Task
. This task terminates immediately in this exception handler and we're technically not even awaiting this task anywhere (this is the backend task foo in BaseWorker
) so locking up due to not reraising is impossible.
await b.in_get_data.wait() | ||
tasks = { | ||
task for task in asyncio.all_tasks() if "gather_dep" in task.get_name() | ||
} | ||
assert tasks | ||
# There should be only one task but cope with finding more just in case a | ||
# previous test didn't properly clean up | ||
for task in tasks: | ||
task.cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I do not like about this PR is this test. This test is just hooking into all tasks and are artificially cancelling those. The only place where this is in fact happening is during BaseWorker.close
but this is clearly not what's concerning us, is it?
How else do we end up in this state?