Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gather_dep should handle CancelledError #8013

Merged
merged 3 commits into from Jul 20, 2023

Conversation

crusaderky
Copy link
Collaborator

@github-actions
Copy link
Contributor

github-actions bot commented Jul 18, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       14 files  ±  0         14 suites  ±0   6h 58m 13s ⏱️ + 8m 54s
  3 720 tests +  4    3 608 ✔️ +  2     108 💤 ±0  4 +2 
24 920 runs  +28  23 738 ✔️ +26  1 178 💤 ±0  4 +2 

For more details on these failures, see this check.

Results for commit 2b90156. ± Comparison against base commit b7e5f8f.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky self-assigned this Jul 19, 2023
@crusaderky crusaderky marked this pull request as ready for review July 19, 2023 11:05
@crusaderky crusaderky requested a review from fjetter as a code owner July 19, 2023 11:05
@crusaderky crusaderky mentioned this pull request Jul 19, 2023
@hendrikmakait hendrikmakait self-requested a review July 19, 2023 14:28
# Note: CancelledError and asyncio.TimeoutError are rare conditions
# that can be raised by the network stack.
# See https://github.com/dask/distributed/issues/8006
except (OSError, asyncio.CancelledError, asyncio.TimeoutError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this makes sense, but I'm wondering if the catch-all below shouldn't have prevented a deadlock in the first place. If not, might we still face a deadlock in the event of an exception being caught by the catch-all (e.g., an error in deserialization)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch-all below, which normally catches deserialization errors, is broken to begin with and will lead to a deadlocked cluster too. Fixing it is complicated because it would require implementing a memory->error transition on the scheduler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more in general I feel that we should treat everything that is a networking failure, which can and will happen in a production cluster, apart from (de)serialization error, internal errors, etc., which in theory should be weeded out in development.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catchall does not cover CancelledErrors since those are BaseExceptions (since py3.8) https://docs.python.org/3/library/asyncio-exceptions.html#asyncio.CancelledError

This change is motivated because too many people were accidentally catching this exception which then caused hard to debug problems, see https://bugs.python.org/issue32528 for a conversation

I think that capturing CancelledError here is the correct way forward because gather_dep is not just a coroutine but we are always scheduling it as a dedicated asyncio.Task. This task terminates immediately in this exception handler and we're technically not even awaiting this task anywhere (this is the backend task foo in BaseWorker) so locking up due to not reraising is impossible.

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @crusaderky!

@crusaderky crusaderky merged commit 19c9f4b into dask:main Jul 20, 2023
15 of 27 checks passed
@crusaderky crusaderky deleted the gather_dep_hangs branch July 20, 2023 15:31
Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an offline conversation between @crusaderky and myself about this in the context of https://github.com/dask/distributed/pull/7997/files#r1268104944 hence I left a review with more details about my concerns

# Note: CancelledError and asyncio.TimeoutError are rare conditions
# that can be raised by the network stack.
# See https://github.com/dask/distributed/issues/8006
except (OSError, asyncio.CancelledError, asyncio.TimeoutError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catchall does not cover CancelledErrors since those are BaseExceptions (since py3.8) https://docs.python.org/3/library/asyncio-exceptions.html#asyncio.CancelledError

This change is motivated because too many people were accidentally catching this exception which then caused hard to debug problems, see https://bugs.python.org/issue32528 for a conversation

I think that capturing CancelledError here is the correct way forward because gather_dep is not just a coroutine but we are always scheduling it as a dedicated asyncio.Task. This task terminates immediately in this exception handler and we're technically not even awaiting this task anywhere (this is the backend task foo in BaseWorker) so locking up due to not reraising is impossible.

Comment on lines +3367 to +3375
await b.in_get_data.wait()
tasks = {
task for task in asyncio.all_tasks() if "gather_dep" in task.get_name()
}
assert tasks
# There should be only one task but cope with finding more just in case a
# previous test didn't properly clean up
for task in tasks:
task.cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I do not like about this PR is this test. This test is just hooking into all tasks and are artificially cancelling those. The only place where this is in fact happening is during BaseWorker.close but this is clearly not what's concerning us, is it?

How else do we end up in this state?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: frequent deadlocks when gather_dep fails to contact peer
3 participants