gather_dep should handle CancelledError #8013

crusaderky · 2023-07-18T14:22:59Z

Closes Regression: frequent deadlocks when gather_dep fails to contact peer #8006

github-actions · 2023-07-18T16:24:10Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      14 files ±  0       14 suites ±0 6h 58m 13s ⏱️ + 8m 54s
  3 720 tests +  4   3 608 ✔️ +  2   108 💤 ±0 4 ❌ +2
24 920 runs +28 23 738 ✔️ +26 1 178 💤 ±0 4 ❌ +2

For more details on these failures, see this check.

Results for commit 2b90156. ± Comparison against base commit b7e5f8f.

♻️ This comment has been updated with latest results.

hendrikmakait · 2023-07-20T06:57:21Z

distributed/worker.py

+        # Note: CancelledError and asyncio.TimeoutError are rare conditions
+        # that can be raised by the network stack.
+        # See https://github.com/dask/distributed/issues/8006
+        except (OSError, asyncio.CancelledError, asyncio.TimeoutError):


Adding this makes sense, but I'm wondering if the catch-all below shouldn't have prevented a deadlock in the first place. If not, might we still face a deadlock in the event of an exception being caught by the catch-all (e.g., an error in deserialization)?

The catch-all below, which normally catches deserialization errors, is broken to begin with and will lead to a deadlocked cluster too. Fixing it is complicated because it would require implementing a memory->error transition on the scheduler.

Failure to pickle/unpickle on flight breaks the worker state machine #6705

more in general I feel that we should treat everything that is a networking failure, which can and will happen in a production cluster, apart from (de)serialization error, internal errors, etc., which in theory should be weeded out in development.

Thanks for clarifying!

The catchall does not cover CancelledErrors since those are BaseExceptions (since py3.8) https://docs.python.org/3/library/asyncio-exceptions.html#asyncio.CancelledError

This change is motivated because too many people were accidentally catching this exception which then caused hard to debug problems, see https://bugs.python.org/issue32528 for a conversation

I think that capturing CancelledError here is the correct way forward because gather_dep is not just a coroutine but we are always scheduling it as a dedicated asyncio.Task. This task terminates immediately in this exception handler and we're technically not even awaiting this task anywhere (this is the backend task foo in BaseWorker) so locking up due to not reraising is impossible.

hendrikmakait

Thanks, @crusaderky!

fjetter

There was an offline conversation between @crusaderky and myself about this in the context of https://github.com/dask/distributed/pull/7997/files#r1268104944 hence I left a review with more details about my concerns

fjetter · 2023-08-02T14:46:18Z

distributed/worker.py

+        # Note: CancelledError and asyncio.TimeoutError are rare conditions
+        # that can be raised by the network stack.
+        # See https://github.com/dask/distributed/issues/8006
+        except (OSError, asyncio.CancelledError, asyncio.TimeoutError):


The catchall does not cover CancelledErrors since those are BaseExceptions (since py3.8) https://docs.python.org/3/library/asyncio-exceptions.html#asyncio.CancelledError

This change is motivated because too many people were accidentally catching this exception which then caused hard to debug problems, see https://bugs.python.org/issue32528 for a conversation

I think that capturing CancelledError here is the correct way forward because gather_dep is not just a coroutine but we are always scheduling it as a dedicated asyncio.Task. This task terminates immediately in this exception handler and we're technically not even awaiting this task anywhere (this is the backend task foo in BaseWorker) so locking up due to not reraising is impossible.

fjetter · 2023-08-02T14:50:30Z

distributed/tests/test_worker.py

+        await b.in_get_data.wait()
+        tasks = {
+            task for task in asyncio.all_tasks() if "gather_dep" in task.get_name()
+        }
+        assert tasks
+        # There should be only one task but cope with finding more just in case a
+        # previous test didn't properly clean up
+        for task in tasks:
+            task.cancel()


What I do not like about this PR is this test. This test is just hooking into all tasks and are artificially cancelling those. The only place where this is in fact happening is during BaseWorker.close but this is clearly not what's concerning us, is it?

How else do we end up in this state?

gather_dep_hangs

a7837d3

Revert temp stress

185922f

crusaderky self-assigned this Jul 19, 2023

crusaderky marked this pull request as ready for review July 19, 2023 11:05

crusaderky requested a review from fjetter as a code owner July 19, 2023 11:05

crusaderky mentioned this pull request Jul 19, 2023

Overhaul gather() #7997

Merged

hendrikmakait self-requested a review July 19, 2023 14:28

hendrikmakait reviewed Jul 20, 2023

View reviewed changes

Add note

2b90156

hendrikmakait approved these changes Jul 20, 2023

View reviewed changes

crusaderky merged commit 19c9f4b into dask:main Jul 20, 2023
15 of 27 checks passed

crusaderky deleted the gather_dep_hangs branch July 20, 2023 15:31

fjetter reviewed Aug 2, 2023

View reviewed changes

This was referenced Aug 8, 2023

Better logging for anomalous task termination #8082

Merged

Propagate CancelledError in gather_from_workers #8089

Merged

This was referenced Aug 15, 2023

Handle CancelledError properly in ConnectionPool #8110

Merged

Use task.uncancel when suppressing cancellation in ConnectionPool #8108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gather_dep should handle CancelledError #8013

gather_dep should handle CancelledError #8013

crusaderky commented Jul 18, 2023

github-actions bot commented Jul 18, 2023 •

edited

hendrikmakait Jul 20, 2023

crusaderky Jul 20, 2023

crusaderky Jul 20, 2023 •

edited

hendrikmakait Jul 20, 2023

fjetter Aug 2, 2023

hendrikmakait left a comment

fjetter left a comment

fjetter Aug 2, 2023

fjetter Aug 2, 2023

gather_dep should handle CancelledError #8013

gather_dep should handle CancelledError #8013

Conversation

crusaderky commented Jul 18, 2023

github-actions bot commented Jul 18, 2023 • edited

Unit Test Results

hendrikmakait Jul 20, 2023

Choose a reason for hiding this comment

crusaderky Jul 20, 2023

Choose a reason for hiding this comment

crusaderky Jul 20, 2023 • edited

Choose a reason for hiding this comment

hendrikmakait Jul 20, 2023

Choose a reason for hiding this comment

fjetter Aug 2, 2023

Choose a reason for hiding this comment

hendrikmakait left a comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

fjetter Aug 2, 2023

Choose a reason for hiding this comment

fjetter Aug 2, 2023

Choose a reason for hiding this comment

github-actions bot commented Jul 18, 2023 •

edited

crusaderky Jul 20, 2023 •

edited