Handle `CancelledError` properly in `ConnectionPool` #8110

fjetter · 2023-08-15T16:14:09Z

This reverts #8013 and changes the way how we're handling cancellations in the connection pool.

I haven't figured out quite how we end up in this situation but somehow the connect_attempt task is raising a CancelledError even though the task itself is catching CancelledErrors and is supposed to reraise as CommClosed. For whatever reason this is not happening and is therefore raising the CancelledError

If the task is cancelled before the task is actually scheduled, awaiting it can still cause a CancelledError to be raised even if the body handles this and reraises as CommClosed. I don't know why this issue hasn't affected us before.

Catching the CancelledError in gather_dep is causing other tests to flake. Explicitly, I found that distributed.tests.test_scheduler.test_tell_workers_when_peers_have_left is affected by this change in gather_dep. If #8013 is reverted, that test is rock solid.

Generally, these two tests have been sensitive to changes around this cancellation

distributed/tests/test_worker_metrics.py::test_gather_dep_network_error
distributed/tests/test_scheduler.py::test_tell_workers_when_peers_have_left

and with this PR, both tests are rock solid on my machine.

This may collide / supersede / be superseded by #8109
Related #8108

fjetter · 2023-08-15T16:20:07Z

distributed/tests/test_worker.py

-@gen_cluster(client=True, nthreads=[("", 1)])
-async def test_gather_dep_cancelled_error(c, s, a):
-    """Something somewhere in the networking stack raises CancelledError while
-    gather_dep is running
-
-    See Also
-    --------
-    test_get_data_cancelled_error
-    https://github.com/dask/distributed/issues/8006
-    """
-    async with BlockedGetData(s.address) as b:
-        x = c.submit(inc, 1, key="x", workers=[b.address])
-        y = c.submit(inc, x, key="y", workers=[a.address])
-        await b.in_get_data.wait()
-        tasks = {
-            task for task in asyncio.all_tasks() if "gather_dep" in task.get_name()
-        }
-        assert tasks
-        # There should be only one task but cope with finding more just in case a
-        # previous test didn't properly clean up
-        for task in tasks:
-            task.cancel()
-
-        b.block_get_data.set()
-        assert await y == 3
-
-    assert_story(
-        a.state.story("x"),
-        [
-            ("x", "fetch", "flight", "flight", {}),
-            ("x", "flight", "missing", "missing", {}),
-            ("x", "missing", "fetch", "fetch", {}),
-            ("x", "fetch", "flight", "flight", {}),
-            ("x", "flight", "memory", "memory", {"y": "ready"}),
-        ],
-    )


This test is too artificial. It trigger a condition that is only possible during worker shutdown. Nothing is ever cancelling the actual gather_dep task itself unless we're shutting down the worker.

The other PR is introducing also test_get_data_cancelled_error which is, I believe, a much more represenative test (which also works on this PR).

The test is provoking a (still wrong and artifical) CancelledError on write. This is wrong and will never happen, I believe. However, due to our low level handshakes, this CancelledError cancels the actual connect and is therefore kind of representative of the real issue

fjetter · 2023-08-15T16:24:55Z

distributed/core.py

+        try:
+            return connect_attempt.result()
+        except asyncio.CancelledError:
+            reason = self._reasons.pop(connect_attempt, "ConnectionPool closing.")
+            raise CommClosedError(reason)


@graingert I'm all ears if you know how to better test this

I got test_remove_cancels_connect_before_task_running which appears to produce this condition 🎉 (not a perfect test but it's something)

github-actions · 2023-08-15T17:44:47Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      21 files ±      0       21 suites ±0 10h 57m 32s ⏱️ -38s
  3 767 tests -       2   3 655 ✔️ +      2   107 💤 -   2 5 ❌ - 2
36 427 runs +1 248 34 620 ✔️ +1 208 1 802 💤 +42 5 ❌ - 2

For more details on these failures, see this check.

Results for commit 6a2a838. ± Comparison against base commit 8aa04a8.

This pull request removes 3 and adds 1 tests. Note that renamed tests count towards both.

distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---nanny]
distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---no-nanny]
distributed.tests.test_worker ‑ test_gather_dep_cancelled_error

distributed.tests.test_core ‑ test_remove_cancels_connect_before_task_running

distributed/tests/test_core.py

fjetter added 2 commits August 15, 2023 17:20

Handle CancelledError in ConnectionPool

9631b88

fixes

b5b259c

fjetter requested review from graingert and crusaderky August 15, 2023 16:14

fjetter changed the title ~~Handle cancelled error~~ Handle CancelledError properly in ConnectionPool Aug 15, 2023

fjetter mentioned this pull request Aug 15, 2023

Use task.uncancel when suppressing cancellation in ConnectionPool #8108

Closed

fjetter commented Aug 15, 2023

View reviewed changes

fjetter added 2 commits August 15, 2023 18:25

Remove dead reference

a29829d

Add test

6a2a838

crusaderky approved these changes Aug 15, 2023

View reviewed changes

distributed/tests/test_core.py Show resolved Hide resolved

graingert approved these changes Aug 16, 2023

View reviewed changes

crusaderky merged commit 3082d27 into dask:main Aug 16, 2023
23 of 28 checks passed

fjetter deleted the handle_cancelled_error branch August 16, 2023 10:14

fjetter mentioned this pull request Sep 13, 2023

Reduce memory footprint of P2P shuffling #8157

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle `CancelledError` properly in `ConnectionPool` #8110

Handle `CancelledError` properly in `ConnectionPool` #8110

fjetter commented Aug 15, 2023 •

edited

fjetter Aug 15, 2023

fjetter Aug 15, 2023

fjetter Aug 15, 2023

fjetter Aug 15, 2023

github-actions bot commented Aug 15, 2023

Handle CancelledError properly in ConnectionPool #8110

Handle CancelledError properly in ConnectionPool #8110

Conversation

fjetter commented Aug 15, 2023 • edited

fjetter Aug 15, 2023

Choose a reason for hiding this comment

fjetter Aug 15, 2023

Choose a reason for hiding this comment

fjetter Aug 15, 2023

Choose a reason for hiding this comment

fjetter Aug 15, 2023

Choose a reason for hiding this comment

github-actions bot commented Aug 15, 2023

Unit Test Results

Handle `CancelledError` properly in `ConnectionPool` #8110

Handle `CancelledError` properly in `ConnectionPool` #8110

fjetter commented Aug 15, 2023 •

edited