Better logging for worker removal #8517

crusaderky · 2024-02-21T19:50:17Z

Improve all regular logger output as well as Scheduler.events around worker shutdown.

print a WARNING when tasks are going to be recomputed due to worker failure
print an ERROR when scattered data is lost due to worker failure
Graceful worker retirement will now warn when it fails to retire a worker (as no other worker can accept its unique data)
Scheduler.events now gives a more complete picture of graceful worker retirement as well as worker failure

@jrbourbeau best effort for 2024.2.1

github-actions · 2024-02-21T21:18:42Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

27 files + 4 27 suites +4 9h 58m 0s ⏱️ + 2h 18m 2s
3 997 tests + 1 3 885 ✅ + 1 110 💤 ± 0 2 ❌ ±0
50 267 runs +9 037 47 967 ✅ +8 647 2 295 💤 +389 5 ❌ +1

For more details on these failures, see this check.

Results for commit 7db2c8b. ± Comparison against base commit cbf939c.

This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.

distributed.tests.test_scheduler ‑ test_gather_failing_cnn_recover

distributed.tests.test_scheduler ‑ test_gather_failing_can_recover
distributed.tests.test_worker ‑ test_log_remove_worker

♻️ This comment has been updated with latest results.

crusaderky · 2024-02-21T22:54:25Z

distributed/client.py

-                        await f
-
-                    self.sync(_)
-


This whole block is redundant with the same inside Client._close.
This changes the close() and __exit__() behaviour of a synchronous Client that started its own LocalCluster (e.g. client = Client()), so that the client first closes itself, thus releasing all tasks on the scheduler, and only afterwards it closes the cluster instead of the other way around.

Note that asynchronous clients already behave this way.

Without this, as the workers close, the scheduler would observe the sudden loss of all tasks the client had in who_wants.
See test: test_quiet_client_close

crusaderky · 2024-02-21T23:22:51Z

distributed/scheduler.py

@@ -7222,7 +7260,7 @@ async def _track_retire_worker(
        close: bool,
        remove: bool,
        stimulus_id: str,
-    ) -> tuple[str | None, dict]:
+    ) -> tuple[str, Literal["OK", "no-recipients"], dict]:


Could have used True/False but this is IMHO more readable

hendrikmakait

Thanks, @crusaderky! Consider the nits non-blocking.

hendrikmakait · 2024-02-26T15:00:50Z

distributed/scheduler.py

+            "action": "remove-worker",
+            "processing-tasks": processing_keys,
+            "lost-computed-tasks": recompute_keys,
+            "lost-scattered-tasks": lost_keys,


Should we also log the erred tasks here?

You don't lose the exception of erred tasks when you lose the workers where they ran.

hendrikmakait · 2024-02-26T15:06:04Z

distributed/tests/test_client.py

@@ -5196,7 +5196,7 @@ def test_quiet_client_close(loop):
            threads_per_worker=4,
        ) as c:
            futures = c.map(slowinc, range(1000), delay=0.01)
-            sleep(0.200)  # stop part-way
+            sleep(0.2)  # stop part-way


I know that this is only cosmetical but is there a better stop condition here? E.g., n tasks being in memory already? Consider this non-blocking.

hendrikmakait · 2024-02-26T15:09:12Z

It looks like the rather verbose test_log_remove_worker fails consistently.

crusaderky force-pushed the log_remove_worker branch 2 times, most recently from 2bdcbc8 to 50def75 Compare February 21, 2024 20:31

crusaderky commented Feb 21, 2024

View reviewed changes

crusaderky marked this pull request as ready for review February 21, 2024 23:18

crusaderky requested a review from fjetter as a code owner February 21, 2024 23:18

crusaderky commented Feb 21, 2024

View reviewed changes

crusaderky force-pushed the log_remove_worker branch 2 times, most recently from 3e4b876 to e96f88a Compare February 23, 2024 23:10

hendrikmakait self-requested a review February 26, 2024 12:47

crusaderky force-pushed the log_remove_worker branch from e96f88a to fe55f41 Compare February 26, 2024 13:44

hendrikmakait approved these changes Feb 26, 2024

View reviewed changes

crusaderky added 4 commits February 27, 2024 18:21

Better logging for worker removal

84dd78b

Close client before cluster

c769776

Generic polish

2b0b463

Code review

7db2c8b

crusaderky force-pushed the log_remove_worker branch from fe55f41 to 7db2c8b Compare February 27, 2024 18:33

crusaderky merged commit 1602d74 into dask:main Feb 28, 2024
28 of 34 checks passed

crusaderky deleted the log_remove_worker branch February 28, 2024 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better logging for worker removal #8517

Better logging for worker removal #8517

crusaderky commented Feb 21, 2024 •

edited

github-actions bot commented Feb 21, 2024 •

edited

crusaderky Feb 21, 2024 •

edited

crusaderky Feb 21, 2024

hendrikmakait left a comment

hendrikmakait Feb 26, 2024

crusaderky Feb 27, 2024

hendrikmakait Feb 26, 2024

crusaderky Feb 27, 2024

hendrikmakait commented Feb 26, 2024

Better logging for worker removal #8517

Better logging for worker removal #8517

Conversation

crusaderky commented Feb 21, 2024 • edited

github-actions bot commented Feb 21, 2024 • edited

Unit Test Results

crusaderky Feb 21, 2024 • edited

Choose a reason for hiding this comment

crusaderky Feb 21, 2024

Choose a reason for hiding this comment

hendrikmakait left a comment

Choose a reason for hiding this comment

hendrikmakait Feb 26, 2024

Choose a reason for hiding this comment

crusaderky Feb 27, 2024

Choose a reason for hiding this comment

hendrikmakait Feb 26, 2024

Choose a reason for hiding this comment

crusaderky Feb 27, 2024

Choose a reason for hiding this comment

hendrikmakait commented Feb 26, 2024

crusaderky commented Feb 21, 2024 •

edited

github-actions bot commented Feb 21, 2024 •

edited

crusaderky Feb 21, 2024 •

edited