Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better logging for worker removal #8517

Merged
merged 4 commits into from Feb 28, 2024
Merged

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Feb 21, 2024

Improve all regular logger output as well as Scheduler.events around worker shutdown.

  • print a WARNING when tasks are going to be recomputed due to worker failure
  • print an ERROR when scattered data is lost due to worker failure
  • Graceful worker retirement will now warn when it fails to retire a worker (as no other worker can accept its unique data)
  • Scheduler.events now gives a more complete picture of graceful worker retirement as well as worker failure

@jrbourbeau best effort for 2024.2.1

@crusaderky crusaderky force-pushed the log_remove_worker branch 2 times, most recently from 2bdcbc8 to 50def75 Compare February 21, 2024 20:31
Copy link
Contributor

github-actions bot commented Feb 21, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    27 files  +    4      27 suites  +4   9h 58m 0s ⏱️ + 2h 18m 2s
 3 997 tests +    1   3 885 ✅ +    1    110 💤 ±  0  2 ❌ ±0 
50 267 runs  +9 037  47 967 ✅ +8 647  2 295 💤 +389  5 ❌ +1 

For more details on these failures, see this check.

Results for commit 7db2c8b. ± Comparison against base commit cbf939c.

This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.
distributed.tests.test_scheduler ‑ test_gather_failing_cnn_recover
distributed.tests.test_scheduler ‑ test_gather_failing_can_recover
distributed.tests.test_worker ‑ test_log_remove_worker

♻️ This comment has been updated with latest results.

await f

self.sync(_)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block is redundant with the same inside Client._close.
This changes the close() and __exit__() behaviour of a synchronous Client that started its own LocalCluster (e.g. client = Client()), so that the client first closes itself, thus releasing all tasks on the scheduler, and only afterwards it closes the cluster instead of the other way around.

Note that asynchronous clients already behave this way.

Without this, as the workers close, the scheduler would observe the sudden loss of all tasks the client had in who_wants.
See test: test_quiet_client_close

@crusaderky crusaderky marked this pull request as ready for review February 21, 2024 23:18
@@ -7222,7 +7260,7 @@ async def _track_retire_worker(
close: bool,
remove: bool,
stimulus_id: str,
) -> tuple[str | None, dict]:
) -> tuple[str, Literal["OK", "no-recipients"], dict]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could have used True/False but this is IMHO more readable

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @crusaderky! Consider the nits non-blocking.

"action": "remove-worker",
"processing-tasks": processing_keys,
"lost-computed-tasks": recompute_keys,
"lost-scattered-tasks": lost_keys,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also log the erred tasks here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't lose the exception of erred tasks when you lose the workers where they ran.

@@ -5196,7 +5196,7 @@ def test_quiet_client_close(loop):
threads_per_worker=4,
) as c:
futures = c.map(slowinc, range(1000), delay=0.01)
sleep(0.200) # stop part-way
sleep(0.2) # stop part-way
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is only cosmetical but is there a better stop condition here? E.g., n tasks being in memory already? Consider this non-blocking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@hendrikmakait
Copy link
Member

It looks like the rather verbose test_log_remove_worker fails consistently.

@crusaderky crusaderky merged commit 1602d74 into dask:main Feb 28, 2024
28 of 34 checks passed
@crusaderky crusaderky deleted the log_remove_worker branch February 28, 2024 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants