Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix P2P worker cleanup #7981

Merged
merged 8 commits into from Jul 13, 2023
Merged

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Jul 10, 2023

Worker cleanup currently works "by accident" because shuffles either complete or worker state gets cleaned up on failure. However, ShuffleWorkerExtension.close() never gets triggered. This PR fixes that. It also refactors the setup such that the scheduler plugin is responsible of installing the worker plugin.

  • Tests added / passed
  • Passes pre-commit run --all-files

@hendrikmakait
Copy link
Member Author

cc @fjetter

@github-actions
Copy link
Contributor

github-actions bot commented Jul 10, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       20 files  ±0         20 suites  ±0   10h 49m 46s ⏱️ - 24m 29s
  3 716 tests ±0    3 607 ✔️ ±0     106 💤 ±0  3 ±0 
35 942 runs  ±0  34 189 ✔️ ±0  1 749 💤  - 1  4 +1 

For more details on these failures, see this check.

Results for commit 74599ee. ± Comparison against base commit 7b21399.

This pull request removes 7 and adds 7 tests. Note that renamed tests count towards both.
distributed.shuffle.tests.test_shuffle_extension
distributed.shuffle.tests.test_shuffle_extension ‑ test_installation_on_scheduler
distributed.shuffle.tests.test_shuffle_extension ‑ test_installation_on_worker
distributed.shuffle.tests.test_shuffle_extension ‑ test_split_by_partition
distributed.shuffle.tests.test_shuffle_extension ‑ test_split_by_worker
distributed.shuffle.tests.test_shuffle_extension ‑ test_split_by_worker_empty
distributed.shuffle.tests.test_shuffle_extension ‑ test_split_by_worker_many_workers
distributed.shuffle.tests.test_shuffle_plugins
distributed.shuffle.tests.test_shuffle_plugins ‑ test_installation_on_scheduler
distributed.shuffle.tests.test_shuffle_plugins ‑ test_installation_on_worker
distributed.shuffle.tests.test_shuffle_plugins ‑ test_split_by_partition
distributed.shuffle.tests.test_shuffle_plugins ‑ test_split_by_worker
distributed.shuffle.tests.test_shuffle_plugins ‑ test_split_by_worker_empty
distributed.shuffle.tests.test_shuffle_plugins ‑ test_split_by_worker_many_workers
This pull request removes 1 skipped test and adds 1 skipped test. Note that renamed tests count towards both.
distributed.shuffle.tests.test_shuffle_extension
distributed.shuffle.tests.test_shuffle_plugins

♻️ This comment has been updated with latest results.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks right, although minor linting changes required.

Comment on lines +124 to +126
await self.scheduler.register_worker_plugin(
None, dumps(worker_plugin), name="shuffle"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the magic that ensures things are cleaned up, because the client will teardown the scheduler which tears down all the worker plugins?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And pretty much everything else is downstream renaming changes....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much, yes. IIUC, extensions don't get cleaned up at all. Worker plugins will get torn down on worker close, but for that the method has to be called teardown (as seen below). Similarly, we need to move some of the initialization to setup.

@@ -855,7 +856,7 @@ async def _(
self._runs.add(shuffle)
return shuffle

async def close(self) -> None:
async def teardown(self, worker: Worker) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with this so that the worker plugin now conforms to the WorkerPlugin interface.

@hendrikmakait
Copy link
Member Author

although minor linting changes required

I'm not sure if I missed any additional linting problems, but I think the ones on here also happen on main (sigh).

@wence-
Copy link
Contributor

wence- commented Jul 11, 2023

I'm not sure if I missed any additional linting problems, but I think the ones on here also happen on main (sigh).

#7985

@hendrikmakait hendrikmakait merged commit 8e3e0f6 into dask:main Jul 13, 2023
19 of 27 checks passed
@hendrikmakait hendrikmakait deleted the shuffle-shutdown branch July 13, 2023 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants