Make P2P shuffle extensible #8096

hendrikmakait · 2023-08-11T08:24:55Z

This PR replaces hard-coded logic for handling dataframe shuffles and array rechunking with a dispatch mechanism. The main goal is to improve the separation of concerns and making the P2P mechanism extensible for other kinds of "shuffle-like" operations.

Tests added / passed
Passes pre-commit run --all-files

hendrikmakait · 2023-08-11T08:41:37Z

The main thing happening in this PR is the introduction of dispatches and replacing some dictionaries with dataclasses. The rest is moving stuff around, which I will probably extract into a separate PR to make it easier to track changes.

github-actions · 2023-08-11T09:26:48Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      21 files ±  0       21 suites ±0 11h 17m 6s ⏱️ - 21m 29s
  3 772 tests +  5   3 663 ✔️ +  9   108 💤 +1 1 ❌ - 5
36 475 runs +60 34 665 ✔️ +63 1 809 💤 +4 1 ❌ - 7

For more details on these failures, see this check.

Results for commit b76bd7a. ± Comparison against base commit 8aa04a8.

This pull request removes 2 and adds 7 tests. Note that renamed tests count towards both.

distributed.tests.test_preload ‑ test_failure_doesnt_crash
distributed.tests.test_worker ‑ test_gather_dep_cancelled_error

distributed.shuffle.tests.test_shuffle ‑ test_basic_cudf_support
distributed.tests.test_core ‑ test_remove_cancels_connect_before_task_running
distributed.tests.test_preload ‑ test_failure_doesnt_crash_client
distributed.tests.test_preload ‑ test_failure_doesnt_crash_nanny
distributed.tests.test_preload ‑ test_failure_doesnt_crash_scheduler
distributed.tests.test_preload ‑ test_failure_doesnt_crash_worker
distributed.tests.test_preload ‑ test_preload_manager_sequence

♻️ This comment has been updated with latest results.

Refactor and introduce dispatch more dispatches Generic ToPickle

hendrikmakait · 2023-08-16T08:44:03Z

@wence-: An initial iteration on this one is done, I would appreciate your thoughts!

hendrikmakait · 2023-08-16T08:47:59Z

To implement a new shuffle-like, you now need to implement...
a few data classes

subclass of ShuffleSpec
subclass of ShuffleRunSpec
subclass of SchedulerShuffleState

a class handling the main logic

subclass of ShuffleRun

a few dispatches to convert things

spec_to_scheduler_state
scheduler_state_to_run_spec
run_spec_to_worker_run

Ideally, I would like to reduce the implementation overhead, but I not sure yet how to best do that.

wence-

A few small suggestions in this round. Perhaps handling the dispatch as required methods on the base dataclass objects is cleaner?

wence- · 2023-08-16T09:09:36Z

distributed/shuffle/_rechunk.py

+
+    return ArrayRechunkState(
+        id=spec.id,
+        run_id=next(SchedulerShuffleState._run_id_iterator),


Can the SchedulerShuffleState be responsible for setting the run id? This is meant to be a unique sequence number AIUI, but it's easy to do something wrong here, it feels like.

Perhaps:

from functools import partial from itertools import count from dataclasses import dataclass, field @dataclass class SchedulerShuffleState: ... run_id = field(init=False, default_factory=partial(next, count()) @dataclass ArrayRechunkState(SchedulerShuffleState) old: ... new: ...

This way run_id is created automagically, and doesn't even appear in the list of arguments to __init__ (so you can't get it wrong).

distributed/shuffle/_worker_plugin.py

distributed/shuffle/_core.py

wence- · 2023-08-16T11:04:02Z

distributed/shuffle/_rechunk.py

@@ -441,47 +447,98 @@ def _() -> dict[str, tuple[NDIndex, bytes]]:
        return self.run_id

    async def get_output_partition(
-        self, partition_id: NDIndex, key: str, meta: pd.DataFrame | None = None
+        self, partition_id: NDIndex, key: str, **kwargs: Any


Is this switch to kwargs future-proofing?

Future-proofing and making it completely transparent to the plugin.

distributed/shuffle/_rechunk.py

hendrikmakait · 2023-08-16T15:04:48Z

distributed/shuffle/_core.py

+@dataclass(frozen=True)
+class ShuffleSpec(abc.ABC, Generic[_T_partition_id]):
    id: ShuffleId
-    run_id: int
-    output_workers: set[str]
+
+    def create_new_run(
+        self,
+        plugin: ShuffleSchedulerPlugin,
+    ) -> SchedulerShuffleState:
+        worker_for = self._pin_output_workers(plugin)
+        return SchedulerShuffleState(
+            run_spec=ShuffleRunSpec(spec=self, worker_for=worker_for),
+            participating_workers=set(worker_for.values()),
+        )
+
+    @abc.abstractmethod
+    def _pin_output_workers(
+        self, plugin: ShuffleSchedulerPlugin
+    ) -> dict[_T_partition_id, str]:
+        """TODO"""
+
+    @abc.abstractmethod
+    def initialize_run_on_worker(
+        self,
+        run_id: int,
+        worker_for: dict[_T_partition_id, str],
+        plugin: ShuffleWorkerPlugin,
+    ) -> ShuffleRun:
+        """TODO"""


All you need to do now is implement concrete subclasses of ShuffleSpec and ShuffleRun.

wence-

Thanks, I think this looks neater now. Coverage went made in the tests, but not sure if that will fix itself in the latest test run.

wence- · 2023-08-17T10:56:50Z

distributed/shuffle/_core.py

-    _run_id_iterator: ClassVar[itertools.count] = itertools.count(1)
+@dataclass(frozen=True)
+class ShuffleRunSpec(Generic[_T_partition_id]):
+    run_id: int = field(init=False, default_factory=partial(next, itertools.count(1)))  # type: ignore


FWIW, this is this mypy bug

This was referenced Aug 11, 2023

Make ToPickle a Generic #8097

Merged

Restructure P2P code #8098

Merged

Refactor and add dispatch

e0adc3a

Refactor and introduce dispatch more dispatches Generic ToPickle

hendrikmakait force-pushed the p2p-dispatch branch from 9ec92d5 to e0adc3a Compare August 15, 2023 16:02

hendrikmakait added 2 commits August 15, 2023 18:14

Adjust error message

82b677d

Minor

2cebf93

hendrikmakait changed the title ~~[WIP] Refactor P2P and make it more extensible~~ [WIP] Make P2P shuffle through dispatching Aug 15, 2023

hendrikmakait changed the title ~~[WIP] Make P2P shuffle through dispatching~~ [WIP] Make P2P shuffle extensible through dispatching Aug 15, 2023

hendrikmakait mentioned this pull request Aug 16, 2023

[Tracking] Advancements for P2P #8043

Open

15 tasks

hendrikmakait changed the title ~~[WIP] Make P2P shuffle extensible through dispatching~~ [RPC] Make P2P shuffle extensible through dispatching Aug 16, 2023

wence- reviewed Aug 16, 2023

View reviewed changes

distributed/shuffle/_rechunk.py Outdated Show resolved Hide resolved

Drop output_workers

368f52e

hendrikmakait changed the title ~~[RPC] Make P2P shuffle extensible through dispatching~~ [RFC] Make P2P shuffle extensible through dispatching Aug 16, 2023

Refactor

7283a56

hendrikmakait changed the title ~~[RFC] Make P2P shuffle extensible through dispatching~~ [RFC] Make P2P shuffle extensible Aug 16, 2023

hendrikmakait commented Aug 16, 2023

View reviewed changes

hendrikmakait marked this pull request as ready for review August 17, 2023 10:32

hendrikmakait requested a review from fjetter as a code owner August 17, 2023 10:32

hendrikmakait requested a review from wence- August 17, 2023 10:32

hendrikmakait marked this pull request as draft August 17, 2023 10:32

hendrikmakait added 2 commits August 17, 2023 12:38

Docs

8251678

Update docs

b76bd7a

hendrikmakait changed the title ~~[RFC] Make P2P shuffle extensible~~ Make P2P shuffle extensible Aug 17, 2023

hendrikmakait marked this pull request as ready for review August 17, 2023 10:42

wence- approved these changes Aug 17, 2023

View reviewed changes

hendrikmakait merged commit 8645c77 into dask:main Aug 17, 2023
25 of 28 checks passed

hendrikmakait deleted the p2p-dispatch branch August 17, 2023 11:38

jrbourbeau mentioned this pull request Aug 21, 2023

dask/tests/test_distributed.py::test_map_partitions_df_input failing in CI dask/dask#10455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make P2P shuffle extensible #8096

Make P2P shuffle extensible #8096

hendrikmakait commented Aug 11, 2023 •

edited

hendrikmakait commented Aug 11, 2023

github-actions bot commented Aug 11, 2023 •

edited

hendrikmakait commented Aug 16, 2023

hendrikmakait commented Aug 16, 2023

wence- left a comment

wence- Aug 16, 2023

wence- Aug 16, 2023

hendrikmakait Aug 16, 2023

hendrikmakait Aug 16, 2023 •

edited

wence- left a comment

wence- Aug 17, 2023

Make P2P shuffle extensible #8096

Make P2P shuffle extensible #8096

Conversation

hendrikmakait commented Aug 11, 2023 • edited

hendrikmakait commented Aug 11, 2023

github-actions bot commented Aug 11, 2023 • edited

Unit Test Results

hendrikmakait commented Aug 16, 2023

hendrikmakait commented Aug 16, 2023

wence- left a comment

Choose a reason for hiding this comment

wence- Aug 16, 2023

Choose a reason for hiding this comment

wence- Aug 16, 2023

Choose a reason for hiding this comment

hendrikmakait Aug 16, 2023

Choose a reason for hiding this comment

hendrikmakait Aug 16, 2023 • edited

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

wence- Aug 17, 2023

Choose a reason for hiding this comment

hendrikmakait commented Aug 11, 2023 •

edited

github-actions bot commented Aug 11, 2023 •

edited

hendrikmakait Aug 16, 2023 •

edited