Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make P2P shuffle extensible #8096

Merged
merged 7 commits into from Aug 17, 2023
Merged

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Aug 11, 2023

This PR replaces hard-coded logic for handling dataframe shuffles and array rechunking with a dispatch mechanism. The main goal is to improve the separation of concerns and making the P2P mechanism extensible for other kinds of "shuffle-like" operations.

  • Tests added / passed
  • Passes pre-commit run --all-files

@hendrikmakait
Copy link
Member Author

The main thing happening in this PR is the introduction of dispatches and replacing some dictionaries with dataclasses. The rest is moving stuff around, which I will probably extract into a separate PR to make it easier to track changes.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 11, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       21 files  ±  0         21 suites  ±0   11h 17m 6s ⏱️ - 21m 29s
  3 772 tests +  5    3 663 ✔️ +  9     108 💤 +1  1  - 5 
36 475 runs  +60  34 665 ✔️ +63  1 809 💤 +4  1  - 7 

For more details on these failures, see this check.

Results for commit b76bd7a. ± Comparison against base commit 8aa04a8.

This pull request removes 2 and adds 7 tests. Note that renamed tests count towards both.
distributed.tests.test_preload ‑ test_failure_doesnt_crash
distributed.tests.test_worker ‑ test_gather_dep_cancelled_error
distributed.shuffle.tests.test_shuffle ‑ test_basic_cudf_support
distributed.tests.test_core ‑ test_remove_cancels_connect_before_task_running
distributed.tests.test_preload ‑ test_failure_doesnt_crash_client
distributed.tests.test_preload ‑ test_failure_doesnt_crash_nanny
distributed.tests.test_preload ‑ test_failure_doesnt_crash_scheduler
distributed.tests.test_preload ‑ test_failure_doesnt_crash_worker
distributed.tests.test_preload ‑ test_preload_manager_sequence

♻️ This comment has been updated with latest results.

This was referenced Aug 11, 2023
Refactor and introduce dispatch

more dispatches

Generic ToPickle
@hendrikmakait hendrikmakait changed the title [WIP] Refactor P2P and make it more extensible [WIP] Make P2P shuffle through dispatching Aug 15, 2023
@hendrikmakait hendrikmakait changed the title [WIP] Make P2P shuffle through dispatching [WIP] Make P2P shuffle extensible through dispatching Aug 15, 2023
@hendrikmakait hendrikmakait changed the title [WIP] Make P2P shuffle extensible through dispatching [RPC] Make P2P shuffle extensible through dispatching Aug 16, 2023
@hendrikmakait
Copy link
Member Author

@wence-: An initial iteration on this one is done, I would appreciate your thoughts!

@hendrikmakait
Copy link
Member Author

To implement a new shuffle-like, you now need to implement...
a few data classes

  • subclass of ShuffleSpec
  • subclass of ShuffleRunSpec
  • subclass of SchedulerShuffleState

a class handling the main logic

  • subclass of ShuffleRun

a few dispatches to convert things

  • spec_to_scheduler_state
  • scheduler_state_to_run_spec
  • run_spec_to_worker_run

Ideally, I would like to reduce the implementation overhead, but I not sure yet how to best do that.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggestions in this round. Perhaps handling the dispatch as required methods on the base dataclass objects is cleaner?


return ArrayRechunkState(
id=spec.id,
run_id=next(SchedulerShuffleState._run_id_iterator),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the SchedulerShuffleState be responsible for setting the run id? This is meant to be a unique sequence number AIUI, but it's easy to do something wrong here, it feels like.

Perhaps:

from functools import partial
from itertools import count
from dataclasses import dataclass, field

@dataclass
class SchedulerShuffleState:
   ...
   run_id = field(init=False, default_factory=partial(next, count())

@dataclass
ArrayRechunkState(SchedulerShuffleState)
   old: ...
   new: ...

This way run_id is created automagically, and doesn't even appear in the list of arguments to __init__ (so you can't get it wrong).

distributed/shuffle/_worker_plugin.py Outdated Show resolved Hide resolved
distributed/shuffle/_core.py Outdated Show resolved Hide resolved
distributed/shuffle/_core.py Show resolved Hide resolved
distributed/shuffle/_core.py Outdated Show resolved Hide resolved
@@ -441,47 +447,98 @@ def _() -> dict[str, tuple[NDIndex, bytes]]:
return self.run_id

async def get_output_partition(
self, partition_id: NDIndex, key: str, meta: pd.DataFrame | None = None
self, partition_id: NDIndex, key: str, **kwargs: Any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this switch to kwargs future-proofing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future-proofing and making it completely transparent to the plugin.

@hendrikmakait hendrikmakait changed the title [RPC] Make P2P shuffle extensible through dispatching [RFC] Make P2P shuffle extensible through dispatching Aug 16, 2023
@hendrikmakait hendrikmakait changed the title [RFC] Make P2P shuffle extensible through dispatching [RFC] Make P2P shuffle extensible Aug 16, 2023
Comment on lines 270 to 297
@dataclass(frozen=True)
class ShuffleSpec(abc.ABC, Generic[_T_partition_id]):
id: ShuffleId
run_id: int
output_workers: set[str]

def create_new_run(
self,
plugin: ShuffleSchedulerPlugin,
) -> SchedulerShuffleState:
worker_for = self._pin_output_workers(plugin)
return SchedulerShuffleState(
run_spec=ShuffleRunSpec(spec=self, worker_for=worker_for),
participating_workers=set(worker_for.values()),
)

@abc.abstractmethod
def _pin_output_workers(
self, plugin: ShuffleSchedulerPlugin
) -> dict[_T_partition_id, str]:
"""TODO"""

@abc.abstractmethod
def initialize_run_on_worker(
self,
run_id: int,
worker_for: dict[_T_partition_id, str],
plugin: ShuffleWorkerPlugin,
) -> ShuffleRun:
"""TODO"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All you need to do now is implement concrete subclasses of ShuffleSpec and ShuffleRun.

@hendrikmakait hendrikmakait marked this pull request as ready for review August 17, 2023 10:32
@hendrikmakait hendrikmakait marked this pull request as draft August 17, 2023 10:32
@hendrikmakait hendrikmakait changed the title [RFC] Make P2P shuffle extensible Make P2P shuffle extensible Aug 17, 2023
@hendrikmakait hendrikmakait marked this pull request as ready for review August 17, 2023 10:42
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this looks neater now. Coverage went made in the tests, but not sure if that will fix itself in the latest test run.

_run_id_iterator: ClassVar[itertools.count] = itertools.count(1)
@dataclass(frozen=True)
class ShuffleRunSpec(Generic[_T_partition_id]):
run_id: int = field(init=False, default_factory=partial(next, itertools.count(1))) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is this mypy bug

@hendrikmakait hendrikmakait merged commit 8645c77 into dask:main Aug 17, 2023
25 of 28 checks passed
@hendrikmakait hendrikmakait deleted the p2p-dispatch branch August 17, 2023 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants