Restructure P2P code #8098

hendrikmakait · 2023-08-11T13:33:31Z

Extracted from #8096

Moves code related to dataframe shuffles or rechunk to the respective _shuffle.py or _rechunk.py files.
Movesbase core logic to _core.py.
Adjusts a few public/private identifier.

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2023-08-11T14:43:10Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      21 files ±0       21 suites ±0 11h 11m 52s ⏱️ - 59m 23s
  3 767 tests ±0   3 655 ✔️ - 3   107 💤 ±0 5 ❌ +3
36 415 runs ±0 34 605 ✔️ - 1 1 805 💤 - 1 5 ❌ +2

For more details on these failures, see this check.

Results for commit f361e71. ± Comparison against base commit f27e9a2.

♻️ This comment has been updated with latest results.

hendrikmakait · 2023-08-14T14:28:50Z

cc @wence-

wence-

For minimal code-movement changes some of the new modules now need import pandas as pd and import numpy as np gated behind TYPE_CHECKING and the type spelt with strings I think.

That might be the best way to get this in, but one could consider making the get_output_partition method generic over meta.

wence- · 2023-08-14T17:00:39Z

distributed/shuffle/_core.py

+from enum import Enum
+from typing import TYPE_CHECKING, Any, ClassVar, Generic, NewType, TypeVar
+
+import pandas as pd


Previously this was guarded by if TYPE_CHECKING which looks to be the cause of most of the test failures.

I note get_output_partition advertises that it accepts meta : pd.DataFrame | None. Probably that should be meta: _T_partition_type | None and then the pandas import can go from this generic core.

Thanks, I just looked at the test report and missed the complete failure of mindeps.

wence- · 2023-08-14T17:03:22Z

distributed/shuffle/_core.py

+
+    @abc.abstractmethod
+    async def get_output_partition(
+        self, partition_id: _T_partition_id, key: str, meta: pd.DataFrame | None = None


pd.DataFrame seems oddly specific for a generic implementation.

Very true, but I would like to make more involved changes such as adding generic typing in a separate PR (e.g., #8096). It will be impossible to spot any meaningful changes in this PR.

wence- · 2023-08-14T17:04:09Z

distributed/shuffle/_core.py

+    if plugin is None:
+        raise RuntimeError(
+            f"The worker {worker.address} does not have a ShuffleExtension. "
+            "Is pandas installed on the worker?"


Is that now necessary even if only using shuffle extensions for array rechunking?

This exception is outdated, but the worker plugin is still involved in array rechunking. Note that we have conflicting names between the generic "P2P shuffle" approach as in sending data all-to-all across the cluster and the specific dataframe.shuffle operation. If you have naming suggestions for resolving this, please let me know. As with other comments, I would rather not address this here.

wence- · 2023-08-14T17:10:03Z

distributed/shuffle/_rechunk.py

@@ -252,3 +264,226 @@ def split_axes(old: ChunkedAxes, new: ChunkedAxes) -> SplitAxes:
            old_chunk.sort(key=lambda split: split.slice.start)
        axes.append(old_axis)
    return axes
+
+
+def convert_chunk(data: bytes) -> np.ndarray:


How do these type annotations work for normal execution. AFAICT, np is not imported at the top-level, so at runtime (rather than type checking time) this name is not defined?

They are not interpreted at runtime thanks to from __future__ import annotations.

Note: If from __future__ import annotations is used, annotations are not evaluated at function definition time. Instead, they are stored as strings in __annotations__. This makes it unnecessary to use quotes around the annotation (see PEP 563).

https://docs.python.org/3/library/typing.html#typing.TYPE_CHECKING

wence- · 2023-08-14T17:10:20Z

distributed/shuffle/_rechunk.py

+        return self.run_id
+
+    async def get_output_partition(
+        self, partition_id: NDIndex, key: str, meta: pd.DataFrame | None = None


As above, if this function were generic over meta as _T_partition_type then meta could be meta : np.ndarray | None here, and rechunking wouldn't require pandas for this type annotation.

wence-

Remaining failures do not look related, I think.

Move code around and adjust public/private

d2892d2

hendrikmakait requested a review from fjetter as a code owner August 11, 2023 13:33

hendrikmakait added 2 commits August 11, 2023 15:39

Fix

11408be

Minor

a5f0817

Merge branch 'main' into restructure-p2p-code

121bdc8

Merge branch 'main' into restructure-p2p-code

e922db4

wence- suggested changes Aug 14, 2023

View reviewed changes

import pandas for type checking

f361e71

wence- approved these changes Aug 15, 2023

View reviewed changes

hendrikmakait merged commit 8aa04a8 into dask:main Aug 15, 2023
22 of 28 checks passed

hendrikmakait deleted the restructure-p2p-code branch August 15, 2023 10:30

hendrikmakait mentioned this pull request Aug 16, 2023

[Tracking] Advancements for P2P #8043

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure P2P code #8098

Restructure P2P code #8098

hendrikmakait commented Aug 11, 2023 •

edited

github-actions bot commented Aug 11, 2023 •

edited

hendrikmakait commented Aug 14, 2023

wence- left a comment

wence- Aug 14, 2023

hendrikmakait Aug 14, 2023

wence- Aug 14, 2023

hendrikmakait Aug 14, 2023 •

edited

wence- Aug 14, 2023

hendrikmakait Aug 14, 2023

wence- Aug 14, 2023

hendrikmakait Aug 14, 2023

wence- Aug 14, 2023

wence- left a comment

Restructure P2P code #8098

Restructure P2P code #8098

Conversation

hendrikmakait commented Aug 11, 2023 • edited

github-actions bot commented Aug 11, 2023 • edited

Unit Test Results

hendrikmakait commented Aug 14, 2023

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait Aug 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

hendrikmakait commented Aug 11, 2023 •

edited

github-actions bot commented Aug 11, 2023 •

edited

hendrikmakait Aug 14, 2023 •

edited