Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure P2P code #8098

Merged
merged 6 commits into from Aug 15, 2023
Merged

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Aug 11, 2023

Extracted from #8096

  • Moves code related to dataframe shuffles or rechunk to the respective _shuffle.py or _rechunk.py files.
  • Movesbase core logic to _core.py.
  • Adjusts a few public/private identifier.
  • Tests added / passed
  • Passes pre-commit run --all-files

@github-actions
Copy link
Contributor

github-actions bot commented Aug 11, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       21 files  ±0         21 suites  ±0   11h 11m 52s ⏱️ - 59m 23s
  3 767 tests ±0    3 655 ✔️  - 3     107 💤 ±0  5 +3 
36 415 runs  ±0  34 605 ✔️  - 1  1 805 💤  - 1  5 +2 

For more details on these failures, see this check.

Results for commit f361e71. ± Comparison against base commit f27e9a2.

♻️ This comment has been updated with latest results.

@hendrikmakait
Copy link
Member Author

cc @wence-

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For minimal code-movement changes some of the new modules now need import pandas as pd and import numpy as np gated behind TYPE_CHECKING and the type spelt with strings I think.

That might be the best way to get this in, but one could consider making the get_output_partition method generic over meta.

from enum import Enum
from typing import TYPE_CHECKING, Any, ClassVar, Generic, NewType, TypeVar

import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this was guarded by if TYPE_CHECKING which looks to be the cause of most of the test failures.

I note get_output_partition advertises that it accepts meta : pd.DataFrame | None. Probably that should be meta: _T_partition_type | None and then the pandas import can go from this generic core.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I just looked at the test report and missed the complete failure of mindeps.


@abc.abstractmethod
async def get_output_partition(
self, partition_id: _T_partition_id, key: str, meta: pd.DataFrame | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.DataFrame seems oddly specific for a generic implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very true, but I would like to make more involved changes such as adding generic typing in a separate PR (e.g., #8096). It will be impossible to spot any meaningful changes in this PR.

if plugin is None:
raise RuntimeError(
f"The worker {worker.address} does not have a ShuffleExtension. "
"Is pandas installed on the worker?"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that now necessary even if only using shuffle extensions for array rechunking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception is outdated, but the worker plugin is still involved in array rechunking. Note that we have conflicting names between the generic "P2P shuffle" approach as in sending data all-to-all across the cluster and the specific dataframe.shuffle operation. If you have naming suggestions for resolving this, please let me know. As with other comments, I would rather not address this here.

@@ -252,3 +264,226 @@ def split_axes(old: ChunkedAxes, new: ChunkedAxes) -> SplitAxes:
old_chunk.sort(key=lambda split: split.slice.start)
axes.append(old_axis)
return axes


def convert_chunk(data: bytes) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do these type annotations work for normal execution. AFAICT, np is not imported at the top-level, so at runtime (rather than type checking time) this name is not defined?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not interpreted at runtime thanks to from __future__ import annotations.

Note: If from __future__ import annotations is used, annotations are not evaluated at function definition time. Instead, they are stored as strings in __annotations__. This makes it unnecessary to use quotes around the annotation (see PEP 563).

https://docs.python.org/3/library/typing.html#typing.TYPE_CHECKING

return self.run_id

async def get_output_partition(
self, partition_id: NDIndex, key: str, meta: pd.DataFrame | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, if this function were generic over meta as _T_partition_type then meta could be meta : np.ndarray | None here, and rechunking wouldn't require pandas for this type annotation.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining failures do not look related, I think.

@hendrikmakait hendrikmakait merged commit 8aa04a8 into dask:main Aug 15, 2023
22 of 28 checks passed
@hendrikmakait hendrikmakait deleted the restructure-p2p-code branch August 15, 2023 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants