Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas-only P2P shuffling #8635

Closed
wants to merge 27 commits into from
Closed
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
37119a9
p2p shuffle without pyArrow
crusaderky Mar 28, 2024
b9725c5
Serialize dataframes manually
crusaderky Mar 29, 2024
8a732a7
Don't send columns for every shard
crusaderky Apr 4, 2024
21c9954
Prevent strings from blowing up memory
hendrikmakait Apr 18, 2024
db8db7d
Merge branch 'main' into p2p-pandas
hendrikmakait Apr 30, 2024
3d1128a
Remove outdated tests
hendrikmakait May 2, 2024
191c9d7
Fix tests and refactor
hendrikmakait May 3, 2024
e35a738
Remove offload
hendrikmakait May 3, 2024
252e44b
Remove orphan executor mentions
hendrikmakait May 3, 2024
3a25c64
Remove executor from tests
hendrikmakait May 3, 2024
8c2ebbf
Fix categoricals and ensure copying
phofl May 6, 2024
e30b540
deep copy
hendrikmakait May 6, 2024
2920472
Fix categoricals
hendrikmakait May 6, 2024
311f91f
Ignore unnecessary shards
hendrikmakait May 6, 2024
ea22a5a
Don't expect warning
hendrikmakait May 6, 2024
712e211
Rewire env (REVERT ME)
hendrikmakait May 6, 2024
c41597c
Merge branch 'main' into p2p-pandas
hendrikmakait May 6, 2024
418edf8
Trigger CI
hendrikmakait May 6, 2024
814c73a
Adjust mindeps (REVERT ME)
hendrikmakait May 6, 2024
9db59e4
Fix pandas compat
hendrikmakait May 7, 2024
c70f28f
Fix tests on mindeps
hendrikmakait May 7, 2024
1602135
Skip in-memory shuffle
hendrikmakait May 7, 2024
474b097
Skip in-memory P2P dataframe shuffling tests
hendrikmakait May 7, 2024
aa9bf21
Skip in-memory P2P dataframe shuffling tests
hendrikmakait May 7, 2024
fb43405
Do not install pyarrow in mindeps-pandas
hendrikmakait May 7, 2024
95d5377
Improved typing
hendrikmakait May 7, 2024
98b1c32
Replace del by drop
hendrikmakait May 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -83,12 +83,12 @@ jobs:
- os: ubuntu-latest
environment: mindeps
label: pandas
extra_packages: [numpy=1.21, pandas=1.3, pyarrow=7, pyarrow-hotfix]
extra_packages: [numpy=1.21, pandas=1.3]
partition: "ci1"
- os: ubuntu-latest
environment: mindeps
label: pandas
extra_packages: [numpy=1.21, pandas=1.3, pyarrow=7, pyarrow-hotfix]
extra_packages: [numpy=1.21, pandas=1.3]
partition: "not ci1"

- os: ubuntu-latest
Expand Down
4 changes: 2 additions & 2 deletions continuous_integration/environment-3.10.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ dependencies:
- zict # overridden by git tip below
- zstandard
- pip:
- git+https://github.com/dask/dask
- git+https://github.com/dask-contrib/dask-expr
- git+https://github.com/hendrikmakait/dask@p2p-pandas
- git+https://github.com/hendrikmakait/dask-expr@p2p-pandas
Comment on lines +48 to +49
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to revert 712e211

- git+https://github.com/dask/zict
# Revert after https://github.com/dask/distributed/issues/8614 is fixed
# - git+https://github.com/dask/s3fs
Expand Down
4 changes: 2 additions & 2 deletions continuous_integration/environment-3.11.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ dependencies:
- zict # overridden by git tip below
- zstandard
- pip:
- git+https://github.com/dask/dask
- git+https://github.com/dask-contrib/dask-expr
- git+https://github.com/hendrikmakait/dask@p2p-pandas
- git+https://github.com/hendrikmakait/dask-expr@p2p-pandas
- git+https://github.com/dask/zict
# Revert after https://github.com/dask/distributed/issues/8614 is fixed
# - git+https://github.com/dask/s3fs
Expand Down
4 changes: 2 additions & 2 deletions continuous_integration/environment-3.12.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ dependencies:
- zict # overridden by git tip below
- zstandard
- pip:
- git+https://github.com/dask/dask
- git+https://github.com/dask-contrib/dask-expr
- git+https://github.com/hendrikmakait/dask@p2p-pandas
- git+https://github.com/hendrikmakait/dask-expr@p2p-pandas
- git+https://github.com/dask/zict
# Revert after https://github.com/dask/distributed/issues/8614 is fixed
# - git+https://github.com/dask/s3fs
Expand Down
4 changes: 2 additions & 2 deletions continuous_integration/environment-3.9.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ dependencies:
- zict
- zstandard
- pip:
- git+https://github.com/dask/dask
- git+https://github.com/dask-contrib/dask-expr
- git+https://github.com/hendrikmakait/dask@p2p-pandas
- git+https://github.com/hendrikmakait/dask-expr@p2p-pandas
- git+https://github.com/dask/crick # Only tested here
- keras
2 changes: 1 addition & 1 deletion continuous_integration/environment-mindeps.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ dependencies:
# Distributed depends on the latest version of Dask
- pip
- pip:
- git+https://github.com/dask/dask
- git+https://github.com/hendrikmakait/dask@p2p-pandas
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to revert 814c73a

# test dependencies
- pytest
- pytest-cov
Expand Down
11 changes: 11 additions & 0 deletions distributed/protocol/serialize.py
Original file line number Diff line number Diff line change
Expand Up @@ -854,6 +854,17 @@ def _deserialize_memoryview(header, frames):
return out


@dask_serialize.register(PickleBuffer)
def _serialize_picklebuffer(obj):
return _serialize_memoryview(obj.raw())


@dask_deserialize.register(PickleBuffer)
def _deserialize_picklebuffer(header, frames):
out = _deserialize_memoryview(header, frames)
return PickleBuffer(out)


#########################
# Descend into __dict__ #
#########################
Expand Down
2 changes: 0 additions & 2 deletions distributed/shuffle/__init__.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
from __future__ import annotations

from distributed.shuffle._arrow import check_minimal_arrow_version
from distributed.shuffle._merge import HashJoinP2PLayer, hash_join_p2p
from distributed.shuffle._rechunk import rechunk_p2p
from distributed.shuffle._scheduler_plugin import ShuffleSchedulerPlugin
from distributed.shuffle._shuffle import P2PShuffleLayer, rearrange_by_column_p2p
from distributed.shuffle._worker_plugin import ShuffleWorkerPlugin

__all__ = [
"check_minimal_arrow_version",
"hash_join_p2p",
"HashJoinP2PLayer",
"P2PShuffleLayer",
Expand Down
201 changes: 0 additions & 201 deletions distributed/shuffle/_arrow.py

This file was deleted.