-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas-only P2P shuffling #8635
Conversation
* Update * Fix other copy issue
- git+https://github.com/hendrikmakait/dask@p2p-pandas | ||
- git+https://github.com/hendrikmakait/dask-expr@p2p-pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to revert 712e211
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 29 files + 2 29 suites +2 10h 21m 53s ⏱️ + 1h 22m 10s For more details on these failures and errors, see this check. Results for commit 98b1c32. ± Comparison against base commit 1ec61a1. This pull request removes 141 and adds 138 tests. Note that renamed tests count towards both.
This pull request removes 1 skipped test and adds 59 skipped tests. Note that renamed tests count towards both.
This pull request skips 5 and un-skips 1 tests.
♻️ This comment has been updated with latest results. |
@hendrikmakait - Is there any background or motivation you can share about this work? Seems very pandas-specific at the moment. |
index: pd.Index, blocks: Sequence[Block], meta: pd.DataFrame | ||
) -> pd.DataFrame: | ||
import pandas as pd | ||
from pandas.core.internals import BlockManager, make_block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice in the long run if any logic using pandas internals could be dispatched.
@@ -22,7 +22,7 @@ dependencies: | |||
# Distributed depends on the latest version of Dask | |||
- pip | |||
- pip: | |||
- git+https://github.com/dask/dask | |||
- git+https://github.com/hendrikmakait/dask@p2p-pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to revert 814c73a
I'll let @hendrikmakait answer fully, but my sense is that early experiments showed substantial speedups (like 30% on full TPC-H Benchmarks, not just on the shuffling portion).
I can see how this would be convenient for RAPIDS folks (although probably a little inconvenient for others due to complexity). If we go this direction then maybe helping address/reduce this complexity is something that RAPIDS people could help design/execute/maintain? |
Very nice!
This is a breaking change for rapids "as is". I suppose it would be "convenient" for rapids to continue working with p2p shuffling :)
I will certainly help with all of the above if doing so is both welcome and feasible. Feasibility probably depends on it being possible to isolate the logic that interacts with pandas internals. If organizing the code in this way really is "too complex", then cudf may need to rely on a completely distinct version of "p2p" to avoid slowing down Coiled engineers. However, my hope is that this is a natural way to organize things even if cudf wasn't in the picture. (fingers crossed) For this PR, it seems fine to focus on pandas-only details (especially for now). @hendrikmakait - Perhaps we can meet offline so I can figure out whether rapids needs to prioritize (1) rolling-back p2p support, or (2) enabling the pyarrow-free approach with cudf-backed data. |
There are three main benefits I see here:
Sounds good! I'll ping you offline. Looking at the code we currently have in place, I expect it to be fairly straightforward to dispatch the library-specific pieces of code or to dispatch instantiation of shuffle runs and implement a cuDF-specific version. |
concatenate3. | ||
""" | ||
with path.open(mode="r+b") as fh: | ||
buffer = memoryview(mmap.mmap(fh.fileno(), 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mmap
causes us to lose strict control over the file descriptor lifecycle. This creates problems with cleanup on Windows because we can't remove files if they still have an open file descriptor.
t = t.drop([column]) | ||
splits = np.where(partition[1:] != partition[:-1])[0] + 1 | ||
splits = np.concatenate([[0], splits]) | ||
base = df[column].values.base # type: ignore[union-attr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the base
logic a bit?
In cudf/cupy, df[column].values.base
produces None
.
Does it make sense to protect against the case that base in None
here? (Otherwise, len(base)
will produce an error)
return pickle_bytelist( | ||
(input_part_id, shard.index, *shard._mgr.blocks), prelude=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun fact: This PR works fine with cudf-backed data when I do something like the following:
shard = shard.to_pandas() if hasattr(shard, "to_pandas") else shard
return pickle_bytelist(
(input_part_id, shard.index, *shard._mgr.blocks), prelude=False
)
as long as I also modify the last line of unpickle_and_concat_dataframe_shards
to be:
result = dd.methods.concat(shards, copy=True)
if hasattr(type(meta), "from_pandas"):
return type(meta).from_pandas(result)
return result
I'm sure there are cases where the cudf->pandas->cudf round trip isn't perfect, but the small amount of logic I needed to work-around the changes is promising. (The performance is certainly not good, but that's a separate concern)
I've been experimenting with PR using cudf-backed data. I think it will take some work on the rapids side to bring performance in line with def deconstruct_dataframe_shard(shard: pd.DataFrame) -> tuple[Any]:
""" Deconstruct a DataFrame shard into the information needed to reconstruct it later
Dispatch on the type of `shard`.
The elements of the result should be whatever is needed by `restore_dataframe_shard`
"""
...
def restore_dataframe_shard(meta: pd.DataFrame, unpickled_shard: tuple[Any]) -> pd.DataFrame:
""" Reconstruct an un-pickled DataFrame shard
Dispatch on the type of `meta`.
`unpickled_shard` corresponds to the round-tripped output of `deconstruct_dataframe_shard`.
Converts a tuple of un-pickled data (e.g. index and blocks) back to a DataFrame.
"""
... For the sake of generality, def pickle_dataframe_shard(input_part_id: int, shard: pd.DataFrame) -> list[pickle.PickleBuffer]:
return pickle_bytelist(
(input_part_id,) + deconstruct_dataframe_shard(shard),
prelude=False,
) and def unpickle_and_concat_dataframe_shards(
b: bytes | bytearray | memoryview, meta: pd.DataFrame
) -> pd.DataFrame:
import dask.dataframe as dd
unpickled_shards = list(unpickle_bytestream(b))
unpickled_shards.sort(key=first)
shards = []
for _, unpickled_shard in unpickled_shards:
shards.append(restore_dataframe_shard(meta, unpickled_shard))
return dd.methods.concat(shards, copy=True) I suppose the only "complex" part of this general suggestion is that the dispatch functions themselves would need to be exposed/defined in |
TL;DR: After running several A/B tests and profiling the code, I have concluded not to move forward with this PR because it is not a universal improvement. Analysis A/B tests have shown improvements ranging from 5%-30% in end-to-end runtime for almost all TPC-H queries: However, they have also shown disastrous results for It looks like this PR is overfitting on OLAP-style queries that include narrow dataframes with few columns and heavily reduce data at the cost of ETL-style transformations that include wide dataframes with many columns. Profiling the code before and after, I saw that both the transfer and the This is likely due to many small shards with very few rows. I have tried multiple possible quick wins like removing |
Example py-spy profiles for future reference: |
Supersedes #8606
pre-commit run --all-files