Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved conversion between pyarrow and pandas in P2P shuffling #7896

Merged
merged 8 commits into from Jun 15, 2023
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
15 changes: 14 additions & 1 deletion distributed/shuffle/_arrow.py
Expand Up @@ -46,6 +46,7 @@


def convert_partition(data: bytes, meta: pd.DataFrame) -> pd.DataFrame:
import pandas as pd

Check warning on line 49 in distributed/shuffle/_arrow.py

View check run for this annotation

Codecov / codecov/patch

distributed/shuffle/_arrow.py#L49

Added line #L49 was not covered by tests
import pyarrow as pa

file = BytesIO(data)
Expand All @@ -56,7 +57,19 @@
shards.append(sr.read_all())
table = pa.concat_tables(shards)
df = table.to_pandas(self_destruct=True)
return df.astype(meta.dtypes)

def default_types_mapper(pyarrow_dtype: pa.DataType) -> object:

Check warning on line 61 in distributed/shuffle/_arrow.py

View check run for this annotation

Codecov / codecov/patch

distributed/shuffle/_arrow.py#L61

Added line #L61 was not covered by tests
# Avoid converting strings from `string[pyarrow]` to `string[python]`
# if we have *some* `string[pyarrow]`
if (

Check warning on line 64 in distributed/shuffle/_arrow.py

View check run for this annotation

Codecov / codecov/patch

distributed/shuffle/_arrow.py#L64

Added line #L64 was not covered by tests
pyarrow_dtype in {pa.large_string(), pa.string()}
and pd.StringDtype("pyarrow") in meta.dtypes.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually two implementations of pyarrow-strings in pandas. The one you have here and also pd.ArrowDtype(pa.string()). What you have here is fine for now, especially since it's just a performance optimization. Over in dask/dask we're also using pd.StringDtype("pyarrow") as it, historically, has been more feature complete than pd.ArrowDtype(pa.string()). That said, I think the situation has changed in pandas=2, so we may switch to pd.ArrowDtype(pa.string()) at some point in the future. This is mostly just an FYI in case we need to circle back to here in the future.

):
return pd.StringDtype("pyarrow")
return None

Check warning on line 69 in distributed/shuffle/_arrow.py

View check run for this annotation

Codecov / codecov/patch

distributed/shuffle/_arrow.py#L68-L69

Added lines #L68 - L69 were not covered by tests

df = table.to_pandas(self_destruct=True, types_mapper=default_types_mapper)
return df.astype(meta.dtypes, copy=False)

Check warning on line 72 in distributed/shuffle/_arrow.py

View check run for this annotation

Codecov / codecov/patch

distributed/shuffle/_arrow.py#L71-L72

Added lines #L71 - L72 were not covered by tests
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy=False to make this a no-op for columns with matching dtypes.



def list_of_buffers_to_table(data: list[bytes]) -> pa.Table:
Expand Down