New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved conversion between pyarrow
and pandas
in P2P shuffling
#7896
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 20 files ±0 20 suites ±0 12h 8m 3s ⏱️ + 40m 15s For more details on these failures, see this check. Results for commit 8e93913. ± Comparison against base commit 74a1bcd. ♻️ This comment has been updated with latest results. |
Oh cool. So we'll avoid ever making python object strings now? |
We'll avoid creating them per default. If there's a column that uses |
Sure. That makes sense. I'm curious to see how this impacts performance on current shuffling workloads. Both due to just general performance, but also possibly complex GIL-interactions with network bandwidth. Do we currently use |
7cef449
to
87fef7c
Compare
I've never used |
return None | ||
|
||
df = table.to_pandas(self_destruct=True, types_mapper=default_types_mapper) | ||
return df.astype(meta.dtypes, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy=False
to make this a no-op for columns with matching dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hendrikmakait -- LGTM
There are a bunch of shuffle-related test failures here. Not fully convinced they're actually related to the changes here since they're array rechunking tests (my guess is #7856), but am holding off on merging until you get a chance to look at things.
# if we have *some* `string[pyarrow]` | ||
if ( | ||
pyarrow_dtype in {pa.large_string(), pa.string()} | ||
and pd.StringDtype("pyarrow") in meta.dtypes.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are actually two implementations of pyarrow
-strings in pandas
. The one you have here and also pd.ArrowDtype(pa.string())
. What you have here is fine for now, especially since it's just a performance optimization. Over in dask/dask
we're also using pd.StringDtype("pyarrow")
as it, historically, has been more feature complete than pd.ArrowDtype(pa.string())
. That said, I think the situation has changed in pandas=2
, so we may switch to pd.ArrowDtype(pa.string())
at some point in the future. This is mostly just an FYI in case we need to circle back to here in the future.
@jrbourbeau: The test failures disappeared after merging with the latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hendrikmakait! This is in
Addresses #7880
Incorporates and blocked by #7895Performance seems to vary wildly between runs on the reproducer from #7880, but not converting to
string[python]
and back tostring[pyarrow]
generally seems like a good idea.pre-commit run --all-files