Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved conversion between pyarrow and pandas in P2P shuffling #7896

Merged
merged 8 commits into from Jun 15, 2023

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Jun 8, 2023

Addresses #7880
Incorporates and blocked by #7895

Performance seems to vary wildly between runs on the reproducer from #7880, but not converting to string[python] and back to string[pyarrow] generally seems like a good idea.

  • Tests added / passed
  • Passes pre-commit run --all-files

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       20 files  ±0         20 suites  ±0   12h 8m 3s ⏱️ + 40m 15s
  3 679 tests ±0    3 564 ✔️  - 2     108 💤 ±0  7 +2 
35 580 runs  ±0  33 805 ✔️  - 6  1 768 💤 +4  7 +2 

For more details on these failures, see this check.

Results for commit 8e93913. ± Comparison against base commit 74a1bcd.

♻️ This comment has been updated with latest results.

@mrocklin
Copy link
Member

mrocklin commented Jun 9, 2023

Oh cool. So we'll avoid ever making python object strings now?

@hendrikmakait
Copy link
Member Author

Oh cool. So we'll avoid ever making python object strings now?

We'll avoid creating them per default. If there's a column that uses string[python] in meta, it will still get converted in return df.astype(meta.dtypes).

@mrocklin
Copy link
Member

mrocklin commented Jun 9, 2023

Sure. That makes sense. I'm curious to see how this impacts performance on current shuffling workloads. Both due to just general performance, but also possibly complex GIL-interactions with network bandwidth.

Do we currently use string[pyarrow] in shuffling benchmarks? If so, I'd be curious to see how this impacts performance.

@phofl
Copy link
Collaborator

phofl commented Jun 12, 2023

I've never used self_destruct=True personally, not sure how this impacts the arrays backing the table. Just mentioning that string[pyarrow] is zero-copy and uses the arrays from the table.

return None

df = table.to_pandas(self_destruct=True, types_mapper=default_types_mapper)
return df.astype(meta.dtypes, copy=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy=False to make this a no-op for columns with matching dtypes.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hendrikmakait -- LGTM

There are a bunch of shuffle-related test failures here. Not fully convinced they're actually related to the changes here since they're array rechunking tests (my guess is #7856), but am holding off on merging until you get a chance to look at things.

# if we have *some* `string[pyarrow]`
if (
pyarrow_dtype in {pa.large_string(), pa.string()}
and pd.StringDtype("pyarrow") in meta.dtypes.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually two implementations of pyarrow-strings in pandas. The one you have here and also pd.ArrowDtype(pa.string()). What you have here is fine for now, especially since it's just a performance optimization. Over in dask/dask we're also using pd.StringDtype("pyarrow") as it, historically, has been more feature complete than pd.ArrowDtype(pa.string()). That said, I think the situation has changed in pandas=2, so we may switch to pd.ArrowDtype(pa.string()) at some point in the future. This is mostly just an FYI in case we need to circle back to here in the future.

@hendrikmakait
Copy link
Member Author

@jrbourbeau: The test failures disappeared after merging with the latest main and skipping caches. This should be ready to merge.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hendrikmakait! This is in

@jrbourbeau jrbourbeau merged commit 9343965 into dask:main Jun 15, 2023
22 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants