Improved conversion between `pyarrow` and `pandas` in P2P shuffling #7896

hendrikmakait · 2023-06-08T16:05:10Z

Addresses #7880
~~Incorporates and blocked by #7895~~

Performance seems to vary wildly between runs on the reproducer from #7880, but not converting to string[python] and back to string[pyarrow] generally seems like a good idea.

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2023-06-08T17:14:41Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      20 files ±0       20 suites ±0 12h 8m 3s ⏱️ + 40m 15s
  3 679 tests ±0   3 564 ✔️ - 2   108 💤 ±0 7 ❌ +2
35 580 runs ±0 33 805 ✔️ - 6 1 768 💤 +4 7 ❌ +2

For more details on these failures, see this check.

Results for commit 8e93913. ± Comparison against base commit 74a1bcd.

♻️ This comment has been updated with latest results.

mrocklin · 2023-06-09T11:35:01Z

Oh cool. So we'll avoid ever making python object strings now?

hendrikmakait · 2023-06-09T11:50:03Z

Oh cool. So we'll avoid ever making python object strings now?

We'll avoid creating them per default. If there's a column that uses string[python] in meta, it will still get converted in return df.astype(meta.dtypes).

mrocklin · 2023-06-09T11:51:55Z

Sure. That makes sense. I'm curious to see how this impacts performance on current shuffling workloads. Both due to just general performance, but also possibly complex GIL-interactions with network bandwidth.

Do we currently use string[pyarrow] in shuffling benchmarks? If so, I'd be curious to see how this impacts performance.

distributed/shuffle/_arrow.py

phofl · 2023-06-12T08:03:12Z

I've never used self_destruct=True personally, not sure how this impacts the arrays backing the table. Just mentioning that string[pyarrow] is zero-copy and uses the arrays from the table.

hendrikmakait · 2023-06-12T16:02:51Z

distributed/shuffle/_arrow.py

+        return None
+
+    df = table.to_pandas(self_destruct=True, types_mapper=default_types_mapper)
+    return df.astype(meta.dtypes, copy=False)


copy=False to make this a no-op for columns with matching dtypes.

jrbourbeau

Thanks @hendrikmakait -- LGTM

There are a bunch of shuffle-related test failures here. Not fully convinced they're actually related to the changes here since they're array rechunking tests (my guess is #7856), but am holding off on merging until you get a chance to look at things.

jrbourbeau · 2023-06-12T21:58:37Z

distributed/shuffle/_arrow.py

+        # if we have *some* `string[pyarrow]`
+        if (
+            pyarrow_dtype in {pa.large_string(), pa.string()}
+            and pd.StringDtype("pyarrow") in meta.dtypes.values


There are actually two implementations of pyarrow-strings in pandas. The one you have here and also pd.ArrowDtype(pa.string()). What you have here is fine for now, especially since it's just a performance optimization. Over in dask/dask we're also using pd.StringDtype("pyarrow") as it, historically, has been more feature complete than pd.ArrowDtype(pa.string()). That said, I think the situation has changed in pandas=2, so we may switch to pd.ArrowDtype(pa.string()) at some point in the future. This is mostly just an FYI in case we need to circle back to here in the future.

hendrikmakait · 2023-06-13T17:43:32Z

@jrbourbeau: The test failures disappeared after merging with the latest main and skipping caches. This should be ready to merge.

jrbourbeau

Thanks @hendrikmakait! This is in

hendrikmakait requested a review from fjetter as a code owner June 8, 2023 16:05

Add types mapper

87fef7c

hendrikmakait force-pushed the dont-convert-strings branch from 7cef449 to 87fef7c Compare June 9, 2023 13:19

phofl reviewed Jun 12, 2023

View reviewed changes

distributed/shuffle/_arrow.py Outdated Show resolved Hide resolved

large_string

3f46581

hendrikmakait added 3 commits June 12, 2023 16:43

Minor

21f129d

[skip-caching]

684d988

Copy only if necessary

38898f7

hendrikmakait commented Jun 12, 2023

View reviewed changes

jrbourbeau reviewed Jun 12, 2023

View reviewed changes

hendrikmakait added 2 commits June 13, 2023 12:30

Merge branch 'main' into dont-convert-strings

c27be14

[skip-caching]

f793bcc

[skip-caching]

8e93913

jrbourbeau approved these changes Jun 15, 2023

View reviewed changes

jrbourbeau merged commit 9343965 into dask:main Jun 15, 2023
22 of 27 checks passed

rjzamora mentioned this pull request Jun 20, 2023

Use general kwargs in pyarrow_table_dispatch functions dask/dask#10364

Merged

jorisvandenbossche mentioned this pull request Jul 10, 2023

Remove accidental duplicated conversion of pyarrow Table to pandas #7983

Merged

2 tasks

hendrikmakait mentioned this pull request Aug 16, 2023

[Tracking] Advancements for P2P #8043

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved conversion between `pyarrow` and `pandas` in P2P shuffling #7896

Improved conversion between `pyarrow` and `pandas` in P2P shuffling #7896

hendrikmakait commented Jun 8, 2023 •

edited

github-actions bot commented Jun 8, 2023 •

edited

mrocklin commented Jun 9, 2023

hendrikmakait commented Jun 9, 2023

mrocklin commented Jun 9, 2023

phofl commented Jun 12, 2023

hendrikmakait Jun 12, 2023

jrbourbeau left a comment

jrbourbeau Jun 12, 2023

hendrikmakait commented Jun 13, 2023

jrbourbeau left a comment

Improved conversion between pyarrow and pandas in P2P shuffling #7896

Improved conversion between pyarrow and pandas in P2P shuffling #7896

Conversation

hendrikmakait commented Jun 8, 2023 • edited

github-actions bot commented Jun 8, 2023 • edited

Unit Test Results

mrocklin commented Jun 9, 2023

hendrikmakait commented Jun 9, 2023

mrocklin commented Jun 9, 2023

phofl commented Jun 12, 2023

hendrikmakait Jun 12, 2023

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau Jun 12, 2023

Choose a reason for hiding this comment

hendrikmakait commented Jun 13, 2023

jrbourbeau left a comment

Choose a reason for hiding this comment

Improved conversion between `pyarrow` and `pandas` in P2P shuffling #7896

Improved conversion between `pyarrow` and `pandas` in P2P shuffling #7896

hendrikmakait commented Jun 8, 2023 •

edited

github-actions bot commented Jun 8, 2023 •

edited