Fix shuffle code to work with pyarrow 13 #8009

jorisvandenbossche · 2023-07-18T08:54:21Z

Closes #8007
Closes #8004

Tests added / passed
Passes pre-commit run --all-files

This is a rather annoying issue (for our users to run into and diagnose ..). I "fixed" the Table.__array__ conversion to be of the proper shape, and so now it returns the transpose of what it returned before (apache/arrow#34886). I was somewhat assuming that nobody would really rely on this broken behaviour, which was clearly not true.

jorisvandenbossche · 2023-07-18T08:56:37Z

distributed/shuffle/_worker_plugin.py

@@ -951,7 +951,7 @@ def split_by_worker(
    # bytestream such that it cannot be deserialized anymore
    t = pa.Table.from_pandas(df, preserve_index=True)
    t = t.sort_by("_worker")
-    codes = np.asarray(t.select(["_worker"]))[0]
+    codes = np.asarray(t["_worker"])


So before you were converting a single-column Table object to a 2D ndarray of shape (1, n) and then getting the 1D array of shape (n, ) with that indexing operation.
With latest pyarrow, that gives shape (n, 1), and then [0] no longer does the correct thing.

But we can also directly convert the column to a numpy array, and then we directly get a 1D array.

fjetter

I don't know why we chose to do it like this. I suppose both is pretty much identical in terms of performance, isn't it?
Is this restricting us in terms of backwards compatibility?

jorisvandenbossche · 2023-07-18T09:20:06Z

I suppose both is pretty much identical in terms of performance, isn't it?

It should actually be better in terms of performance, because it avoids making a 2D array (and thus avoids one additional copy, I think)

Is this restricting us in terms of backwards compatibility?

No, I think this should work across all versions (getting a column of a Table and calling asarray on it is already supported for a long time)

fjetter

(and thus avoids one additional copy, I think)

Awesome. If that's the case we should see this in our benchmarks soon (I don't think it's necessary to run a dedicated test for this).

Thanks a lot @jorisvandenbossche !

github-actions · 2023-07-18T10:55:40Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      14 files ±0       14 suites ±0 6h 44m 50s ⏱️ + 10m 36s
  3 716 tests ±0   3 607 ✔️ ±0   108 💤 ±0 1 ❌ ±0
24 892 runs ±0 23 713 ✔️ ±0 1 178 💤 ±0 1 ❌ ±0

For more details on these failures, see this check.

Results for commit 7c39d29. ± Comparison against base commit 9d516da.

fjetter · 2023-07-18T13:15:52Z

Ok, the windows failures are unrelated... #8012

fjetter · 2023-07-18T13:16:03Z

Thanks @jorisvandenbossche !

jrbourbeau

Thanks @jorisvandenbossche!

(cherry picked from commit b7e5f8f)

Fix shuffle code to work with pyarrow 13

7c39d29

jorisvandenbossche requested a review from fjetter as a code owner July 18, 2023 08:54

jorisvandenbossche commented Jul 18, 2023

View reviewed changes

jorisvandenbossche mentioned this pull request Jul 18, 2023

P2P shuffling incompatible with pyarrow-13.0.0.dev* #8007

Closed

fjetter reviewed Jul 18, 2023

View reviewed changes

fjetter approved these changes Jul 18, 2023

View reviewed changes

fjetter merged commit b7e5f8f into dask:main Jul 18, 2023
21 of 28 checks passed

jrbourbeau reviewed Jul 18, 2023

View reviewed changes

phofl pushed a commit to phofl/distributed that referenced this pull request Jul 24, 2023

Fix shuffle code to work with pyarrow 13 (dask#8009)

b3991af

(cherry picked from commit b7e5f8f)

jorisvandenbossche deleted the fix-pyarrow-13 branch August 14, 2023 09:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix shuffle code to work with pyarrow 13 #8009

Fix shuffle code to work with pyarrow 13 #8009

jorisvandenbossche commented Jul 18, 2023

jorisvandenbossche Jul 18, 2023

fjetter left a comment

jorisvandenbossche commented Jul 18, 2023

fjetter left a comment

github-actions bot commented Jul 18, 2023

fjetter commented Jul 18, 2023

fjetter commented Jul 18, 2023

jrbourbeau left a comment

Fix shuffle code to work with pyarrow 13 #8009

Fix shuffle code to work with pyarrow 13 #8009

Conversation

jorisvandenbossche commented Jul 18, 2023

jorisvandenbossche Jul 18, 2023

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 18, 2023

fjetter left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 18, 2023

Unit Test Results

fjetter commented Jul 18, 2023

fjetter commented Jul 18, 2023

jrbourbeau left a comment

Choose a reason for hiding this comment