Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix shuffle code to work with pyarrow 13 #8009

Merged
merged 1 commit into from Jul 18, 2023

Conversation

jorisvandenbossche
Copy link
Member

Closes #8007
Closes #8004

  • Tests added / passed
  • Passes pre-commit run --all-files

This is a rather annoying issue (for our users to run into and diagnose ..). I "fixed" the Table.__array__ conversion to be of the proper shape, and so now it returns the transpose of what it returned before (apache/arrow#34886). I was somewhat assuming that nobody would really rely on this broken behaviour, which was clearly not true.

@@ -951,7 +951,7 @@ def split_by_worker(
# bytestream such that it cannot be deserialized anymore
t = pa.Table.from_pandas(df, preserve_index=True)
t = t.sort_by("_worker")
codes = np.asarray(t.select(["_worker"]))[0]
codes = np.asarray(t["_worker"])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So before you were converting a single-column Table object to a 2D ndarray of shape (1, n) and then getting the 1D array of shape (n, ) with that indexing operation.
With latest pyarrow, that gives shape (n, 1), and then [0] no longer does the correct thing.

But we can also directly convert the column to a numpy array, and then we directly get a 1D array.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we chose to do it like this. I suppose both is pretty much identical in terms of performance, isn't it?
Is this restricting us in terms of backwards compatibility?

@jorisvandenbossche
Copy link
Member Author

I suppose both is pretty much identical in terms of performance, isn't it?

It should actually be better in terms of performance, because it avoids making a 2D array (and thus avoids one additional copy, I think)

Is this restricting us in terms of backwards compatibility?

No, I think this should work across all versions (getting a column of a Table and calling asarray on it is already supported for a long time)

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and thus avoids one additional copy, I think)

Awesome. If that's the case we should see this in our benchmarks soon (I don't think it's necessary to run a dedicated test for this).

Thanks a lot @jorisvandenbossche !

@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       14 files  ±0         14 suites  ±0   6h 44m 50s ⏱️ + 10m 36s
  3 716 tests ±0    3 607 ✔️ ±0     108 💤 ±0  1 ±0 
24 892 runs  ±0  23 713 ✔️ ±0  1 178 💤 ±0  1 ±0 

For more details on these failures, see this check.

Results for commit 7c39d29. ± Comparison against base commit 9d516da.

@fjetter
Copy link
Member

fjetter commented Jul 18, 2023

Ok, the windows failures are unrelated... #8012

@fjetter fjetter merged commit b7e5f8f into dask:main Jul 18, 2023
21 of 28 checks passed
@fjetter
Copy link
Member

fjetter commented Jul 18, 2023

Thanks @jorisvandenbossche !

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phofl pushed a commit to phofl/distributed that referenced this pull request Jul 24, 2023
@jorisvandenbossche jorisvandenbossche deleted the fix-pyarrow-13 branch August 14, 2023 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants