New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix shuffle code to work with pyarrow 13 #8009
Conversation
@@ -951,7 +951,7 @@ def split_by_worker( | |||
# bytestream such that it cannot be deserialized anymore | |||
t = pa.Table.from_pandas(df, preserve_index=True) | |||
t = t.sort_by("_worker") | |||
codes = np.asarray(t.select(["_worker"]))[0] | |||
codes = np.asarray(t["_worker"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So before you were converting a single-column Table object to a 2D ndarray of shape (1, n)
and then getting the 1D array of shape (n, )
with that indexing operation.
With latest pyarrow, that gives shape (n, 1)
, and then [0]
no longer does the correct thing.
But we can also directly convert the column to a numpy array, and then we directly get a 1D array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why we chose to do it like this. I suppose both is pretty much identical in terms of performance, isn't it?
Is this restricting us in terms of backwards compatibility?
It should actually be better in terms of performance, because it avoids making a 2D array (and thus avoids one additional copy, I think)
No, I think this should work across all versions (getting a column of a Table and calling asarray on it is already supported for a long time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and thus avoids one additional copy, I think)
Awesome. If that's the case we should see this in our benchmarks soon (I don't think it's necessary to run a dedicated test for this).
Thanks a lot @jorisvandenbossche !
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 14 files ±0 14 suites ±0 6h 44m 50s ⏱️ + 10m 36s For more details on these failures, see this check. Results for commit 7c39d29. ± Comparison against base commit 9d516da. |
Ok, the windows failures are unrelated... #8012 |
Thanks @jorisvandenbossche ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche!
(cherry picked from commit b7e5f8f)
Closes #8007
Closes #8004
pre-commit run --all-files
This is a rather annoying issue (for our users to run into and diagnose ..). I "fixed" the
Table.__array__
conversion to be of the proper shape, and so now it returns the transpose of what it returned before (apache/arrow#34886). I was somewhat assuming that nobody would really rely on this broken behaviour, which was clearly not true.