-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] np.asarray(parrow_table)
returns a transposed representation of the data
#34886
Comments
Hi @thomasjpfan, This isn't a bug, but a difference in the underlying storage layout of the objects (and the limitations of that). Arrow supports interoperability with numpy at the array level (https://arrow.apache.org/docs/python/numpy.html). What you are seeing is the zero-copy conversion of the arrow columnar storage format into numpy arrays for each column (https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions). If you don't want to view the data in this format, a copy of the data needs to be made. This is inefficient and usually not the desired behavior at the array level. You'll need to implement the copying outside of pyarrow if you want this direct conversion. For more complex datatypes (e.g. dataframes), you'll need to use pyarrow's pandas interoperability like in your example (https://arrow.apache.org/docs/python/pandas.html#pandas-integration). |
I would still call it a bug (if it works, i.e. it returns something, it shouldn't transpose the data), but I think it is indeed caused because we only implemented numpy compatibility on the array level, as Dane mentioned. When doing
So we get here a list of the column values, each being a ChunkedArray. But because those arrays now actually do have numpy compatibility with Leaving this as is doesn't sound as a good idea, given the unexpected shape. Two options I would think of:
A simple implementation (for us or for external users) could be
I don't know if that will start to fail with more complex cases, though. Although it seems if the dtypes are not compatible, |
+1 Thanks for the correction and the detailed examples @jorisvandenbossche! I agree we can call this a bug. |
np.asarray(parrow_table)
returns a transposed representation of the datanp.asarray(parrow_table)
returns a transposed representation of the data
…able and RecordBatch
…nd RecordBatch (#36242) ### Rationale for this change Currently, calling `np.asarray(table)` gives the wrong result (transpose of the expected result), because we don't implement an explicit `__array__` to numpy conversion, and then numpy falls back to iterate the object (but this iterates the columns, giving a transposed result). To fix this unexpected result, I added an actual `__array__` (currently with a naive implementation in python, potentially this could be optimized in our C++ conversion layer) ### Are there any user-facing changes? This changes the behaviour of `np.asarray(table/record_batch)` * Closes: #34886 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Calling np.asarray() on a pyarrow table used to return the transpose of the underlying data. This has now been fixed in the latest version of pyarrow. See: apache/arrow#34886 for more details.
Calling np.asarray() on a pyarrow table used to return the transpose of the underlying data. This has now been fixed in the latest version of pyarrow. See: apache/arrow#34886 for more details.
Describe the bug, including details regarding any error messages, version, and platform.
Running
np.asarary
on a PyArrow Table returns the transpose of the data:I expect that
np.asarray
gives the same result asnp.asarray(pa_table.to_pandas())
.Component(s)
Python
The text was updated successfully, but these errors were encountered: