[Python] `np.asarray(parrow_table)` returns a transposed representation of the data #34886

thomasjpfan · 2023-04-04T15:14:56Z

Describe the bug, including details regarding any error messages, version, and platform.

Running np.asarary on a PyArrow Table returns the transpose of the data:

import pyarrow as pa
import pandas as pd
import numpy as np

df = pd.DataFrame({'year': [2020, 2022, 2019, 2021],
                   'n_legs': [2, 4, 5, 100]})
pa_table = pa.Table.from_pandas(df)

# Converting to pandas first gives the expected result:
print(np.asarray(pa_table.to_pandas()))
# [[2020    2]
#  [2022    4]
#  [2019    5]
#  [2021  100]]

# Calling `np.asarray` directly gives the transpose:
print(np.asarray(pa_table))
# [[2020 2022 2019 2021]
#  [   2    4    5  100]]

I expect that np.asarray gives the same result as np.asarray(pa_table.to_pandas()).

Component(s)

Python

The text was updated successfully, but these errors were encountered:

danepitkin · 2023-04-07T17:35:59Z

Hi @thomasjpfan,

This isn't a bug, but a difference in the underlying storage layout of the objects (and the limitations of that).

Arrow supports interoperability with numpy at the array level (https://arrow.apache.org/docs/python/numpy.html). What you are seeing is the zero-copy conversion of the arrow columnar storage format into numpy arrays for each column (https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions). If you don't want to view the data in this format, a copy of the data needs to be made. This is inefficient and usually not the desired behavior at the array level. You'll need to implement the copying outside of pyarrow if you want this direct conversion.

For more complex datatypes (e.g. dataframes), you'll need to use pyarrow's pandas interoperability like in your example (https://arrow.apache.org/docs/python/pandas.html#pandas-integration).

jorisvandenbossche · 2023-04-07T19:45:35Z

I would still call it a bug (if it works, i.e. it returns something, it shouldn't transpose the data), but I think it is indeed caused because we only implemented numpy compatibility on the array level, as Dane mentioned.

When doing np.asarray(..) on a pyarrow Table, numpy sees an object that hasn't any of the protocol methods like __array__, but it does see an iterable object with getitem, and so will try to convert it to an array like any list like. Illustrating this with converting to a list:

In [2]: table = pa.table({'a': [1, 2, 3], 'b': [4, 5, 6]})

In [3]: list(table)
Out[3]: 
[<pyarrow.lib.ChunkedArray object at 0x7fb21b832e30>
 [
   [
     1,
     2,
     3
   ]
 ],
 <pyarrow.lib.ChunkedArray object at 0x7fb21b8328e0>
 [
   [
     4,
     5,
     6
   ]
 ]]

So we get here a list of the column values, each being a ChunkedArray. But because those arrays now actually do have numpy compatibility with __array__, numpy will actually further unpack those and instead of creating a 1D array of the column objects, it creates a 2D array. But with the number of columns (how it got unpacked initially) as the first dimension. And this then results in this "transposed" result compared to how you would expect it.

Leaving this as is doesn't sound as a good idea, given the unexpected shape. Two options I would think of:

Explicitly disallow conversion to numpy (I suppose we could raise an error in __array__, although would have to check if numpy doesn't still fallback to the current method then). And leave this to the user to do themselves (or go through another library that does this)
Actually implement Table.__array__.

A simple implementation (for us or for external users) could be np.stack([np.asarray(col) for col in table], axis=1):

In [14]: np.stack([np.asarray(col) for col in table], axis=1)
Out[14]: 
array([[1, 4],
       [2, 5],
       [3, 6]])

I don't know if that will start to fail with more complex cases, though. Although it seems if the dtypes are not compatible, np.stack gives you object dtype instead of erroring.

danepitkin · 2023-04-07T20:08:34Z

+1 Thanks for the correction and the detailed examples @jorisvandenbossche! I agree we can call this a bug.

…able and RecordBatch

…e-asarray

…nd RecordBatch (#36242) ### Rationale for this change Currently, calling `np.asarray(table)` gives the wrong result (transpose of the expected result), because we don't implement an explicit `__array__` to numpy conversion, and then numpy falls back to iterate the object (but this iterates the columns, giving a transposed result). To fix this unexpected result, I added an actual `__array__` (currently with a naive implementation in python, potentially this could be optimized in our C++ conversion layer) ### Are there any user-facing changes? This changes the behaviour of `np.asarray(table/record_batch)` * Closes: #34886 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Calling np.asarray() on a pyarrow table used to return the transpose of the underlying data. This has now been fixed in the latest version of pyarrow. See: apache/arrow#34886 for more details.

thomasjpfan added the Type: bug label Apr 4, 2023

github-actions bot added the Component: Python label Apr 4, 2023

thomasjpfan mentioned this issue Apr 7, 2023

Support other dataframes like polars and pyarrow not just pandas scikit-learn/scikit-learn#25896

Open

jorisvandenbossche changed the title ~~np.asarray(parrow_table) returns a transposed representation of the data~~ [Python] np.asarray(parrow_table) returns a transposed representation of the data Apr 11, 2023

jorisvandenbossche added this to the 13.0.0 milestone Jun 22, 2023

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Jun 22, 2023

apacheGH-34886: [Python] Add correct __array__ numpy conversion for T…

1b4c865

…able and RecordBatch

github-actions bot mentioned this issue Jun 22, 2023

GH-34886: [Python] Add correct __array__ numpy conversion for Table and RecordBatch #36242

Merged

github-actions bot assigned jorisvandenbossche Jun 22, 2023

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Jun 28, 2023

Merge remote-tracking branch 'upstream/main' into apachegh-34886-tabl…

c878c56

…e-asarray

jorisvandenbossche closed this as completed in #36242 Jun 29, 2023

jorisvandenbossche mentioned this issue Jul 18, 2023

Fix shuffle code to work with pyarrow 13 dask/distributed#8009

Merged

2 tasks

jorisvandenbossche added the Breaking Change Includes a breaking change to the API label Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] `np.asarray(parrow_table)` returns a transposed representation of the data #34886

[Python] `np.asarray(parrow_table)` returns a transposed representation of the data #34886

thomasjpfan commented Apr 4, 2023

danepitkin commented Apr 7, 2023 •

edited

Loading

jorisvandenbossche commented Apr 7, 2023

danepitkin commented Apr 7, 2023

[Python] np.asarray(parrow_table) returns a transposed representation of the data #34886

[Python] np.asarray(parrow_table) returns a transposed representation of the data #34886

Comments

thomasjpfan commented Apr 4, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

danepitkin commented Apr 7, 2023 • edited Loading

jorisvandenbossche commented Apr 7, 2023

danepitkin commented Apr 7, 2023

[Python] `np.asarray(parrow_table)` returns a transposed representation of the data #34886

[Python] `np.asarray(parrow_table)` returns a transposed representation of the data #34886

danepitkin commented Apr 7, 2023 •

edited

Loading