[Python] List array conversion to Numpy N-d array #29892

asfimport · 2021-10-14T14:56:59Z

When converting a single-dimensional array to numpy, the dtype is preserved

import pyarrow as pa
x = pa.array([.234,.345,.456])
x.to_numpy().dtype # dtype('float64')

But when doing the same for a multi-dimensional array, the dtype is lost and cannot be set manually

x = pa.array([[1,2,3],[4,5,6]]).to_numpy(zero_copy_only=False)
print(x.dtpye) # object
x.astype(np.float64) # ValueError: setting an array element with a sequence.

Which is to say that numpy believes this array is not uniform. The only way to get it to the proper dtype is to convert it to a python list then back to a numpy array.

Is there another way to achieve this? Or, at least, can it be fixed such that we can manually set the dtype of the numpy array after conversion?

I know that pyarrow doesn't support ndarrays with ndim>1 (https://issues.apache.org/jira/browse/ARROW-5645) but I was curious if this can be achieved going the other way.

Reporter: Ben Epstein

_{Note: This issue was originally created as ARROW-14320. Please see the migration documentation for further details.}

asfimport · 2021-10-14T15:00:35Z

Antoine Pitrou / @pitrou:
It can't work because you have a PyArrow list array and Arrow lists are variable-sized, so you cannot convert them to a rectangular 2d array.

However, it would be nice if this could work with a fixed-size-list array:

>>> pa.array([[1,2,3],[4,5,6]], type=pa.list_(pa.int32(), 3))
<pyarrow.lib.FixedSizeListArray object at 0x7f787e2e8ac0>
[
  [
    1,
    2,
    3
  ],
  [
    4,
    5,
    6
  ]
]
>>> pa.array([[1,2,3],[4,5,6]], type=pa.list_(pa.int32(), 3)).to_numpy()
Traceback (most recent call last):
  [...]
ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

asfimport · 2021-10-14T15:00:46Z

Antoine Pitrou / @pitrou:
cc @jorisvandenbossche @amol-

asfimport · 2021-10-14T15:29:38Z

Ben Epstein:
Even in that scenario you cannot manually fix the numpy array

pa.array([[1,2,3],[4,5,6]], type=pa.list_(pa.int32(), 3)).to_numpy(zero_copy_only=False).astype(np.int32) # # ValueError: setting an array element with a sequence.

asfimport · 2021-10-15T16:55:59Z

Joris Van den Bossche / @jorisvandenbossche:
That's because currently with that code snippet, it is still returning a 1D numpy array that contains numpy arrays as elements (and numpy doesn't have much support for such kind of nested arrays).
In theory we could return a zero-copy 2D numpy array if you have a fixed-length list array type (and then having such a 2D array astype would work). However, I think we will always default to returning a 1D array (so then that would need an additional keyword?)

asfimport · 2021-10-15T18:26:36Z

Ben Epstein:
@jorisvandenbossche The second example sets a fixed length, right?

asfimport · 2021-10-15T18:36:37Z

Joris Van den Bossche / @jorisvandenbossche:
Yes, your last example is using a fixed-length list type.

asfimport · 2021-10-15T18:39:24Z

Ben Epstein:
The issue i'm currently facing is that even after converting it to a numpy array, which has dtype=object, I cannot "convince" the numpy array that its records are of constant length. If I could do that, I could avoid ever having to put this into a python list. Do you have any ideas/workaround for that?

asfimport · 2021-10-15T18:46:26Z

Antoine Pitrou / @pitrou:
You could workaround it in three steps:

flatten the PyArrow list array
convert it to Numpy (this will give a 1D Numpy array)

reshape the Numpy array

>>> a = pa.array([[1,2,3], [4,5,6]])
>>> a.flatten().to_numpy().reshape((-1, 3))
array([[1, 2, 3],
       [4, 5, 6]])
>>> a.flatten().to_numpy().reshape((-1, 3)).dtype
dtype('int64')

Ben-Epstein · 2023-02-26T02:17:28Z

Does the arrow team plan on building an internal fix for this?

westonpace · 2023-02-27T22:09:09Z

There has been recent discussion around support for tensors as a canonical extension type. That might be a better and more general purpose solution for this ask. I am not as familiar with all of the nuance involved but I would encourage you to take a look at [1][2][3] and see if that proposal could work.

jorisvandenbossche · 2023-03-23T11:42:28Z

To be explicit, there is no "internal" fix to be done, as this conversion is already possible zero copy with preserving the dtype, if you convert the flat values (i.e. what Antoine showed above):

>>> a = pa.array([[1,2,3], [4,5,6]])
>>> a.flatten().to_numpy()
array([1, 2, 3, 4, 5, 6])
>>> a.flatten().to_numpy().reshape(2, 3)
array([[1, 2, 3],
       [4, 5, 6]])

But so it is more a question about what user facing API we provide for this. Do we expect the user to do this themselves, or do we want to add some "to_numpy_2d" method to FixedSizeListArray that does that for you?
The existing to_numpy cannot do this, because this method is expected to give you a 1D array of the same length as the pyarrow array. I personally would lean towards letting the user do this themselves, since this is relatively straightforward to do and then you have full control (a method to get a 2D array would also get messy if you have a list array with multiple levels of nesting). So regarding the original topic, I would tend to close this issue.

But @westonpace makes a good point that the FixedShapeTensorArray extension type that is being added might be interesting, depending on your exact use case. The pyarrow API for that still needs to be finalized and merged, but we were planning to add a to_numpy_array method (or some other name) that gives you the actual underlying array zero-copy as a N-d array. See the examples in the documentation that is being added in #33948

davlee1972 · 2023-06-09T14:16:04Z

** Edit ** - StructArray or a just 3 arrays/vectors might be better.

Wouldn't it be better to convert an arrow tensor type into an arrow list of structs?
This would make multidimensional matrixes searchable..
You could also write this efficiently to parquet.

What is missing in the solution above are the names for x (3 columns) and y (2 rows).

    Bob | Mary | John

Kids 1 | 2 | 3
Cars 4 | 5 | 6

[{'Bob', 'Kids', 1}, {'Bob', 'Cars', 4}, {'Mary', 'Kids', 2}, {'Mary', 'Cars', 5}, {'John', 'Kids', 3}, {'John', 'Cars', 6}]

pa.schema([
    pa._list(
        pa.struct([
            pa.field('person', pa.string()),
            pa.field('has', pa.string()),
            pa.field('count', pa.int32()),
        ])
    )
])

OR which might be more "searchable"

pa.schema([
            pa.field('person', pa.string()),
            pa.field('has', pa.string()),
            pa.field('count', pa.int32()),
])

jorisvandenbossche added Type: usage Issue is a user question and removed Type: bug labels Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] List array conversion to Numpy N-d array #29892

[Python] List array conversion to Numpy N-d array #29892

asfimport commented Oct 14, 2021

asfimport commented Oct 14, 2021

asfimport commented Oct 14, 2021

asfimport commented Oct 14, 2021

asfimport commented Oct 15, 2021

asfimport commented Oct 15, 2021

asfimport commented Oct 15, 2021

asfimport commented Oct 15, 2021

asfimport commented Oct 15, 2021

Ben-Epstein commented Feb 26, 2023

westonpace commented Feb 27, 2023 •

edited

Loading

jorisvandenbossche commented Mar 23, 2023

davlee1972 commented Jun 9, 2023 •

edited

Loading

[Python] List array conversion to Numpy N-d array #29892

[Python] List array conversion to Numpy N-d array #29892

Comments

asfimport commented Oct 14, 2021

asfimport commented Oct 14, 2021

asfimport commented Oct 14, 2021

asfimport commented Oct 14, 2021

asfimport commented Oct 15, 2021

asfimport commented Oct 15, 2021

asfimport commented Oct 15, 2021

asfimport commented Oct 15, 2021

asfimport commented Oct 15, 2021

Ben-Epstein commented Feb 26, 2023

westonpace commented Feb 27, 2023 • edited Loading

jorisvandenbossche commented Mar 23, 2023

davlee1972 commented Jun 9, 2023 • edited Loading

westonpace commented Feb 27, 2023 •

edited

Loading

davlee1972 commented Jun 9, 2023 •

edited

Loading