Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] List array conversion to Numpy N-d array #29892

Open
asfimport opened this issue Oct 14, 2021 · 12 comments
Open

[Python] List array conversion to Numpy N-d array #29892

asfimport opened this issue Oct 14, 2021 · 12 comments
Labels
Component: Python Type: usage Issue is a user question

Comments

@asfimport
Copy link
Collaborator

When converting a single-dimensional array to numpy, the dtype is preserved

import pyarrow as pa
x = pa.array([.234,.345,.456])
x.to_numpy().dtype # dtype('float64')

But when doing the same for a multi-dimensional array, the dtype is lost and cannot be set manually

x = pa.array([[1,2,3],[4,5,6]]).to_numpy(zero_copy_only=False)
print(x.dtpye) # object
x.astype(np.float64) # ValueError: setting an array element with a sequence.

Which is to say that numpy believes this array is not uniform. The only way to get it to the proper dtype is to convert it to a python list then back to a numpy array.

Is there another way to achieve this? Or, at least, can it be fixed such that we can manually set the dtype of the numpy array after conversion?

I know that pyarrow doesn't support ndarrays with ndim>1 (https://issues.apache.org/jira/browse/ARROW-5645) but I was curious if this can be achieved going the other way.

Reporter: Ben Epstein

Note: This issue was originally created as ARROW-14320. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
It can't work because you have a PyArrow list array and Arrow lists are variable-sized, so you cannot convert them to a rectangular 2d array.

However, it would be nice if this could work with a fixed-size-list array:

>>> pa.array([[1,2,3],[4,5,6]], type=pa.list_(pa.int32(), 3))
<pyarrow.lib.FixedSizeListArray object at 0x7f787e2e8ac0>
[
  [
    1,
    2,
    3
  ],
  [
    4,
    5,
    6
  ]
]
>>> pa.array([[1,2,3],[4,5,6]], type=pa.list_(pa.int32(), 3)).to_numpy()
Traceback (most recent call last):
  [...]
ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

@asfimport
Copy link
Collaborator Author

@asfimport
Copy link
Collaborator Author

Ben Epstein:
Even in that scenario you cannot manually fix the numpy array 

pa.array([[1,2,3],[4,5,6]], type=pa.list_(pa.int32(), 3)).to_numpy(zero_copy_only=False).astype(np.int32) # # ValueError: setting an array element with a sequence.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
That's because currently with that code snippet, it is still returning a 1D numpy array that contains numpy arrays as elements (and numpy doesn't have much support for such kind of nested arrays).
In theory we could return a zero-copy 2D numpy array if you have a fixed-length list array type (and then having such a 2D array astype would work). However, I think we will always default to returning a 1D array (so then that would need an additional keyword?)

@asfimport
Copy link
Collaborator Author

Ben Epstein:
@jorisvandenbossche  The second example sets a fixed length, right?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Yes, your last example is using a fixed-length list type.

@asfimport
Copy link
Collaborator Author

Ben Epstein:
The issue i'm currently facing is that even after converting it to a numpy array, which has dtype=object, I cannot "convince" the numpy array that its records are of constant length. If I could do that, I could avoid ever having to put this into a python list. Do you have any ideas/workaround for that? 

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
You could workaround it in three steps:

  • flatten the PyArrow list array

  • convert it to Numpy (this will give a 1D Numpy array)

  • reshape the Numpy array

    >>> a = pa.array([[1,2,3], [4,5,6]])
    >>> a.flatten().to_numpy().reshape((-1, 3))
    array([[1, 2, 3],
           [4, 5, 6]])
    >>> a.flatten().to_numpy().reshape((-1, 3)).dtype
    dtype('int64')

@Ben-Epstein
Copy link

Does the arrow team plan on building an internal fix for this?

@westonpace
Copy link
Member

westonpace commented Feb 27, 2023

There has been recent discussion around support for tensors as a canonical extension type. That might be a better and more general purpose solution for this ask. I am not as familiar with all of the nuance involved but I would encourage you to take a look at [1][2][3] and see if that proposal could work.

@jorisvandenbossche jorisvandenbossche added Type: usage Issue is a user question and removed Type: bug labels Mar 23, 2023
@jorisvandenbossche
Copy link
Member

To be explicit, there is no "internal" fix to be done, as this conversion is already possible zero copy with preserving the dtype, if you convert the flat values (i.e. what Antoine showed above):

>>> a = pa.array([[1,2,3], [4,5,6]])
>>> a.flatten().to_numpy()
array([1, 2, 3, 4, 5, 6])
>>> a.flatten().to_numpy().reshape(2, 3)
array([[1, 2, 3],
       [4, 5, 6]])

But so it is more a question about what user facing API we provide for this. Do we expect the user to do this themselves, or do we want to add some "to_numpy_2d" method to FixedSizeListArray that does that for you?
The existing to_numpy cannot do this, because this method is expected to give you a 1D array of the same length as the pyarrow array. I personally would lean towards letting the user do this themselves, since this is relatively straightforward to do and then you have full control (a method to get a 2D array would also get messy if you have a list array with multiple levels of nesting). So regarding the original topic, I would tend to close this issue.

But @westonpace makes a good point that the FixedShapeTensorArray extension type that is being added might be interesting, depending on your exact use case. The pyarrow API for that still needs to be finalized and merged, but we were planning to add a to_numpy_array method (or some other name) that gives you the actual underlying array zero-copy as a N-d array. See the examples in the documentation that is being added in #33948

@davlee1972
Copy link

davlee1972 commented Jun 9, 2023

** Edit ** - StructArray or a just 3 arrays/vectors might be better.

Wouldn't it be better to convert an arrow tensor type into an arrow list of structs?
This would make multidimensional matrixes searchable..
You could also write this efficiently to parquet.

What is missing in the solution above are the names for x (3 columns) and y (2 rows).

    Bob | Mary | John

Kids 1 | 2 | 3
Cars 4 | 5 | 6

[{'Bob', 'Kids', 1}, {'Bob', 'Cars', 4}, {'Mary', 'Kids', 2}, {'Mary', 'Cars', 5}, {'John', 'Kids', 3}, {'John', 'Cars', 6}]

pa.schema([
    pa._list(
        pa.struct([
            pa.field('person', pa.string()),
            pa.field('has', pa.string()),
            pa.field('count', pa.int32()),
        ])
    )
])

OR which might be more "searchable"

pa.schema([
            pa.field('person', pa.string()),
            pa.field('has', pa.string()),
            pa.field('count', pa.int32()),
])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

5 participants