Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Add Array.to_numpy functions #18249

Closed
asfimport opened this issue Mar 9, 2018 · 4 comments
Closed

[Python] Add Array.to_numpy functions #18249

asfimport opened this issue Mar 9, 2018 · 4 comments

Comments

@asfimport
Copy link

asfimport commented Mar 9, 2018

There are to_pandas() functions, but no to_numpy() functions. I'd like to propose that we include both.

Also, pyarrow.lib.Array.to_pandas() returns a numpy.ndarray, which imho is very confusing :). I think it would be more intuitive for the to_pandas() functions to return pandas.Series and pandas.DataFrame objects, and the to_numpy() functions to return numpy.ndarray and either a ordered dict of numpy.ndarray or a structured numpy.ndarray depending on a flag, for example. The to_pandas() function is of course welcome to use the to_numpy() func to avoid the additional index and whatnot of the pandas.Series.

 

Reporter: Lawrence Chan / @llchan

Related issues:

Note: This issue was originally created as ARROW-2295. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Jim Pivarski / @jpivarski:
I second this and would like to request that the Numpy interface has more low-level access to Arrow structures. For instance, ListArray is internally represented as two arrays: offsets and contents, and there are applications where we'd want to get a zero-copy view of these arrays. The to_pandas() function constructs a Numpy object array of subarrays, which is a performance bottleneck if you really do want the original offsets and contents.

This function could be an inverse of pyarrow.ListArray.from_arrays, something that returns the offsets and contents as Numpy arrays for a List and something more complex for general cases (a dict from strings representing a place in the hierarchy to Numpy arrays?).

A simpler interface that could be implemented immediately would be one that returns the raw bytes of the Arrow buffer, to let us identify its contents using [the Arrow spec|[https://github.com/apache/arrow/blob/master/format/Layout.md].] But that doesn't make use of the dtype (probably just set it to uint8) and would probably make more sense as a raw buffer. (Should that be a separate JIRA ticket?)

 

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:

Also, pyarrow.lib.Array.to_pandas() returns a numpy.ndarray, which imho is very confusing
Agreed, it also surprises me often.

either a ordered dict of numpy.ndarray or a structured numpy.ndarray depending on a flag, for example

Converting to a struct array sounds like the reciprocal of ARROW-1886. That doesn't have to be part of a Numpy conversion function, though.

ListArray is internally represented as two arrays: offsets and contents, and there are applications where we'd want to get a zero-copy view of these arrays

You can use Array.buffers() to get zero-copy views of those buffers and call np.frombuffer on each of them.

@asfimport
Copy link
Author

Jim Pivarski / @jpivarski:
Array.buffers() must be a new feature, after 0.8.0. I'll look for it in the next release. Thanks!

@asfimport
Copy link
Author

Todd Farmer / @toddfarmer:
Transitioning issue from Resolved to Closed to based on resolution field value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant