Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] List<Extension> arrays aren't supported in to_pandas calls #32791

Open
asfimport opened this issue Aug 26, 2022 · 5 comments
Open

[Python] List<Extension> arrays aren't supported in to_pandas calls #32791

asfimport opened this issue Aug 26, 2022 · 5 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Aug 26, 2022

EXTENSION is not in the list of types allowed. I think in order to enable EXTENSION we need to be able to call to_pylist or similar on the original extension array from C++ code, in case there were user provided overrides. Off the top of my head one way of doing this would be to pass through an additional std::unorderd_map<Array*, PyObject*> where PyObject is the bound to_pylist python function. Are there other alternative that might be cleaner?

Reporter: Micah Kornfield / @emkornfield

Related issues:

Note: This issue was originally created as ARROW-17535. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Chang She / @changhiskhan:
What I was thinking is the following possibilities:

  1. If the ExtensionType is associated with an ExtensionArray subtype that overrides “to_pandas”, we could do the to_pandas call on the list values array and then use the offsets to create the proper pandas array

  2. If the ExtensionType is associated with an ExtensionScalar, then you can call to_polish on the values array and then use the offsets to construct the pandas array

For computer vision data this is actually fairly important as very often we have a list-of-labels or list-of-Box2d per row (image)

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
There is currently an open PR (#14238) that addresses this (partly) by just using the storage array conversion. At the moment this doesn't take into account that the ExtensionType might define a custom conversion to numpy and/or pandas in Python. But my question: are we OK with for now falling back to storage array conversion?

On the one hand, that would be consistent with StructArray, where we also fall back to the storage array at the moment. On the other hand, if we want to solve this more "properly" later, that would mean another change in behaviour.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
On the actual issue:

I think in order to enable EXTENSION we need to be able to call to_pylist or similar on the original extension array from C++ code, in case there were user provided overrides

For other list arrays, we actually do not convert to lists but to numpy arrays:

In [3]: pa.array([[1, 2], [3, 4, 5]]).to_numpy(zero_copy_only=False)
Out[3]: array([array([1, 2]), array([3, 4, 5])], dtype=object)

In [4]: pa.array([[1, 2], [3, 4, 5]]).to_pandas().values
Out[4]: array([array([1, 2]), array([3, 4, 5])], dtype=object)

So it could also be an option to keep using arrays, instead of using lists in case of ExtensionType. And then, if we can somehow convert the list array's values to a single array (calling into the Python to_numpy or to_pandas, since that can be overridden?), then we could continue slicing this into pieces and put that in the resulting array, as we do now (I think this is basically the first possibility that @changhiskhan mentions?)

  1. If the ExtensionType is associated with an ExtensionScalar, then you can call to_polish on the values array and then use the offsets to construct the pandas array

That actually brings up a question: if an ExtensionType defines an ExtensionScalar (but not an associciated pandas dtype, or custom to_numpy conversion), should we use this scalar's as_py() for the to_numpy/to_pandas conversion as well for plain extension arrays? (not the nested case)

Because currently, if you have an ExtensionArray like that (for example using the example from the docs: https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion), we still use the storage type conversion for to_numpy/to_pandas, and only use the scalar's conversion in to_pylist.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Note: the actual conversion to_pandas is now "working" after ARROW-17813 (#14238), by falling back to the storage array (the same for to_numpy) as mentioned above (https://issues.apache.org/jira/browse/ARROW-17535?focusedCommentId=17612532&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17612532).
For to_pandas, it would be good if we can still improve this using the conversion defined by the ExtensionType, as discussed above.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
Yeah, so I agree with the conclusion that scalar conversion should currently not be used, as it isn't used today except in to_pylist. I think even using the to_pandas call might be tricky but if it can work, then it would be a good idea to pursue as the approach I outlined above could get complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant