Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Nested ExtensionArray conversion to/from pandas/numpy #33036

Closed
asfimport opened this issue Sep 22, 2022 · 6 comments
Closed

[Python] Nested ExtensionArray conversion to/from pandas/numpy #33036

asfimport opened this issue Sep 22, 2022 · 6 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Sep 22, 2022

user@ thread: https://lists.apache.org/thread/dhnxq0g4kgdysjowftfv3z5ngj780xpb
repro gist: https://gist.github.com/changhiskhan/4163f8cec675a2418a69ec9168d5fdd9

Arrow => numpy/pandas

For a non-nested array, pa.ExtensionArray.to_numpy automatically "lowers" to the storage type (as expected). However this is not done for nested arrays:

import pyarrow as pa

class LabelType(pa.ExtensionType):

    def __init__(self):
        super(LabelType, self).__init__(pa.string(), "label")

    def __arrow_ext_serialize__(self):
        return b""

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        return LabelType()
    
storage = pa.array(["dog", "cat", "horse"])
ext_arr = pa.ExtensionArray.from_storage(LabelType(), storage)
offsets = pa.array([0, 1])
list_arr = pa.ListArray.from_arrays(offsets, ext_arr)
list_arr.to_numpy()
---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
Cell In [15], line 1
----> 1 list_arr.to_numpy()

File /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:1445, in pyarrow.lib.Array.to_numpy()

File /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in pyarrow.lib.check_status()

ArrowNotImplementedError: Not implemented type for Arrow list to pandas: extension<label<LabelType>>

As mentioned on the user thread linked from the top, a fairly generic solution would just have the conversion default to the storage array's to_numpy.

 
pandas/numpy => Arrow

Equivalently, conversion to Arrow is also difficult for nested extension types:

if I have say a pandas DataFrame that has a column of list-of-string and I want to convert that to list-of-label Array. Currently I have to:

  1. Convert to list-of-string (storage) numpy array to pa.list_(pa.string())
  2. Convert the string values array to ExtensionArray, then reconstitue a list array using the ExtensionArray combined with the offsets from the result of step 1
import pyarrow as pa
import pandas as pd
df = pd.DataFrame({'labels': [["dog", "horse", "cat"], ["person", "person", "car", "car"]]})
list_of_storage = pa.array(df.labels)
ext_values = pa.ExtensionArray.from_storage(LabelType(), list_of_storage.values)
list_of_ext = pa.ListArray.from_arrays(offsets=list_of_storage.offsets, values=ext_values)

For non-nested columns, one can achieve easier conversion by defining a pandas extension dtype, but i don't think that works for a nested column. You would instead have to fallback to something like pa.ExtensionArray.from_storage (or from_pandas?) to do the trick. Even that doesn't necessarily work for something like a dictionary column because you'd have to pass in the dictionary somehow. Off the cuff, one could provide a custom lambda to pa.Table.from_pandas that is used for either specified column names / data types?

Thanks in advance for the consideration!

Reporter: Chang She / @changhiskhan
Assignee: Miles Granger / @milesgranger

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-17813. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Arrow => numpy/pandas

For numpy, we can indeed fall back to converting the storage array. That's also what happens for the ExtensionArray.to_numpy() at the moment. Although this is implemented in python right now (

def to_numpy(self, **kwargs):
), while the ListArray conversion is done in C+. So we would need to move that logic of using the storage type into the pyarrow C+ code (which should be doable, I think)

For conversion to pandas, for plain ExtensionArrays, this is controlled by whether there is an equivalent pandas extension type to convert to. So the question is whether this should be done for ExtensionArrays within a nested type as well. That would get a bit more complicated, as then we need to call back into python from C++ (this is basically covered by ARROW-17535)

pandas/numpy => Arrow

One way this will be a bit easier is to cast to the final type, something like: list_of_storage.cast(pa.list_(LabelType())).
This is currently not yet possible, but there is some work being done on that at the moment (ARROW-14500 about casting storage type to extension type, ARROW-15545 is a different issue related to casting of extension types, but this might actually also solve the former, and there is an open PR for this: #14106. We should verify if that PR also enables this cast)

For non-nested columns, one can achieve easier conversion by defining a pandas extension dtype, but i don't think that works for a nested column.

Indeed, that won't work without specifying a separate extension type for this nested type (until pandas supports nested types properly)

Off the cuff, one could provide a custom lambda to pa.Table.from_pandas that is used for either specified column names / data types?

That could be one option. But maybe we should start with enabling basic conversion (through the storage type) for extension types in the array conversion, which currently fails:

# this could be the equivalent of `pa.ExtensionArray.from_storage(LabelType(), pa.array(["dog", "cat", "horse"]))` ?
>>> pa.array(["dog", "cat", "horse"], type=LabelType())
ArrowNotImplementedError: extension

I opened ARROW-17834 for this.

If the above works, I think it should also work to specify a schema with the extension type in the Table.from_pandas conversion.
(we could still make it easier to allow to specify the type for one specific column, instead of having to specify the full schema)

@asfimport
Copy link
Collaborator Author

Chang She / @changhiskhan:
@jorisvandenbossche thank you for the details above!

ExtensionArray => pandas

Just for discussion, I was curious whether you had any thoughts around using the extension scalar as a fallback mechanism. It's a lot simpler to define an ExtensionScalar with as_py than a pandas extension dtype. So if an ExtensionArray doesn't have an equivalent pandas dtype, would it make sense to convert it to just an object series whose elements are the result of as_py? I added it as a comment to ARROW-17353 for further discussion as well if it makes sense.

pandas/numpy => Arrow

One way this will be a bit easier is to cast to the final type, something like: list_of_storage.cast(pa.list_(LabelType())).

Yeah, that would certainly make it a lot more convenient! I don't see any tests relating to nested types in #14106 but hopefully it's not much additional effort on top of what's already there?

this could be the equivalent of pa.ExtensionArray.from_storage(LabelType(), pa.array(["dog", "cat", "horse"])) ?

pa.array(["dog", "cat", "horse"], type=LabelType())
ArrowNotImplementedError: extension
I opened ARROW-17834 for this.

Agreed. Thanks for opening the JIRA. One additional tricky thing here is what if the storage array also need additional arguments. e.g., in CV, most canonical datasets has a predetermined dictionary, so for the above example, often-times you'd want read in a CSV data dictionary and pass in the class names in the right order to construct the storage DictionaryArray (cross-posted on ARROW-17834).

If the above works, I think it should also work to specify a schema with the extension type in the Table.from_pandas conversion.
(we could still make it easier to allow to specify the type for one specific column, instead of having to specify the full schema)

yeah that would be amazing. I'd love to toss away my custom type conversion code that's hard to maintain (and not to mention slow) :)

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:

ExtensionArray => pandas

Just for discussion, I was curious whether you had any thoughts around using the extension scalar as a fallback mechanism

I just was wondering the same in ARROW-17535, forgetting your brought that up here as well. I opened a dedicated JIRA for this part: ARROW-17925

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 14238
#14238

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
@changhiskhan so this issue is marked as resolved now #14238 is merged. Not all the issues you raised have been fixed, though, but I think remaining issues are covered by other JIRAs (but it would be good if you could verify this).
For example, after #14238, we still fall back on basic storage type conversion in case of list, but we have ARROW-17535 to see if we can further improve this by actually using the proper to_pandas conversion.

@asfimport
Copy link
Collaborator Author

Chang She / @changhiskhan:
@jorisvandenbossche this is great. I will try it out on master. thanks!

I agree with your comments in ARROW-17535 that the design can be further improved to take into account overridden to_numpy/to_pandas,
but just having this fallback would open up a lot more paths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants