[Python] Nested ExtensionArray conversion to/from pandas/numpy #33036

asfimport · 2022-09-22T02:48:50Z

user@ thread: https://lists.apache.org/thread/dhnxq0g4kgdysjowftfv3z5ngj780xpb
repro gist: https://gist.github.com/changhiskhan/4163f8cec675a2418a69ec9168d5fdd9

Arrow => numpy/pandas

For a non-nested array, pa.ExtensionArray.to_numpy automatically "lowers" to the storage type (as expected). However this is not done for nested arrays:

import pyarrow as pa

class LabelType(pa.ExtensionType):

    def __init__(self):
        super(LabelType, self).__init__(pa.string(), "label")

    def __arrow_ext_serialize__(self):
        return b""

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        return LabelType()
    
storage = pa.array(["dog", "cat", "horse"])
ext_arr = pa.ExtensionArray.from_storage(LabelType(), storage)
offsets = pa.array([0, 1])
list_arr = pa.ListArray.from_arrays(offsets, ext_arr)
list_arr.to_numpy()

---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
Cell In [15], line 1
----> 1 list_arr.to_numpy()

File /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:1445, in pyarrow.lib.Array.to_numpy()

File /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in pyarrow.lib.check_status()

ArrowNotImplementedError: Not implemented type for Arrow list to pandas: extension<label<LabelType>>

As mentioned on the user thread linked from the top, a fairly generic solution would just have the conversion default to the storage array's to_numpy.

pandas/numpy => Arrow

Equivalently, conversion to Arrow is also difficult for nested extension types:

if I have say a pandas DataFrame that has a column of list-of-string and I want to convert that to list-of-label Array. Currently I have to:

Convert to list-of-string (storage) numpy array to pa.list_(pa.string())
Convert the string values array to ExtensionArray, then reconstitue a list array using the ExtensionArray combined with the offsets from the result of step 1

import pyarrow as pa
import pandas as pd
df = pd.DataFrame({'labels': [["dog", "horse", "cat"], ["person", "person", "car", "car"]]})
list_of_storage = pa.array(df.labels)
ext_values = pa.ExtensionArray.from_storage(LabelType(), list_of_storage.values)
list_of_ext = pa.ListArray.from_arrays(offsets=list_of_storage.offsets, values=ext_values)

For non-nested columns, one can achieve easier conversion by defining a pandas extension dtype, but i don't think that works for a nested column. You would instead have to fallback to something like pa.ExtensionArray.from_storage (or from_pandas?) to do the trick. Even that doesn't necessarily work for something like a dictionary column because you'd have to pass in the dictionary somehow. Off the cuff, one could provide a custom lambda to pa.Table.from_pandas that is used for either specified column names / data types?

Thanks in advance for the consideration!

Reporter: Chang She / @changhiskhan
Assignee: Miles Granger / @milesgranger

Related issues:

[Python] List arrays aren't supported in to_pandas calls (relates to)
[Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? (relates to)
[Python] Allow creating ExtensionArray through pa.array(..) constructor (relates to)

PRs and other links:

GitHub Pull Request #14238

_{Note: This issue was originally created as ARROW-17813. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-09-23T16:40:08Z

Joris Van den Bossche / @jorisvandenbossche:
Arrow => numpy/pandas

For numpy, we can indeed fall back to converting the storage array. That's also what happens for the ExtensionArray.to_numpy() at the moment. Although this is implemented in python right now (

arrow/python/pyarrow/array.pxi

Line 2795 in 356e7f8

def to_numpy(self, **kwargs):

), while the ListArray conversion is done in C+. So we would need to move that logic of using the storage type into the pyarrow C+ code (which should be doable, I think)

For conversion to pandas, for plain ExtensionArrays, this is controlled by whether there is an equivalent pandas extension type to convert to. So the question is whether this should be done for ExtensionArrays within a nested type as well. That would get a bit more complicated, as then we need to call back into python from C++ (this is basically covered by ARROW-17535)

pandas/numpy => Arrow

One way this will be a bit easier is to cast to the final type, something like: list_of_storage.cast(pa.list_(LabelType())).
This is currently not yet possible, but there is some work being done on that at the moment (ARROW-14500 about casting storage type to extension type, ARROW-15545 is a different issue related to casting of extension types, but this might actually also solve the former, and there is an open PR for this: #14106. We should verify if that PR also enables this cast)

For non-nested columns, one can achieve easier conversion by defining a pandas extension dtype, but i don't think that works for a nested column.

Indeed, that won't work without specifying a separate extension type for this nested type (until pandas supports nested types properly)

Off the cuff, one could provide a custom lambda to pa.Table.from_pandas that is used for either specified column names / data types?

That could be one option. But maybe we should start with enabling basic conversion (through the storage type) for extension types in the array conversion, which currently fails:

# this could be the equivalent of `pa.ExtensionArray.from_storage(LabelType(), pa.array(["dog", "cat", "horse"]))` ?
>>> pa.array(["dog", "cat", "horse"], type=LabelType())
ArrowNotImplementedError: extension

I opened ARROW-17834 for this.

If the above works, I think it should also work to specify a schema with the extension type in the Table.from_pandas conversion.
(we could still make it easier to allow to specify the type for one specific column, instead of having to specify the full schema)

asfimport · 2022-09-26T17:30:32Z

Chang She / @changhiskhan:
@jorisvandenbossche thank you for the details above!

ExtensionArray => pandas

Just for discussion, I was curious whether you had any thoughts around using the extension scalar as a fallback mechanism. It's a lot simpler to define an ExtensionScalar with as_py than a pandas extension dtype. So if an ExtensionArray doesn't have an equivalent pandas dtype, would it make sense to convert it to just an object series whose elements are the result of as_py? I added it as a comment to ARROW-17353 for further discussion as well if it makes sense.

pandas/numpy => Arrow

One way this will be a bit easier is to cast to the final type, something like: list_of_storage.cast(pa.list_(LabelType())).

Yeah, that would certainly make it a lot more convenient! I don't see any tests relating to nested types in #14106 but hopefully it's not much additional effort on top of what's already there?

this could be the equivalent of pa.ExtensionArray.from_storage(LabelType(), pa.array(["dog", "cat", "horse"])) ?

pa.array(["dog", "cat", "horse"], type=LabelType())
ArrowNotImplementedError: extension
I opened ARROW-17834 for this.

Agreed. Thanks for opening the JIRA. One additional tricky thing here is what if the storage array also need additional arguments. e.g., in CV, most canonical datasets has a predetermined dictionary, so for the above example, often-times you'd want read in a CSV data dictionary and pass in the class names in the right order to construct the storage DictionaryArray (cross-posted on ARROW-17834).

If the above works, I think it should also work to specify a schema with the extension type in the Table.from_pandas conversion.
(we could still make it easier to allow to specify the type for one specific column, instead of having to specify the full schema)

yeah that would be amazing. I'd love to toss away my custom type conversion code that's hard to maintain (and not to mention slow) :)

asfimport · 2022-10-04T10:13:02Z

Joris Van den Bossche / @jorisvandenbossche:

ExtensionArray => pandas

Just for discussion, I was curious whether you had any thoughts around using the extension scalar as a fallback mechanism

I just was wondering the same in ARROW-17535, forgetting your brought that up here as well. I opened a dedicated JIRA for this part: ARROW-17925

asfimport · 2022-10-12T13:27:30Z

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 14238
#14238

asfimport · 2022-10-12T13:35:23Z

Joris Van den Bossche / @jorisvandenbossche:
@changhiskhan so this issue is marked as resolved now #14238 is merged. Not all the issues you raised have been fixed, though, but I think remaining issues are covered by other JIRAs (but it would be good if you could verify this).
For example, after #14238, we still fall back on basic storage type conversion in case of list, but we have ARROW-17535 to see if we can further improve this by actually using the proper to_pandas conversion.

asfimport · 2022-10-12T17:29:30Z

Chang She / @changhiskhan:
@jorisvandenbossche this is great. I will try it out on master. thanks!

I agree with your comments in ARROW-17535 that the design can be further improved to take into account overridden to_numpy/to_pandas,
but just having this fallback would open up a lot more paths

asfimport closed this as completed Oct 12, 2022

asfimport assigned milesgranger Jan 11, 2023

This was referenced Jan 11, 2023

[Python] List<Extension> arrays aren't supported in to_pandas calls #32791

Open

[Python] Allow creating ExtensionArray through pa.array(..) constructor #33054

Closed

[Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? #33134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Nested ExtensionArray conversion to/from pandas/numpy #33036

[Python] Nested ExtensionArray conversion to/from pandas/numpy #33036

asfimport commented Sep 22, 2022 •

edited

Loading

asfimport commented Sep 23, 2022

asfimport commented Sep 26, 2022

asfimport commented Oct 4, 2022

asfimport commented Oct 12, 2022

asfimport commented Oct 12, 2022

asfimport commented Oct 12, 2022

[Python] Nested ExtensionArray conversion to/from pandas/numpy #33036

[Python] Nested ExtensionArray conversion to/from pandas/numpy #33036

Comments

asfimport commented Sep 22, 2022 • edited Loading

Related issues:

PRs and other links:

asfimport commented Sep 23, 2022

asfimport commented Sep 26, 2022

asfimport commented Oct 4, 2022

asfimport commented Oct 12, 2022

asfimport commented Oct 12, 2022

asfimport commented Oct 12, 2022

asfimport commented Sep 22, 2022 •

edited

Loading