[Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? #33134

asfimport · 2022-10-04T10:11:57Z

This was raised in ARROW-17813 by @changhiskhan:

ExtensionArray => pandas

Just for discussion, I was curious whether you had any thoughts around using the extension scalar as a fallback mechanism. It's a lot simpler to define an ExtensionScalar with as_py than a pandas extension dtype. So if an ExtensionArray doesn't have an equivalent pandas dtype, would it make sense to convert it to just an object series whose elements are the result of as_py?

and I also mentioned this in ARROW-17535:

That actually brings up a question: if an ExtensionType defines an ExtensionScalar (but not an associciated pandas dtype, or custom to_numpy conversion), should we use this scalar's as_py() for the to_numpy/to_pandas conversion as well for plain extension arrays? (not the nested case)

Because currently, if you have an ExtensionArray like that (for example using the example from the docs: https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion), we still use the storage type conversion for to_numpy/to_pandas, and only use the scalar's conversion in to_pylist.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Watchers: Rok Mihevc / @rok

Related issues:

[Python] List arrays aren't supported in to_pandas calls (is related to)
[Python] Nested ExtensionArray conversion to/from pandas/numpy (is related to)

_{Note: This issue was originally created as ARROW-17925. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-10-04T10:21:29Z

Joris Van den Bossche / @jorisvandenbossche:
To give a concrete copy-pastable example (using the one from the docs: https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion):

from collections import namedtuple
import pyarrow as pa

Point3D = namedtuple("Point3D", ["x", "y", "z"])

class Point3DScalar(pa.ExtensionScalar):
    def as_py(self) -> Point3D:
        return Point3D(*self.value.as_py())

class Point3DType(pa.PyExtensionType):
    def __init__(self):
        pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))

    def __reduce__(self):
        return Point3DType, ()

    def __arrow_ext_scalar_class__(self):
        return Point3DScalar

storage = pa.array([[1, 2, 3], [4, 5, 6]], pa.list_(pa.float32(), 3))
arr = pa.ExtensionArray.from_storage(Point3DType(), storage)

>>> arr.to_pandas().values
array([array([1., 2., 3.], dtype=float32),
       array([4., 5., 6.], dtype=float32)], dtype=object)

>>> arr.to_pylist()
[Point3D(x=1.0, y=2.0, z=3.0), Point3D(x=4.0, y=5.0, z=6.0)]

So here, to_pylist gives the nice scalars, while in to_pandas(), we have the raw numpy arrays from converting the storage list array.

We could do this automatically in to_pandas as well if we detect that the ExtensionType raises NotImplementedError for to_pandas_dtype and returns a subclass from \_\_arrow_ext_scalar_class\_\_.

On the other hand, you can also do this yourself by overriding to_pandas()?

And what about to_numpy()?

asfimport · 2022-10-04T12:05:54Z

Rok Mihevc / @rok:
As a user I would like to have an opt-in 'no thinking' route and an obvious way to override if needed.

asfimport · 2022-10-05T08:02:50Z

Joris Van den Bossche / @jorisvandenbossche:
@rok but what is your preferred no-thinking route? Is that to use Scalar.as_py() if you define that (and then convert to object dtype Series in pandas?), or to use the storage array conversion?

asfimport · 2022-10-05T09:29:13Z

Rok Mihevc / @rok:
I suppose as_py as the overridable "thinking" route and storage array conversion as "no-thinking" (although that is not explicitly opt-in).

asfimport · 2022-10-12T17:37:50Z

Chang She / @changhiskhan:
My head hurts trying to keep it all straight:

so we have:

3 "targets" for conversion (pylist, numpy, pandas)
At least 6 different knobs that can be turned:
=> 4 different overrideable mechanisms (to_py, to_pylist, to_numpy, to_pandas)
=> Storage fallback
=> pandas extensionDtype <> pa.ExtensionType
Some of these are defined/performed in C++ and others in Python

hard to think how to give devs clear guidance on the order of things

This was referenced Jan 11, 2023

[Python] List<Extension> arrays aren't supported in to_pandas calls #32791

Open

[Python] Nested ExtensionArray conversion to/from pandas/numpy #33036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? #33134

[Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? #33134

asfimport commented Oct 4, 2022 •

edited

Loading

asfimport commented Oct 4, 2022

asfimport commented Oct 4, 2022

asfimport commented Oct 5, 2022

asfimport commented Oct 5, 2022

asfimport commented Oct 12, 2022

[Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? #33134

[Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? #33134

Comments

asfimport commented Oct 4, 2022 • edited Loading

Related issues:

asfimport commented Oct 4, 2022

asfimport commented Oct 4, 2022

asfimport commented Oct 5, 2022

asfimport commented Oct 5, 2022

asfimport commented Oct 12, 2022

asfimport commented Oct 4, 2022 •

edited

Loading