Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? #33134

Open
asfimport opened this issue Oct 4, 2022 · 5 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Oct 4, 2022

This was raised in ARROW-17813 by @changhiskhan:

ExtensionArray => pandas

Just for discussion, I was curious whether you had any thoughts around using the extension scalar as a fallback mechanism. It's a lot simpler to define an ExtensionScalar with as_py than a pandas extension dtype. So if an ExtensionArray doesn't have an equivalent pandas dtype, would it make sense to convert it to just an object series whose elements are the result of as_py?

and I also mentioned this in ARROW-17535:

That actually brings up a question: if an ExtensionType defines an ExtensionScalar (but not an associciated pandas dtype, or custom to_numpy conversion), should we use this scalar's as_py() for the to_numpy/to_pandas conversion as well for plain extension arrays? (not the nested case)

Because currently, if you have an ExtensionArray like that (for example using the example from the docs: https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion), we still use the storage type conversion for to_numpy/to_pandas, and only use the scalar's conversion in to_pylist.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Watchers: Rok Mihevc / @rok

Related issues:

Note: This issue was originally created as ARROW-17925. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
To give a concrete copy-pastable example (using the one from the docs: https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion):

from collections import namedtuple
import pyarrow as pa

Point3D = namedtuple("Point3D", ["x", "y", "z"])

class Point3DScalar(pa.ExtensionScalar):
    def as_py(self) -> Point3D:
        return Point3D(*self.value.as_py())

class Point3DType(pa.PyExtensionType):
    def __init__(self):
        pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))

    def __reduce__(self):
        return Point3DType, ()

    def __arrow_ext_scalar_class__(self):
        return Point3DScalar
storage = pa.array([[1, 2, 3], [4, 5, 6]], pa.list_(pa.float32(), 3))
arr = pa.ExtensionArray.from_storage(Point3DType(), storage)

>>> arr.to_pandas().values
array([array([1., 2., 3.], dtype=float32),
       array([4., 5., 6.], dtype=float32)], dtype=object)

>>> arr.to_pylist()
[Point3D(x=1.0, y=2.0, z=3.0), Point3D(x=4.0, y=5.0, z=6.0)]

So here, to_pylist gives the nice scalars, while in to_pandas(), we have the raw numpy arrays from converting the storage list array.

We could do this automatically in to_pandas as well if we detect that the ExtensionType raises NotImplementedError for to_pandas_dtype and returns a subclass from \_\_arrow_ext_scalar_class\_\_.

On the other hand, you can also do this yourself by overriding to_pandas()?

And what about to_numpy()?

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
As a user I would like to have an opt-in 'no thinking' route and an obvious way to override if needed.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
@rok but what is your preferred no-thinking route? Is that to use Scalar.as_py() if you define that (and then convert to object dtype Series in pandas?), or to use the storage array conversion?

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
I suppose as_py as the overridable "thinking" route and storage array conversion as "no-thinking" (although that is not explicitly opt-in).

@asfimport
Copy link
Collaborator Author

Chang She / @changhiskhan:
My head hurts trying to keep it all straight:

so we have:

  • 3 "targets" for conversion (pylist, numpy, pandas)

  • At least 6 different knobs that can be turned:
    => 4 different overrideable mechanisms (to_py, to_pylist, to_numpy, to_pandas)
    => Storage fallback
    => pandas extensionDtype <> pa.ExtensionType

  • Some of these are defined/performed in C++ and others in Python

    hard to think how to give devs clear guidance on the order of things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant