Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Custom Python type/array subclasses for ExtensionTypes implemented in C++ #33997

Open
jorisvandenbossche opened this issue Feb 2, 2023 · 5 comments

Comments

@jorisvandenbossche
Copy link
Member

When wrapping a type (or array) in a pyarrow object, we need to define which Python class to use. Currently, for extension types, this logic lives here in pyarrow_wrap_data_type:

elif type.get().id() == _Type_EXTENSION:
ext_type = <const CExtensionType*> type.get()
cpy_ext_type = dynamic_cast[_CPyExtensionTypePtr](ext_type)
if cpy_ext_type != nullptr:
return cpy_ext_type.GetInstance()
else:
out = BaseExtensionType.__new__(BaseExtensionType)

So there are currently two options:

  • The ExtensionType is implemented in Python, by subclassing pyarrow.(Py)ExtensionType, and which links to the C++ arrow::py::PyExtensionType (a subclass of arrow::ExtensionType). In this case, we store the python type instance on the C++ instance, and return this as python object in pyarrow_wrap_data_type.
  • The ExtensionType is implemented in C++, and then we currently always fall back to wrap this in the pyarrow.BaseExtenstionType base class (there is currently a bug in this, but that is getting fixed in GH-33802).

However, that means that for such extension types implemented in C++, there is currently no way to have a "richer" python Type object (or Array object, since that is determined by the Type, and for a BaseExtensionType, that will always use the base ExtensionArray). While for an extension type, you might want to add type-specific attributes or methods.

For canonical extension types that are implemented in Arrow C++ itself (for example, the currently discussed Tensor extension type in #8510, or a previous effort to add complex type as extension type in #10565), I think it will work today to create a custom subclass of pyarrow.BaseExtensionType for the specific canonical type, and then we could add a special case to pyarrow_wrap_data_type checking the name of the extension type, and if it is a canonical one we implement ourselves, use the python subclass we implemented ourselves.

But for extension types that are implemented in C++ externally (or for extension types that are implemented in Arrow C++, but for which we don't provide a custom python subclass), that doesn't work.
I am wondering to what extent we want to allow "registering" a python class that should be used when wrapping a specific C++ extension type (and to what extent this would be useful for

@sjperkins
Copy link
Contributor

But for extension types that are implemented in C++ externally (or for extension types that are implemented in Arrow C++, but for which we don't provide a custom python subclass), that doesn't work.
I am wondering to what extent we want to allow "registering" a python class that should be used when wrapping a specific C++ extension type (and to what extent this would be useful for

arrow/python/pyarrow/types.pxi

Lines 1095 to 1100 in 45918a9

# register on the C++ side
check_status(
RegisterPyExtensionType(<shared_ptr[CDataType]> _type.sp_type))
# register on the python side
_python_extension_types_registry.append(_type)

Perhaps one way to do this would be to modify pyarrow.register_extension_type to check for collision with an existing C++ Extension when calling RegisterPyExtensionType. If the name and storage type match, then the associated Python type can be registered in _python_extension_types_registry.

@rok
Copy link
Member

rok commented Sep 24, 2023

I'm looking to subclass FixedShapeTensorType (which is difficult as it's a parametric extension type without __init__). @sjperkins did you resolve this in some way?

@sjperkins
Copy link
Contributor

@rok I didn't get too far with this.

@rok
Copy link
Member

rok commented Sep 26, 2023

If I'm reading the discussion (in #34483 and other threads) correctly the exotic part is automatic generation of types over explicit manual definition for canonical types only? Otherwise the proposed approach would be suitable, right?
Would that work for your usecase?

@sjperkins
Copy link
Contributor

If I'm reading the discussion (in #34483 and other threads) correctly the exotic part is automatic generation of types over explicit manual definition for canonical types only? Otherwise the proposed approach would be suitable, right? Would that work for your usecase?

Yes, I think it would.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment