Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33801: [Python] Generate on the fly Python Extension Types wrapping C++ Extension Types #34483

Closed

Conversation

sjperkins
Copy link
Contributor

@sjperkins sjperkins commented Mar 7, 2023

Alternative implementation to #34469

Rationale for this change

Explained in greater detail in #33997, but my summary follows:

  • C++ Extension Types currently are not well-exposed in the Python Layer
  • In particular, generic ExtensionArray and ExtensionScalar types are associated with C++ Extensions exposed through BaseExtensionType, rather than hand-coded Python types (which would be laborious to generate).

What changes are included in this PR?

  • When get_cpp_extension_type is called a Python wrapper CppExtensionType is created and cached, along with Python ExtensionArrays and ExtensionScalars
  • CppExtensionType.__arrow_ext_serialize__() and CppExtensionType.__arrow_ext_deserialize__() defer to CExtensionType.Serialize() and CExtensionType.Deserialize() respectively.

Are these changes tested?

  • __arrow_ext_class__() and __arrow_ext_scalar_class__() are tested
  • __arrow_ext_serialize__() and __arrow_ext_deserialize__() are not tested yet. I'd appreciate feedback on whether the general approach is useful going forward before adding tests.

Are there any user-facing changes?

  • I don't believe so.

@sjperkins sjperkins requested a review from AlenkaF as a code owner March 7, 2023 11:01
@github-actions
Copy link

github-actions bot commented Mar 7, 2023

@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
@sjperkins
Copy link
Contributor Author

Would it be possible to re-open this issue? I'm awaiting review from Arrow maintainers here.

@amol-
Copy link
Member

amol- commented Mar 30, 2023

Yes of course, the cleaning was done exactly to surface the prs that still needed attention

@amol- amol- reopened this Mar 30, 2023
@jorisvandenbossche
Copy link
Member

@sjperkins sorry about closing all your PRs, and about the slow review here. I will try to get back to this one and #34469 shortly.

Other question: you couldn't reopen this PR yourself? I mean, did the GitHub UI allow that? Because if only committers can reopen PRs, that would be useful to know because the messaging should be different then if we would do some auto-closing in the future.

@sjperkins
Copy link
Contributor Author

@sjperkins sorry about closing all your PRs, and about the slow review here. I will try to get back to this one and #34469 shortly.

No problem! I assumed that the upcoming Arrow 12.0.0 release was important.

Other question: you couldn't reopen this PR yourself? I mean, did the GitHub UI allow that? Because if only committers can reopen PRs, that would be useful to know because the messaging should be different then if we would do some auto-closing in the future.

Yes, the UI didn't seem to give me the option to reopen the PR.

@sjperkins
Copy link
Contributor Author

Yes, the UI didn't seem to give me the option to reopen the PR.

But I do have the ability to close the PR..

@sjperkins sjperkins closed this Mar 31, 2023
@sjperkins sjperkins reopened this Mar 31, 2023
@sjperkins
Copy link
Contributor Author

Yes, the UI didn't seem to give me the option to reopen the PR.

OK, this does seem possible if I close the PR. Apologies I responded late yesterday evening, so I may have conflated the force push in #33805 or missed the button in this issue.

@sjperkins
Copy link
Contributor Author

sjperkins commented Mar 31, 2023

@jorisvandenbossche Regarding this issue and #33997, they're somewhat exploratory because they dynamically generate Python types.

I'm interested in whether these strategies would be a useful way of exposing pure C++ Extension Types in Python.

For instance, {from,to}_numpy() and {from,to}_pandas could then be added to these types. Without too much thought, this is now possible, but a better mechanism would be desirable:

def to_numpy(self):
   ....

array_cls = uuid_type.__arrow_ext_class__()
array_cls.to_numpy = to_numpy

However, perhaps this should be done through the existing PyExtensionType mechanisms (both Python and C++)?

@sjperkins
Copy link
Contributor Author

For instance, {from,to}_numpy() and {from,to}_pandas could then be added to these types.

Some further context to motivate for this: It would be useful to efficiently convert nested FixedSizeListArray's into numpy arrays. The default produces deeply nested numpy arrays with object dtype. See ratt-ru/arcae#2 for example.

This may also be relevant to #8510 which, from imperfect memory, uses FixedSizeListArray's to represent Tensors.

@jorisvandenbossche
Copy link
Member

Can you reopen it now? (it might depend on whether you closed it yourself)

@sjperkins
Copy link
Contributor Author

Can you reopen it now? (it might depend on whether you closed it yourself)

Ah I cannot. I think it does depend on the closer.

@jorisvandenbossche
Copy link
Member

OK, that's good to know, thanks for checking!

@jorisvandenbossche
Copy link
Member

I'm interested in whether these strategies would be a useful way of exposing pure C++ Extension Types in Python.

I am not a huge fan of the idea of creating classes on the fly .. Also, does this give something more useful than wrapping it just in a base class as is done now? (because right now this generated class also doesn't have any extension-type specific logic?)

Some further context to motivate for this: It would be useful to efficiently convert nested FixedSizeListArray's into numpy arrays.

Yeah, so if we have a way to register a python class to use as the type class for an extension type implemented in C++ (#33997), you can override the to_numpy() method. However, than only works if you call this method directly on the array object. But if you convert a table, or a chunked array, it will still go through the C++ layer which currently falls back on converting the storage array. So we might need to think about a more general mechanism here to tap into this conversion logic.

This may also be relevant to #8510 which, from imperfect memory, uses FixedSizeListArray's to represent Tensors.

Yes, that one is close to being merged, and then we can expose this in python, which might be a useful exercise to see how this goes / what we can learn from this (but of course it's an internal one, so we can hard code support for it)

@sjperkins
Copy link
Contributor Author

I am not a huge fan of the idea of creating classes on the fly .. Also, does this give something more useful than wrapping it just in a base class as is done now? (because right now this generated class also doesn't have any extension-type specific logic?)

Yes, I understand the suggestion is a bit exotic. I think what it gives is the ability to associate specific Python classes with C++ Extension Types/Arrays and attach (mainly) {from,to}_numpy functionality to the Python class. There probably are nicer ways to do it, but nothing comes immediately to mind. Perhaps there's a way to do this with the existing C++ PyExtensionType machinery.

Yeah, so if we have a way to register a python class to use as the type class for an extension type implemented in C++ (#33997), you can override the to_numpy() method. However, than only works if you call this method directly on the array object. But if you convert a table, or a chunked array, it will still go through the C++ layer which currently falls back on converting the storage array. So we might need to think about a more general mechanism here to tap into this conversion logic.

Thanks for pointing this out, this gives me a broader perspective on what would be useful.

This may also be relevant to #8510 which, from imperfect memory, uses FixedSizeListArray's to represent Tensors.

Yes, that one is close to being merged, and then we can expose this in python, which might be a useful exercise to see how this goes / what we can learn from this (but of course it's an internal one, so we can hard code support for it)

I'll think about this a bit more. let me know if there're approaches you think would be useful to experiment with.

@sjperkins
Copy link
Contributor Author

Closing because the use of dynamic types is a bit too exotic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Custom Python type/array subclasses for ExtensionTypes implemented in C++
3 participants