Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Public API to consume objects supporting the PyCapsule Arrow C Data Interface #38010

Closed
1 task done
Tracked by #39195
jorisvandenbossche opened this issue Oct 4, 2023 · 3 comments
Closed
1 task done
Tracked by #39195
Assignees
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 4, 2023

#37797 is adding official dunder methods to expose the Arrow C Data/Stream Interface in Python using PyCapsules (#34031 / #35531).

In addition to official dunders to expose this to other libraries, we also need public APIs in pyarrow to import / consume such PyCapsules (or rather the objects implementing the dunders to give you the PyCapsule).
#37797 already added this to the pa.array(..), pa.record_batch(..) and pa.schema(..) constructors, such that you can for example create a pyarrow array with pa.array(obj) given any object obj that supports the interface by defining __arrow_c_array__.

But that's not fully complete: we certainly need a way to construct a RecordBatchReader as well, where we don't have such a factory function available. For this, we could add a from_ function (similar to the existing from_batches) like RecordBatchReader.from_stream?

(in addition there is also the Table, Field and DataType constructors, both those all have factory functions that could support this, similar to pa.array(..) et al)


Secondly, I am also wondering if we want to provide APIs that accept PyCapsules directly, instead of an object that implements the dunders. For example, if you are a library that has data in Arrow compatible memory, and you want to convert this to pyarrow through the C Data Interface, you might want to use a PyCapsule directly if your library doesn't expose a Python class that represents that data (to avoid that you need to create a small wrapper class just with the dunder to pass to the pyarrow constructor, although this is of course not difficult).

@kylebarron
Copy link
Contributor

kylebarron commented Mar 20, 2024

I also just hit an instance where having the pa.field constructor consume these objects would be helpful.

In particular, I was trying to read an arrow array with GeoArrow extension metadata but manually persist the field metadata:

schema_capsule, array_capsule = data.__arrow_c_array__()

class SchemaHolder:
    schema_capsule: object

    def __init__(self, schema_capsule) -> None:
        self.schema_capsule = schema_capsule

    def __arrow_c_schema__(self):
        return self.schema_capsule

class ArrayHolder:
    schema_capsule: object
    array_capsule: object

    def __init__(self, schema_capsule, array_capsule) -> None:
        self.schema_capsule = schema_capsule
        self.array_capsule = array_capsule

    def __arrow_c_array__(self, requested_schema):
        return self.schema_capsule, self.array_capsule

# Here the pa.field constructor doesn't accept pycapsule objects
field = pa.field(SchemaHolder(schema_capsule))
array = pa.array(ArrayHolder(field.__arrow_c_schema__(), array_capsule))
schema = pa.schema([field.with_name("geometry")])
table = pa.Table.from_arrays([array], schema=schema)

Aside from this, the only way to maintain extension metadata is to ensure that the extension types are registered with pyarrow, which is harder to control because if its global scope.

@jorisvandenbossche
Copy link
Member Author

Yes, we should provide a public way to create a Field object as well (and from there you can also get a DataType).

(short term, I would say it is safe to use pa.Field._import_from_c_capsule, if you check that the method is available)

I suppose adding this to pa.field(..) would be the easiest, although signature-wise it's also not a great addition, given that right now this constructor always takes both a name and type required arguments)

kylebarron added a commit to developmentseed/lonboard that referenced this issue Mar 25, 2024
Closes #425, blocked
on apache/arrow#38010 (comment).
The main issue is that we need a reliable way to maintain the geoarrow
extension metadata through FFI. The easiest way would be if `pa.field()`
were able to support `__arrow_c_schema__` input. Or alternatively, one
option is to have a context manager of sorts to register global pyarrow
geoarrow extension arrays, and then deregister them after use.
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Mar 27, 2024
@raulcd raulcd modified the milestones: 16.0.0, 17.0.0 Apr 8, 2024
jorisvandenbossche added a commit that referenced this issue Apr 15, 2024
…rrow PyCapsule Protocol (#40818)

### Rationale for this change

See #38010 (comment) for more context. Right now for _consuming_ ArrowSchema-compatible objects that implement the PyCapsule interface, we only have the private `_import_from_c_capsule` (on Schema, Field, DataType) and we check for the protocol in the public `pa.schema(..)`.

But that means you currently can only consume objects that represent the schema of a batch (struct type), and not schemas of individual arrays. 

### What changes are included in this PR?

Expand the `pa.field(..)` constructor to accept objects implementing the protocol method.

### Are these changes tested?

TODO

* GitHub Issue: #38010

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche modified the milestones: 17.0.0, 16.0.0 Apr 15, 2024
@jorisvandenbossche
Copy link
Member Author

Issue resolved by pull request 40818
#40818

raulcd pushed a commit that referenced this issue Apr 15, 2024
…rrow PyCapsule Protocol (#40818)

### Rationale for this change

See #38010 (comment) for more context. Right now for _consuming_ ArrowSchema-compatible objects that implement the PyCapsule interface, we only have the private `_import_from_c_capsule` (on Schema, Field, DataType) and we check for the protocol in the public `pa.schema(..)`.

But that means you currently can only consume objects that represent the schema of a batch (struct type), and not schemas of individual arrays. 

### What changes are included in this PR?

Expand the `pa.field(..)` constructor to accept objects implementing the protocol method.

### Are these changes tested?

TODO

* GitHub Issue: #38010

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
…ough Arrow PyCapsule Protocol (apache#40818)

### Rationale for this change

See apache#38010 (comment) for more context. Right now for _consuming_ ArrowSchema-compatible objects that implement the PyCapsule interface, we only have the private `_import_from_c_capsule` (on Schema, Field, DataType) and we check for the protocol in the public `pa.schema(..)`.

But that means you currently can only consume objects that represent the schema of a batch (struct type), and not schemas of individual arrays. 

### What changes are included in this PR?

Expand the `pa.field(..)` constructor to accept objects implementing the protocol method.

### Are these changes tested?

TODO

* GitHub Issue: apache#38010

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…ough Arrow PyCapsule Protocol (apache#40818)

### Rationale for this change

See apache#38010 (comment) for more context. Right now for _consuming_ ArrowSchema-compatible objects that implement the PyCapsule interface, we only have the private `_import_from_c_capsule` (on Schema, Field, DataType) and we check for the protocol in the public `pa.schema(..)`.

But that means you currently can only consume objects that represent the schema of a batch (struct type), and not schemas of individual arrays. 

### What changes are included in this PR?

Expand the `pa.field(..)` constructor to accept objects implementing the protocol method.

### Are these changes tested?

TODO

* GitHub Issue: apache#38010

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants