Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Use PyCapsule for communicating C Data Interface pointers at the Python level #34031

Closed
jorisvandenbossche opened this issue Feb 3, 2023 · 10 comments · Fixed by #37797

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 3, 2023

Describe the enhancement requested

Currently we have the various _export_to_c / _import_from_c methods for working with the Arrow C Interface that expect integers as arguments for the struct pointers. We could also use PyCapsule objects for this instead of integers (or (certainly initially) in addition to), inspired by a similar interface from DLPack.

DLPack provides a stable in-memory data structure that allows exchanging array data between frameworks. It essentially plays the same role for ndarrays (tensors) as what the Arrow C interface does for arrow-compatible data (columnar data). It also defines a stable C ABI with a similar C struct definitions (header file).

In the DLPack project, apart from the stable C ABI struct, they also defined a python specification (including a method name to access the protocol, i.e. __dlpack__), see https://dmlc.github.io/dlpack/latest/python_spec.html#implementation for the details. And for that specification, they return not a raw pointer, but use a PyCapsule object (a python object that represents an "opaque value", such as a pointer, and that can used by C extensions to pass such values through Python code to other C code, https://docs.python.org/3/c-api/capsule.html).

Some details based on their implementation:

  • The PyCapsule has a name, and the producer should set that to a well-defined value (giving some protection to not getting a random pointer)
  • The consumer renames the capsule (which gives some protection to the C Data pointer being consumed more than once)
  • The PyCapsule has a destructor defined that would call the release callback (in case the object never got consumed)

The proposal would be to mimic what DLPack does in places where we now expect or return a integer pointer (the interface needs to be different as the current _export_to_c, as we would now return a capsule, instead of having the return pointer as a parameter of the method).

Component(s)

Python

@lidavidm
Copy link
Member

lidavidm commented Feb 3, 2023

It would also be nice to do this in the ADBC libraries (which currently use a handwritten Python wrapper for this purpose)

@jorisvandenbossche
Copy link
Member Author

Yes, certainly, if we decide to go this way, we should support it in the places we control that interacts with the C Data Interface like ADBC (we should also check with other produces/consumers if this is possible to use)

@pitrou
Copy link
Member

pitrou commented May 10, 2023

  • The consumer renames the capsule (which gives some protection to the C Data pointer being consumed more than once)

This doesn't seem necessary in our case, as the consumer would typically move the C struct contents.

@jorisvandenbossche
Copy link
Member Author

  • The consumer renames the capsule (which gives some protection to the C Data pointer being consumed more than once)

This doesn't seem necessary in our case, as the consumer would typically move the C struct contents.

But can a consumer move the C struct content multiple times (and potentially calling the release callback multiple times?)

The other reason for renaming in the case of DLPack is that it checks for this in the capsule deleter, to only call the release callback of the struct in case it was not consumed/renamed (for example cupy's implementation). But that's in the end the same question I assume, because it also depends on whether it is OK to call the release callback multiple times.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 11, 2023

Ah, I suppose this is handled by setting the release callback to NULL when moving the C struct content. And then the capsule deleter should check for that (instead of checking for it being renamed) ?
Then the guideline for consuming a Arrow C struct through a PyCapsule would be to always move the struct?

@pitrou
Copy link
Member

pitrou commented May 11, 2023

Indeed, you don't call the release callback when moving the structure:
https://arrow.apache.org/docs/format/CDataInterface.html#moving-an-array

The consumer can move the ArrowArray structure by bitwise copying or shallow member-wise copying. Then it MUST mark the source structure released (see “released structure” above for how to do it) but without calling the release callback. This ensures that only one live copy of the struct is active at any given time and that lifetime is correctly communicated to the producer.

As usual, the release callback will be called on the destination structure when it is not needed anymore.

@westonpace
Copy link
Member

The PyCapsule has a destructor defined that would call the release callback (in case the object never got consumed)

I think this would still be useful.

@jorisvandenbossche
Copy link
Member Author

The PyCapsule has a destructor defined that would call the release callback (in case the object never got consumed)

I think this would still be useful.

Yes, we would still keep it, but just using a different check to see when it should be called (checking if the release callback is not null, instead of the changed capsule name, as dlpack does and as I mentioned in the top post)

@jorisvandenbossche
Copy link
Member Author

I created a PR on the adbc (apache/arrow-adbc#702) and pyarrow (#35739) side as a small experiment with this.

wjones127 pushed a commit to wjones127/arrow that referenced this issue Sep 20, 2023
wjones127 pushed a commit to wjones127/arrow that referenced this issue Sep 27, 2023
wjones127 pushed a commit to wjones127/arrow that referenced this issue Oct 1, 2023
wjones127 pushed a commit to wjones127/arrow that referenced this issue Oct 10, 2023
@jorisvandenbossche jorisvandenbossche added this to the 14.0.0 milestone Oct 11, 2023
@raulcd raulcd removed this from the 14.0.0 milestone Oct 18, 2023
@raulcd
Copy link
Member

raulcd commented Oct 18, 2023

Hi, I am creating the RC0 right now. I can add this in future RCs if they are created.

pitrou pushed a commit to wjones127/arrow that referenced this issue Oct 18, 2023
pitrou pushed a commit that referenced this issue Oct 18, 2023
### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: #34031
* Closes: #35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@jorisvandenbossche jorisvandenbossche added this to the 14.0.0 milestone Oct 18, 2023
raulcd pushed a commit that referenced this issue Oct 18, 2023
### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: #34031
* Closes: #35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…37797)

### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: apache#34031
* Closes: apache#35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…37797)

### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: apache#34031
* Closes: apache#35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…37797)

### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: apache#34031
* Closes: apache#35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment