Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Expose the device interface through the Arrow PyCapsule protocol #38325

Closed
jorisvandenbossche opened this issue Oct 18, 2023 · 60 comments
Assignees
Milestone

Comments

@jorisvandenbossche
Copy link
Member

We added a new protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and new dunder methods __arrow_c_schema/array/stream__ (#35531 / #37797).

We recently also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (#34972).

The currently merged PyCapsule protocol uses the stable non-device interface, but so the question is how to integrate the device version in the protocol in order to expose the C Device Data Interface in Python as well. Some options:

  1. Only support the device versions going forward (like currently only the cpu version is supported, i.e. the returned capsules always contain a device array/stream).
    (this is a backwards incompatible change, but given we labeled the protocol as experimental, we can still make such changes if we think this is the best long-term option. The capsule names would reflect this change, thus this will generate a proper python error if a consumer or producer would not yet have been updated, and we can actually first deprecate the non-device support in pyarrow before removing it. All to say that AFAIU this is perfectly possible if we want it.)

  2. Add separate dunder methods __arrow_c_device_array__ and __arrow_c_device_stream__, and then it is up to the producer to implement those dunders if they can (and we can strongly recommend doing that, also for CPU-only libraries), and to consumers to check which ones are present.

  3. Allow the consumer to request a device array with some keyword (eg __array_c_array__(device=True)), which gives the consumer the option to request it while also still giving the producer the possibility to raise an error if they don't (yet) support the device version.

  4. Support both options in the current methods without keyword, i.e. allow __arrow_c_array__ to return both a "arrow_array" or "arrow_device_array" capsule (and their capsule name distinguishes both). With the recommendation to always return a device version if you can, but allowing producers to still return a cpu version if they don't support the device one. This only gives some flexibility to the producer, and no control to the consumer to request the CPU version (so this essentially expects that all consumers will handle the device version)

Options 2/3/4 are probably just variants of how to expose both interfaces, and thus the main initial question is whether we want to, long term, move towards an ecosystem where everyone uses the C Device Data Interface, or to keep using both interfaces side by side (as the main interchange mechanism, I mean, the device interface of course still embeds the standard struct).

@jorisvandenbossche
Copy link
Member Author

Looking at the DLPack python page (https://dmlc.github.io/dlpack/latest/python_spec.html), they also define a __dlpack_device__ dunder method, and have a stream keyword in the __dlpack__ method. Do we need something similar here for the sync event? My understanding is that we don't need anything here in addition to the actual C struct, since the sync_event is already part of that? (cc @kkraus14)

@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

I would in favor of the second approach (separate dunder methods). Python methods are cheap, and these are really two different protocols.

@paleolimbot
Copy link
Member

I would in favor of the second approach (separate dunder methods). Python methods are cheap, and these are really two different protocols.

I also would favour a separate dunder method. A consumer that is expecting an ArrowDeviceArray or stream almost certainly has the capability to construct one (i.e., the producer not implementing the method is not a barrier).

@kkraus14
Copy link
Contributor

The downside of using a separate dunder method is that you run into ecosystem fragmentation and inefficiencies. I.E. there was a similar effect that happened with __array_interface__ (https://numpy.org/doc/stable/reference/arrays.interface.html) and __cuda_array_interface__ (https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html). Libraries and applications primarily developed against numpy despite the __array_function__ protocol which made them think and develop in a CPU centric way despite just using array APIs. GPU libraries then needed to either implicitly copy to CPU to support __array_interface__ or raise an Exception to try to guide the developer in the right direction, neither of which is a good developer UX.

Compare that to dlpack (https://github.com/dmlc/dlpack) and the associated Python protocol surrounding it (https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html) which has had device support as a primary goal from its inception. Deep Learning libraries that have multiple device backends like PyTorch, Tensorflow, JAX, etc. all support dlpack but they don't support __array_interface__ or __cuda_array_interface__ because of those issues. Similarly, the __dataframe__ protocol (https://data-apis.org/dataframe-protocol/latest/API.html) has device support as a primary goal as well.

@pitrou
Copy link
Member

pitrou commented Oct 19, 2023

I don't get the fragmentation and inefficiency argument. Implementing the C Data Interface is necessary to also implement the C Device Data Interface (since the latter is based on the former). Almost no implementation effort is saved by eschewing the former to go with the latter.

In any case, the C Data Interface already exists in C++, C#, Go, Java, Rust, nanoarrow and several third-party projects... It's simply logical to expose it in Python as well.

pitrou pushed a commit that referenced this issue Feb 28, 2024
…evice Interface (#39980)

### Rationale for this change

We have low-level methods `_import_from_c`/`_export_to_c` for the C Data Interface, we can add similar methods for the C Device data interface.

Expanding the Arrow PyCapsule protocol (i.e. a better public API for other libraries) is covered by #38325. Because of that, we might not want to keep those low-level methods long term (or at least we need to have the equivalents using capsules), but for testing it's useful to already add those.

### What changes are included in this PR?

Added methods to Array and RecordBatch classes. Currently import only works for CPU devices.

* GitHub Issue: #39979

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…he C Device Interface (apache#39980)

### Rationale for this change

We have low-level methods `_import_from_c`/`_export_to_c` for the C Data Interface, we can add similar methods for the C Device data interface.

Expanding the Arrow PyCapsule protocol (i.e. a better public API for other libraries) is covered by apache#38325. Because of that, we might not want to keep those low-level methods long term (or at least we need to have the equivalents using capsules), but for testing it's useful to already add those.

### What changes are included in this PR?

Added methods to Array and RecordBatch classes. Currently import only works for CPU devices.

* GitHub Issue: apache#39979

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 6, 2024

I would like to revive this, now there is some movement in exposing the Device Interface in pyarrow (low level bindings have been added, #39979) and in supporting it in cudf (rapidsai/cudf#15047).

Concretely, I think adding a separate set of dunder methods, __arrow_c_device_array__ and __arrow_c_device_stream__ (option 2 from the top post), would be the best option moving forward.
It's not that it are entirely separate methods. For a producer/consumer that can handle the device interface, most of the code will be shared with the code required to handle the standard C interface, given that the C Device Interface is only a small extension on top of the C Interface struct. But keeping the dunder methods separate on the Python side allows libraries that only support the C Data Interface to still implement that part of the PyCapsule protocol. The absence of the device versions of the dunder methods is then also an indication that this producer only supports CPU data.

The actual methods can mimic the current ones (see https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html#arrowarray-export), just with adapted names:

  • __arrow_c_device_array__ returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name
  • __arrow_c_device_stream__ returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream"

And both methods can then similarly also accept a requested_schema keyword.

Some questions:

  • My understanding is that in the Python API for the dunder methods, we don't need to expose anything to deal with the sync_event, as that is entirely handled by the consumer through the struct itself (so in contrast with the DLPack __dlpack__ method which does have a stream keyword. But they way how the even works, this is not needed). This is correct?
  • Do we want to add an __arrow_c_device__ method that returns the device of the array or stream?
    This would be similar as the __dlpack_device__ from DLPack, and I suppose essentially just an easier way to quickly check the device of the object.
    For DLPack, one reason to have this is to check the device before you can pass the correct stream to __dlpack__ (according to https://dmlc.github.io/dlpack/latest/python_spec.html#syntax-for-data-interchange-with-dlpack), and that's of course not a relevant reason for us (based on the previous bullet point)
  • Similarly as with the PyCapsule protocol we already added, I would for now not specify anything about how the consumer side should look like (in contrast to DLPack which has a from_dlpack specified through the Array API). That also means it is up to the consumer library how to deal with device copies etc (although I assume typically a copy will be avoided unless explicitly asked?)
    EDIT: I see that the Array API standard recently added copy and device keywords to from_dlpack to be more explicit or ask to copy to a different device (Support copy and device keywords in from_dlpack data-apis/array-api#741)

@kkraus14
Copy link
Contributor

kkraus14 commented Mar 6, 2024

  • My understanding is that in the Python API for the dunder methods, we don't need to expose anything to deal with the sync_event, as that is entirely handled by the consumer through the struct itself (so in contrast with the DLPack __dlpack__ method which does have a stream keyword. But they way how the even works, this is not needed). This is correct?

Yes, this is correct.

This can always be added later if there proves to be a need for it. I would push to keep things minimal for now to avoid needing to make breaking changes.

  • Similarly as with the PyCapsule protocol we already added, I would for now not specify anything about how the consumer side should look like (in contrast to DLPack which has a from_dlpack specified through the Array API). That also means it is up to the consumer library how to deal with device copies etc (although I assume typically a copy will be avoided unless explicitly asked?)
    EDIT: I see that the Array API standard recently added copy and device keywords to from_dlpack to be more explicit or ask to copy to a different device (Support copy and device keywords in from_dlpack data-apis/array-api#741)

+1 to not defining from_* methods yet. I suspect we'll eventually want something similar with copy and device keywords to pass to the producer, but I think we can wait for feedback before building things out on that front.

thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…he C Device Interface (apache#39980)

### Rationale for this change

We have low-level methods `_import_from_c`/`_export_to_c` for the C Data Interface, we can add similar methods for the C Device data interface.

Expanding the Arrow PyCapsule protocol (i.e. a better public API for other libraries) is covered by apache#38325. Because of that, we might not want to keep those low-level methods long term (or at least we need to have the equivalents using capsules), but for testing it's useful to already add those.

### What changes are included in this PR?

Added methods to Array and RecordBatch classes. Currently import only works for CPU devices.

* GitHub Issue: apache#39979

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Mar 21, 2024
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 21, 2024

Thanks for the feedback @kkraus14. I started a PR updating the documentation page about the PyCapsule protocol to expand with those two new dunder methods -> #40708

Apart from just defining __arrow_c_device_array__ and __arrow_c_device_stream__ and their return values, it might be useful to agree on and document some guidelines about when to implement which methods?

  • For a CPU-only library, it is encouraged to implement both the standard and device version of the protocol methods (i.e. both __arrow_c_array__ and __arrow_c_device_array__, and/or both __arrow_c_stream__ and __arrow_c_device_stream__)
  • The presence of only the standard version (e.g. only __arrow_c_array__ and not __arrow_c_device_array__) means that this is a CPU-only data object.
  • For a device-aware library, and for data structures that can only reside in non-CPU memory, you should only implement the device version of the protocol (e.g. only add __arrow_c_device_array__, and never add a __arrow_c_array__)
    • Libraries can of course have data structures that can live on both CPU or non-CPU, and for those it is fine that they implement both versions (and error in the non-device version if the data is not on the CPU)?
      EDIT: this has to be fine of course, given that pyarrow is in this situation, and we want to define both methods. But should we error in __arrow_c_array__ for non-CPU data? (right now we don't actually check the device here, but silently return an ArrowArray struct with null buffer pointers)
  • Do we want to say something about expectations that no cross-device copies happen? Although we don't specify anything about a public consumer method (like from_dlpack), we could still give the guideline to not let such method do device copies? Although that is maybe fully up to the consumer library to decide, but at least in the pyarrow consumer methods, we will not do any copies.

@pitrou
Copy link
Member

pitrou commented Mar 21, 2024

  • For a CPU-only library, it is encouraged to implement both the standard and device version of the protocol methods (i.e. both __arrow_c_array__ and __arrow_c_device_array__, and/or both __arrow_c_stream__ and __arrow_c_device_stream__)

+0. I'm not sure it makes sense to ask producers for more effort in this regard.

  • The presence of only the standard version (e.g. only __arrow_c_array__ and not __arrow_c_device_array__) means that this is a CPU-only data object.

+1

  • For a device-aware library, and for data structures that can only reside in non-CPU memory, you should only implement the device version of the protocol (e.g. only add __arrow_c_device_array__, and never add a __arrow_c_array__)

+1

  • Libraries can of course have data structures that can live on both CPU or non-CPU, and for those it is fine that they implement both versions (and error in the non-device version if the data is not on the CPU)?

+1

EDIT: this has to be fine of course, given that pyarrow is in this situation, and we want to define both methods. But should we error in __arrow_c_array__ for non-CPU data?

Yes, we should. The expectation of the (regular) C Data Interface is that data lives on the CPU.

  • Do we want to say something about expectations that no cross-device copies happen?

In the producer or in the consumer? IMHO the consumer is free to do whatever suits them. On the producer side the question is a bit more delicate. Perhaps we need to pass some options to __arrow_c_device_array__.

@paleolimbot
Copy link
Member

paleolimbot commented Mar 21, 2024

+1 to all! Perhaps just +0 to:

For a CPU-only library, it is encouraged to implement both the standard and device version of the protocol methods (i.e. both __arrow_c_array__ and __arrow_c_device_array__, and/or both __arrow_c_stream__ and __arrow_c_device_stream__)

I am not sure I would recommend that (although it is easy to provide producers boilerplate cython or helper functions to help them do it if they would like).

Do we want to say something about expectations that no cross-device copies happen?

As a consumer I suppose I would be very surprised if a producer performed a cross-device copy to happen in a call to __arrow_c_device_xxx__.

@vyasr
Copy link
Contributor

vyasr commented Mar 21, 2024

Do we want to say something about expectations that no cross-device copies happen?

My preference would be that at the outset the interface recommends that no implicit copies occur, but that there is no normative language prohibiting it. I agree with @paleolimbot that in a vacuum a copy would be surprising, but for objects that support data resident on either host or device (e.g. a pytorch array) it could be reasonable for this to happen. If a library already supports implicit back-and-forth copying via its public APIs then it should be acceptable for the same to happen via the capsule interface (in either direction, i.e. you could access the device interface and trigger H2D or vice versa), whereas if a library requires some explicit transfer mechanism (e.g. APIs like pytorch.tensor.to(torch.device(...))) then the library would probably require the same before accessing the associated capsule interface.

Perhaps we need to pass some options to arrow_c_device_array.

I would recommend that we wait on this and see how the implementations look in practice. We could always add a allow_copy=False parameter with a default later without breaking implementers of the interface. To properly support libraries that implement both the device and host interfaces we would presumably need to add the same parameter to both since implicit H2D and D2H copies are equally undesirable.

@vyasr
Copy link
Contributor

vyasr commented Mar 21, 2024

Since @jorisvandenbossche mentioned the ongoing work in cudf's C++ to support the C device data interface in rapidsai/cudf#15047, I put together a (very, very preliminary) concept of what exposing this in Python could look like in rapidsai/cudf#15370. There is a lot of work to be done before that PR can come out of draft, but I'm happy to help people on this thread use it as a testing ground for ideas on the Python capsule interface if that's useful. Feedback is of course welcome, but given how preliminary the implementation is I wouldn't waste any time on details. For now I haven't implemented the dunder methods on any classes, but the free functions are doing the same capsule construction and should be trivial to toss into a class's dunder method once cudf decides exactly what to expose. In the meantime the free functions offer the same interface for testing.

@kkraus14
Copy link
Contributor

  • For a CPU-only library, it is encouraged to implement both the standard and device version of the protocol methods (i.e. both __arrow_c_array__ and __arrow_c_device_array__, and/or both __arrow_c_stream__ and __arrow_c_device_stream__)

I would be +1 on encouraging being able to consume both protocols, but +0.5 on encouraging to produce both protocols. Agreed with @pitrou's comment about it being more effort with the only return being future proofing for potential non-cpu usage.

I do think we should start discussing what does it look like for a CPU-only library to request data from a non-CPU library. I.E. if say Pandas wants to consume data produced by cuDF. Presuming we figure out a convenient api to pass that information to cuDF, would it return an __arrow_c_device_array__ with the device as CPU or return an __arrow_c_array__? The former puts an onus on all consumers to support the device variants of the protocol.

@vyasr
Copy link
Contributor

vyasr commented Apr 10, 2024

While I agree that supporting a version attribute is more work, it also feels more natural than introspecting an API to determine what is supported. If I read code relying on signature inspection I assume it's doing something fairly hacky by default (which isn't to say that I haven't written such code on multiple occasions before...).

If we don't want versioning, another option (that I'm not necessarily a huge fan of but is worth considering) is defining the protocol as a function that accepts arbitrary keyword arguments obj.__arrow_device_array__(**kwargs). Then producers can choose what arguments to support. We could allow consumers to opt out (or opt in) to error-handling:

class Foo:
    def __arrow_device_array__(self, *, strict=True, **kwargs):
        try:
            parse_kwargs(kwargs)
        except BaseException as e:
            if strict:
                raise
            else:
                warn_about_unsupported_kwargs(kwargs, e)

This way all options are only honored if they are supported. Otherwise, it's up to the consumer if they want this to fail loudly or if they want to handle things themselves.

One downside of this approach is that it assumes that consumers are fine stating the superset of all options that they want and then accepting whether or not producers will honor those options. If consumers want to dynamically adjust the calls based on the options available from a particular producer, they will have to resort to doing so more manually (e.g. via isinstance checks that sort of defeat the purpose of a protocol). It also precludes the signature-based introspection.

@kkraus14
Copy link
Contributor

Versioning of or tracking of capabilities in the C Data Interface itself is out of scope for this discussion, I would say (it's an interesting topic and has come up before, I think, but a much bigger discussion, eg because it requires a new struct).

It's moreso that we're talking about this topic because of the desire to introduce something like a requested_device consumer provided parameter, so it's not just a new struct, but a way for a consumer to provide information to a producer in a standardized way at a C-API level as opposed to Python-API level.

@vyasr's proposal above is interesting, and we could possibly restrict it even further in that we could document that all evolutions to the protocol that introduce new parameters should have a default value of None that correspond to the previous behavior. I.E. we could do something like:

v1 producer

class Foo:
    def __arrow_device_array__(self, *, strict=True, **kwargs):
        for key, value in kwargs.items():
            if value is not None:
                raise NotImplementedError(f"{key}" parameter is not supported)
                
        return capsule

v2 producer

class Foo:
    def __arrow_device_array__(self, *, strict=True, requested_device=None, **kwargs):
        for key, value in kwargs.items():
            if value is not None:
                raise NotImplementedError(f"{key}" parameter is not supported)
                
        # handle `requested_device` here in a meaningful way where None implies same behavior as v1
        return capsule

From a consumer perspective, passing any new parameter becomes purely optional and they can handle unsupported parameters via try / except handling. I don't think signature inspection works here because a producer implementation may explicitly handle a parameter by explicitly throwing on non-default values for example.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 10, 2024

It's moreso that we're talking about this topic because of the desire to introduce something like a requested_device consumer provided parameter, so it's not just a new struct, but a way for a consumer to provide information to a producer in a standardized way at a C-API level as opposed to Python-API level.

Sorry, I don't understand this paragraph. How is adding the requested_device keyword or not (or ways to know if that keyword is supported) something that plays that the C-API level? The C API is the struct and there is nothing in there that allows passing options, that's all at the python level.

@vyasr's proposal above is interesting, and we could possibly restrict it even further in that we could document that all evolutions to the protocol that introduce new parameters should have a default value of None that correspond to the previous behavior. I.E. we could do something like:

I don't have a strong objection to add that, but to be clear, this all still means that the consumer has to do the work to check if the keyword is supported or not or has been honored or not.

To use a concrete example, with the current proposal of no special mechanism, and if we add a requested_device keyword in the future, the consumer code could look like (for an example consumer that can only handle CPU data):

try:
    capsules = object.__arrow_c_device_array__(requested_device=kCPU)
except TypeError:
    capsules = object.__arrow_c_device_array__()
    # manually check the returned array capsule has a CPU device and otherwise error

With the proposal above to add the **kwargs that raise if not None:

try:
    capsules = object.__arrow_c_device_array__(requested_device=kCPU)
except NotImplementedError:
    capsules = object.__arrow_c_device_array__()
    # manually check the returned array capsule has a CPU device and otherwise error

or if the strict keyword is used (and honored by the producer), this is a bit simpler:

capsules  = object.__arrow_c_device_array__(requested_device=kCPU, strict=False)
# manually check the returned array capsule has a CPU device and otherwise error

From a consumer perspective, passing any new parameter becomes purely optional and they can handle unsupported parameters via try / except handling.

Exactly the same is true for no explicit specification of adding **kwargs, the only difference is the error type to catch.

we could document that all evolutions to the protocol that introduce new parameters should have a default value of None that correspond to the previous behavior

I think that is something that we will do anyway in practice: if we add a new keyword in the future, it should always have a default that matches the previous behaviour, such that not specifying it preserves behaviour (otherwise it would be a breaking change).
This is not exactly a "default value of None", but essentially comes down to the same?


To summarize, if there is a strong desire to add the **kwargs and checking of it, and potentially the strict=False option, to the specification, I can certainly live with that.
But I do think that 1) it does not add that much compared to the base situation of not specifying it (also in that case you can handle it with a simple try/except), and 2) it adds some details to the specification that the consumer wants to rely on and that producers can get wrong (i.e. the fact that the producer will check the kwargs, the exact error type that the producer will raise if they are present, that the producer will honor a strict=False keyword). This is of course not hard to get right, but I am personally not sure it is worth the small gain.

@pitrou
Copy link
Member

pitrou commented Apr 10, 2024

I think we should avoid vague and confusing terms such as strict, because it will bear a different meaning and expectations for each person.

As for requested_device, if we think it is needed, then let's just add it immediately?

As for always taking **kwargs, this looks reasonable to me. Also, we can mandate that None be the default value for all optional arguments we add in the future. This makes checking the kwargs for non-default values very easy, for example:

    non_default_kwargs = [name for name, value in kwargs.items() if value is not None]
    if non_default_kwargs:
        raise NotImplementedError(f"some nice error message with {non_default_kwargs}")

@jorisvandenbossche
Copy link
Member Author

As for always taking **kwargs, this looks reasonable to me

Do we then require that the producer raise an error if unknown kwargs / kwargs with non-default values are passed? (like your example code snippet to do that)
Because if not requiring that, this might make it actually harder for the consumer? (eg if passing requested_device and getting no error, that could either mean it was honored or ignored)

@pitrou
Copy link
Member

pitrou commented Apr 10, 2024

Do we then require that the producer raise an error if unknown kwargs / kwargs with non-default values are passed? (like your example code snippet to do that)

Yes, we probably want to. If a producer ignores a hypothetical requested_device and returns data on another device, then the consumer might just crash.

@jorisvandenbossche
Copy link
Member Author

Python does that automatically for you (raising an error for an unknown keyword) if not adding **kwargs, so the main benefit of adding the explicit **kwargs is that it would allow a consumer to actually pass a keyword with a default value without getting an error, even if the producer didn't yet support that keyword?
I.e. so someone can do obj.__arrow_c_device_array__(requested_device=None) regardless of obj supporting that keyword or not.

@jorisvandenbossche
Copy link
Member Author

As for requested_device, if we think it is needed, then let's just add it immediately?

The specifics for requested_device could then be:

  • For the default of requested_device=None, the producer should provide the data with the device that the data is currently on.
  • When specified, it accepts integer values corresponding to ArrowDeviceType from the ABI
  • The producer should attempt to copy the data to the provided device, and if it cannot do that (eg doesn't support that device), raise an exception (ValueError?)
    (thus, not "best effort" like requested_schema, but actually raise an error if not possible to fulfill the request)

Does that sound OK?

If we add this keyword, do we keep the notion that a non-CPU library is allowed to also implement the non-device-aware methods like __arrow_c_array__ doing an implicit device-to-CPU copy? Because that can then be covered with the more explicit __arrow_c_device_array__(requested_device=kCPU), although on the other hand it can still be useful for consumers that don't yet support the Device variants of the structs

@pitrou
Copy link
Member

pitrou commented Apr 10, 2024

Python does that automatically for you (raising an error for an unknown keyword) if not adding **kwargs, so the main benefit of adding the explicit **kwargs is that it would allow a consumer to actually pass a keyword with a default value without getting an error, even if the producer didn't yet support that keyword?

Exactly.

@pitrou
Copy link
Member

pitrou commented Apr 10, 2024

Does that sound OK?

Hmm, I would rather let @vyasr and @kkraus14 decide what is best here.

@paleolimbot
Copy link
Member

Only mentioning this because I just ran across it, but dlpack apparently uses a consumer-provided version:

Starting Python array API standard v2023, a new max_version argument is added to dlpack for the consumer to signal the producer the maximal supported DLPack version.

For what it's worth, I think a producer-defined __arrow_c_api_version__ is the cleanest possible solution: it's very easy to implement for a producer, and very easy for a consumer to generate an informative error (or adjust calling code accordingly). The other ways are fine too, but it seems like they are possibly inventing a brand new way to deal with Python API evolution and I am not sure there is sufficient evidence based on the experience of the array API or dlpack that there is a need to invent a brand new way to do this.

When specified, it accepts integer values corresponding to ArrowDeviceType from the ABI

Would the request also have to specify the device_id?

Because that can then be covered with the more explicit __arrow_c_device_array__(requested_device=kCPU)

From a CPU calling library perspective, I would prefer to only have to deal with __arrow_c_array__(), or possibly __arrow_c_array__(copy=True/False/None) in some future version of that method (since a cross-device copy is not the only type of copy a caller might want to prevent).

Adding requested_device is fine too and I'm in favour of all outcomes that let this interface start being implemented, I am just worried that adding an argument before any implementations exist is going to mean that we risk having to change its interpretation or remove it (defeating the purpose of adding it early).

@jorisvandenbossche
Copy link
Member Author

Note that for DLPack this is about the C ABI version, not the Python interface. So it allows the user to request a certain version of the struct to be put in the returned capsule (DLPack has different versions of the C struct). That's something different as what we are discussing here.

@jorisvandenbossche
Copy link
Member Author

When specified, it accepts integer values corresponding to ArrowDeviceType from the ABI

Would the request also have to specify the device_id?

That's a good question, hopefully @kkraus14 or @vyasr could answer that. If needed, then the argument would be a tuple of two ints for (device_type, device_id) (I see that's also what DLPack does with their dl_device keyword in __dlpack__)

I'm in favour of all outcomes that let this interface start being implemented, I am just worried that adding an argument before any implementations exist is going to mean that we risk having to change its interpretation or remove it (defeating the purpose of adding it early).

Likewise here, happy to spec and implement any agreed proposal, but my personal preference would still be to just start with the simplest option (no keywords, no special kwarg handling for future evolution) with clearly labeling it as experimental, and re-evaluate this in some time after we have initial implementations and usage.

@vyasr
Copy link
Contributor

vyasr commented Apr 12, 2024

To clarify, it sounds like we're trying to make two separate decisions here. It's important to decouple the requirements imposed by the need for a painless syntactic evolution of the protocol from the semantics of any particular feature of the protocol. The motivation behind the **kwargs proposal was to make it possible to allow consumers to specify arbitrary arguments without seeing errors because a particular producer doesn't support that argument. It sounds like from the above discussion we either want the protocol to be clearly versioned in a way that consumers can query and have a strictly defined signature for each version, or we need the protocol's signature to be flexible enough to accommodate any call from the consumer without raising an exception solely on the basis of certain arguments not being supported. I think either approach is fine, but I wanted to lay out clearly the options as I see them in case others differ there, or if they see a preferable alternative

That discussion is separate from the semantics of any particular option. Some options may not make sense to allow defaults, or to have optional behavior, and we might still need to make such arguments explicit the way that Keith did with def __arrow_device_array__(self, *, strict=True, requested_device=None, **kwargs). In that scenario I think it's totally fine for the producer to raise an exception saying "f{'Parameter {param} must be specified when invoking the device array protocol'}". We probably need to document those semantics on a per-parameter basis and make them uniform as part of the spec, rather than allow flexibility on a per-producer basis.

On the specific topic of the requested_device, I think Joris's proposal makes sense. I agree that if a specific device is requested a producer should error if the data cannot be transferred to that device. I don't think best effort makes a lot of sense here. OTOH the zero-copy remains best-effort if requested_device=None since it may not always be possible even on the same device to provide an entirely zero-copy experience.

@jorisvandenbossche
Copy link
Member Author

The motivation behind the **kwargs proposal was to make it possible to allow consumers to specify arbitrary arguments without seeing errors because a particular producer doesn't support that argument.
..
or we need the protocol's signature to be flexible enough to accommodate any call from the consumer without raising an exception solely on the basis of certain arguments not being supported.

My understanding was that we only want to allow the consumer to pass a default without raising an exception. As a producer, if any parameter with a non-default value is passed that the producer in question does not support, it should still raise an error?

But from your above post, it seems you are arguing for actually silencing any errors?

@vyasr
Copy link
Contributor

vyasr commented Apr 18, 2024

Hmm my interpretation of various comments above like this one:

Well, for me the question is: how do we later add options to the API without breaking compatibility with producers that don't implement those options?

was that we wanted a way for consumers to be producer-agnostic up to and including allowing producers to silently ignore arguments that they do not support. How would we distinguish from whether an API was provided with a default value explicitly vs because that value is the default in the signature itself? To do that we would need to expose something like pandas NoDefault, which doesn't work here since we're defining a protocol without any centralized implementation where such a "no default" default object could live.

Perhaps we should start sketching out some slightly more realistic examples than the ones in this thread so far, both of the v0 proposal and potential changes we could envision in v1 (even if we don't plan to implement them right now). Making that concrete ought to help us decide what behavior we should expect and what failure modes we should be handling. I know that a few folks have expressed a preference for getting started on a simpler prototype sooner rather than spending too long discussing, which I am in favor of too. So maybe a real v0 prototype, plus a sample v1 prototype that makes some set of changes which we could imagine being realistic but don't actually want to deal with in v0. Then we can assess based on that.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 19, 2024

How would we distinguish from whether an API was provided with a default value explicitly vs because that value is the default in the signature itself?

The idea mentioned earlier was to require that any new keyword will have a default of None:

Keith proposed this first (#38325 (comment)):

we could possibly restrict it even further in that we could document that all evolutions to the protocol that introduce new parameters should have a default value of None that correspond to the previous behavior.

And later also Antoine (#38325 (comment)):

Also, we can mandate that None be the default value for all optional arguments we add in the future. This makes checking the kwargs for non-default values very easy

(with each time an example code of how handling that would look like, see the links)

To me, the main reason that I don't really like the idea of silently ignoring unsupported keywords is that then as a consumer you still need to check in the resulting data if your keyword was honored, and have a fall back for when it was not honored. At that point it becomes very similar to a try/except.

For the example of a requested_device keyword, if this keyword would be ignored if not supported by some producer, essentially that becomes like a "best-effort" keyword (a behaviour which we didn't want for the keyword itself), and the consumer will still have to check the device type of the returned struct because it cannot be sure that it will have the device type that they requested.


Perhaps we should start sketching out some slightly more realistic examples than the ones in this thread so far, both of the v0 proposal and potential changes we could envision in v1 (even if we don't plan to implement them right now)

Fully agreed that concrete examples are helpful. But I think we can essentially treat therequested_device as an example? (for the example assuming we would not yet add that keyword right now but only in a year from now or so). So then a v0 would have no keywords, and a v1 would add the requested_device keyword. I think that is a realistic example?

In #38325 (comment) above, I already showed some different code snippets how it would look like for a consumer to use this keyword depending on which mechanism we use to add new keywords (no special handling, let the producer raise for non-default value, allow asking to silence errors and ignore unsupported keywords).

@vyasr
Copy link
Contributor

vyasr commented Apr 30, 2024

I apologize for the exceedingly slow responses here, work has been in a state of transition recently and I'm still getting my bearings a bit.

Right, so basically there are three proposals for the API here:

  1. All arguments are specified explicitly in the signature, but the default value must be None.
  2. In addition to specifying arguments, the interface also accepts **kwargs. The producer is responsible for raising a NotImplementedError for any unsupported kwargs.
  3. In addition to specifying arguments, the interface also accepts **kwargs. The producer is free to ignore unsupported keywords or raise as desired.

Your message above effectively compares 1 and 2. Antoine and Keith's are both of 2. I think 2 is a bit more work for the producer for nearly identical consumer code, but I do think there is a real improvement and it's worth the extra code. A producer-generated NotImplementedError can be much more informative than a Python-generated TypeError (or an inspect.signature-based analysis).

I thought 3 might be of interest because of some of the prior discussions around making a best effort to honor some keywords but allowing them to also not be honored. In that case it might be easier via **kwargs to simply ignore such keywords, but the downside with 3 is that this behavior of accepting but not honoring keywords becomes the default. On further thought, I'm also leaning against that. A producer can always still choose not to honor a particular keyword (without failing outright) with 2, they just have to do so explicitly by handling that keyword rather than letting it be subsumed by **kwargs. So I'm on board with not silently ignoring unsupported keywords, I do think that's the better long-term choice.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 20, 2024

Apologies for the slow response from my side now. Re-reading the discussion above, I think there is general support for what is called option 2 is the last comment just above: the interface methods accept **kwargs, and the producer is responsible for raising a NotImplementedError for any keyword that it does not support and does not have the value of None.

So then I would propose to just go with that, adding those **kwargs handling for future keywords, but for now not yet defining any new actual keyword (except for requested_schema). I will update my two open PRs with the specification and pyarrow implementation. And then I think we can still get this into the 17.0 release to make is easier to experiment with this.

@jorisvandenbossche
Copy link
Member Author

Review of the spec additions in #40708 would be very welcome.

jorisvandenbossche added a commit that referenced this issue Jun 26, 2024
… Data support (#40708)

### Rationale for this change

We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (#35531 / #37797).

We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (#34972).

This expands the Python exposure of the interface with support for the newer Device structs.

### What changes are included in this PR?

Update the specification to defined two additional dunders:

* `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name
* `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" 

### Are these changes tested?

Spec-only change

* GitHub Issue: #38325

Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche added this to the 17.0.0 milestone Jun 26, 2024
@jorisvandenbossche
Copy link
Member Author

Issue resolved by pull request 40708
#40708

jorisvandenbossche added a commit that referenced this issue Jun 26, 2024
…yArrow (#40717)

### Rationale for this change

PyArrow implementation for the specification additions being proposed in
#40708

### What changes are included in this PR?

New `__arrow_c_device_array__` method to `pyarrow.Array` and
`pyarrow.RecordBatch`, and support in the `pyarrow.array(..)`,
`pyarrow.record_batch(..)` and `pyarrow.table(..)` functions to consume
objects that have those methods.

### Are these changes tested?

Yes (for CPU only for now, #40385 is
a prerequisite to test this for CUDA)


* GitHub Issue: #38325
@jorisvandenbossche
Copy link
Member Author

Thanks all for the discussion here! I merged the updates to the specification (#40708) and also a first basic implementation of it for pyarrow (only for the device array, not yet for the device stream; #40717)

@vyasr
Copy link
Contributor

vyasr commented Jun 26, 2024

Thanks @jorisvandenbossche! Once some of the other dust settles I'll look into getting this implemented in cudf and we can start testing.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Jul 9, 2024
…Device Data support (apache#40708)

### Rationale for this change

We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (apache#35531 / apache#37797).

We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (apache#34972).

This expands the Python exposure of the interface with support for the newer Device structs.

### What changes are included in this PR?

Update the specification to defined two additional dunders:

* `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name
* `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" 

### Are these changes tested?

Spec-only change

* GitHub Issue: apache#38325

Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Jul 9, 2024
…a in PyArrow (apache#40717)

### Rationale for this change

PyArrow implementation for the specification additions being proposed in
apache#40708

### What changes are included in this PR?

New `__arrow_c_device_array__` method to `pyarrow.Array` and
`pyarrow.RecordBatch`, and support in the `pyarrow.array(..)`,
`pyarrow.record_batch(..)` and `pyarrow.table(..)` functions to consume
objects that have those methods.

### Are these changes tested?

Yes (for CPU only for now, apache#40385 is
a prerequisite to test this for CUDA)


* GitHub Issue: apache#38325
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants