ENH: allow using open_arrow with PyCapsule protocol (without pyarrow dependency) #349

jorisvandenbossche · 2024-01-25T14:46:16Z

I think in one of the issues/PRs related to adding the Arrow read support we already touched on this topic (about returning a pointer instead of a pyarrow object, so you can also use this without a pyarrow dependency), but don't directly find it anymore.
But now that Arrow has standardized the Arrow PyCapsule protocol for exposing those pointers in Python with capsules, I think it makes sense to reconsider this (apache/arrow#39195, https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html)

cc @kylebarron

Some notes:

In the public open_arrow, I added a use_pyarrow keyword with the current default of True, which you can then set to False to allow using this without pyarrow
When enabling that, I let it return a small wrapper class _ArrowStream that just has the protocol's dunder method for now. In theory, we could make this a bit more full-featured class, to make it more similar as pyarrow's RecordBatchReader (eg making it iterable), but not sure who would actually use that
I am still returning the (meta, stream) tuple, although in this case it would be sufficient to return a single class (as we control the class and could attach the meta to the class). But maybe better to keep the return value consistent now?

…dependency)

…suming

kylebarron · 2024-01-25T15:32:56Z

pyogrio/_io.pyx

-        import pyarrow as pa
-        reader = pa.RecordBatchStreamReader._import_from_c(stream_ptr)
+        if return_capsule:
+            reader = capsule


To simplify the return objects, can we always return the bare capsule from here?

Then in the Python code if the user wants a pyarrow object, you could call pa.RecordBatchStreamReader._import_from_c(stream_ptr)? Or with pyarrow 15 pass to your new from_stream constructor?

Yes, that would be nicer, but that requires a minimum of pyarrow 14.0 to be able to use RecordBatchReader._import_from_c_capsule, because in python space we can't easily get the pointer out of the capsule. So therefore left that code path as is for now

Although for testing purposes, we have been using ctypes.pythonapi.PyCapsule_.. methods, so maybe that way we can actually do that in Python. The other problem is that we have to somehow keep the capsule alive (pyarrow doesn't allow to attach it to the RecordBatchReader), and just having it as a variable in ogr_open_arrow before the yield didn't seem to be sufficient (although for the ogr_dataset it is, not sure if cython handles python objects differently ..)

kylebarron · 2024-01-25T15:42:06Z

In #206 (comment) there was some discussion of whether to return a RecordBatchReader (or a subclass). I don't recall a discussion of separating it from pyarrow.

Overall this is exciting! I'm looking forward to using this from geoarrow-rs without a pyarrow dependency! (I'll make a helper that calls open_arrow with return_capsule=True under the hood)

On the topic of removing dependencies, with arrow it would be feasible to also make numpy an optional dependency, right? Though I would certainly understand if you don't want the maintenance overhead of that.

jorisvandenbossche · 2024-01-25T15:57:33Z

On the topic of removing dependencies, with arrow it would be feasible to also make numpy an optional dependency, right? Though I would certainly understand if you don't want the maintenance overhead of that.

I think the main question is whether we use the numpy C API, because in that case it's probably more complicated to have it optional. We do have a cimport numpy as np in _io.pyx, but looking at the call sites of np.<something>, I am not sure there is actually any non-Python usage.
If we only use the Python API, it might be quite straightforward to ensure we don't use it in the minimal subset of what open_arrow needs.

kylebarron · 2024-01-25T16:41:54Z

I think the main question is whether we use the numpy C API, because in that case it's probably more complicated to have it optional.

I was thinking the opposite... is it true that importing the C API only requires numpy as a build dependency? Or does that also require it as a runtime dependency?

Numpy is only used in the non-arrow readers, right? In theory you could split the non-arrow and arrow-based implementations into two separate files where only the non-arrow implementation imports numpy?

jorisvandenbossche · 2024-01-25T16:45:22Z

I would think that if you build with the numpy C API, then the shared library has numpy symbols it will not find when numpy is not available at runtime. I am not sure in C/C++ you can have "optional" dependencies at runtime, rather than at build time? (either your library is built with a certain dependency, or it is not?)

kylebarron · 2024-01-25T18:28:18Z

My guess had been that those symbols would be looked up when code was called that needed, but I really don't know.

kylebarron · 2024-02-08T16:13:56Z

pyogrio/raw.py

@@ -349,10 +359,11 @@ def open_arrow(
    sql_dialect=None,
    return_fids=False,
    batch_size=65_536,
+    return_pyarrow=True,


Thoughts on return_pyarrow vs return_capsule? I'm not sure which is clearer

It doesn't return a capsule exactly, but a small wrapper object with the __arrow_c_stream__ method that will return the capsule.

Side question: would returning the capsule directly be more useful for you?

(although it would be nice to have keyword name that is "positive" about what you want, instead of indicating of not wanting pyarrow .. I just don't have good ideas for that at the moment)

(although it would be nice to have keyword name that is "positive" about what you want, instead of indicating of not wanting pyarrow .. I just don't have good ideas for that at the moment)

I think this is more along the lines of what I was trying to ask.

I think having a wrapper object is preferable than raw capsules

We could maybe also consider removing the option entirely, and always returns the capsule wrapper object. This API is still quite new and more an advanced API, so we might be OK with still changing it.

Taking the example from the docstring:

with open_arrow(path) as source: meta, reader = source for table in reader: ....

If we would just always return the wrapper object as implemented in the PR right now, that means that people using this right now and expecting a pyarrow RecordBachReader (or an iterable of pyarrow Tables), would need to change their code to:

with open_arrow(path) as source: meta, reader = source reader = pa.RecordBatchReader.from_stream(reader) for table in reader: ....

One problem with this however is that this requires pyarrow >= 15.0, so that's not ideal.

We could also add the basic methods of a RecordBatchreader to the wrapper object (__enter__, __iter__, read_next_batch, read_all, schema, etc, just calling the equivalent method of the underlying pyarrow.RecordBatchReader, which we can instantiate lazily only when one of those methods is used, i.e. which still allows you to convert it with __arrow_c_stream__ without requiring pyarrow)

That keeps compatibility for people using the Python APIs of a RecordBatchReader (like in the code snippet above, to iterate through the stream). However, that unfortunately still doesn't work when passing this reader object to some other functionality of pyarrow that can accept a reader but passes that through to C++ code (eg writing a dataset with a RecordBatchReader as input)

We could also just check if pyarrow is available, if so return a RecordBatchReader (which will also have the __arrow_c_stream__ method for recent pyarrow), and if not fallback on returning the wrapper object.

That keeps backwards compatibility perfectly, while allowing to use it without pyarrow (at the expense of a slightly confusing API / different behaviour just depending on whether another package is installed or not)

Although that might also be an extra indirection in case you don't need the pyarrow RecordBatchReader. Because what will be returned from __arrow_c_stream__ on the returned reader will be a ArrowArrayStream backed by Arrow wrapping the GDAL stream.

I don't think the extra indirection would be a problem.

I tend to think that having multiple possible return types is much more confusion than it's worth for most users. I'm not really a fan of having it depend only on if pyarrow is installed.

If we're going to have one function, having a return_pyarrow=True I think is the best option. It's probably too much duplication here, but otherwise I would've considered an open_arrow_pycapsule_interface() or similar function

We could also just check if pyarrow is available, if so return a RecordBatchReader (which will also have the __arrow_c_stream__ method for recent pyarrow), and if not fallback on returning the wrapper object.

Silently returning a different type depending on installed Python packages is generally error-prone for users. I would recommend avoiding this.

Good points, agreed that it's better to not do any "smart" return value depending on what is installed.

I updated the PR (and added some docs) for the current state of having a return_pyarrow keyword with a default of True.

I am personally also still fine with changing the default to False (so you have to ask explicitly for a pyarrow object (to avoid the one line to do it yourself), which makes the keyword also about what you want, not what you want to avoid)

The keyword mentioned in the changelog is use_pyarrow (which I think I prefer) but the implementation is return_pyarrow

…-capsule

pyogrio/_io.pyx

pyogrio/_ogr.pxd

pyogrio/tests/test_arrow.py

brendan-ward

Thanks for working on this @jorisvandenbossche !

A few minor comments, and this looks good to go (esp. since it might just resolve our periodic segfault as per #386 where I tested with capsule changes from here).

It seems difficult to fully prove that the capsule works in the absence of pyarrow (e.g., that the stream can be consumed by a different ArrowStream reader), but the tests here at least prove that it operates without an import error triggered if pyarrow is missing.

brendan-ward · 2024-04-08T23:58:54Z

pyogrio/_io.pyx

+        if return_pyarrow:
+            import pyarrow as pa
+            stream_ptr = <uintptr_t> stream
+            reader = pa.RecordBatchStreamReader._import_from_c(stream_ptr)


maybe simplify a bit and remove line 1439?

Suggested change

reader = pa.RecordBatchStreamReader._import_from_c(stream_ptr)

reader = pa.RecordBatchStreamReader._import_from_c(<uintptr_t> stream)

brendan-ward · 2024-04-09T00:05:00Z

pyogrio/raw.py

@@ -349,10 +359,11 @@ def open_arrow(
    sql_dialect=None,
    return_fids=False,
    batch_size=65_536,
+    return_pyarrow=True,


The keyword mentioned in the changelog is use_pyarrow (which I think I prefer) but the implementation is return_pyarrow

pyogrio/raw.py

kylebarron · 2024-04-09T23:49:40Z

It seems difficult to fully prove that the capsule works in the absence of pyarrow

It looks like you can use nanoarrow's python bindings for that, and the nanoarrow.c_array_stream() function.

brendan-ward · 2024-04-10T13:18:43Z

It looks like you can use nanoarrow's python bindings for that, and the nanoarrow.c_array_stream() function.

Indeed; that appears to work, thanks for the tip. I'm assuming we don't yet want to add a CI environment with nanoarrow but without pyarrow to test this, but as we reduce dependency on pyarrow that would be something to consider in the future.

kylebarron · 2024-04-10T14:22:33Z

For now you can probably just assert that the returned object is not an instance of any known pyarrow class and assert that it works with nanoarrow.c_array_stream()?

…-capsule

Co-authored-by: Brendan Ward <bcward@astutespruce.com>

jorisvandenbossche · 2024-04-12T08:12:49Z

OK, updated this, and changed return_pyarrow to use_pyarrow for the keyword name.

I think I have a slight preference to actually make use_pyarrow=False the default (i.e. a breaking change), so you don't have to specify a keyword when not having pyarrow installed (if we would like to have this as default eventually, better to do that now IMO)

brendan-ward · 2024-04-12T15:48:10Z

make use_pyarrow=False the default

How do you see that working out downstream, e.g., in GeoPandas? I'm assuming we'd still need to use pyarrow in read_arrow for now, right? So there wouldn't yet be a change to read_dataframe that needs to be handled from downstream.

kylebarron · 2024-04-12T15:55:18Z

How do you see that working out downstream, e.g., in GeoPandas?

Any pyarrow-based downstream library can convert the stream to a pyarrow by passing it to a RecordBatchReader I think

jorisvandenbossche · 2024-04-12T16:23:27Z

Yes, it's like the docstring example I added:

with open_arrow(path, use_pyarrow=False) as source:
    meta, stream = source
    # this is the extra line you need with pyarrow=False
    reader = pa.RecordBatchReader.from_stream(stream)
    for table in reader:
        geometries = shapely.from_wkb(table[meta["geometry_name"] or "wkb_geometry"])

And of course, you can still set use_pyarrow=True as a convenience instead of writing that extra line (unless we would just do away with this keyword altogether, given it's only one line extra that users have to write to get the pyarrow object)

And indeed this is only about open_arrow, for read_arrow we continue to return a pyarrow.Table (because there we return the data as a consumed stream, so we need some Arrow library to consume it)

brendan-ward · 2024-04-12T16:28:47Z

Ok - use_pyarrow=False seems like a reasonable default and better to do sooner than later.

kylebarron · 2024-04-12T17:53:45Z

unless we would just do away with this keyword altogether, given it's only one line extra that users have to write to get the pyarrow object

I'm personally OK with this if we document that line of code. Presumably only relatively advanced users will be using open_arrow

No strong preference either way.

kylebarron · 2024-04-17T13:13:14Z

Thanks @jorisvandenbossche !

jorisvandenbossche added 2 commits January 25, 2024 15:30

ENH: allow using open_arrow with PyCapsule protocol (without pyarrow …

ab214f4

…dependency)

don't use capsule in default case as that frees the stream before con…

c48fac2

…suming

kylebarron reviewed Jan 25, 2024

View reviewed changes

kylebarron approved these changes Feb 8, 2024

View reviewed changes

jorisvandenbossche added 4 commits March 12, 2024 18:20

Merge remote-tracking branch 'upstream/main' into arrow-reader-export…

e3e498c

…-capsule

remove noexcept for cython compat

102c117

fix context manager and memory issues with non-capsule code path

947b5d1

add docstring and changelog note

83f5c40

pitrou reviewed Mar 14, 2024

View reviewed changes

pyogrio/_io.pyx Show resolved Hide resolved

pitrou reviewed Mar 14, 2024

View reviewed changes

pyogrio/_io.pyx Show resolved Hide resolved

pitrou reviewed Mar 14, 2024

View reviewed changes

pyogrio/_ogr.pxd Outdated Show resolved Hide resolved

pitrou reviewed Mar 14, 2024

View reviewed changes

pyogrio/tests/test_arrow.py Show resolved Hide resolved

jorisvandenbossche added 4 commits March 14, 2024 15:22

always allocate stream through capsule

7c33fa0

ensure it actually runs without having pyarrow installed + test this

51787e7

clean-up pxd file

437485e

fix test in case of no pyarrow

73b9b2b

brendan-ward added a commit that referenced this pull request Apr 8, 2024

Test with stream release machinery from #349

94adbc0

brendan-ward mentioned this pull request Apr 8, 2024

DO NOT MERGE: Refactor dataframe where tests, reduce IF conditional compilation blocks #386

Draft

brendan-ward reviewed Apr 9, 2024

View reviewed changes

brendan-ward added this to the 0.8.0 milestone Apr 9, 2024

jorisvandenbossche added 2 commits April 12, 2024 09:50

Merge remote-tracking branch 'upstream/main' into arrow-reader-export…

8cf1dd0

…-capsule

small cleanup

d1282ca

jorisvandenbossche and others added 3 commits April 12, 2024 10:01

Apply suggestions from code review

a554294

Co-authored-by: Brendan Ward <bcward@astutespruce.com>

return_pyarrow -> use_pyarrow

69a0922

reflow text, expand example explanation

3070274

switch default to use_pyarrow=False

f648447

jorisvandenbossche merged commit e5aba48 into geopandas:main Apr 17, 2024
20 checks passed

jorisvandenbossche deleted the arrow-reader-export-capsule branch April 17, 2024 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: allow using open_arrow with PyCapsule protocol (without pyarrow dependency) #349

ENH: allow using open_arrow with PyCapsule protocol (without pyarrow dependency) #349

jorisvandenbossche commented Jan 25, 2024 •

edited

kylebarron Jan 25, 2024

jorisvandenbossche Jan 25, 2024

jorisvandenbossche Jan 25, 2024

kylebarron commented Jan 25, 2024

jorisvandenbossche commented Jan 25, 2024

kylebarron commented Jan 25, 2024

jorisvandenbossche commented Jan 25, 2024

kylebarron commented Jan 25, 2024

kylebarron Feb 8, 2024

jorisvandenbossche Feb 10, 2024

jorisvandenbossche Feb 10, 2024

kylebarron Feb 10, 2024

jorisvandenbossche Feb 12, 2024

jorisvandenbossche Feb 12, 2024

kylebarron Feb 12, 2024

pitrou Feb 28, 2024

jorisvandenbossche Mar 14, 2024

brendan-ward Apr 9, 2024

brendan-ward left a comment

brendan-ward Apr 8, 2024

brendan-ward Apr 9, 2024

kylebarron commented Apr 9, 2024

brendan-ward commented Apr 10, 2024

kylebarron commented Apr 10, 2024

jorisvandenbossche commented Apr 12, 2024

brendan-ward commented Apr 12, 2024

kylebarron commented Apr 12, 2024

jorisvandenbossche commented Apr 12, 2024

brendan-ward commented Apr 12, 2024

kylebarron commented Apr 12, 2024

kylebarron commented Apr 17, 2024

	reader = pa.RecordBatchStreamReader._import_from_c(stream_ptr)
	reader = pa.RecordBatchStreamReader._import_from_c(<uintptr_t> stream)

ENH: allow using open_arrow with PyCapsule protocol (without pyarrow dependency) #349

ENH: allow using open_arrow with PyCapsule protocol (without pyarrow dependency) #349

Conversation

jorisvandenbossche commented Jan 25, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Jan 25, 2024

jorisvandenbossche commented Jan 25, 2024

kylebarron commented Jan 25, 2024

jorisvandenbossche commented Jan 25, 2024

kylebarron commented Jan 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendan-ward left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Apr 9, 2024

brendan-ward commented Apr 10, 2024

kylebarron commented Apr 10, 2024

jorisvandenbossche commented Apr 12, 2024

brendan-ward commented Apr 12, 2024

kylebarron commented Apr 12, 2024

jorisvandenbossche commented Apr 12, 2024

brendan-ward commented Apr 12, 2024

kylebarron commented Apr 12, 2024

kylebarron commented Apr 17, 2024

jorisvandenbossche commented Jan 25, 2024 •

edited