Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support for RFC 86: Column-oriented read API for vector layers #367

Merged
merged 1 commit into from
Feb 8, 2023

Conversation

kylebarron
Copy link
Contributor

@kylebarron kylebarron commented Feb 5, 2023

  • I agree to follow the project's code of conduct.
  • I added an entry to CHANGES.md if knowledge of this change could be valuable to users.

Description

This is a pretty low-level/advanced function, but is very useful for performance when reading (and maybe in the future writing) from OGR into columnar memory.

This function operates on an ArrowArrayStream struct that needs to be passed in. Most of the time, users will be using a helper library for this, like arrow-rs or arrow2. The nice part about this API is that this crate does not need to declare those as dependencies.

The OGR guide is very helpful reading. Would love someone to double-check this PR in context of this paragraph:

There are extra precautions to take into account in a OGR context. Unless otherwise specified by a particular driver implementation, the ArrowArrayStream structure, and the ArrowSchema or ArrowArray objects its callbacks have returned, should no longer be used (except for potentially being released) after the OGRLayer from which it was initialized has been destroyed (typically at dataset closing). Furthermore, unless otherwise specified by a particular driver implementation, only one ArrowArrayStream can be active at a time on a given layer (that is the last active one must be explicitly released before a next one is asked). Changing filter state, ignored columns, modifying the schema or using ResetReading()/GetNextFeature() while using a ArrowArrayStream is strongly discouraged and may lead to unexpected results. As a rule of thumb, no OGRLayer methods that affect the state of a layer should be called on a layer, while an ArrowArrayStream on it is active.

Change list

  • Copy in arrow_bridge.h with the Arrow C Data Interface headers.
  • Add arrow_bridge.h to the bindgen script so that gdal_3.6.rs includes a definition for ArrowArrayStream. I re-ran this locally; I'm not sure why there's such a big diff. Maybe I need to run this from 3.6.0 instead of 3.6.2?
  • Implement read_arrow_stream
  • Add example of reading arrow data to arrow2

Todo

  • Pass in options to OGR_L_GetArrowStream? According to the guide:

    The papszOptions that may be provided is a NULL terminated list of key=value strings, that may be driver specific.

    So maybe we should have an options: Option<Vec<(String, String)>> argument? Pyogrio uses this to turn off generating an fid for every row.

  • Have an option to skip reading some columns. Pyogrio does this with calls to OGR_L_SetIgnoredFields.

References

Closes #280

@lnicola
Copy link
Member

lnicola commented Feb 6, 2023

That C header seems like a recipe for disaster. The structs are defined in ogr_recordbatch.h in my version of GDAL. Is bindgen not picking them up?

@kylebarron
Copy link
Contributor Author

kylebarron commented Feb 6, 2023

The structs are defined in ogr_recordbatch.h in my version of GDAL. Is bindgen not picking them up?

Before my change, the generated ArrowArrayStream was empty:

#[doc = " Data type for a Arrow C stream Include ogr_recordbatch.h to get the definition."]
#[repr(C)]
#[derive(Debug, Copy, Clone)]
pub struct ArrowArrayStream {
_unused: [u8; 0],
}

Is there a better way to include those definitions?

That C header seems like a recipe for disaster.

Can you explain this more? Given that the definitions in that header are specified by the Arrow project to remain the same, seems fine to include the header this way. I copied that header in manually because the similar pyogrio PR (vectorized Python GDAL wrapper) did the same

@jdroenner
Copy link
Member

you also need to add
#[cfg(all(major_is_3, minor_ge_6))] // GDAL >= 3.6
to enable it only for GDAL >= 3.6. Otherwise the CI will keep failing

@kylebarron
Copy link
Contributor Author

Thank you! That was going to be my next question 😄

@lnicola
Copy link
Member

lnicola commented Feb 6, 2023

Before my change, the generated ArrowArrayStream was empty

See the doc-comment. If I add that header to wrapper.h, I get:

 #[doc = " Data type for a Arrow C stream Include ogr_recordbatch.h to get the\n definition."]                                                                                                 
 #[repr(C)]                                                                                                                                                                                    
 #[derive(Debug, Copy, Clone)]                                                                                                                                                                 
 pub struct ArrowArrayStream {                                                                                                                                                                 
     pub get_schema: ::std::option::Option<                                                                                                                                                    
         unsafe extern "C" fn(arg1: *mut ArrowArrayStream, out: *mut ArrowSchema) -> libc::c_int,                                                                                              
     >,                                                                                                                                                                                        
     pub get_next: ::std::option::Option<                                                                                                                                                      
         unsafe extern "C" fn(arg1: *mut ArrowArrayStream, out: *mut ArrowArray) -> libc::c_int,                                                                                               
     >,                                                                                                                                                                                        
     pub get_last_error: ::std::option::Option<                                                                                                                                                
         unsafe extern "C" fn(arg1: *mut ArrowArrayStream) -> *const libc::c_char,                                                                                                             
     >,                                                                                                                                                                                        
     pub release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut ArrowArrayStream)>,                                                                                                    
     pub private_data: *mut libc::c_void,                                                                                                                                                      
 }                                                                                                                                                                                             

I understand that they're unlikely to change, but I'd be more comfortable using the same definition GDAL is using. It's conceptually correct, and we ship less code, so why not do it that way?

@kylebarron
Copy link
Contributor Author

kylebarron commented Feb 6, 2023

The issue was that I'm pretty inexperienced with C and didn't know how to add ogr_recordbatch.h the first time 😅. I think the latest commit is what you were suggesting?

@lnicola
Copy link
Member

lnicola commented Feb 6, 2023

Yeah, the bindgen part looks good, but don't take this as a full approval.

I still have some doubts about the API -- I just skimmed the code, but is there a reason why we can't pass a &mut ArrowArrayStream and let the read_arrow_stream make a pointer from it? I would prefer a safe (as opposed to unsafe) API. The Box in the example might also be unnecessary, I think a stack allocation works just as well.

But I guess some unsafe code will still be required to adapt from this to arrrow-rs or arrow2. We should probably add those crates as optional features and implement conversions from their structs to ours, so that the user can enable a feature and just use safe code.

@kylebarron
Copy link
Contributor Author

is there a reason why we can't pass a &mut ArrowArrayStream and let the read_arrow_stream make a pointer from it?

That's what I was trying to do before this commit d61f2ed (#367), but was getting double-free issues:

read_ogr_arrow(49392,0x1e7680100) malloc: *** error for object 0x60000169c510: pointer being freed was not allocated
read_ogr_arrow(49392,0x1e7680100) malloc: *** set a breakpoint in malloc_error_break to debug

My conclusion was that the first free was from the Box going out of scope, and the second free was from the Drop implemented for when ArrowArrayStream goes out of scope.

I changed the API based on examples I could find using the C Data Interface. For example, Polars in its Arrow-based FFI with Python and pyarrow does a similar process of creating an empty pointer, passing those pointers to the arrow producer (in this example pyarrow._export_to_c), and then importing from those pointers into Rust objects. I modeled this PR after that, where read_arrow_stream is the producer akin to pyarrow._export_to_c.

@kylebarron
Copy link
Contributor Author

I think this discussion in arrow2 is useful background reading for the Arrow FFI API

@kylebarron
Copy link
Contributor Author

kylebarron commented Feb 6, 2023

I would prefer a safe API

One additional thing to note here is that I suppose we can expose a safe API if we depend on arrow2 directly. I think this unsafe API is best if we don't directly link to a library. We could potentially expose both this and a safe linked API?

@lnicola
Copy link
Member

lnicola commented Feb 6, 2023

We could potentially expose both this and a safe linked API?

Yeah, that was my idea. Having some features to allow using types from those two crates.

@kylebarron
Copy link
Contributor Author

kylebarron commented Feb 6, 2023

My main hesitation with that is that it makes the gdal crate more tied to the versioning of arrow and arrow2. arrow has a major release roughly every month (it's annoying but that's how all official arrow crates are versioned) and arrow2 has a pre-1.0 minor release decently often. So gdal would quickly become out of date, right? Whereas the unsafe API would work with any version as long as the C Data Interface doesn't change

@lnicola
Copy link
Member

lnicola commented Feb 6, 2023

  • unsafe: I see what you mean, it's an inherently-unsafe interface because we end up calling get_next and friends. I think we could do something similar to what arrow does, but I didn't try it:
struct ArrowArrayStream(pub Box<ArrowArrayStreamFfi>);

impl ArrowArrayStream {
    unsafe fn new(inner: Box<ArrowArrayStreamFfi>) { Self(inner) }
}

trait LayerAccess: Sized {
    fn read_arrow_stream(&mut self, out_stream: &mut ArrowArrayStream) {
        // ...
    }
}
  • Box: I see, arrow2 wants it for some reason
  • arrow versions; normally we could support multiple versions with features like arrow_30, arrow_31, arrow_32, but it ends up pretty annoying. In any case, I was thinking of something like implementing a constructor or AsMut, so there would still be an escape hatch.
  • not requiring a specific dependency: yeah, you were allocating the stream inside read_arrow_stream, but it should be constructed by the user and passed in
  • the error on free: not sure, the following diff passes Valgrind on my system (but, as mentioned above, I don't think we want it):
diff --git i/examples/read_ogr_arrow.rs w/examples/read_ogr_arrow.rs
index a0a6fcf..fb63cd4 100644
--- i/examples/read_ogr_arrow.rs
+++ w/examples/read_ogr_arrow.rs
@@ -37,14 +37,15 @@ fn run() -> Result<()> {
     let gdal_pointer: *mut gdal::ArrowArrayStream = output_stream_ptr.cast();
 
     // Read the layer's data into our provisioned pointer
-    unsafe { layer_a.read_arrow_stream(gdal_pointer) }
+    let output_stream = unsafe { layer_a.read_arrow_stream() };
 
     // The rest of this example is arrow2-specific.
 
     // `arrow2` has a helper class `ArrowArrayStreamReader` to assist with iterating over the raw
     // batches
-    let mut arrow_stream_reader =
-        unsafe { arrow2::ffi::ArrowArrayStreamReader::try_new(output_stream).unwrap() };
+    let mut arrow_stream_reader = unsafe {
+        arrow2::ffi::ArrowArrayStreamReader::try_new(std::mem::transmute(output_stream)).unwrap()
+    };
 
     // Iterate over the stream until it's finished
     // arrow_stream_reader.next() will return None when the stream has no more data
@@ -84,6 +85,7 @@ fn run() -> Result<()> {
 
         // Access the first row as WKB
         let _wkb_buffer = binary_array.value(0);
+        dbg!(_wkb_buffer);
     }
 
     Ok(())
diff --git i/src/vector/layer.rs w/src/vector/layer.rs
index 99f0c83..ceb6c93 100644
--- i/src/vector/layer.rs
+++ w/src/vector/layer.rs
@@ -481,7 +481,7 @@ pub trait LayerAccess: Sized {
     /// This uses the Arrow C Data Interface to operate on raw pointers provisioned from Rust.
     /// These pointers must be valid and provisioned according to the ArrowArrayStream spec
     #[cfg(all(major_is_3, minor_ge_6))]
-    unsafe fn read_arrow_stream(&mut self, out_stream: *mut gdal_sys::ArrowArrayStream) {
+    fn read_arrow_stream(&mut self) -> Box<gdal_sys::ArrowArrayStream> {
         unsafe {
             let version_check = gdal_sys::GDALCheckVersion(3, 6, std::ptr::null_mut());
             if version_check == 0 {
@@ -491,6 +491,16 @@ pub trait LayerAccess: Sized {
 
         self.reset_feature_reading();
 
+        let mut out_stream = Box::new(gdal_sys::ArrowArrayStream {
+            get_schema: None,
+            get_next: None,
+            get_last_error: None,
+            release: None,
+            private_data: std::ptr::null_mut(),
+        });
+
+        let out_stream_ptr = &mut *out_stream as *mut gdal_sys::ArrowArrayStream;
+
         let options = std::ptr::null_mut();
 
         unsafe {
@@ -500,11 +510,13 @@ pub trait LayerAccess: Sized {
             //     "INCLUDE_FID".as_ptr() as *const libc::c_char,
             //     "NO".as_ptr() as *const libc::c_char,
             // );
-            let success = gdal_sys::OGR_L_GetArrowStream(self.c_layer(), out_stream, options);
+            let success = gdal_sys::OGR_L_GetArrowStream(self.c_layer(), out_stream_ptr, options);
             if !success {
                 panic!("Failed to read arrow data");
             }
         }
+
+        out_stream
     }
 }

@kylebarron
Copy link
Contributor Author

  • Box: I see, arrow2 wants it for some reason

I don't think it has to be boxed, but we don't know ahead of time how big the arrays are that will be produced. So if it's not boxed you might run the risk of a stack overflow?

  • unsafe: I see what you mean, it's an inherently-unsafe interface because we end up calling get_next and friends.

Yeah... maybe it's useful to mark the function as unsafe to make clear that it's an advanced API with invariants that the caller needs to uphold?

I think we could do something similar to what arrow does, but I didn't try it:

If we have our own ArrowArrayStream struct, you'd still need a transmute to convert that into an arrow-rs or arrow2 ArrowArrayStream object, right?

In any case, I was thinking of something like implementing a constructor or AsMut, so there would still be an escape hatch.

I'm still learning and don't really know how to use AsMut/what exactly you're suggesting here.

@lnicola
Copy link
Member

lnicola commented Feb 6, 2023

I don't think it has to be boxed, but we don't know ahead of time how big the arrays are that will be produced. So if it's not boxed you might run the risk of a stack overflow?

I think the reader takes a Box. I don't think the size of the type is a problem, the struct only has a couple of pointers.

Yeah... maybe it's useful to mark the function as unsafe to make clear that it's an advanced API with invariants that the caller needs to uphold?

Not just useful, but required. The user might fill in some bogus pointers (that's safe!), we would pass them on to GDAL, and the program would crash. As long as we don't construct the stream (which would mean depending on a specific Arrow crate), it needs to be unsafe.

If we have our own ArrowArrayStream struct, you'd still need a transmute to convert that into an arrow-rs or arrow2 ArrowArrayStream object, right?

We could add some conversion functions, but yes.

I'm still learning and don't really know how to use AsMut/what exactly you're suggesting here.

Taking an S: AsMut<ArrowArrayStream> and doing stream.as_mut() as *mut _ to get a pointer. Then we could implement that for the types in the two crates, using transmutes. But it doesn't matter because of the second point.

@jdroenner
Copy link
Member

i guess this is really difficult to decide. I don't want to depend on Arrow or Arrow2. If we add a dependency it should be Arrow since that's the official one. However, Arrow versions are really messy to handle. So i would suggest to add the unsafe method and find a safe solution later? Maybe arrow/arrow2 will converge at some point...

@lnicola
Copy link
Member

lnicola commented Feb 6, 2023

Agreed, let's take a pointer and we can do a feature gate if we end up with a nicer API for arrow2. Maybe the only question is whether we want to offer a wrapper like in my comment above.

src/lib.rs Outdated Show resolved Hide resolved
src/vector/layer.rs Outdated Show resolved Hide resolved
@kylebarron
Copy link
Contributor Author

kylebarron commented Feb 6, 2023

it should be Arrow since that's the official one

The Rust Arrow ecosystem seems pretty evenly divided between arrow-rs and arrow2. Some really large projects like polars depend on arrow2 while other big projects like datafusion depend on arrow-rs. I think the API is best to depend on neither.

let options = std::ptr::null_mut();

unsafe {
// Note to self: example of how to operate on options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a nicer way to do this: a68aba7#diff-735b4e8f56e7d6ad6abe40c99808df64a2b2e05b2b13f976303cefe2dca42affR174.

You might want to remove or update the commented-out code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are areas where CslStringList is passed in through the public API, so I'll update this to have the user pass that in

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to Option<&CslStringList>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop the Option, Geometry::make_valid ended up not using one. An empty CslStringList gets passed as a null pointer, so there shouldn't be any downsides.

And can you squash the commits if it's not too much of a hassle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took out the Option and squashed

// "NO".as_ptr() as *const libc::c_char,
// );
let success = gdal_sys::OGR_L_GetArrowStream(self.c_layer(), out_stream, options);
if !success {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why some of these functions return bool while others return OGRErr 😕.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️ it is indeed a little surprising

Copy link
Member

@lnicola lnicola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a nit about a comment

@kylebarron
Copy link
Contributor Author

Do you know the best way to fix this CI? I don't know how to have the example only exist for gdal 3.6+

@lnicola
Copy link
Member

lnicola commented Feb 8, 2023

The example seems fine now, it's just Clippy complaining about an unused import.

@kylebarron
Copy link
Contributor Author

The example seems fine now, it's just Clippy complaining about an unused import.

That's my issue... On GDAL 3.6 the example runs fine, but I need to have the example not run on <3.6 and not get checked by clippy on <3.6, or else clippy will think the imports aren't being used

@lnicola
Copy link
Member

lnicola commented Feb 8, 2023

It's complaining about layer.rs, not about the example. The feature gate in the example looks good.

@kylebarron
Copy link
Contributor Author

I see; a few of the previous commits had clippy errors on the example

@lnicola
Copy link
Member

lnicola commented Feb 8, 2023

bors r+

Thanks for the patience and for working on this. We might have Arrow support before GDAL 3.7 comes out! 😃

bors bot added a commit that referenced this pull request Feb 8, 2023
367: Implement support for RFC 86: Column-oriented read API for vector layers r=lnicola a=kylebarron

- [x] I agree to follow the project's [code of conduct](https://github.com/georust/gdal/blob/master/CODE_OF_CONDUCT.md).
- [x] I added an entry to `CHANGES.md` if knowledge of this change could be valuable to users.
---

### Description

This is a pretty low-level/advanced function, but is very useful for performance when reading (and maybe in the future writing) from OGR into columnar memory. 

This function operates on an `ArrowArrayStream` struct that needs to be passed in. Most of the time, users will be using a helper library for this, like [`arrow-rs`](https://github.com/apache/arrow-rs) or [`arrow2`](https://github.com/jorgecarleitao/arrow2). The nice part about this API is that this crate does _not_ need to declare those as dependencies.



The [OGR guide](https://gdal.org/tutorials/vector_api_tut.html#reading-from-ogr-using-the-arrow-c-stream-data-interface) is very helpful reading. Would love someone to double-check this PR in context of this paragraph:

> There are extra precautions to take into account in a OGR context. Unless otherwise specified by a particular driver implementation, the ArrowArrayStream structure, and the ArrowSchema or ArrowArray objects its callbacks have returned, should no longer be used (except for potentially being released) after the OGRLayer from which it was initialized has been destroyed (typically at dataset closing). Furthermore, unless otherwise specified by a particular driver implementation, only one ArrowArrayStream can be active at a time on a given layer (that is the last active one must be explicitly released before a next one is asked). Changing filter state, ignored columns, modifying the schema or using ResetReading()/GetNextFeature() while using a ArrowArrayStream is strongly discouraged and may lead to unexpected results. As a rule of thumb, no OGRLayer methods that affect the state of a layer should be called on a layer, while an ArrowArrayStream on it is active.


### Change list

- Copy in `arrow_bridge.h` with the Arrow C Data Interface headers. 
- Add `arrow_bridge.h` to the bindgen script so that `gdal_3.6.rs` includes a definition for `ArrowArrayStream`. I re-ran this locally; I'm not sure why there's such a big diff. Maybe I need to run this from `3.6.0` instead of `3.6.2`?
- Implement `read_arrow_stream`
- Add example of reading arrow data to [`arrow2`](https://docs.rs/arrow2)


### Todo

- Pass in options to `OGR_L_GetArrowStream`? According to the guide:

	> The `papszOptions` that may be provided is a NULL terminated list of key=value strings, that may be driver specific.

	So maybe we should have an `options: Option<Vec<(String, String)>>` argument? Pyogrio [uses this](https://github.com/geopandas/pyogrio/blob/a0b658509f191dece282d6b198099505e9510349/pyogrio/_io.pyx#L1090-L1091) to turn off generating an `fid` for every row.

- Have an option to skip reading some columns. Pyogrio does this with [calls to](https://github.com/geopandas/pyogrio/blob/a0b658509f191dece282d6b198099505e9510349/pyogrio/_io.pyx#L1081-L1088) `OGR_L_SetIgnoredFields`. 

### References

- [OGR Guide for using the C Data Interface](https://gdal.org/tutorials/vector_api_tut.html#reading-from-ogr-using-the-arrow-c-stream-data-interface)

Closes #280

Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
@lnicola
Copy link
Member

lnicola commented Feb 8, 2023

bors r-

@bors
Copy link
Contributor

bors bot commented Feb 8, 2023

Canceled.

@kylebarron
Copy link
Contributor Author

Ok indeed so I fixed that clippy error in layer.rs and now it found all the previous clippy errors with my conditional compilation in the example 😅

@lnicola
Copy link
Member

lnicola commented Feb 8, 2023

For the example, you can move the imports inside the function and make another version of it with the same feature gate, but negated. Then you can drop the check.

@lnicola
Copy link
Member

lnicola commented Feb 8, 2023

bors d+

(And please don't forget to squash at the end)

@bors
Copy link
Contributor

bors bot commented Feb 8, 2023

✌️ kylebarron can now approve this pull request. To approve and merge a pull request, simply reply with bors r+. More detailed instructions are available here.

@kylebarron
Copy link
Contributor Author

kylebarron commented Feb 8, 2023

Ok I think I followed your instructions correctly (hard to test because I have gdal 3.6 locally). And I squashed again (sorry, I'm used to just merging with a squash commit so I forget).

I've never used bors before so not exactly sure how that works

@lnicola
Copy link
Member

lnicola commented Feb 8, 2023

bors r+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for RFC 86: Column-oriented read API for vector layers
3 participants