Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for RFC 86: Column-oriented read API for vector layers #280

Closed
kylebarron opened this issue Jul 16, 2022 · 3 comments · Fixed by #367
Closed

Support for RFC 86: Column-oriented read API for vector layers #280

kylebarron opened this issue Jul 16, 2022 · 3 comments · Fixed by #367

Comments

@kylebarron
Copy link
Contributor

GDAL 3.6 added support for a column-oriented API in RFC 86. This is a feature request to add an API for this in the Rust bindings.

For higher-level bindings to GDAL, such as from Python, this API is a big performance improvement as it moves the row-to-columnar conversion loop into C. I don't know how Rust-C bindings work well enough to know if this would also improve performance compared to a Rust loop. But regardless, for a Rust application that would like to use Arrow memory, it would be most ergonomic to reuse the GDAL implementation.

@kylebarron
Copy link
Contributor Author

Adding references to other implementations:

@phayes
Copy link
Contributor

phayes commented Nov 11, 2022

@kylebarron, do you know if this column-oriented API is available in GDAL for all formats, or just for column-oriented formats?

@kylebarron
Copy link
Contributor Author

It's available for all formats. From this part of the RFC

  • For Arrow and Parquet it's virtually zero cost
  • For FlatGeoBuf and GeoPackage there's a specialized implementation that's faster than going through the normal OGRFeature abstraction
  • For other formats, OGR does the row -> columnar transpose automatically

bors bot added a commit that referenced this issue Feb 8, 2023
367: Implement support for RFC 86: Column-oriented read API for vector layers r=lnicola a=kylebarron

- [x] I agree to follow the project's [code of conduct](https://github.com/georust/gdal/blob/master/CODE_OF_CONDUCT.md).
- [x] I added an entry to `CHANGES.md` if knowledge of this change could be valuable to users.
---

### Description

This is a pretty low-level/advanced function, but is very useful for performance when reading (and maybe in the future writing) from OGR into columnar memory. 

This function operates on an `ArrowArrayStream` struct that needs to be passed in. Most of the time, users will be using a helper library for this, like [`arrow-rs`](https://github.com/apache/arrow-rs) or [`arrow2`](https://github.com/jorgecarleitao/arrow2). The nice part about this API is that this crate does _not_ need to declare those as dependencies.



The [OGR guide](https://gdal.org/tutorials/vector_api_tut.html#reading-from-ogr-using-the-arrow-c-stream-data-interface) is very helpful reading. Would love someone to double-check this PR in context of this paragraph:

> There are extra precautions to take into account in a OGR context. Unless otherwise specified by a particular driver implementation, the ArrowArrayStream structure, and the ArrowSchema or ArrowArray objects its callbacks have returned, should no longer be used (except for potentially being released) after the OGRLayer from which it was initialized has been destroyed (typically at dataset closing). Furthermore, unless otherwise specified by a particular driver implementation, only one ArrowArrayStream can be active at a time on a given layer (that is the last active one must be explicitly released before a next one is asked). Changing filter state, ignored columns, modifying the schema or using ResetReading()/GetNextFeature() while using a ArrowArrayStream is strongly discouraged and may lead to unexpected results. As a rule of thumb, no OGRLayer methods that affect the state of a layer should be called on a layer, while an ArrowArrayStream on it is active.


### Change list

- Copy in `arrow_bridge.h` with the Arrow C Data Interface headers. 
- Add `arrow_bridge.h` to the bindgen script so that `gdal_3.6.rs` includes a definition for `ArrowArrayStream`. I re-ran this locally; I'm not sure why there's such a big diff. Maybe I need to run this from `3.6.0` instead of `3.6.2`?
- Implement `read_arrow_stream`
- Add example of reading arrow data to [`arrow2`](https://docs.rs/arrow2)


### Todo

- Pass in options to `OGR_L_GetArrowStream`? According to the guide:

	> The `papszOptions` that may be provided is a NULL terminated list of key=value strings, that may be driver specific.

	So maybe we should have an `options: Option<Vec<(String, String)>>` argument? Pyogrio [uses this](https://github.com/geopandas/pyogrio/blob/a0b658509f191dece282d6b198099505e9510349/pyogrio/_io.pyx#L1090-L1091) to turn off generating an `fid` for every row.

- Have an option to skip reading some columns. Pyogrio does this with [calls to](https://github.com/geopandas/pyogrio/blob/a0b658509f191dece282d6b198099505e9510349/pyogrio/_io.pyx#L1081-L1088) `OGR_L_SetIgnoredFields`. 

### References

- [OGR Guide for using the C Data Interface](https://gdal.org/tutorials/vector_api_tut.html#reading-from-ogr-using-the-arrow-c-stream-data-interface)

Closes #280

Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
@bors bors bot closed this as completed in c7f88b5 Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants