Skip to content

Commit

Permalink
Merge #367
Browse files Browse the repository at this point in the history
367: Implement support for RFC 86: Column-oriented read API for vector layers r=lnicola a=kylebarron

- [x] I agree to follow the project's [code of conduct](https://github.com/georust/gdal/blob/master/CODE_OF_CONDUCT.md).
- [x] I added an entry to `CHANGES.md` if knowledge of this change could be valuable to users.
---

### Description

This is a pretty low-level/advanced function, but is very useful for performance when reading (and maybe in the future writing) from OGR into columnar memory. 

This function operates on an `ArrowArrayStream` struct that needs to be passed in. Most of the time, users will be using a helper library for this, like [`arrow-rs`](https://github.com/apache/arrow-rs) or [`arrow2`](https://github.com/jorgecarleitao/arrow2). The nice part about this API is that this crate does _not_ need to declare those as dependencies.



The [OGR guide](https://gdal.org/tutorials/vector_api_tut.html#reading-from-ogr-using-the-arrow-c-stream-data-interface) is very helpful reading. Would love someone to double-check this PR in context of this paragraph:

> There are extra precautions to take into account in a OGR context. Unless otherwise specified by a particular driver implementation, the ArrowArrayStream structure, and the ArrowSchema or ArrowArray objects its callbacks have returned, should no longer be used (except for potentially being released) after the OGRLayer from which it was initialized has been destroyed (typically at dataset closing). Furthermore, unless otherwise specified by a particular driver implementation, only one ArrowArrayStream can be active at a time on a given layer (that is the last active one must be explicitly released before a next one is asked). Changing filter state, ignored columns, modifying the schema or using ResetReading()/GetNextFeature() while using a ArrowArrayStream is strongly discouraged and may lead to unexpected results. As a rule of thumb, no OGRLayer methods that affect the state of a layer should be called on a layer, while an ArrowArrayStream on it is active.


### Change list

- Copy in `arrow_bridge.h` with the Arrow C Data Interface headers. 
- Add `arrow_bridge.h` to the bindgen script so that `gdal_3.6.rs` includes a definition for `ArrowArrayStream`. I re-ran this locally; I'm not sure why there's such a big diff. Maybe I need to run this from `3.6.0` instead of `3.6.2`?
- Implement `read_arrow_stream`
- Add example of reading arrow data to [`arrow2`](https://docs.rs/arrow2)


### Todo

- Pass in options to `OGR_L_GetArrowStream`? According to the guide:

	> The `papszOptions` that may be provided is a NULL terminated list of key=value strings, that may be driver specific.

	So maybe we should have an `options: Option<Vec<(String, String)>>` argument? Pyogrio [uses this](https://github.com/geopandas/pyogrio/blob/a0b658509f191dece282d6b198099505e9510349/pyogrio/_io.pyx#L1090-L1091) to turn off generating an `fid` for every row.

- Have an option to skip reading some columns. Pyogrio does this with [calls to](https://github.com/geopandas/pyogrio/blob/a0b658509f191dece282d6b198099505e9510349/pyogrio/_io.pyx#L1081-L1088) `OGR_L_SetIgnoredFields`. 

### References

- [OGR Guide for using the C Data Interface](https://gdal.org/tutorials/vector_api_tut.html#reading-from-ogr-using-the-arrow-c-stream-data-interface)

Closes #280

Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
  • Loading branch information
bors[bot] and kylebarron committed Feb 8, 2023
2 parents bd8f877 + 2e8214b commit c7f88b5
Show file tree
Hide file tree
Showing 7 changed files with 797 additions and 396 deletions.
8 changes: 6 additions & 2 deletions CHANGES.md
Expand Up @@ -34,10 +34,14 @@

- <https://github.com/georust/gdal/pull/355>

- Exposed various functions on `Geometry`: `make_valid`, `geometry_name`, and `point_count`.
- Exposed various functions on `Geometry`: `make_valid`, `geometry_name`, and `point_count`.

- <https://github.com/georust/gdal/pull/356>

- Exposed `read_arrow_stream` on `Layer` to access OGR's columnar reading API.

- <https://github.com/georust/gdal/pull/367>

## 0.14

- Added new content to `README.md` and the root docs.
Expand Down
2 changes: 2 additions & 0 deletions Cargo.toml
Expand Up @@ -32,6 +32,8 @@ semver = "1.0"

[dev-dependencies]
tempfile = "3.3"
# Only used in the example
arrow2 = "0.15"

[workspace]
members = ["gdal-sys"]
Expand Down
102 changes: 102 additions & 0 deletions examples/read_ogr_arrow.rs
@@ -0,0 +1,102 @@
//! Example of reading from OGR to a stream of Arrow arrays
//!
//! As of this writing (Feb 2023), there are two competing low-level Arrow libraries in Rust.
//! [`arrow-rs`](https://github.com/apache/arrow-rs) is the "official" one but uses unsafe
//! transmutes. [`arrow2`](https://github.com/jorgecarleitao/arrow2) was written to be a fully safe
//! implementation of Arrow.
//!
//! Each library implements the same Arrow memory standard, and each implements the
//! ArrowArrayStream interface, so each can integrate with the GDAL `read_arrow_stream` API.
//!
//! This example will use `arrow2` but the process should be similar using `arrow-rs`.

#[cfg(any(major_ge_4, all(major_is_3, minor_ge_6)))]
fn run() -> gdal::errors::Result<()> {
use arrow2::array::{BinaryArray, StructArray};
use arrow2::datatypes::DataType;
use gdal::cpl::CslStringList;
use gdal::vector::*;
use gdal::Dataset;
use std::path::Path;

// Open a dataset and access a layer
let dataset_a = Dataset::open(Path::new("fixtures/roads.geojson"))?;
let mut layer_a = dataset_a.layer(0)?;

// Instantiate an `ArrowArrayStream` for OGR to write into
let mut output_stream = Box::new(arrow2::ffi::ArrowArrayStream::empty());

// Access the unboxed pointer
let output_stream_ptr = &mut *output_stream as *mut arrow2::ffi::ArrowArrayStream;

// gdal includes its own copy of the ArrowArrayStream struct definition. These are guaranteed
// to be the same across implementations, but we need to manually cast between the two for Rust
// to allow it.
let gdal_pointer: *mut gdal::ArrowArrayStream = output_stream_ptr.cast();

let mut options = CslStringList::new();
options.set_name_value("INCLUDE_FID", "NO")?;

// Read the layer's data into our provisioned pointer
unsafe { layer_a.read_arrow_stream(gdal_pointer, &options).unwrap() }

// The rest of this example is arrow2-specific.

// `arrow2` has a helper class `ArrowArrayStreamReader` to assist with iterating over the raw
// batches
let mut arrow_stream_reader =
unsafe { arrow2::ffi::ArrowArrayStreamReader::try_new(output_stream).unwrap() };

// Iterate over the stream until it's finished
// arrow_stream_reader.next() will return None when the stream has no more data
while let Some(maybe_array) = unsafe { arrow_stream_reader.next() } {
// Access the contained array
let top_level_array = maybe_array.unwrap();

// The top-level array is a single logical "struct" array which includes all columns of the
// dataset inside it.
assert!(
matches!(top_level_array.data_type(), DataType::Struct(..)),
"Top-level arrays from OGR are expected to be of struct type"
);

// Downcast from the Box<dyn Array> to a concrete StructArray
let struct_array = top_level_array
.as_any()
.downcast_ref::<StructArray>()
.unwrap();

// Access the underlying column metadata and data
// Clones are cheap because they do not copy the underlying data
let (fields, columns, _validity) = struct_array.clone().into_data();

// Find the index of the geometry column
let geom_column_index = fields
.iter()
.position(|field| field.name == "wkb_geometry")
.unwrap();

// Pick that column and downcast to a BinaryArray
let geom_column = &columns[geom_column_index];
let binary_array = geom_column
.as_any()
.downcast_ref::<BinaryArray<i32>>()
.unwrap();

// Access the first row as WKB
let _wkb_buffer = binary_array.value(0);

println!("Number of geometries: {}", binary_array.len());
}

Ok(())
}

#[cfg(not(any(major_ge_4, all(major_is_3, minor_ge_6))))]
fn run() -> gdal::errors::Result<()> {
Ok(())
}

fn main() {
run().unwrap();
}

0 comments on commit c7f88b5

Please sign in to comment.