Explore decomposed branchless loop WKB parsing

The `WkbHeader` will be a great optimization for a number of things we'd like to do! Another technique potentially worth exploring to optimize these very cheap functions is decomposing the parsing into a series of branchless loops. The compiler has more options to optimize a loop that contains absolutely zero `if` `match` or `||/&&`s (e.g. SIMD autovectorization). I've prototyped this in C++ before and found order-of-magnitude speed improvements for Binary arrays (probably less helpful for BinaryView arrays, where there is always an if statement to access the data).

```rust
let mut n_valid_size = 0;
for item in not_null_array {
  n_valid_size += item.len() >= 5;
}

if n_valid_size != not_null_array.len() {
  return do_the_slow_version();
}

let mut n_little_endian = 0;
for item in not_null_array {
  n_little_endian += item[0] == 0x01;
}

if n_little_endian != not_null_array.len() {
  return do_the_slow_version();
}

let mut geometry_types: u32 = 0;
let mut geometry_type_bytes: [u8; 4];
for item in not_null_array {
  geometry_type_bytes.copy_from_slice(&item[1..5]);
  let geometry_type_id = u32::from_be_bytes(&geometry_type_bytes);
  geometry_types |= 1_u32 << geometry_type_id & 0x07;
}

// Potentially do something faster if all we have are points
```

There comes a point where the multiple branchless loops become slower than a single loop with branching...I haven't experimented with this enough to know where that point is.

For functions that only operate on points, there is also a cool optimization you can do for Binary (not BinaryView) arrays: because the data for arrays with zero nulls are all lined up consecutively in memory (i.e., one WKB item after another all in the same buffer), you can loop through X values like this (once you've validated the input using the above):

```rust
let mut offset = first_x_offset;
let mut x_bytes: [u8; 8];
for i in 0..array.len() {
  x_bytes.copy_from_slice(data_buffer[offset..(offset + 8)]);
  let x = f64::from_le_bytes(x_bytes);
  offset += 21;
}
```

My initial explorations of this are here: https://github.com/geoarrow/geoarrow-c/blob/28eca0fea6f47c70113dc1719e7597e53bede461/dev/benchmarks/c/wkb_bounding_benchmark.cc#L408-L450 . Specifically for the bounding operation, the initial numbers suggested that this was able to trigger simd for at least the bounding operation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explore decomposed branchless loop WKB parsing #201

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explore decomposed branchless loop WKB parsing #201

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions