Skip to content

Explore decomposed branchless loop WKB parsing #201

@paleolimbot

Description

@paleolimbot

The WkbHeader will be a great optimization for a number of things we'd like to do! Another technique potentially worth exploring to optimize these very cheap functions is decomposing the parsing into a series of branchless loops. The compiler has more options to optimize a loop that contains absolutely zero if match or ||/&&s (e.g. SIMD autovectorization). I've prototyped this in C++ before and found order-of-magnitude speed improvements for Binary arrays (probably less helpful for BinaryView arrays, where there is always an if statement to access the data).

let mut n_valid_size = 0;
for item in not_null_array {
  n_valid_size += item.len() >= 5;
}

if n_valid_size != not_null_array.len() {
  return do_the_slow_version();
}

let mut n_little_endian = 0;
for item in not_null_array {
  n_little_endian += item[0] == 0x01;
}

if n_little_endian != not_null_array.len() {
  return do_the_slow_version();
}

let mut geometry_types: u32 = 0;
let mut geometry_type_bytes: [u8; 4];
for item in not_null_array {
  geometry_type_bytes.copy_from_slice(&item[1..5]);
  let geometry_type_id = u32::from_be_bytes(&geometry_type_bytes);
  geometry_types |= 1_u32 << geometry_type_id & 0x07;
}

// Potentially do something faster if all we have are points

There comes a point where the multiple branchless loops become slower than a single loop with branching...I haven't experimented with this enough to know where that point is.

For functions that only operate on points, there is also a cool optimization you can do for Binary (not BinaryView) arrays: because the data for arrays with zero nulls are all lined up consecutively in memory (i.e., one WKB item after another all in the same buffer), you can loop through X values like this (once you've validated the input using the above):

let mut offset = first_x_offset;
let mut x_bytes: [u8; 8];
for i in 0..array.len() {
  x_bytes.copy_from_slice(data_buffer[offset..(offset + 8)]);
  let x = f64::from_le_bytes(x_bytes);
  offset += 21;
}

My initial explorations of this are here: https://github.com/geoarrow/geoarrow-c/blob/28eca0fea6f47c70113dc1719e7597e53bede461/dev/benchmarks/c/wkb_bounding_benchmark.cc#L408-L450 . Specifically for the bounding operation, the initial numbers suggested that this was able to trigger simd for at least the bounding operation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions