Skip to content

perf: Use bulk-NULL builders in split and substring #4397

@mbutrovich

Description

@mbutrovich

What is the problem the feature request solves?

Several DataFusion PRs from @neilconway (#21789, #21849, #21863, #21845, #21847, #21854, #21877, #21991, #22029, #21366, #21519) replaced per-row null buffer maintenance in string functions with bulk null handling. The pattern: when an output array's null mask is identical to (or a simple function of) the input's, don't append_null/append(true/false) every iteration. Either clone the input NullBuffer up front or compute it once with NullBuffer::union[_many]. Reported wins were 3-15% on microbenchmarks.

Two of Comet's native string functions still follow the old per-row pattern.

1. native/spark-expr/src/string_funcs/split.rs

Both split_array (lines 136-161) and split_large_string_array (lines 173-198) maintain a per-row BooleanBufferBuilder even though split produces NULL output exactly when the input is NULL. The output null buffer is just string_array.nulls().cloned().

let mut null_buffer_builder = arrow::array::BooleanBufferBuilder::new(string_array.len());
for i in 0..string_array.len() {
    if string_array.is_null(i) {
        offsets.push(offsets[i]);
        null_buffer_builder.append(false);
    } else {
        ...
        null_buffer_builder.append(true);
    }
}

2. native/spark-expr/src/string_funcs/substring.rs:138-191

spark_substring_negative_start uses GenericStringBuilder / GenericBinaryBuilder with per-row append_null() across four arms (Utf8, LargeUtf8, Binary, LargeBinary). Output nulls equal input nulls. Same anti-pattern as DataFusion's substr before #21519/#21366.

Describe the potential solution

Adopt the bulk-NULL pattern in both files:

  1. Clone the input NullBuffer once up front.
  2. Iterate values only; for null rows, push an empty placeholder for the offset/value buffer and skip null bookkeeping.
  3. Attach the cloned null buffer when constructing the final array.

Two additional cleanups worth bundling into split.rs while in the file:

  • split_large_string_array returns GenericStringArray::<i32> despite a LargeUtf8 input, which can overflow offsets for large values. The result type should mirror the input (i64 offsets when input is LargeUtf8).
  • The hot loop materializes a Vec<String> from regex::Regex::split (which already yields &str) and then copies into the array. Appending &str directly via the builder skips the intermediate String allocations. Mirrors the append_with improvement from DataFusion #22029.

Additional context

Reference PRs in DataFusion that established the pattern:

Other Comet sites considered but rejected:

  • array_funcs/arrays_overlap.rs — row-by-row builder, but the inner loop has early-exit and tri-valued null logic, so bulk-NULL doesn't apply directly.
  • array_funcs/array_position.rs — already uses a combined_nulls helper plus flat-buffer scan; this is the post-optimization shape.
  • agg_funcs/{covariance,correlation,stddev,variance,avg,sum_*}.rs — already have GroupsAccumulator impls covering the territory of DataFusion #20504 / #21154.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions