perf: Use bulk-NULL builders in split and substring

### What is the problem the feature request solves?

Several DataFusion PRs from @neilconway (#21789, #21849, #21863, #21845, #21847, #21854, #21877, #21991, #22029, #21366, #21519) replaced per-row null buffer maintenance in string functions with bulk null handling. The pattern: when an output array's null mask is identical to (or a simple function of) the input's, don't `append_null`/`append(true/false)` every iteration. Either clone the input `NullBuffer` up front or compute it once with `NullBuffer::union[_many]`. Reported wins were 3-15% on microbenchmarks.

Two of Comet's native string functions still follow the old per-row pattern.

**1. `native/spark-expr/src/string_funcs/split.rs`**

Both `split_array` (lines 136-161) and `split_large_string_array` (lines 173-198) maintain a per-row `BooleanBufferBuilder` even though `split` produces NULL output exactly when the input is NULL. The output null buffer is just `string_array.nulls().cloned()`.

```rust
let mut null_buffer_builder = arrow::array::BooleanBufferBuilder::new(string_array.len());
for i in 0..string_array.len() {
    if string_array.is_null(i) {
        offsets.push(offsets[i]);
        null_buffer_builder.append(false);
    } else {
        ...
        null_buffer_builder.append(true);
    }
}
```

**2. `native/spark-expr/src/string_funcs/substring.rs:138-191`**

`spark_substring_negative_start` uses `GenericStringBuilder` / `GenericBinaryBuilder` with per-row `append_null()` across four arms (Utf8, LargeUtf8, Binary, LargeBinary). Output nulls equal input nulls. Same anti-pattern as DataFusion's `substr` before #21519/#21366.

### Describe the potential solution

Adopt the bulk-NULL pattern in both files:

1. Clone the input `NullBuffer` once up front.
2. Iterate values only; for null rows, push an empty placeholder for the offset/value buffer and skip null bookkeeping.
3. Attach the cloned null buffer when constructing the final array.

Two additional cleanups worth bundling into `split.rs` while in the file:

- `split_large_string_array` returns `GenericStringArray::<i32>` despite a `LargeUtf8` input, which can overflow offsets for large values. The result type should mirror the input (`i64` offsets when input is LargeUtf8).
- The hot loop materializes a `Vec<String>` from `regex::Regex::split` (which already yields `&str`) and then copies into the array. Appending `&str` directly via the builder skips the intermediate `String` allocations. Mirrors the `append_with` improvement from DataFusion #22029.

### Additional context

Reference PRs in DataFusion that established the pattern:

- apache/datafusion#21789 — adds bulk NULL-aware string builders, uses them in `lower` / `upper`
- apache/datafusion#21519, #21366 — bulk-NULL applied to `substr`
- apache/datafusion#21849, #21854, #21863, #21845, #21847, #21877, #21991, #22029 — same pattern across `replace`, `repeat`, `initcap`, `uuid`, `chr`, `substr_index`, `reverse`

Other Comet sites considered but rejected:

- `array_funcs/arrays_overlap.rs` — row-by-row builder, but the inner loop has early-exit and tri-valued null logic, so bulk-NULL doesn't apply directly.
- `array_funcs/array_position.rs` — already uses a `combined_nulls` helper plus flat-buffer scan; this is the post-optimization shape.
- `agg_funcs/{covariance,correlation,stddev,variance,avg,sum_*}.rs` — already have `GroupsAccumulator` impls covering the territory of DataFusion #20504 / #21154.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Use bulk-NULL builders in split and substring #4397

What is the problem the feature request solves?

Describe the potential solution

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf: Use bulk-NULL builders in split and substring #4397

Description

What is the problem the feature request solves?

Describe the potential solution

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions