You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Several DataFusion PRs from @neilconway (#21789, #21849, #21863, #21845, #21847, #21854, #21877, #21991, #22029, #21366, #21519) replaced per-row null buffer maintenance in string functions with bulk null handling. The pattern: when an output array's null mask is identical to (or a simple function of) the input's, don't append_null/append(true/false) every iteration. Either clone the input NullBuffer up front or compute it once with NullBuffer::union[_many]. Reported wins were 3-15% on microbenchmarks.
Two of Comet's native string functions still follow the old per-row pattern.
1. native/spark-expr/src/string_funcs/split.rs
Both split_array (lines 136-161) and split_large_string_array (lines 173-198) maintain a per-row BooleanBufferBuilder even though split produces NULL output exactly when the input is NULL. The output null buffer is just string_array.nulls().cloned().
letmut null_buffer_builder = arrow::array::BooleanBufferBuilder::new(string_array.len());for i in0..string_array.len(){if string_array.is_null(i){
offsets.push(offsets[i]);
null_buffer_builder.append(false);}else{
...null_buffer_builder.append(true);}}
spark_substring_negative_start uses GenericStringBuilder / GenericBinaryBuilder with per-row append_null() across four arms (Utf8, LargeUtf8, Binary, LargeBinary). Output nulls equal input nulls. Same anti-pattern as DataFusion's substr before #21519/#21366.
Describe the potential solution
Adopt the bulk-NULL pattern in both files:
Clone the input NullBuffer once up front.
Iterate values only; for null rows, push an empty placeholder for the offset/value buffer and skip null bookkeeping.
Attach the cloned null buffer when constructing the final array.
Two additional cleanups worth bundling into split.rs while in the file:
split_large_string_array returns GenericStringArray::<i32> despite a LargeUtf8 input, which can overflow offsets for large values. The result type should mirror the input (i64 offsets when input is LargeUtf8).
The hot loop materializes a Vec<String> from regex::Regex::split (which already yields &str) and then copies into the array. Appending &str directly via the builder skips the intermediate String allocations. Mirrors the append_with improvement from DataFusion #22029.
Additional context
Reference PRs in DataFusion that established the pattern:
array_funcs/arrays_overlap.rs — row-by-row builder, but the inner loop has early-exit and tri-valued null logic, so bulk-NULL doesn't apply directly.
array_funcs/array_position.rs — already uses a combined_nulls helper plus flat-buffer scan; this is the post-optimization shape.
agg_funcs/{covariance,correlation,stddev,variance,avg,sum_*}.rs — already have GroupsAccumulator impls covering the territory of DataFusion #20504 / #21154.
What is the problem the feature request solves?
Several DataFusion PRs from @neilconway (#21789, #21849, #21863, #21845, #21847, #21854, #21877, #21991, #22029, #21366, #21519) replaced per-row null buffer maintenance in string functions with bulk null handling. The pattern: when an output array's null mask is identical to (or a simple function of) the input's, don't
append_null/append(true/false)every iteration. Either clone the inputNullBufferup front or compute it once withNullBuffer::union[_many]. Reported wins were 3-15% on microbenchmarks.Two of Comet's native string functions still follow the old per-row pattern.
1.
native/spark-expr/src/string_funcs/split.rsBoth
split_array(lines 136-161) andsplit_large_string_array(lines 173-198) maintain a per-rowBooleanBufferBuildereven thoughsplitproduces NULL output exactly when the input is NULL. The output null buffer is juststring_array.nulls().cloned().2.
native/spark-expr/src/string_funcs/substring.rs:138-191spark_substring_negative_startusesGenericStringBuilder/GenericBinaryBuilderwith per-rowappend_null()across four arms (Utf8, LargeUtf8, Binary, LargeBinary). Output nulls equal input nulls. Same anti-pattern as DataFusion'ssubstrbefore #21519/#21366.Describe the potential solution
Adopt the bulk-NULL pattern in both files:
NullBufferonce up front.Two additional cleanups worth bundling into
split.rswhile in the file:split_large_string_arrayreturnsGenericStringArray::<i32>despite aLargeUtf8input, which can overflow offsets for large values. The result type should mirror the input (i64offsets when input is LargeUtf8).Vec<String>fromregex::Regex::split(which already yields&str) and then copies into the array. Appending&strdirectly via the builder skips the intermediateStringallocations. Mirrors theappend_withimprovement from DataFusion #22029.Additional context
Reference PRs in DataFusion that established the pattern:
lowerandupperdatafusion#21789 — adds bulk NULL-aware string builders, uses them inlower/uppersubstrdatafusion#21519, #21366 — bulk-NULL applied tosubstrreplacedatafusion#21849, #21854, #21863, #21845, #21847, #21877, #21991, #22029 — same pattern acrossreplace,repeat,initcap,uuid,chr,substr_index,reverseOther Comet sites considered but rejected:
array_funcs/arrays_overlap.rs— row-by-row builder, but the inner loop has early-exit and tri-valued null logic, so bulk-NULL doesn't apply directly.array_funcs/array_position.rs— already uses acombined_nullshelper plus flat-buffer scan; this is the post-optimization shape.agg_funcs/{covariance,correlation,stddev,variance,avg,sum_*}.rs— already haveGroupsAccumulatorimpls covering the territory of DataFusion #20504 / #21154.