perf: Add bulk NULL-aware string builders, use in lower and upper#21789
perf: Add bulk NULL-aware string builders, use in lower and upper#21789neilconway wants to merge 7 commits intoapache:mainfrom
lower and upper#21789Conversation
Rename the three builders in `datafusion/functions/src/strings.rs` to make their special-purpose nature explicit: - `StringArrayBuilder` -> `ConcatStringBuilder` - `LargeStringArrayBuilder` -> `ConcatLargeStringBuilder` - `StringViewArrayBuilder` -> `ConcatStringViewBuilder` These builders are used only by `concat` and `concat_ws`; their `write::<CHECK_VALID>(col: &ColumnarValueRef, i)` + `append_offset()` API is shaped around accumulating multiple input fragments into one output row and is not a general-purpose string-builder replacement. The `Concat` prefix makes that clear and frees the previous names for a follow-up change that introduces general-purpose bulk-nulls builders with a simpler `append_value(&str)` API.
Introduce three new string array builders with bulk null tracking: - `StringArrayBuilder` (Utf8) - `LargeStringArrayBuilder` (LargeUtf8) - `StringViewArrayBuilder` (Utf8View) Each builder has the following API: - append_value(&str) -- add a non-NULL value (row) - append_placeholder() -- add a NULL row placeholder - finish(Option<NullBuffer>) -- finish the build, specify NULLs These are the counterpart of Arrow's `GenericStringBuilder` / `StringViewBuilder` but it skips per-row NULL buffer maintenance, which lets callers compute the NULL buffer in bulk when possible. This PR also switches `case_conversion` to use the new APIs, which is used to implement `lower`, `upper`, and the Spark equivalents. This improves `lower` / `upper` performance by 3-15% on microbenchmarks. More UDFs (~10) will be converted to use this API in future PRs.
…w-builder-api # Conflicts: # datafusion/functions/src/strings.rs
lower and upperlower and upper
EeshanBembi
left a comment
There was a problem hiding this comment.
Nice builder abstraction — the null-deferred pattern is clean and removes the per-row null-buffer write overhead. A few non-blocking observations below.
EeshanBembi
left a comment
There was a problem hiding this comment.
Nit: consider exposing an append_null() alias alongside append_placeholder(). Arrow's own builders all use append_null(), and downstream contributors reaching for this builder will naturally look for that name first. Both names can coexist.
|
Thanks for the review @EeshanBembi !
I don't agree with adding an alias; I'd prefer to pick one name or the other. |
Which issue does this PR close?
StringViewArrayBuilderand related types #21684Rationale for this change
Introduce three new string array builders with bulk null tracking:
StringArrayBuilder(Utf8)LargeStringArrayBuilder(LargeUtf8)StringViewArrayBuilder(Utf8View)Each builder has the following API:
These are the counterpart of Arrow's
GenericStringBuilder/StringViewBuilderbut itskips per-row NULL buffer maintenance, which lets callers compute the NULL buffer in
bulk when possible.
This PR also switches
case_conversionto use the new APIs, which is used toimplement
lower,upper, and the Spark equivalents. This improveslower/upperperformance by 3-15% on microbenchmarks. More UDFs (~10) will be converted to use
this API in future PRs.
What changes are included in this PR?
case_conversionAre these changes tested?
Yes.
Are there any user-facing changes?
No.