Skip to content

perf: Add bulk NULL-aware string builders, use in lower and upper#21789

Open
neilconway wants to merge 7 commits intoapache:mainfrom
neilconway:neilc/perf-string-view-builder-api
Open

perf: Add bulk NULL-aware string builders, use in lower and upper#21789
neilconway wants to merge 7 commits intoapache:mainfrom
neilconway:neilc/perf-string-view-builder-api

Conversation

@neilconway
Copy link
Copy Markdown
Contributor

@neilconway neilconway commented Apr 22, 2026

Which issue does this PR close?

Rationale for this change

Introduce three new string array builders with bulk null tracking:

  • StringArrayBuilder (Utf8)
  • LargeStringArrayBuilder (LargeUtf8)
  • StringViewArrayBuilder (Utf8View)

Each builder has the following API:

  • append_value(&str) -- add a non-NULL value (row)
  • append_placeholder() -- add a NULL row placeholder
  • finish(Option) -- finish the build, specify NULLs

These are the counterpart of Arrow's GenericStringBuilder / StringViewBuilder but it
skips per-row NULL buffer maintenance, which lets callers compute the NULL buffer in
bulk when possible.

This PR also switches case_conversion to use the new APIs, which is used to
implement lower, upper, and the Spark equivalents. This improves lower / upper
performance by 3-15% on microbenchmarks. More UDFs (~10) will be converted to use
this API in future PRs.

What changes are included in this PR?

  • Add new builders
  • Add unit tests
  • Adopt builders in case_conversion

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Rename the three builders in `datafusion/functions/src/strings.rs` to make
their special-purpose nature explicit:

- `StringArrayBuilder`     -> `ConcatStringBuilder`
- `LargeStringArrayBuilder` -> `ConcatLargeStringBuilder`
- `StringViewArrayBuilder`  -> `ConcatStringViewBuilder`

These builders are used only by `concat` and `concat_ws`; their
`write::<CHECK_VALID>(col: &ColumnarValueRef, i)` + `append_offset()`
API is shaped around accumulating multiple input fragments into one
output row and is not a general-purpose string-builder replacement.
The `Concat` prefix makes that clear and frees the previous names for
a follow-up change that introduces general-purpose bulk-nulls
builders with a simpler `append_value(&str)` API.
Introduce three new string array builders with bulk null tracking:

- `StringArrayBuilder` (Utf8)
- `LargeStringArrayBuilder` (LargeUtf8)
- `StringViewArrayBuilder` (Utf8View)

Each builder has the following API:

- append_value(&str) -- add a non-NULL value (row)
- append_placeholder() -- add a NULL row placeholder
- finish(Option<NullBuffer>) -- finish the build, specify NULLs

These are the counterpart of Arrow's `GenericStringBuilder` /
`StringViewBuilder` but it skips per-row NULL buffer maintenance, which
lets callers compute the NULL buffer in bulk when possible.

This PR also switches `case_conversion` to use the new APIs, which is
used to implement `lower`, `upper`, and the Spark equivalents. This
improves `lower` / `upper` performance by 3-15% on microbenchmarks. More
UDFs (~10) will be converted to use this API in future PRs.
@github-actions github-actions Bot added the functions Changes to functions implementation label Apr 22, 2026
…w-builder-api

# Conflicts:
#	datafusion/functions/src/strings.rs
@neilconway neilconway changed the title perf: Add new string builders that support bulk NULLs, use in lower and upper perf: Add bulk NULL-aware string builders, use in lower and upper Apr 22, 2026
Copy link
Copy Markdown
Contributor

@EeshanBembi EeshanBembi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice builder abstraction — the null-deferred pattern is clean and removes the per-row null-buffer write overhead. A few non-blocking observations below.

Comment thread datafusion/functions/src/strings.rs
Comment thread datafusion/functions/src/string/common.rs
Copy link
Copy Markdown
Contributor

@EeshanBembi EeshanBembi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: consider exposing an append_null() alias alongside append_placeholder(). Arrow's own builders all use append_null(), and downstream contributors reaching for this builder will naturally look for that name first. Both names can coexist.

@neilconway
Copy link
Copy Markdown
Contributor Author

Thanks for the review @EeshanBembi !

Nit: consider exposing an append_null() alias alongside append_placeholder().

I don't agree with adding an alias; I'd prefer to pick one name or the other. append_null is what StringBuilder in Arrow does, but the semantics are a little different there -- in that case, append_null is all you need to do to append a null value. In this case, calling append_placeholder isn't sufficient to add a NULL row; you also need to pass a NULL bitmap with the corresponding bit set. I think using a name other than append_null makes this a bit more clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants