Skip to content

perf: Add bulk NULL-aware string builders, use in lower and upper#21789

Merged
mbutrovich merged 12 commits intoapache:mainfrom
neilconway:neilc/perf-string-view-builder-api
Apr 24, 2026
Merged

perf: Add bulk NULL-aware string builders, use in lower and upper#21789
mbutrovich merged 12 commits intoapache:mainfrom
neilconway:neilc/perf-string-view-builder-api

Conversation

@neilconway
Copy link
Copy Markdown
Contributor

@neilconway neilconway commented Apr 22, 2026

Which issue does this PR close?

Rationale for this change

Introduce three new string array builders with bulk null tracking:

  • StringArrayBuilder (Utf8)
  • LargeStringArrayBuilder (LargeUtf8)
  • StringViewArrayBuilder (Utf8View)

Each builder has the following API:

  • append_value(&str) -- add a non-NULL value (row)
  • append_placeholder() -- add a NULL row placeholder
  • finish(Option) -- finish the build, specify NULLs

These are the counterpart of Arrow's GenericStringBuilder / StringViewBuilder but it
skips per-row NULL buffer maintenance, which lets callers compute the NULL buffer in
bulk when possible.

This PR also switches case_conversion to use the new APIs, which is used to
implement lower, upper, and the Spark equivalents. This improves lower / upper
performance by 3-15% on microbenchmarks. More UDFs (~10) will be converted to use
this API in future PRs.

What changes are included in this PR?

  • Add new builders
  • Add unit tests
  • Adopt builders in case_conversion

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Rename the three builders in `datafusion/functions/src/strings.rs` to make
their special-purpose nature explicit:

- `StringArrayBuilder`     -> `ConcatStringBuilder`
- `LargeStringArrayBuilder` -> `ConcatLargeStringBuilder`
- `StringViewArrayBuilder`  -> `ConcatStringViewBuilder`

These builders are used only by `concat` and `concat_ws`; their
`write::<CHECK_VALID>(col: &ColumnarValueRef, i)` + `append_offset()`
API is shaped around accumulating multiple input fragments into one
output row and is not a general-purpose string-builder replacement.
The `Concat` prefix makes that clear and frees the previous names for
a follow-up change that introduces general-purpose bulk-nulls
builders with a simpler `append_value(&str)` API.
Introduce three new string array builders with bulk null tracking:

- `StringArrayBuilder` (Utf8)
- `LargeStringArrayBuilder` (LargeUtf8)
- `StringViewArrayBuilder` (Utf8View)

Each builder has the following API:

- append_value(&str) -- add a non-NULL value (row)
- append_placeholder() -- add a NULL row placeholder
- finish(Option<NullBuffer>) -- finish the build, specify NULLs

These are the counterpart of Arrow's `GenericStringBuilder` /
`StringViewBuilder` but it skips per-row NULL buffer maintenance, which
lets callers compute the NULL buffer in bulk when possible.

This PR also switches `case_conversion` to use the new APIs, which is
used to implement `lower`, `upper`, and the Spark equivalents. This
improves `lower` / `upper` performance by 3-15% on microbenchmarks. More
UDFs (~10) will be converted to use this API in future PRs.
@github-actions github-actions Bot added the functions Changes to functions implementation label Apr 22, 2026
…w-builder-api

# Conflicts:
#	datafusion/functions/src/strings.rs
@neilconway neilconway changed the title perf: Add new string builders that support bulk NULLs, use in lower and upper perf: Add bulk NULL-aware string builders, use in lower and upper Apr 22, 2026
Copy link
Copy Markdown
Contributor

@EeshanBembi EeshanBembi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice builder abstraction — the null-deferred pattern is clean and removes the per-row null-buffer write overhead. A few non-blocking observations below.

Comment thread datafusion/functions/src/strings.rs
Comment thread datafusion/functions/src/string/common.rs
Copy link
Copy Markdown
Contributor

@EeshanBembi EeshanBembi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: consider exposing an append_null() alias alongside append_placeholder(). Arrow's own builders all use append_null(), and downstream contributors reaching for this builder will naturally look for that name first. Both names can coexist.

@neilconway
Copy link
Copy Markdown
Contributor Author

Thanks for the review @EeshanBembi !

Nit: consider exposing an append_null() alias alongside append_placeholder().

I don't agree with adding an alias; I'd prefer to pick one name or the other. append_null is what StringBuilder in Arrow does, but the semantics are a little different there -- in that case, append_null is all you need to do to append a null value. In this case, calling append_placeholder isn't sufficient to add a NULL row; you also need to pass a NULL bitmap with the corresponding bit set. I think using a name other than append_null makes this a bit more clear.

@neilconway
Copy link
Copy Markdown
Contributor Author

cc @mbutrovich , in case you have a free moment to assist in the next step of the quest to optimize NULL handling 😊

@neilconway
Copy link
Copy Markdown
Contributor Author

Benchmarks:

  lower_all_values_are_ascii: 1024                                  5.4µs  -> 5.3µs   -1.9%
  lower_all_values_are_ascii: 4096                                 21.4µs  -> 21.1µs  -1.4%
  lower_all_values_are_ascii: 8192                                 41.7µs  -> 41.3µs  -1.0%

  ascii_sv 4096 len10  nulls0   mixed=f                           160.2µs  -> 151.4µs -5.5%
  ascii_sv 4096 len10  nulls0   mixed=t                           179.7µs  -> 181.1µs +0.8%
  ascii_sv 4096 len10  nulls0.1 mixed=f                           164.6µs  -> 141.7µs -13.9%
  ascii_sv 4096 len10  nulls0.1 mixed=t                           181.7µs  -> 172.8µs -4.9%
  ascii_sv 4096 len64  nulls0   mixed=f                           149.0µs  -> 154.1µs +3.4%
  ascii_sv 4096 len64  nulls0   mixed=t                           187.8µs  -> 194.8µs +3.7%
  ascii_sv 4096 len64  nulls0.1 mixed=f                           147.6µs  -> 145.4µs -1.5%
  ascii_sv 4096 len64  nulls0.1 mixed=t                           189.2µs  -> 178.2µs -5.8%
  ascii_sv 4096 len128 nulls0   mixed=f                           188.7µs  -> 191.2µs +1.3%
  ascii_sv 4096 len128 nulls0   mixed=t                           204.6µs  -> 202.6µs -1.0%
  ascii_sv 4096 len128 nulls0.1 mixed=f                           190.2µs  -> 172.8µs -9.1%
  ascii_sv 4096 len128 nulls0.1 mixed=t                           201.7µs  -> 195.0µs -3.3%
  ascii_sv 8192 len10  nulls0   mixed=f                           319.4µs  -> 303.8µs -4.9%
  ascii_sv 8192 len10  nulls0   mixed=t                           366.3µs  -> 363.6µs -0.7%
  ascii_sv 8192 len10  nulls0.1 mixed=f                           334.8µs  -> 284.6µs -15.0%
  ascii_sv 8192 len10  nulls0.1 mixed=t                           362.1µs  -> 339.0µs -6.4%
  ascii_sv 8192 len64  nulls0   mixed=f                           296.6µs  -> 309.1µs +4.2%
  ascii_sv 8192 len64  nulls0   mixed=t                           384.6µs  -> 384.0µs -0.2%
  ascii_sv 8192 len64  nulls0.1 mixed=f                           299.7µs  -> 290.6µs -3.0%
  ascii_sv 8192 len64  nulls0.1 mixed=t                           378.9µs  -> 355.4µs -6.2%
  ascii_sv 8192 len128 nulls0   mixed=f                           394.9µs  -> 415.3µs +5.2%
  ascii_sv 8192 len128 nulls0   mixed=t                           408.4µs  -> 407.4µs -0.2%
  ascii_sv 8192 len128 nulls0.1 mixed=f                           419.0µs  -> 350.9µs -16.3%
  ascii_sv 8192 len128 nulls0.1 mixed=t                           409.6µs  -> 388.3µs -5.2%

  nonascii_sv 4096 len10  nulls0   mixed=f                        372.8µs  -> 377.0µs +1.1%
  nonascii_sv 4096 len10  nulls0   mixed=t                        375.2µs  -> 372.6µs -0.7%
  nonascii_sv 4096 len10  nulls0.1 mixed=f                        345.5µs  -> 357.9µs +3.6%
  nonascii_sv 4096 len10  nulls0.1 mixed=t                        346.0µs  -> 340.7µs -1.5%
  nonascii_sv 4096 len64  nulls0   mixed=f                        358.9µs  -> 370.5µs +3.2%
  nonascii_sv 4096 len64  nulls0   mixed=t                        382.0µs  -> 386.4µs +1.2%
  nonascii_sv 4096 len64  nulls0.1 mixed=f                        353.7µs  -> 349.7µs -1.1%
  nonascii_sv 4096 len64  nulls0.1 mixed=t                        363.7µs  -> 340.3µs -6.4%
  nonascii_sv 4096 len128 nulls0   mixed=f                        371.6µs  -> 393.8µs +6.0%
  nonascii_sv 4096 len128 nulls0   mixed=t                        376.2µs  -> 387.4µs +3.0%
  nonascii_sv 4096 len128 nulls0.1 mixed=f                        349.6µs  -> 349.0µs -0.2%
  nonascii_sv 4096 len128 nulls0.1 mixed=t                        363.4µs  -> 341.1µs -6.1%
  nonascii_sv 8192 len10  nulls0   mixed=f                        738.5µs  -> 771.2µs +4.4%
  nonascii_sv 8192 len10  nulls0   mixed=t                        759.2µs  -> 760.1µs +0.1%
  nonascii_sv 8192 len10  nulls0.1 mixed=f                        705.8µs  -> 700.4µs -0.8%
  nonascii_sv 8192 len10  nulls0.1 mixed=t                        710.9µs  -> 695.0µs -2.2%
  nonascii_sv 8192 len64  nulls0   mixed=f                        781.1µs  -> 759.5µs -2.8%
  nonascii_sv 8192 len64  nulls0   mixed=t                        746.3µs  -> 791.6µs +6.1%
  nonascii_sv 8192 len64  nulls0.1 mixed=f                        726.4µs  -> 696.9µs -4.1%
  nonascii_sv 8192 len64  nulls0.1 mixed=t                        733.7µs  -> 690.0µs -6.0%
  nonascii_sv 8192 len128 nulls0   mixed=f                        746.5µs  -> 769.6µs +3.1%
  nonascii_sv 8192 len128 nulls0   mixed=t                        742.5µs  -> 764.1µs +2.9%
  nonascii_sv 8192 len128 nulls0.1 mixed=f                        745.2µs  -> 717.4µs -3.7%
  nonascii_sv 8192 len128 nulls0.1 mixed=t                        707.5µs  -> 699.9µs -1.1%

  lower_the_first_value_is_nonascii: 1024                          45.6µs  -> 42.6µs  -6.6%
  lower_the_first_value_is_nonascii: 4096                         183.6µs  -> 172.7µs -5.9%
  lower_the_first_value_is_nonascii: 8192                         372.1µs  -> 356.5µs -4.2%
  lower_the_middle_value_is_nonascii: 1024                         45.8µs  -> 43.2µs  -5.7%
  lower_the_middle_value_is_nonascii: 4096                        187.0µs  -> 175.7µs -6.0%
  lower_the_middle_value_is_nonascii: 8192                        373.2µs  -> 360.0µs -3.5%

Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @neilconway! One minor request: it might be worth adding an all-placeholders test (every row null) to exercise the placeholder_count == null_count boundary. The existing tests all mix values and placeholders.

@neilconway
Copy link
Copy Markdown
Contributor Author

@mbutrovich Thanks, good idea! Done.

Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks as always for the performance improvements, @neilconway!

@mbutrovich mbutrovich added this pull request to the merge queue Apr 24, 2026
Merged via the queue into apache:main with commit 794f30e Apr 24, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants