Skip to content

[opt](function) optimize repeat and trim string functions#63784

Open
Mryange wants to merge 2 commits into
apache:masterfrom
Mryange:opt-string-func-5.28
Open

[opt](function) optimize repeat and trim string functions#63784
Mryange wants to merge 2 commits into
apache:masterfrom
Mryange:opt-string-func-5.28

Conversation

@Mryange
Copy link
Copy Markdown
Contributor

@Mryange Mryange commented May 28, 2026

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary:

This PR optimizes two hot BE string function families that spend significant time on repeated buffer growth and high-overhead character lookup.

  1. repeat

    • Replace the old per-row append loop with a shared two-pass execution path for both vector repeat counts and constant repeat counts.
    • Precompute res_offsets and total output size before writing results.
    • Write directly into ColumnString::Chars with StringOP::fast_repeat().
  2. trim_in / ltrim_in / rtrim_in

    • Reuse SIMD-assisted ASCII symbol search for small trim sets.
    • Add reverse symbol search helpers for right trim.
    • Use a fixed-size UTF-8 small-set lookup path for common cases while preserving fallback paths for larger trim strings.
  3. Coverage and observability

    • Add a larger repeat('a', 256) correctness case.
    • Add find_symbols unit coverage for empty ranges, embedded NUL bytes, boundary lengths, and cross-check cases.
    • Add old/new microbenchmarks for both repeat and trim paths.

Benchmark results

Local Release benchmark results on this machine:

repeat

Case Old New Speedup
RepeatVector 4096/16/8 86184 ns 42878 ns 2.01x
RepeatVector 4096/16/64 411330 ns 146498 ns 2.81x
RepeatVector 1024/128/16 84985 ns 39620 ns 2.15x
RepeatVector 4096/0/64 222362 ns 7382 ns 30.12x
RepeatConst 4096/16/8 127848 ns 59495 ns 2.15x
RepeatConst 4096/16/64 764912 ns 205294 ns 3.73x
RepeatConst 1024/128/16 158499 ns 78090 ns 2.03x
RepeatConst 4096/0/64 395137 ns 7107 ns 55.60x

trim

Case Old New Speedup
ASCII, 4 trim chars, 65536 rows 2398.526 us 997.839 us 2.40x
ASCII, 8 trim chars, 65536 rows 2452.269 us 872.865 us 2.81x
ASCII, no match, 65536 rows 318.894 us 309.079 us 1.03x
UTF-8, 2 trim chars, 65536 rows 7832.364 us 6799.569 us 1.15x

Note:
The medium and large trim cases reproduced the expected improvement direction on this machine. One small-input case (BM_TrimInAscii/1024) was slightly slower at 0.81x, while the 4096-row and 65536-row cases remained clearly faster.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 28, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two blocking correctness issues in the BE string-function changes.

Critical checkpoint conclusions:

  • Goal/test proof: The PR targets repeat and trim_in performance. The added tests cover normal repeat and symbol-search cases, but they do not cover repeat output-size overflow or the scalar all-trim tail in the new runtime reverse search under sanitizers.
  • Scope/focus: The effective GitHub PR diff is focused on BE string functions and benchmarks.
  • Concurrency: No new shared mutable state or locking path is introduced.
  • Lifecycle/static initialization: No new cross-TU static initialization dependency was found in the changed code.
  • Config/compatibility/protocol/storage: No new config, FE-BE protocol, storage-format, or rolling-upgrade compatibility concern was found.
  • Parallel paths: repeat vector/const paths were unified; trim left/right ASCII and UTF-8 paths were reviewed. The new reverse runtime symbol path has a distinct scalar-tail bug.
  • Error handling/memory safety: repeat now computes output sizes in uint32_t before checking, which can wrap and lead to under-allocation followed by out-of-bounds writes. The new SSE2 runtime reverse search can decrement before begin on all-trim scalar tails.
  • Test coverage: Existing tests are insufficient for these edge cases; add overflow/large-size checks for repeat and sanitizer-safe all-trim scalar-tail coverage for runtime SearchSymbols.
  • Observability/performance: The intended performance direction is reasonable, but correctness must be fixed before merging.
  • User focus: No additional user-provided review focus was specified.

const auto str_ref = source_column.get_data_at(i);
const uint32_t size = str_ref.size;
total_size += size * repeat;
res_offsets[i] = total_size;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pre-size pass can overflow before ColumnString::check_chars_length() sees the real output size. Both total_size and size are uint32_t, and size * repeat is evaluated in 32-bit unsigned arithmetic, so a large but valid input such as one row with size=65536 and repeat=65536 wraps to zero. The later res_data.resize(total_size) then allocates too little and the second pass writes the full repeated payload through fast_repeat, causing memory corruption instead of raising STRING_OVERFLOW_IN_VEC_ENGINE. Please compute in size_t/checked arithmetic and call check_chars_length() before narrowing into ColumnString::Offset.


--pos;
for (; pos >= begin; --pos)
if (maybe_negate<positive>(is_in(*pos, symbols, num_chars))) return pos;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scalar tail has undefined behavior when every remaining byte matches the trim set. For example, find_last_not_symbols_or_null(s.data(), s.data() + 4, SearchSymbols(" ")) on four spaces reaches pos == begin, the loop body does not return, and the --pos in the loop expression decrements before the start of the array; the next pos >= begin comparison is also outside the valid range. This is the runtime SSE2 path used by the new ASCII rtrim optimization. Please use the same for (const char* p = pos; p != begin;) { --p; ... } pattern used in the SSE4.2 tail or otherwise avoid forming begin - 1.

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 28, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed for PR 63784. I found one blocking correctness issue in the new runtime reverse-symbol SSE2 helper.

Critical checkpoints:

  • Goal/test coverage: the PR targets BE string-function performance for repeat and trim_in/ltrim_in/rtrim_in; it adds BE unit coverage and benchmarks, and the updated tests now cover the previously reported repeat overflow and scalar-tail cases.
  • Scope/focus: the BE changes are focused on the stated string-function hot paths plus related tests/benchmarks.
  • Concurrency/lifecycle/config/compatibility/persistence/data writes: not materially involved.
  • Parallel paths: runtime SearchSymbols reverse search is added alongside existing template helpers; the new runtime SSE2 path still needs the bounds-safe loop condition used by the SSE4.2 path.
  • Memory/data correctness: one undefined-behavior risk remains in pointer arithmetic before bounds are checked. Existing review threads already covered the scalar-tail UB and repeat overflow; I did not duplicate those.
  • Observability: no new observability needed for this local function optimization.
  • User focus: no additional user-provided review focus was supplied.

#if defined(__SSE2__)
const auto needles = mm_is_in_prepare(symbols, num_chars);
for (; pos - 16 >= begin; pos -= 16) {
__m128i bytes = _mm_loadu_si128(reinterpret_cast<const __m128i*>(pos - 16));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop forms pos - 16 before proving that there are 16 bytes left in the range. For any runtime SearchSymbols call on a string shorter than 16 bytes (for example trim_in('abc', ' ') through the new ASCII rtrim path), pos is end and pos - 16 is outside the array before the comparison is evaluated, which is undefined behavior even if the loop body is skipped. This is distinct from the scalar-tail issue already raised: the UB happens in the SIMD loop condition before reaching the tail. Please use the same safe form as the SSE4.2 helper, e.g. static_cast<size_t>(pos - begin) >= 16, before computing pos - 16.

Suggested change
__m128i bytes = _mm_loadu_si128(reinterpret_cast<const __m128i*>(pos - 16));
for (; static_cast<size_t>(pos - begin) >= 16; pos -= 16) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants