[opt](function) optimize repeat and trim string functions#63784
[opt](function) optimize repeat and trim string functions#63784Mryange wants to merge 2 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
I found two blocking correctness issues in the BE string-function changes.
Critical checkpoint conclusions:
- Goal/test proof: The PR targets repeat and trim_in performance. The added tests cover normal repeat and symbol-search cases, but they do not cover repeat output-size overflow or the scalar all-trim tail in the new runtime reverse search under sanitizers.
- Scope/focus: The effective GitHub PR diff is focused on BE string functions and benchmarks.
- Concurrency: No new shared mutable state or locking path is introduced.
- Lifecycle/static initialization: No new cross-TU static initialization dependency was found in the changed code.
- Config/compatibility/protocol/storage: No new config, FE-BE protocol, storage-format, or rolling-upgrade compatibility concern was found.
- Parallel paths: repeat vector/const paths were unified; trim left/right ASCII and UTF-8 paths were reviewed. The new reverse runtime symbol path has a distinct scalar-tail bug.
- Error handling/memory safety: repeat now computes output sizes in uint32_t before checking, which can wrap and lead to under-allocation followed by out-of-bounds writes. The new SSE2 runtime reverse search can decrement before begin on all-trim scalar tails.
- Test coverage: Existing tests are insufficient for these edge cases; add overflow/large-size checks for repeat and sanitizer-safe all-trim scalar-tail coverage for runtime SearchSymbols.
- Observability/performance: The intended performance direction is reasonable, but correctness must be fixed before merging.
- User focus: No additional user-provided review focus was specified.
| const auto str_ref = source_column.get_data_at(i); | ||
| const uint32_t size = str_ref.size; | ||
| total_size += size * repeat; | ||
| res_offsets[i] = total_size; |
There was a problem hiding this comment.
This pre-size pass can overflow before ColumnString::check_chars_length() sees the real output size. Both total_size and size are uint32_t, and size * repeat is evaluated in 32-bit unsigned arithmetic, so a large but valid input such as one row with size=65536 and repeat=65536 wraps to zero. The later res_data.resize(total_size) then allocates too little and the second pass writes the full repeated payload through fast_repeat, causing memory corruption instead of raising STRING_OVERFLOW_IN_VEC_ENGINE. Please compute in size_t/checked arithmetic and call check_chars_length() before narrowing into ColumnString::Offset.
|
|
||
| --pos; | ||
| for (; pos >= begin; --pos) | ||
| if (maybe_negate<positive>(is_in(*pos, symbols, num_chars))) return pos; |
There was a problem hiding this comment.
The scalar tail has undefined behavior when every remaining byte matches the trim set. For example, find_last_not_symbols_or_null(s.data(), s.data() + 4, SearchSymbols(" ")) on four spaces reaches pos == begin, the loop body does not return, and the --pos in the loop expression decrements before the start of the array; the next pos >= begin comparison is also outside the valid range. This is the runtime SSE2 path used by the new ASCII rtrim optimization. Please use the same for (const char* p = pos; p != begin;) { --p; ... } pattern used in the SSE4.2 tail or otherwise avoid forming begin - 1.
|
/review |
There was a problem hiding this comment.
Review completed for PR 63784. I found one blocking correctness issue in the new runtime reverse-symbol SSE2 helper.
Critical checkpoints:
- Goal/test coverage: the PR targets BE string-function performance for
repeatandtrim_in/ltrim_in/rtrim_in; it adds BE unit coverage and benchmarks, and the updated tests now cover the previously reported repeat overflow and scalar-tail cases. - Scope/focus: the BE changes are focused on the stated string-function hot paths plus related tests/benchmarks.
- Concurrency/lifecycle/config/compatibility/persistence/data writes: not materially involved.
- Parallel paths: runtime
SearchSymbolsreverse search is added alongside existing template helpers; the new runtime SSE2 path still needs the bounds-safe loop condition used by the SSE4.2 path. - Memory/data correctness: one undefined-behavior risk remains in pointer arithmetic before bounds are checked. Existing review threads already covered the scalar-tail UB and repeat overflow; I did not duplicate those.
- Observability: no new observability needed for this local function optimization.
- User focus: no additional user-provided review focus was supplied.
| #if defined(__SSE2__) | ||
| const auto needles = mm_is_in_prepare(symbols, num_chars); | ||
| for (; pos - 16 >= begin; pos -= 16) { | ||
| __m128i bytes = _mm_loadu_si128(reinterpret_cast<const __m128i*>(pos - 16)); |
There was a problem hiding this comment.
This loop forms pos - 16 before proving that there are 16 bytes left in the range. For any runtime SearchSymbols call on a string shorter than 16 bytes (for example trim_in('abc', ' ') through the new ASCII rtrim path), pos is end and pos - 16 is outside the array before the comparison is evaluated, which is undefined behavior even if the loop body is skipped. This is distinct from the scalar-tail issue already raised: the UB happens in the SIMD loop condition before reaching the tail. Please use the same safe form as the SSE4.2 helper, e.g. static_cast<size_t>(pos - begin) >= 16, before computing pos - 16.
| __m128i bytes = _mm_loadu_si128(reinterpret_cast<const __m128i*>(pos - 16)); | |
| for (; static_cast<size_t>(pos - begin) >= 16; pos -= 16) { |
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary:
This PR optimizes two hot BE string function families that spend significant time on repeated buffer growth and high-overhead character lookup.
repeatres_offsetsand total output size before writing results.ColumnString::CharswithStringOP::fast_repeat().trim_in/ltrim_in/rtrim_inCoverage and observability
repeat('a', 256)correctness case.find_symbolsunit coverage for empty ranges, embedded NUL bytes, boundary lengths, and cross-check cases.Benchmark results
Local Release benchmark results on this machine:
repeat
trim
Note:
The medium and large trim cases reproduced the expected improvement direction on this machine. One small-input case (
BM_TrimInAscii/1024) was slightly slower at 0.81x, while the 4096-row and 65536-row cases remained clearly faster.Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)