Skip to content

[opt](function) simplify FunctionStringPad fast path#62359

Merged
Mryange merged 2 commits intoapache:masterfrom
Mryange:opt-pad-dev-4.10
Apr 15, 2026
Merged

[opt](function) simplify FunctionStringPad fast path#62359
Mryange merged 2 commits intoapache:masterfrom
Mryange:opt-pad-dev-4.10

Conversation

@Mryange
Copy link
Copy Markdown
Contributor

@Mryange Mryange commented Apr 10, 2026

What problem does this PR solve?

This PR refactors BE lpad / rpad implementation in FunctionStringPad.

The previous version mixed multiple optimization dimensions in one heavily-templated path, which made the code hard to read and maintain.

This change simplifies the execution model into:

  • a fast path for length const && pad const
  • a general path for all other cases

The const/const path keeps the key optimizations:

  • direct write to ColumnString::Chars
  • one-time pad preprocessing
  • output size estimation
  • optional ASCII fast path

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 10, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 95.85% (208/217) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.04% (20152/37996)
Line Coverage 36.58% (189486/517979)
Region Coverage 32.85% (147115/447872)
Branch Coverage 33.97% (64402/189573)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 98.16% (213/217) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.60% (27386/37210)
Line Coverage 57.24% (295600/516396)
Region Coverage 54.64% (246976/451996)
Branch Coverage 56.17% (106813/190149)

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 13, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 issue.

Critical checkpoint conclusions:

  • Goal and correctness: The PR clearly aims to refactor FunctionStringPad by splitting out a const-length/const-pad fast path and a general path. The overall structure is smaller and easier to follow, but the new fast path introduces a correctness regression in its upfront output allocation logic, so the current code does not fully preserve the previous behavior.
  • Minimality and focus: The change is focused on a single function implementation, but it also changes the memory growth strategy, which is where the regression appears.
  • Concurrency: No new concurrency is involved here; execution remains row-local within the vectorized function.
  • Lifecycle/static initialization: No special lifecycle or static initialization concerns found.
  • Configuration: No configuration changes.
  • Compatibility: No FE/BE protocol, symbol, or storage format compatibility issues found.
  • Parallel code paths: The split between the const/const path and the general path is explicit; the issue I found is limited to the new const/const fast path.
  • Special conditional logic: The new ASCII-vs-UTF8 dispatch is reasonable, but the const/const path now relies on an estimated byte count before any check_chars_length() validation.
  • Test coverage: Existing BE/regression tests cover many lpad/rpad value cases, but this PR changes memory-allocation behavior and was submitted without new tests. I did not see coverage that specifically exercises overflow/error-path behavior for the new fast path.
  • Observability: No additional observability needed for this change.
  • Transaction/persistence: Not applicable.
  • Data writes/atomicity: Not applicable beyond in-memory string-column writes.
  • FE/BE variable passing: Not applicable.
  • Performance: The refactor should help readability and may help the const/const path, but correctness needs to be fixed first.
  • Other issues: None beyond the inline finding below.

Summary: The refactor is directionally good, but the new const-length/const-pad path should not pre-size ColumnString::Chars with an unchecked estimate. It needs the same overflow protection guarantees as the previous row-by-row checked growth path.

}

const PadChars* padding_ptr = padding ? &*padding : nullptr;
res_chars.resize(estimate_const_output_bytes<is_utf8>(strcol.get_chars(), target_len,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new fast path pre-sizes res_chars with an estimated total byte count before any ColumnString::check_chars_length() validation. That changes behavior compared with the old implementation, which validated each row's actual buffer_len + new_capacity before growing the destination.

A concrete failure case is a very large constant target length (or just a large row count) where estimate_const_output_bytes() exceeds ColumnString::MAX_STRING_SIZE: PaddedPODArray::resize() will try to allocate that size directly instead of raising the Doris STRING_OVERFLOW_IN_VEC_ENGINE exception from check_chars_length(). So this is not only a refactor; it weakens the existing overflow guard on the const/const path.

Please switch this to reserve() plus checked growth, or run ColumnString::check_chars_length() on the estimated size before the initial resize, and still keep the row-level checked growth for cases where the estimate is low.

@HappenLee
Copy link
Copy Markdown
Contributor

/review

@github-actions
Copy link
Copy Markdown
Contributor

No issues found in be/src/exprs/function/function_string_concat.h after reviewing the refactor against the Doris review checklist.

Critical checkpoint conclusions:

  • Goal and correctness: The PR goal is to simplify FunctionStringPad by splitting a const-length/const-pad fast path from the general path, and the new implementation still preserves the previous lpad/rpad semantics in the code paths I traced.
  • Minimality and focus: The change is focused to one BE string-function implementation and keeps the optimization-specific logic localized.
  • Concurrency: No concurrency or lock-safety concerns are involved; execution remains row-local in the vectorized function path.
  • Lifecycle/static initialization: No special lifecycle or static initialization risks found.
  • Configuration: No configuration changes.
  • Compatibility: No FE/BE protocol, persisted metadata, or storage-format compatibility issues found.
  • Parallel code paths: The new fast path and general path are both covered by the refactor; I did not find a mismatch that would require mirroring changes elsewhere.
  • Special conditional logic: The ASCII fast-path dispatch is consistent with the underlying ColumnString::is_ascii() / UTF-8 helper behavior.
  • Test coverage: Existing regression and constant-fold cases already exercise lpad/rpad, including UTF-8 examples. No new tests were added, so the residual risk is mainly around untested performance/overflow corners rather than an evident functional regression.
  • Observability: No additional observability needed for this refactor.
  • Transaction/persistence: Not applicable.
  • Data writes/atomicity: Not applicable beyond in-memory column writes.
  • FE/BE variable passing: Not applicable.
  • Performance: The refactor should improve maintainability while keeping the intended fast path.
  • Other issues: None found.

Comment thread be/src/exprs/function/function_string_concat.h
Copy link
Copy Markdown
Contributor

@HappenLee HappenLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 14, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 98.19% (217/221) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.68% (27401/37191)
Line Coverage 57.35% (296085/516236)
Region Coverage 54.62% (246810/451850)
Branch Coverage 56.28% (106967/190049)

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 14, 2026

/review

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 14, 2026

run cloud_p0

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking correctness regression in the refactor.

Critical Checkpoints

  • Goal of the task: The PR aims to simplify FunctionStringPad while preserving existing lpad/rpad behavior. Conclusion: not fully achieved because the new implementation changes results for non-UTF8 byte strings; no new test in this PR demonstrates compatibility.
  • Change size/focus: The refactor is focused on one function, but it rewrites the core execution path enough that behavior parity needs stronger validation.
  • Concurrency: Not applicable here; the code is single-threaded vectorized function logic and does not introduce shared-state synchronization.
  • Lifecycle/static initialization: No special lifecycle or static initialization concerns found.
  • Configuration: No new configuration items.
  • Compatibility/incompatible changes: No wire/storage compatibility impact, but there is a user-visible function behavior regression on arbitrary byte strings.
  • Parallel code paths: lpad and rpad both share this implementation, so the regression affects both paths.
  • Special conditional checks: The new ASCII/UTF-8 split is understandable, but the new UTF-8 string-length helper is not behavior-equivalent to the prior truncation logic.
  • Test coverage: Existing regression cases cover normal ASCII/UTF-8 usage, but I did not find coverage for malformed/non-UTF8 byte strings, which is exactly where the regression appears.
  • Observability: No additional observability needed for this function change.
  • Transaction/persistence: Not applicable.
  • Data writes/modifications: Not applicable beyond function output correctness.
  • FE/BE variable passing: Not applicable.
  • Performance: The const/const specialization is reasonable, but correctness must be fixed first.
  • Other issues: None beyond the blocking behavior change above were confirmed.

Please fix the string-length/truncation behavior so it remains compatible with the previous implementation for arbitrary byte strings, then re-run the relevant string-function coverage.

}

const size_t str_char_len = get_char_length<is_utf8>(str_data, str_byte_len);
const size_t target_char_len = static_cast<size_t>(target_len);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes lpad/rpad semantics for non-UTF8 byte strings. simd::VStringFunctions::get_char_len() counts only bytes > 0xBF, while the previous implementation decided whether the source already reached len via iterate_utf8_with_limit_length() (vstring_function.h:236), which advances by UTF8_BYTE_LENGTH[*p] and therefore treats continuation bytes 0x80..0xBF as one character.

A concrete regression is a one-byte string like unhex('80') with target length 1: before this refactor the function kept the original byte, but now str_char_len becomes 0, so we pad and return two bytes. Doris string columns can contain arbitrary byte sequences, so this is a behavior break. Please keep the source-string length/truncation logic consistent with the old iterate_utf8_with_limit_length() behavior (or make an explicit validated-UTF8-only change with tests if that is intentional).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们默认,对于字符串函数,里面的字符串都是合法的。

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 98.19% (217/221) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.68% (27401/37191)
Line Coverage 57.35% (296085/516236)
Region Coverage 54.62% (246810/451850)
Branch Coverage 56.28% (106967/190049)

@Mryange Mryange dismissed github-actions[bot]’s stale review April 15, 2026 06:44

我们默认,对于字符串函数,里面的字符串都是合法的utf8字符。

@Mryange Mryange merged commit 5cbaa4f into apache:master Apr 15, 2026
31 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants