Skip to content

[refine](function) split monolithic function_string.h into domain-specific files#62262

Merged
BiteTheDDDDt merged 1 commit intoapache:masterfrom
Mryange:refine-function-string.h
Apr 10, 2026
Merged

[refine](function) split monolithic function_string.h into domain-specific files#62262
BiteTheDDDDt merged 1 commit intoapache:masterfrom
Mryange:refine-function-string.h

Conversation

@Mryange
Copy link
Copy Markdown
Contributor

@Mryange Mryange commented Apr 9, 2026

What problem does this PR solve?

function_string.h was a 5393-line monolithic header containing ~40 function class implementations with heavy dependencies (pugixml, ICU, Boost.Locale, digest libs, etc.). Every translation unit including it pulled in all dependencies, causing slow compilation and unnecessary rebuilds.

This PR splits it into 9 domain-specific files. 6 files with no external consumers are converted to standalone .cpp with their own registration functions. 3 files with external consumers remain as .h.

Converted to .cpp (6 files):

  • function_string_basic.cpp — Strcmp, Substring, Left, Right, NullOrEmpty
  • function_string_mask.cpp — Mask, MaskPartial
  • function_string_search.cpp — LocatePos, SplitPart, SubstringIndex, SplitByString, CountSubString
  • function_string_digest.cpp — SM3, MD5, SHA1, SHA2
  • function_string_url.cpp — ExtractURLParameter, ParseUrl, UrlDecode, UrlEncode
  • function_string_misc.cpp — AutoPartitionName, RandomBytes, ConvertTo, IntToChar, NgramSearch, Translate, XPathString, MakeSet, ExportSet, Crc32, UnicodeNormalize

Kept as .h (3 files):

  • function_string_concat.h — used by column_string_test.cpp
  • function_string_format.h — used by function_money_format_test.cpp
  • function_string_replace.h — used by function_reverse.h, function_sub_replace_test.cpp

Other changes:

  • function_string.cpp — removed original header include, calls 6 sub-registration functions via extern declarations
  • Removed unnecessary function_string.h includes from partition_transformers.h and function_split_by_regexp.cpp
  • Updated test files to include specific sub-headers
  • Fixed missing block.h include in function_array_reverse.h
  • Fixed pre-existing missing vexpr_context.h include in viceberg_merge_sink.cpp

The original function_string.h is deleted.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 9, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 92.84% (3617/3896) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.94% (26756/37192)
Line Coverage 54.91% (283373/516096)
Region Coverage 51.78% (233966/451813)
Branch Coverage 53.31% (101326/190057)

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 10, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found 1 issue that should be addressed before merge.

  1. Goal of the task: splitting function_string.h into smaller translation units does reduce include fanout, but the refactor is not fully self-contained yet because one of the new files still relies on an implicit RE2 declaration from the global PCH.
  2. Modification size/focus: the change is otherwise focused on the string-function split and related include cleanup.
  3. Concurrency: not applicable; I did not find new threading or locking changes in this PR.
  4. Special lifecycle/static initialization: no new lifecycle or static-init issues identified in the modified paths.
  5. Configuration: no config changes.
  6. Compatibility/incompatible changes: no FE/BE protocol or storage-format compatibility issue identified.
  7. Parallel code paths: registration/alias coverage for the split string functions looks consistent overall; I did not find a missing runtime registration in the reviewed paths.
  8. Special conditional checks: no new problematic conditional logic identified beyond the issue below.
  9. Test coverage: there are no new tests, and for a refactor like this the build itself is the key coverage. The missing direct include means the split file is not robust outside the current PCH build setup.
  10. Observability: not applicable.
  11. Transaction/persistence: not applicable.
  12. Data writes/modifications: not applicable.
  13. FE-BE variable passing: not applicable.
  14. Performance: the intended compile-time improvement makes sense; I did not identify a runtime-performance regression.
  15. Other issues: see the inline comment.

Summary opinion: the refactor is directionally good, but the new function_string_misc.cpp translation unit should add its direct RE2 dependency before merging so the split remains self-contained and build-robust.

// specific language governing permissions and limitations
// under the License.

#include <crc32c/crc32c.h>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FunctionAutoPartitionName::_auto_partition_type_of_range() still instantiates RE2 later in this file, but this new TU no longer includes re2/re2.h directly. After the split, it only builds because the BE target's global PCH happens to pull RE2 in transitively. That makes the refactor brittle: building this source without the PCH path will fail with an unknown RE2 type. Please add a direct #include <re2/re2.h> here so the new file is self-contained.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 10, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@BiteTheDDDDt BiteTheDDDDt merged commit 7aa904c into apache:master Apr 10, 2026
31 of 32 checks passed
@Mryange Mryange mentioned this pull request Apr 27, 2026
29 tasks
yiguolei pushed a commit that referenced this pull request Apr 28, 2026
…cific files (#62262)

#62262

#56304


`function_string.h` was a 5393-line monolithic header containing ~40
function class implementations with heavy dependencies (`pugixml`, ICU,
Boost.Locale, digest libs, etc.). Every translation unit including it
pulled in all dependencies, causing slow compilation and unnecessary
rebuilds.

This PR splits it into 9 domain-specific files. 6 files with no external
consumers are converted to standalone `.cpp` with their own registration
functions. 3 files with external consumers remain as `.h`.

**Converted to .cpp (6 files):**
- `function_string_basic.cpp` — Strcmp, Substring, Left, Right,
NullOrEmpty
- `function_string_mask.cpp` — Mask, MaskPartial
- `function_string_search.cpp` — LocatePos, SplitPart, SubstringIndex,
SplitByString, CountSubString
- `function_string_digest.cpp` — SM3, MD5, SHA1, SHA2
- `function_string_url.cpp` — ExtractURLParameter, ParseUrl, UrlDecode,
UrlEncode
- `function_string_misc.cpp` — AutoPartitionName, RandomBytes,
ConvertTo, IntToChar, NgramSearch, Translate, XPathString, MakeSet,
ExportSet, Crc32, UnicodeNormalize

**Kept as .h (3 files):**
- `function_string_concat.h` — used by `column_string_test.cpp`
- `function_string_format.h` — used by `function_money_format_test.cpp`
- `function_string_replace.h` — used by `function_reverse.h`,
`function_sub_replace_test.cpp`

**Other changes:**
- `function_string.cpp` — removed original header include, calls 6
sub-registration functions via extern declarations
- Removed unnecessary `function_string.h` includes from
`partition_transformers.h` and `function_split_by_regexp.cpp`
- Updated test files to include specific sub-headers
- Fixed missing `block.h` include in `function_array_reverse.h`
- Fixed pre-existing missing `vexpr_context.h` include in
`viceberg_merge_sink.cpp`

The original `function_string.h` is deleted.

None

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change. - [ ] No code files have been
changed. - [ ] Other reason <!-- Add your reason? -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->

---------

Co-authored-by: linrrarity <linzhenqi@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.1-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants