Skip to content

[opt](nereids) optimize length(str_col) by only read offset sub column#62205

Merged
englefly merged 3 commits intoapache:masterfrom
englefly:len-str-v2
Apr 9, 2026
Merged

[opt](nereids) optimize length(str_col) by only read offset sub column#62205
englefly merged 3 commits intoapache:masterfrom
englefly:len-str-v2

Conversation

@englefly
Copy link
Copy Markdown
Contributor

@englefly englefly commented Apr 8, 2026

What problem does this PR solve?

Optimized the calculation of length(str_col).
Treat the string column as a combination of an offset sub column and a chars sub column.
Prune the string column via NestedColumnPruning so that the BE only needs to read the offset sub column, thereby saving I/O for reading the chars sub column.
Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 8, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@englefly
Copy link
Copy Markdown
Contributor Author

englefly commented Apr 8, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 50.25% (102/203) 🎉
Increment coverage report
Complete coverage report

@englefly
Copy link
Copy Markdown
Contributor Author

englefly commented Apr 8, 2026

run buildall

@englefly
Copy link
Copy Markdown
Contributor Author

englefly commented Apr 8, 2026

run external

1 similar comment
@englefly
Copy link
Copy Markdown
Contributor Author

englefly commented Apr 8, 2026

run external

@englefly englefly changed the title Len str v2 [opt](nereids) optimize length(str_col) by only read offset sub column Apr 8, 2026
@englefly englefly marked this pull request as ready for review April 8, 2026 10:56
starocean999
starocean999 previously approved these changes Apr 8, 2026
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

PR approved by anyone and no changes requested.

@englefly
Copy link
Copy Markdown
Contributor Author

englefly commented Apr 8, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found 2 issues that should be addressed before merging.

  1. StringEmptyToLengthRule does not match the analyzed str_col = '' shape used in production, because type coercion wraps the empty-string literal in Cast(...). The new unit test explicitly bypasses coercion to make the rule pass, which means the optimization is not actually exercised on real rewritten expressions.
  2. ExpressionUtils.extractUniformSlot() now infers slot = '' from any length(slot) = 0, but length() also accepts VARBINARY. That leaks a string literal into uniform-slot / constant-propagation state for binary columns and can mis-rewrite downstream expressions with the wrong typed constant.

Critical checkpoint conclusions:

  • Goal / correctness: The intended optimization is only partially achieved; one main rewrite path does not fire on analyzed expressions, and one new inference path is semantically too broad. Existing tests do not prove end-to-end correctness for the production expression shape.
  • Change scope / focus: The PR is focused on Nereids expression and nested-column pruning, but it also changes uniform-slot inference, which introduces an unrelated semantic risk.
  • Concurrency: No new concurrency or locking concerns found in the touched FE code.
  • Lifecycle / static init: No special lifecycle or static initialization issues found.
  • Config changes: None.
  • Compatibility: No FE/BE protocol or storage-format compatibility issue was identified from the touched code paths.
  • Parallel code paths: The optimization was wired into several access-path collectors, but the main expression rewrite path does not cover the analyzed comparison form after coercion.
  • Special conditional checks: The new delete guard is understandable, but the Literal-only empty-string check is too narrow for analyzed trees.
  • Test coverage: Added tests cover handcrafted pre-coercion expressions and explain-based regressions, but they miss the analyzed/coerced expression form and the new length(varbinary_col)=0 uniform-inference case.
  • Observability: No additional observability appears necessary for this change.
  • Transaction / persistence: Not applicable.
  • Data writes / modifications: Not applicable.
  • FE/BE variable passing: No new transmitted fields beyond access-path contents; no incompatibility confirmed in the touched path.
  • Performance: The intended pruning should help when it fires, but the missed rewrite path means some targeted queries will get no benefit.
  • Other issues: None beyond the two findings above.

@englefly
Copy link
Copy Markdown
Contributor Author

englefly commented Apr 8, 2026

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Apr 8, 2026
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 9, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

PR approved by at least one committer and no changes requested.

@englefly englefly merged commit eb2567d into apache:master Apr 9, 2026
31 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants