[opt](nereids) optimize length(str_col) by only read offset sub column#62205
Merged
englefly merged 3 commits intoapache:masterfrom Apr 9, 2026
Merged
[opt](nereids) optimize length(str_col) by only read offset sub column#62205englefly merged 3 commits intoapache:masterfrom
englefly merged 3 commits intoapache:masterfrom
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
Contributor
FE UT Coverage ReportIncrement line coverage |
Contributor
Author
|
run buildall |
Contributor
Author
|
run external |
1 similar comment
Contributor
Author
|
run external |
starocean999
previously approved these changes
Apr 8, 2026
Contributor
|
PR approved by at least one committer and no changes requested. |
Contributor
|
PR approved by anyone and no changes requested. |
Contributor
Author
|
/review |
Contributor
There was a problem hiding this comment.
I found 2 issues that should be addressed before merging.
StringEmptyToLengthRuledoes not match the analyzedstr_col = ''shape used in production, because type coercion wraps the empty-string literal inCast(...). The new unit test explicitly bypasses coercion to make the rule pass, which means the optimization is not actually exercised on real rewritten expressions.ExpressionUtils.extractUniformSlot()now infersslot = ''from anylength(slot) = 0, butlength()also acceptsVARBINARY. That leaks a string literal into uniform-slot / constant-propagation state for binary columns and can mis-rewrite downstream expressions with the wrong typed constant.
Critical checkpoint conclusions:
- Goal / correctness: The intended optimization is only partially achieved; one main rewrite path does not fire on analyzed expressions, and one new inference path is semantically too broad. Existing tests do not prove end-to-end correctness for the production expression shape.
- Change scope / focus: The PR is focused on Nereids expression and nested-column pruning, but it also changes uniform-slot inference, which introduces an unrelated semantic risk.
- Concurrency: No new concurrency or locking concerns found in the touched FE code.
- Lifecycle / static init: No special lifecycle or static initialization issues found.
- Config changes: None.
- Compatibility: No FE/BE protocol or storage-format compatibility issue was identified from the touched code paths.
- Parallel code paths: The optimization was wired into several access-path collectors, but the main expression rewrite path does not cover the analyzed comparison form after coercion.
- Special conditional checks: The new delete guard is understandable, but the
Literal-only empty-string check is too narrow for analyzed trees. - Test coverage: Added tests cover handcrafted pre-coercion expressions and explain-based regressions, but they miss the analyzed/coerced expression form and the new
length(varbinary_col)=0uniform-inference case. - Observability: No additional observability appears necessary for this change.
- Transaction / persistence: Not applicable.
- Data writes / modifications: Not applicable.
- FE/BE variable passing: No new transmitted fields beyond access-path contents; no incompatibility confirmed in the touched path.
- Performance: The intended pruning should help when it fires, but the missed rewrite path means some targeted queries will get no benefit.
- Other issues: None beyond the two findings above.
...c/test/java/org/apache/doris/nereids/rules/expression/rules/StringEmptyToLengthRuleTest.java
Show resolved
Hide resolved
fe/fe-core/src/main/java/org/apache/doris/nereids/util/ExpressionUtils.java
Show resolved
Hide resolved
Contributor
Author
|
run buildall |
starocean999
approved these changes
Apr 9, 2026
Contributor
|
PR approved by at least one committer and no changes requested. |
feiniaofeiafei
approved these changes
Apr 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Optimized the calculation of length(str_col).
Treat the string column as a combination of an offset sub column and a chars sub column.
Prune the string column via NestedColumnPruning so that the BE only needs to read the offset sub column, thereby saving I/O for reading the chars sub column.
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)