[opt](nereids) Optimize I/O operations for the IS NULL predicate#62304
Open
englefly wants to merge 3 commits intoapache:masterfrom
Open
[opt](nereids) Optimize I/O operations for the IS NULL predicate#62304englefly wants to merge 3 commits intoapache:masterfrom
englefly wants to merge 3 commits intoapache:masterfrom
Conversation
### What problem does this PR solve? Issue Number: close #xxx Problem Summary: When a query uses `col IS NULL` or `col IS NOT NULL` as the only access to a nullable column, the BE previously read the full column data. Nullable columns consist of a null-flag column (null map) and a data column. When the only usage is null-checking, we should prune the data column and read only the null flag. This commit extends the existing NestedColumnPruning framework (which already handles `length(str_col)` via OFFSET paths) to detect IS NULL patterns and emit `[col_name, NULL]` access paths. The BE can then skip data reading for those columns. ### Changes **AccessPathInfo.java** — Add `ACCESS_NULL = "NULL"` constant **AccessPathExpressionCollector.java** — Implement `visitIsNull()` and `visitNot()` to detect IS NULL/IS NOT NULL on direct SlotReferences (without subPath), creating NULL-suffix access contexts. Add fallback nullable slot handling with guards for NestedColumnPrunable and string-like types. **NestedColumnPruning.java** — Add `containsNullCheck()` early-exit guard, `isNullCheckOnly` field and `hasNullCheckOnlyAccess()` method to DataTypeAccessTree, NULL path handling in `setAccessByPath()`, null-only branch in `pruneDataType()`, and NULL-path stripping from allAccessPaths when data access also exists (predicateAccessPaths retains NULL paths). **PruneNestedColumnTest.java** — Add 3 unit tests: struct IS NULL pruning, IS NOT NULL pruning, mixed IS NULL + field access. Update existing testFilter expectations for null-optimized paths. **null_column_pruning.groovy** — New regression test verifying EXPLAIN plans show NULL access paths for struct/array/map IS NULL, IS NOT NULL, aggregates, mixed access, and full-struct projection scenarios. ### Release note Support IS NULL / IS NOT NULL null-flag-only column reading optimization via NestedColumnPruning. When a nullable column is only used in null checks, only the null flag is read, skipping full data column reading. ### Check List (For Author) - Test: Unit Test (PruneNestedColumnTest 38/38 pass) + Regression test (plan-only) - Behavior changed: No - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
…ss in column pruning ### What problem does this PR solve? Issue Number: close #xxx Problem Summary: When using `struct_element(struct_col, 'city') IS NULL`, the nested column pruning optimization failed to emit the NULL-suffixed access path (e.g. `[struct_col.city.NULL]`). Instead it produced `[struct_col.city]`, meaning BE would read the full column data instead of just the null flag. Three interconnected bugs were identified and fixed: 1. **visitIsNull only handled direct SlotReference**: The IS NULL visitor only recognized `col IS NULL` (direct slot), not nested expressions like `struct_element(s, 'city') IS NULL`. Fixed by broadening to accept any nullable expression and propagating the NULL context through recognized access visitors (struct_element, element_at, etc.). 2. **setAccessByPath set accessPartialChild before NULL check**: The NULL path marker is a flag, not a real child. Setting `accessPartialChild = true` before checking for NULL caused `isNullCheckOnly` detection to fail. Fixed by moving the NULL check before `accessPartialChild` assignment. 3. **pruneDataType returned Optional.empty() for null-check-only nodes**: Parent nodes interpreted this as "child not accessed" and dropped it from the pruned type. Fixed by returning `Optional.of(type)` so null-check-only children are preserved in the pruned struct type. 4. **Variant sub-column NULL stripping**: Variant types do not support null-flag-only optimization for sub-column access. Added stripping of NULL suffix in the variant slot reference handler to maintain existing behavior. ### Release note None ### Check List (For Author) - Test: Unit Test (PruneNestedColumnTest) / Manual test (explain verbose) - Behavior changed: No (optimization was already intended but not working for nested access) - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
029bcba to
5f53583
Compare
Contributor
Author
|
run buildall |
Contributor
FE UT Coverage ReportIncrement line coverage |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Treat nullable fields as a combination of a nullable flag and data. When evaluating the
col IS NULLpredicate, use the NestedColumnPruning rule to prune the col field to col.NULL, thereby saving I/O on the data.Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)