[feature](be) Support nested parquet struct predicate pruning and stats filtering#64098
Merged
suxiaogang223 merged 8 commits intoJun 4, 2026
Conversation
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Support file-layer pruning for primitive leaf predicates under Parquet STRUCT columns in the new parquet reader. The change keeps row-level filtering on Expr/VExprContext, adds file-local nested predicate targets, merges filter-only nested projections, and resolves nested primitive leaves for statistics, dictionary, bloom, and page-index pruning.
### Release note
None
### Check List (For Author)
- Test: Unit Test / Manual test
- Ran build-support/clang-format.sh for modified BE files.
- Ran git diff --check.
- Local BE UT did not start because macOS clang16 failed CMake compiler probe with ld: library c++ not found. Fedora build and UT will be run after push.
- Behavior changed: No
- Does this need documentation: Yes (updated internal design document)
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Fix a missing local Status declaration in ParquetColumnReaderTest so the BE unit test target can compile.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh for the modified test file.
- Ran git diff --check.
- Fedora BE UT will be rerun after push.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Fix duplicate and missing status variable declarations in parquet column reader unit tests so the BE UT target can compile.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add nested STRUCT IN-list pruning hint extraction for new parquet scans and restore explicit nested scalar value index mapping so nullable struct parent/child values remain aligned with Arrow RecordReader output.
### Release note
None
### Check List (For Author)
- Test: Manual test
- build-support/clang-format.sh be/src/format/reader/column_mapper.cpp be/src/format/new_parquet/reader/nested_column_reader.h be/src/format/new_parquet/reader/arrow_leaf_reader_adapter.cpp be/test/format/new_parquet/parquet_reader_test.cpp
- git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add real parquet fixtures for nested struct dictionary and page-index pruning, and update the complex predicate pruning status document.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- git diff --check
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Resolve nested struct filter projection and pruning targets through ColumnMapping before falling back to file schema names, so renamed mapped children can still produce file-local pruning paths.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- git diff --check
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Treat table/file nested child name mismatches as complex projections so field-id mapped renamed children are read with the correct file-local projection.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Mark the completed nested parquet predicate and pruning implementation scope, and move remaining items into explicit non-goals for this branch. ### Release note None ### Check List (For Author) - Test: Manual test - Behavior changed: No - Does this need documentation: No
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements complex type predicate filtering and statistics-based file-layer pruning for nested Parquet STRUCT columns, aligning with DuckDB's nested filter semantics while respecting Doris' new parquet reader architecture.
Changes
Row-level Expr Localization
struct_element(VSlotRef(parent), literal child)chains are recognized as nested pathsstruct_elementformFilter-only Nested Projection
FieldProjection.childrenColumnMapping.child_mappingsto avoid affecting table output materializationNested File-layer Pruning Target
FileColumnPredicateFilteraddsfile_child_id_pathfor file-local child field-id pathsstruct_element(...) op literal/IN (...)construct pruning hintsParquet Leaf Resolution & Pruning
ResolvePredicateLeafSchema()resolves top-level or nested targets to primitive leaf schemaTest Coverage
Out of Scope (intentionally)
Related
🤖 Generated with Claude Code