Skip to content

[feature](be) Support nested parquet struct predicate pruning and stats filtering#64098

Merged
suxiaogang223 merged 8 commits into
refact_reader_branchfrom
codex/complex-column-predicate-stats-filtering
Jun 4, 2026
Merged

[feature](be) Support nested parquet struct predicate pruning and stats filtering#64098
suxiaogang223 merged 8 commits into
refact_reader_branchfrom
codex/complex-column-predicate-stats-filtering

Conversation

@suxiaogang223
Copy link
Copy Markdown
Member

Summary

Implements complex type predicate filtering and statistics-based file-layer pruning for nested Parquet STRUCT columns, aligning with DuckDB's nested filter semantics while respecting Doris' new parquet reader architecture.

Changes

Row-level Expr Localization

  • struct_element(VSlotRef(parent), literal child) chains are recognized as nested paths
  • Parent slot is rewritten to file-local top-level block slot while preserving struct_element form
  • Struct children are NOT registered as independent block slots

Filter-only Nested Projection

  • Filter-referenced struct children are merged into the same top-level complex column's FieldProjection.children
  • Output children maintain priority order; filter-only children are appended to read projection
  • Filter-only children are excluded from ColumnMapping.child_mappings to avoid affecting table output materialization

Nested File-layer Pruning Target

  • FileColumnPredicateFilter adds file_child_id_path for file-local child field-id paths
  • AND-semantics struct_element(...) op literal / IN (...) construct pruning hints
  • OR/NOT/arbitrary function subtrees are NOT extracted for pruning (safety)
  • Supports renamed nested children via table-to-file field-id mapping

Parquet Leaf Resolution & Pruning

  • ResolvePredicateLeafSchema() resolves top-level or nested targets to primitive leaf schema
  • Row group min/max statistics pruning for nested struct primitives
  • Dictionary pruning for nested struct string-like columns
  • Bloom filter pruning via Arrow adapter for supported primitive types
  • Page index row range pruning for non-repeated primitive leaves only

Test Coverage

  • Mapper unit tests: nested predicate filters (GT, IN_LIST, reverse comparison, deep path)
  • Renamed child projection via field-id mapping
  • Missing child and OR subtree safety (no false pruning hints)
  • Real Parquet fixture tests for statistics, dictionary, and page index pruning
  • Bloom filter unit tests via Arrow adapter

Out of Scope (intentionally)

  • LIST/MAP/repeated leaf pruning
  • Dynamic field names or non-deterministic expressions
  • Real Parquet bloom filter fixture (Arrow writer lacks stable bloom metadata API)
  • Full complex child schema change (requires FE/table reader support)

Related

🤖 Generated with Claude Code

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Support file-layer pruning for primitive leaf predicates under Parquet STRUCT columns in the new parquet reader. The change keeps row-level filtering on Expr/VExprContext, adds file-local nested predicate targets, merges filter-only nested projections, and resolves nested primitive leaves for statistics, dictionary, bloom, and page-index pruning.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Manual test
    - Ran build-support/clang-format.sh for modified BE files.
    - Ran git diff --check.
    - Local BE UT did not start because macOS clang16 failed CMake compiler probe with ld: library c++ not found. Fedora build and UT will be run after push.
- Behavior changed: No
- Does this need documentation: Yes (updated internal design document)
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Fix a missing local Status declaration in ParquetColumnReaderTest so the BE unit test target can compile.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh for the modified test file.
    - Ran git diff --check.
    - Fedora BE UT will be rerun after push.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Fix duplicate and missing status variable declarations in parquet column reader unit tests so the BE UT target can compile.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add nested STRUCT IN-list pruning hint extraction for new parquet scans and restore explicit nested scalar value index mapping so nullable struct parent/child values remain aligned with Arrow RecordReader output.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - build-support/clang-format.sh be/src/format/reader/column_mapper.cpp be/src/format/new_parquet/reader/nested_column_reader.h be/src/format/new_parquet/reader/arrow_leaf_reader_adapter.cpp be/test/format/new_parquet/parquet_reader_test.cpp
    - git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add real parquet fixtures for nested struct dictionary and page-index pruning, and update the complex predicate pruning status document.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - git diff --check

- Behavior changed: No

- Does this need documentation: Yes
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Resolve nested struct filter projection and pruning targets through ColumnMapping before falling back to file schema names, so renamed mapped children can still produce file-local pruning paths.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - git diff --check

- Behavior changed: No

- Does this need documentation: Yes
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Treat table/file nested child name mismatches as complex projections so field-id mapped renamed children are read with the correct file-local projection.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - git diff --check

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Mark the completed nested parquet predicate and pruning implementation scope, and move remaining items into explicit non-goals for this branch.

### Release note

None

### Check List (For Author)

- Test: Manual test
- Behavior changed: No
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223 suxiaogang223 merged commit 8805572 into refact_reader_branch Jun 4, 2026
18 of 20 checks passed
@suxiaogang223 suxiaogang223 deleted the codex/complex-column-predicate-stats-filtering branch June 4, 2026 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants