Skip to content

[improvement](be) Add bloom filter pruning for new parquet reader#64025

Merged
suxiaogang223 merged 2 commits into
apache:refact_reader_branchfrom
suxiaogang223:codex/new-parquet-bloom-filter
Jun 2, 2026
Merged

[improvement](be) Add bloom filter pruning for new parquet reader#64025
suxiaogang223 merged 2 commits into
apache:refact_reader_branchfrom
suxiaogang223:codex/new-parquet-bloom-filter

Conversation

@suxiaogang223
Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: New parquet row group pruning did not use Parquet bloom filters, so equality and IN predicates could only rely on statistics and dictionary pruning before falling back to reading row groups.

This PR adds conservative bloom-filter row group pruning for the new parquet reader by reusing Arrow Parquet bloom filter APIs and adapting Doris file-layer predicates to Arrow Parquet hash checks.

Release note

None

Check List (For Author)

  • Test: Unit Test
    • Local: git diff --check
    • Fedora: BUILD_TYPE=DEBUG ./build.sh --be
    • Fedora: ./run-be-ut.sh --run '--filter=ParquetBloomFilterPruningTest.*'
    • Fedora: ./run-be-ut.sh --run '--filter=NewParquetReaderTest.:ParquetColumnReaderTest.:ParquetBloomFilterPruningTest.*'
  • Behavior changed: Yes. New parquet reader can prune row groups with Parquet bloom filters when enabled and predicates are supported equality or IN-list predicates.
  • Does this need documentation: No

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: New parquet row group pruning did not use Parquet bloom filters, so equality and IN predicates could only rely on statistics and dictionary pruning before falling back to reading row groups.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Local: git diff --check
    - Fedora: BUILD_TYPE=DEBUG ./build.sh --be
    - Fedora: ./run-be-ut.sh --run '--filter=ParquetBloomFilterPruningTest.*'
    - Fedora: ./run-be-ut.sh --run '--filter=NewParquetReaderTest.*:ParquetColumnReaderTest.*:ParquetBloomFilterPruningTest.*'
- Behavior changed: Yes. New parquet reader can prune row groups with Parquet bloom filters when enabled and predicates are supported equality or IN-list predicates.
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The new parquet reader internal layering document still described bloom filter pruning as a future task after the implementation was added.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation-only change)
- Behavior changed: No
- Does this need documentation: Yes
@suxiaogang223 suxiaogang223 marked this pull request as ready for review June 2, 2026 09:56
@suxiaogang223 suxiaogang223 merged commit c2e474b into apache:refact_reader_branch Jun 2, 2026
18 of 20 checks passed
@suxiaogang223 suxiaogang223 deleted the codex/new-parquet-bloom-filter branch June 3, 2026 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants