[improvement](be) Add new parquet page-level skip and profile metrics#64214
Open
suxiaogang223 wants to merge 8 commits into
Open
[improvement](be) Add new parquet page-level skip and profile metrics#64214suxiaogang223 wants to merge 8 commits into
suxiaogang223 wants to merge 8 commits into
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
suxiaogang223
added a commit
to suxiaogang223/doris
that referenced
this pull request
Jun 8, 2026
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#64214 Problem Summary: The new parquet page skip profile mixed logical pruning information with physical reader work. Page skip counters used generic names that could be interpreted as all page-index-pruned pages, while they are updated only when Arrow's data page filter callback actually skips a page. ReaderSkipRows also counted scheduler-level logical skips, including rows already removed by page filtering, so it could overstate the actual RecordReader::SkipRecords work. This change renames the page skip counters to data-page-filter-specific names and updates ReaderSkipRows only for rows actually passed to Arrow RecordReader::SkipRecords. Parent complex readers and the synthetic row-position reader no longer add logical read/skip rows to the physical reader counters. ### Release note None ### Check List (For Author) - Test: Unit Test - Pending: NewParquetReaderTest.* - Behavior changed: No - Does this need documentation: No
suxiaogang223
added a commit
to suxiaogang223/doris
that referenced
this pull request
Jun 8, 2026
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#64214 Problem Summary: The new parquet data page filter setup was implemented as an inline closure inside RecordReader creation, which made the page ordinal tracking, profile updates and page skip plan lookup hard to read. This refactors the filter into a small DataPageSkipFilter helper, centralizes page skip plan lookup/filter installation, and documents the important page-index invariants around data-page ordinals, non-repeated leaves, and double-skip accounting. The remaining touched reader files are formatting-only changes from the project clang-format script. ### Release note None ### Check List (For Author) - Test: Unit Test - Pending: NewParquetReaderTest.* - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: New parquet reader already uses page index to narrow row ranges, but selected range gaps were still skipped through RecordReader::SkipRecords(). This means page bodies in page-aligned gaps may still be read/decompressed, and directly enabling Arrow PageReader data page filtering would double skip rows because the scheduler also calls SkipRecords() for the same logical gap. This change carries a per-leaf page skip plan from page-index planning to row group readers, injects Arrow PageReader::set_data_page_filter() before RecordReader::SetPageReader(), and adjusts ScalarColumnReader skip accounting so page-filtered rows are not passed to SkipRecords(). The first phase only enables local page skipping for primitive non-repeated leaves and keeps LIST/MAP/repeated leaves on existing row-range pruning.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran targeted clang-format v16 through build-support/run_clang_format.py for modified C++ files
- Ran git diff --check and git diff --cached --check
- Attempted ./run-be-ut.sh --run --filter='NewParquetReaderTest.*', but local CMake compiler check failed before Doris code compilation because /opt/homebrew/opt/llvm@16/bin/clang++ cannot link a simple program: ld: library 'c++' not found
- Behavior changed: No
- Does this need documentation: Yes (updated docs/page-level-skip-plan.md)
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Page-level skip filtering in the new Parquet reader did not expose profile counters for the actual pages skipped by Arrow's data_page_filter callback or the compressed bytes associated with those skipped pages. This makes it hard to validate page-level pruning effectiveness from runtime profiles. This change wires page skip profile counters from ParquetReader through ParquetScanScheduler to ParquetColumnReaderFactory, updates them only when the callback actually skips a page, and records skipped compressed bytes from OffsetIndex page locations.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Added NewParquetReaderTest coverage for page skip compressed bytes and profile counters. Not run locally; will run on Fedora after syncing branch.
- Behavior changed: No
- Does this need documentation: Yes (updated docs/observability-profile-plan.md)
### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: The observability profile plan did not clearly distinguish between counters registered by the new parquet reader and counters that currently have real update paths. This update documents the current effective profile counters, identifies registered-but-unwired counters, and records the next profile gaps to fill for scheduler, column reader, page index, nested assembler, and page-level skip observability. ### Release note None ### Check List (For Author) - Test: No need to test (documentation only) - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: New parquet registered many parquet reader profile counters but several low-risk scan path counters still had no update path, making it hard to observe row-level filtering, empty batches, range-gap skips, and file reader creation from query profiles. This change publishes file reader lifecycle statistics, wires scheduler-level scan counters and timers, and documents the current profile state.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Added new parquet reader unit coverage for profile counters. Fedora unit test will be run after pushing this branch.
- Behavior changed: No
- Does this need documentation: Yes
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: New parquet now needs observability for the first three profile priorities: scheduler/read path counters, column reader and Arrow adapter timing, and row group/page index planning timing. This change wires reader-level row counters, Arrow RecordReader and materialization timers, and planning timers into the existing ParquetReader profile, with unit test assertions and profile documentation updates.
None
- Test: Unit Test
- Fedora DEBUG build and NewParquetReaderTest will be run after pushing this branch.
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: The temporary new parquet page-level skip and observability profile planning documents described implementation work that is now completed on this branch. Keeping these design notes in docs would make the current implementation status harder to read, so this change removes the obsolete plan documents. ### Release note None ### Check List (For Author) - Test: No need to test (documentation cleanup only) - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#64214 Problem Summary: The new parquet page skip profile mixed logical pruning information with physical reader work. Page skip counters used generic names that could be interpreted as all page-index-pruned pages, while they are updated only when Arrow's data page filter callback actually skips a page. ReaderSkipRows also counted scheduler-level logical skips, including rows already removed by page filtering, so it could overstate the actual RecordReader::SkipRecords work. This change renames the page skip counters to data-page-filter-specific names and updates ReaderSkipRows only for rows actually passed to Arrow RecordReader::SkipRecords. Parent complex readers and the synthetic row-position reader no longer add logical read/skip rows to the physical reader counters. ### Release note None ### Check List (For Author) - Test: Unit Test - Pending: NewParquetReaderTest.* - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#64214 Problem Summary: The new parquet data page filter setup was implemented as an inline closure inside RecordReader creation, which made the page ordinal tracking, profile updates and page skip plan lookup hard to read. This refactors the filter into a small DataPageSkipFilter helper, centralizes page skip plan lookup/filter installation, and documents the important page-index invariants around data-page ordinals, non-repeated leaves, and double-skip accounting. The remaining touched reader files are formatting-only changes from the project clang-format script. ### Release note None ### Check List (For Author) - Test: Unit Test - Pending: NewParquetReaderTest.* - Behavior changed: No - Does this need documentation: No
af16607 to
d9c7a81
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: This PR extends the new parquet reader with page-level skip support and the first set of observability counters needed to validate it. Page index planning now produces page skip plans consumed by Arrow page readers, avoiding double skip with row-range pruning. The reader profile now reports scheduler/read path counters, reader read/skip/select rows, Arrow RecordReader timing, materialization timing, page skip pages/bytes, and row group/page index planning timing. Temporary implementation plan docs were removed after the feature work landed.
Release note
None
Check List (For Author)