[test](be) Add new parquet reader UT coverage#64729
Merged
suxiaogang223 merged 17 commits intoJun 24, 2026
Merged
Conversation
### What problem does this PR solve? Issue Number: None Related PR: None Problem Summary: Update the new Parquet reader unit test plans to align with the current implementation. The plans now call out unsupported converted TIME behavior, schema nullability nuances, dense nullable testing requirements, nested value layout coverage, and scan helper coverage through public entry points. ### Release note None ### Check List (For Author) - Test: No need to test (documentation-only change) - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add focused BE unit tests for the new parquet reader type resolution, file-local schema conversion, and ParquetLeafReader value preparation paths. The tests cover logical/converted/physical type precedence, LIST/MAP wrapper folding and Dremel level propagation, bare repeated compatibility, dense nullable value spacing, FLOAT16 conversion, and binary chunk handling.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Attempted `./run-be-ut.sh --run '--filter=ParquetTypeTest.*:ParquetSchemaTest.*:ParquetLeafReaderTest.*' -j 8` locally, but macOS clang-16 failed during CMake compiler check because the linker could not find `libc++`. Fedora verification will be run after pushing this branch.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The focused parquet type unit test previously built a FLOAT16 logical schema with `PrimitiveNode::Make`, but the bundled Arrow schema API rejects that direct construction. Generate the FLOAT16 schema through Arrow's parquet writer instead, then resolve the descriptor from parquet metadata so the test covers the same descriptor shape produced by real files.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- `./run-be-ut.sh --run '--filter=ParquetTypeTest.*:ParquetSchemaTest.*:ParquetLeafReaderTest.*' -j 8` was run on Fedora before this fix and exposed the FLOAT16 test setup failure. The same filtered UT will be rerun after pushing this commit.
- Behavior changed: No
- Does this need documentation: No
EOF && git push
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add focused unit tests for new parquet reader control-flow helpers, virtual column readers, default column reader selection behavior, nested skip defaults, and factory projection error paths. Remove old schema/type-oriented cases from the large parquet column reader test because those behaviors are now covered by dedicated parquet type and schema tests.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Local clang-format v16 check on the modified parquet UT files.
- Fedora BE UT verification will be run after pushing this branch.
- Behavior changed: No
- Does this need documentation: No
EOF && git push
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Split bloom filter pruning checks out of the end-to-end parquet reader test file into a dedicated parquet statistics UT file, and add a reader-level global row id test that verifies schema exposure and selected-row materialization through the new parquet scan path.
None
- Test: Unit Test
- python3 build-support/run_clang_format.py --clang-format-executable /opt/homebrew/opt/llvm@16/bin/clang-format -r --style file --inplace false --extensions c,h,C,H,cpp,hpp,cc,hh,c++,h++,cxx,hxx --exclude none be/test/format_v2/parquet/parquet_statistics_test.cpp be/test/format_v2/parquet/parquet_reader_test.cpp
- ./run-be-ut.sh --run '--filter=ParquetBloomFilterPruningTest.*:NewParquetReaderTest.GlobalRowIdSchemaAndSelectionUseFileRowPosition' -j 8
- ./build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN --files be/test/format_v2/parquet/parquet_statistics_test.cpp be/test/format_v2/parquet/parquet_reader_test.cpp
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Update the new parquet reader global row id UT assertions to inspect nested data from nullable output columns, matching the schema returned by the new parquet reader.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- python3 build-support/run_clang_format.py --clang-format-executable /opt/homebrew/opt/llvm@16/bin/clang-format -r --style file --inplace false --extensions c,h,C,H,cpp,hpp,cc,hh,c++,h++,cxx,hxx --exclude none be/test/format_v2/parquet/parquet_reader_test.cpp
- ./run-be-ut.sh --run '--filter=ParquetBloomFilterPruningTest.*:NewParquetReaderTest.GlobalRowIdSchemaAndSelectionUseFileRowPosition' -j 8
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add focused new parquet scan tests for aggregate pushdown and scan-range virtual global row id behavior. The tests cover COUNT and MIN/MAX over selected row groups, statistics-pruned aggregate results, current page-index count semantics, nested single-leaf MIN/MAX, rejected repeated and invalid aggregate requests, missing statistics, and GlobalRowId row ids after scan-range row group selection.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- python3 build-support/run_clang_format.py --clang-format-executable /opt/homebrew/opt/llvm@16/bin/clang-format -r --style file --inplace false --extensions c,h,C,H,cpp,hpp,cc,hh,c++,h++,cxx,hxx --exclude none be/test/format_v2/parquet/parquet_scan_test.cpp
- ./run-be-ut.sh --run '--filter=ParquetScanTest.*' -j 8
- ./build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN --files be/test/format_v2/parquet/parquet_scan_test.cpp
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add focused unit coverage for new parquet reader type resolution and scan planning paths. The new cases cover converted integer mapping and decoded value kinds, logical string/date/decimal descriptors, physical fallback defaults, scan range ordering before statistics pruning, first_file_row calculation across pruned row groups, and page-index range intersection/full pruning behavior.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- ./run-be-ut.sh --run '--filter=ParquetTypeTest.*:ParquetScanTest.*' -j 8
- ./build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN --files be/test/format_v2/parquet/parquet_type_test.cpp be/test/format_v2/parquet/parquet_scan_test.cpp
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add focused new parquet reader schema unit coverage according to the UT plan. The new cases cover legacy LIST element resolution, nested repeated field wrapping inside list elements, MAP entry level propagation, deep Dremel level propagation, empty root/null entry handling, additional invalid LIST/MAP layouts, and logical UTC TIME rejection.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- ./run-be-ut.sh --run '--filter=ParquetSchemaTest.*' -j 8
- ./build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN --files be/test/format_v2/parquet/parquet_schema_test.cpp
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Document the current new parquet reader UT coverage strategy, identify gaps in schema-only tests, Arrow-generated parquet coverage, def/rep level materialization, and legacy compatibility validation, and propose a layered testing plan with priorities and acceptance criteria. ### Release note None ### Check List (For Author) - Test: No need to test (documentation-only change) - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Extend the new parquet reader unit tests to cover the remaining non-column-reader plan items. The change adds type/schema matrix coverage, direct DecodedColumnView append_values checks, statistics transformation and pruning cases, and scan planning/profile coverage. A decoded value appender hook lets leaf reader tests assert the DecodedColumnView fields after parquet-specific data preparation without changing the default production path.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Fedora: ./run-be-ut.sh --run '--filter=ParquetTypeTest.*:ParquetSchemaTest.*:ParquetLeafReaderTest.*:ParquetBloomFilterPruningTest.*:ParquetStatisticsTransformTest.*:ParquetStatisticsPruningTest.*:ParquetScanTest.*' -j 8
- build-support/clang-format.sh on changed files
- clang-tidy: test files reported no warnings; parquet_leaf_reader.cpp could not be analyzed in this environment because the available clang-tidy/PCH formats are incompatible
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Remove the temporary parquet UT plan documents for modules whose coverage has been implemented and validated. The remaining column_reader plan and broader improvement documents are kept because they still describe uncovered complex reader/materializer and compatibility corpus work. ### Release note None ### Check List (For Author) - Test: No need to test (documentation cleanup only) - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Split flat Parquet value and serde reader coverage out of the column reader unit test and into a dedicated ParquetSerdeReaderTest. Keep parquet_column_reader_test focused on column reader control flow, and add coverage for scalar read/skip/select edge cases, SelectionVector range validation, default base reader behavior, nested skip delegation, and factory error paths. The factory now reports null scalar descriptors as invalid input before flat-layout support checks.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Fedora: ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*:ParquetColumnReaderBaseTest.*:ParquetSerdeReaderTest.*' -j 8
- Local: git diff --check -- be/src/format_v2/parquet/reader/column_reader.cpp be/test/format_v2/parquet/parquet_column_reader_test.cpp be/test/format_v2/parquet/parquet_serde_reader_test.cpp
- Local: build-support/clang-format.sh be/src/format_v2/parquet/reader/column_reader.cpp be/test/format_v2/parquet/parquet_column_reader_test.cpp be/test/format_v2/parquet/parquet_serde_reader_test.cpp
- Behavior changed: Yes. Internal factory error classification for null scalar descriptors now returns InvalidArgument instead of NotSupported.
- Does this need documentation: No
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Complete the new parquet column reader unit test coverage plan and remove the completed plan document. Add deterministic white-box coverage for nested scalar value layouts, dense nullable nested rejection, nested materializer helpers, page-filtered skip accounting, STRUCT shape-source batching, MAP null-key and repetition-alignment errors, and scalar/complex MAP value paths. Refactor ParquetLeafReader nested batch parsing into a private helper so tests can inject stable ParquetLeafBatch layouts without depending on Arrow RecordReader internals. Also make parquet reader tests unwrap nullable output columns before checking values.
None
- Test: Unit Test
- Fedora: ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*:ParquetColumnReaderBaseTest.*:ParquetLeafReaderTest.*:ParquetSerdeReaderTest.*:ParquetColumnReaderControlTest.*:SelectionVectorTest.*:ParquetVirtualColumnReaderTest.*:ParquetColumnReaderFactoryTest.*:NewParquetReaderTest.*' -j 8
- Local: git diff --check on changed files
- Local: build-support/clang-format.sh on changed C++ files
- Behavior changed: No
- Does this need documentation: No
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Member
Author
|
run be-ut |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Fix a rebase leftover in the new parquet reader rowid UT. The test still referenced removed helper functions after parquet reader tests were aligned with nullable file-local columns, which caused BE UT compilation to fail before executing the filtered tests.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Pending rerun of new parquet related BE UT on Fedora.
- Behavior changed: No
- Does this need documentation: No
8e80529
into
apache:refact_reader_branch
17 of 19 checks passed
Gabriel39
pushed a commit
that referenced
this pull request
Jun 26, 2026
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Expand and reorganize UT coverage for the new parquet
reader on top of refact_reader_branch. The change splits
serde/control/statistics/type/schema/scan coverage into focused tests,
completes column reader coverage according to the UT plan, and keeps
reader tests aligned with the file-local nullable schema behavior in
refact_reader_branch.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Fedora `/home/socrates/code/doris`:
`./run-be-ut.sh --run
"--filter=NewParquetReaderTest.*:ParquetScanTest.*:ParquetSerdeReaderTest.*:ParquetColumnReaderTest.*:ParquetColumnReaderBaseTest.*:ParquetColumnReaderControlTest.*:ParquetColumnReaderFactoryTest.*:ParquetLeafReaderTest.*:ParquetSchemaTest.*:ParquetStatisticsPruningTest.*:ParquetStatisticsTransformTest.*:ParquetTypeTest.*:ParquetVirtualColumnReaderTest.*:SelectionVectorTest.*:ParquetBloomFilterPruningTest.*:FileReaderTest.*"
-j 8`
Passed: 172 tests from 16 test suites.
- Behavior changed: No
- Does this need documentation: No
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Expand and reorganize UT coverage for the new parquet reader on top of refact_reader_branch. The change splits serde/control/statistics/type/schema/scan coverage into focused tests, completes column reader coverage according to the UT plan, and keeps reader tests aligned with the file-local nullable schema behavior in refact_reader_branch.
Release note
None
Check List (For Author)
/home/socrates/code/doris:./run-be-ut.sh --run "--filter=NewParquetReaderTest.*:ParquetScanTest.*:ParquetSerdeReaderTest.*:ParquetColumnReaderTest.*:ParquetColumnReaderBaseTest.*:ParquetColumnReaderControlTest.*:ParquetColumnReaderFactoryTest.*:ParquetLeafReaderTest.*:ParquetSchemaTest.*:ParquetStatisticsPruningTest.*:ParquetStatisticsTransformTest.*:ParquetTypeTest.*:ParquetVirtualColumnReaderTest.*:SelectionVectorTest.*:ParquetBloomFilterPruningTest.*:FileReaderTest.*" -j 8Passed: 172 tests from 16 test suites.