Skip to content

[test](be) Add new parquet reader UT coverage#64729

Merged
suxiaogang223 merged 17 commits into
apache:refact_reader_branchfrom
suxiaogang223:codex/new-parquet-ut-coverage
Jun 24, 2026
Merged

[test](be) Add new parquet reader UT coverage#64729
suxiaogang223 merged 17 commits into
apache:refact_reader_branchfrom
suxiaogang223:codex/new-parquet-ut-coverage

Conversation

@suxiaogang223

@suxiaogang223 suxiaogang223 commented Jun 23, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Expand and reorganize UT coverage for the new parquet reader on top of refact_reader_branch. The change splits serde/control/statistics/type/schema/scan coverage into focused tests, completes column reader coverage according to the UT plan, and keeps reader tests aligned with the file-local nullable schema behavior in refact_reader_branch.

Release note

None

Check List (For Author)

  • Test: Unit Test
    • Fedora /home/socrates/code/doris:
      ./run-be-ut.sh --run "--filter=NewParquetReaderTest.*:ParquetScanTest.*:ParquetSerdeReaderTest.*:ParquetColumnReaderTest.*:ParquetColumnReaderBaseTest.*:ParquetColumnReaderControlTest.*:ParquetColumnReaderFactoryTest.*:ParquetLeafReaderTest.*:ParquetSchemaTest.*:ParquetStatisticsPruningTest.*:ParquetStatisticsTransformTest.*:ParquetTypeTest.*:ParquetVirtualColumnReaderTest.*:SelectionVectorTest.*:ParquetBloomFilterPruningTest.*:FileReaderTest.*" -j 8
      Passed: 172 tests from 16 test suites.
  • Behavior changed: No
  • Does this need documentation: No

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Update the new Parquet reader unit test plans to align with the current implementation. The plans now call out unsupported converted TIME behavior, schema nullability nuances, dense nullable testing requirements, nested value layout coverage, and scan helper coverage through public entry points.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation-only change)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add focused BE unit tests for the new parquet reader type resolution, file-local schema conversion, and ParquetLeafReader value preparation paths. The tests cover logical/converted/physical type precedence, LIST/MAP wrapper folding and Dremel level propagation, bare repeated compatibility, dense nullable value spacing, FLOAT16 conversion, and binary chunk handling.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Attempted `./run-be-ut.sh --run '--filter=ParquetTypeTest.*:ParquetSchemaTest.*:ParquetLeafReaderTest.*' -j 8` locally, but macOS clang-16 failed during CMake compiler check because the linker could not find `libc++`. Fedora verification will be run after pushing this branch.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The focused parquet type unit test previously built a FLOAT16 logical schema with `PrimitiveNode::Make`, but the bundled Arrow schema API rejects that direct construction. Generate the FLOAT16 schema through Arrow's parquet writer instead, then resolve the descriptor from parquet metadata so the test covers the same descriptor shape produced by real files.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - `./run-be-ut.sh --run '--filter=ParquetTypeTest.*:ParquetSchemaTest.*:ParquetLeafReaderTest.*' -j 8` was run on Fedora before this fix and exposed the FLOAT16 test setup failure. The same filtered UT will be rerun after pushing this commit.
- Behavior changed: No
- Does this need documentation: No
EOF && git push
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add focused unit tests for new parquet reader control-flow helpers, virtual column readers, default column reader selection behavior, nested skip defaults, and factory projection error paths. Remove old schema/type-oriented cases from the large parquet column reader test because those behaviors are now covered by dedicated parquet type and schema tests.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Local clang-format v16 check on the modified parquet UT files.
    - Fedora BE UT verification will be run after pushing this branch.
- Behavior changed: No
- Does this need documentation: No
EOF && git push
Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Split bloom filter pruning checks out of the end-to-end parquet reader test file into a dedicated parquet statistics UT file, and add a reader-level global row id test that verifies schema exposure and selected-row materialization through the new parquet scan path.

None

- Test: Unit Test
    - python3 build-support/run_clang_format.py --clang-format-executable /opt/homebrew/opt/llvm@16/bin/clang-format -r --style file --inplace false --extensions c,h,C,H,cpp,hpp,cc,hh,c++,h++,cxx,hxx --exclude none be/test/format_v2/parquet/parquet_statistics_test.cpp be/test/format_v2/parquet/parquet_reader_test.cpp
    - ./run-be-ut.sh --run '--filter=ParquetBloomFilterPruningTest.*:NewParquetReaderTest.GlobalRowIdSchemaAndSelectionUseFileRowPosition' -j 8
    - ./build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN --files be/test/format_v2/parquet/parquet_statistics_test.cpp be/test/format_v2/parquet/parquet_reader_test.cpp
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Update the new parquet reader global row id UT assertions to inspect nested data from nullable output columns, matching the schema returned by the new parquet reader.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - python3 build-support/run_clang_format.py --clang-format-executable /opt/homebrew/opt/llvm@16/bin/clang-format -r --style file --inplace false --extensions c,h,C,H,cpp,hpp,cc,hh,c++,h++,cxx,hxx --exclude none be/test/format_v2/parquet/parquet_reader_test.cpp
    - ./run-be-ut.sh --run '--filter=ParquetBloomFilterPruningTest.*:NewParquetReaderTest.GlobalRowIdSchemaAndSelectionUseFileRowPosition' -j 8
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add focused new parquet scan tests for aggregate pushdown and scan-range virtual global row id behavior. The tests cover COUNT and MIN/MAX over selected row groups, statistics-pruned aggregate results, current page-index count semantics, nested single-leaf MIN/MAX, rejected repeated and invalid aggregate requests, missing statistics, and GlobalRowId row ids after scan-range row group selection.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - python3 build-support/run_clang_format.py --clang-format-executable /opt/homebrew/opt/llvm@16/bin/clang-format -r --style file --inplace false --extensions c,h,C,H,cpp,hpp,cc,hh,c++,h++,cxx,hxx --exclude none be/test/format_v2/parquet/parquet_scan_test.cpp
    - ./run-be-ut.sh --run '--filter=ParquetScanTest.*' -j 8
    - ./build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN --files be/test/format_v2/parquet/parquet_scan_test.cpp
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add focused unit coverage for new parquet reader type resolution and scan planning paths. The new cases cover converted integer mapping and decoded value kinds, logical string/date/decimal descriptors, physical fallback defaults, scan range ordering before statistics pruning, first_file_row calculation across pruned row groups, and page-index range intersection/full pruning behavior.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - ./run-be-ut.sh --run '--filter=ParquetTypeTest.*:ParquetScanTest.*' -j 8
    - ./build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN --files be/test/format_v2/parquet/parquet_type_test.cpp be/test/format_v2/parquet/parquet_scan_test.cpp
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add focused new parquet reader schema unit coverage according to the UT plan. The new cases cover legacy LIST element resolution, nested repeated field wrapping inside list elements, MAP entry level propagation, deep Dremel level propagation, empty root/null entry handling, additional invalid LIST/MAP layouts, and logical UTC TIME rejection.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - ./run-be-ut.sh --run '--filter=ParquetSchemaTest.*' -j 8
    - ./build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN --files be/test/format_v2/parquet/parquet_schema_test.cpp
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Document the current new parquet reader UT coverage strategy, identify gaps in schema-only tests, Arrow-generated parquet coverage, def/rep level materialization, and legacy compatibility validation, and propose a layered testing plan with priorities and acceptance criteria.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation-only change)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Extend the new parquet reader unit tests to cover the remaining non-column-reader plan items. The change adds type/schema matrix coverage, direct DecodedColumnView append_values checks, statistics transformation and pruning cases, and scan planning/profile coverage. A decoded value appender hook lets leaf reader tests assert the DecodedColumnView fields after parquet-specific data preparation without changing the default production path.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Fedora: ./run-be-ut.sh --run '--filter=ParquetTypeTest.*:ParquetSchemaTest.*:ParquetLeafReaderTest.*:ParquetBloomFilterPruningTest.*:ParquetStatisticsTransformTest.*:ParquetStatisticsPruningTest.*:ParquetScanTest.*' -j 8
    - build-support/clang-format.sh on changed files
    - clang-tidy: test files reported no warnings; parquet_leaf_reader.cpp could not be analyzed in this environment because the available clang-tidy/PCH formats are incompatible
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Remove the temporary parquet UT plan documents for modules whose coverage has been implemented and validated. The remaining column_reader plan and broader improvement documents are kept because they still describe uncovered complex reader/materializer and compatibility corpus work.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation cleanup only)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Split flat Parquet value and serde reader coverage out of the column reader unit test and into a dedicated ParquetSerdeReaderTest. Keep parquet_column_reader_test focused on column reader control flow, and add coverage for scalar read/skip/select edge cases, SelectionVector range validation, default base reader behavior, nested skip delegation, and factory error paths. The factory now reports null scalar descriptors as invalid input before flat-layout support checks.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Fedora: ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*:ParquetColumnReaderBaseTest.*:ParquetSerdeReaderTest.*' -j 8
    - Local: git diff --check -- be/src/format_v2/parquet/reader/column_reader.cpp be/test/format_v2/parquet/parquet_column_reader_test.cpp be/test/format_v2/parquet/parquet_serde_reader_test.cpp
    - Local: build-support/clang-format.sh be/src/format_v2/parquet/reader/column_reader.cpp be/test/format_v2/parquet/parquet_column_reader_test.cpp be/test/format_v2/parquet/parquet_serde_reader_test.cpp
- Behavior changed: Yes. Internal factory error classification for null scalar descriptors now returns InvalidArgument instead of NotSupported.
- Does this need documentation: No
Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Complete the new parquet column reader unit test coverage plan and remove the completed plan document. Add deterministic white-box coverage for nested scalar value layouts, dense nullable nested rejection, nested materializer helpers, page-filtered skip accounting, STRUCT shape-source batching, MAP null-key and repetition-alignment errors, and scalar/complex MAP value paths. Refactor ParquetLeafReader nested batch parsing into a private helper so tests can inject stable ParquetLeafBatch layouts without depending on Arrow RecordReader internals. Also make parquet reader tests unwrap nullable output columns before checking values.

None

- Test: Unit Test
    - Fedora: ./run-be-ut.sh --run '--filter=ParquetColumnReaderTest.*:ParquetColumnReaderBaseTest.*:ParquetLeafReaderTest.*:ParquetSerdeReaderTest.*:ParquetColumnReaderControlTest.*:SelectionVectorTest.*:ParquetVirtualColumnReaderTest.*:ParquetColumnReaderFactoryTest.*:NewParquetReaderTest.*' -j 8
    - Local: git diff --check on changed files
    - Local: build-support/clang-format.sh on changed C++ files
- Behavior changed: No
- Does this need documentation: No
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223 suxiaogang223 marked this pull request as ready for review June 23, 2026 08:28
@suxiaogang223

Copy link
Copy Markdown
Member Author

run be-ut

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Fix a rebase leftover in the new parquet reader rowid UT. The test still referenced removed helper functions after parquet reader tests were aligned with nullable file-local columns, which caused BE UT compilation to fail before executing the filtered tests.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - Pending rerun of new parquet related BE UT on Fedora.

- Behavior changed: No

- Does this need documentation: No
@suxiaogang223 suxiaogang223 merged commit 8e80529 into apache:refact_reader_branch Jun 24, 2026
17 of 19 checks passed
Gabriel39 pushed a commit that referenced this pull request Jun 26, 2026
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Expand and reorganize UT coverage for the new parquet
reader on top of refact_reader_branch. The change splits
serde/control/statistics/type/schema/scan coverage into focused tests,
completes column reader coverage according to the UT plan, and keeps
reader tests aligned with the file-local nullable schema behavior in
refact_reader_branch.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Fedora `/home/socrates/code/doris`:
`./run-be-ut.sh --run
"--filter=NewParquetReaderTest.*:ParquetScanTest.*:ParquetSerdeReaderTest.*:ParquetColumnReaderTest.*:ParquetColumnReaderBaseTest.*:ParquetColumnReaderControlTest.*:ParquetColumnReaderFactoryTest.*:ParquetLeafReaderTest.*:ParquetSchemaTest.*:ParquetStatisticsPruningTest.*:ParquetStatisticsTransformTest.*:ParquetTypeTest.*:ParquetVirtualColumnReaderTest.*:SelectionVectorTest.*:ParquetBloomFilterPruningTest.*:FileReaderTest.*"
-j 8`
      Passed: 172 tests from 16 test suites.
- Behavior changed: No
- Does this need documentation: No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants