Test 0603#64054
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
run buildall |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Squash the refactored reader branch into one commit on top of master. The change adds the refactored TableReader/FileReader stack, the new parquet reader path, table-format readers, nested projection/filter support, aggregate pushdown support, FileScannerV2, and related BE tests and design docs.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --cached --check before committing.
- Behavior changed: Yes
- Does this need documentation: No
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: TableReader could pass shared COW columns to mutable column builders in two paths. The new Parquet scan scheduler read non-predicate output columns by calling assert_mutable() on columns still owned by the file block, which throws when the block retains another reference. Complex projection materialization also rebuilt struct, array, and map columns with child pointers shared with the source nested column, so projected nested fallback scans could fail with COW::assert_mutable. This change uses scoped block column mutation for Parquet output reads and recursively detaches complex child columns with IColumn::mutate before wrapping them in result complex columns.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check -- be/src/format/reader/table_reader.h be/src/format/new_parquet/parquet_scan.cpp
- PARALLEL=8 JDK_17=/usr/local/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home JAVA_HOME=/usr/local/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home ./run-be-ut.sh --run --filter=TableReaderTest.ReopenSplitAfterClose:TableReaderTest.PushDownMinMaxFallsBackForProjectedListStructLeaf:TableReaderTest.PushDownMinMaxFallsBackForProjectedMapValueStructLeaf (not completed: submodule setup could not lock .git/config in the sandbox and fallback curl could not resolve github.com)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Nested Parquet scalar assembly copied present scalar values by calling insert_from on the destination column directly. When the destination is a nullable list element, struct child, or map value column, the source batch values are stored in the non-nullable nested data column, so ColumnNullable::insert_from tried to cast the source to ColumnNullable and aborted. This change appends present nested scalar values into the nullable destination's nested column and records a non-null entry in its null map.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check -- be/src/format/new_parquet/reader/nested_column_reader.cpp
- Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Nested Parquet scalar batches can store values in a nullable column when the leaf schema is nullable, while the destination slot being filled may be a non-nullable child column after parent-level definition checks have already proven the value is present. The direct insert_from path then tried to insert a ColumnNullable value into a non-nullable destination such as ColumnString and aborted with a bad cast. This change unwraps nullable source batches for present values before appending, while preserving nullable destination handling.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check -- be/src/format/new_parquet/reader/nested_column_reader.cpp
- Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Nested Parquet scalar batches can contain nullable source values. The scalar append helper previously treated any nullable source null at a present value slot as corruption, but nullable destination columns such as struct child b or nullable list/map values should receive a null entry. This change appends a default null when both source value and destination are nullable, while still rejecting null source values for required destinations.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check -- be/src/format/new_parquet/reader/nested_column_reader.cpp
- Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Nested Parquet scalar batches mixed two nullability models: nullable leaf values were stored in ColumnNullable while complex assemblers also interpreted definition levels to build element, value, and struct-child null maps. This allowed nullable source columns to leak into required child outputs and caused bad casts in complex type validation. The fix normalizes nested scalar batches so values_column always contains the non-nullable physical values, only max-definition-level slots receive value indices, and struct scalar children use the nullable-aware append helper to materialize child nulls from definition levels.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check -- be/src/format/new_parquet/reader/arrow_leaf_reader_adapter.cpp be/src/format/new_parquet/reader/nested_column_reader.cpp be/src/format/new_parquet/reader/struct_column_reader.cpp
- Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Doris DataTypeArray creates nullable element columns by default, but Parquet LIST element nullability must follow the file schema. Required list elements were therefore materialized into ColumnNullable wrappers and later failed validation when callers expected the required element type such as ColumnInt32. This change removes the default nullable wrapper from list element data columns when the Parquet element reader type is non-nullable, including nested list elements and map LIST values.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check -- be/src/format/new_parquet/reader/list_column_reader.cpp be/src/format/new_parquet/reader/map_column_reader.cpp
- Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
FE UT Coverage ReportIncrement line coverage |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Nested Parquet reads can intentionally leave value slots at the tail of a batch when the levels cross a requested parent batch boundary. The assembler moves those tail levels into overflow for the next read or skip call. read_nested_leaf_batch incorrectly required every value written by Arrow RecordReader to be consumed immediately, so map LIST skip/read overflow cases failed with an extra values corruption error. This removes that eager check and lets overflow handling preserve the tail values.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check -- be/src/format/new_parquet/reader/arrow_leaf_reader_adapter.cpp
- Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Nested nullable Parquet leaves can still produce a value slot for null element/value positions. read_nested_leaf_batch skipped value index assignment when the definition level was below the leaf max definition level, which shifted all values after null slots and produced default/empty values in complex list and map reads. This change keeps value slots aligned with Arrow RecordReader output for all slots that belong to the requested shape; null materialization remains controlled by definition levels in the nullable append helpers.
### Release note
None
### Check List (For Author)
- Test: Manual test
- git diff --check -- be/src/format/new_parquet/reader/arrow_leaf_reader_adapter.cpp
- Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
|
run buildall |
FE Regression Coverage ReportIncrement line coverage |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29699 ms |
TPC-DS: Total hot run time: 168999 ms |
TPC-H: Total hot run time: 29366 ms |
TPC-DS: Total hot run time: 169411 ms |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)