[refine](column) enforce nullable nested types for array #63088
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Automated code review summary:
No blocking issues found in this PR.
Critical checkpoint conclusions:
- Goal and tests: The PR makes array element types explicitly nullable in BE and aligns FE array type modeling with that invariant. The changed BE serde unit test covers one Arrow nested-array path, and FE tests were adjusted for array nullability matching. I did not run the test suite in this review runner.
- Scope: The change is focused on array nested nullability and related FE/BE type conversion call sites.
- Concurrency and lifecycle: No new concurrent state, locks, threads, or non-trivial lifecycle/static initialization concerns were introduced.
- Configuration and compatibility: No new configuration items. Existing serialized FE metadata keeps the
containsNullfield for Gson compatibility while treating it as true at runtime. BE thrift/protobuf type descriptors continue to emit nullable array elements. - Parallel paths: I checked FE parser/type conversion/function signatures, BE DataTypeArray/DataTypeArraySerDe, and representative BE factory/consumer paths; the nullable nested type invariant is applied consistently in the modified paths.
- Error handling and invariants: The new
DORIS_CHECKassertions are used for the intended array nested-nullable invariant rather than silently accepting inconsistent columns/serdes. - Data correctness and persistence: No transaction/version/delete-bitmap/write path changes were introduced.
- Memory/performance: The change adds no significant allocations beyond normal nullable type wrapping and does not alter hot data loops except invariant checks on deserialize entry points.
- Observability: No additional observability appears necessary for this type-system refinement.
- User focus: No additional user-provided review focus was present.
|
run buildall |
TPC-H: Total hot run time: 29557 ms |
TPC-DS: Total hot run time: 170984 ms |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29429 ms |
TPC-DS: Total hot run time: 171726 ms |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29622 ms |
TPC-DS: Total hot run time: 172004 ms |
|
run buildall |
TPC-H: Total hot run time: 29687 ms |
TPC-DS: Total hot run time: 171519 ms |
|
run beut |
FE UT Coverage ReportIncrement line coverage |
1c1bdc1 to
88ca38a
Compare
|
run buildall |
TPC-H: Total hot run time: 29725 ms |
TPC-DS: Total hot run time: 170615 ms |
FE UT Coverage ReportIncrement line coverage |
|
run beut |
|
run buildall |
TPC-H: Total hot run time: 29817 ms |
TPC-DS: Total hot run time: 170923 ms |
|
run buildall |
TPC-H: Total hot run time: 29652 ms |
TPC-DS: Total hot run time: 172120 ms |
FE UT Coverage ReportIncrement line coverage |
|
run p0 |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
There was a problem hiding this comment.
I found blocking compatibility/schema issues in the array-nullability change.
Critical checkpoint conclusions:
- Goal/test coverage: The PR enforces nullable array elements across FE and BE. Tests were updated for unit/golden outputs, but coverage does not prove compatibility with existing external schemas or mixed-version/native block serialization behavior.
- Scope/focus: The actual GitHub PR is focused on array nested nullability.
- Concurrency: No concurrency paths are modified in the actual PR diff.
- Lifecycle/static initialization: No new lifecycle or static-initialization issue was identified.
- Configuration: No configuration items are added or changed.
- Compatibility/storage/protocol: Blocking concerns were found. The FE catalog now collapses explicit non-null array element metadata, affecting external schema contracts. BE now wraps every array nested type in
Nullable, changing serialized/native block layout without version gating. - Parallel code paths: FE Nereids/catalog and BE type/serde paths were updated consistently for the new invariant, but external catalog conversion paths still consume
getContainsNull()as a schema contract. - Tests/results:
.nativeand unit expectations were updated, but there is no migration/compatibility test for reading old native blocks or for Iceberg/Paimon required array elements. - Observability: Not applicable.
- Transactions/persistence/data writes: No transaction path was changed, but persisted/external schema compatibility is affected by the catalog type change.
- FE/BE variable passing: Thrift/protobuf array nullability is forced toward nullable; mixed-version behavior needs explicit handling.
- Performance: No concrete performance issue identified.
User focus: No additional user-provided review focus was present.
|
|
||
| DataTypeArray::DataTypeArray(const DataTypePtr& nested_) : nested {nested_} {} | ||
| DataTypeArray::DataTypeArray(const DataTypePtr& nested_) { | ||
| DataTypePtr nullable_nested = make_nullable(nested_); |
There was a problem hiding this comment.
Wrapping every array element type in Nullable changes the BE block/native serialization shape for arrays: DataTypeArray::serialize() and get_uncompressed_serialized_bytes() now delegate to a DataTypeNullable, which adds a null map before the nested element payload. The PR updates all_types_single_row.native, which confirms the binary representation changes, but there is no be_exec_version branch or backward-compatible deserialize path. In a rolling upgrade or when reading native blocks/files produced by the old non-null nested representation, old/new BEs will disagree about the array payload layout and misread the stream. Please gate the new nullable-nested encoding by be_exec_version or add compatibility decoding for the old layout before changing the serialized format.
| * Always returns {@code true}. Array elements are always nullable in Doris. | ||
| */ | ||
| public boolean getContainsNull() { | ||
| return containsNull; |
There was a problem hiding this comment.
Making getContainsNull() always return true silently changes catalog type semantics that external schema conversion still relies on. Iceberg conversion uses this flag to emit optional vs required list elements, Paimon copies it into the element type, and Iceberg alter validation compares required/optional list elements against newArrayType.getContainsNull(). After this change, any existing or constructed ARRAY<... NOT NULL> is reported as nullable and can be converted/validated as optional, changing external table schema contracts. If Doris execution requires nullable array elements, external schema paths need an explicit rejection or separate preservation of external element nullability instead of collapsing the flag globally.
应该不会有兼容性的问题,如果有规划出来给be,be就炸了
What problem does this PR solve?
This PR makes the nested types inside Array explicitly nullable in BE type implementations, instead of relying on implicit caller-side conventions.
DataTypeArray now always stores nullable nested element type
DataTypeArraySerDe updated to follow the same invariant
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)