[fix](be) Avoid mutating shared Variant columns#64132
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
a658633 to
e2c8340
Compare
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Local exchange and join execution can share input blocks across downstream tasks. Variant cast and block serialization finalized ColumnVariant in place, and Subcolumn::insert_range_from could leave the lazy-default suffix unmaterialized when copying ranges. For local-shuffle anti-join queries that evaluate Variant path casts, one task could mutate a shared Variant column while another task is still reading it, leading to unstable results or range-copy failures. This change finalizes private deep copies for Variant cast/serialization paths, trims serialized Variant cast inputs to the requested input row prefix, and materializes pending defaults during range copy.
The cast path must also handle empty prefixes and legacy root-only unfinalized Variant columns. An empty prefix can otherwise create a zero-row ColumnVariant and then call helpers that assume num_rows > 0. Root-only unfinalized Variant test columns can also have a semantic input row count greater than ColumnVariant::size(), so checking the requested rows against ColumnVariant::size() can crash even though the root column contains the rows being cast.
The fix was reproduced with Variant red tests: the old code finalized source Variant columns during cast/serialization, failed prefix Variant-to-JSONB casts on private finalized copies, failed already-finalized prefix Variant-to-JSONB casts, crashed CastFromVariant on a root-only unfinalized Variant column, and failed to copy pending defaults. The same tests pass after the change. A local four-BE cluster also verified the affected local-shuffle anti-join query with Variant expressions and a non-Variant control query on the same plan shape.
### Release note
Fix an issue where local-shuffle queries using VARIANT expressions could return unstable results or fail.
### Check List (For Author)
- Test:
- BE Unit Test red/green: ./run-be-ut.sh --run --filter='FunctionVariantCast.CastFromVariant' reproduced the root-only unfinalized Variant crash before the CastFromVariant guard fix and passed after it
- BE Unit Test: ./run-be-ut.sh --run --filter='FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
- BE Unit Test: ./run-be-ut.sh --run --filter='ColumnVariantTest.insert_range_from_materializes_pending_default_suffix:ColumnVariantTest.clone_finalized_deep_copies_columns:ColumnVariantTest.serialize_does_not_finalize_source_column:ColumnVariantTest.block_serialize_does_not_finalize_source_column:FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromFinalizedVariantJsonbPrefix:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
- Build: ./build.sh --be
- Format: PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/clang-format.sh, git diff --check
- Manual test: local four-BE cluster, 800/800 Variant local-shuffle anti-join queries passed, 800/800 non-Variant control queries passed
- Behavior changed: Yes. Variant cast and serialization no longer mutate shared source columns.
- Does this need documentation: No
|
run buildall |
There was a problem hiding this comment.
Pull request overview
This PR targets correctness issues in Doris BE VARIANT handling when upstream blocks/columns are shared (e.g., local exchange / local shuffle). It avoids in-place finalization/mutation of ColumnVariant during cast and serialization, and fixes a Subcolumn::insert_range_from() case where pending default rows weren’t materialized when copying a suffix range.
Changes:
- Materialize pending default rows when copying a range from a VARIANT subcolumn (fixes missing-default suffix copy behavior).
- Stop mutating unfinalized VARIANT columns during
DataTypeVariantserialization by serializing a private finalized copy instead. - Stop mutating unfinalized VARIANT inputs during cast-from-variant by finalizing a private copy, and add/extend unit tests to validate “no source finalization” behavior and edge cases (prefix / zero-row).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| be/src/core/column/column_variant.cpp | Ensures insert_range_from() materializes trailing pending-default rows when the copied range extends into the lazy-default suffix. |
| be/src/core/data_type/data_type_variant.cpp | Switches serialization paths to use a private finalized VARIANT copy instead of finalizing the source column in-place. |
| be/src/exprs/function/cast/cast_to_variant.h | Adjusts cast-from-variant to avoid mutating shared VARIANT inputs and adds handling for prefix/zero-row scenarios. |
| be/test/core/column/column_variant_test.cpp | Adds BE unit tests covering deep-copy expectations for finalized clones and ensuring serialization doesn’t finalize the source VARIANT column. |
| be/test/exprs/function/cast/function_variant_cast_test.cpp | Adds BE unit tests ensuring VARIANT casts (including prefix/zero-row and nullable inputs) do not finalize/mutate the source column. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29515 ms |
TPC-DS: Total hot run time: 169405 ms |
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Local shuffle and lazy materialization paths can share input blocks while evaluating Variant expressions. Some Variant operations finalized, cached, or extended shared subcolumns in-place, so another consumer could observe a mutated column layout or a mismatched row count. Nullable rows materialized by nested-loop join also needed to preserve null defaults when repeating a selected row. This change keeps const Variant lookups from updating the path cache, makes Variant clone/exclusive checks cover nested internal columns, preserves pending Variant defaults during cast/serialization, and materializes nullable joined rows without reading from empty nested data columns.
Release note
Fix unstable local-shuffle query results and Variant cast failures caused by shared Variant column mutation.
Check List (For Author)
env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u all_proxy -u ALL_PROXY ./run-be-ut.sh --run --filter='ColumnVariantTest.clone_finalized_deep_copies_columns:ColumnVariantTest.serialize_does_not_finalize_source_column:ColumnVariantTest.block_serialize_does_not_finalize_source_column:ColumnVariantTest.insert_range_from_materializes_pending_default_suffix:FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromFinalizedVariantJsonbPrefix:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/clang-format.shgit diff --check upstream/master