[fix](be) Avoid mutating shared local shuffle columns#64092
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
9566c5e to
ff10f78
Compare
|
run buildall |
ff10f78 to
2269005
Compare
|
run buildall |
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Queries that evaluate complex nullable or VARIANT expressions after local exchange can share input blocks across downstream pipeline tasks. Some BE paths mutated those shared columns in-place: nullable CRC hashing could replace nested null data, Variant casts and serialization could finalize the source column, and nested-loop join lazy materialization expanded probe-side columns while reusing the input block. This could make local-shuffle query results unstable and could trigger lazy nested-loop join nullable-column size mismatches. This change uses private finalized Variant copies for casts and serialization, deep-clones finalized Variant columns, avoids mutating shared nullable nested columns during CRC hashing, and copies only the lazy/materialized probe columns before nested-loop join materialization appends rows.
### Release note
None
### Check List (For Author)
- Test:
- PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/clang-format.sh be/src/core/block/block.cpp be/src/core/column/column_nullable.cpp be/src/exec/operator/nested_loop_join_probe_operator.cpp be/test/core/column/column_nullable_test.cpp be/test/core/column/column_variant_test.cpp
- git diff --cached --check
- ./run-be-ut.sh --run --filter=FunctionVariantCast.CastFromVariantDoesNotFinalizeSourceColumn:ColumnVariantTest.clone_finalized_deep_copies_columns:ColumnVariantTest.serialize_does_not_finalize_source_column:ColumnVariantTest.block_serialize_does_not_finalize_source_column:NLJAppendProbeDataWithNullTest.*:ColumnNullableTest.UpdateCrc32cBatchDoesNotReplaceSharedNestedNullData
- ./build.sh --be
- Manual test: local 4BE repro query for nullable nested-loop join lazy materialization, 8 workers x 80 loops, 0 failures and stable output hash
- Manual test: local 4BE RQG-shaped Variant/local-shuffle query, local shuffle off baseline vs 50 local shuffle on runs, 0 mismatches
- Behavior changed: No
- Does this need documentation: No
2269005 to
00d2917
Compare
|
run buildall |
|
run external |
TPC-H: Total hot run time: 29484 ms |
TPC-DS: Total hot run time: 169084 ms |
|
/review |
TPC-H: Total hot run time: 29117 ms |
|
run p0 |
|
run cloud_p0 |
|
run nonConcurrent |
TPC-DS: Total hot run time: 169178 ms |
TPC-H: Total hot run time: 28323 ms |
TPC-DS: Total hot run time: 169386 ms |
There was a problem hiding this comment.
Code review completed for PR #64092.
No blocking issues found. The changes are focused on avoiding in-place mutation of shared BE column trees after local exchange, and the implementation consistently uses private finalized/deep-cloned copies for Variant serialization/casts and a detached probe block for lazy nested-loop join materialization.
Critical checkpoint conclusions:
- Goal and tests: The code addresses the stated shared-column mutation problem. Added/updated unit coverage exercises nullable CRC hashing, Variant cast/serialization non-mutation, finalized Variant deep copy, and the relevant lazy NLJ nullable append behavior described in the PR.
- Scope/focus: The change is reasonably small for the affected paths and avoids broad behavioral changes or storage/protocol format changes.
- Concurrency/COW: No new thread or dependency behavior is introduced. The reviewed paths avoid mutating columns that may be shared across downstream pipeline tasks; recursive/deep cloning and lazy probe-block detachment look appropriate.
- Lifecycle/static initialization: No new static initialization or unusual lifecycle hazard found.
- Configuration/compatibility: No new config items and no incompatible serialization layout changes found.
- Parallel paths: Block-level Variant serialization pre-finalizes unfinalized top-level Variant columns, while datatype-level Variant serialization still protects nested/fallback callers. Cast and lazy NLJ paths are covered separately.
- Error handling: New Status-returning paths propagate errors with RETURN_IF_ERROR/RETURN_IF_CATCH_EXCEPTION. No ignored Status found in the diff.
- Memory/performance: The fixes intentionally trade extra cloning for correctness on shared columns. Block serialization avoids cloning the same top-level Variant twice; nested Variant datatype callers may still clone independently for sizing/serialization, but I did not find a correctness issue.
- Observability/transactions/persistence: Not applicable; no transaction, persistence, or user-facing observability changes.
- User focus points: No additional user-provided review focus was present.
Review note: I inspected the diff and related call paths and ran git diff --check for the PR diff; I did not rerun the full BE unit/build suite in this review runner.
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run external |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1 similar comment
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1 similar comment
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Queries that evaluate complex nullable or VARIANT expressions after local exchange can share input blocks across downstream pipeline tasks. Some BE paths mutated those shared columns in-place: nullable CRC hashing could replace nested null data, Variant casts and serialization could finalize the source column, and nested-loop join lazy materialization expanded probe-side columns while reusing the input block. This could make local-shuffle query results unstable and could trigger lazy nested-loop join nullable-column size mismatches. This change uses private finalized Variant copies for casts and serialization, deep-clones finalized Variant columns, avoids mutating shared nullable nested columns during CRC hashing, and copies only the lazy/materialized probe columns before nested-loop join materialization appends rows.
Release note
None
Check List (For Author)
PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/clang-format.sh be/src/core/block/block.cpp be/src/core/column/column_nullable.cpp be/src/exec/operator/nested_loop_join_probe_operator.cpp be/test/core/column/column_nullable_test.cpp be/test/core/column/column_variant_test.cppgit diff --cached --check./run-be-ut.sh --run --filter=FunctionVariantCast.CastFromVariantDoesNotFinalizeSourceColumn:ColumnVariantTest.clone_finalized_deep_copies_columns:ColumnVariantTest.serialize_does_not_finalize_source_column:ColumnVariantTest.block_serialize_does_not_finalize_source_column:NLJAppendProbeDataWithNullTest.*:ColumnNullableTest.UpdateCrc32cBatchDoesNotReplaceSharedNestedNullData./build.sh --be