Skip to content

[fix](be) Avoid mutating shared Variant columns#64132

Open
eldenmoon wants to merge 1 commit into
apache:masterfrom
eldenmoon:branch-fix-variant-shared-columns
Open

[fix](be) Avoid mutating shared Variant columns#64132
eldenmoon wants to merge 1 commit into
apache:masterfrom
eldenmoon:branch-fix-variant-shared-columns

Conversation

@eldenmoon
Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Local shuffle and lazy materialization paths can share input blocks while evaluating Variant expressions. Some Variant operations finalized, cached, or extended shared subcolumns in-place, so another consumer could observe a mutated column layout or a mismatched row count. Nullable rows materialized by nested-loop join also needed to preserve null defaults when repeating a selected row. This change keeps const Variant lookups from updating the path cache, makes Variant clone/exclusive checks cover nested internal columns, preserves pending Variant defaults during cast/serialization, and materializes nullable joined rows without reading from empty nested data columns.

Release note

Fix unstable local-shuffle query results and Variant cast failures caused by shared Variant column mutation.

Check List (For Author)

  • Test:
    • BE UT: env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u all_proxy -u ALL_PROXY ./run-be-ut.sh --run --filter='ColumnVariantTest.clone_finalized_deep_copies_columns:ColumnVariantTest.serialize_does_not_finalize_source_column:ColumnVariantTest.block_serialize_does_not_finalize_source_column:ColumnVariantTest.insert_range_from_materializes_pending_default_suffix:FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromFinalizedVariantJsonbPrefix:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
    • Format: PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/clang-format.sh
    • Check: git diff --check upstream/master
  • Behavior changed: Yes. Variant casts, serialization, and local-shuffle consumers no longer mutate shared Variant columns.
  • Does this need documentation: No

Copilot AI review requested due to automatic review settings June 5, 2026 03:53
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon eldenmoon closed this Jun 5, 2026
@eldenmoon eldenmoon force-pushed the branch-fix-variant-shared-columns branch from a658633 to e2c8340 Compare June 5, 2026 03:57
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Local exchange and join execution can share input blocks across downstream tasks. Variant cast and block serialization finalized ColumnVariant in place, and Subcolumn::insert_range_from could leave the lazy-default suffix unmaterialized when copying ranges. For local-shuffle anti-join queries that evaluate Variant path casts, one task could mutate a shared Variant column while another task is still reading it, leading to unstable results or range-copy failures. This change finalizes private deep copies for Variant cast/serialization paths, trims serialized Variant cast inputs to the requested input row prefix, and materializes pending defaults during range copy.

The cast path must also handle empty prefixes and legacy root-only unfinalized Variant columns. An empty prefix can otherwise create a zero-row ColumnVariant and then call helpers that assume num_rows > 0. Root-only unfinalized Variant test columns can also have a semantic input row count greater than ColumnVariant::size(), so checking the requested rows against ColumnVariant::size() can crash even though the root column contains the rows being cast.

The fix was reproduced with Variant red tests: the old code finalized source Variant columns during cast/serialization, failed prefix Variant-to-JSONB casts on private finalized copies, failed already-finalized prefix Variant-to-JSONB casts, crashed CastFromVariant on a root-only unfinalized Variant column, and failed to copy pending defaults. The same tests pass after the change. A local four-BE cluster also verified the affected local-shuffle anti-join query with Variant expressions and a non-Variant control query on the same plan shape.

### Release note

Fix an issue where local-shuffle queries using VARIANT expressions could return unstable results or fail.

### Check List (For Author)

- Test:
    - BE Unit Test red/green: ./run-be-ut.sh --run --filter='FunctionVariantCast.CastFromVariant' reproduced the root-only unfinalized Variant crash before the CastFromVariant guard fix and passed after it
    - BE Unit Test: ./run-be-ut.sh --run --filter='FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
    - BE Unit Test: ./run-be-ut.sh --run --filter='ColumnVariantTest.insert_range_from_materializes_pending_default_suffix:ColumnVariantTest.clone_finalized_deep_copies_columns:ColumnVariantTest.serialize_does_not_finalize_source_column:ColumnVariantTest.block_serialize_does_not_finalize_source_column:FunctionVariantCast.CastFromVariant:FunctionVariantCast.CastFromVariantDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantJsonbPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromVariantZeroRowPrefixDoesNotFinalizeSourceColumn:FunctionVariantCast.CastFromFinalizedVariantJsonbPrefix:FunctionVariantCast.CastFromNullableVariantPrefixDoesNotFinalizeSourceColumn'
    - Build: ./build.sh --be
    - Format: PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/clang-format.sh, git diff --check
    - Manual test: local four-BE cluster, 800/800 Variant local-shuffle anti-join queries passed, 800/800 non-Variant control queries passed
- Behavior changed: Yes. Variant cast and serialization no longer mutate shared source columns.
- Does this need documentation: No
@eldenmoon eldenmoon reopened this Jun 5, 2026
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets correctness issues in Doris BE VARIANT handling when upstream blocks/columns are shared (e.g., local exchange / local shuffle). It avoids in-place finalization/mutation of ColumnVariant during cast and serialization, and fixes a Subcolumn::insert_range_from() case where pending default rows weren’t materialized when copying a suffix range.

Changes:

  • Materialize pending default rows when copying a range from a VARIANT subcolumn (fixes missing-default suffix copy behavior).
  • Stop mutating unfinalized VARIANT columns during DataTypeVariant serialization by serializing a private finalized copy instead.
  • Stop mutating unfinalized VARIANT inputs during cast-from-variant by finalizing a private copy, and add/extend unit tests to validate “no source finalization” behavior and edge cases (prefix / zero-row).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
be/src/core/column/column_variant.cpp Ensures insert_range_from() materializes trailing pending-default rows when the copied range extends into the lazy-default suffix.
be/src/core/data_type/data_type_variant.cpp Switches serialization paths to use a private finalized VARIANT copy instead of finalizing the source column in-place.
be/src/exprs/function/cast/cast_to_variant.h Adjusts cast-from-variant to avoid mutating shared VARIANT inputs and adds handling for prefix/zero-row scenarios.
be/test/core/column/column_variant_test.cpp Adds BE unit tests covering deep-copy expectations for finalized clones and ensuring serialization doesn’t finalize the source VARIANT column.
be/test/exprs/function/cast/function_variant_cast_test.cpp Adds BE unit tests ensuring VARIANT casts (including prefix/zero-row and nullable inputs) do not finalize/mutate the source column.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 97.59% (81/83) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.74% (27445/38257)
Line Coverage 55.34% (294022/531268)
Region Coverage 52.09% (245265/470877)
Branch Coverage 53.24% (106040/199190)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29515 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f748899d68fccc0f3384212bf932f375410a111b, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17637	4026	4003	4003
q2	q3	10848	1414	810	810
q4	4688	476	346	346
q5	7544	877	594	594
q6	188	173	141	141
q7	775	863	648	648
q8	9362	1506	1570	1506
q9	5966	4605	4569	4569
q10	6780	1812	1559	1559
q11	444	264	246	246
q12	627	420	292	292
q13	18148	3401	2803	2803
q14	256	257	245	245
q15	q16	822	778	709	709
q17	942	947	971	947
q18	6896	5786	5645	5645
q19	1307	1216	1092	1092
q20	522	393	265	265
q21	6443	2792	2781	2781
q22	461	373	314	314
Total cold run time: 100656 ms
Total hot run time: 29515 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5147	4719	4773	4719
q2	q3	4926	5327	4689	4689
q4	2137	2226	1450	1450
q5	4784	4958	4753	4753
q6	232	181	128	128
q7	1825	1720	1646	1646
q8	2421	2085	2066	2066
q9	7948	7754	7422	7422
q10	4723	4699	4227	4227
q11	519	390	348	348
q12	740	741	529	529
q13	3035	3354	2831	2831
q14	266	284	251	251
q15	q16	674	707	610	610
q17	1269	1264	1251	1251
q18	7193	6776	6743	6743
q19	1109	1105	1120	1105
q20	2212	2205	1945	1945
q21	5219	4564	4460	4460
q22	531	455	427	427
Total cold run time: 56910 ms
Total hot run time: 51600 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169405 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f748899d68fccc0f3384212bf932f375410a111b, data reload: false

query5	4319	638	494	494
query6	444	193	187	187
query7	4832	597	282	282
query8	363	220	220	220
query9	8779	4084	4089	4084
query10	476	324	267	267
query11	5963	2348	2209	2209
query12	157	108	102	102
query13	1259	622	446	446
query14	6408	5411	5099	5099
query14_1	4391	4411	4414	4411
query15	207	200	179	179
query16	1024	472	459	459
query17	1136	722	613	613
query18	2732	482	360	360
query19	207	186	149	149
query20	119	110	110	110
query21	218	142	122	122
query22	13724	13626	13451	13451
query23	17266	16579	16215	16215
query23_1	16395	16376	16349	16349
query24	7550	1789	1306	1306
query24_1	1317	1317	1322	1317
query25	577	479	441	441
query26	1293	318	166	166
query27	2637	594	340	340
query28	4394	2049	2011	2011
query29	1040	586	492	492
query30	307	234	202	202
query31	1113	1078	956	956
query32	111	61	58	58
query33	515	319	241	241
query34	1165	1151	648	648
query35	770	782	681	681
query36	1393	1411	1237	1237
query37	157	100	89	89
query38	3224	3167	3085	3085
query39	954	916	898	898
query39_1	878	871	867	867
query40	230	127	104	104
query41	66	62	62	62
query42	95	96	95	95
query43	324	316	282	282
query44	
query45	198	184	181	181
query46	1108	1166	740	740
query47	2325	2383	2250	2250
query48	392	434	283	283
query49	630	467	348	348
query50	998	355	247	247
query51	4339	4298	4209	4209
query52	89	89	77	77
query53	271	272	194	194
query54	270	219	198	198
query55	79	80	71	71
query56	247	217	223	217
query57	1451	1412	1307	1307
query58	254	221	217	217
query59	1583	1672	1434	1434
query60	289	246	241	241
query61	165	158	160	158
query62	690	689	560	560
query63	231	185	185	185
query64	2519	773	629	629
query65	
query66	1754	470	345	345
query67	29837	29697	29533	29533
query68	
query69	433	310	264	264
query70	952	972	926	926
query71	292	226	212	212
query72	2924	2636	2385	2385
query73	884	737	417	417
query74	5129	4966	4778	4778
query75	2658	2592	2250	2250
query76	2329	1167	796	796
query77	370	409	302	302
query78	12486	12280	11909	11909
query79	1224	1124	753	753
query80	578	509	423	423
query81	455	282	248	248
query82	246	162	135	135
query83	322	266	252	252
query84	248	138	113	113
query85	855	548	432	432
query86	324	296	287	287
query87	3367	3397	3184	3184
query88	3651	2772	2712	2712
query89	423	374	334	334
query90	2212	182	175	175
query91	175	165	136	136
query92	65	63	55	55
query93	1465	1407	819	819
query94	541	328	313	313
query95	688	396	448	396
query96	993	800	360	360
query97	2723	2691	2568	2568
query98	212	203	206	203
query99	1157	1182	1021	1021
Total cold run time: 251086 ms
Total hot run time: 169405 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants