Skip to content

[fix](serde) Support large string arrow builder for variant serde#63718

Open
eldenmoon wants to merge 1 commit into
apache:masterfrom
eldenmoon:branch-doris-24846
Open

[fix](serde) Support large string arrow builder for variant serde#63718
eldenmoon wants to merge 1 commit into
apache:masterfrom
eldenmoon:branch-doris-24846

Conversation

@eldenmoon
Copy link
Copy Markdown
Member

@eldenmoon eldenmoon commented May 27, 2026

What problem does this PR solve?

Problem Summary: DataTypeVariantSerDe::write_column_to_arrow always cast the Arrow builder to arrow::StringBuilder. During Parquet OUTFILE export, the Arrow block converter can switch utf8 columns to large_utf8 when a batch is large, which gives variant serialization an arrow::LargeStringBuilder and crashes BE on the bad cast.

This patch handles both arrow::StringBuilder and arrow::LargeStringBuilder for VARIANT Arrow serialization and adds a BE UT that reproduces the LargeStringBuilder path.

Release note

Fix BE crash when exporting VARIANT columns to Parquet OUTFILE with large Arrow string batches.

Check List (For Author)

  • Test: Unit Test
    • ./run-be-ut.sh --run --filter='DataTypeSerDeTest.VariantWriteColumnToArrowSupportsLargeString'
    • ./run-be-ut.sh --run --filter='DataTypeSerDeTest.*'
    • PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/check-format.sh
  • Behavior changed: Yes. VARIANT Arrow serialization now supports large_utf8 builders instead of aborting on a bad builder cast.
  • Does this need documentation: No

Notes

build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN --base upstream/master was attempted. It is blocked by existing diagnostics in this path, including core/types.h unmatched NOLINTEND and pre-existing modernize/readability findings in data_type_variant_serde.cpp / data_type_serde_test.cpp; the new signed/unsigned warning introduced while developing this patch was fixed before the final tests.

Copilot AI review requested due to automatic review settings May 27, 2026 05:14
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a BE crash during Parquet OUTFILE export of VARIANT columns by making DataTypeVariantSerDe::write_column_to_arrow handle both Arrow utf8 (arrow::StringBuilder) and large_utf8 (arrow::LargeStringBuilder) builders, and adds a unit test that exercises the LargeStringBuilder path.

Changes:

  • Add Arrow builder type dispatch for VARIANT Arrow serialization (STRING vs LARGE_STRING) and return a clear error for unsupported Arrow types.
  • Refactor the per-row Arrow append logic into a shared templated helper.
  • Add a BE unit test to reproduce and validate the LargeStringBuilder serialization path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
be/src/core/data_type_serde/data_type_variant_serde.cpp Dispatch VARIANT Arrow serialization across StringBuilder and LargeStringBuilder, avoiding invalid builder casts.
be/test/core/data_type_serde/data_type_serde_test.cpp Add UT coverage for VARIANT serialization into arrow::LargeStringBuilder.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +58 to +62
var.serialize_one_row_to_string(i, &serialized_value, options);
const auto serialized_size =
cast_set<typename BuilderType::offset_type>(serialized_value.size());
RETURN_IF_ERROR(checkArrowStatus(builder.Append(serialized_value.data(), serialized_size),
column.get_name(), builder.type()->name()));
Issue Number: close DORIS-24846

Related PR: #xxx

Problem Summary: Fix BE crash when serializing VARIANT columns through Arrow large string builders during Parquet OUTFILE export.

Fix BE crash when exporting VARIANT columns to Parquet OUTFILE with large Arrow string batches.

- Test: Unit Test
    - `./run-be-ut.sh --run --filter='DataTypeSerDeTest.VariantWriteColumnToArrowSupportsLargeString'`
    - `./run-be-ut.sh --run --filter='DataTypeSerDeTest.*'`
    - `PATH=/mnt/disk1/claude-max/ldb_toolchain16/bin:$PATH build-support/check-format.sh`
- Behavior changed: Yes. VARIANT Arrow serialization now supports `large_utf8` builders instead of aborting on a bad builder cast.
- Does this need documentation: No
@eldenmoon eldenmoon changed the title [fix](be) Support large string arrow builder for variant serde [fix](serde) Support large string arrow builder for variant serde May 27, 2026
@eldenmoon eldenmoon force-pushed the branch-doris-24846 branch from 289c900 to 21648cc Compare May 27, 2026 06:19
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31565 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 21648cca34fd2b07db1221678d4381a46efe0a43, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17638	3989	3940	3940
q2	q3	10807	1349	815	815
q4	4675	478	342	342
q5	7558	2264	2073	2073
q6	254	173	142	142
q7	921	803	644	644
q8	9463	1733	1680	1680
q9	6725	4954	4919	4919
q10	6439	2217	1865	1865
q11	438	274	252	252
q12	689	422	292	292
q13	18359	3341	2774	2774
q14	258	259	241	241
q15	q16	825	762	707	707
q17	1005	865	932	865
q18	6867	5657	5687	5657
q19	1220	1362	1216	1216
q20	521	408	283	283
q21	6082	2728	2541	2541
q22	493	363	317	317
Total cold run time: 101237 ms
Total hot run time: 31565 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4777	4674	4952	4674
q2	q3	4886	5285	4636	4636
q4	2118	2179	1376	1376
q5	4941	4699	4659	4659
q6	232	178	126	126
q7	1800	1765	1441	1441
q8	2188	1932	1921	1921
q9	7321	7405	7340	7340
q10	4753	4733	4258	4258
q11	540	388	354	354
q12	722	751	537	537
q13	3036	3354	2814	2814
q14	278	287	266	266
q15	q16	678	699	612	612
q17	1277	1257	1258	1257
q18	7311	6921	6688	6688
q19	1109	1097	1110	1097
q20	2229	2218	1947	1947
q21	5235	4546	4402	4402
q22	527	443	401	401
Total cold run time: 55958 ms
Total hot run time: 50806 ms

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172080 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 21648cca34fd2b07db1221678d4381a46efe0a43, data reload: false

query5	4300	646	510	510
query6	338	214	202	202
query7	4213	560	311	311
query8	322	222	220	220
query9	8782	4060	4018	4018
query10	457	353	310	310
query11	5799	2593	2247	2247
query12	182	128	135	128
query13	1302	635	418	418
query14	6134	5474	5150	5150
query14_1	4464	4432	4467	4432
query15	216	209	184	184
query16	1071	467	439	439
query17	1169	742	616	616
query18	2719	487	368	368
query19	222	215	169	169
query20	142	133	146	133
query21	217	141	120	120
query22	13628	13522	13364	13364
query23	17360	16439	16294	16294
query23_1	16258	16320	16319	16319
query24	7470	1786	1305	1305
query24_1	1325	1311	1323	1311
query25	565	501	443	443
query26	1316	336	179	179
query27	2685	536	356	356
query28	4455	2016	2014	2014
query29	1013	645	527	527
query30	299	245	203	203
query31	1150	1090	958	958
query32	90	79	122	79
query33	527	344	289	289
query34	1229	1115	640	640
query35	764	793	702	702
query36	1347	1436	1270	1270
query37	148	101	88	88
query38	3216	3152	3066	3066
query39	920	943	911	911
query39_1	885	878	898	878
query40	231	147	120	120
query41	63	64	62	62
query42	110	110	105	105
query43	325	334	300	300
query44	
query45	214	202	199	199
query46	1073	1223	742	742
query47	2408	2360	2366	2360
query48	425	426	319	319
query49	643	500	390	390
query50	1074	355	253	253
query51	4411	4323	4239	4239
query52	104	106	98	98
query53	267	296	206	206
query54	309	265	245	245
query55	93	93	85	85
query56	299	315	302	302
query57	1428	1426	1331	1331
query58	302	280	265	265
query59	1575	1684	1432	1432
query60	314	311	313	311
query61	153	152	154	152
query62	698	639	595	595
query63	243	203	223	203
query64	2373	800	656	656
query65	
query66	1671	478	355	355
query67	29875	29655	29467	29467
query68	
query69	469	358	307	307
query70	1037	1002	1028	1002
query71	305	278	266	266
query72	3059	2684	2390	2390
query73	878	788	445	445
query74	5122	4931	4789	4789
query75	2717	2624	2274	2274
query76	2292	1133	765	765
query77	395	411	337	337
query78	12374	12410	11868	11868
query79	1495	1005	787	787
query80	1319	547	458	458
query81	517	283	243	243
query82	1195	153	121	121
query83	316	275	251	251
query84	262	144	111	111
query85	944	576	461	461
query86	459	339	339	339
query87	3431	3366	3272	3272
query88	3653	2731	2751	2731
query89	450	393	340	340
query90	1937	183	185	183
query91	180	167	138	138
query92	81	80	71	71
query93	1547	1414	840	840
query94	718	364	333	333
query95	688	472	349	349
query96	1053	767	325	325
query97	2736	2720	2683	2683
query98	248	226	227	226
query99	1177	1140	1026	1026
Total cold run time: 255296 ms
Total hot run time: 172080 ms

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review completed. I did not find additional blocking issues in this PR, and I did not duplicate the already-open inline thread about cast_set overflow in data_type_variant_serde.cpp.

Critical checkpoint conclusions:

  • Goal/test: The PR targets the VARIANT-to-Arrow crash when the Arrow converter upgrades utf8 to large_utf8 for large batches. The updated SerDe now dispatches both arrow::StringBuilder and arrow::LargeStringBuilder, and the added BE UT covers the LargeStringBuilder path.
  • Scope: The change is small and focused on DataTypeVariantSerDe::write_column_to_arrow, with a localized helper mirroring the string SerDe pattern.
  • Concurrency/lifecycle: No new shared state, threads, locks, or special lifecycle/static initialization concerns were introduced.
  • Configuration/compatibility: No config items, storage format changes, or FE/BE protocol changes were introduced.
  • Parallel paths: The behavior now aligns with the existing string SerDe large_utf8 handling used by the same Arrow block conversion path.
  • Error handling: Unsupported Arrow builder types now return InvalidArgument instead of failing a bad cast. Existing review context already covers the cast-overflow concern, so I am not reopening it here.
  • Testing: A targeted BE unit test was added for the LargeStringBuilder path. I did not run tests in this review runner.
  • Observability/performance: No additional observability appears necessary for this narrow serialization fix; per-row serialization behavior is unchanged from the previous implementation.
  • User focus: No additional user-provided review focus was specified.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 92.59% (25/27) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.86% (28089/38029)
Line Coverage 57.79% (305062/527862)
Region Coverage 55.09% (256035/464763)
Branch Coverage 56.49% (110343/195349)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 92.59% (25/27) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.86% (28089/38029)
Line Coverage 57.79% (305075/527862)
Region Coverage 55.10% (256099/464763)
Branch Coverage 56.49% (110355/195349)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants