Skip to content

[fix](variant) return raw string for element_at on scalar-string variant#64103

Open
csun5285 wants to merge 1 commit into
apache:masterfrom
csun5285:fix/CIR-20498-variant-element-quote
Open

[fix](variant) return raw string for element_at on scalar-string variant#64103
csun5285 wants to merge 1 commit into
apache:masterfrom
csun5285:fix/CIR-20498-variant-element-quote

Conversation

@csun5285
Copy link
Copy Markdown
Contributor

@csun5285 csun5285 commented Jun 4, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@csun5285
Copy link
Copy Markdown
Contributor Author

csun5285 commented Jun 4, 2026

run buildall

@csun5285 csun5285 force-pushed the fix/CIR-20498-variant-element-quote branch from c9c907e to f5b09c1 Compare June 4, 2026 07:00
@csun5285
Copy link
Copy Markdown
Contributor Author

csun5285 commented Jun 4, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 80.00% (4/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.93% (21053/39038)
Line Coverage 37.60% (200208/532441)
Region Coverage 33.68% (157003/466174)
Branch Coverage 34.65% (68702/198302)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 28634 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f5b09c1466ff24cb5701b4564cb7ddebbe110aa2, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17731	4013	4004	4004
q2	q3	10793	1386	823	823
q4	4705	477	351	351
q5	7516	896	601	601
q6	182	173	137	137
q7	778	861	623	623
q8	9340	1560	1568	1560
q9	5799	4434	4449	4434
q10	6787	1787	1526	1526
q11	454	277	248	248
q12	632	436	292	292
q13	18125	3398	2804	2804
q14	266	263	242	242
q15	q16	809	768	708	708
q17	985	999	823	823
q18	6968	5819	5511	5511
q19	1321	1303	996	996
q20	524	397	261	261
q21	5977	2623	2387	2387
q22	432	351	303	303
Total cold run time: 100124 ms
Total hot run time: 28634 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4374	4268	4269	4268
q2	q3	4518	4962	4355	4355
q4	2093	2195	1382	1382
q5	4473	4257	4303	4257
q6	229	175	128	128
q7	1738	1622	1686	1622
q8	2830	2275	2183	2183
q9	8214	8271	7889	7889
q10	4834	4953	4268	4268
q11	575	415	384	384
q12	739	768	548	548
q13	3418	3684	2946	2946
q14	305	322	268	268
q15	q16	733	733	622	622
q17	1347	1320	1309	1309
q18	8076	7278	7203	7203
q19	1146	1129	1194	1129
q20	2280	2232	1976	1976
q21	5316	4615	4438	4438
q22	512	452	418	418
Total cold run time: 57750 ms
Total hot run time: 51593 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168660 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f5b09c1466ff24cb5701b4564cb7ddebbe110aa2, data reload: false

query5	4349	621	482	482
query6	441	207	185	185
query7	4897	572	280	280
query8	416	214	202	202
query9	8777	4038	3975	3975
query10	462	313	264	264
query11	5916	2373	2155	2155
query12	156	105	101	101
query13	1367	590	391	391
query14	6377	5402	5074	5074
query14_1	4420	4392	4405	4392
query15	208	195	174	174
query16	1039	451	449	449
query17	1132	723	604	604
query18	2586	485	360	360
query19	240	188	149	149
query20	112	109	109	109
query21	226	151	120	120
query22	13627	13473	13399	13399
query23	17598	16416	16205	16205
query23_1	16338	16382	16326	16326
query24	7498	1779	1311	1311
query24_1	1331	1292	1334	1292
query25	622	479	400	400
query26	1334	324	175	175
query27	2726	558	343	343
query28	4453	2005	1997	1997
query29	1188	592	472	472
query30	313	234	198	198
query31	1114	1061	954	954
query32	112	65	65	65
query33	541	320	246	246
query34	1325	1209	649	649
query35	789	776	670	670
query36	1366	1407	1261	1261
query37	152	103	89	89
query38	3196	3140	3065	3065
query39	936	919	892	892
query39_1	904	873	878	873
query40	227	119	101	101
query41	64	64	62	62
query42	95	93	93	93
query43	314	316	275	275
query44	
query45	199	187	182	182
query46	1093	1200	722	722
query47	2342	2424	2225	2225
query48	386	417	301	301
query49	630	459	353	353
query50	1067	365	254	254
query51	4300	4324	4152	4152
query52	88	87	75	75
query53	239	261	197	197
query54	279	213	217	213
query55	79	76	71	71
query56	242	240	222	222
query57	1453	1394	1347	1347
query58	273	213	207	207
query59	1596	1611	1453	1453
query60	312	251	226	226
query61	161	156	152	152
query62	692	655	577	577
query63	239	189	184	184
query64	2579	782	618	618
query65	
query66	1786	464	349	349
query67	29633	29803	29605	29605
query68	
query69	427	303	265	265
query70	975	971	916	916
query71	293	223	198	198
query72	3127	2650	2390	2390
query73	835	777	430	430
query74	5153	4934	4723	4723
query75	2674	2598	2227	2227
query76	2378	1159	769	769
query77	355	369	271	271
query78	12329	12374	11843	11843
query79	1290	1041	742	742
query80	548	468	411	411
query81	462	274	235	235
query82	237	159	119	119
query83	359	282	247	247
query84	264	141	114	114
query85	1051	633	439	439
query86	343	309	300	300
query87	3380	3390	3135	3135
query88	3603	2719	2709	2709
query89	431	376	326	326
query90	2150	172	177	172
query91	177	162	138	138
query92	67	61	54	54
query93	1408	1337	842	842
query94	551	343	313	313
query95	717	487	342	342
query96	1061	773	351	351
query97	2684	2705	2555	2555
query98	215	209	202	202
query99	1141	1176	1035	1035
Total cold run time: 252133 ms
Total hot run time: 168660 ms

When extracting a string property from a scalar-string variant (the shape
produced by `cast(text as variant)`), `element_at` goes through the simdjson
document path and stored `simdjson::to_json_string(value)` for the extracted
value. For a JSON string that representation keeps the surrounding double
quotes (e.g. `"2026-05-20 18:40:02"`), which leaked into the result and made
scalar-string variants inconsistent with the structured-subcolumn path, which
returns the string unquoted. This also broke downstream string ops, e.g.
`substring(v['k'], 1, 10)` consumed the leading quote.

Add a dedicated `string` branch in `_write_data_to_column` that extracts the
raw, unescaped value via `value.get_string()`. number/array/object values
keep their JSON-text representation through `to_json_string` as before.

Add BE unit test `extract_string_from_scalar_root` and regression assertions
covering string/substring/escaped/number/array/object extraction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@csun5285 csun5285 force-pushed the fix/CIR-20498-variant-element-quote branch from f5b09c1 to feaa5f6 Compare June 5, 2026 06:14
@csun5285
Copy link
Copy Markdown
Contributor Author

csun5285 commented Jun 5, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 80.00% (4/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.78% (21059/39158)
Line Coverage 37.48% (200249/534266)
Region Coverage 33.51% (157032/468642)
Branch Coverage 34.55% (68715/198886)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 80.00% (4/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.07% (27571/38254)
Line Coverage 55.48% (294725/531221)
Region Coverage 52.34% (246428/470847)
Branch Coverage 53.43% (106418/199169)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29557 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit feaa5f6a4251f53b0939b5df96afe5ecae5140f6, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17644	4172	4087	4087
q2	q3	10756	1439	805	805
q4	4686	478	344	344
q5	7562	889	589	589
q6	187	172	137	137
q7	774	822	648	648
q8	9389	1586	1584	1584
q9	5705	4543	4511	4511
q10	6776	1833	1575	1575
q11	440	271	260	260
q12	637	440	292	292
q13	18097	3316	2754	2754
q14	271	262	237	237
q15	q16	793	781	715	715
q17	928	887	941	887
q18	6766	5826	5640	5640
q19	1340	1266	1148	1148
q20	538	409	259	259
q21	6340	2791	2774	2774
q22	468	372	311	311
Total cold run time: 100097 ms
Total hot run time: 29557 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5233	4835	4881	4835
q2	q3	4790	5266	4848	4848
q4	2098	2209	1396	1396
q5	4845	4924	4669	4669
q6	240	182	139	139
q7	1862	1724	1635	1635
q8	2523	2129	2105	2105
q9	7988	7796	7423	7423
q10	4743	4687	4233	4233
q11	529	386	349	349
q12	730	739	535	535
q13	3050	3391	2794	2794
q14	267	279	263	263
q15	q16	676	708	605	605
q17	1289	1260	1255	1255
q18	7203	6868	6811	6811
q19	1155	1062	1110	1062
q20	2295	2207	1942	1942
q21	5285	4622	4508	4508
q22	531	446	411	411
Total cold run time: 57332 ms
Total hot run time: 51818 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169761 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit feaa5f6a4251f53b0939b5df96afe5ecae5140f6, data reload: false

query5	4333	638	485	485
query6	454	203	179	179
query7	4907	564	310	310
query8	377	218	212	212
query9	8758	4084	4066	4066
query10	449	329	261	261
query11	5944	2352	2185	2185
query12	152	106	101	101
query13	1324	615	422	422
query14	6403	5407	5075	5075
query14_1	4409	4403	4375	4375
query15	206	202	175	175
query16	1007	462	470	462
query17	1130	756	552	552
query18	2438	479	340	340
query19	202	183	145	145
query20	114	109	104	104
query21	212	135	117	117
query22	13768	13583	13420	13420
query23	17416	16467	16207	16207
query23_1	16234	16347	16334	16334
query24	7485	1770	1321	1321
query24_1	1291	1308	1326	1308
query25	557	445	386	386
query26	1317	338	169	169
query27	2652	580	341	341
query28	4459	2024	2022	2022
query29	1087	605	466	466
query30	304	235	203	203
query31	1128	1073	960	960
query32	111	66	61	61
query33	520	315	250	250
query34	1155	1139	639	639
query35	742	787	684	684
query36	1411	1431	1218	1218
query37	150	103	89	89
query38	3217	3133	3037	3037
query39	924	930	896	896
query39_1	887	894	871	871
query40	222	123	111	111
query41	67	65	63	63
query42	97	96	94	94
query43	316	329	284	284
query44	
query45	205	188	182	182
query46	1066	1183	746	746
query47	2376	2472	2251	2251
query48	391	417	321	321
query49	650	501	377	377
query50	996	346	267	267
query51	4307	4300	4255	4255
query52	93	92	79	79
query53	245	272	201	201
query54	299	234	233	233
query55	82	80	74	74
query56	253	234	250	234
query57	1413	1405	1325	1325
query58	256	228	221	221
query59	1601	1688	1554	1554
query60	307	271	247	247
query61	182	175	181	175
query62	707	662	603	603
query63	236	189	193	189
query64	2609	804	622	622
query65	
query66	1803	476	348	348
query67	29149	29763	29545	29545
query68	
query69	425	299	270	270
query70	979	973	948	948
query71	301	227	212	212
query72	3092	2649	2434	2434
query73	853	769	447	447
query74	5100	4953	4793	4793
query75	2692	2594	2247	2247
query76	2330	1180	764	764
query77	348	383	278	278
query78	12316	12290	11848	11848
query79	1295	1001	773	773
query80	522	482	410	410
query81	447	283	245	245
query82	241	157	134	134
query83	278	279	251	251
query84	255	139	113	113
query85	843	546	433	433
query86	334	301	260	260
query87	3334	3356	3165	3165
query88	3635	2769	2713	2713
query89	433	381	335	335
query90	2157	189	199	189
query91	175	163	137	137
query92	66	64	56	56
query93	1643	1554	845	845
query94	546	352	302	302
query95	685	480	336	336
query96	977	806	359	359
query97	2723	2722	2574	2574
query98	211	207	203	203
query99	1168	1172	1029	1029
Total cold run time: 250243 ms
Total hot run time: 169761 ms

@csun5285
Copy link
Copy Markdown
Contributor Author

csun5285 commented Jun 5, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review summary: no blocking issues found.

Critical checkpoint conclusions:

  • Goal/test proof: The PR fixes scalar-string-root variant extraction so JSON string values returned by element_at are raw/unescaped instead of JSON-quoted. The changed code accomplishes this by handling simdjson::ondemand::json_type::string separately, and coverage includes BE unit tests plus regression assertions and the updated SQL expected output.
  • Scope/focus: The implementation is small and focused; non-string scalar/object/array behavior remains on the existing to_json_string path.
  • Concurrency/lifecycle: No new shared state, locks, threads, static initialization, or lifecycle-sensitive ownership was introduced.
  • Configuration/compatibility: No new configuration, storage format, or FE-BE protocol compatibility concerns.
  • Parallel paths: The change targets the scalar-string-root simdjson document path. Structured subcolumn extraction is intentionally unchanged and now matches the scalar-string behavior for strings.
  • Conditional checks/error handling: The new type branch is specific to confirmed simdjson string values; existing parse/extract error behavior is preserved.
  • Test coverage/results: Added tests cover string, substring downstream use, escaped string content, number, array, and object cases. Existing CI reports compile, formatter, BE UT, P0 regression, and related checks passing.
  • Observability/transactions/data writes: No new observability need and no transaction, persistence, or data-write path changes.
  • Performance/memory: The change avoids the extra JSON-token representation for strings and copies directly into ColumnString; no significant new allocation or hot-path regression identified.

User focus: No additional user-provided review focus was supplied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants