Skip to content

[fix](function) deduplicate map keys after string-to-map cast#63713

Open
Mryange wants to merge 2 commits into
apache:masterfrom
Mryange:fix-map-deduplicate_keys
Open

[fix](function) deduplicate map keys after string-to-map cast#63713
Mryange wants to merge 2 commits into
apache:masterfrom
Mryange:fix-map-deduplicate_keys

Conversation

@Mryange
Copy link
Copy Markdown
Contributor

@Mryange Mryange commented May 27, 2026

What problem does this PR solve?

Problem Summary:
Casting a JSON string with duplicated object keys to MAP kept all duplicated entries because the string-to-complex cast path returned the generic wrapper directly and skipped ColumnMap::deduplicate_keys(). This made string-to-map casts inconsistent with MAP constructor semantics where the last value wins.

Reproduction SQL:

SELECT CAST('{"a":1,"a":2}' AS MAP<STRING,INT>);
SELECT size(CAST('{"a":1,"a":2}' AS MAP<STRING,INT>));
SELECT element_at(CAST('{"a":1,"a":2}' AS MAP<STRING,INT>), 'a');

SELECT CAST('{"outer":{"a":1,"a":2}}' AS MAP<STRING, MAP<STRING, INT>>);
SELECT element_at(element_at(CAST('{"outer":{"a":1,"a":2}}' AS MAP<STRING, MAP<STRING, INT>>), 'outer'), 'a');

SELECT map('a',1,'a',2);
SELECT size(map('a',1,'a',2));
SELECT element_at(map('a',1,'a',2), 'a');

Before this fix:

{"a":1, "a":2}
2
1

{"outer":{"a":1, "a":2}}
1

{"a":2}
1
2

After this fix:

{"a":2}
1
2

{"outer":{"a":2}}
2

{"a":2}
1
2

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 27, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one correctness issue in the string-to-map cast normalization. The PR fixes top-level duplicate keys, but nested MAP values parsed from the same string still keep duplicate keys, so the behavior remains inconsistent for nested MAP types.

Critical checkpoint conclusions:

  • Goal/test: The PR partially accomplishes the goal and tests the top-level case, but misses nested MAP values produced by the same string deserialization path.
  • Scope: The change is focused and small.
  • Concurrency/lifecycle: No new concurrency or special lifecycle concerns found.
  • Config/compatibility/protocol: No new config, persisted format, or FE-BE protocol compatibility concerns found.
  • Parallel paths: Table ingestion already uses recursive map deduplication; this cast path should match it for nested maps.
  • Tests: Added regression coverage is useful but incomplete for nested MAP duplicate-key behavior.
  • Observability/transactions/data writes: Not applicable to this expression-only change.
  • Performance: No material performance issue found beyond the existing dedup cost.

Focus points: No additional user-provided review focus was present.

result_column_name);
}

RETURN_IF_ERROR(map_column->deduplicate_keys());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only deduplicates the outer ColumnMap. DataTypeMapSerDe::_from_string() recursively deserializes MAP values but does not deduplicate them, so a cast like CAST('{"outer":{"a":1,"a":2}}' AS MAP<STRING, MAP<STRING, INT>>) will still leave duplicate keys in the nested map and element_at(..., 'a') can keep returning the first value there. That means the PR fixes the top-level repro but not the same inconsistency for nested MAP types. Please call deduplicate_keys(true) here (and in the non-nullable branch) and add a nested MAP regression case.

Suggested change
RETURN_IF_ERROR(map_column->deduplicate_keys());
RETURN_IF_ERROR(map_column->deduplicate_keys(true));

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 27, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no additional blocking issues found in the latest PR head. The previously raised nested MAP concern is addressed by using deduplicate_keys(true) and adding the nested MAP regression case.

Critical checkpoint conclusions:

  • Goal/test: The change targets duplicated keys when casting JSON strings to MAP. The implementation now deduplicates string-to-map cast results, including nested MAP values, and adds top-level and nested regression coverage.
  • Scope: The code change is small and limited to the string-to-map wrapper plus regression expectations.
  • Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, configuration, storage-format, or FE-BE protocol concerns identified.
  • Parallel paths: The changed path is specifically string-to-map casts; map-to-map behavior was already deduplicating keys separately. Array/struct casts are unaffected.
  • Error handling: The wrapped cast Status is propagated, and deduplication Status is checked.
  • Tests: Added deterministic single-row qt_sql cases for cast output, size, element_at, and nested MAP behavior. I did not run the regression suite in this review environment.
  • Observability/performance: No new observability needed. Deduplication adds expected work only on string-to-map cast output.

User focus: No additional user-provided review focus was specified.

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 27, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31406 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 295881b5ff50e0441459735ffec0e77394afbbdc, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17765	4041	3987	3987
q2	q3	10816	1358	839	839
q4	4682	476	347	347
q5	7549	2233	2087	2087
q6	234	173	139	139
q7	913	790	633	633
q8	9420	1754	1672	1672
q9	6330	4980	4939	4939
q10	6442	2226	1885	1885
q11	437	270	243	243
q12	698	432	292	292
q13	18316	3422	2748	2748
q14	270	260	239	239
q15	q16	820	771	713	713
q17	950	965	857	857
q18	6965	5817	5616	5616
q19	1280	1410	1074	1074
q20	537	481	284	284
q21	5967	2709	2499	2499
q22	471	378	313	313
Total cold run time: 100862 ms
Total hot run time: 31406 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4808	4709	4797	4709
q2	q3	4940	5344	4690	4690
q4	2108	2162	1395	1395
q5	4888	4719	4693	4693
q6	247	187	132	132
q7	1881	1752	1595	1595
q8	2260	1921	1912	1912
q9	7387	7440	7340	7340
q10	4705	4643	4209	4209
q11	540	389	351	351
q12	742	733	527	527
q13	3057	3357	2771	2771
q14	283	273	250	250
q15	q16	674	700	608	608
q17	1267	1245	1248	1245
q18	7430	6965	6936	6936
q19	1142	1079	1083	1079
q20	2223	2225	1941	1941
q21	5225	4578	4369	4369
q22	530	454	397	397
Total cold run time: 56337 ms
Total hot run time: 51149 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171888 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 295881b5ff50e0441459735ffec0e77394afbbdc, data reload: false

query5	4314	661	522	522
query6	328	221	198	198
query7	4255	586	289	289
query8	322	237	221	221
query9	8790	4016	4013	4013
query10	443	361	302	302
query11	5741	2535	2283	2283
query12	180	124	128	124
query13	1275	631	430	430
query14	6102	5455	5169	5169
query14_1	4506	4486	4483	4483
query15	215	201	182	182
query16	1013	451	432	432
query17	1127	751	606	606
query18	2587	477	358	358
query19	211	202	160	160
query20	139	132	126	126
query21	216	135	113	113
query22	13628	13631	13470	13470
query23	17352	16551	16175	16175
query23_1	16300	16283	16419	16283
query24	7459	1831	1346	1346
query24_1	1322	1317	1318	1317
query25	602	503	446	446
query26	1305	321	173	173
query27	2701	572	353	353
query28	4463	1980	2003	1980
query29	1016	623	490	490
query30	304	232	203	203
query31	1131	1090	956	956
query32	90	79	75	75
query33	540	355	302	302
query34	1182	1186	676	676
query35	787	792	705	705
query36	1400	1403	1236	1236
query37	167	104	95	95
query38	3201	3179	3121	3121
query39	925	908	922	908
query39_1	903	880	886	880
query40	236	149	123	123
query41	66	61	62	61
query42	114	112	111	111
query43	347	336	294	294
query44	
query45	217	201	202	201
query46	1106	1215	757	757
query47	2400	2379	2259	2259
query48	401	440	311	311
query49	630	497	380	380
query50	1045	357	251	251
query51	4411	4322	4336	4322
query52	106	105	95	95
query53	271	280	202	202
query54	312	272	274	272
query55	96	90	89	89
query56	298	317	303	303
query57	1437	1424	1337	1337
query58	303	271	272	271
query59	1673	1680	1460	1460
query60	316	329	314	314
query61	167	159	156	156
query62	696	649	563	563
query63	247	201	201	201
query64	2494	864	693	693
query65	
query66	1713	491	373	373
query67	29818	29804	28916	28916
query68	
query69	477	350	329	329
query70	1063	1061	1008	1008
query71	322	278	271	271
query72	3100	2723	2449	2449
query73	892	783	415	415
query74	5109	4934	4774	4774
query75	2662	2628	2290	2290
query76	2285	1152	805	805
query77	404	418	337	337
query78	12474	12390	11897	11897
query79	1468	1022	772	772
query80	826	534	447	447
query81	494	306	240	240
query82	1350	158	123	123
query83	357	283	250	250
query84	261	141	110	110
query85	935	560	449	449
query86	452	365	311	311
query87	3473	3425	3224	3224
query88	3657	2748	2750	2748
query89	463	392	342	342
query90	1800	196	194	194
query91	184	169	141	141
query92	80	87	77	77
query93	1533	1482	893	893
query94	631	361	319	319
query95	663	476	342	342
query96	1091	824	339	339
query97	2743	2730	2565	2565
query98	235	229	225	225
query99	1208	1152	1036	1036
Total cold run time: 255194 ms
Total hot run time: 171888 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 72.22% (26/36) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.86% (20914/38831)
Line Coverage 37.41% (197966/529242)
Region Coverage 33.70% (155142/460362)
Branch Coverage 34.68% (67501/194634)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 72.22% (26/36) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.82% (28075/38031)
Line Coverage 57.77% (304934/527886)
Region Coverage 54.90% (255170/464785)
Branch Coverage 56.45% (110284/195360)

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 27, 2026

run p0

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 72.22% (26/36) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.82% (28075/38031)
Line Coverage 57.77% (304945/527886)
Region Coverage 54.90% (255177/464785)
Branch Coverage 56.46% (110292/195360)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants