Skip to content

[fix](function) allow count distinct on array#63888

Open
Mryange wants to merge 2 commits into
apache:masterfrom
Mryange:allow-const-distinct-on-array
Open

[fix](function) allow count distinct on array#63888
Mryange wants to merge 2 commits into
apache:masterfrom
Mryange:allow-const-distinct-on-array

Conversation

@Mryange
Copy link
Copy Markdown
Contributor

@Mryange Mryange commented May 29, 2026

What problem does this PR solve?

Problem Summary:
Before this change, count(distinct array_col) was rejected in FE with COUNT DISTINCT could not process type, even though the exact distinct backend path already supported array values.

Previous behavior:

select count(distinct arr)
from (
	select [1, 2] as arr
	union all
	select [1, 2]
	union all
	select [2, 1]
) t;
COUNT DISTINCT could not process type count(DISTINCT arr)

After this change, count(distinct array_col) is accepted and returns the correct distinct array count.

Current behavior:

select count(distinct arr)
from (
	select [1, 2] as arr
	union all
	select [1, 2]
	union all
	select [2, 1]
) t;
2

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 29, 2026

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

No blocking issues found. The change is small and focused: it narrows the count(distinct ...) legality check so array arguments are accepted while map/struct/object/json top-level types remain rejected. This matches the stated goal and the exact distinct path can use serialized keys for array values.

Critical checkpoint conclusions:

  • Goal and proof: The PR enables count(distinct array) and adds both a Nereids legality unit test and a regression case covering duplicate arrays and a NULL array input.
  • Scope: The implementation is minimal and localized to the existing Count legality gate.
  • Concurrency/lifecycle: Not applicable; no shared state, locking, threads, or lifecycle changes.
  • Configuration/compatibility/protocol: Not applicable; no configs, storage format, thrift, or FE-BE protocol changes.
  • Parallel paths: The relevant Nereids after-rewrite legality path is updated. Bitmap/HLL rewrite behavior remains covered by the existing test.
  • Tests: Regression output is deterministic because each query returns one aggregate row. The new .out file has the generated-file header. No additional user focus points were provided.
  • Observability/transactions/persistence/data writes: Not applicable.
  • Performance: No hot-path runtime overhead beyond the existing type check.

Residual risk: I did not run the test suite in this review environment; assessment is based on code inspection and PR diff review.

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 29, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31682 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 50d7e55effb2cfc16678ceb966fc548e8062b639, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17910	4021	3984	3984
q2	q3	10851	1408	823	823
q4	4686	473	355	355
q5	7560	2289	2128	2128
q6	245	182	139	139
q7	962	767	650	650
q8	9423	1716	1571	1571
q9	5215	5058	5046	5046
q10	6401	2217	1916	1916
q11	446	282	245	245
q12	630	439	295	295
q13	18086	3385	2810	2810
q14	267	262	242	242
q15	q16	814	772	717	717
q17	959	963	988	963
q18	6984	5772	5620	5620
q19	1327	1218	1093	1093
q20	508	402	267	267
q21	6011	2667	2511	2511
q22	430	356	307	307
Total cold run time: 99715 ms
Total hot run time: 31682 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4371	4412	4290	4290
q2	q3	4581	4966	4409	4409
q4	2073	2192	1380	1380
q5	4446	4309	4521	4309
q6	297	216	149	149
q7	2163	1935	1674	1674
q8	2511	2193	2190	2190
q9	8313	8050	8070	8050
q10	4913	4782	4349	4349
q11	573	418	394	394
q12	746	758	532	532
q13	3241	3606	2935	2935
q14	282	293	276	276
q15	q16	720	738	645	645
q17	1492	1332	1333	1332
q18	7869	7299	7277	7277
q19	1133	1090	1077	1077
q20	2234	2210	1937	1937
q21	5266	4528	4459	4459
q22	511	453	396	396
Total cold run time: 57735 ms
Total hot run time: 52060 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171557 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 50d7e55effb2cfc16678ceb966fc548e8062b639, data reload: false

query5	4312	653	504	504
query6	331	218	209	209
query7	4294	541	322	322
query8	339	225	218	218
query9	8790	4046	3986	3986
query10	455	340	291	291
query11	5791	2414	2200	2200
query12	182	128	127	127
query13	1316	633	439	439
query14	6154	5492	5160	5160
query14_1	4516	4477	4484	4477
query15	225	209	187	187
query16	1002	454	448	448
query17	1156	753	615	615
query18	2474	498	375	375
query19	224	205	178	178
query20	153	134	133	133
query21	221	137	120	120
query22	13692	13742	13450	13450
query23	17487	16589	16232	16232
query23_1	16312	16329	16443	16329
query24	7460	1780	1333	1333
query24_1	1333	1323	1343	1323
query25	603	519	447	447
query26	1321	317	179	179
query27	2686	570	344	344
query28	4417	2011	2011	2011
query29	1027	639	510	510
query30	310	237	202	202
query31	1183	1095	950	950
query32	101	79	75	75
query33	553	361	324	324
query34	1196	1185	661	661
query35	784	792	700	700
query36	1412	1368	1253	1253
query37	156	101	101	101
query38	3199	3190	3058	3058
query39	946	924	894	894
query39_1	881	872	876	872
query40	232	145	123	123
query41	63	62	63	62
query42	112	108	104	104
query43	322	329	284	284
query44	
query45	223	203	198	198
query46	1065	1212	744	744
query47	2375	2402	2265	2265
query48	400	412	299	299
query49	631	510	386	386
query50	989	353	268	268
query51	4411	4274	4335	4274
query52	106	103	94	94
query53	247	281	207	207
query54	309	307	246	246
query55	99	91	86	86
query56	305	298	292	292
query57	1448	1417	1353	1353
query58	295	274	260	260
query59	1544	1612	1438	1438
query60	334	320	314	314
query61	157	153	159	153
query62	684	654	598	598
query63	246	206	204	204
query64	2427	807	638	638
query65	
query66	1723	481	359	359
query67	29751	29648	29429	29429
query68	
query69	473	338	306	306
query70	1029	980	1005	980
query71	319	270	255	255
query72	3043	2725	2395	2395
query73	841	758	419	419
query74	5128	4981	4805	4805
query75	2716	2572	2270	2270
query76	2284	1129	745	745
query77	396	407	333	333
query78	12403	12507	11835	11835
query79	1491	1024	804	804
query80	666	546	454	454
query81	452	277	251	251
query82	1794	156	120	120
query83	349	272	248	248
query84	259	141	111	111
query85	871	538	455	455
query86	387	348	347	347
query87	3438	3377	3239	3239
query88	3620	2767	2731	2731
query89	440	392	341	341
query90	2046	184	182	182
query91	175	171	136	136
query92	82	78	72	72
query93	1418	1485	870	870
query94	542	336	300	300
query95	685	401	346	346
query96	1084	750	346	346
query97	2758	2745	2645	2645
query98	229	227	225	225
query99	1173	1146	1035	1035
Total cold run time: 254724 ms
Total hot run time: 171557 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/2) 🎉
Increment coverage report
Complete coverage report

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented May 29, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary:
This PR fixes the intended scalar/primitive array case, but the legality check is now too broad: it allows any top-level ARRAY, including arrays whose nested element type is still an unsupported complex type. That changes previously rejected shapes into executable plans without proving correct distinct semantics for the nested MAP/STRUCT/JSON/VARIANT cases.

Critical checkpoint conclusions:

  • Goal/test proof: The intended goal, count(distinct array) support, is covered by unit and regression tests; nested unsupported array element types are not covered.
  • Scope/focus: The implementation is small, but the predicate is broader than the stated behavior.
  • Concurrency/lifecycle/config/compatibility: Not applicable; this is FE analysis and test coverage only.
  • Parallel paths: Nereids COUNT DISTINCT legality is changed; unsupported MAP/STRUCT paths remain rejected only when they are top-level arguments.
  • Conditional checks: The special array exception needs recursive element-type validation, not only a top-level type check.
  • Tests/results: Existing new results are deterministic for scalar aggregate outputs, but missing negative tests for arrays wrapping unsupported complex types.
  • Observability/transactions/data writes/FE-BE variable passing: Not applicable.
  • Performance: No performance concern found.
  • User focus: No additional user-provided review focus was specified.

// after rewrite, count(distinct bitmap_column) should be rewritten to bitmap_union_count(bitmap_column)
for (Expression argument : getArguments()) {
if (distinct && (argument.getDataType().isComplexType()
if (distinct && ((argument.getDataType().isComplexType() && !argument.getDataType().isArrayType())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now exempts every top-level ARRAY from the complex-type rejection, so unsupported nested shapes like count(distinct array(map(...))), count(distinct array(named_struct(...))), or arrays nested around those types bypass the same MAP/STRUCT rejection that this PR keeps for top-level m and s. Those types still lack an explicit COUNT DISTINCT contract here, and the new tests only cover array<int>. Please recursively inspect ArrayType.getItemType() and only allow the array element shapes that the backend distinct path actually supports, with negative coverage for arrays wrapping unsupported complex types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants