[fix](function) allow count distinct on array#63888
Conversation
|
/review |
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Review Summary
No blocking issues found. The change is small and focused: it narrows the count(distinct ...) legality check so array arguments are accepted while map/struct/object/json top-level types remain rejected. This matches the stated goal and the exact distinct path can use serialized keys for array values.
Critical checkpoint conclusions:
- Goal and proof: The PR enables
count(distinct array)and adds both a Nereids legality unit test and a regression case covering duplicate arrays and a NULL array input. - Scope: The implementation is minimal and localized to the existing Count legality gate.
- Concurrency/lifecycle: Not applicable; no shared state, locking, threads, or lifecycle changes.
- Configuration/compatibility/protocol: Not applicable; no configs, storage format, thrift, or FE-BE protocol changes.
- Parallel paths: The relevant Nereids after-rewrite legality path is updated. Bitmap/HLL rewrite behavior remains covered by the existing test.
- Tests: Regression output is deterministic because each query returns one aggregate row. The new
.outfile has the generated-file header. No additional user focus points were provided. - Observability/transactions/persistence/data writes: Not applicable.
- Performance: No hot-path runtime overhead beyond the existing type check.
Residual risk: I did not run the test suite in this review environment; assessment is based on code inspection and PR diff review.
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 31682 ms |
TPC-DS: Total hot run time: 171557 ms |
FE Regression Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
Summary:
This PR fixes the intended scalar/primitive array case, but the legality check is now too broad: it allows any top-level ARRAY, including arrays whose nested element type is still an unsupported complex type. That changes previously rejected shapes into executable plans without proving correct distinct semantics for the nested MAP/STRUCT/JSON/VARIANT cases.
Critical checkpoint conclusions:
- Goal/test proof: The intended goal, count(distinct array) support, is covered by unit and regression tests; nested unsupported array element types are not covered.
- Scope/focus: The implementation is small, but the predicate is broader than the stated behavior.
- Concurrency/lifecycle/config/compatibility: Not applicable; this is FE analysis and test coverage only.
- Parallel paths: Nereids COUNT DISTINCT legality is changed; unsupported MAP/STRUCT paths remain rejected only when they are top-level arguments.
- Conditional checks: The special array exception needs recursive element-type validation, not only a top-level type check.
- Tests/results: Existing new results are deterministic for scalar aggregate outputs, but missing negative tests for arrays wrapping unsupported complex types.
- Observability/transactions/data writes/FE-BE variable passing: Not applicable.
- Performance: No performance concern found.
- User focus: No additional user-provided review focus was specified.
| // after rewrite, count(distinct bitmap_column) should be rewritten to bitmap_union_count(bitmap_column) | ||
| for (Expression argument : getArguments()) { | ||
| if (distinct && (argument.getDataType().isComplexType() | ||
| if (distinct && ((argument.getDataType().isComplexType() && !argument.getDataType().isArrayType()) |
There was a problem hiding this comment.
This now exempts every top-level ARRAY from the complex-type rejection, so unsupported nested shapes like count(distinct array(map(...))), count(distinct array(named_struct(...))), or arrays nested around those types bypass the same MAP/STRUCT rejection that this PR keeps for top-level m and s. Those types still lack an explicit COUNT DISTINCT contract here, and the new tests only cover array<int>. Please recursively inspect ArrayType.getItemType() and only allow the array element shapes that the backend distinct path actually supports, with negative coverage for arrays wrapping unsupported complex types.
What problem does this PR solve?
Problem Summary:
Before this change,
count(distinct array_col)was rejected in FE withCOUNT DISTINCT could not process type, even though the exact distinct backend path already supported array values.Previous behavior:
After this change,
count(distinct array_col)is accepted and returns the correct distinct array count.Current behavior:
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)