[SPARK-56944] [SQL] Trim aliases from ListAgg expression subtree#55984
[SPARK-56944] [SQL] Trim aliases from ListAgg expression subtree#55984mihailoale-db wants to merge 1 commit into
Conversation
|
@cloud-fan PTAL when you find time. Thanks! |
cloud-fan
left a comment
There was a problem hiding this comment.
Thanks Mihailo, the fix is reasonable and follows the pattern from SPARK-54871.
One question about completeness, and one about test coverage — see inline. Otherwise the change is straightforward.
|
@cloud-fan Addressed your comments, PTAL when you find time. Thanks |
cloud-fan
left a comment
There was a problem hiding this comment.
Round 2: 1 addressed (checkOrderValueDeterminism now trims aliases — thanks), 1 new (see inline on the new line-78 test). The round-1 concern about SQLQueryTestSuite running only the legacy analyzer still stands technically, but I'll defer to your judgement on the golden-files-only approach.
| SELECT grp, listagg(DISTINCT v:a::string, ',') WITHIN GROUP (ORDER BY v:a::string) FROM (SELECT 1 grp, parse_json('{"a": "x"}') v UNION ALL SELECT 1, parse_json('{"a": "y"}') UNION ALL SELECT 2, parse_json('{"a": "x"}') UNION ALL SELECT 2, parse_json('{"a": "x"}') UNION ALL SELECT 1, parse_json('{"a": "x"}')) GROUP BY grp; | ||
| -- Semi-structured extract: DISTINCT cast with non-equality-preserving order (variant) | ||
| -- Tests that checkOrderValueDeterminism strips Alias wrappers before comparing via semanticEquals | ||
| SELECT listagg(DISTINCT v:a::string, ',') WITHIN GROUP (ORDER BY v:a) FROM (SELECT parse_json('{"a": "x"}') v UNION ALL SELECT parse_json('{"a": "y"}') UNION ALL SELECT parse_json('{"a": "x"}')); |
There was a problem hiding this comment.
The comment above says this exercises checkOrderValueDeterminism's alias stripping, but ORDER BY v:a is VARIANT and fails with DATATYPE_MISMATCH.INVALID_ORDERING_TYPE from SortOrder.checkInputDataTypes before listagg's checks run — see this query's analyzer-results golden output: only that error is raised, no functionAndOrderExpressionUnsafeCastError. The test would pass identically with or without the new trimAliases you added at line 684.
To actually hit the Cast(castChild, ...) arm in checkOrderValueDeterminism, the order column needs to be orderable but non-equality-preserving when cast to String, e.g. LISTAGG(DISTINCT (v:a)::double::string, ',') WITHIN GROUP (ORDER BY (v:a)::double). Without the trim, the single-pass aliases inside castChild and orderExpressions.head.child mismatch and you'd get NonDeterministicMismatch → functionAndOrderExpressionMismatchError; with the trim, they match and you get NonDeterministicCast(Double, String) → functionAndOrderExpressionUnsafeCastError. That divergence is what pins the fix.
What changes were proposed in this pull request?
In this PR I propose to trim aliases from
ListAggexpression subtree in order to fix a discrepancy between single-pass and fixed-point analyzers.It is a safe change since it would otherwise be removed in
CleanupAliases.Why are the changes needed?
To fix a dual-run issue.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added + existing tests.
Was this patch authored or co-authored using generative AI tooling?
Yes.