[SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING #44352

pan3793 · 2023-12-14T10:19:29Z

What changes were proposed in this pull request?

This PR enhanced the analyzer to handle the following pattern properly.

Sort
 - Filter
   - Aggregate

Why are the changes needed?

spark-sql (default)> CREATE TABLE t1 (flag BOOLEAN, dt STRING);

spark-sql (default)>   SELECT LENGTH(dt),
                   >          COUNT(t1.flag)
                   >     FROM t1
                   > GROUP BY LENGTH(dt)
                   >   HAVING COUNT(t1.flag) > 1
                   > ORDER BY LENGTH(dt);
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `dt` cannot be resolved. Did you mean one of the following? [`length(dt)`, `count(flag)`].; line 6 pos 16;
'Sort ['LENGTH('dt) ASC NULLS FIRST], true
+- Filter (count(flag)#60L > cast(1 as bigint))
   +- Aggregate [length(dt#9)], [length(dt#9) AS length(dt)#59, count(flag#8) AS count(flag)#60L]
      +- SubqueryAlias spark_catalog.default.t1
         +- Relation spark_catalog.default.t1[flag#8,dt#9] parquet

The above code demonstrates the failure case, the query failed during the analysis phase when both HAVING and ORDER BY clauses are present, but successful if only one is present.

Does this PR introduce any user-facing change?

Yes, maybe we can call it a bugfix.

How was this patch tested?

New UTs are added

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2023-12-14T13:16:15Z

cc @cloud-fan @wangyum @yaooqinn @ulysses-you

ulysses-you · 2023-12-15T02:07:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

  // a table `t` has columns `c1` and `c2`, for query `SELECT ... FROM t GROUP BY c1 HAVING c2 = 0`,
  // even though we can resolve column `c2` here, we should undo it and fail with
  // "Column c2 not found".
  protected def resolveColWithAgg(e: Expression, plan: LogicalPlan): Expression = plan match {
+    case Filter(_, agg: Aggregate) => resolveColWithAgg(e, agg)


Cann we move this into ResolveReferencesInSort. I think it is more clearer to make the call side resolve aggregate through filter.

Thanks, changed as suggested

cloud-fan · 2023-12-15T05:05:04Z

sql/core/src/test/resources/sql-tests/analyzer-results/udf/postgreSQL/udf-select_having.sql.out

@@ -102,12 +102,11 @@ Project [udf(b)#x, udf(c)#x]
 SELECT udf(b), udf(c) FROM test_having
 	GROUP BY b, c HAVING udf(b) = 3 ORDER BY udf(b), udf(c)
 -- !query analysis
-Project [udf(b)#x, udf(c)#x]


do you know why the plan is changed?

I think the previous resolution matches item 4 of ResolveReferencesInSort comments.

Resolves the column to [[AttributeReference]] with the output of a descendant plan node.
Spark will propagate the missing attributes from the descendant plan node to the Sort node.
This is to allow users to ORDER BY columns that are not in the SELECT clause, which is
widely supported in other SQL dialects. For example, SELECT a FROM t ORDER BY b.

With this patch, it should match item 3

If the child plan is Aggregate or Filter(_, Aggregate), resolves the column to
[[TempResolvedColumn]] with the output of Aggregate's child plan.
This is to allow Sort to host grouping expressions and aggregate functions, which can
be pushed down to the Aggregate later. For example,
SELECT max(a) FROM t GROUP BY b HAVING max(a) > 1 ORDER BY min(a).

oh so the plan is actually more efficient now?

i think so, the plan shows it eliminates some unnecessary column propagation across operators

it is an analyzed plan, the optimized plan should be same with pr ?

both Analyzed plan and Optimized plan are changed :)

before:

== Analyzed Logical Plan == udf(b): int, udf(c): double Project [udf(b)#24, udf(c)#25] +- Sort [udf(b#21) ASC NULLS FIRST, udf(cast(c#22 as double)) ASC NULLS FIRST], true +- Filter (udf(b)#24 = 3) +- Aggregate [b#21, c#22], [udf(b#21) AS udf(b)#24, udf(cast(c#22 as double)) AS udf(c)#25, b#21, c#22] +- SubqueryAlias spark_catalog.default.test_having +- Relation spark_catalog.default.test_having[a#20,b#21,c#22,d#23] parquet == Optimized Logical Plan == Project [udf(b)#24, udf(c)#25] +- Sort [udf(b#21) ASC NULLS FIRST, udf(cast(c#22 as double)) ASC NULLS FIRST], true +- Aggregate [b#21, c#22], [udf(b#21) AS udf(b)#24, udf(cast(c#22 as double)) AS udf(c)#25, b#21, c#22] +- Project [b#21, c#22] +- Filter (isnotnull(b#21) AND (udf(b#21) = 3)) +- Relation spark_catalog.default.test_having[a#20,b#21,c#22,d#23] parquet

after:

== Analyzed Logical Plan == udf(b): int, udf(c): double Sort [udf(b)#9 ASC NULLS FIRST, udf(c)#10 ASC NULLS FIRST], true +- Filter (udf(b)#9 = 3) +- Aggregate [b#6, c#7], [udf(b#6) AS udf(b)#9, udf(cast(c#7 as double)) AS udf(c)#10] +- SubqueryAlias spark_catalog.default.test_having +- Relation spark_catalog.default.test_having[a#5,b#6,c#7,d#8] parquet == Optimized Logical Plan == Sort [udf(b)#9 ASC NULLS FIRST, udf(c)#10 ASC NULLS FIRST], true +- Aggregate [b#6, c#7], [udf(b#6) AS udf(b)#9, udf(cast(c#7 as double)) AS udf(c)#10] +- Project [b#6, c#7] +- Filter (isnotnull(b#6) AND (udf(b#6) = 3)) +- Relation spark_catalog.default.test_having[a#5,b#6,c#7,d#8] parquet

yaooqinn

LGTM

cloud-fan · 2023-12-15T21:09:29Z

does this query work in other databases?

cloud-fan · 2023-12-15T21:12:22Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala

+    assertAnalysisSuccess(
+      parsePlan(
+        """
+          |WITH t1 as (SELECT 1 id, 'one' name)


we can put this in golden file test as well.

Since this just checks analysis success, we can remove it if the test is already in golden file.

cloud-fan · 2023-12-15T21:12:44Z

sql/core/src/test/resources/sql-tests/inputs/having.sql

@@ -33,3 +33,6 @@ SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY GROUPING SETS(t.c1) HAVING t.
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY CUBE(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY ROLLUP(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY t.c1 HAVING t.c1 = 1;
+
+-- SPARK-28386: Cannot resolve ORDER BY columns with GROUP BY and HAVING
+SELECT k, sum(v) FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY sum(v)


does it work if we order by a different aggregate function (different from HAVING)?

Good question!

beliefer · 2023-12-16T10:27:40Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveReferencesInSort.scala

- * 3. If the child plan is Aggregate, resolves the column to [[TempResolvedColumn]] with the output
- *    of Aggregate's child plan. This is to allow Sort to host grouping expressions and aggregate
- *    functions, which can be pushed down to the Aggregate later. For example,
- *    `SELECT max(a) FROM t GROUP BY b ORDER BY min(a)`.


Shall we obtain the example?

beliefer · 2023-12-16T10:37:20Z

sql/core/src/test/resources/sql-tests/inputs/having.sql

@@ -33,3 +33,6 @@ SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY GROUPING SETS(t.c1) HAVING t.
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY CUBE(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY ROLLUP(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY t.c1 HAVING t.c1 = 1;
+
+-- SPARK-28386: Cannot resolve ORDER BY columns with GROUP BY and HAVING
+SELECT k, sum(v) FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY sum(v)


Good question!

pan3793 · 2023-12-16T13:26:28Z

does this query work in other databases?

Actually, it was reported by my colleague that the same SQL works on Impala but not Spark. I will investigate other popular RDBMS.

cloud-fan · 2023-12-18T12:49:59Z

In the SQL standard, ORDER BY can only reference columns in the SELECT list, but many databases extend it to support other cases. I think in Spark the extension is we can push down grouping expressions and aggregate functions from ORDER BY to SELECT.

cloud-fan · 2023-12-18T12:55:07Z

I think we should dig into #44352 (comment) more. It seems we have an optimization that if the ORDER BY expression directly matches something from the SELECT list, we replace it with AttributReference. Can you find out where the optimization is? And when it will be triggered? Is there rule-order independent?

pan3793 · 2023-12-18T13:05:48Z

I think in Spark the extension is we can push down grouping expressions and aggregate functions from ORDER BY to SELECT.

@cloud-fan I believe Spark already supports it when HVAING is absent, but does not work if HAVING is present, that's why I say "maybe we can call it a bugfix"

  SELECT xxx
  FROM xxx
  GROUP BY xxx
+ HAVING xxx
  ORDER BY xxx

It seems we have an optimization that if the ORDER BY expression directly matches something from the SELECT list, we replace it with AttributReference. Can you find out where the optimization is? And when it will be triggered? Is there rule-order independent?

Let me try.

pan3793 · 2023-12-18T15:25:20Z

It seems we have an optimization that if the ORDER BY expression directly matches something from the SELECT list, we replace it with AttributReference. Can you find out where the optimization is? And when it will be triggered? Is there rule-order independent?

The plan change log shows it indeed caused by ResolveReferences(which invokes the virtual rule ResolveReferencesInSort), thus I suppose this answer is valid #44352 (comment). it should be rule-order independent

master plan change log

=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                 +- 'UnresolvedHaving ('udf('b) = 3)
    +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]      +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedali...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.execution.datasources.FindDataSourceTable ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                            'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                         +- 'UnresolvedHaving ('udf('b) = 3)
    +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]              +- 'Aggregate ['b, 'c], [unresolvedalias('u...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                          'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                       +- 'UnresolvedHaving ('udf('b) = 3)
!   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]            +- 'Aggregate [b#178, c#179], [unresolvedalias...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                             +- 'UnresolvedHaving ('udf('b) = 3)
!   +- 'Aggregate [b#178, c#179], [unresolvedalias('udf(b#178)), unresolvedalias('udf(c#179))]      +- 'Aggregate [b#178, c#179],...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                           'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                        +- 'UnresolvedHaving ('udf('b) = 3)
!   +- 'Aggre...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                 +- 'UnresolvedHaving ('udf('b) = 3)
    +- Aggregate [b#178, c#179], [c...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
!+- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                 +- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, fals...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
!+- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, false)) = 3)                                                                                +- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178,...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
!+- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) = 3)                                                    +- Filter (udf(b)#181 = 3)
    +- Aggregate [b#178, ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
!'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Project [udf(b)#181, udf(c)#182]
!+- Filter (udf(b)#181 = 3)                                                                                                                          +- 'Sort ['udf(b#178) ASC NULLS FIRST, 'udf(c#179) ASC NULLS FIRST], true
!   +- Aggregate ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
!'Project [udf(b)#181, udf(c)#182]                                                                                                                                    Project [udf(b)#181, udf(c)#182]
!+- 'Sort ['udf(b#178) ASC NULLS FIRST, 'udf(c#179) ASC NULLS FIRST], true                                                                                            +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone ===
 Project [udf(b)#181, udf(c)#182]                                                                                                                                     Project [udf(b)#181, udf(c)#182]
 +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, cast(udf(cast(c#179 as string)) as string) ASC NULLS FIRST], true                                  +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, cas...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Result of Batch Resolution ===
!'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                    Project [udf(b)#181, udf(c)#182]
!+- 'UnresolvedHaving ('udf('b) = 3)                                                 +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, cast(udf(cast(c#179 as string)) as string) ASC NULLS FIRST], true
!   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]      +- Filter (udf(b)#181 = 3)
!      +- 'UnresolvedRelat...

this patch plan change log

=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                 +- 'UnresolvedHaving ('udf('b) = 3)
    +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]      +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedali...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.execution.datasources.FindDataSourceTable ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                            'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                         +- 'UnresolvedHaving ('udf('b) = 3)
    +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]              +- 'Aggregate ['b, 'c], [unresolvedalias('u...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                          'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                       +- 'UnresolvedHaving ('udf('b) = 3)
!   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]            +- 'Aggregate [b#178, c#179], [unresolvedalias...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                             +- 'UnresolvedHaving ('udf('b) = 3)
!   +- 'Aggregate [b#178, c#179], [unresolvedalias('udf(b#178)), unresolvedalias('udf(c#179))]      +- 'Aggregate [b#178, c#179],...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                           'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                        +- 'UnresolvedHaving ('udf('b) = 3)
!   +- 'Aggre...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                 +- 'UnresolvedHaving ('udf('b) = 3)
    +- Aggregate [b#178, c#179], [c...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
!+- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                 +- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, fals...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
!+- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, false)) = 3)                                                                                +- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178,...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
!+- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) = 3)                                                    +- Filter (udf(b)#181 = 3)
    +- Aggregate [b#178, ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
!'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf(tempresolvedcolumn(b#178, b, false)) ASC NULLS FIRST, 'udf(tempresolvedcolumn(c#179, c, false)) ASC NULLS FIRST], true
 +- Filter (udf(b)#181 = 3)                                                                                                                    ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
!'Sort ['udf(tempresolvedcolumn(b#178, b, false)) ASC NULLS FIRST, 'udf(tempresolvedcolumn(c#179, c, false)) ASC NULLS FIRST], true                  Sort [cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) ASC NULLS FIRST, cast(udf(cast(tempresolvedcolumn(c#179, c, false) as string)) as string) ASC NULLS FIRST], true
 +- Filter (udf(b)#181 = 3)                                                           ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
!Sort [cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) ASC NULLS FIRST, cast(udf(cast(tempresolvedcolumn(c#179, c, false) as string)) as string) ASC NULLS FIRST], true   Sort [udf(b)#181 ASC NULLS FIRST, udf(c)#182 ASC NULLS FIRST], true
 +- Filter (udf(b)#181 = 3)                                                                                                                                ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Result of Batch Resolution ===
!'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                    Sort [udf(b)#181 ASC NULLS FIRST, udf(c)#182 ASC NULLS FIRST], true
!+- 'UnresolvedHaving ('udf('b) = 3)                                                 +- Filter (udf(b)#181 = 3)
!   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]      +- Aggregate [b#178, c#179], [cast(udf(cast(b#178 as string)) as int) AS udf(b)#181, cast(udf(cast(c#179 as string)) as str...

I think in Spark the extension is we can push down grouping expressions and aggregate functions from ORDER BY to SELECT.

BTW, I think this may be out scope of this PR. The key point of this PR is, ORDER BY should have the same resolve result whether HAVING clause is present or not.

  SELECT xxx
  FROM xxx
  GROUP BY xxx
+ HAVING xxx
  ORDER BY xxx

cloud-fan

LGTM, please address the review comments regarding tests

pan3793 · 2023-12-19T08:30:57Z

@cloud-fan @beliefer I have updated the test cases as requested, please take another look when you have time

pan3793 · 2023-12-19T11:56:46Z

CI failure seems irrelevant
https://github.com/pan3793/spark/actions/runs/7259161309/job/19780922398

Notice:  A new release of pip is available: 23.3.1 -> 23.3.2
Notice:  To update, run: python3.9 -m pip install --upgrade pip
ERROR:  Error installing bundler:
	The last version of bundler (>= 0) to support your Ruby & RubyGems was 2.4.22. Try installing it with `gem install bundler -v 2.4.22`
	bundler requires Ruby version >= 3.0.0. The current ruby version is 2.7.0.0.
Error: Process completed with exit code 1.

cloud-fan · 2023-12-19T12:40:38Z

sql/core/src/test/resources/sql-tests/inputs/having.sql

+
+-- SPARK-28386: Resolve ORDER BY column with/without HAVING clause, while the column presents on SELECT list
+SELECT k FROM hav GROUP BY k ORDER BY k;
+SELECT k FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY k;


I think these two tests can pass without this PR? This is the basic functionality of ORDER BY: referencing columns in the SELECT list.

yes, should I only retain the fixed SQL？

updated, only retain the fixed SQL now

cloud-fan · 2023-12-19T12:42:15Z

sql/core/src/test/resources/sql-tests/inputs/having.sql

+
+-- SPARK-28386: Resolve ORDER BY scalar function with/without HAVING clause, while the scalar function does not present on SELECT list
+SELECT k FROM hav GROUP BY k ORDER BY length(k);
+SELECT k FROM hav GROUP BY k HAVING max(v) > 2 ORDER BY length(k);


ditto for these two tests

yaooqinn · 2023-12-20T08:48:02Z

Thanks, merged to master.

beliefer

Late LGTM.

github-actions bot added the SQL label Dec 14, 2023

pan3793 force-pushed the SPARK-28386 branch from ce4eb3e to b13635b Compare December 14, 2023 10:20

ulysses-you reviewed Dec 15, 2023

View reviewed changes

cloud-fan reviewed Dec 15, 2023

View reviewed changes

yaooqinn approved these changes Dec 15, 2023

View reviewed changes

ulysses-you approved these changes Dec 15, 2023

View reviewed changes

cloud-fan reviewed Dec 15, 2023

View reviewed changes

beliefer reviewed Dec 16, 2023

View reviewed changes

cloud-fan approved these changes Dec 18, 2023

View reviewed changes

dongjoon-hyun approved these changes Dec 18, 2023

View reviewed changes

cloud-fan reviewed Dec 19, 2023

View reviewed changes

pan3793 added 6 commits December 20, 2023 10:22

[SPARK-28386] Cannot resolve ORDER BY columns with GROUP BY and HAVING

6aa6fc7

regenerate tpcds plan

eaafeed

address comments

f423022

add more test cases

108d742

remove unnecessary ut

111ec3c

ut

6c85cf2

pan3793 force-pushed the SPARK-28386 branch from 0cf284e to 6c85cf2 Compare December 20, 2023 03:57

cloud-fan approved these changes Dec 20, 2023

View reviewed changes

yaooqinn closed this in 9fdc375 Dec 20, 2023

beliefer reviewed Dec 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING #44352

[SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING #44352

pan3793 commented Dec 14, 2023 •

edited

pan3793 commented Dec 14, 2023

ulysses-you Dec 15, 2023

pan3793 Dec 15, 2023

cloud-fan Dec 15, 2023

pan3793 Dec 15, 2023 •

edited

cloud-fan Dec 15, 2023 •

edited

pan3793 Dec 15, 2023

ulysses-you Dec 15, 2023

pan3793 Dec 15, 2023

yaooqinn left a comment

cloud-fan commented Dec 15, 2023

cloud-fan Dec 15, 2023

cloud-fan Dec 19, 2023

cloud-fan Dec 15, 2023

beliefer Dec 16, 2023

beliefer Dec 16, 2023

beliefer Dec 16, 2023

pan3793 commented Dec 16, 2023

cloud-fan commented Dec 18, 2023

cloud-fan commented Dec 18, 2023

pan3793 commented Dec 18, 2023 •

edited

pan3793 commented Dec 18, 2023

cloud-fan left a comment

pan3793 commented Dec 19, 2023

pan3793 commented Dec 19, 2023

cloud-fan Dec 19, 2023

pan3793 Dec 19, 2023

pan3793 Dec 20, 2023

cloud-fan Dec 19, 2023

pan3793 Dec 20, 2023

yaooqinn commented Dec 20, 2023

beliefer left a comment

[SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING #44352

[SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING #44352

Conversation

pan3793 commented Dec 14, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

pan3793 commented Dec 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pan3793 Dec 15, 2023 • edited

Choose a reason for hiding this comment

cloud-fan Dec 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaooqinn left a comment

Choose a reason for hiding this comment

cloud-fan commented Dec 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pan3793 commented Dec 16, 2023

cloud-fan commented Dec 18, 2023

cloud-fan commented Dec 18, 2023

pan3793 commented Dec 18, 2023 • edited

pan3793 commented Dec 18, 2023

cloud-fan left a comment

Choose a reason for hiding this comment

pan3793 commented Dec 19, 2023

pan3793 commented Dec 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaooqinn commented Dec 20, 2023

beliefer left a comment

Choose a reason for hiding this comment

pan3793 commented Dec 14, 2023 •

edited

pan3793 Dec 15, 2023 •

edited

cloud-fan Dec 15, 2023 •

edited

pan3793 commented Dec 18, 2023 •

edited