Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING #44352

Closed
wants to merge 6 commits into from

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented Dec 14, 2023

What changes were proposed in this pull request?

This PR enhanced the analyzer to handle the following pattern properly.

Sort
 - Filter
   - Aggregate

Why are the changes needed?

spark-sql (default)> CREATE TABLE t1 (flag BOOLEAN, dt STRING);

spark-sql (default)>   SELECT LENGTH(dt),
                   >          COUNT(t1.flag)
                   >     FROM t1
                   > GROUP BY LENGTH(dt)
                   >   HAVING COUNT(t1.flag) > 1
                   > ORDER BY LENGTH(dt);
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `dt` cannot be resolved. Did you mean one of the following? [`length(dt)`, `count(flag)`].; line 6 pos 16;
'Sort ['LENGTH('dt) ASC NULLS FIRST], true
+- Filter (count(flag)#60L > cast(1 as bigint))
   +- Aggregate [length(dt#9)], [length(dt#9) AS length(dt)#59, count(flag#8) AS count(flag)#60L]
      +- SubqueryAlias spark_catalog.default.t1
         +- Relation spark_catalog.default.t1[flag#8,dt#9] parquet

The above code demonstrates the failure case, the query failed during the analysis phase when both HAVING and ORDER BY clauses are present, but successful if only one is present.

Does this PR introduce any user-facing change?

Yes, maybe we can call it a bugfix.

How was this patch tested?

New UTs are added

Was this patch authored or co-authored using generative AI tooling?

No.

@pan3793
Copy link
Member Author

pan3793 commented Dec 14, 2023

// a table `t` has columns `c1` and `c2`, for query `SELECT ... FROM t GROUP BY c1 HAVING c2 = 0`,
// even though we can resolve column `c2` here, we should undo it and fail with
// "Column c2 not found".
protected def resolveColWithAgg(e: Expression, plan: LogicalPlan): Expression = plan match {
case Filter(_, agg: Aggregate) => resolveColWithAgg(e, agg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cann we move this into ResolveReferencesInSort. I think it is more clearer to make the call side resolve aggregate through filter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changed as suggested

@@ -102,12 +102,11 @@ Project [udf(b)#x, udf(c)#x]
SELECT udf(b), udf(c) FROM test_having
GROUP BY b, c HAVING udf(b) = 3 ORDER BY udf(b), udf(c)
-- !query analysis
Project [udf(b)#x, udf(c)#x]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why the plan is changed?

Copy link
Member Author

@pan3793 pan3793 Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous resolution matches item 4 of ResolveReferencesInSort comments.

  1. Resolves the column to [[AttributeReference]] with the output of a descendant plan node.
    Spark will propagate the missing attributes from the descendant plan node to the Sort node.
    This is to allow users to ORDER BY columns that are not in the SELECT clause, which is
    widely supported in other SQL dialects. For example, SELECT a FROM t ORDER BY b.

With this patch, it should match item 3

  1. If the child plan is Aggregate or Filter(_, Aggregate), resolves the column to
    [[TempResolvedColumn]] with the output of Aggregate's child plan.
    This is to allow Sort to host grouping expressions and aggregate functions, which can
    be pushed down to the Aggregate later. For example,
    SELECT max(a) FROM t GROUP BY b HAVING max(a) > 1 ORDER BY min(a).

Copy link
Contributor

@cloud-fan cloud-fan Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh so the plan is actually more efficient now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think so, the plan shows it eliminates some unnecessary column propagation across operators

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is an analyzed plan, the optimized plan should be same with pr ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both Analyzed plan and Optimized plan are changed :)

before:

== Analyzed Logical Plan ==
udf(b): int, udf(c): double
Project [udf(b)#24, udf(c)#25]
+- Sort [udf(b#21) ASC NULLS FIRST, udf(cast(c#22 as double)) ASC NULLS FIRST], true
   +- Filter (udf(b)#24 = 3)
      +- Aggregate [b#21, c#22], [udf(b#21) AS udf(b)#24, udf(cast(c#22 as double)) AS udf(c)#25, b#21, c#22]
         +- SubqueryAlias spark_catalog.default.test_having
            +- Relation spark_catalog.default.test_having[a#20,b#21,c#22,d#23] parquet

== Optimized Logical Plan ==
Project [udf(b)#24, udf(c)#25]
+- Sort [udf(b#21) ASC NULLS FIRST, udf(cast(c#22 as double)) ASC NULLS FIRST], true
   +- Aggregate [b#21, c#22], [udf(b#21) AS udf(b)#24, udf(cast(c#22 as double)) AS udf(c)#25, b#21, c#22]
      +- Project [b#21, c#22]
         +- Filter (isnotnull(b#21) AND (udf(b#21) = 3))
            +- Relation spark_catalog.default.test_having[a#20,b#21,c#22,d#23] parquet

after:

== Analyzed Logical Plan ==
udf(b): int, udf(c): double
Sort [udf(b)#9 ASC NULLS FIRST, udf(c)#10 ASC NULLS FIRST], true
+- Filter (udf(b)#9 = 3)
   +- Aggregate [b#6, c#7], [udf(b#6) AS udf(b)#9, udf(cast(c#7 as double)) AS udf(c)#10]
      +- SubqueryAlias spark_catalog.default.test_having
         +- Relation spark_catalog.default.test_having[a#5,b#6,c#7,d#8] parquet

== Optimized Logical Plan ==
Sort [udf(b)#9 ASC NULLS FIRST, udf(c)#10 ASC NULLS FIRST], true
+- Aggregate [b#6, c#7], [udf(b#6) AS udf(b)#9, udf(cast(c#7 as double)) AS udf(c)#10]
   +- Project [b#6, c#7]
      +- Filter (isnotnull(b#6) AND (udf(b#6) = 3))
         +- Relation spark_catalog.default.test_having[a#5,b#6,c#7,d#8] parquet

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cloud-fan
Copy link
Contributor

does this query work in other databases?

assertAnalysisSuccess(
parsePlan(
"""
|WITH t1 as (SELECT 1 id, 'one' name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can put this in golden file test as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this just checks analysis success, we can remove it if the test is already in golden file.

@@ -33,3 +33,6 @@ SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY GROUPING SETS(t.c1) HAVING t.
SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY CUBE(t.c1) HAVING t.c1 = 1;
SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY ROLLUP(t.c1) HAVING t.c1 = 1;
SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY t.c1 HAVING t.c1 = 1;

-- SPARK-28386: Cannot resolve ORDER BY columns with GROUP BY and HAVING
SELECT k, sum(v) FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY sum(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it work if we order by a different aggregate function (different from HAVING)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question!

* 3. If the child plan is Aggregate, resolves the column to [[TempResolvedColumn]] with the output
* of Aggregate's child plan. This is to allow Sort to host grouping expressions and aggregate
* functions, which can be pushed down to the Aggregate later. For example,
* `SELECT max(a) FROM t GROUP BY b ORDER BY min(a)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we obtain the example?

@@ -33,3 +33,6 @@ SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY GROUPING SETS(t.c1) HAVING t.
SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY CUBE(t.c1) HAVING t.c1 = 1;
SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY ROLLUP(t.c1) HAVING t.c1 = 1;
SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY t.c1 HAVING t.c1 = 1;

-- SPARK-28386: Cannot resolve ORDER BY columns with GROUP BY and HAVING
SELECT k, sum(v) FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY sum(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question!

@pan3793
Copy link
Member Author

pan3793 commented Dec 16, 2023

does this query work in other databases?

Actually, it was reported by my colleague that the same SQL works on Impala but not Spark. I will investigate other popular RDBMS.

@cloud-fan
Copy link
Contributor

In the SQL standard, ORDER BY can only reference columns in the SELECT list, but many databases extend it to support other cases. I think in Spark the extension is we can push down grouping expressions and aggregate functions from ORDER BY to SELECT.

@cloud-fan
Copy link
Contributor

I think we should dig into #44352 (comment) more. It seems we have an optimization that if the ORDER BY expression directly matches something from the SELECT list, we replace it with AttributReference. Can you find out where the optimization is? And when it will be triggered? Is there rule-order independent?

@pan3793
Copy link
Member Author

pan3793 commented Dec 18, 2023

I think in Spark the extension is we can push down grouping expressions and aggregate functions from ORDER BY to SELECT.

@cloud-fan I believe Spark already supports it when HVAING is absent, but does not work if HAVING is present, that's why I say "maybe we can call it a bugfix"

  SELECT xxx
  FROM xxx
  GROUP BY xxx
+ HAVING xxx
  ORDER BY xxx

It seems we have an optimization that if the ORDER BY expression directly matches something from the SELECT list, we replace it with AttributReference. Can you find out where the optimization is? And when it will be triggered? Is there rule-order independent?

Let me try.

@pan3793
Copy link
Member Author

pan3793 commented Dec 18, 2023

It seems we have an optimization that if the ORDER BY expression directly matches something from the SELECT list, we replace it with AttributReference. Can you find out where the optimization is? And when it will be triggered? Is there rule-order independent?

The plan change log shows it indeed caused by ResolveReferences(which invokes the virtual rule ResolveReferencesInSort), thus I suppose this answer is valid #44352 (comment). it should be rule-order independent

master plan change log
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                 +- 'UnresolvedHaving ('udf('b) = 3)
    +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]      +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedali...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.execution.datasources.FindDataSourceTable ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                            'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                         +- 'UnresolvedHaving ('udf('b) = 3)
    +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]              +- 'Aggregate ['b, 'c], [unresolvedalias('u...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                          'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                       +- 'UnresolvedHaving ('udf('b) = 3)
!   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]            +- 'Aggregate [b#178, c#179], [unresolvedalias...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                             +- 'UnresolvedHaving ('udf('b) = 3)
!   +- 'Aggregate [b#178, c#179], [unresolvedalias('udf(b#178)), unresolvedalias('udf(c#179))]      +- 'Aggregate [b#178, c#179],...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                           'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                        +- 'UnresolvedHaving ('udf('b) = 3)
!   +- 'Aggre...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                 +- 'UnresolvedHaving ('udf('b) = 3)
    +- Aggregate [b#178, c#179], [c...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
!+- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                 +- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, fals...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
!+- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, false)) = 3)                                                                                +- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178,...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
!+- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) = 3)                                                    +- Filter (udf(b)#181 = 3)
    +- Aggregate [b#178, ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
!'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Project [udf(b)#181, udf(c)#182]
!+- Filter (udf(b)#181 = 3)                                                                                                                          +- 'Sort ['udf(b#178) ASC NULLS FIRST, 'udf(c#179) ASC NULLS FIRST], true
!   +- Aggregate ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
!'Project [udf(b)#181, udf(c)#182]                                                                                                                                    Project [udf(b)#181, udf(c)#182]
!+- 'Sort ['udf(b#178) ASC NULLS FIRST, 'udf(c#179) ASC NULLS FIRST], true                                                                                            +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone ===
 Project [udf(b)#181, udf(c)#182]                                                                                                                                     Project [udf(b)#181, udf(c)#182]
 +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, cast(udf(cast(c#179 as string)) as string) ASC NULLS FIRST], true                                  +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, cas...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Result of Batch Resolution ===
!'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                    Project [udf(b)#181, udf(c)#182]
!+- 'UnresolvedHaving ('udf('b) = 3)                                                 +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, cast(udf(cast(c#179 as string)) as string) ASC NULLS FIRST], true
!   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]      +- Filter (udf(b)#181 = 3)
!      +- 'UnresolvedRelat...
this patch plan change log
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                 +- 'UnresolvedHaving ('udf('b) = 3)
    +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]      +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedali...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.execution.datasources.FindDataSourceTable ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                            'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                         +- 'UnresolvedHaving ('udf('b) = 3)
    +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]              +- 'Aggregate ['b, 'c], [unresolvedalias('u...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                          'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                       +- 'UnresolvedHaving ('udf('b) = 3)
!   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]            +- 'Aggregate [b#178, c#179], [unresolvedalias...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                             +- 'UnresolvedHaving ('udf('b) = 3)
!   +- 'Aggregate [b#178, c#179], [unresolvedalias('udf(b#178)), unresolvedalias('udf(c#179))]      +- 'Aggregate [b#178, c#179],...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                           'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                        +- 'UnresolvedHaving ('udf('b) = 3)
!   +- 'Aggre...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
 +- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                 +- 'UnresolvedHaving ('udf('b) = 3)
    +- Aggregate [b#178, c#179], [c...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
!+- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                 +- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, fals...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
!+- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, false)) = 3)                                                                                +- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178,...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
!+- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) = 3)                                                    +- Filter (udf(b)#181 = 3)
    +- Aggregate [b#178, ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
!'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf(tempresolvedcolumn(b#178, b, false)) ASC NULLS FIRST, 'udf(tempresolvedcolumn(c#179, c, false)) ASC NULLS FIRST], true
 +- Filter (udf(b)#181 = 3)                                                                                                                    ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
!'Sort ['udf(tempresolvedcolumn(b#178, b, false)) ASC NULLS FIRST, 'udf(tempresolvedcolumn(c#179, c, false)) ASC NULLS FIRST], true                  Sort [cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) ASC NULLS FIRST, cast(udf(cast(tempresolvedcolumn(c#179, c, false) as string)) as string) ASC NULLS FIRST], true
 +- Filter (udf(b)#181 = 3)                                                           ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
!Sort [cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) ASC NULLS FIRST, cast(udf(cast(tempresolvedcolumn(c#179, c, false) as string)) as string) ASC NULLS FIRST], true   Sort [udf(b)#181 ASC NULLS FIRST, udf(c)#182 ASC NULLS FIRST], true
 +- Filter (udf(b)#181 = 3)                                                                                                                                ...
org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Result of Batch Resolution ===
!'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                    Sort [udf(b)#181 ASC NULLS FIRST, udf(c)#182 ASC NULLS FIRST], true
!+- 'UnresolvedHaving ('udf('b) = 3)                                                 +- Filter (udf(b)#181 = 3)
!   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]      +- Aggregate [b#178, c#179], [cast(udf(cast(b#178 as string)) as int) AS udf(b)#181, cast(udf(cast(c#179 as string)) as str...

I think in Spark the extension is we can push down grouping expressions and aggregate functions from ORDER BY to SELECT.

BTW, I think this may be out scope of this PR. The key point of this PR is, ORDER BY should have the same resolve result whether HAVING clause is present or not.

  SELECT xxx
  FROM xxx
  GROUP BY xxx
+ HAVING xxx
  ORDER BY xxx

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please address the review comments regarding tests

@pan3793
Copy link
Member Author

pan3793 commented Dec 19, 2023

@cloud-fan @beliefer I have updated the test cases as requested, please take another look when you have time

@pan3793
Copy link
Member Author

pan3793 commented Dec 19, 2023

CI failure seems irrelevant
https://github.com/pan3793/spark/actions/runs/7259161309/job/19780922398

Notice:  A new release of pip is available: 23.3.1 -> 23.3.2
Notice:  To update, run: python3.9 -m pip install --upgrade pip
ERROR:  Error installing bundler:
	The last version of bundler (>= 0) to support your Ruby & RubyGems was 2.4.22. Try installing it with `gem install bundler -v 2.4.22`
	bundler requires Ruby version >= 3.0.0. The current ruby version is 2.7.0.0.
Error: Process completed with exit code 1.


-- SPARK-28386: Resolve ORDER BY column with/without HAVING clause, while the column presents on SELECT list
SELECT k FROM hav GROUP BY k ORDER BY k;
SELECT k FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY k;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two tests can pass without this PR? This is the basic functionality of ORDER BY: referencing columns in the SELECT list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, should I only retain the fixed SQL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, only retain the fixed SQL now


-- SPARK-28386: Resolve ORDER BY scalar function with/without HAVING clause, while the scalar function does not present on SELECT list
SELECT k FROM hav GROUP BY k ORDER BY length(k);
SELECT k FROM hav GROUP BY k HAVING max(v) > 2 ORDER BY length(k);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for these two tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@yaooqinn yaooqinn closed this in 9fdc375 Dec 20, 2023
@yaooqinn
Copy link
Member

Thanks, merged to master.

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants