-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14471][SQL] Aliases in SELECT could be used in GROUP BY #17191
Conversation
Test build #74110 has finished for PR 17191 at commit
|
@maropu Would your code work for the problem in the PR description?
|
@nsyca oh, good suggestion. yes, it seems the case does not work. I'll update this pr soon. Thanks! |
Test build #74320 has finished for PR 17191 at commit
|
checkAnswer( | ||
sql("SELECT k1 AS key1, key1 + 1 AS key2, COUNT(1) FROM t GROUP BY key1, key2"), | ||
Row(1, 2, 2) :: Row(2, 3, 1) :: Nil) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the alias has a collision with a valid column name, like,
select k1 as k2, k2 from t
select k1 as k2 from t group by k1
select k1 as k2 from t group by k2
This would be an interesting test case to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the target of this pr just replaces unresolved exprs in grouping keys with resolved ones by using resolved exprs in SELECT clauses. So, I think we'd better to discuss the name collision case you described in follow-up other tickets.
q.resolveChildren(nameParts, resolver).getOrElse { | ||
resolvedExprs.find(ne => resolver(ne.name, u.name)).getOrElse(u) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your new code, unresolved attributes will get a second chance to be resolved with any aliases in ALL operators. Is this an intended behaviour? If it is, you'd better update the PR description. The description currently refers to aliases in SELECT to be used in GROUP BY.
Using your definition in the new test cases:
Seq((1, "a", 0), (2, "a", 1), (1, "a", 2)).toDF("k1", "k2", "v").createOrReplaceTempView("t")
You can see that when there is a name collision, it shows a different behaviour. The one below has the unresolved kx
in GROUP BY clause resolved to k1
.
sql("select k1 as kx from t group by kx")
While this one below returns an error that k1
in SELECT clause is not found in GROUP BY clause.
sql("select k1 as k2 from t group by k2")
I think it'd better to go back to the drawing board and make a decision what we are going to support before we verify the code works as designed.
@maropu: I have made a few comments in the code and test cases. IMO, we should go back to define what we want to fix before jumping into the code. I am sorry if my previous statement gave you an impression that we need to solve the problem in your PR description. I personally think the example in your PR description goes too far to resolve an undefined column in the SELECT clause from any aliases from its LHS. Example:
I think this should result in an error that IMO, the problem we want to focus is to resolve any unresolved references in a GROUP BY clause by trying to map them to the aliases in the SELECT clause. That is, assuming table
should resolve in an error that
should resolve the unresolved reference It may be that the code in your first commit is good for the above problem definition. You may want to involve any committers who are willing to concur with your problem definition, review your code and merge it in. In any case, the PR description should be updated and the test cases with name collision between aliases in SELECT clause and valid columns in the table should be added. |
Ah, I see. In a basic stance, I think we should do the same behaviour with existing databases. You're right and
Anyway, I agree with your suggestion; we should make PRs as much small as possible, and so I reconsider this pr again to fix the point in the description. |
Test build #74585 has started for PR 17191 at commit |
Jenkins, retest this please. |
@maropu Hello, Perhaps you have already seen .. there have been PRs to fix this in the past. Here are a few i could find - #12794 by @dongjoon-hyun Hope this helps .. |
Test build #74589 has finished for PR 17191 at commit
|
@dilipbiswal oh, I didn't notice that. I check them and I'll close this. Thanks! |
@maropu We may want to implement this extension if mysql and postgres support it. Just wanted to point you to the past discussions on this topic. Do we know if hive allows this ? |
@maropu . I think you had better ask committers before closing this issue once more. |
Could you please do a check which systems support it and which systems disallow it? Thanks! |
okay, I'll check |
@gatorsmile I checked;
And you know Oracle and SQLServer does not support this syntax; In summary, WDYT? |
We hit this issue multiple times. Although it looks not right to support it for the users of major enterprise RDBMS, two popular open source RDBMS PostgreSQL and MySQL support it. This is the design decision we need to make. Should we keep the previous decision? @rxin @marmbrus @hvanhovell @cloud-fan Thanks! |
I personally have run into this issue and was surprised that we didn't support it ... it's pretty verbose to retype everything. If Postgres and MySQL both support it, I think we should do it too! Can we introduce a config flag to not break old workloads? |
Test build #74677 has finished for PR 17191 at commit
|
}.getOrElse(u) | ||
case e => e | ||
}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should we place this Aggregate pattern next to the other Aggregate that deals with the wildcard character (*) above? Otherwise, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code block you suggested has some rules for replacing wildcards and so I feel putting this rule there is a little weird. But, I don't have a strong opinion on this, so both is okay to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu I did not suggest to merge the two code blocks. I suggested to move them next to each other so readers will easily see the code handling Aggregate with two different patterns.
I'll add a config soon |
It's kind of weird if we have both "group by ordinal" feature and this "group by alias" feature, shall we only pick one of them? |
Test build #74712 has finished for PR 17191 at commit
|
@cloud-fan Even in the mixed case, it seems PotgreSQL and MySQL support the syntax;
|
Test build #74713 has finished for PR 17191 at commit
|
ok makes sense, let's support it |
okay, I'll recheck the code and add tests for the case. |
I tried to move |
Test build #76187 has finished for PR 17191 at commit
|
This reverts commit 1340862.
For now, I reverted the latest commit and I'll look for other better way to fix this. |
Test build #76202 has finished for PR 17191 at commit
|
override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators { | ||
case agg @ Aggregate(groups, aggs, child) | ||
if conf.groupByAliases && child.resolved && aggs.forall(_.resolved) && | ||
groups.exists(!_.resolved) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can check groups.exists(_.isInstanceOf[UnresolvedAttribute])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -998,6 +998,22 @@ class Analyzer( | |||
} | |||
|
|||
/** | |||
* Replace unresolved expressions in grouping keys with resolved ones in SELECT clauses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment to say that this rule has to be run after ResolveReferences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll update
|
||
override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators { | ||
case agg @ Aggregate(groups, aggs, child) | ||
if conf.groupByAliases && child.resolved && aggs.forall(_.resolved) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very confident about this condition, we may mistakenly resolve grouping expression by aggregate list while it should be resolved by child output.
one example is the star. If the aggregate list contains a star, then we will expand the star in ResolveReferences
, without resolving grouping expressions. When we reach here, the condition will match but the grouping expression should not be resolved by aggregate list.
cc @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this code for the check? https://github.com/apache/spark/pull/17191/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R1011
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an UnresolvedAttribute
in grouping expressions here already implicitly indicates it is not in child plan's output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure though, since the resolution batch currently has this rule, it seems this rule is firstly applied into unresolved grouping keys in the star case @cloud-fan suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the UnresolvedAttribute
in grouping expressions that can be resolved by child's output, should be already resolved by ResolveReferences
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got debug info and it seemed ResolveAggAliasInGroupBy
was firstly applied in that case?
scala> sql("SELECT *, count(1) FROM (select 1 AS a, 1 AS b) GROUP BY a, b").show
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
!'Aggregate ['a, 'b], [*, unresolvedalias('count(1), None)] 'Aggregate ['a, 'b], [a#0, b#1, unresolvedalias('count(1), None)]
+- Project [1 AS a#0, 1 AS b#1] +- Project [1 AS a#0, 1 AS b#1]
+- OneRowRelation$ +- OneRowRelation$
<-- ResolveAggAliasInGroupBy applied -->
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
!'Aggregate ['a, 'b], [a#0, b#1, unresolvedalias('count(1), None)] 'Aggregate ['a, 'b], [a#0, b#1, unresolvedalias(count(1), None)]
+- Project [1 AS a#0, 1 AS b#1] +- Project [1 AS a#0, 1 AS b#1]
+- OneRowRelation$ +- OneRowRelation$
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases ===
!'Aggregate ['a, 'b], [a#0, b#1, unresolvedalias(count(1), None)] 'Aggregate ['a, 'b], [a#0, b#1, count(1) AS count(1)#3L]
+- Project [1 AS a#0, 1 AS b#1] +- Project [1 AS a#0, 1 AS b#1]
+- OneRowRelation$ +- OneRowRelation$
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
!'Aggregate ['a, 'b], [a#0, b#1, count(1) AS count(1)#3L] Aggregate [a#0, b#1], [a#0, b#1, count(1) AS count(1)#3L]
+- Project [1 AS a#0, 1 AS b#1] +- Project [1 AS a#0, 1 AS b#1]
+- OneRowRelation$ +- OneRowRelation$
<-- ResolveAggAliasInGroupBy applied -->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I have an idea: first, check there is UnresolvedAttribute
in grouping expression, then check these UnresolvedAttribute
s can't be resolved by child.output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest fix is not enough for your suggestion?;
This fix checks if UnresolvedAttribute
exists there, then filter out by child.output
.
https://github.com/apache/spark/pull/17191/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R1014
+ // This is a strict check though, we put this to apply the rule only in alias expressions
+ def checkIfChildOutputHasNo(attrName: String): Boolean =
+ !child.output.exists(a => resolver(a.name, attrName))
+ agg.copy(groupingExpressions = groups.map {
+ case u: UnresolvedAttribute if checkIfChildOutputHasNo(u.name) =>
+ aggs.find(ne => resolver(ne.name, u.name)).getOrElse(u)
+ case e => e
+ })
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea it works, but maybe give it a better name like notResolvableByChild
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Test build #76218 has finished for PR 17191 at commit
|
Test build #76220 has finished for PR 17191 at commit
|
Jenkins, retest this please. |
Test build #76224 has finished for PR 17191 at commit
|
Test build #76232 has finished for PR 17191 at commit
|
@cloud-fan ok, could you check again? Thanks! |
@@ -998,6 +998,27 @@ class Analyzer( | |||
} | |||
|
|||
/** | |||
* Replace unresolved expressions in grouping keys with resolved ones in SELECT clauses. | |||
* This rule is expected to run after [[ResolveReferences]] applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this now. With the new check, the order doesn't matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
thanks, mering to master/2.2! |
## What changes were proposed in this pull request? This pr added a new rule in `Analyzer` to resolve aliases in `GROUP BY`. The current master throws an exception if `GROUP BY` clauses have aliases in `SELECT`; ``` scala> spark.sql("select a a1, a1 + 1 as b, count(1) from t group by a1") org.apache.spark.sql.AnalysisException: cannot resolve '`a1`' given input columns: [a]; line 1 pos 51; 'Aggregate ['a1], [a#83L AS a1#87L, ('a1 + 1) AS b#88, count(1) AS count(1)#90L] +- SubqueryAlias t +- Project [id#80L AS a#83L] +- Range (0, 10, step=1, splits=Some(8)) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) ``` ## How was this patch tested? Added tests in `SQLQuerySuite` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17191 from maropu/SPARK-14471. (cherry picked from commit 59e3a56) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? This pr added `Analyzer` code for supporting aliases in CUBE/ROLLUP/GROUPING SETS (This is follow-up of apache#17191). ## How was this patch tested? Added tests in `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#17948 from maropu/SPARK-20710.
## What changes were proposed in this pull request? This pr added `Analyzer` code for supporting aliases in CUBE/ROLLUP/GROUPING SETS (This is follow-up of #17191). ## How was this patch tested? Added tests in `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17948 from maropu/SPARK-20710. (cherry picked from commit 92ea7fd) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? This pr added `Analyzer` code for supporting aliases in CUBE/ROLLUP/GROUPING SETS (This is follow-up of apache#17191). ## How was this patch tested? Added tests in `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#17948 from maropu/SPARK-20710.
## What changes were proposed in this pull request? This pr added `Analyzer` code for supporting aliases in CUBE/ROLLUP/GROUPING SETS (This is follow-up of apache#17191). ## How was this patch tested? Added tests in `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#17948 from maropu/SPARK-20710.
What changes were proposed in this pull request?
This pr added a new rule in
Analyzer
to resolve aliases inGROUP BY
.The current master throws an exception if
GROUP BY
clauses have aliases inSELECT
;How was this patch tested?
Added tests in
SQLQuerySuite
andSQLQueryTestSuite
.