Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16639][SQL] The query with having condition that contains grouping by column should work #14296

Closed
wants to merge 6 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Jul 21, 2016

What changes were proposed in this pull request?

The query with having condition that contains grouping by column will be failed during analysis. E.g.,

create table tbl(a int, b string);
select count(b) from tbl group by a + 1 having a + 1 = 2;

Having condition should be able to use grouping by column.

How was this patch tested?

Jenkins tests.

case e: Expression =>
val alias = Alias(e, e.toString)()
aggregateExpressions += alias
alias.toAttribute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we are blindly pushing all expression into Aggregate, how about we just push the whole filter condition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean we don't replace and push the expressions, but just push the whole filter condition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

Copy link
Member Author

@viirya viirya Jul 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like pushing the whole filter condition brings more problem.

Not the problem of whole filter condition pushdown. But the problem of this approach. I will update this.

@SparkQA
Copy link

SparkQA commented Jul 21, 2016

Test build #62655 has finished for PR 14296 at commit 5704709.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2016

Test build #62702 has finished for PR 14296 at commit 4ca4088.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1207,6 +1207,12 @@ class Analyzer(
val alias = Alias(ae, ae.toString)()
aggregateExpressions += alias
alias.toAttribute
case ne: NamedExpression => ne
case e: Expression if grouping.exists(_.semanticEquals(e)) &&
Copy link
Member Author

@viirya viirya Jul 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not push down all expressions or the whole expression. Only push down the group by expressions or aggregate expressions.

@SparkQA
Copy link

SparkQA commented Jul 22, 2016

Test build #62722 has finished for PR 14296 at commit 6d0d753.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1207,6 +1207,12 @@ class Analyzer(
val alias = Alias(ae, ae.toString)()
aggregateExpressions += alias
alias.toAttribute
case ne: NamedExpression => ne
case e: Expression if grouping.exists(_.semanticEquals(e)) &&
!ResolveGroupingAnalytics.hasGroupingFunction(e) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why !ResolveGroupingAnalytics.hasGroupingFunction(e) is needed? Do we want a test case for this check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not near my laptop. But pushing grouping function causes test failed in SQLQuerySuite. I remember there is another rule taking care of grouping function.

Copy link
Contributor

@cloud-fan cloud-fan Jul 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be great if we can figure it out and add comment here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah. ResolveGroupingAnalytics performs grouping id and grouping function replacement. We should skip pushdown them here.

@viirya
Copy link
Member Author

viirya commented Jul 23, 2016

@cloud-fan any more comments? Thanks!

@@ -1207,6 +1207,12 @@ class Analyzer(
val alias = Alias(ae, ae.toString)()
aggregateExpressions += alias
alias.toAttribute
case ne: NamedExpression => ne
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to replace named expr (attributes, alias)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but case e: Expression if grouping.exists(_.semanticEquals(e)) will filter them out right?

Copy link
Member Author

@viirya viirya Jul 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it will cause error in replaceGroupingFunc. As we create a new attribute based on the grouping column with additional alias, we can't find it in aggregate's group by expressions.

@SparkQA
Copy link

SparkQA commented Jul 25, 2016

Test build #62808 has finished for PR 14296 at commit 8bab985.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case ne: NamedExpression => ne
// Grouping functions are handled in the rule [[ResolveGroupingAnalytics]].
case e: Expression if grouping.exists(_.semanticEquals(e)) &&
!ResolveGroupingAnalytics.hasGroupingFunction(e) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what will happen if we use grouping function in HAVING? will the query fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean if we don't add this check? Yes. As the grouping function is pushed down to Aggregation, the rule ResolveGroupingAnalytics can't perform the replacement of grouping function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean with this check. Can we resolve grouping function inside Filter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. SQLQuerySuite has test for with grouping function in HAVING.

@viirya
Copy link
Member Author

viirya commented Jul 27, 2016

ping @cloud-fan Any more comments? Thanks.

// Replacing [[NamedExpression]] causes the error on [[Grouping]] because the
// grouping column will be new attribute created by adding additional [[Alias]].
// So we can't find the grouping column and replace it in the rule
// [[ResolveGroupingAnalytics]].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this comment, can you give a concrete example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is a Grouping(col#1) in the HAVING clause. If we push down it to Aggregate and use an Alias instead, the grouping function become Grouping(col#1#2). Then in the rule ResolveGroupingAnalytics we try to find the index of col in group by expressions, we can't find it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but in the next case, we will check: 1) if it's a grouping column. 2) if it's not grouping function. So how can this happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do a transform, the inner col of Grouping will match this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about we do some improve for next case:

  1. add one more check: the expression should not exist in the chid output
  2. if the expression is a NamedExpression already, don't alias it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this. Please take a look.

@SparkQA
Copy link

SparkQA commented Jul 28, 2016

Test build #62961 has finished for PR 14296 at commit f5e037a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 9ade77c Jul 28, 2016
asfgit pushed a commit that referenced this pull request Jul 28, 2016
…ping by column should work

## What changes were proposed in this pull request?

The query with having condition that contains grouping by column will be failed during analysis. E.g.,

    create table tbl(a int, b string);
    select count(b) from tbl group by a + 1 having a + 1 = 2;

Having condition should be able to use grouping by column.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14296 from viirya/having-contains-grouping-column.

(cherry picked from commit 9ade77c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

also backport it to 2.0

@viirya
Copy link
Member Author

viirya commented Jul 28, 2016

Thanks for reviewing this.

@viirya viirya deleted the having-contains-grouping-column branch December 27, 2023 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants