Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12720] [SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets #11283

Closed
wants to merge 52 commits into from

Conversation

gatorsmile
Copy link
Member

What changes were proposed in this pull request?

This PR is for supporting SQL generation for cube, rollup and grouping sets.

For example, a query using rollup:

SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP

Original logical plan:

  Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
            [(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
             (key#17L % cast(5 as bigint))#47L AS _c1#45L,
             grouping__id#46 AS _c2#44]
  +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
             List(key#17L, value#18, null, 1)],
            [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46]
     +- Project [key#17L,
                 value#18,
                 (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L]
        +- Subquery t1
           +- Relation[key#17L,value#18] ParquetRelation

Converted SQL:

  SELECT count( 1) AS `cnt`,
         (`t1`.`key` % CAST(5 AS BIGINT)),
         grouping_id() AS `_c2`
  FROM `default`.`t1`
  GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
  GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())

How was the this patch tested?

Added eight test cases in LogicalPlanToSQLSuite.

gatorsmile and others added 29 commits November 13, 2015 14:50
@SparkQA
Copy link

SparkQA commented Feb 20, 2016

Test build #51597 has finished for PR 11283 at commit bc0c030.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

plan: Aggregate,
expand: Expand,
project: Project): String = {
require(plan.groupingExpressions.length > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think assert is better here, as if it breaks, it means something goes wrong in our system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Thanks!

@cloud-fan
Copy link
Contributor

LGTM except some minor comments, thanks for working on it!

cc @liancheng

@gatorsmile
Copy link
Member Author

@cloud-fan Really thank you for your time and your detailed reviews!!! : )

@@ -107,6 +107,11 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Loggi
case p: Project =>
projectToSQL(p, isDistinct = false)

case a @ Aggregate(_, _, e @ Expand(_, _, p: Project))
if sameOutput(e.output,
p.child.output ++ a.groupingExpressions.map(_.asInstanceOf[Attribute])) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry misses this: should check isInstanceOf before calling asInstanceOf directly.
We can put all of it in one method and use it as if condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing it out! I will be more careful next time.

@SparkQA
Copy link

SparkQA commented Mar 4, 2016

Test build #52433 has finished for PR 11283 at commit 6f609fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2016

Test build #52440 has finished for PR 11283 at commit 9eaca51.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// a map from group by attributes to the original group by expressions.
val groupByAttrMap = AttributeMap(groupByAttributes.zip(groupByExprs))

val groupingSet = expand.projections.map { project =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Would be nice to add type annotation to groupingSet since it's a relatively complex nested data structure and can be hard to grasp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@liancheng
Copy link
Contributor

LGTM except a few comments. Thanks for donging this!

Would you please run HiveCompatibilitySuite locally and check sql/hive/target/unit-tests.log to confirm that all queries with grouping set are correctly generated? It takes about 10~15 min to run the suite. I found that Jenkins skipped this suite.

Grouping(groupingCol.get)
} else {
throw new UnsupportedOperationException(s"unsupported operator $a")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following version might be clearer:

val groupingCol = groupByExprs.applyOrElse(
  idx, throw new UnsupportedOperationException(s"unsupported operator $a")
Grouping(groupingCol)

And I don't quite get the meaning of the exception error message...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, if the value is out of boundary, I thought we should not continue the conversion. After rethinking this, users might call grouping_id() inside such a function. Maybe we should not throw any exception. How about changing it to

groupByExprs.lift(idx).map(Grouping).getOrElse(a)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@gatorsmile
Copy link
Member Author

@liancheng I ran the suite in my local laptop. All the related tests works. However, I hit multiple regression failure that was introduced by another PR: #11466

I will submit a separate PR to fix the issues. Thank you for your reviews!

@SparkQA
Copy link

SparkQA commented Mar 5, 2016

Test build #52508 has finished for PR 11283 at commit 385c0d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Thanks for running the test. Could you please name these regressions? I couldn't find them. Seems that currently Jenkins PR builder looks good.

@liancheng
Copy link
Contributor

This LGTM now. Merging to master. Thanks!

@asfgit asfgit closed this in adce5ee Mar 5, 2016
@liancheng
Copy link
Contributor

A final thought about this PR:

For SQL generation for grouping sets, we have to depend on a set of implicit assumptions related to implementation details of specific analysis rules, which makes the implementation tend to be fragile. I think maybe we can add an auxiliary logical plan operator AnnotatePlan and an expression AnnotateExpression, which can be used to annotate a sub-tree of a logical plan / expression to indicate their original forms. These nodes can be added by the analyzer and should be erased by the optimizer before doing optimizations.

Haven't thought thoroughly about this, but with the help of these annotations, I'd expect it to be easier to recognize patterns corresponding to plans / expressions like grouping set, grouping, multi-distinct aggregation etc..

@gatorsmile
Copy link
Member Author

Yeah, definitely, it helps a lot. Otherwise, toSQL needs to identify the pattern and convert it back using a few assumptions. The pattern and assumptions we made depends on the implementation of our analyzer rules.

Before we finalize the design, I will first stop working on the SQL generation of multi-distinct aggregation. Thanks! @liancheng

roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…ing Sets

#### What changes were proposed in this pull request?

This PR is for supporting SQL generation for cube, rollup and grouping sets.

For example, a query using rollup:
```SQL
SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP
```
Original logical plan:
```
  Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
            [(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
             (key#17L % cast(5 as bigint))#47L AS _c1#45L,
             grouping__id#46 AS _c2#44]
  +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
             List(key#17L, value#18, null, 1)],
            [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46]
     +- Project [key#17L,
                 value#18,
                 (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L]
        +- Subquery t1
           +- Relation[key#17L,value#18] ParquetRelation
```
Converted SQL:
```SQL
  SELECT count( 1) AS `cnt`,
         (`t1`.`key` % CAST(5 AS BIGINT)),
         grouping_id() AS `_c2`
  FROM `default`.`t1`
  GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
  GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())
```

#### How was the this patch tested?

Added eight test cases in `LogicalPlanToSQLSuite`.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes apache#11283 from gatorsmile/groupingSetsToSQL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants