[SPARK-33229][SQL] Support partial grouping analytics and concatenated grouping analytics #30144

AngersZhuuuu · 2020-10-24T04:39:09Z

What changes were proposed in this pull request?

Support GROUP BY use Separate columns and CUBE/ROLLUP

In postgres sql, it support

select a, b, c, count(1) from t group by a, b, cube (a, b, c);
select a, b, c, count(1) from t group by a, b, rollup(a, b, c);
select a, b, c, count(1) from t group by cube(a, b), rollup (a, b, c);
select a, b, c, count(1) from t group by a, b, grouping sets((a, b), (a), ());

In this pr, we have done two things as below:

Support partial grouping analytics such as group by a, cube(a, b)
Support mixed grouping analytics such as group by cube(a, b), rollup(b,c)

Partial Groupings

Partial Groupings means there are both `group_expression` and `CUBE|ROLLUP|GROUPING SETS`
in GROUP BY clause. For example:
`GROUP BY warehouse, CUBE(product, location)` is equivalent to
`GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (warehouse))`.
`GROUP BY warehouse, ROLLUP(product, location)` is equivalent to
`GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse))`.
`GROUP BY warehouse, GROUPING SETS((product, location), (producet), ())` is equivalent to
`GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, location), (warehouse))`.

Concatenated Groupings

Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified
with concatenated groupings yield the cross-product of groupings from each grouping set. The cross-product 
operation enables even a small number of concatenated groupings to generate a large number of final groups. 
The concatenated groupings are specified simply by listing multiple `GROUPING SETS`, `CUBES`, and `ROLLUP`, 
and separating them with commas. For example:
`GROUP BY GROUPING SETS((warehouse), (producet)), GROUPING SETS((location), (size))` is equivalent to 
`GROUP BY GROUPING SETS((warehouse, location), (warehouse, size), (product, location), (product, size))`.
`GROUP BY CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to 
`GROUP BY GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())`
`GROUP BY GROUPING SETS(
    (warehouse, product, location, size), (warehouse, product, location), (warehouse, product),
    (warehouse, location, size), (warehouse, location), (warehouse),
    (product, location, size), (product, location), (product),
    (location, size), (location), ())`.
`GROUP BY order, CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to 
`GROUP BY order, GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())`
`GROUP BY GROUPING SETS(
    (order, warehouse, product, location, size), (order, warehouse, product, location), (order, warehouse, product),
    (order, warehouse, location, size), (order, warehouse, location), (order, warehouse),
    (order, product, location, size), (order, product, location), (order, product),
    (order, location, size), (order, location), (order))`.

Why are the changes needed?

Support more flexible grouping analytics

Does this PR introduce any user-facing change?

User can use sql like

select a, b, c, agg_expr() from table group by a, cube(b, c)

How was this patch tested?

Added UT

SparkQA · 2020-10-24T05:33:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34827/

SparkQA · 2020-10-24T05:55:39Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34827/

SparkQA · 2020-10-24T07:05:02Z

Test build #130227 has finished for PR 30144 at commit 68c3e48.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-10-25T10:10:17Z

retest this please

SparkQA · 2020-10-25T10:59:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34843/

SparkQA · 2020-10-25T11:23:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34843/

SparkQA · 2020-10-25T14:30:28Z

Test build #130243 has finished for PR 30144 at commit 68c3e48.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-10-26T07:11:14Z

I'm not familiar with this mixed case, so I just want to know a whole picture of this feature first. The other systems support this feature? And, how about the other mixed cases as follows?

postgres=# create table t(a int, b int, c int, v int);
postgres=# insert into t values (1, 1, 1, 1);
postgres=# select a, b, c, sum(v) from t group by rollup(a, b), cube(b, c);
 a | b | c | sum 
---+---+---+-----
   |   |   |   1
 1 |   | 1 |   1
 1 |   |   |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
   | 1 | 1 |   1
   |   | 1 |   1
 1 | 1 |   |   1
 1 | 1 |   |   1
 1 | 1 |   |   1
   | 1 |   |   1
(12 rows)

postgres=# select a, b, c, sum(v) from t group by rollup(a, b), cube(b, c), grouping sets(a, c);
 a | b | c | sum 
---+---+---+-----
 1 | 1 |   |   1
 1 | 1 |   |   1
 1 | 1 |   |   1
 1 | 1 |   |   1
 1 |   |   |   1
 1 |   |   |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
   | 1 | 1 |   1
   | 1 | 1 |   1
   |   | 1 |   1
   |   | 1 |   1
 1 |   | 1 |   1
 1 |   | 1 |   1
 1 |   | 1 |   1
 1 |   | 1 |   1
(24 rows)

maropu · 2020-10-26T07:11:42Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+
+  test("SPARK-33229: Support GROUP BY use Separate columns and CUBE/ROLLUP") {
+    withTable("t") {
+      sql("CREATE TABLE t USING PARQUET AS SELECT id AS a, id AS b, id AS c FROM range(1)")


Could you move these tests into SQLQueryTestSuite?

Could you move these tests into SQLQueryTestSuite?

Update this at the end, since we need to add more UT about support mixed case.

maropu · 2020-10-26T07:16:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala

@@ -151,3 +151,26 @@ object GroupingID {
    if (SQLConf.get.integerGroupingIdEnabled) IntegerType else LongType
  }
 }
+
+object MixedExprsWithCube {


If you define extractors for the mixed case, I think we need to make them more general for extracting more complicated cases, mix of cube/rollup, mix of rollup/grouping sets, ...

If you define extractors for the mixed case, I think we need to make them more general for extracting more complicated cases, mix of cube/rollup, mix of rollup/grouping sets, ...

Since current code only support one cube/rollup expr, so I just support one cube/rollup expr.
Since other engine support mixed case, IMO, we should and we can support these feature and it's compatible with the previous behavior。

I will update this later.

AngersZhuuuu · 2020-10-26T10:01:39Z

postgres=# select a, b, c, sum(v) from t group by rollup(a, b), cube(b, c), grouping sets(a, c);

FYI @maropu, for this sql, we should support

SELECT A, B, SUM(C) FROM TBL GROUP BY A, grouping sets(A, B)

first, I will rase a new jira for this.
How about support mixed CUBE/ROLLUP first then implement GROUPING SETS in that pr.

SparkQA · 2020-10-26T11:28:44Z

Test build #130281 has finished for PR 30144 at commit a3d1b60.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-26T12:56:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34880/

SparkQA · 2020-10-26T12:59:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34882/

SparkQA · 2020-10-26T13:00:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34884/

SparkQA · 2020-10-26T13:16:11Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34882/

SparkQA · 2020-10-26T13:19:29Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34880/

SparkQA · 2020-10-26T13:29:04Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34884/

SparkQA · 2020-10-26T16:18:06Z

Test build #130284 has finished for PR 30144 at commit dc4e148.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-10-30T01:36:56Z

Any more suggestion?

cloud-fan · 2020-11-02T06:42:18Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+          Row(0, null, 0, 1) :: Row(0, null, null, 1) ::
+          Row(null, 0, 0, 1) :: Row(null, 0, null, 1) ::
+          Row(null, null, 0, 1) :: Row(null, null, null, 1) :: Nil)
+      checkAnswer(sql("SELECT a, b, c, count(*) FROM t GROUP BY a, CUBE(b, c)"),


what's the semantic of it?

what's the semantic of it?

If we want some dimensional analysis group by a and different dimensional about combine b & c, in current we need to write
group by cube(a, b, c) and where a !=NULL to remove interfering data， with this patch we can just write

group by a, cube(b, c)

And this set of PR can make Grouping Analytics more flexible as Postgres SQL. And we do have this need for analysis。

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala

cloud-fan · 2021-04-09T14:42:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala

+    }.forall(_ == true)
+    if (!resolved) {
+      None
+    } else if (!exprs.exists(e => e.find(_.isInstanceOf[BaseGroupingSets]).isDefined)) {


do we need to call find? I think BaseGroupingSets can only appear in the top level.

BTW this check can go first, as isInstanceOf[BaseGroupingSets] is cheaper to run

SparkQA · 2021-04-09T14:47:51Z

Test build #137128 has finished for PR 30144 at commit 0046d40.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-09T14:53:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41715/

SparkQA · 2021-04-09T15:00:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41715/

SparkQA · 2021-04-09T15:26:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41718/

SparkQA · 2021-04-09T15:26:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41718/

SparkQA · 2021-04-09T15:30:28Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41717/

cloud-fan · 2021-04-09T15:46:29Z

docs/sql-ref-syntax-qry-select-groupby.md

-When a FILTER clause is attached to an aggregate function, only the matching rows are passed to that function.
+The grouping expressions and advanced aggregations can be mixed in the `GROUP BY` clause.
+See more details in the `Mixed Grouping Analytics` section. When a FILTER clause is attached to
+an aggregate function, only the matching.


only the matching rows are passed to that function.

cloud-fan · 2021-04-09T15:50:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala

+          case other: Expression => Seq(Seq(other))
+        }
+        val selectedGroupByExprs = unmergedSelectedGroupByExprs.init
+          .foldLeft(unmergedSelectedGroupByExprs.last) { (x, y) =>


why do we put unmergedSelectedGroupByExprs.last as the first one? how about

unmergedSelectedGroupByExprs.tail.foldLeft(unmergedSelectedGroupByExprs.head)...

SparkQA · 2021-04-09T15:51:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41720/

SparkQA · 2021-04-09T15:51:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41720/

SparkQA · 2021-04-09T15:58:37Z

Test build #137130 has finished for PR 30144 at commit 8cca908.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-09T16:44:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41722/

SparkQA · 2021-04-09T16:44:50Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41722/

SparkQA · 2021-04-09T19:03:11Z

Test build #137137 has finished for PR 30144 at commit c7de14c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-09T19:24:56Z

Test build #137139 has finished for PR 30144 at commit 4359aef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-09T19:42:19Z

Test build #137141 has finished for PR 30144 at commit 133e073.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-09T20:46:00Z

Test build #137143 has finished for PR 30144 at commit 9d1a115.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-11T02:04:17Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41750/

SparkQA · 2021-04-11T05:59:35Z

Test build #137172 has finished for PR 30144 at commit 2ae3c16.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-04-12T08:19:38Z

docs/sql-ref-syntax-qry-select-groupby.md

+    `GROUPING SETS` under this context. For multiple `GROUPING SETS` in the `GROUP BY` clause, we generate
+    a single `GROUPING SETS` by doing a cross-product of the original `GROUPING SETS`s. For example,
+    `GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ())`
+    and `GROUP BY warehouse, ROLLUP(warehouse), CUBE(location, size)` is equivalent to 


typo? ROLLUP(warehouse) -> ROLLUP(product)

typo? ROLLUP(warehouse) -> ROLLUP(product)

yea， thanks

cloud-fan · 2021-04-12T08:23:49Z

The last commit just fixed a typo in the doc, no need to wait for jenkins again. Thanks, merging to master!

…d grouping analytics ### What changes were proposed in this pull request? Support GROUP BY use Separate columns and CUBE/ROLLUP In postgres sql, it support ``` select a, b, c, count(1) from t group by a, b, cube (a, b, c); select a, b, c, count(1) from t group by a, b, rollup(a, b, c); select a, b, c, count(1) from t group by cube(a, b), rollup (a, b, c); select a, b, c, count(1) from t group by a, b, grouping sets((a, b), (a), ()); ``` In this pr, we have done two things as below: 1. Support partial grouping analytics such as `group by a, cube(a, b)` 2. Support mixed grouping analytics such as `group by cube(a, b), rollup(b,c)` *Partial Groupings* Partial Groupings means there are both `group_expression` and `CUBE|ROLLUP|GROUPING SETS` in GROUP BY clause. For example: `GROUP BY warehouse, CUBE(product, location)` is equivalent to `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (warehouse))`. `GROUP BY warehouse, ROLLUP(product, location)` is equivalent to `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse))`. `GROUP BY warehouse, GROUPING SETS((product, location), (producet), ())` is equivalent to `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, location), (warehouse))`. *Concatenated Groupings* Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified with concatenated groupings yield the cross-product of groupings from each grouping set. The cross-product operation enables even a small number of concatenated groupings to generate a large number of final groups. The concatenated groupings are specified simply by listing multiple `GROUPING SETS`, `CUBES`, and `ROLLUP`, and separating them with commas. For example: `GROUP BY GROUPING SETS((warehouse), (producet)), GROUPING SETS((location), (size))` is equivalent to `GROUP BY GROUPING SETS((warehouse, location), (warehouse, size), (product, location), (product, size))`. `GROUP BY CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to `GROUP BY GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())` `GROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse), (product, location, size), (product, location), (product), (location, size), (location), ())`. `GROUP BY order, CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to `GROUP BY order, GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())` `GROUP BY GROUPING SETS( (order, warehouse, product, location, size), (order, warehouse, product, location), (order, warehouse, product), (order, warehouse, location, size), (order, warehouse, location), (order, warehouse), (order, product, location, size), (order, product, location), (order, product), (order, location, size), (order, location), (order))`. ### Why are the changes needed? Support more flexible grouping analytics ### Does this PR introduce _any_ user-facing change? User can use sql like ``` select a, b, c, agg_expr() from table group by a, cube(b, c) ``` ### How was this patch tested? Added UT Closes apache#30144 from AngersZhuuuu/SPARK-33229. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-33229][SQL]Support GROUP BY use Separate columns and CUBE/ROLLUP

68c3e48

wangyum requested review from cloud-fan and maropu October 26, 2020 02:17

maropu reviewed Oct 26, 2020

View reviewed changes

maropu changed the title ~~[SPARK-33229][SQL]Support GROUP BY use Separate columns and CUBE/ROLLUP~~ [SPARK-33229][SQL] Support GROUP BY use Separate columns and CUBE/ROLLUP Oct 26, 2020

AngersZhuuuu added 3 commits October 26, 2020 19:18

update

cea12eb

Update Analyzer.scala

fd6fc67

Update Analyzer.scala

a3d1b60

update

dc4e148

AngersZhuuuu mentioned this pull request Oct 30, 2020

[SPARK-33233][SQL]CUBE/ROLLUP/GROUPING SETS support GROUP BY ordinal #30145

Closed

cloud-fan reviewed Nov 2, 2020

View reviewed changes

cloud-fan reviewed Apr 9, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 9, 2021

View reviewed changes

follow comment

06865d7

follow comment

9d1a115

cloud-fan reviewed Apr 9, 2021

View reviewed changes

follow comment

2ae3c16

cloud-fan reviewed Apr 12, 2021

View reviewed changes

Update sql-ref-syntax-qry-select-groupby.md

34f98ca

cloud-fan closed this in 2123237 Apr 12, 2021

[SPARK-33229][SQL] Support partial grouping analytics and concatenated grouping analytics #30144

[SPARK-33229][SQL] Support partial grouping analytics and concatenated grouping analytics #30144

Conversation

AngersZhuuuu commented Oct 24, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Oct 24, 2020

SparkQA commented Oct 24, 2020

SparkQA commented Oct 24, 2020

AngersZhuuuu commented Oct 25, 2020

SparkQA commented Oct 25, 2020

SparkQA commented Oct 25, 2020

SparkQA commented Oct 25, 2020

maropu commented Oct 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AngersZhuuuu commented Oct 26, 2020 • edited

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

AngersZhuuuu commented Oct 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

SparkQA commented Apr 9, 2021

SparkQA commented Apr 11, 2021

SparkQA commented Apr 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 12, 2021

AngersZhuuuu commented Oct 24, 2020 •

edited

AngersZhuuuu commented Oct 26, 2020 •

edited