Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33229][SQL] Support partial grouping analytics and concatenated grouping analytics #30144

Closed
wants to merge 38 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Oct 24, 2020

What changes were proposed in this pull request?

Support GROUP BY use Separate columns and CUBE/ROLLUP

In postgres sql, it support

select a, b, c, count(1) from t group by a, b, cube (a, b, c);
select a, b, c, count(1) from t group by a, b, rollup(a, b, c);
select a, b, c, count(1) from t group by cube(a, b), rollup (a, b, c);
select a, b, c, count(1) from t group by a, b, grouping sets((a, b), (a), ());

In this pr, we have done two things as below:

  1. Support partial grouping analytics such as group by a, cube(a, b)
  2. Support mixed grouping analytics such as group by cube(a, b), rollup(b,c)

Partial Groupings

Partial Groupings means there are both `group_expression` and `CUBE|ROLLUP|GROUPING SETS`
in GROUP BY clause. For example:
`GROUP BY warehouse, CUBE(product, location)` is equivalent to
`GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (warehouse))`.
`GROUP BY warehouse, ROLLUP(product, location)` is equivalent to
`GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse))`.
`GROUP BY warehouse, GROUPING SETS((product, location), (producet), ())` is equivalent to
`GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, location), (warehouse))`.

Concatenated Groupings

Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified
with concatenated groupings yield the cross-product of groupings from each grouping set. The cross-product 
operation enables even a small number of concatenated groupings to generate a large number of final groups. 
The concatenated groupings are specified simply by listing multiple `GROUPING SETS`, `CUBES`, and `ROLLUP`, 
and separating them with commas. For example:
`GROUP BY GROUPING SETS((warehouse), (producet)), GROUPING SETS((location), (size))` is equivalent to 
`GROUP BY GROUPING SETS((warehouse, location), (warehouse, size), (product, location), (product, size))`.
`GROUP BY CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to 
`GROUP BY GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())`
`GROUP BY GROUPING SETS(
    (warehouse, product, location, size), (warehouse, product, location), (warehouse, product),
    (warehouse, location, size), (warehouse, location), (warehouse),
    (product, location, size), (product, location), (product),
    (location, size), (location), ())`.
`GROUP BY order, CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to 
`GROUP BY order, GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())`
`GROUP BY GROUPING SETS(
    (order, warehouse, product, location, size), (order, warehouse, product, location), (order, warehouse, product),
    (order, warehouse, location, size), (order, warehouse, location), (order, warehouse),
    (order, product, location, size), (order, product, location), (order, product),
    (order, location, size), (order, location), (order))`.

Why are the changes needed?

Support more flexible grouping analytics

Does this PR introduce any user-facing change?

User can use sql like

select a, b, c, agg_expr() from table group by a, cube(b, c)

How was this patch tested?

Added UT

@SparkQA
Copy link

SparkQA commented Oct 24, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34827/

@SparkQA
Copy link

SparkQA commented Oct 24, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34827/

@SparkQA
Copy link

SparkQA commented Oct 24, 2020

Test build #130227 has finished for PR 30144 at commit 68c3e48.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 25, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34843/

@SparkQA
Copy link

SparkQA commented Oct 25, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34843/

@SparkQA
Copy link

SparkQA commented Oct 25, 2020

Test build #130243 has finished for PR 30144 at commit 68c3e48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Oct 26, 2020

I'm not familiar with this mixed case, so I just want to know a whole picture of this feature first. The other systems support this feature? And, how about the other mixed cases as follows?

postgres=# create table t(a int, b int, c int, v int);
postgres=# insert into t values (1, 1, 1, 1);
postgres=# select a, b, c, sum(v) from t group by rollup(a, b), cube(b, c);
 a | b | c | sum 
---+---+---+-----
   |   |   |   1
 1 |   | 1 |   1
 1 |   |   |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
   | 1 | 1 |   1
   |   | 1 |   1
 1 | 1 |   |   1
 1 | 1 |   |   1
 1 | 1 |   |   1
   | 1 |   |   1
(12 rows)

postgres=# select a, b, c, sum(v) from t group by rollup(a, b), cube(b, c), grouping sets(a, c);
 a | b | c | sum 
---+---+---+-----
 1 | 1 |   |   1
 1 | 1 |   |   1
 1 | 1 |   |   1
 1 | 1 |   |   1
 1 |   |   |   1
 1 |   |   |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
 1 | 1 | 1 |   1
   | 1 | 1 |   1
   | 1 | 1 |   1
   |   | 1 |   1
   |   | 1 |   1
 1 |   | 1 |   1
 1 |   | 1 |   1
 1 |   | 1 |   1
 1 |   | 1 |   1
(24 rows)


test("SPARK-33229: Support GROUP BY use Separate columns and CUBE/ROLLUP") {
withTable("t") {
sql("CREATE TABLE t USING PARQUET AS SELECT id AS a, id AS b, id AS c FROM range(1)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move these tests into SQLQueryTestSuite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move these tests into SQLQueryTestSuite?

Update this at the end, since we need to add more UT about support mixed case.

@@ -151,3 +151,26 @@ object GroupingID {
if (SQLConf.get.integerGroupingIdEnabled) IntegerType else LongType
}
}

object MixedExprsWithCube {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you define extractors for the mixed case, I think we need to make them more general for extracting more complicated cases, mix of cube/rollup, mix of rollup/grouping sets, ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you define extractors for the mixed case, I think we need to make them more general for extracting more complicated cases, mix of cube/rollup, mix of rollup/grouping sets, ...

Since current code only support one cube/rollup expr, so I just support one cube/rollup expr.
Since other engine support mixed case, IMO, we should and we can support these feature and it's compatible with the previous behavior。

I will update this later.

@maropu maropu changed the title [SPARK-33229][SQL]Support GROUP BY use Separate columns and CUBE/ROLLUP [SPARK-33229][SQL] Support GROUP BY use Separate columns and CUBE/ROLLUP Oct 26, 2020
@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Oct 26, 2020

postgres=# select a, b, c, sum(v) from t group by rollup(a, b), cube(b, c), grouping sets(a, c);

FYI @maropu, for this sql, we should support

SELECT A, B, SUM(C) FROM TBL GROUP BY A, grouping sets(A, B) 

first, I will rase a new jira for this.
How about support mixed CUBE/ROLLUP first then implement GROUPING SETS in that pr.

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Test build #130281 has finished for PR 30144 at commit a3d1b60.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34880/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34882/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34884/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34882/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34880/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34884/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Test build #130284 has finished for PR 30144 at commit dc4e148.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

Any more suggestion?

Row(0, null, 0, 1) :: Row(0, null, null, 1) ::
Row(null, 0, 0, 1) :: Row(null, 0, null, 1) ::
Row(null, null, 0, 1) :: Row(null, null, null, 1) :: Nil)
checkAnswer(sql("SELECT a, b, c, count(*) FROM t GROUP BY a, CUBE(b, c)"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the semantic of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the semantic of it?

If we want some dimensional analysis group by a and different dimensional about combine b & c, in current we need to write
group by cube(a, b, c) and where a !=NULL to remove interfering data, with this patch we can just write

group by a, cube(b, c)

And this set of PR can make Grouping Analytics more flexible as Postgres SQL. And we do have this need for analysis。

}.forall(_ == true)
if (!resolved) {
None
} else if (!exprs.exists(e => e.find(_.isInstanceOf[BaseGroupingSets]).isDefined)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to call find? I think BaseGroupingSets can only appear in the top level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this check can go first, as isInstanceOf[BaseGroupingSets] is cheaper to run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOne

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Test build #137128 has finished for PR 30144 at commit 0046d40.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41715/

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41715/

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41718/

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41718/

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41717/

When a FILTER clause is attached to an aggregate function, only the matching rows are passed to that function.
The grouping expressions and advanced aggregations can be mixed in the `GROUP BY` clause.
See more details in the `Mixed Grouping Analytics` section. When a FILTER clause is attached to
an aggregate function, only the matching.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the matching rows are passed to that function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

case other: Expression => Seq(Seq(other))
}
val selectedGroupByExprs = unmergedSelectedGroupByExprs.init
.foldLeft(unmergedSelectedGroupByExprs.last) { (x, y) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we put unmergedSelectedGroupByExprs.last as the first one? how about

unmergedSelectedGroupByExprs.tail.foldLeft(unmergedSelectedGroupByExprs.head)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41720/

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41720/

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Test build #137130 has finished for PR 30144 at commit 8cca908.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41722/

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41722/

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Test build #137137 has finished for PR 30144 at commit c7de14c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Test build #137139 has finished for PR 30144 at commit 4359aef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Test build #137141 has finished for PR 30144 at commit 133e073.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 9, 2021

Test build #137143 has finished for PR 30144 at commit 9d1a115.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41750/

@SparkQA
Copy link

SparkQA commented Apr 11, 2021

Test build #137172 has finished for PR 30144 at commit 2ae3c16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

`GROUPING SETS` under this context. For multiple `GROUPING SETS` in the `GROUP BY` clause, we generate
a single `GROUPING SETS` by doing a cross-product of the original `GROUPING SETS`s. For example,
`GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ())`
and `GROUP BY warehouse, ROLLUP(warehouse), CUBE(location, size)` is equivalent to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? ROLLUP(warehouse) -> ROLLUP(product)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? ROLLUP(warehouse) -> ROLLUP(product)

yea, thanks

@cloud-fan
Copy link
Contributor

The last commit just fixed a typo in the doc, no need to wait for jenkins again. Thanks, merging to master!

@cloud-fan cloud-fan closed this in 2123237 Apr 12, 2021
xuanyuanking pushed a commit to xuanyuanking/spark that referenced this pull request Sep 29, 2021
…d grouping analytics

### What changes were proposed in this pull request?
Support GROUP BY use Separate columns and CUBE/ROLLUP

In postgres sql, it support
```
select a, b, c, count(1) from t group by a, b, cube (a, b, c);
select a, b, c, count(1) from t group by a, b, rollup(a, b, c);
select a, b, c, count(1) from t group by cube(a, b), rollup (a, b, c);
select a, b, c, count(1) from t group by a, b, grouping sets((a, b), (a), ());
```
In this pr, we have done two things as below:

1. Support partial grouping analytics such as `group by a, cube(a, b)`
2. Support mixed grouping analytics such as `group by cube(a, b), rollup(b,c)`

*Partial Groupings*

    Partial Groupings means there are both `group_expression` and `CUBE|ROLLUP|GROUPING SETS`
    in GROUP BY clause. For example:
    `GROUP BY warehouse, CUBE(product, location)` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (warehouse))`.
    `GROUP BY warehouse, ROLLUP(product, location)` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse))`.
    `GROUP BY warehouse, GROUPING SETS((product, location), (producet), ())` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, location), (warehouse))`.

*Concatenated Groupings*

    Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified
    with concatenated groupings yield the cross-product of groupings from each grouping set. The cross-product
    operation enables even a small number of concatenated groupings to generate a large number of final groups.
    The concatenated groupings are specified simply by listing multiple `GROUPING SETS`, `CUBES`, and `ROLLUP`,
    and separating them with commas. For example:
    `GROUP BY GROUPING SETS((warehouse), (producet)), GROUPING SETS((location), (size))` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, location), (warehouse, size), (product, location), (product, size))`.
    `GROUP BY CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())`
    `GROUP BY GROUPING SETS(
        (warehouse, product, location, size), (warehouse, product, location), (warehouse, product),
        (warehouse, location, size), (warehouse, location), (warehouse),
        (product, location, size), (product, location), (product),
        (location, size), (location), ())`.
    `GROUP BY order, CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to
    `GROUP BY order, GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())`
    `GROUP BY GROUPING SETS(
        (order, warehouse, product, location, size), (order, warehouse, product, location), (order, warehouse, product),
        (order, warehouse, location, size), (order, warehouse, location), (order, warehouse),
        (order, product, location, size), (order, product, location), (order, product),
        (order, location, size), (order, location), (order))`.

### Why are the changes needed?
Support more flexible grouping analytics

### Does this PR introduce _any_ user-facing change?
User can use sql like
```
select a, b, c, agg_expr() from table group by a, cube(b, c)
```

### How was this patch tested?
Added UT

Closes apache#30144 from AngersZhuuuu/SPARK-33229.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants