[SPARK-45929][SQL] Support groupingSets operation in dataframe api#43813

Closed

JacobZheng0927 wants to merge 4 commits intoapache:masterfrom

JacobZheng0927:SPARK-45929

Contributor

JacobZheng0927 commented Nov 15, 2023

What changes were proposed in this pull request?

Add groupingSets method in dataset api.

select col1, col2, col3, sum(col4) FROM t GROUP col1, col2, col3 BY GROUPING SETS ((col1, col2), ())
This SQL can be equivalently replaced with the following code:
df.groupingSets(Seq(Seq("col1", "col2"), Seq()), "col1", "col2", "col3").sum("col4")

Why are the changes needed?

Currently grouping sets can only be used in spark sql. This feature is not available when developing with the dataset api.

Does this PR introduce any user-facing change?

Yes. This PR introduces the use of groupingSets in the dataset api.

How was this patch tested?

Tests added in DataFrameAggregateSuite.scala.

Was this patch authored or co-authored using generative AI tooling?

No

github-actions bot added the SQL label

JacobZheng0927 changed the title ~~[SPARK-45898][SQL] Support groupingSets operation in dataframe api~~ [SPARK-45929][SQL] Support groupingSets operation in dataframe api

JacobZheng0927 force-pushed the SPARK-45929 branch from 27a65ed to e7f71c7 Compare

November 16, 2023 07:59

HyukjinKwon reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated Show resolved Hide resolved

Member

HyukjinKwon commented Nov 20, 2023

cc @zhengruifeng FYI

zhengruifeng reviewed

View reviewed changes

Contributor

zhengruifeng left a comment

looks good, but I would like to defer it to @cloud-fan

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated

Comment on lines 1849 to 1850

Contributor

zhengruifeng Nov 20, 2023

Suggested change

      
                                groupingSets: Seq[Seq[Column]],
          
                                cols: Column*): RelationalGroupedDataset = {
          
                  groupingSets: Seq[Seq[Column]],
          
                  cols: Column*): RelationalGroupedDataset = {

Contributor Author

JacobZheng0927 Nov 20, 2023

Fixed. Thanks.

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated

Comment on lines 2014 to 2016

Contributor

zhengruifeng Nov 20, 2023

Suggested change

      
                                groupingSets: Seq[Seq[String]],
          
                                col1: String,
          
                                cols: String*): RelationalGroupedDataset = {
          
                  groupingSets: Seq[Seq[String]],
          
                  col1: String,
          
                  cols: String*): RelationalGroupedDataset = {

cloud-fan reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated

Contributor

cloud-fan Nov 20, 2023

Shall we stop adding string column overloads for new APIs? also cc @HyukjinKwon

Member

HyukjinKwon Nov 20, 2023

👍

Contributor Author

JacobZheng0927 Nov 20, 2023

Are you saying that the methods using string columns are no longer needed, and that I only need to keep the methods using columns of type Column?

Contributor

cloud-fan Nov 20, 2023

Yes

Contributor Author

JacobZheng0927 Nov 20, 2023

Got it. This API has been removed.

cloud-fan reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated

Contributor

cloud-fan Nov 20, 2023

The SQL syntax is GROUP BY ... GROUPING SETS (...), shall we put grouping cols as the first parameter?

Contributor Author

JacobZheng0927 Nov 20, 2023

This is because the * parameter must be placed at the end.

Contributor

cloud-fan Nov 20, 2023

oh I see, then no problem

cloud-fan approved these changes

View reviewed changes

Contributor

cloud-fan left a comment

I thought we already have this API... good catch!

HyukjinKwon approved these changes

View reviewed changes

Member

HyukjinKwon commented Nov 20, 2023

I think you should rebase against the latest master branch to fix up the test failure.

JacobZheng0927 added 4 commits

November 20, 2023 19:04


          Support groupingSets operation in dataframe api

db54ea0


          add groupingSets since 4.0.0

ea9f7e1


          add groupingSets since 4.0.0

529bc30


          remove groupingSets function with String column

ef1117b

JacobZheng0927 force-pushed the SPARK-45929 branch from 04da8b4 to ef1117b Compare

November 20, 2023 11:05

cloud-fan approved these changes

View reviewed changes

Member

HyukjinKwon commented Nov 21, 2023

Merged to master.

HyukjinKwon closed this in

847199f

HyukjinKwon mentioned this pull request

[SPARK-46048][PYTHON][SQL] Support DataFrame.groupingSets in PySpark #43951

Closed

HyukjinKwon added a commit that referenced this pull request


          [SPARK-46048][PYTHON][SQL] Support DataFrame.groupingSets in PySpark

90470ff

### What changes were proposed in this pull request?

#43813 added Scala API of `DataFrame.groupingSets`. This PR proposes to have the same API in PySpark (non-Spark Connect for now).

### Why are the changes needed?

For feature parity.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new Python API `DataFrame.groupingSets` that is equivalent to [`GROUPING SETS`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html).

### How was this patch tested?

Doctests were added.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43951 from HyukjinKwon/SPARK-45929-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

HyukjinKwon mentioned this pull request

[SPARK-46085][CONNECT] Dataset.groupingSets in Scala Spark Connect client #43995

Closed

dongjoon-hyun pushed a commit that referenced this pull request


          [SPARK-46085][CONNECT] Dataset.groupingSets in Scala Spark Connect cl…

5211f6b

…ient

### What changes were proposed in this pull request?

This PR proposes to add `Dataset.groupingsets` API added from #43813 to Scala Spark Connect cleint.

### Why are the changes needed?

For feature parity.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new API to Scala Spark Connect client.

### How was this patch tested?

Unittest was added.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43995 from HyukjinKwon/SPARK-46085.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

zml1206 pushed a commit to zml1206/spark that referenced this pull request


          [SPARK-45929][SQL] Support groupingSets operation in dataframe api

f9688b2

### What changes were proposed in this pull request?
Add groupingSets method in dataset api.

`select col1, col2, col3, sum(col4) FROM t GROUP col1, col2, col3 BY GROUPING SETS ((col1, col2), ())`
This SQL can be equivalently replaced with the following code:
`df.groupingSets(Seq(Seq("col1", "col2"), Seq()), "col1", "col2", "col3").sum("col4")`

### Why are the changes needed?
Currently grouping sets can only be used in spark sql. This feature is not available when developing with the dataset api.

### Does this PR introduce _any_ user-facing change?
Yes. This PR introduces the use of groupingSets in the dataset api.

### How was this patch tested?
Tests added in `DataFrameAggregateSuite.scala`.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#43813 from JacobZheng0927/SPARK-45929.

Authored-by: JacobZheng0927 <zsh517559523@163.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SQL