Skip to content

[SPARK-45929][SQL] Support groupingSets operation in dataframe api#43813

Closed
JacobZheng0927 wants to merge 4 commits intoapache:masterfrom
JacobZheng0927:SPARK-45929
Closed

[SPARK-45929][SQL] Support groupingSets operation in dataframe api#43813
JacobZheng0927 wants to merge 4 commits intoapache:masterfrom
JacobZheng0927:SPARK-45929

Conversation

@JacobZheng0927
Copy link
Contributor

What changes were proposed in this pull request?

Add groupingSets method in dataset api.

select col1, col2, col3, sum(col4) FROM t GROUP col1, col2, col3 BY GROUPING SETS ((col1, col2), ())
This SQL can be equivalently replaced with the following code:
df.groupingSets(Seq(Seq("col1", "col2"), Seq()), "col1", "col2", "col3").sum("col4")

Why are the changes needed?

Currently grouping sets can only be used in spark sql. This feature is not available when developing with the dataset api.

Does this PR introduce any user-facing change?

Yes. This PR introduces the use of groupingSets in the dataset api.

How was this patch tested?

Tests added in DataFrameAggregateSuite.scala.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 15, 2023
@JacobZheng0927 JacobZheng0927 changed the title [SPARK-45898][SQL] Support groupingSets operation in dataframe api [SPARK-45929][SQL] Support groupingSets operation in dataframe api Nov 16, 2023
@HyukjinKwon
Copy link
Member

cc @zhengruifeng FYI

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, but I would like to defer it to @cloud-fan

Comment on lines 1849 to 1850
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
groupingSets: Seq[Seq[Column]],
cols: Column*): RelationalGroupedDataset = {
groupingSets: Seq[Seq[Column]],
cols: Column*): RelationalGroupedDataset = {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

Comment on lines 2014 to 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
groupingSets: Seq[Seq[String]],
col1: String,
cols: String*): RelationalGroupedDataset = {
groupingSets: Seq[Seq[String]],
col1: String,
cols: String*): RelationalGroupedDataset = {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we stop adding string column overloads for new APIs? also cc @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that the methods using string columns are no longer needed, and that I only need to keep the methods using columns of type Column?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. This API has been removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQL syntax is GROUP BY ... GROUPING SETS (...), shall we put grouping cols as the first parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the * parameter must be placed at the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, then no problem

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we already have this API... good catch!

@HyukjinKwon
Copy link
Member

I think you should rebase against the latest master branch to fix up the test failure.

@HyukjinKwon
Copy link
Member

Merged to master.

HyukjinKwon added a commit that referenced this pull request Nov 22, 2023
### What changes were proposed in this pull request?

#43813 added Scala API of `DataFrame.groupingSets`. This PR proposes to have the same API in PySpark (non-Spark Connect for now).

### Why are the changes needed?

For feature parity.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new Python API `DataFrame.groupingSets` that is equivalent to [`GROUPING SETS`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html).

### How was this patch tested?

Doctests were added.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43951 from HyukjinKwon/SPARK-45929-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Nov 25, 2023
…ient

### What changes were proposed in this pull request?

This PR proposes to add `Dataset.groupingsets` API added from #43813 to Scala Spark Connect cleint.

### Why are the changes needed?

For feature parity.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new API to Scala Spark Connect client.

### How was this patch tested?

Unittest was added.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43995 from HyukjinKwon/SPARK-46085.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
zml1206 pushed a commit to zml1206/spark that referenced this pull request May 7, 2025
### What changes were proposed in this pull request?
Add groupingSets method in dataset api.

`select col1, col2, col3, sum(col4) FROM t GROUP col1, col2, col3 BY GROUPING SETS ((col1, col2), ())`
This SQL can be equivalently replaced with the following code:
`df.groupingSets(Seq(Seq("col1", "col2"), Seq()), "col1", "col2", "col3").sum("col4")`

### Why are the changes needed?
Currently grouping sets can only be used in spark sql. This feature is not available when developing with the dataset api.

### Does this PR introduce _any_ user-facing change?
Yes. This PR introduces the use of groupingSets in the dataset api.

### How was this patch tested?
Tests added in `DataFrameAggregateSuite.scala`.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#43813 from JacobZheng0927/SPARK-45929.

Authored-by: JacobZheng0927 <zsh517559523@163.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments