[SPARK-45929][SQL] Support groupingSets operation in dataframe api#43813
[SPARK-45929][SQL] Support groupingSets operation in dataframe api#43813JacobZheng0927 wants to merge 4 commits intoapache:masterfrom
Conversation
27a65ed to
e7f71c7
Compare
|
cc @zhengruifeng FYI |
zhengruifeng
left a comment
There was a problem hiding this comment.
looks good, but I would like to defer it to @cloud-fan
There was a problem hiding this comment.
| groupingSets: Seq[Seq[Column]], | |
| cols: Column*): RelationalGroupedDataset = { | |
| groupingSets: Seq[Seq[Column]], | |
| cols: Column*): RelationalGroupedDataset = { |
There was a problem hiding this comment.
Fixed. Thanks.
There was a problem hiding this comment.
| groupingSets: Seq[Seq[String]], | |
| col1: String, | |
| cols: String*): RelationalGroupedDataset = { | |
| groupingSets: Seq[Seq[String]], | |
| col1: String, | |
| cols: String*): RelationalGroupedDataset = { |
There was a problem hiding this comment.
Shall we stop adding string column overloads for new APIs? also cc @HyukjinKwon
There was a problem hiding this comment.
Are you saying that the methods using string columns are no longer needed, and that I only need to keep the methods using columns of type Column?
There was a problem hiding this comment.
Got it. This API has been removed.
There was a problem hiding this comment.
The SQL syntax is GROUP BY ... GROUPING SETS (...), shall we put grouping cols as the first parameter?
There was a problem hiding this comment.
This is because the * parameter must be placed at the end.
There was a problem hiding this comment.
oh I see, then no problem
cloud-fan
left a comment
There was a problem hiding this comment.
I thought we already have this API... good catch!
|
I think you should rebase against the latest master branch to fix up the test failure. |
04da8b4 to
ef1117b
Compare
|
Merged to master. |
### What changes were proposed in this pull request? #43813 added Scala API of `DataFrame.groupingSets`. This PR proposes to have the same API in PySpark (non-Spark Connect for now). ### Why are the changes needed? For feature parity. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new Python API `DataFrame.groupingSets` that is equivalent to [`GROUPING SETS`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html). ### How was this patch tested? Doctests were added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43951 from HyukjinKwon/SPARK-45929-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ient ### What changes were proposed in this pull request? This PR proposes to add `Dataset.groupingsets` API added from #43813 to Scala Spark Connect cleint. ### Why are the changes needed? For feature parity. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new API to Scala Spark Connect client. ### How was this patch tested? Unittest was added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43995 from HyukjinKwon/SPARK-46085. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add groupingSets method in dataset api.
`select col1, col2, col3, sum(col4) FROM t GROUP col1, col2, col3 BY GROUPING SETS ((col1, col2), ())`
This SQL can be equivalently replaced with the following code:
`df.groupingSets(Seq(Seq("col1", "col2"), Seq()), "col1", "col2", "col3").sum("col4")`
### Why are the changes needed?
Currently grouping sets can only be used in spark sql. This feature is not available when developing with the dataset api.
### Does this PR introduce _any_ user-facing change?
Yes. This PR introduces the use of groupingSets in the dataset api.
### How was this patch tested?
Tests added in `DataFrameAggregateSuite.scala`.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes apache#43813 from JacobZheng0927/SPARK-45929.
Authored-by: JacobZheng0927 <zsh517559523@163.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Add groupingSets method in dataset api.
select col1, col2, col3, sum(col4) FROM t GROUP col1, col2, col3 BY GROUPING SETS ((col1, col2), ())This SQL can be equivalently replaced with the following code:
df.groupingSets(Seq(Seq("col1", "col2"), Seq()), "col1", "col2", "col3").sum("col4")Why are the changes needed?
Currently grouping sets can only be used in spark sql. This feature is not available when developing with the dataset api.
Does this PR introduce any user-facing change?
Yes. This PR introduces the use of groupingSets in the dataset api.
How was this patch tested?
Tests added in
DataFrameAggregateSuite.scala.Was this patch authored or co-authored using generative AI tooling?
No