New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33229][SQL] Support partial grouping analytics and concatenated grouping analytics #30144
Conversation
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130227 has finished for PR 30144 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #130243 has finished for PR 30144 at commit
|
I'm not familiar with this mixed case, so I just want to know a whole picture of this feature first. The other systems support this feature? And, how about the other mixed cases as follows?
|
|
||
test("SPARK-33229: Support GROUP BY use Separate columns and CUBE/ROLLUP") { | ||
withTable("t") { | ||
sql("CREATE TABLE t USING PARQUET AS SELECT id AS a, id AS b, id AS c FROM range(1)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move these tests into SQLQueryTestSuite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move these tests into
SQLQueryTestSuite
?
Update this at the end, since we need to add more UT about support mixed case.
@@ -151,3 +151,26 @@ object GroupingID { | |||
if (SQLConf.get.integerGroupingIdEnabled) IntegerType else LongType | |||
} | |||
} | |||
|
|||
object MixedExprsWithCube { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you define extractors for the mixed case, I think we need to make them more general for extracting more complicated cases, mix of cube/rollup, mix of rollup/grouping sets, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you define extractors for the mixed case, I think we need to make them more general for extracting more complicated cases, mix of cube/rollup, mix of rollup/grouping sets, ...
Since current code only support one cube/rollup expr, so I just support one cube/rollup expr.
Since other engine support mixed case, IMO, we should and we can support these feature and it's compatible with the previous behavior。
I will update this later.
FYI @maropu, for this sql, we should support
first, I will rase a new jira for this. |
Test build #130281 has finished for PR 30144 at commit
|
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status success |
Kubernetes integration test status success |
Test build #130284 has finished for PR 30144 at commit
|
Any more suggestion? |
Row(0, null, 0, 1) :: Row(0, null, null, 1) :: | ||
Row(null, 0, 0, 1) :: Row(null, 0, null, 1) :: | ||
Row(null, null, 0, 1) :: Row(null, null, null, 1) :: Nil) | ||
checkAnswer(sql("SELECT a, b, c, count(*) FROM t GROUP BY a, CUBE(b, c)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the semantic of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the semantic of it?
If we want some dimensional analysis group by a
and different dimensional about combine b
& c
, in current we need to write
group by cube(a, b, c)
and where a !=NULL
to remove interfering data, with this patch we can just write
group by a, cube(b, c)
And this set of PR can make Grouping Analytics more flexible as Postgres SQL. And we do have this need for analysis。
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
Outdated
Show resolved
Hide resolved
}.forall(_ == true) | ||
if (!resolved) { | ||
None | ||
} else if (!exprs.exists(e => e.find(_.isInstanceOf[BaseGroupingSets]).isDefined)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to call find
? I think BaseGroupingSets
can only appear in the top level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW this check can go first, as isInstanceOf[BaseGroupingSets]
is cheaper to run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DOne
Test build #137128 has finished for PR 30144 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test unable to build dist. exiting with code: 1 |
When a FILTER clause is attached to an aggregate function, only the matching rows are passed to that function. | ||
The grouping expressions and advanced aggregations can be mixed in the `GROUP BY` clause. | ||
See more details in the `Mixed Grouping Analytics` section. When a FILTER clause is attached to | ||
an aggregate function, only the matching. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only the matching rows are passed to that function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
case other: Expression => Seq(Seq(other)) | ||
} | ||
val selectedGroupByExprs = unmergedSelectedGroupByExprs.init | ||
.foldLeft(unmergedSelectedGroupByExprs.last) { (x, y) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we put unmergedSelectedGroupByExprs.last
as the first one? how about
unmergedSelectedGroupByExprs.tail.foldLeft(unmergedSelectedGroupByExprs.head)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137130 has finished for PR 30144 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137137 has finished for PR 30144 at commit
|
Test build #137139 has finished for PR 30144 at commit
|
Test build #137141 has finished for PR 30144 at commit
|
Test build #137143 has finished for PR 30144 at commit
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #137172 has finished for PR 30144 at commit
|
`GROUPING SETS` under this context. For multiple `GROUPING SETS` in the `GROUP BY` clause, we generate | ||
a single `GROUPING SETS` by doing a cross-product of the original `GROUPING SETS`s. For example, | ||
`GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ())` | ||
and `GROUP BY warehouse, ROLLUP(warehouse), CUBE(location, size)` is equivalent to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? ROLLUP(warehouse)
-> ROLLUP(product)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
ROLLUP(warehouse)
->ROLLUP(product)
yea, thanks
The last commit just fixed a typo in the doc, no need to wait for jenkins again. Thanks, merging to master! |
…d grouping analytics ### What changes were proposed in this pull request? Support GROUP BY use Separate columns and CUBE/ROLLUP In postgres sql, it support ``` select a, b, c, count(1) from t group by a, b, cube (a, b, c); select a, b, c, count(1) from t group by a, b, rollup(a, b, c); select a, b, c, count(1) from t group by cube(a, b), rollup (a, b, c); select a, b, c, count(1) from t group by a, b, grouping sets((a, b), (a), ()); ``` In this pr, we have done two things as below: 1. Support partial grouping analytics such as `group by a, cube(a, b)` 2. Support mixed grouping analytics such as `group by cube(a, b), rollup(b,c)` *Partial Groupings* Partial Groupings means there are both `group_expression` and `CUBE|ROLLUP|GROUPING SETS` in GROUP BY clause. For example: `GROUP BY warehouse, CUBE(product, location)` is equivalent to `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (warehouse))`. `GROUP BY warehouse, ROLLUP(product, location)` is equivalent to `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse))`. `GROUP BY warehouse, GROUPING SETS((product, location), (producet), ())` is equivalent to `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, location), (warehouse))`. *Concatenated Groupings* Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified with concatenated groupings yield the cross-product of groupings from each grouping set. The cross-product operation enables even a small number of concatenated groupings to generate a large number of final groups. The concatenated groupings are specified simply by listing multiple `GROUPING SETS`, `CUBES`, and `ROLLUP`, and separating them with commas. For example: `GROUP BY GROUPING SETS((warehouse), (producet)), GROUPING SETS((location), (size))` is equivalent to `GROUP BY GROUPING SETS((warehouse, location), (warehouse, size), (product, location), (product, size))`. `GROUP BY CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to `GROUP BY GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())` `GROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse), (product, location, size), (product, location), (product), (location, size), (location), ())`. `GROUP BY order, CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to `GROUP BY order, GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())` `GROUP BY GROUPING SETS( (order, warehouse, product, location, size), (order, warehouse, product, location), (order, warehouse, product), (order, warehouse, location, size), (order, warehouse, location), (order, warehouse), (order, product, location, size), (order, product, location), (order, product), (order, location, size), (order, location), (order))`. ### Why are the changes needed? Support more flexible grouping analytics ### Does this PR introduce _any_ user-facing change? User can use sql like ``` select a, b, c, agg_expr() from table group by a, cube(b, c) ``` ### How was this patch tested? Added UT Closes apache#30144 from AngersZhuuuu/SPARK-33229. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Support GROUP BY use Separate columns and CUBE/ROLLUP
In postgres sql, it support
In this pr, we have done two things as below:
group by a, cube(a, b)
group by cube(a, b), rollup(b,c)
Partial Groupings
Concatenated Groupings
Why are the changes needed?
Support more flexible grouping analytics
Does this PR introduce any user-facing change?
User can use sql like
How was this patch tested?
Added UT