-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated #26961
Conversation
Currently, I followed the hive result because of the simpler fix. But, we might need to check results in the other DBMS-like systems, too (Oracle and DB2?). I just want to know the Oracle answer, but I cannot because I don't have a Oracle instance. Anyone can help me and check it? |
Test build #115607 has finished for PR 26961 at commit
|
retest this please |
Test build #115613 has finished for PR 26961 at commit
|
Test build #115625 has finished for PR 26961 at commit
|
Retest this please. |
cc @cloud-fan |
Test build #116625 has finished for PR 26961 at commit
|
ok, I'll follow the pgSQL(& SQL Server) behaviour. |
Test build #116686 has finished for PR 26961 at commit
|
retest this please |
Test build #116692 has started for PR 26961 at commit |
retest this please |
@@ -641,11 +640,14 @@ object Expand { | |||
child: LogicalPlan): Expand = { | |||
val attrMap = groupByAttrs.zipWithIndex.toMap | |||
|
|||
val hasDuplicateGroupingSets = groupingSetsAttrs.size != | |||
groupingSetsAttrs.map(_.map(_.canonicalized).toSet).distinct.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (_.map(_.exprId).toSet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so (k1,k2),(k2,k1)
are also duplicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I think so;
postgres=# select k1, k2, count(v) from t group by grouping sets ((k1, k2), (k2, k1));
k1 | k2 | count
----+----+-------
0 | 0 | 1
0 | 0 | 1
(2 rows)
Test build #116751 has finished for PR 26961 at commit
|
retest this please |
Test build #116765 has finished for PR 26961 at commit
|
Test build #116768 has finished for PR 26961 at commit
|
Merged to master. |
@dongjoon-hyun Just a check; we don't need to backport this bugfix to branch-2.4? |
This is a correctness bug, but for a rarely used API. I don't have a strong opinion. |
yea, thanks for the check. I don't, either. |
Hi, @maropu . Could you make a backporting PR for RC2? Thanks! |
Yea, I saw the thread in the mail and sure! |
…are duplicated ### What changes were proposed in this pull request? This pr intends to fix wrong aggregated values in `GROUPING SETS` when there are duplicated grouping sets in a query (e.g., `GROUPING SETS ((k1),(k1))`). For example; ``` scala> spark.table("t").show() +---+---+---+ | k1| k2| v| +---+---+---+ | 0| 0| 3| +---+---+---+ scala> sql("""select grouping_id(), k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2))""").show() +-------------+---+----+------+ |grouping_id()| k1| k2|sum(v)| +-------------+---+----+------+ | 0| 0| 0| 9| <---- wrong aggregate value and the correct answer is `3` | 1| 0|null| 3| +-------------+---+----+------+ // PostgreSQL case postgres=# select k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2)); k1 | k2 | sum ----+------+----- 0 | 0 | 3 0 | 0 | 3 0 | 0 | 3 0 | NULL | 3 (4 rows) // Hive case hive> select GROUPING__ID, k1, k2, sum(v) from t group by k1, k2 grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2)); 1 0 NULL 3 0 0 0 3 ``` [MS SQL Server has the same behaviour with PostgreSQL](#26961 (comment)). This pr follows the behaviour of PostgreSQL/SQL server; it adds one more virtual attribute in `Expand` for avoiding wrongly grouping rows with the same grouping ID. This is the #26961 backport for `branch-2.4` ### Why are the changes needed? To fix bugs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The existing tests. Closes #27229 from maropu/SPARK-29708-BRANCHC2.4. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This pr intends to fix wrong aggregated values in
GROUPING SETS
when there are duplicated grouping sets in a query (e.g.,GROUPING SETS ((k1),(k1))
).For example;
MS SQL Server has the same behaviour with PostgreSQL. This pr follows the behaviour of PostgreSQL/SQL server; it adds one more virtual attribute in
Expand
for avoiding wrongly grouping rows with the same grouping ID.Why are the changes needed?
To fix bugs.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
The existing tests.