[SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated #26961

maropu · 2019-12-20T07:51:40Z

What changes were proposed in this pull request?

This pr intends to fix wrong aggregated values in GROUPING SETS when there are duplicated grouping sets in a query (e.g., GROUPING SETS ((k1),(k1))).

For example;

scala> spark.table("t").show()
+---+---+---+
| k1| k2|  v|
+---+---+---+
|  0|  0|  3|
+---+---+---+

scala> sql("""select grouping_id(), k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2))""").show()
+-------------+---+----+------+                                                 
|grouping_id()| k1|  k2|sum(v)|
+-------------+---+----+------+
|            0|  0|   0|     9| <---- wrong aggregate value and the correct answer is `3`
|            1|  0|null|     3|
+-------------+---+----+------+

// PostgreSQL case
postgres=#  select k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2));
 k1 |  k2  | sum 
----+------+-----
  0 |    0 |   3
  0 |    0 |   3
  0 |    0 |   3
  0 | NULL |   3
(4 rows)

// Hive case
hive> select GROUPING__ID, k1, k2, sum(v) from t group by k1, k2 grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2));
1	0	NULL	3
0	0	0	3

MS SQL Server has the same behaviour with PostgreSQL. This pr follows the behaviour of PostgreSQL/SQL server; it adds one more virtual attribute in Expand for avoiding wrongly grouping rows with the same grouping ID.

Why are the changes needed?

To fix bugs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The existing tests.

maropu · 2019-12-20T07:55:12Z

Currently, I followed the hive result because of the simpler fix. But, we might need to check results in the other DBMS-like systems, too (Oracle and DB2?). I just want to know the Oracle answer, but I cannot because I don't have a Oracle instance. Anyone can help me and check it?

SparkQA · 2019-12-20T08:05:02Z

Test build #115607 has finished for PR 26961 at commit 16d99c7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-12-20T08:29:10Z

retest this please

SparkQA · 2019-12-20T12:00:02Z

Test build #115613 has finished for PR 26961 at commit 16d99c7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-20T18:42:32Z

Test build #115625 has finished for PR 26961 at commit 19a36b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-01-13T09:19:06Z

Retest this please.

dongjoon-hyun · 2020-01-13T09:19:38Z

cc @cloud-fan

cloud-fan · 2020-01-13T12:23:39Z

The result of SQL server:

I think pgsql is corrected.

SparkQA · 2020-01-13T13:46:42Z

Test build #116625 has finished for PR 26961 at commit 19a36b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-01-14T01:31:08Z

ok, I'll follow the pgSQL(& SQL Server) behaviour.

SparkQA · 2020-01-14T08:05:01Z

Test build #116686 has finished for PR 26961 at commit cdcc4d0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-01-14T08:17:48Z

retest this please

SparkQA · 2020-01-14T08:18:40Z

Test build #116692 has started for PR 26961 at commit cdcc4d0.

maropu · 2020-01-15T06:02:09Z

retest this please

cloud-fan · 2020-01-15T07:52:13Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -641,11 +640,14 @@ object Expand {
    child: LogicalPlan): Expand = {
    val attrMap = groupByAttrs.zipWithIndex.toMap

+    val hasDuplicateGroupingSets = groupingSetsAttrs.size !=
+      groupingSetsAttrs.map(_.map(_.canonicalized).toSet).distinct.size


nit: (_.map(_.exprId).toSet)

so (k1,k2),(k2,k1) are also duplicated?

yea, I think so;

postgres=# select k1, k2, count(v) from t group by grouping sets ((k1, k2), (k2, k1)); k1 | k2 | count ----+----+------- 0 | 0 | 1 0 | 0 | 1 (2 rows)

SparkQA · 2020-01-15T08:05:02Z

Test build #116751 has finished for PR 26961 at commit cdcc4d0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-15T08:05:20Z

retest this please

SparkQA · 2020-01-15T12:22:42Z

Test build #116765 has finished for PR 26961 at commit cdcc4d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-15T12:37:50Z

Test build #116768 has finished for PR 26961 at commit b81ab18.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-01-15T13:03:16Z

Merged to master.

maropu · 2020-01-15T13:05:10Z

@dongjoon-hyun Just a check; we don't need to backport this bugfix to branch-2.4?

cloud-fan · 2020-01-15T13:06:34Z

This is a correctness bug, but for a rarely used API. I don't have a strong opinion.

maropu · 2020-01-15T13:08:11Z

yea, thanks for the check. I don't, either.

dongjoon-hyun · 2020-01-16T04:15:51Z

Hi, @maropu . Could you make a backporting PR for RC2? Thanks!

maropu · 2020-01-16T04:32:35Z

Yea, I saw the thread in the mail and sure!

…are duplicated ### What changes were proposed in this pull request? This pr intends to fix wrong aggregated values in `GROUPING SETS` when there are duplicated grouping sets in a query (e.g., `GROUPING SETS ((k1),(k1))`). For example; ``` scala> spark.table("t").show() +---+---+---+ | k1| k2| v| +---+---+---+ | 0| 0| 3| +---+---+---+ scala> sql("""select grouping_id(), k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2))""").show() +-------------+---+----+------+ |grouping_id()| k1| k2|sum(v)| +-------------+---+----+------+ | 0| 0| 0| 9| <---- wrong aggregate value and the correct answer is `3` | 1| 0|null| 3| +-------------+---+----+------+ // PostgreSQL case postgres=# select k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2)); k1 | k2 | sum ----+------+----- 0 | 0 | 3 0 | 0 | 3 0 | 0 | 3 0 | NULL | 3 (4 rows) // Hive case hive> select GROUPING__ID, k1, k2, sum(v) from t group by k1, k2 grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2)); 1 0 NULL 3 0 0 0 3 ``` [MS SQL Server has the same behaviour with PostgreSQL](#26961 (comment)). This pr follows the behaviour of PostgreSQL/SQL server; it adds one more virtual attribute in `Expand` for avoiding wrongly grouping rows with the same grouping ID. This is the #26961 backport for `branch-2.4` ### Why are the changes needed? To fix bugs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The existing tests. Closes #27229 from maropu/SPARK-29708-BRANCHC2.4. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

maropu changed the title ~~[SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated~~ [WIP][SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated Dec 20, 2019

maropu changed the title ~~[WIP][SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated~~ [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated Dec 20, 2019

dongjoon-hyun added the SQL label Jan 13, 2020

maropu added 3 commits January 14, 2020 14:31

Fix

0906993

Fix

dcdd61e

Fix

cdcc4d0

maropu force-pushed the SPARK-29708 branch from 19a36b3 to cdcc4d0 Compare January 14, 2020 06:54

cloud-fan reviewed Jan 15, 2020

View reviewed changes

cloud-fan approved these changes Jan 15, 2020

View reviewed changes

nit updates

b81ab18

maropu closed this in 5f6cd61 Jan 15, 2020

maropu mentioned this pull request Jan 16, 2020

[SPARK-29708][SQL][2.4] Correct aggregated values when grouping sets are duplicated #27229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated #26961

[SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated #26961

maropu commented Dec 20, 2019 •

edited

Loading

maropu commented Dec 20, 2019

SparkQA commented Dec 20, 2019

maropu commented Dec 20, 2019

SparkQA commented Dec 20, 2019

SparkQA commented Dec 20, 2019

dongjoon-hyun commented Jan 13, 2020

dongjoon-hyun commented Jan 13, 2020

cloud-fan commented Jan 13, 2020

SparkQA commented Jan 13, 2020

maropu commented Jan 14, 2020 •

edited

Loading

SparkQA commented Jan 14, 2020

maropu commented Jan 14, 2020

SparkQA commented Jan 14, 2020

maropu commented Jan 15, 2020

cloud-fan Jan 15, 2020

cloud-fan Jan 15, 2020

maropu Jan 15, 2020

SparkQA commented Jan 15, 2020

cloud-fan commented Jan 15, 2020

SparkQA commented Jan 15, 2020

SparkQA commented Jan 15, 2020

maropu commented Jan 15, 2020

maropu commented Jan 15, 2020

cloud-fan commented Jan 15, 2020

maropu commented Jan 15, 2020 •

edited

Loading

dongjoon-hyun commented Jan 16, 2020

maropu commented Jan 16, 2020

[SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated #26961

[SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated #26961

Conversation

maropu commented Dec 20, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

maropu commented Dec 20, 2019

SparkQA commented Dec 20, 2019

maropu commented Dec 20, 2019

SparkQA commented Dec 20, 2019

SparkQA commented Dec 20, 2019

dongjoon-hyun commented Jan 13, 2020

dongjoon-hyun commented Jan 13, 2020

cloud-fan commented Jan 13, 2020

SparkQA commented Jan 13, 2020

maropu commented Jan 14, 2020 • edited Loading

SparkQA commented Jan 14, 2020

maropu commented Jan 14, 2020

SparkQA commented Jan 14, 2020

maropu commented Jan 15, 2020

cloud-fan Jan 15, 2020

Choose a reason for hiding this comment

cloud-fan Jan 15, 2020

Choose a reason for hiding this comment

maropu Jan 15, 2020

Choose a reason for hiding this comment

SparkQA commented Jan 15, 2020

cloud-fan commented Jan 15, 2020

SparkQA commented Jan 15, 2020

SparkQA commented Jan 15, 2020

maropu commented Jan 15, 2020

maropu commented Jan 15, 2020

cloud-fan commented Jan 15, 2020

maropu commented Jan 15, 2020 • edited Loading

dongjoon-hyun commented Jan 16, 2020

maropu commented Jan 16, 2020

maropu commented Dec 20, 2019 •

edited

Loading

maropu commented Jan 14, 2020 •

edited

Loading

maropu commented Jan 15, 2020 •

edited

Loading