Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated #26961

Closed
wants to merge 4 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Dec 20, 2019

What changes were proposed in this pull request?

This pr intends to fix wrong aggregated values in GROUPING SETS when there are duplicated grouping sets in a query (e.g., GROUPING SETS ((k1),(k1))).

For example;

scala> spark.table("t").show()
+---+---+---+
| k1| k2|  v|
+---+---+---+
|  0|  0|  3|
+---+---+---+

scala> sql("""select grouping_id(), k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2))""").show()
+-------------+---+----+------+                                                 
|grouping_id()| k1|  k2|sum(v)|
+-------------+---+----+------+
|            0|  0|   0|     9| <---- wrong aggregate value and the correct answer is `3`
|            1|  0|null|     3|
+-------------+---+----+------+

// PostgreSQL case
postgres=#  select k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2));
 k1 |  k2  | sum 
----+------+-----
  0 |    0 |   3
  0 |    0 |   3
  0 |    0 |   3
  0 | NULL |   3
(4 rows)

// Hive case
hive> select GROUPING__ID, k1, k2, sum(v) from t group by k1, k2 grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2));
1	0	NULL	3
0	0	0	3

MS SQL Server has the same behaviour with PostgreSQL. This pr follows the behaviour of PostgreSQL/SQL server; it adds one more virtual attribute in Expand for avoiding wrongly grouping rows with the same grouping ID.

Why are the changes needed?

To fix bugs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The existing tests.

@maropu
Copy link
Member Author

maropu commented Dec 20, 2019

Currently, I followed the hive result because of the simpler fix. But, we might need to check results in the other DBMS-like systems, too (Oracle and DB2?). I just want to know the Oracle answer, but I cannot because I don't have a Oracle instance. Anyone can help me and check it?

@SparkQA
Copy link

SparkQA commented Dec 20, 2019

Test build #115607 has finished for PR 26961 at commit 16d99c7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Dec 20, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Dec 20, 2019

Test build #115613 has finished for PR 26961 at commit 16d99c7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated [WIP][SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated Dec 20, 2019
@SparkQA
Copy link

SparkQA commented Dec 20, 2019

Test build #115625 has finished for PR 26961 at commit 19a36b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [WIP][SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated [SPARK-29708][SQL] Correct aggregated values when grouping sets are duplicated Dec 20, 2019
@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun
Copy link
Member

cc @cloud-fan

@cloud-fan
Copy link
Contributor

The result of SQL server:
image

I think pgsql is corrected.

@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Test build #116625 has finished for PR 26961 at commit 19a36b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 14, 2020

ok, I'll follow the pgSQL(& SQL Server) behaviour.

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116686 has finished for PR 26961 at commit cdcc4d0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 14, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116692 has started for PR 26961 at commit cdcc4d0.

@maropu
Copy link
Member Author

maropu commented Jan 15, 2020

retest this please

@@ -641,11 +640,14 @@ object Expand {
child: LogicalPlan): Expand = {
val attrMap = groupByAttrs.zipWithIndex.toMap

val hasDuplicateGroupingSets = groupingSetsAttrs.size !=
groupingSetsAttrs.map(_.map(_.canonicalized).toSet).distinct.size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (_.map(_.exprId).toSet)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so (k1,k2),(k2,k1) are also duplicated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I think so;

postgres=# select k1, k2, count(v) from t group by grouping sets ((k1, k2), (k2, k1));
 k1 | k2 | count 
----+----+-------
  0 |  0 |     1
  0 |  0 |     1
(2 rows)

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116751 has finished for PR 26961 at commit cdcc4d0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116765 has finished for PR 26961 at commit cdcc4d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116768 has finished for PR 26961 at commit b81ab18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu closed this in 5f6cd61 Jan 15, 2020
@maropu
Copy link
Member Author

maropu commented Jan 15, 2020

Merged to master.

@maropu
Copy link
Member Author

maropu commented Jan 15, 2020

@dongjoon-hyun Just a check; we don't need to backport this bugfix to branch-2.4?

@cloud-fan
Copy link
Contributor

This is a correctness bug, but for a rarely used API. I don't have a strong opinion.

@maropu
Copy link
Member Author

maropu commented Jan 15, 2020

yea, thanks for the check. I don't, either.

@dongjoon-hyun
Copy link
Member

Hi, @maropu . Could you make a backporting PR for RC2? Thanks!

@maropu
Copy link
Member Author

maropu commented Jan 16, 2020

Yea, I saw the thread in the mail and sure!

dongjoon-hyun pushed a commit that referenced this pull request Jan 16, 2020
…are duplicated

### What changes were proposed in this pull request?

This pr intends to fix wrong aggregated values in `GROUPING SETS` when there are duplicated grouping sets in a query (e.g., `GROUPING SETS ((k1),(k1))`).

For example;
```
scala> spark.table("t").show()
+---+---+---+
| k1| k2|  v|
+---+---+---+
|  0|  0|  3|
+---+---+---+

scala> sql("""select grouping_id(), k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2))""").show()
+-------------+---+----+------+
|grouping_id()| k1|  k2|sum(v)|
+-------------+---+----+------+
|            0|  0|   0|     9| <---- wrong aggregate value and the correct answer is `3`
|            1|  0|null|     3|
+-------------+---+----+------+

// PostgreSQL case
postgres=#  select k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2));
 k1 |  k2  | sum
----+------+-----
  0 |    0 |   3
  0 |    0 |   3
  0 |    0 |   3
  0 | NULL |   3
(4 rows)

// Hive case
hive> select GROUPING__ID, k1, k2, sum(v) from t group by k1, k2 grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2));
1	0	NULL	3
0	0	0	3
```
[MS SQL Server has the same behaviour with PostgreSQL](#26961 (comment)). This pr follows the behaviour of PostgreSQL/SQL server; it adds one more virtual attribute in `Expand` for avoiding wrongly grouping rows with the same grouping ID.

This is the #26961 backport  for `branch-2.4`

### Why are the changes needed?

To fix bugs.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

The existing tests.

Closes #27229 from maropu/SPARK-29708-BRANCHC2.4.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants