[SPARK-32816][SQL] Fix analyzer bug when aggregating multiple distinct DECIMAL columns #29673

linhongliu-db · 2020-09-08T07:55:45Z

What changes were proposed in this pull request?

This PR fixes a conflict between RewriteDistinctAggregates and DecimalAggregates.
In some cases, DecimalAggregates will wrap the decimal column to UnscaledValue using
different rules for different aggregates.

This means, same distinct column with different aggregates will change to different distinct columns
after DecimalAggregates. For example:
avg(distinct decimal_col), sum(distinct decimal_col) may change to
avg(distinct UnscaledValue(decimal_col)), sum(distinct decimal_col)

We assume after RewriteDistinctAggregates, there will be at most one distinct column in aggregates,
but DecimalAggregates breaks this assumption. To fix this, we have to switch the order of these two
rules.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

no

How was this patch tested?

added test cases

…t DECIMAL columns

TJX2014 · 2020-09-08T08:23:48Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+      spark.range(0, 100, 1, 1)
+        .selectExpr("id", "cast(id as decimal(9, 0)) as decimal_col")
+        .write.mode("overwrite")
+        .parquet(path.getAbsolutePath)
+      spark.read.parquet(path.getAbsolutePath).createOrReplaceTempView("test_table")


Seems we need not to write parquet.

val df = spark.range(0, 50000, 1, 1).selectExpr("id", "cast(id as decimal(9, 0)) as ss_ext_list_price")
df.createOrReplaceTempView("test_table")
sql("select avg(distinct ss_ext_list_price), sum(distinct ss_ext_list_price) from test_table").explain
seems enough to reproduce.

linhongliu-db · 2020-09-08T11:16:51Z

@cloud-fan

cloud-fan · 2020-09-08T11:51:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -196,6 +195,8 @@ abstract class Optimizer(catalogManager: CatalogManager)
      EliminateSorts) :+
    Batch("Decimal Optimizations", fixedPoint,
      DecimalAggregates) :+
+    Batch("Distinct Aggregate Rewrite", Once,


can we add a comment to say: this batch must be run after "Decimal Optimizations", as "Decimal Optimizations" may change the aggregate distinct column?

SparkQA · 2020-09-08T15:14:31Z

Test build #128394 has finished for PR 29673 at commit b1ce4b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-09T07:05:02Z

Test build #128428 has finished for PR 29673 at commit 4df4f7c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-09T14:45:52Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+  test("SPARK-32816: aggregating multiple distinct DECIMAL columns") {
+    spark.range(0, 100, 1, 1)
+      .selectExpr("id", "cast(id as decimal(9, 0)) as decimal_col")
+      .createOrReplaceTempView("test_table")


nit: wrap the test with withTempView

How about writing it like this w/o a temp view;

test("SPARK-32816: aggregating multiple distinct DECIMAL columns") { checkAnswer( sql( s""" |SELECT AVG(DISTINCT decimal_col), SUM(DISTINCT decimal_col) | FROM VALUES (CAST(1 AS DECIMAL(9, 0))) t(decimal_col) """.stripMargin), Row(XXX, XXX)) }

nit: Also, could you move this test into SQLQueryTestSuite?

let me try this, I'm not sure if literal will behave differently

cloud-fan · 2020-09-10T03:01:54Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    withTempView("test_table") {
+      spark.range(0, 100, 1, 1)
+        .selectExpr("id", "cast(id as decimal(9, 0)) as decimal_col")
+        .createOrReplaceTempView("test_table")


can you follow #29673 (comment) ? That's a better idea.

SparkQA · 2020-09-10T07:05:02Z

Test build #128480 has finished for PR 29673 at commit 737f996.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-09-10T10:28:42Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+           |SELECT AVG(DISTINCT decimal_col), SUM(DISTINCT decimal_col)
+           |  FROM VALUES (CAST(1 AS DECIMAL(9, 0))) t(decimal_col)
+        """.stripMargin),
+      Row(1, 1))


nit: the test group-by.sql looks enough for this issue, so could you remove this?

sorry, I forgot to remove this.

maropu

LGTM excep for one minor comment.

SparkQA · 2020-09-10T16:08:09Z

Test build #128504 has finished for PR 29673 at commit f2111df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-11T05:59:40Z

Test build #128540 has finished for PR 29673 at commit 8510ff9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-16T16:53:24Z

thanks, merging to master!

…t DECIMAL columns This PR fixes a conflict between `RewriteDistinctAggregates` and `DecimalAggregates`. In some cases, `DecimalAggregates` will wrap the decimal column to `UnscaledValue` using different rules for different aggregates. This means, same distinct column with different aggregates will change to different distinct columns after `DecimalAggregates`. For example: `avg(distinct decimal_col), sum(distinct decimal_col)` may change to `avg(distinct UnscaledValue(decimal_col)), sum(distinct decimal_col)` We assume after `RewriteDistinctAggregates`, there will be at most one distinct column in aggregates, but `DecimalAggregates` breaks this assumption. To fix this, we have to switch the order of these two rules. bug fix no added test cases Closes apache#29673 from linhongliu-db/SPARK-32816. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 40ef5c9)

[SPARK-32816][SQL] Fix analyzer bug when aggregating multiple distinc…

b1ce4b5

…t DECIMAL columns

probot-autolabeler bot added the SQL label Sep 8, 2020

TJX2014 reviewed Sep 8, 2020

View reviewed changes

cloud-fan reviewed Sep 8, 2020

View reviewed changes

fix comments

4df4f7c

cloud-fan reviewed Sep 9, 2020

View reviewed changes

use temp view

737f996

cloud-fan reviewed Sep 10, 2020

View reviewed changes

change to SQLQueryTestSuite

f2111df

maropu reviewed Sep 10, 2020

View reviewed changes

maropu approved these changes Sep 10, 2020

View reviewed changes

code clean

8510ff9

cloud-fan closed this in 40ef5c9 Sep 16, 2020

AngersZhuuuu mentioned this pull request Nov 9, 2020

[SPARK-33302][SQL] Push down filters through Expand #30278

Closed

ulysses-you mentioned this pull request Apr 12, 2022

[SPARK-38832][SQL] Remove unnecessary distinct in aggregate expression by distinctKeys #36117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32816][SQL] Fix analyzer bug when aggregating multiple distinct DECIMAL columns #29673

[SPARK-32816][SQL] Fix analyzer bug when aggregating multiple distinct DECIMAL columns #29673

linhongliu-db commented Sep 8, 2020

TJX2014 Sep 8, 2020

TJX2014 Sep 8, 2020 •

edited

linhongliu-db commented Sep 8, 2020

cloud-fan Sep 8, 2020

SparkQA commented Sep 8, 2020

SparkQA commented Sep 9, 2020

cloud-fan Sep 9, 2020

maropu Sep 9, 2020

maropu Sep 9, 2020

linhongliu-db Sep 10, 2020

cloud-fan Sep 10, 2020

linhongliu-db Sep 10, 2020

SparkQA commented Sep 10, 2020

maropu Sep 10, 2020

linhongliu-db Sep 11, 2020

maropu left a comment

SparkQA commented Sep 10, 2020

SparkQA commented Sep 11, 2020

cloud-fan commented Sep 16, 2020

[SPARK-32816][SQL] Fix analyzer bug when aggregating multiple distinct DECIMAL columns #29673

[SPARK-32816][SQL] Fix analyzer bug when aggregating multiple distinct DECIMAL columns #29673

Conversation

linhongliu-db commented Sep 8, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

TJX2014 Sep 8, 2020 • edited

Choose a reason for hiding this comment

linhongliu-db commented Sep 8, 2020

Choose a reason for hiding this comment

SparkQA commented Sep 8, 2020

SparkQA commented Sep 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu left a comment

Choose a reason for hiding this comment

SparkQA commented Sep 10, 2020

SparkQA commented Sep 11, 2020

cloud-fan commented Sep 16, 2020

TJX2014 Sep 8, 2020 •

edited