[SPARK-18148][SQL] Misleading Error Message for Aggregation Without Window/GroupBy #15672

jiangxb1987 · 2016-10-28T11:14:18Z

What changes were proposed in this pull request?

Aggregation Without Window/GroupBy expressions will fail in checkAnalysis, the error message is a bit misleading, we should generate a more specific error message for this case.

For example,

spark.read.load("/some-data")
  .withColumn("date_dt", to_date($"date"))
  .withColumn("year", year($"date_dt"))
  .withColumn("week", weekofyear($"date_dt"))
  .withColumn("user_count", count($"userId"))
  .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow))
)

creates the following output:

org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

In the error message above, randomColumn doesn't appear in the query(acturally it's added by function withColumn), so the message is not enough for the user to address the problem.

How was this patch tested?

Manually test

Before:

scala> spark.sql("select col, count(col) from tbl")
org.apache.spark.sql.AnalysisException: expression 'tbl.`col`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;

After:

scala> spark.sql("select col, count(col) from tbl")
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'tbl.`col`' is not an aggregate function. Wrap '(count(col#231L) AS count(col)#239L)' in windowing function(s) or wrap 'tbl.`col`' in first() (or first_value) if you don't care which value you get.;;

Also add new test sqls in group-by.sql.

SparkQA · 2016-10-28T12:50:38Z

Test build #67702 has finished for PR 15672 at commit 350b7a3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-10-29T04:51:33Z

Is this confusing msg reproducible using SQL queries (not dataframe APIs)? If yes, perhaps we can add a test in SQLQueryTestSuite.

jiangxb1987 · 2016-10-29T11:51:55Z

@rxin I've added new test sqls in group-by.sql, which covers the both error messages generated by checkAnalysis. I also moved the test cases for Aggregate operator from SQLQuerySuite to group-by.sql.

SparkQA · 2016-10-29T13:25:19Z

Test build #67755 has finished for PR 15672 at commit 8986207.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-29T15:36:30Z

Test build #67762 has finished for PR 15672 at commit 033f43f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-10-29T15:53:14Z

The failed test case looks not related to our change here. It didn't fail in my local envirement.

jiangxb1987 · 2016-10-31T07:16:17Z

retest this please

SparkQA · 2016-10-31T09:28:41Z

Test build #67803 has finished for PR 15672 at commit 033f43f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-10-31T17:20:51Z

sql/core/src/test/resources/sql-tests/inputs/group-by.sql

+SELECT a, COUNT(b) FROM testData;
+SELECT COUNT(a), COUNT(b) FROM testData;
+
+-- Aggregate with non-empty GroupBy expressions.


these are already tested earlier ain't they?

rxin · 2016-10-31T17:21:07Z

sql/core/src/test/resources/sql-tests/inputs/group-by.sql

+
+-- Aggregate with nulls.
+SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), VARIANCE(a), STDDEV(a), SUM(a), COUNT(a)
+FROM testData;


add a newline.

rxin · 2016-10-31T17:21:29Z

sql/core/src/test/resources/sql-tests/inputs/group-by.sql

+-- Test data.
+CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES
+(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null)
+AS testData(a, b);


Also can we just augment the initial dataset rather than introducing a new testData?

It'd be better to use one dataset.

jiangxb1987 · 2016-11-01T11:56:57Z

@rxin I've updated test sqls to use one dataset. Thank you!

SparkQA · 2016-11-01T13:17:48Z

Test build #67889 has finished for PR 15672 at commit 43f8fa6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-01T13:35:52Z

Test build #67888 has finished for PR 15672 at commit 2542f7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-01T18:23:02Z

Thanks - merging in master and branch-2.0.

…indow/GroupBy ## What changes were proposed in this pull request? Aggregation Without Window/GroupBy expressions will fail in `checkAnalysis`, the error message is a bit misleading, we should generate a more specific error message for this case. For example, ``` spark.read.load("/some-data") .withColumn("date_dt", to_date($"date")) .withColumn("year", year($"date_dt")) .withColumn("week", weekofyear($"date_dt")) .withColumn("user_count", count($"userId")) .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow)) ) ``` creates the following output: ``` org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; ``` In the error message above, `randomColumn` doesn't appear in the query(acturally it's added by function `withColumn`), so the message is not enough for the user to address the problem. ## How was this patch tested? Manually test Before: ``` scala> spark.sql("select col, count(col) from tbl") org.apache.spark.sql.AnalysisException: expression 'tbl.`col`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; ``` After: ``` scala> spark.sql("select col, count(col) from tbl") org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'tbl.`col`' is not an aggregate function. Wrap '(count(col#231L) AS count(col)#239L)' in windowing function(s) or wrap 'tbl.`col`' in first() (or first_value) if you don't care which value you get.;; ``` Also add new test sqls in `group-by.sql`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15672 from jiangxb1987/groupBy-empty. (cherry picked from commit d0272b4) Signed-off-by: Reynold Xin <rxin@databricks.com>

…indow/GroupBy ## What changes were proposed in this pull request? Aggregation Without Window/GroupBy expressions will fail in `checkAnalysis`, the error message is a bit misleading, we should generate a more specific error message for this case. For example, ``` spark.read.load("/some-data") .withColumn("date_dt", to_date($"date")) .withColumn("year", year($"date_dt")) .withColumn("week", weekofyear($"date_dt")) .withColumn("user_count", count($"userId")) .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow)) ) ``` creates the following output: ``` org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; ``` In the error message above, `randomColumn` doesn't appear in the query(acturally it's added by function `withColumn`), so the message is not enough for the user to address the problem. ## How was this patch tested? Manually test Before: ``` scala> spark.sql("select col, count(col) from tbl") org.apache.spark.sql.AnalysisException: expression 'tbl.`col`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; ``` After: ``` scala> spark.sql("select col, count(col) from tbl") org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'tbl.`col`' is not an aggregate function. Wrap '(count(col#231L) AS count(col)#239L)' in windowing function(s) or wrap 'tbl.`col`' in first() (or first_value) if you don't care which value you get.;; ``` Also add new test sqls in `group-by.sql`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes apache#15672 from jiangxb1987/groupBy-empty.

jiangxb1987 added 2 commits October 28, 2016 18:48

improve error message when checkAnalysis fail on Aggregate operator.

494ef09

better format.

350b7a3

add test cases.

8986207

map aggregate expressions to sql

033f43f

rxin reviewed Oct 31, 2016

View reviewed changes

jiangxb1987 added 2 commits November 1, 2016 19:19

modify test cases.

2542f7f

modify test cases.

43f8fa6

asfgit closed this in d0272b4 Nov 1, 2016

jiangxb1987 deleted the groupBy-empty branch November 2, 2016 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18148][SQL] Misleading Error Message for Aggregation Without Window/GroupBy #15672

[SPARK-18148][SQL] Misleading Error Message for Aggregation Without Window/GroupBy #15672

jiangxb1987 commented Oct 28, 2016 •

edited

Loading

SparkQA commented Oct 28, 2016

rxin commented Oct 29, 2016

jiangxb1987 commented Oct 29, 2016 •

edited

Loading

SparkQA commented Oct 29, 2016

SparkQA commented Oct 29, 2016

jiangxb1987 commented Oct 29, 2016

jiangxb1987 commented Oct 31, 2016

SparkQA commented Oct 31, 2016

rxin Oct 31, 2016

rxin Oct 31, 2016

rxin Oct 31, 2016

jiangxb1987 commented Nov 1, 2016

SparkQA commented Nov 1, 2016

SparkQA commented Nov 1, 2016

rxin commented Nov 1, 2016 •

edited

Loading

[SPARK-18148][SQL] Misleading Error Message for Aggregation Without Window/GroupBy #15672

[SPARK-18148][SQL] Misleading Error Message for Aggregation Without Window/GroupBy #15672

Conversation

jiangxb1987 commented Oct 28, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 28, 2016

rxin commented Oct 29, 2016

jiangxb1987 commented Oct 29, 2016 • edited Loading

SparkQA commented Oct 29, 2016

SparkQA commented Oct 29, 2016

jiangxb1987 commented Oct 29, 2016

jiangxb1987 commented Oct 31, 2016

SparkQA commented Oct 31, 2016

rxin Oct 31, 2016

Choose a reason for hiding this comment

rxin Oct 31, 2016

Choose a reason for hiding this comment

rxin Oct 31, 2016

Choose a reason for hiding this comment

jiangxb1987 commented Nov 1, 2016

SparkQA commented Nov 1, 2016

SparkQA commented Nov 1, 2016

rxin commented Nov 1, 2016 • edited Loading

jiangxb1987 commented Oct 28, 2016 •

edited

Loading

jiangxb1987 commented Oct 29, 2016 •

edited

Loading

rxin commented Nov 1, 2016 •

edited

Loading