-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18148][SQL] Misleading Error Message for Aggregation Without Window/GroupBy #15672
Conversation
Test build #67702 has finished for PR 15672 at commit
|
Is this confusing msg reproducible using SQL queries (not dataframe APIs)? If yes, perhaps we can add a test in SQLQueryTestSuite. |
@rxin I've added new test sqls in |
Test build #67755 has finished for PR 15672 at commit
|
Test build #67762 has finished for PR 15672 at commit
|
The failed test case looks not related to our change here. It didn't fail in my local envirement. |
retest this please |
Test build #67803 has finished for PR 15672 at commit
|
SELECT a, COUNT(b) FROM testData; | ||
SELECT COUNT(a), COUNT(b) FROM testData; | ||
|
||
-- Aggregate with non-empty GroupBy expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are already tested earlier ain't they?
|
||
-- Aggregate with nulls. | ||
SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), VARIANCE(a), STDDEV(a), SUM(a), COUNT(a) | ||
FROM testData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a newline.
-- Test data. | ||
CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES | ||
(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2), (null, 1), (3, null), (null, null) | ||
AS testData(a, b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can we just augment the initial dataset rather than introducing a new testData?
It'd be better to use one dataset.
@rxin I've updated test sqls to use one dataset. Thank you! |
Test build #67889 has finished for PR 15672 at commit
|
Test build #67888 has finished for PR 15672 at commit
|
Thanks - merging in master and branch-2.0. |
…indow/GroupBy ## What changes were proposed in this pull request? Aggregation Without Window/GroupBy expressions will fail in `checkAnalysis`, the error message is a bit misleading, we should generate a more specific error message for this case. For example, ``` spark.read.load("/some-data") .withColumn("date_dt", to_date($"date")) .withColumn("year", year($"date_dt")) .withColumn("week", weekofyear($"date_dt")) .withColumn("user_count", count($"userId")) .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow)) ) ``` creates the following output: ``` org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; ``` In the error message above, `randomColumn` doesn't appear in the query(acturally it's added by function `withColumn`), so the message is not enough for the user to address the problem. ## How was this patch tested? Manually test Before: ``` scala> spark.sql("select col, count(col) from tbl") org.apache.spark.sql.AnalysisException: expression 'tbl.`col`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; ``` After: ``` scala> spark.sql("select col, count(col) from tbl") org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'tbl.`col`' is not an aggregate function. Wrap '(count(col#231L) AS count(col)#239L)' in windowing function(s) or wrap 'tbl.`col`' in first() (or first_value) if you don't care which value you get.;; ``` Also add new test sqls in `group-by.sql`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15672 from jiangxb1987/groupBy-empty. (cherry picked from commit d0272b4) Signed-off-by: Reynold Xin <rxin@databricks.com>
…indow/GroupBy ## What changes were proposed in this pull request? Aggregation Without Window/GroupBy expressions will fail in `checkAnalysis`, the error message is a bit misleading, we should generate a more specific error message for this case. For example, ``` spark.read.load("/some-data") .withColumn("date_dt", to_date($"date")) .withColumn("year", year($"date_dt")) .withColumn("week", weekofyear($"date_dt")) .withColumn("user_count", count($"userId")) .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow)) ) ``` creates the following output: ``` org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; ``` In the error message above, `randomColumn` doesn't appear in the query(acturally it's added by function `withColumn`), so the message is not enough for the user to address the problem. ## How was this patch tested? Manually test Before: ``` scala> spark.sql("select col, count(col) from tbl") org.apache.spark.sql.AnalysisException: expression 'tbl.`col`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; ``` After: ``` scala> spark.sql("select col, count(col) from tbl") org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'tbl.`col`' is not an aggregate function. Wrap '(count(col#231L) AS count(col)#239L)' in windowing function(s) or wrap 'tbl.`col`' in first() (or first_value) if you don't care which value you get.;; ``` Also add new test sqls in `group-by.sql`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes apache#15672 from jiangxb1987/groupBy-empty.
What changes were proposed in this pull request?
Aggregation Without Window/GroupBy expressions will fail in
checkAnalysis
, the error message is a bit misleading, we should generate a more specific error message for this case.For example,
creates the following output:
In the error message above,
randomColumn
doesn't appear in the query(acturally it's added by functionwithColumn
), so the message is not enough for the user to address the problem.How was this patch tested?
Manually test
Before:
After:
Also add new test sqls in
group-by.sql
.