[SPARK-12706] [SQL] grouping() and grouping_id() #10677

davies · 2016-01-09T22:20:05Z

Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.

grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.

The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).

SparkQA · 2016-01-09T22:34:21Z

Test build #49049 has finished for PR 10677 at commit 0e8317d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BitwiseReverse(child: Expression, width: Int)

SparkQA · 2016-01-10T00:53:32Z

Test build #49050 has finished for PR 10677 at commit 736e8d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BitwiseReverse(child: Expression, width: Int)

davies · 2016-01-18T20:57:21Z

@nongli @rxin Could you have to review this one?

aray · 2016-01-19T16:42:36Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+  /**
+    * Aggregate function: returns the level of grouping, equals to
+    *
+    *   (grouping(c1) << (n-1)) + (grouping(c1) << (n-2)) + ... + grouping(cn)


Second term should be grouping(c2)

rxin · 2016-01-29T07:48:47Z

@hvanhovell can you review this one?

hvanhovell · 2016-01-29T08:34:32Z

I'll have a look

hvanhovell · 2016-01-29T08:36:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+              aggsBuffer += e
+              e
+            case e if isPartOfAggregation(e) => e
+            case e: GroupingID =>


This is probably a dumb question. What happens if we use these functions without grouping sets? Do we get a nice analysis exception?

Right now, it will fail to resolve, agreed that should be have a better error message.

Had capture this in CheckAnasys.

hvanhovell · 2016-02-02T22:40:20Z

@davies the PR is in good shape. The two main (minor) issues I could find are:

The use of the Hive gid which is wrong. We could also break some compatibility and solve it at the root.
The usefulllnes of child expressions in the grouping_id function.

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

This reverts commit 736e8d2.

SparkQA · 2016-02-10T01:05:27Z

Test build #51017 has finished for PR 10677 at commit 90c1655.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-02-10T06:55:45Z

@davies I did some reading on grouping_id() and it depends per vendor. Oracle also allows you to use a subset of the grouping columns, see: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions063.htm. Lets keep it like it is, someone can always implement this more fancy operator (shouldn't be to hard).

As for the Hive compatibility we can also add a few lines of comments in the Analyzer explaining why Hive is wrong.

SparkQA · 2016-02-10T06:59:52Z

Test build #51025 has finished for PR 10677 at commit 9511c2c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-10T08:35:56Z

Test build #51027 has finished for PR 10677 at commit c008569.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-02-10T18:48:32Z

@hvanhovell The analyzer know nothing about Hive, where is the best place to put the comment?

hvanhovell · 2016-02-10T19:48:58Z

@davies Yeah, you have a point there.

We have inherited the wrong construction of the bitmask from Hive: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala#L202-L206. We could also fix/document it there.

SparkQA · 2016-02-10T22:01:49Z

Test build #51043 has finished for PR 10677 at commit aa34559.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-02-10T22:05:03Z

LGTM

I left a few minor final comments in CatalystQl

SparkQA · 2016-02-10T23:12:39Z

Test build #51053 has finished for PR 10677 at commit 3469e45.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-11T01:03:09Z

Test build #2534 has finished for PR 10677 at commit 9c7a06f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-02-11T04:13:44Z

merging into master, thanks!

## What changes were proposed in this pull request? Prior this pr, the following code would cause an NPE: `case class point(a:String, b:String, c:String, d: Int)` `val data = Seq( point("1","2","3", 1), point("4","5","6", 1), point("7","8","9", 1) )` `sc.parallelize(data).toDF().registerTempTable("table")` `spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()` The reason is that when the grouping_id() behavior was changed in #10677, some code (which should be changed) was left out. Take the above code for example, prior #10677, the bit mask for set "(a)" was `001`, while after #10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly. This pr will fix this problem. ## How was this patch tested? add integration tests Author: wangyang <wangyang@haizhi.com> Closes #15416 from yangw1234/groupingid. (cherry picked from commit fb0d608) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

## What changes were proposed in this pull request? Prior this pr, the following code would cause an NPE: `case class point(a:String, b:String, c:String, d: Int)` `val data = Seq( point("1","2","3", 1), point("4","5","6", 1), point("7","8","9", 1) )` `sc.parallelize(data).toDF().registerTempTable("table")` `spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()` The reason is that when the grouping_id() behavior was changed in #10677, some code (which should be changed) was left out. Take the above code for example, prior #10677, the bit mask for set "(a)" was `001`, while after #10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly. This pr will fix this problem. ## How was this patch tested? add integration tests Author: wangyang <wangyang@haizhi.com> Closes #15416 from yangw1234/groupingid.

## What changes were proposed in this pull request? Prior this pr, the following code would cause an NPE: `case class point(a:String, b:String, c:String, d: Int)` `val data = Seq( point("1","2","3", 1), point("4","5","6", 1), point("7","8","9", 1) )` `sc.parallelize(data).toDF().registerTempTable("table")` `spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()` The reason is that when the grouping_id() behavior was changed in apache#10677, some code (which should be changed) was left out. Take the above code for example, prior apache#10677, the bit mask for set "(a)" was `001`, while after apache#10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly. This pr will fix this problem. ## How was this patch tested? add integration tests Author: wangyang <wangyang@haizhi.com> Closes apache#15416 from yangw1234/groupingid.

add grouping() and grouping_id()

bcb8d9e

make GROUPING__ID compatible with Hive

736e8d2

davies force-pushed the grouping branch from 0e8317d to 736e8d2 Compare January 9, 2016 22:59

aray reviewed Jan 19, 2016
View reviewed changes

hvanhovell reviewed Jan 29, 2016
View reviewed changes

Davies Liu added 2 commits February 9, 2016 16:59

Merge branch 'master' of github.com:apache/spark into grouping

68d6a3c

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

Revert "make GROUPING__ID compatible with Hive"

90c1655

This reverts commit 736e8d2.

Davies Liu added 3 commits February 9, 2016 21:49

fix style

9511c2c

fix Hive tests

b1ddff1

add Python API

c008569

Davies Liu added 2 commits February 10, 2016 10:36

fix flaky tests

ba8e06f

check error message

aa34559

fix masks of grouping sets

3469e45

CR

9c7a06f

asfgit closed this in b5761d1 Feb 11, 2016

gatorsmile mentioned this pull request Feb 11, 2016

[SPARK-13221] [SQL] Fixing GroupingSets when Aggregate Functions Containing GroupBy Columns #11100

Closed

yangw1234 mentioned this pull request Oct 10, 2016

[SPARK-17849] [SQL] Fix NPE problem when using grouping sets #15416

Closed

gatorsmile mentioned this pull request Aug 30, 2017

[SPARK-21055][SQL] replace grouping__id with grouping_id() #18270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12706] [SQL] grouping() and grouping_id() #10677

[SPARK-12706] [SQL] grouping() and grouping_id() #10677

davies commented Jan 9, 2016

SparkQA commented Jan 9, 2016

SparkQA commented Jan 10, 2016

davies commented Jan 18, 2016

aray Jan 19, 2016

rxin commented Jan 29, 2016

hvanhovell commented Jan 29, 2016

hvanhovell Jan 29, 2016

davies Feb 2, 2016

davies Feb 10, 2016

hvanhovell commented Feb 2, 2016

SparkQA commented Feb 10, 2016

hvanhovell commented Feb 10, 2016

SparkQA commented Feb 10, 2016

SparkQA commented Feb 10, 2016

davies commented Feb 10, 2016

hvanhovell commented Feb 10, 2016

SparkQA commented Feb 10, 2016

hvanhovell commented Feb 10, 2016

SparkQA commented Feb 10, 2016

SparkQA commented Feb 11, 2016

davies commented Feb 11, 2016

[SPARK-12706] [SQL] grouping() and grouping_id() #10677

[SPARK-12706] [SQL] grouping() and grouping_id() #10677

Conversation

davies commented Jan 9, 2016

SparkQA commented Jan 9, 2016

SparkQA commented Jan 10, 2016

davies commented Jan 18, 2016

aray Jan 19, 2016

Choose a reason for hiding this comment

rxin commented Jan 29, 2016

hvanhovell commented Jan 29, 2016

hvanhovell Jan 29, 2016

Choose a reason for hiding this comment

davies Feb 2, 2016

Choose a reason for hiding this comment

davies Feb 10, 2016

Choose a reason for hiding this comment

hvanhovell commented Feb 2, 2016

SparkQA commented Feb 10, 2016

hvanhovell commented Feb 10, 2016

SparkQA commented Feb 10, 2016

SparkQA commented Feb 10, 2016

davies commented Feb 10, 2016

hvanhovell commented Feb 10, 2016

SparkQA commented Feb 10, 2016

hvanhovell commented Feb 10, 2016

SparkQA commented Feb 10, 2016

SparkQA commented Feb 11, 2016

davies commented Feb 11, 2016