Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12706] [SQL] grouping() and grouping_id() #10677

Closed
wants to merge 11 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Jan 9, 2016

Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.

grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.

The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).

@SparkQA
Copy link

SparkQA commented Jan 9, 2016

Test build #49049 has finished for PR 10677 at commit 0e8317d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BitwiseReverse(child: Expression, width: Int)

@SparkQA
Copy link

SparkQA commented Jan 10, 2016

Test build #49050 has finished for PR 10677 at commit 736e8d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BitwiseReverse(child: Expression, width: Int)

@davies
Copy link
Contributor Author

davies commented Jan 18, 2016

@nongli @rxin Could you have to review this one?

/**
* Aggregate function: returns the level of grouping, equals to
*
* (grouping(c1) << (n-1)) + (grouping(c1) << (n-2)) + ... + grouping(cn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second term should be grouping(c2)

@rxin
Copy link
Contributor

rxin commented Jan 29, 2016

@hvanhovell can you review this one?

@hvanhovell
Copy link
Contributor

I'll have a look

aggsBuffer += e
e
case e if isPartOfAggregation(e) => e
case e: GroupingID =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a dumb question. What happens if we use these functions without grouping sets? Do we get a nice analysis exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, it will fail to resolve, agreed that should be have a better error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had capture this in CheckAnasys.

@hvanhovell
Copy link
Contributor

@davies the PR is in good shape. The two main (minor) issues I could find are:

  • The use of the Hive gid which is wrong. We could also break some compatibility and solve it at the root.
  • The usefulllnes of child expressions in the grouping_id function.

Davies Liu added 2 commits February 9, 2016 16:59
Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
	sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@SparkQA
Copy link

SparkQA commented Feb 10, 2016

Test build #51017 has finished for PR 10677 at commit 90c1655.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

@davies I did some reading on grouping_id() and it depends per vendor. Oracle also allows you to use a subset of the grouping columns, see: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions063.htm. Lets keep it like it is, someone can always implement this more fancy operator (shouldn't be to hard).

As for the Hive compatibility we can also add a few lines of comments in the Analyzer explaining why Hive is wrong.

@SparkQA
Copy link

SparkQA commented Feb 10, 2016

Test build #51025 has finished for PR 10677 at commit 9511c2c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2016

Test build #51027 has finished for PR 10677 at commit c008569.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Feb 10, 2016

@hvanhovell The analyzer know nothing about Hive, where is the best place to put the comment?

@hvanhovell
Copy link
Contributor

@davies Yeah, you have a point there.

We have inherited the wrong construction of the bitmask from Hive: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala#L202-L206. We could also fix/document it there.

@SparkQA
Copy link

SparkQA commented Feb 10, 2016

Test build #51043 has finished for PR 10677 at commit aa34559.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

LGTM

I left a few minor final comments in CatalystQl

@SparkQA
Copy link

SparkQA commented Feb 10, 2016

Test build #51053 has finished for PR 10677 at commit 3469e45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #2534 has finished for PR 10677 at commit 9c7a06f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Feb 11, 2016

merging into master, thanks!

@asfgit asfgit closed this in b5761d1 Feb 11, 2016
asfgit pushed a commit that referenced this pull request Nov 5, 2016
## What changes were proposed in this pull request?

Prior this pr, the following code would cause an NPE:
`case class point(a:String, b:String, c:String, d: Int)`

`val data = Seq(
point("1","2","3", 1),
point("4","5","6", 1),
point("7","8","9", 1)
)`
`sc.parallelize(data).toDF().registerTempTable("table")`
`spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()`

The reason is that when the grouping_id() behavior was changed in #10677, some code (which should be changed) was left out.

Take the above code for example, prior #10677, the bit mask for set "(a)" was `001`, while after #10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly.

This pr will fix this problem.
## How was this patch tested?

add integration tests

Author: wangyang <wangyang@haizhi.com>

Closes #15416 from yangw1234/groupingid.

(cherry picked from commit fb0d608)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
asfgit pushed a commit that referenced this pull request Nov 5, 2016
## What changes were proposed in this pull request?

Prior this pr, the following code would cause an NPE:
`case class point(a:String, b:String, c:String, d: Int)`

`val data = Seq(
point("1","2","3", 1),
point("4","5","6", 1),
point("7","8","9", 1)
)`
`sc.parallelize(data).toDF().registerTempTable("table")`
`spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()`

The reason is that when the grouping_id() behavior was changed in #10677, some code (which should be changed) was left out.

Take the above code for example, prior #10677, the bit mask for set "(a)" was `001`, while after #10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly.

This pr will fix this problem.
## How was this patch tested?

add integration tests

Author: wangyang <wangyang@haizhi.com>

Closes #15416 from yangw1234/groupingid.

(cherry picked from commit fb0d608)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
asfgit pushed a commit that referenced this pull request Nov 5, 2016
## What changes were proposed in this pull request?

Prior this pr, the following code would cause an NPE:
`case class point(a:String, b:String, c:String, d: Int)`

`val data = Seq(
point("1","2","3", 1),
point("4","5","6", 1),
point("7","8","9", 1)
)`
`sc.parallelize(data).toDF().registerTempTable("table")`
`spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()`

The reason is that when the grouping_id() behavior was changed in #10677, some code (which should be changed) was left out.

Take the above code for example, prior #10677, the bit mask for set "(a)" was `001`, while after #10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly.

This pr will fix this problem.
## How was this patch tested?

add integration tests

Author: wangyang <wangyang@haizhi.com>

Closes #15416 from yangw1234/groupingid.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Prior this pr, the following code would cause an NPE:
`case class point(a:String, b:String, c:String, d: Int)`

`val data = Seq(
point("1","2","3", 1),
point("4","5","6", 1),
point("7","8","9", 1)
)`
`sc.parallelize(data).toDF().registerTempTable("table")`
`spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()`

The reason is that when the grouping_id() behavior was changed in apache#10677, some code (which should be changed) was left out.

Take the above code for example, prior apache#10677, the bit mask for set "(a)" was `001`, while after apache#10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly.

This pr will fix this problem.
## How was this patch tested?

add integration tests

Author: wangyang <wangyang@haizhi.com>

Closes apache#15416 from yangw1234/groupingid.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants