[SPARK-17844] Simplify DataFrame API for defining frame boundaries in window functions #15412

rxin · 2016-10-10T01:18:22Z

What changes were proposed in this pull request?

When I was creating the example code for SPARK-10496, I realized it was pretty convoluted to define the frame boundaries for window functions when there is no partition column or ordering column. The reason is that we don't provide a way to create a WindowSpec directly with the frame boundaries. We can trivially improve this by adding rowsBetween and rangeBetween to Window object.

As an example, to compute cumulative sum using the natural ordering, before this pr:

df.select('key, sum("value").over(Window.partitionBy(lit(1)).rowsBetween(Long.MinValue, 0)))

After this pr:

df.select('key, sum("value").over(Window.rowsBetween(Long.MinValue, 0)))

Note that you could argue there is no point specifying a window frame without partitionBy/orderBy -- but it is strange that only rowsBetween and rangeBetween are not the only two APIs not available.

This also fixes https://issues.apache.org/jira/browse/SPARK-17656 (removing root.scala).

How was this patch tested?

Added test cases to compute cumulative sum in DataFrameWindowSuite for Scala/Java and tests.py for Python.

… window functions

SparkQA · 2016-10-10T03:05:06Z

Test build #66615 has finished for PR 15412 at commit 98b77a7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-10T03:37:58Z

Test build #66616 has finished for PR 15412 at commit 4d02864.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-10-10T04:17:33Z

Hi @rxin , I just happened to look at this PR. I just want to leave a gentle reminder just in case, that there are SPARK-17656 and two more cases in ./sql/core/src/main/scala/org/apache/spark/sql/expressions/udaf.scala (Maybe this is not directly relevant with this PR but just when I saw the changes here, it rang a bell to me and I just wanted to let you know just in case).

rxin · 2016-10-10T04:27:07Z

Sure I can fix those in this pull request too. Thanks for the reminder.

SparkQA · 2016-10-10T06:37:59Z

Test build #66623 has finished for PR 15412 at commit e141868.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-10-11T05:14:18Z

cc @hvanhovell ?

hvanhovell · 2016-10-11T05:32:12Z

LGTM - merging to master.

… window functions ## What changes were proposed in this pull request? When I was creating the example code for SPARK-10496, I realized it was pretty convoluted to define the frame boundaries for window functions when there is no partition column or ordering column. The reason is that we don't provide a way to create a WindowSpec directly with the frame boundaries. We can trivially improve this by adding rowsBetween and rangeBetween to Window object. As an example, to compute cumulative sum using the natural ordering, before this pr: ``` df.select('key, sum("value").over(Window.partitionBy(lit(1)).rowsBetween(Long.MinValue, 0))) ``` After this pr: ``` df.select('key, sum("value").over(Window.rowsBetween(Long.MinValue, 0))) ``` Note that you could argue there is no point specifying a window frame without partitionBy/orderBy -- but it is strange that only rowsBetween and rangeBetween are not the only two APIs not available. This also fixes https://issues.apache.org/jira/browse/SPARK-17656 (removing _root_.scala). ## How was this patch tested? Added test cases to compute cumulative sum in DataFrameWindowSuite for Scala/Java and tests.py for Python. Author: Reynold Xin <rxin@databricks.com> Closes apache#15412 from rxin/SPARK-17844.

rxin added 2 commits October 9, 2016 18:15

[SPARK-17844] Simplify DataFrame API for defining frame boundaries in…

98b77a7

… window functions

Fix Python implementation

4d02864

Remove the _root_ in udaf.scala too.

e141868

asfgit closed this in b515768 Oct 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17844] Simplify DataFrame API for defining frame boundaries in window functions #15412

[SPARK-17844] Simplify DataFrame API for defining frame boundaries in window functions #15412

rxin commented Oct 10, 2016 •

edited

SparkQA commented Oct 10, 2016

SparkQA commented Oct 10, 2016

HyukjinKwon commented Oct 10, 2016

rxin commented Oct 10, 2016

SparkQA commented Oct 10, 2016

rxin commented Oct 11, 2016

hvanhovell commented Oct 11, 2016

[SPARK-17844] Simplify DataFrame API for defining frame boundaries in window functions #15412

[SPARK-17844] Simplify DataFrame API for defining frame boundaries in window functions #15412

Conversation

rxin commented Oct 10, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 10, 2016

SparkQA commented Oct 10, 2016

HyukjinKwon commented Oct 10, 2016

rxin commented Oct 10, 2016

SparkQA commented Oct 10, 2016

rxin commented Oct 11, 2016

hvanhovell commented Oct 11, 2016

rxin commented Oct 10, 2016 •

edited