Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17844] Simplify DataFrame API for defining frame boundaries in window functions #15412

Closed
wants to merge 3 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Oct 10, 2016

What changes were proposed in this pull request?

When I was creating the example code for SPARK-10496, I realized it was pretty convoluted to define the frame boundaries for window functions when there is no partition column or ordering column. The reason is that we don't provide a way to create a WindowSpec directly with the frame boundaries. We can trivially improve this by adding rowsBetween and rangeBetween to Window object.

As an example, to compute cumulative sum using the natural ordering, before this pr:

df.select('key, sum("value").over(Window.partitionBy(lit(1)).rowsBetween(Long.MinValue, 0)))

After this pr:

df.select('key, sum("value").over(Window.rowsBetween(Long.MinValue, 0)))

Note that you could argue there is no point specifying a window frame without partitionBy/orderBy -- but it is strange that only rowsBetween and rangeBetween are not the only two APIs not available.

This also fixes https://issues.apache.org/jira/browse/SPARK-17656 (removing root.scala).

How was this patch tested?

Added test cases to compute cumulative sum in DataFrameWindowSuite for Scala/Java and tests.py for Python.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66615 has finished for PR 15412 at commit 98b77a7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66616 has finished for PR 15412 at commit 4d02864.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Hi @rxin , I just happened to look at this PR. I just want to leave a gentle reminder just in case, that there are SPARK-17656 and two more cases in ./sql/core/src/main/scala/org/apache/spark/sql/expressions/udaf.scala (Maybe this is not directly relevant with this PR but just when I saw the changes here, it rang a bell to me and I just wanted to let you know just in case).

@rxin
Copy link
Contributor Author

rxin commented Oct 10, 2016

Sure I can fix those in this pull request too. Thanks for the reminder.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66623 has finished for PR 15412 at commit e141868.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor Author

rxin commented Oct 11, 2016

cc @hvanhovell ?

@hvanhovell
Copy link
Contributor

LGTM - merging to master.

@asfgit asfgit closed this in b515768 Oct 11, 2016
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… window functions

## What changes were proposed in this pull request?
When I was creating the example code for SPARK-10496, I realized it was pretty convoluted to define the frame boundaries for window functions when there is no partition column or ordering column. The reason is that we don't provide a way to create a WindowSpec directly with the frame boundaries. We can trivially improve this by adding rowsBetween and rangeBetween to Window object.

As an example, to compute cumulative sum using the natural ordering, before this pr:
```
df.select('key, sum("value").over(Window.partitionBy(lit(1)).rowsBetween(Long.MinValue, 0)))
```

After this pr:
```
df.select('key, sum("value").over(Window.rowsBetween(Long.MinValue, 0)))
```

Note that you could argue there is no point specifying a window frame without partitionBy/orderBy -- but it is strange that only rowsBetween and rangeBetween are not the only two APIs not available.

This also fixes https://issues.apache.org/jira/browse/SPARK-17656 (removing _root_.scala).

## How was this patch tested?
Added test cases to compute cumulative sum in DataFrameWindowSuite for Scala/Java and tests.py for Python.

Author: Reynold Xin <rxin@databricks.com>

Closes apache#15412 from rxin/SPARK-17844.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants