[SPARK-13049] Add First/last with ignore nulls to functions.scala #10957

hvanhovell · 2016-01-27T21:40:09Z

This PR adds the ability to specify the ignoreNulls option to the functions dsl, e.g:
df.select($"id", last($"value", ignoreNulls = true).over(Window.partitionBy($"id").orderBy($"other"))

This PR is some where between a bug fix (see the JIRA) and a new feature. I am not sure if we should backport to 1.6.

cc @yhuai

yhuai · 2016-01-27T21:47:22Z

@hvanhovell Thanks for the PR. Do you know why expr/callUDF does not work?

hvanhovell · 2016-01-27T21:59:04Z

@yhuai expr("last(r, true)") would return an UnresolvedFunction(UnresolvedAttribute(r), Literal(true)). The problem is that the WindowSpec does not recognize UnresolvedFunction's.

This is the cleaner fix. We could also add a match to the WindowSpec for unresolved functions.

hvanhovell · 2016-01-27T21:59:23Z

retest this please

hvanhovell · 2016-01-27T22:21:46Z

retest this please

rxin · 2016-01-27T23:06:23Z

Why might this be a bug fix?

hvanhovell · 2016-01-27T23:21:52Z

A user is trying to get this working on 1.6 using the dataframe api. That doesn't work directly because functions.scala misses the functions implemented in this PR. The indirect approach using expr(...) doesn't work because WindowSpec does not support UnresolvedFunctions.

I guess this is more a feature than a bug fix....

SparkQA · 2016-01-27T23:45:10Z

Test build #50231 has finished for PR 10957 at commit defcc02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-29T01:53:40Z

Actually can you update the Python API as well?

SparkQA · 2016-01-31T14:37:13Z

Test build #50462 has finished for PR 10957 at commit b002d60.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-31T16:11:39Z

Test build #50463 has finished for PR 10957 at commit 6e4da4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-31T18:48:58Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

-   */
-  def first(e: Column): Column = withAggregateFunction { new First(e.expr) }
+    * Aggregate function: returns the first value in a group. The function does not consider null
+    * values when the ignoreNulls flag is set to true.


Can you write something like this to be more clear? And update all the docs (including Python).

"The function by default includes the first value it sees. When ignoreNulls is set to true, then it ignores the null values and includes the first non-null value. If all values are null, then null is returned."

rxin · 2016-01-31T18:49:20Z

Thanks - only some minor comment on the documentation to make it more clear.

SparkQA · 2016-01-31T21:10:24Z

Test build #50464 has finished for PR 10957 at commit 809c999.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-31T21:56:11Z

Thanks - merging this in master.

First/last with ignore nulls.

defcc02

hvanhovell added 2 commits January 31, 2016 13:22

Merge remote-tracking branch 'spark/master' into SPARK-13049

f9cad44

Add first/last ignoreNulls in python

b002d60

Style.

6e4da4f

rxin reviewed Jan 31, 2016
View reviewed changes

Extra docs.

809c999

asfgit closed this in 5a8b978 Jan 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13049] Add First/last with ignore nulls to functions.scala #10957

[SPARK-13049] Add First/last with ignore nulls to functions.scala #10957

hvanhovell commented Jan 27, 2016

yhuai commented Jan 27, 2016

hvanhovell commented Jan 27, 2016

hvanhovell commented Jan 27, 2016

hvanhovell commented Jan 27, 2016

rxin commented Jan 27, 2016

hvanhovell commented Jan 27, 2016

SparkQA commented Jan 27, 2016

rxin commented Jan 29, 2016

SparkQA commented Jan 31, 2016

SparkQA commented Jan 31, 2016

rxin Jan 31, 2016

rxin commented Jan 31, 2016

SparkQA commented Jan 31, 2016

rxin commented Jan 31, 2016

[SPARK-13049] Add First/last with ignore nulls to functions.scala #10957

[SPARK-13049] Add First/last with ignore nulls to functions.scala #10957

Conversation

hvanhovell commented Jan 27, 2016

yhuai commented Jan 27, 2016

hvanhovell commented Jan 27, 2016

hvanhovell commented Jan 27, 2016

hvanhovell commented Jan 27, 2016

rxin commented Jan 27, 2016

hvanhovell commented Jan 27, 2016

SparkQA commented Jan 27, 2016

rxin commented Jan 29, 2016

SparkQA commented Jan 31, 2016

SparkQA commented Jan 31, 2016

rxin Jan 31, 2016

Choose a reason for hiding this comment

rxin commented Jan 31, 2016

SparkQA commented Jan 31, 2016

rxin commented Jan 31, 2016