[SPARK-27297] [SQL] Add higher order functions to scala API #24232

nvander1 · 2019-03-28T03:38:33Z

What changes were proposed in this pull request?

There is currently no existing Scala API equivalent for the higher order functions introduced in Spark 2.4.0.

transform
aggregate
filter
exists
forall
zip_with
map_zip_with
map_filter
transform_values
transform_keys

Equivalent column based functions should be added to the Scala API for org.apache.spark.sql.functions with the following signatures:

def transform(column: Column, f: Column => Column): Column = ???

def transform(column: Column, f: (Column, Column) => Column): Column = ???

def exists(column: Column, f: Column => Column): Column = ???

def filter(column: Column, f: Column => Column): Column = ???

def aggregate(
expr: Column,
zero: Column,
merge: (Column, Column) => Column,
finish: Column => Column): Column = ???

def aggregate(
expr: Column,
zero: Column,
merge: (Column, Column) => Column): Column = ???

def zip_with(
left: Column,
right: Column,
f: (Column, Column) => Column): Column = ???

def transform_keys(expr: Column, f: (Column, Column) => Column): Column = ???

def transform_values(expr: Column, f: (Column, Column) => Column): Column = ???

def map_filter(expr: Column, f: (Column, Column) => Column): Column = ???

def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) => Column): Column = ???

How was this patch tested?

I've mimicked the existing tests for the higher order functions in org.apache.spark.sql.DataFrameFunctionsSuite that use expr to test the higher order functions.

As an example of an existing test:

  test("map_zip_with function - map of primitive types") {
    val df = Seq(
      (Map(8 -> 6L, 3 -> 5L, 6 -> 2L), Map[Integer, Integer]((6, 4), (8, 2), (3, 2))),
      (Map(10 -> 6L, 8 -> 3L), Map[Integer, Integer]((8, 4), (4, null))),
      (Map.empty[Int, Long], Map[Integer, Integer]((5, 1))),
      (Map(5 -> 1L), null)
    ).toDF("m1", "m2")

    checkAnswer(df.selectExpr("map_zip_with(m1, m2, (k, v1, v2) -> k == v1 + v2)"),
      Seq(
        Row(Map(8 -> true, 3 -> false, 6 -> true)),
        Row(Map(10 -> null, 8 -> false, 4 -> null)),
        Row(Map(5 -> null)),
        Row(null)))
}

I've added this test that performs the same logic, but with the new column based API I've added.

    checkAnswer(df.select(map_zip_with(df("m1"), df("m2"), (k, v1, v2) => k === v1 + v2)),
      Seq(
        Row(Map(8 -> true, 3 -> false, 6 -> true)),
        Row(Map(10 -> null, 8 -> false, 4 -> null)),
        Row(Map(5 -> null)),
        Row(null)))

HyukjinKwon · 2019-03-28T11:29:59Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   *
+   * @group collection_funcs
+   */
+  def exists(column: Column, f: Column => Column): Column = withExpr {


But how do we support this in Java?

Could we change the signatures to accept scala.runtime.AbstractFunctions instead to avoid using the Function traits?

Let's add (Scala-specific) at least for each doc. BTW, please take a look for style guide at https://github.com/databricks/scala-style-guide

Actually a better idea would probably be to use java functional interfaces.

@FuncitonalInterface interface Function3[T1, T2, T3, R] { R apply(T1 t1, T2 t2, T3 t3); } Column map_zip_with(Column left, Column right, Function3[Column, Column, Column, Column] f) = ...

And of course we would use the existing functional interfaces first from java.util.function, but I don't think there are any that accept three parameters likes some of the functions here require.

It appears these interfaces already exist in the source tree: https://github.com/apache/spark/blob/v2.4.0/core/src/main/java/org/apache/spark/api/java/function/Function3.java

I'll come back later to add java-specific apis that utilizes these.

These signatures won't work in java as they rely on Scala lambdas

srowen · 2019-03-28T14:29:27Z

I think this is a whole lot to add and support in the APIs, and will be Scala-specific, in an API that is more meant to follow SQL operations than Scala.

nvander1 · 2019-03-28T20:12:53Z

@srowen Can't we make the same argument against any of the scala functions in org.apache.spark.sql.functions ?

Also, I can provide equivalent methods in java that will accept lambda expressions via a functional interface.

nvander1 · 2019-03-28T23:01:08Z

I've added Java-specific versions of the api as well to show how java interop can be handled. I still need to add the corresponding tests.

HyukjinKwon · 2019-03-29T00:41:51Z

Can you hold it for a while before we go further? Growing APIs in functions.scala is a concern. cc @rxin, @gatorsmile, and @ueshin

nvander1 · 2019-03-29T01:25:21Z

Sure, I'd love to get some more feedback on this!

nvander1 · 2019-04-02T23:00:11Z

Any more thoughts on this? @HyukjinKwon @rxin @gatorsmile @ueshin @srowen

gatorsmile · 2019-04-02T23:21:24Z

Also cc @hvanhovell

nvander1 · 2019-05-31T15:46:29Z

Would it be more appropriate for me to close the issue and make a third party library for these if growing the API is a concern? @HyukjinKwon @rxin @gatorsmile @ueshin @srowen @hvanhovell

HyukjinKwon · 2019-06-14T00:50:44Z

I have no strong opinion on here. I would leave it to @rxin, @ueshin, @hvanhovell

ssimeonov · 2019-06-14T17:18:39Z

It's extremely weird and inconsistent for there to be SparkSQL functions with no DSL equivalent. It forces companies such as Swoop to create our own (e.g., https://gist.github.com/ssimeonov/8d902d0dfda934a79c3a46ec7dc0523d) yet bear the uncertainty as to what OSS Spark does, which is not a great outcome for the ecosystem. It would have been much better if SparkSQL and DSL support had been launched jointly.

The growing size of functions feels like a red herring, but, if it is a concern, what prevents us from putting various categories of functions in other "namespaces", the way Dataset does with statistics & missing value functionality?

Either way, delaying a decision on this functionality, by which I don't mean this specific PR, does not help.

/cc @rxin

nvander1 · 2019-06-14T22:09:07Z

I noticed that the implementation I initially submitted only worked for bound column references, so I've fixed that with this most recent commit. Referencing columns via col("x") will work now instead of just dataframe("x")

rxin · 2019-06-15T13:33:15Z

I feel it's ok to have these functions. Fills a gap.

nvander1 · 2019-06-26T16:52:27Z

@rxin @HyukjinKwon What's the next steps here? Can we get a jenkins build kicked off?

HyukjinKwon · 2019-07-02T13:57:03Z

ok to test

HyukjinKwon · 2019-07-02T13:58:03Z

@ssimeonov,

It's extremely weird and inconsistent for there to be SparkSQL functions with no DSL equivalent.

See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L42-L56

SparkQA · 2019-07-02T14:05:00Z

Test build #107117 has finished for PR 24232 at commit 6bf07d8.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

nvander1 · 2019-07-11T20:06:31Z

The build is failing but not at changes I have made:

error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala:184: type mismatch;
[error]  found   : Function0[Int] (in scala) 
[error]  required: Function0[?]   (in org.apache.spark.api.java.function) 
[error]     val sampleUDF = udf(sample)

@HyukjinKwon How should we proceed to isolate the failure from my changes?

srowen · 2019-07-11T20:30:35Z

That doesn't seem to be failing in master. I suspect it is somehow related to this change though it's hard to see how here. Does it compile locally?

HyukjinKwon · 2019-07-15T04:39:31Z

ping @nvander1 are you able to compile locally?

SparkQA · 2019-09-19T06:41:36Z

Test build #110961 has finished for PR 24232 at commit e43033b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2019-09-19T20:05:56Z

FYI: We might want to include the method for #25666.

nvander1 · 2019-09-20T01:24:03Z

retest this please

SparkQA · 2019-09-20T05:47:34Z

Test build #111034 has finished for PR 24232 at commit 722f0e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nvander1 · 2019-09-25T03:51:28Z

@ueshin Do you want to wait for #25666 to get sorted before moving forward with this? Or get this moving, then add the binding for the new filter overload in #25666?

ueshin · 2019-09-25T04:27:38Z

@nvander1 We don't need to wait for #25666. It won't affect the current behavior but just add a new usage. We can do it in the separate PR.

nvander1 · 2019-09-28T17:50:19Z

@ueshin I think it's ready to go now then, pending maintainer review :)

ueshin

I left some nits. Otherwise LGTM.
Btw, shall we add forall in the description?
Thanks!

ueshin · 2019-10-01T21:24:46Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @since 3.0.0
+   */
+  def aggregate(expr: Column, zero: Column, merge: (Column, Column) => Column,
+                finish: Column => Column): Column = withExpr {


nit: style

def aggregate( expr: Column, zero: Column, merge: (Column, Column) => Column, finish: Column => Column): Column = withExpr { ... }

ueshin · 2019-10-01T21:25:02Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @since 3.0.0
+   */
+  def map_zip_with(left: Column, right: Column,
+                   f: (Column, Column, Column) => Column): Column = withExpr {


ueshin · 2019-10-01T21:26:14Z

sql/core/src/test/java/test/org/apache/spark/sql/JavaHigherOrderFunctionsSuite.java

+                    put(2, 1);
+                    put(4, 2);
+                }}),
+                null


nit: style. one more indent?

ueshin · 2019-10-01T21:26:21Z

sql/core/src/test/java/test/org/apache/spark/sql/JavaHigherOrderFunctionsSuite.java

+                    put(1, 2);
+                    put(2, 4);
+                }}),
+                null


@Test public void testTransformValues() { checkAnswer( mapDf.select(transform_values(col("x"), (k, v) -> k.plus(v))), toRows( mapAsScalaMap(new HashMap<Integer, Integer>() {{ put(1, 2); put(2, 4); }}), null ) ); }

Does this work as well? I've moved the new HashMap up a line. @ueshin

Also, what is the general preference in the codebase, each paren and brace on a new line?

Or the more "lispy" style of every close on the same line:

@Test public void testTransformValues() { checkAnswer( mapDf.select(transform_values(col("x"), (k, v) -> k.plus(v))), toRows( mapAsScalaMap(new HashMap<Integer, Integer>() {{ put(1, 2); put(2, 4);}}), null)); }

I've seen a mixture of the two to various degrees in the code, I edited this file to at least be consistent with itself (the exception here being the mapAsScalaMap / hashmap since it really is its own entity just being converted to a scala equivalent.

Maybe the first one is more preferred.
The second one needs a line break at the end of HashMap since it's a block:

mapAsScalaMap(new HashMap<Integer, Integer>() {{ put(1, 2); put(2, 4); }}), null));

I'm not quite sure about the parentheses after null. Maybe we need a line break as well.

As for my comment, sorry, maybe my pointer was wrong.
I meant new HashMap ... should be on one more indent.

mapAsScalaMap( new HashMap<Integer, Integer>() {{ put(1, 2); put(2, 4); }} ), null

SparkQA · 2019-10-02T06:11:48Z

Test build #111666 has finished for PR 24232 at commit 1bf2654.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-02T07:05:02Z

Test build #111669 has finished for PR 24232 at commit 64c0f87.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2019-10-02T07:31:04Z

Jenkins, retest this please.

SparkQA · 2019-10-02T11:07:20Z

Test build #111680 has finished for PR 24232 at commit 64c0f87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2019-10-02T19:53:11Z

Thanks! merging to master.

HyukjinKwon · 2019-10-03T02:04:37Z

+1 !

This reverts commit 1e78335. This was merged upstream in spark: apache/spark#24232

… functions object ### What changes were proposed in this pull request? Add an overload for the higher order function `filter` that takes array index as its second argument to `org.apache.spark.sql.functions`. ### Why are the changes needed? See: SPARK-28962 and SPARK-27297. Specifically ueshin pointing out the discrepency here: #24232 (comment) ### Does this PR introduce any user-facing change? ### How was this patch tested? Updated the these test suites: `test.org.apache.spark.sql.JavaHigherOrderFunctionsSuite` and `org.apache.spark.sql.DataFrameFunctionsSuite` Closes #26007 from nvander1/add_index_overload_for_filter. Authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

Adds higher order functions to scala API

9cf1ebf

HyukjinKwon reviewed Mar 28, 2019

View reviewed changes

nvander1 added 3 commits March 28, 2019 08:59

Add (Scala-specifc) note to higher order functions

efc6ba4

These signatures won't work in java as they rely on Scala lambdas

Follow style guide more closely

b9dceec

Fix scalastyle issues

1fb46a3

Add java-specific version of higher order function api

03d602f

nvander1 mentioned this pull request May 13, 2019

Add an array_sum function to sum all the numbers in an array MrPowers/spark-daria#80

Closed

MrPowers mentioned this pull request May 15, 2019

Add higher order functions to Scala API MrPowers/spark-daria#85

Closed

nvander1 mentioned this pull request May 22, 2019

WIP Add higher order functions MrPowers/spark-daria#92

Closed

dongjoon-hyun added the SQL label Jun 14, 2019

Do not prematurely bind lambda variables

6bf07d8

nvander1 added 3 commits September 19, 2019 21:14

Add java test for transformKeys

c1c76a9

Add java test for transform_values

10a5f2e

Add java test for map_filter and map_zip_with

722f0e6

ueshin reviewed Oct 1, 2019

View reviewed changes

nvander1 added 2 commits October 1, 2019 22:32

Fix style nits

1bf2654

Fix linter errors in imports

64c0f87

ueshin closed this in 730a178 Oct 2, 2019

nvander1 mentioned this pull request Oct 3, 2019

[SPARK-28962][SPARK-27297][SQL] Add overload for filter with index to functions object #26007

Closed

nvander1 added a commit to MrPowers/spark-daria that referenced this pull request Oct 3, 2019

Revert "Add transform higher order function (#94)"

5d7da29

This reverts commit 1e78335. This was merged upstream in spark: apache/spark#24232

nvander1 mentioned this pull request Oct 3, 2019

Revert "Add transform higher order function (#94)" MrPowers/spark-daria#105

Closed

[SPARK-27297] [SQL] Add higher order functions to scala API #24232

[SPARK-27297] [SQL] Add higher order functions to scala API #24232

Conversation

nvander1 commented Mar 28, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvander1 Mar 28, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Mar 28, 2019

nvander1 commented Mar 28, 2019

nvander1 commented Mar 28, 2019

HyukjinKwon commented Mar 29, 2019

nvander1 commented Mar 29, 2019

nvander1 commented Apr 2, 2019 • edited

gatorsmile commented Apr 2, 2019

nvander1 commented May 31, 2019

HyukjinKwon commented Jun 14, 2019

ssimeonov commented Jun 14, 2019

nvander1 commented Jun 14, 2019

rxin commented Jun 15, 2019

nvander1 commented Jun 26, 2019

HyukjinKwon commented Jul 2, 2019

HyukjinKwon commented Jul 2, 2019

SparkQA commented Jul 2, 2019

nvander1 commented Jul 11, 2019 • edited

srowen commented Jul 11, 2019

HyukjinKwon commented Jul 15, 2019

SparkQA commented Sep 19, 2019

ueshin commented Sep 19, 2019

nvander1 commented Sep 20, 2019

SparkQA commented Sep 20, 2019

nvander1 commented Sep 25, 2019

ueshin commented Sep 25, 2019

nvander1 commented Sep 28, 2019

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 2, 2019

SparkQA commented Oct 2, 2019

ueshin commented Oct 2, 2019

SparkQA commented Oct 2, 2019

ueshin commented Oct 2, 2019

HyukjinKwon commented Oct 3, 2019

nvander1 commented Mar 28, 2019 •

edited

nvander1 Mar 28, 2019 •

edited

nvander1 commented Apr 2, 2019 •

edited

nvander1 commented Jul 11, 2019 •

edited