New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27297] [SQL] Add higher order functions to scala API #24232
[SPARK-27297] [SQL] Add higher order functions to scala API #24232
Conversation
* | ||
* @group collection_funcs | ||
*/ | ||
def exists(column: Column, f: Column => Column): Column = withExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how do we support this in Java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change the signatures to accept scala.runtime.AbstractFunction
s instead to avoid using the Function traits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add (Scala-specific)
at least for each doc. BTW, please take a look for style guide at https://github.com/databricks/scala-style-guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually a better idea would probably be to use java functional interfaces.
@FuncitonalInterface
interface Function3[T1, T2, T3, R] {
R apply(T1 t1, T2 t2, T3 t3);
}
Column map_zip_with(Column left, Column right, Function3[Column, Column, Column, Column] f) = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And of course we would use the existing functional interfaces first from java.util.function
, but I don't think there are any that accept three parameters likes some of the functions here require.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears these interfaces already exist in the source tree: https://github.com/apache/spark/blob/v2.4.0/core/src/main/java/org/apache/spark/api/java/function/Function3.java
I'll come back later to add java-specific apis that utilizes these.
These signatures won't work in java as they rely on Scala lambdas
I think this is a whole lot to add and support in the APIs, and will be Scala-specific, in an API that is more meant to follow SQL operations than Scala. |
@srowen Can't we make the same argument against any of the scala functions in Also, I can provide equivalent methods in java that will accept lambda expressions via a functional interface. |
I've added Java-specific versions of the api as well to show how java interop can be handled. I still need to add the corresponding tests. |
Can you hold it for a while before we go further? Growing APIs in |
Sure, I'd love to get some more feedback on this! |
Any more thoughts on this? @HyukjinKwon @rxin @gatorsmile @ueshin @srowen |
Also cc @hvanhovell |
Would it be more appropriate for me to close the issue and make a third party library for these if growing the API is a concern? @HyukjinKwon @rxin @gatorsmile @ueshin @srowen @hvanhovell |
I have no strong opinion on here. I would leave it to @rxin, @ueshin, @hvanhovell |
It's extremely weird and inconsistent for there to be SparkSQL functions with no DSL equivalent. It forces companies such as Swoop to create our own (e.g., https://gist.github.com/ssimeonov/8d902d0dfda934a79c3a46ec7dc0523d) yet bear the uncertainty as to what OSS Spark does, which is not a great outcome for the ecosystem. It would have been much better if SparkSQL and DSL support had been launched jointly. The growing size of Either way, delaying a decision on this functionality, by which I don't mean this specific PR, does not help. /cc @rxin |
I noticed that the implementation I initially submitted only worked for bound column references, so I've fixed that with this most recent commit. Referencing columns via |
I feel it's ok to have these functions. Fills a gap. |
@rxin @HyukjinKwon What's the next steps here? Can we get a jenkins build kicked off? |
ok to test |
|
Test build #107117 has finished for PR 24232 at commit
|
The build is failing but not at changes I have made:
@HyukjinKwon How should we proceed to isolate the failure from my changes? |
That doesn't seem to be failing in master. I suspect it is somehow related to this change though it's hard to see how here. Does it compile locally? |
ping @nvander1 are you able to compile locally? |
Test build #110961 has finished for PR 24232 at commit
|
FYI: We might want to include the method for #25666. |
retest this please |
Test build #111034 has finished for PR 24232 at commit
|
@ueshin I think it's ready to go now then, pending maintainer review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some nits. Otherwise LGTM.
Btw, shall we add forall
in the description?
Thanks!
* @since 3.0.0 | ||
*/ | ||
def aggregate(expr: Column, zero: Column, merge: (Column, Column) => Column, | ||
finish: Column => Column): Column = withExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: style
def aggregate(
expr: Column,
zero: Column,
merge: (Column, Column) => Column,
finish: Column => Column): Column = withExpr {
...
}
* @since 3.0.0 | ||
*/ | ||
def map_zip_with(left: Column, right: Column, | ||
f: (Column, Column, Column) => Column): Column = withExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
put(2, 1); | ||
put(4, 2); | ||
}}), | ||
null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: style. one more indent?
put(1, 2); | ||
put(2, 4); | ||
}}), | ||
null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Test
public void testTransformValues() {
checkAnswer(
mapDf.select(transform_values(col("x"), (k, v) -> k.plus(v))),
toRows(
mapAsScalaMap(new HashMap<Integer, Integer>() {{
put(1, 2);
put(2, 4);
}}),
null
)
);
}
Does this work as well? I've moved the new HashMap up a line. @ueshin
Also, what is the general preference in the codebase, each paren and brace on a new line?
Or the more "lispy" style of every close on the same line:
@Test
public void testTransformValues() {
checkAnswer(
mapDf.select(transform_values(col("x"), (k, v) -> k.plus(v))),
toRows(
mapAsScalaMap(new HashMap<Integer, Integer>() {{
put(1, 2);
put(2, 4);}}),
null));
}
I've seen a mixture of the two to various degrees in the code, I edited this file to at least be consistent with itself (the exception here being the mapAsScalaMap / hashmap since it really is its own entity just being converted to a scala equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the first one is more preferred.
The second one needs a line break at the end of HashMap
since it's a block:
mapAsScalaMap(new HashMap<Integer, Integer>() {{
put(1, 2);
put(2, 4);
}}),
null));
I'm not quite sure about the parentheses after null
. Maybe we need a line break as well.
As for my comment, sorry, maybe my pointer was wrong.
I meant new HashMap ...
should be on one more indent.
mapAsScalaMap(
new HashMap<Integer, Integer>() {{
put(1, 2);
put(2, 4);
}}
),
null
Test build #111666 has finished for PR 24232 at commit
|
Test build #111669 has finished for PR 24232 at commit
|
Jenkins, retest this please. |
Test build #111680 has finished for PR 24232 at commit
|
Thanks! merging to master. |
+1 ! |
This reverts commit 1e78335. This was merged upstream in spark: apache/spark#24232
… functions object ### What changes were proposed in this pull request? Add an overload for the higher order function `filter` that takes array index as its second argument to `org.apache.spark.sql.functions`. ### Why are the changes needed? See: SPARK-28962 and SPARK-27297. Specifically ueshin pointing out the discrepency here: #24232 (comment) ### Does this PR introduce any user-facing change? ### How was this patch tested? Updated the these test suites: `test.org.apache.spark.sql.JavaHigherOrderFunctionsSuite` and `org.apache.spark.sql.DataFrameFunctionsSuite` Closes #26007 from nvander1/add_index_overload_for_filter. Authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
What changes were proposed in this pull request?
There is currently no existing Scala API equivalent for the higher order functions introduced in Spark 2.4.0.
Equivalent column based functions should be added to the Scala API for org.apache.spark.sql.functions with the following signatures:
How was this patch tested?
I've mimicked the existing tests for the higher order functions in
org.apache.spark.sql.DataFrameFunctionsSuite
that useexpr
to test the higher order functions.As an example of an existing test:
I've added this test that performs the same logic, but with the new column based API I've added.