-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-34981][SQL] Implement V2 function resolution and evaluation #32082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #137036 has finished for PR 32082 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137039 has finished for PR 32082 at commit
|
|
Test build #137041 has finished for PR 32082 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137064 has finished for PR 32082 at commit
|
173f7e1 to
dc9dde6
Compare
|
Test build #137227 has finished for PR 32082 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137237 has finished for PR 32082 at commit
|
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ScalarFunction.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ScalarFunction.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ScalarFunction.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add the v2 API to check function existence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes looking at SPARK-19737 I think this will be very useful (although the checking won't be complete since bind could still fail). Do you mind if I add a TODO here and do it separately later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/V2Aggregator.scala
Outdated
Show resolved
Hide resolved
...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/V2Aggregator.scala
Outdated
Show resolved
Hide resolved
97c29c3 to
c453b64
Compare
|
Test build #137989 has finished for PR 32082 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137992 has finished for PR 32082 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137987 has finished for PR 32082 at commit
|
|
Test build #137994 has finished for PR 32082 at commit
|
...talyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
|
|
||
| bound match { | ||
| case scalarFunc: ScalarFunction[_] => | ||
| if (isDistinct) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for readability, can we put the logic of handling ScalarFunction into a method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done. I think we can also consider moving the big chunk of handling of V1 and V2 functions separately into two functions like what I used to have:
case u @ UnresolvedFunction(AsFunctionIdentifier(ident), arguments, isDistinct,
filter, ignoreNulls) => withPosition(u) {
processV1Function(...)
}
case u @ UnresolvedFunction(nameParts, arguments, isDistinct, filter, ignoreNulls) =>
withPosition(u) {
processV2Function(...)
}to keep the main logic of pattern matching on unresolved functions clearer. Let me know if you prefer this way too.
| } | ||
| } | ||
| } | ||
| case aggFunc: V2AggregateFunction[_, _] => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, put into a new method.
| * | ||
| * <pre> | ||
| * public class IntegerAdd implements{@code ScalarFunction<Integer>} { | ||
| * public int invoke(int left, int right) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I think we can also consider adding another "static invoke" API for those stateless UDFs. From the benchmark you did sometime back it seems this can give a decent performance improvement. WDYT?
Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.6
Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
UDF perf: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
native add 14206 14516 535 70.4 14.2 1.0X
udf add 24609 25271 898 40.6 24.6 0.6X
new udf add 18657 19096 726 53.6 18.7 0.8X
new row udf add 21128 22343 1478 47.3 21.1 0.7X
static udf add 16678 16887 278 60.0 16.7 0.9X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao can you spend some time on the API design? I'd love to see this feature!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will do. It should similar to the current invoke and we can leverage StaticInvoke for the purpose. Do you think we can do this in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #138006 has finished for PR 32082 at commit
|
|
thanks, merging to master! |
|
Thanks all! |
| override def name: String = function.name() | ||
| override def dataType: DataType = function.resultType() | ||
|
|
||
| private lazy val reusedRow = new GenericInternalRow(children.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use SpecificInternalRow to avoid boxing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah looks like we should. Let me see change this and run the benchmark again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the delay - just opened #32647 for this.
Co-Authored-By: Chao Sun sunchao@apple.com
Co-Authored-By: Ryan Blue rblue@netflix.com
What changes were proposed in this pull request?
This implements function resolution and evaluation for functions registered through V2 FunctionCatalog SPARK-27658. In particular:
ScalarFunction.ApplyFunctionExpressionwhich evaluates input by delegating toScalarFunction.produceResultmethod.V2Aggregatorwhich is a type ofTypedImperativeAggregate. It's a wrapper of V2AggregateFunctionand mostly delegate methods to the implementation of the latter. It also uses plain Java serde for intermediate state.ScalarFunctionandAggregateFunctioninAnalyzer.ScalarFunctionthis checks if the magic method is implemented through Java reflection, and create aInvokeexpression if so. Otherwise, it checks if the defaultproduceResultis overridden. If so, it creates aApplyFunctionExpressionwhich evaluates throughInternalRow. Otherwise an analysis exception is thrown.AggregateFunction, this checks if theupdatemethod is overridden. If so, it converts it toV2Aggregator. Otherwise an analysis exception is thrown similar to the case ofScalarFunction.InMemoryTableCatalogto add the function catalog capability. Also renamed it toInMemoryCatalogsince it no longer only covers tables.Note: this currently can successfully detect whether a subclass overrides the default
produceResultorupdatemethod from the parent interface only for Java implementations. It seems in Scala it's hard to differentiate whether a subclass overrides a default method from its parent interface. In this case, it will be a runtime error instead of analysis error.A few TODOs:
V2SessionCatalogwith function catalog. This seems a little tricky since API such V2FunctionCatalog'sloadFunctionis different from V1SessionCatalog'slookupFunction.AggregateFunction.Why are the changes needed?
As V2 FunctionCatalog APIs are finalized, we should integrate it with function resolution and evaluation process so that they are actually useful.
Does this PR introduce any user-facing change?
Yes, now a function exposed through V2 FunctionCatalog can be analyzed and evaluated.
How was this patch tested?
Added new unit tests.