-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31319][SQL][DOCS] Document UDFs/UDAFs in SQL Reference #28087
Conversation
Test build #120660 has finished for PR 28087 at commit
|
@@ -1,22 +1,65 @@ | |||
--- | |||
layout: global | |||
title: User defined Aggregate Functions (UDAF) | |||
displayTitle: User defined Aggregate Functions (UDAF) | |||
title: User Defined Aggregate Functions (UDAF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: UDAF
-> UDAFs
?
|
||
// Define, register a UDAF to calculate the sum of product of two columns | ||
// Scala | ||
import functions.udf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import functions.udf
-> import org.apache.spark.sql.functions.udf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we need to import import org.apache.spark.sql.expressions.Aggregator
and import org.apache.spark.sql.Encoder
?
SELECT agg(a, b) FROM testUDAF; | ||
|
||
+---------------------------------------------+ | ||
|$anon$2(CAST(a AS BIGINT), CAST(b AS BIGINT))| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, (this is not related to this pr though), the "agg"
name defined in spark.udf.register("agg", agg)
is not used in the name of the plan...
docs/sql-ref-functions-udf-scalar.md
Outdated
|
||
{% highlight sql %} | ||
|
||
// Define, register a zero argument non-deterministic UDF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define, register
-> Define and register
docs/sql-ref-functions-udf-scalar.md
Outdated
|
||
// Define, register a zero argument non-deterministic UDF | ||
// Scala | ||
import functions.udf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import functions.udf
-> import org.apache.spark.sql.functions.udf
?
docs/sql-ref-functions-udf-scalar.md
Outdated
spark.udf.register("groupFunction", (n: Int) => { n > 10 }) | ||
|
||
val df = Seq(("red", 1), ("red", 2), ("blue", 10), | ||
("green", 100), ("green", 200)).toDF("color", "value") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to match either format: a single value per line or multiple values per line https://github.com/apache/spark/pull/28087/files#diff-97b11639e031df253c371da5298980d9R44
docs/sql-ref-functions-udf.md
Outdated
Spark. This document describes the SQL constructs supported by Spark in detail | ||
along with usage examples when applicable. | ||
User-Defined Functions (UDFs) are a feature of Spark SQL that allows users to define their own functions when the system's built-in functions are not enough to perform the desired task. To use UDFs in SPARK SQL, users must first define the function, then register the function with SPARK, and finally call the registered function. The User-Defined Functions can act on a single row or act on multiple rows at once. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to document Hive UDFs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will think about this.
Do we have a page for introducing the APIs for Scala, Java, Python UDFs/UDAFs/UDTF? If not, could we introduce it? |
I will think about where and how to document these. Thanks for your review! @maropu @gatorsmile |
Test build #120664 has finished for PR 28087 at commit
|
cc @Ngone51 Please help review this? |
I added Scala APIs. Java APIs and Python APIs are similar so I prefer not to add them here. I think the Scala APIs can already give the users a very good idea about what these UDF APIs are. If they need to check the Java or Python APIs, they can refer to the APIs doc. I don't want the majority of the page become API docs. |
|
||
### Examples | ||
|
||
{% highlight sql %} | ||
|
||
// Define and register a UDAF to calculate the sum of product of two columns | ||
// Scala | ||
import org.apache.spark.sql.Encoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused with this import. My test works OK without it.
Test build #120700 has finished for PR 28087 at commit
|
df.createOrReplaceTempView("testUDAF") | ||
|
||
-- SQL | ||
SELECT agg(a, b) FROM testUDAF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a Scala case, sql("SELECT agg(a, b) FROM testUDAF").show()
is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change all examples to this format
<dl> | ||
<dt><code><em>reduce(b: BUF, a: IN): BUF</em></code></dt> | ||
<dd> | ||
Combine two values to produce a new value. For performance, the function may modify b and return it instead of constructing new object for b. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a new value
-> a new aggregated value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to wrap b with `, e.g., `b`?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to b
because ` doesn't work inside html.
|
||
User-Defined Aggregate Functions (UDAFs) are user-programmable routines that act on multiple rows at once and return a single aggregated value as a result. This documentation lists the classes that are required for creating and registering UDAFs. It also contains examples that demonstrate how to define and register UDAFs in Scala and invoke them in Spark SQL. | ||
|
||
### org.apache.spark.sql.expressions.Aggregator[-IN, BUF, OUT] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "Interface of Aggregator[-IN, BUF, OUT]
"? We need the full-qualifier here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I will just use Aggregator[-IN, BUF, OUT]
. Because UserDefinedFunction
is an abstract class , and UDFRegistration
is a regular class
</dd> | ||
</dl> | ||
|
||
### org.apache.spark.sql.UDFRegistration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: how about "Inerface of UDFRegistration
"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to UDFRegistration
docs/sql-ref-functions-udf-scalar.md
Outdated
| 6| | ||
+------+ | ||
|
||
// Define a two arguments UDF and register it with Spark in one step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define a two arguments
-> Define two arguments
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I will change to a two-argument UDF?
<dl> | ||
<dt><code><em>zero: BUF</em></code></dt> | ||
<dd> | ||
A zero value for this aggregation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial value of the intermediate result for this aggregation.
docs/sql-ref-functions-udf-scalar.md
Outdated
|300 | | ||
+----------+ | ||
|
||
# Define and register a UDF using Python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this section? I think there is doc for Python UDFs. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to give an example using Python. There is API doc for Python UDFs, but I won't have a separate doc for Python. I am OK to keep or delete it.
Test build #120710 has finished for PR 28087 at commit
|
docs/sql-ref-functions-udf-scalar.md
Outdated
</dl> | ||
|
||
<dl> | ||
<dt><code><em>nullable: Boolean</em></code></dt> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need deterministic
and nullable
here? I personally think asNonNullable
, asNondeterministic
, and withName
is enough for users.
docs/sql-ref-functions-udf-scalar.md
Outdated
|
||
### UserDefinedFunction | ||
|
||
A user-defined function. To create one, use the `udf` functions in `functions`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about saying here To define the properties of a user-defined function, you can use the some methods defined in 'UserDefinedFunction';
?
Test build #120789 has finished for PR 28087 at commit
|
Test build #120790 has finished for PR 28087 at commit
|
|
||
### UDFRegistration | ||
|
||
Functions for registering user-defined functions. Use `SparkSession.udf` to access this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users need to know this class? I personally think not, so how about removing this section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user needs to call UDFRegistration.register
to register the udaf
spark.udf.register("agg", agg)
I guess we keep this section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think its ok for users to just know how to register udf/udaf in the document, so user don't need to know what UDFRegistration
is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove
### Aggregator[-IN, BUF, OUT] | ||
|
||
A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value. | ||
- IN The input type for the aggregation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: need a single blank?
A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements of a group and reduce them to a single value.
- IN The input type for the aggregation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: also, how about putting a separator between IN
and the statement? e.g., IN - The input...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, the contents are somewhat overlapped with https://spark.apache.org/docs/3.0.0-preview/sql-getting-started.html#aggregations ? Probably, we need to organize these aggregator contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will think about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think its better to move all the contents of sql-getting-started.html
into the SQL doc. Since the user-defined aggregator topic is advanced one, the SQL doc is more suitable than the Getting Started
doc. cc: @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu sounds good to me.
Test build #120851 has finished for PR 28087 at commit
|
import org.apache.spark.sql.expressions.Aggregator | ||
import org.apache.spark.sql.functions.udaf | ||
|
||
val agg = udaf(new Aggregator[(Long, Long), Long, Long] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we already have an example at https://spark.apache.org/docs/latest/sql-getting-started.html#aggregations.
And do we still need to keep that Aggregations
section or not? If not, I think we could reuse the previous examples here.
{% highlight sql %} | ||
|
||
// Define and register a UDAF to calculate the sum of product of two columns | ||
// Scala |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we highlight Scala codes as other doc does? e.g,
<div data-lang="scala" markdown="1">
{% include_example generic_load_save_functions scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>
Test build #121047 has finished for PR 28087 at commit
|
|
||
spark.stop() | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the blank line.
docs/sql-ref-functions-udf-scalar.md
Outdated
</div> | ||
</div> | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the single blank line.
docs/sql-ref-functions-udf-scalar.md
Outdated
|
||
User-Defined Functions (UDFs) are user-programmable routines that act on one row. This documentation lists the classes that are required for creating and registering UDFs. It also contains examples that demonstrate how to define and register UDFs and invoke them in Spark SQL. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the blank line.
</div> | ||
|
||
### Related Statements | ||
- [Scalar User Defined Functions (UDFs)](sql-ref-functions-udf-scalar.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: -
-> *
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for the existing comments.
Test build #121108 has finished for PR 28087 at commit
|
Test build #121109 has finished for PR 28087 at commit
|
### What changes were proposed in this pull request? Document UDF in SQL Reference ### Why are the changes needed? To make SQL Reference complete. ### Does this PR introduce any user-facing change? Yes. Here are the new pages: <img width="1050" alt="Screen Shot 2020-04-09 at 5 06 42 PM" src="https://user-images.githubusercontent.com/13592258/78950977-585dc200-7a85-11ea-875c-ce14c3795e0f.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 07 06 PM" src="https://user-images.githubusercontent.com/13592258/78950979-5b58b280-7a85-11ea-81f3-bd5d91bd07e3.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 07 26 PM" src="https://user-images.githubusercontent.com/13592258/78950985-5e53a300-7a85-11ea-86be-f63152c1501b.png"> <img width="1051" alt="Screen Shot 2020-04-09 at 5 07 54 PM" src="https://user-images.githubusercontent.com/13592258/78950991-63185700-7a85-11ea-9379-8da46cfc434c.png"> <img width="1060" alt="Screen Shot 2020-04-09 at 5 08 17 PM" src="https://user-images.githubusercontent.com/13592258/78950994-657ab100-7a85-11ea-8b34-d2c87f94b03b.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 09 27 PM" src="https://user-images.githubusercontent.com/13592258/78951001-6875a180-7a85-11ea-874e-8abd14a3d3d3.png"> <img width="1060" alt="Screen Shot 2020-04-09 at 5 10 00 PM" src="https://user-images.githubusercontent.com/13592258/78951005-6f041900-7a85-11ea-9e57-520eb8db59ec.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 11 10 PM" src="https://user-images.githubusercontent.com/13592258/78951014-73303680-7a85-11ea-93ab-32d68d2e2d59.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 11 41 PM" src="https://user-images.githubusercontent.com/13592258/78951019-75929080-7a85-11ea-9d3b-600e8e157c05.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 16 22 PM" src="https://user-images.githubusercontent.com/13592258/78951137-dfab3580-7a85-11ea-8512-c6b660aa271e.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 22 15 PM" src="https://user-images.githubusercontent.com/13592258/78951466-22214200-7a87-11ea-93dd-6e36492421f1.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 22 46 PM" src="https://user-images.githubusercontent.com/13592258/78951469-24839c00-7a87-11ea-93a9-fe30d689adbd.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 23 08 PM" src="https://user-images.githubusercontent.com/13592258/78951472-26e5f600-7a87-11ea-84db-087a3528aa53.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 23 34 PM" src="https://user-images.githubusercontent.com/13592258/78951474-29e0e680-7a87-11ea-8be4-2a5be1bc3788.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 23 57 PM" src="https://user-images.githubusercontent.com/13592258/78951481-2cdbd700-7a87-11ea-8894-0a39abf54a3b.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 24 15 PM" src="https://user-images.githubusercontent.com/13592258/78951483-2f3e3100-7a87-11ea-8845-ffebf89d7898.png"> ### How was this patch tested? Manually build and check Closes #28087 from huaxingao/udf. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 3bbd80d) Signed-off-by: Sean Owen <srowen@gmail.com>
Merged to master/3.0 |
Thank you everyone! |
(String s, Integer x) -> s.length() + x, DataTypes.IntegerType | ||
); | ||
spark.udf().register("strLen", strLen); | ||
spark.sql("SELECT strLen('test', 1)").show(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should have been better probably if we added SQL example with UDAF too, e.g.)
val agg = Aggregator(...)
spark.udf.register("udaf_func", agg)
which was added in SPARK-27296.
The current example is only Scala/Java APIs there in UserDefinedTypedAggregation.scala
and JavaUserDefinedTypedAggregation.java
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1; @huaxingao Could you add it in follow-up? The initali commit in this PR includes the example (b276e16#diff-97b11639e031df253c371da5298980d9R32-R54), but it seems to disappear during the reviews....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just submitted a follow-up to add the SQL example.
### What changes were proposed in this pull request? Document UDF in SQL Reference ### Why are the changes needed? To make SQL Reference complete. ### Does this PR introduce any user-facing change? Yes. Here are the new pages: <img width="1050" alt="Screen Shot 2020-04-09 at 5 06 42 PM" src="https://user-images.githubusercontent.com/13592258/78950977-585dc200-7a85-11ea-875c-ce14c3795e0f.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 07 06 PM" src="https://user-images.githubusercontent.com/13592258/78950979-5b58b280-7a85-11ea-81f3-bd5d91bd07e3.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 07 26 PM" src="https://user-images.githubusercontent.com/13592258/78950985-5e53a300-7a85-11ea-86be-f63152c1501b.png"> <img width="1051" alt="Screen Shot 2020-04-09 at 5 07 54 PM" src="https://user-images.githubusercontent.com/13592258/78950991-63185700-7a85-11ea-9379-8da46cfc434c.png"> <img width="1060" alt="Screen Shot 2020-04-09 at 5 08 17 PM" src="https://user-images.githubusercontent.com/13592258/78950994-657ab100-7a85-11ea-8b34-d2c87f94b03b.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 09 27 PM" src="https://user-images.githubusercontent.com/13592258/78951001-6875a180-7a85-11ea-874e-8abd14a3d3d3.png"> <img width="1060" alt="Screen Shot 2020-04-09 at 5 10 00 PM" src="https://user-images.githubusercontent.com/13592258/78951005-6f041900-7a85-11ea-9e57-520eb8db59ec.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 11 10 PM" src="https://user-images.githubusercontent.com/13592258/78951014-73303680-7a85-11ea-93ab-32d68d2e2d59.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 11 41 PM" src="https://user-images.githubusercontent.com/13592258/78951019-75929080-7a85-11ea-9d3b-600e8e157c05.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 16 22 PM" src="https://user-images.githubusercontent.com/13592258/78951137-dfab3580-7a85-11ea-8512-c6b660aa271e.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 22 15 PM" src="https://user-images.githubusercontent.com/13592258/78951466-22214200-7a87-11ea-93dd-6e36492421f1.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 22 46 PM" src="https://user-images.githubusercontent.com/13592258/78951469-24839c00-7a87-11ea-93a9-fe30d689adbd.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 23 08 PM" src="https://user-images.githubusercontent.com/13592258/78951472-26e5f600-7a87-11ea-84db-087a3528aa53.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 23 34 PM" src="https://user-images.githubusercontent.com/13592258/78951474-29e0e680-7a87-11ea-8be4-2a5be1bc3788.png"> <img width="1049" alt="Screen Shot 2020-04-09 at 5 23 57 PM" src="https://user-images.githubusercontent.com/13592258/78951481-2cdbd700-7a87-11ea-8894-0a39abf54a3b.png"> <img width="1050" alt="Screen Shot 2020-04-09 at 5 24 15 PM" src="https://user-images.githubusercontent.com/13592258/78951483-2f3e3100-7a87-11ea-8845-ffebf89d7898.png"> ### How was this patch tested? Manually build and check Closes apache#28087 from huaxingao/udf. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>
What changes were proposed in this pull request?
Document UDF in SQL Reference
Why are the changes needed?
To make SQL Reference complete.
Does this PR introduce any user-facing change?
Yes. Here are the new pages:
How was this patch tested?
Manually build and check