[SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions #42272

allisonwang-db · 2023-08-01T17:04:50Z

What changes were proposed in this pull request?

This PR adds a user guide for Python user-defined table functions (UDTFs) introduced in Spark 3.5.

Why are the changes needed?

To help users write Python UDTFs.

Does this PR introduce any user-facing change?

No

How was this patch tested?

docs test

itholic

Otherwise looks pretty fine to me

itholic · 2023-08-01T17:25:43Z

examples/src/main/python/sql/udtf.py

+  ./bin/spark-submit examples/src/main/python/sql/udtf.py
+"""
+
+# NOTE that this file is imported in user guide in PySpark documentation.


nit: "user guide" -> "User Guides" to follow official documentation name?

Also maybe adding a doc link(https://spark.apache.org/docs/latest/api/python/user_guide/index.html) would helpful?

Yup it's on the user guide page. I will add a screenshot in the PR description.

itholic · 2023-08-01T17:42:29Z

examples/src/main/python/sql/udtf.py

+            self.count += 1
+
+        def terminate(self):
+            yield self.count,


qq: should we always yield the data as tuple for UDTF?

Yes, each element corresponds to one column in the output schema.

itholic · 2023-08-01T17:47:11Z

python/docs/source/user_guide/sql/python_udtf.rst

+Python User-defined Table Functions (UDTFs)
+===========================================
+
+Spark 3.5 introduces a new type of user-defined fucntion: Python user-defined table functions (UDTFs),


typo: "fucntion" -> "function"

itholic · 2023-08-01T17:48:30Z

python/docs/source/user_guide/sql/python_udtf.rst

+
+            Yields:
+                tuple: A tuple representing a single row in the UDTF result relation.
+                       Yield thisas many times as needed to produce multiple rows.


typo?: "thisas" -> "this as"

itholic · 2023-08-01T17:50:14Z

python/docs/source/user_guide/sql/python_udtf.rst

+
+            This method is required to implement.
+
+            Args:


I'm not pretty sure if we should follow numpydoc style here since we're following them in overall PySpark code base. WDYT @HyukjinKwon ?

yeah should follow numpy doc style I think

itholic · 2023-08-01T17:56:10Z

I think attaching screen capture(or something visible stuffs for built documentation) in the PR description would be great!

allisonwang-db · 2023-08-02T17:12:43Z

cc @dtenedor @ueshin

dtenedor

These docs look great, thanks Allison for working on this!

dtenedor · 2023-08-02T20:47:24Z

python/docs/source/user_guide/sql/python_udtf.rst

+===========================================
+
+Spark 3.5 introduces a new type of user-defined fucntion: Python user-defined table functions (UDTFs),
+which take zero or more arguments and return a set of rows.


Suggested change

which take zero or more arguments and return a set of rows.

wherein each invocation appears in the FROM clause and returns an entire

relation as output instead of a single result value. Every UDTF call accepts

zero or more arguments each comprising either a scalar constant expression or

a separate input relation.

dtenedor · 2023-08-02T20:47:48Z

python/docs/source/user_guide/sql/python_udtf.rst

+
+.. currentmodule:: pyspark.sql.functions
+
+To implement a Python UDTF, you can implement this class:


Suggested change

To implement a Python UDTF, you can implement this class:

To implement a Python UDTF, you can define a class implementing these methods:

dtenedor · 2023-08-02T20:58:55Z

python/docs/source/user_guide/sql/python_udtf.rst

+            Initialize the user-defined table function (UDTF).
+
+            This method is optional to implement and is called once when the UDTF is
+            instantiated. Use it to perform any initialization required for the UDTF.


Can we also describe the UDTF class instance's lifetime here? For example, any class fields assigned here will be available for subsequent eval method call(s) to consume (either just one eval call for a UDTF call accepting only scalar constant arg(s) or several eval calls for a UDTF call accepting an input relation arg).

Also should mention that it should be a default constructor which doesn’t accept any extra arguments?

allisonwang-db · 2023-08-03T23:09:44Z

python/docs/source/user_guide/sql/python_udtf.rst

+
+        def eval(self, *args: Any) -> Iterator[Any]:
+            """"
+            Evaluate the function using the given input arguments.


I am thinking about this too, but I found it difficult to explain in words. The interface is the same as scalar UDFs so I think Spark users should be able to figure it out. I can provide more examples.

👍 more examples should be helpful. Maybe we could also add:

The arguments provided to the UDTF call map to the values in this *args list, in order. Each provided scalar expression maps to exactly one value in this *args list. Each provided TABLE argument of N columns maps to exactly N values in this *args list, in the order of the columns as they appear in the table.

python/docs/source/user_guide/sql/python_udtf.rst

dtenedor · 2023-08-02T21:03:30Z

python/docs/source/user_guide/sql/python_udtf.rst

+
+            Example:
+                def eval(self, x: int, y: int):
+                    yield x + y, x - y


can we also add an example with a combination of scalar constant arguments and a relation input argument, to show how the mapping from provided SQL arguments to the python *args works? Could we include a SQL query and its results with each example as well?

Sure! I will add a simple one here and a more complex one in the example section below.

dtenedor · 2023-08-02T21:11:57Z

python/docs/source/user_guide/sql/python_udtf.rst

+
+        def terminate(self) -> Iterator[Any]:
+            """
+            Called when the UDTF has processed all rows in a partition.


We haven't really precisely defined what comprises a partition yet. Should we define it using the definitions from #42100 and #42174? Alternatively if these docs are targeting Spark 3.5 but those PRs are only going into master, we could simply define a partition here as either (1) just one eval call with the provided scalar argument(s), if any, or (2) several eval calls with an undefined subset of the rows from the input relation. Then we can expand it later.

allisonwang-db · 2023-08-03T23:09:44Z

python/docs/source/user_guide/sql/python_udtf.rst

+
+        def eval(self, *args: Any) -> Iterator[Any]:
+            """"
+            Evaluate the function using the given input arguments.


I am thinking about this too, but I found it difficult to explain in words. The interface is the same as scalar UDFs so I think Spark users should be able to figure it out. I can provide more examples.

ueshin · 2023-08-04T19:08:50Z

python/docs/source/user_guide/sql/python_udtf.rst

+            Initialize the user-defined table function (UDTF).
+
+            This method is optional to implement and is called once when the UDTF is
+            instantiated. Use it to perform any initialization required for the UDTF.


Also should mention that it should be a default constructor which doesn’t accept any extra arguments?

python/docs/source/user_guide/sql/python_udtf.rst

allisonwang-db · 2023-08-09T20:40:13Z

@ueshin @dtenedor @itholic @allanf-db @dstrodtman-db I've addressed the comments; PTAL thanks!

python/docs/source/user_guide/sql/python_udtf.rst

python/pyspark/sql/functions.py

johnayoub · 2023-08-11T19:59:50Z

@allisonwang-db would we able to use this feature to return a dataframe? I think this will be extremely useful especially that some functions such as dropDuplicates have no equivalent in SQL and wrapping them would be helpful.

allisonwang-db · 2023-08-16T20:20:30Z

@johnayoub A Python UDTF is a table-valued function and it returns a dataframe. However, I don't think you can directly use dataframe functions like dropDuplicates directly inside the UDTF.

dstrodtman-db · 2023-08-21T17:41:28Z

python/docs/source/user_guide/sql/python_udtf.rst

+
+            The arguments provided to the UDTF call are mapped to the values in the
+            `*args` list sequentially. Each provided scalar expression maps to exactly
+            one value in this `*args` list. Each provided TABLE argument of N columns


@dtenedor Here's the line

@allisonwang-db it turns out this part about TABLE arguments is wrong (I think I suggested it before, sorry). Instead of:

Each provided TABLE argument of N columns maps to exactly N values in this `*args` list, in the order of the columns as they appear in the table.

it should be something like

Each provided TABLE argument maps to a pyspark.sql.Row object containing the columns in the order they appear in the provided input relation.

examples/src/main/python/sql/udtf.py

python/docs/source/user_guide/sql/python_udtf.rst

dtenedor

Thanks @allisonwang-db for putting in the work to get this drafted, the documentation will be very useful for Spark users!

python/docs/source/user_guide/sql/python_udtf.rst

dtenedor · 2023-08-23T18:27:10Z

python/docs/source/user_guide/sql/python_udtf.rst

+            ------
+            tuple
+                A tuple representing a single row in the UDTF result relation.
+                Yield this if you want to return additional rows during termination.


should we mention here the tricky detail that you have to include a trailing comma when yielding a row of just one value (here and above in the eval description)?

Yea let me add an example

dtenedor · 2023-08-23T18:27:35Z

python/docs/source/user_guide/sql/python_udtf.rst

+
+
+The return type of the UDTF defines the schema of the table it outputs. 
+It must be either a ``StructType`` or a DDL string representing a struct type.


should we put an example with this DDL string as well? It looks useful :)

Will do. All the examples below are actually using DDL strings, but I couldn't find any documentation on this. cc @HyukjinKwon do you know if we have documentation on DDL strings of pyspark types?

dtenedor · 2023-08-23T18:29:31Z

python/docs/source/user_guide/sql/python_udtf.rst

+Advanced Featuress
+------------------
+
+TABLE input argument


I would recommend to propose this as the primary way of passing relation arguments, rather than in the "additional features" section, since this syntax conforms to the SQL standard.

One way is to just swap the LATERAL syntax to this "advanced features" section instead.

Moved. But we might need to improve it in the future (SPARK-44746)

dtenedor · 2023-08-23T18:30:57Z

examples/src/main/python/sql/udtf.py

+
+    spark.udtf.register("filter_udtf", FilterUDTF)
+
+    spark.sql("SELECT * FROM filter_udtf(TABLE(SELECT * FROM range(10)))").show()


this is good, let's also add an example just passing a table by name directly as well, e.g. TABLE(t)?

We can follow up in SPARK-44746

allisonwang-db · 2023-09-08T05:22:26Z

I've addressed all comments. cc @HyukjinKwon we should merge this soon in spark 3.5. I can create follow up PRs if there are additional comments. Thanks!

HyukjinKwon · 2023-09-08T05:25:40Z

Merged to master and branch-3.5.

…ble functions ### What changes were proposed in this pull request? This PR adds a user guide for Python user-defined table functions (UDTFs) introduced in Spark 3.5. <img width="468" alt="Screenshot 2023-08-04 at 14 46 13" src="https://github.com/apache/spark/assets/66282705/11f5dc5e-681b-4677-a466-1a23c0b8dd01"> ### Why are the changes needed? To help users write Python UDTFs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? docs test Closes #42272 from allisonwang-db/spark-44508-udtf-user-guide. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit aaf413c) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

docs

12428bd

github-actions bot added SQL EXAMPLES PYTHON labels Aug 1, 2023

itholic reviewed Aug 1, 2023

View reviewed changes

dtenedor reviewed Aug 2, 2023

View reviewed changes

allisonwang-db commented Aug 3, 2023

View reviewed changes

ueshin reviewed Aug 4, 2023

View reviewed changes

allisonwang-db added 2 commits August 4, 2023 16:31

address comments

5761eed

more updates

f2cd5ff

dstrodtman-db reviewed Aug 8, 2023

View reviewed changes

python/docs/source/user_guide/sql/python_udtf.rst Outdated Show resolved Hide resolved

address comments

c0146a2

allisonwang-db commented Aug 9, 2023

View reviewed changes

python/docs/source/user_guide/sql/python_udtf.rst Outdated Show resolved Hide resolved

allisonwang-db commented Aug 9, 2023

View reviewed changes

python/pyspark/sql/functions.py Show resolved Hide resolved

dstrodtman-db reviewed Aug 21, 2023

View reviewed changes

ueshin reviewed Aug 22, 2023

View reviewed changes

examples/src/main/python/sql/udtf.py Outdated Show resolved Hide resolved

examples/src/main/python/sql/udtf.py Outdated Show resolved Hide resolved

python/docs/source/user_guide/sql/python_udtf.rst Outdated Show resolved Hide resolved

address comments and update examples

cfa6ae2

allisonwang-db force-pushed the spark-44508-udtf-user-guide branch from 7df1c76 to cfa6ae2 Compare August 22, 2023 20:43

allisonwang-db added 2 commits August 22, 2023 14:21

remove mypy checks

8d0ab83

address comments

1e3d794

dtenedor reviewed Aug 23, 2023

View reviewed changes

allisonwang-db added 2 commits August 30, 2023 17:00

address comments

5361a3f

retrigger build

796ba5b

HyukjinKwon approved these changes Sep 8, 2023

View reviewed changes

HyukjinKwon closed this in aaf413c Sep 8, 2023

-which take zero or more arguments and return a set of rows.
+wherein each invocation appears in the FROM clause and returns an entire
+relation as output instead of a single result value. Every UDTF call accepts
+zero or more arguments each comprising either a scalar constant expression or
+a separate input relation.


		.. currentmodule:: pyspark.sql.functions

		To implement a Python UDTF, you can implement this class:

	To implement a Python UDTF, you can implement this class:
	To implement a Python UDTF, you can define a class implementing these methods:



		The return type of the UDTF defines the schema of the table it outputs.
		It must be either a ``StructType`` or a DDL string representing a struct type.


		spark.udtf.register("filter_udtf", FilterUDTF)

		spark.sql("SELECT * FROM filter_udtf(TABLE(SELECT * FROM range(10)))").show()

[SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions #42272

[SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions #42272

Conversation

allisonwang-db commented Aug 1, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

itholic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic commented Aug 1, 2023

allisonwang-db commented Aug 2, 2023

dtenedor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtenedor Aug 3, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonwang-db commented Aug 9, 2023

johnayoub commented Aug 11, 2023

allisonwang-db commented Aug 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtenedor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonwang-db commented Sep 8, 2023

HyukjinKwon commented Sep 8, 2023

allisonwang-db commented Aug 1, 2023 •

edited

dtenedor Aug 3, 2023 •

edited