[SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints by HyukjinKwon · Pull Request #27466 · apache/spark

HyukjinKwon · 2020-02-05T08:16:34Z

What changes were proposed in this pull request?

This PR targets to document the Pandas UDF redesign with type hints introduced at SPARK-28264.
Mostly self-describing; however, there are few things to note for reviewers.

This PR replace the existing documentation of pandas UDFs to the newer redesign to promote the Python type hints. I added some words that Spark 3.0 still keeps the compatibility though.
This PR proposes to name non-pandas UDFs as "Pandas Function API"
SCALAR_ITER become two separate sections to reduce confusion:

Iterator[pd.Series] -> Iterator[pd.Series]
Iterator[Tuple[pd.Series, ...]] -> Iterator[pd.Series]

I removed some examples that look overkill to me.
I also removed some information in the doc, that seems duplicating or too much.

Why are the changes needed?

To document new redesign in pandas UDF.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests should cover.

HyukjinKwon · 2020-02-05T11:56:42Z

dev/sparktestsupport/modules.py

        "pyspark.sql.avro.functions",
        "pyspark.sql.pandas.conversion",
        "pyspark.sql.pandas.map_ops",
-        "pyspark.sql.pandas.functions",


All the tests in pyspark.sql.pandas.functions should be conditionally ran - It should skip the tests if pandas or PyArrow are not available. However, we have been skipping them always due to the lack of mechanism to conditionally run the doctests.

Now, the doctests at pyspark.sql.pandas.functions have type hints that are only for Python 3.5+. So, even if we skip all the tests like the previous way, it shows compilation error due to illegal syntax in Python 2. This is why I had to remove this from the module list to avoid compiling the doctests at all.

HyukjinKwon · 2020-02-05T11:58:13Z

docs/sql-pyspark-pandas-with-arrow.md

@@ -65,132 +65,188 @@ Spark will fall back to create the DataFrame without Arrow.

 ## Pandas UDFs (a.k.a. Vectorized UDFs)


Please see also the PR description of #27165 (comment)

HyukjinKwon · 2020-02-05T12:00:16Z

examples/src/main/python/sql/arrow.py

-    # |  9|
-    # +---+
-
-    # In the UDF, you can initialize some states before processing batches.


I removed this example. It seems too much to know, and the example itself doesn't look particularly useful.

HyukjinKwon · 2020-02-05T12:02:27Z

python/pyspark/sql/pandas/functions.py

-
-       :class:`MapType`, nested :class:`StructType` are currently not supported as output types.
-
-       Scalar UDFs can be used with :meth:`pyspark.sql.DataFrame.withColumn` and


I removed this info. To me it looks too much to know. I just said "A Pandas UDF behaves as a regular PySpark function API in general." instead.

HyukjinKwon · 2020-02-05T12:03:35Z

python/pyspark/sql/pandas/functions.py

-
-       .. note:: The length of `pandas.Series` within a scalar UDF is not that of the whole input
-           column, but is the length of an internal batch used for each call to the function.
-           Therefore, this can be used, for example, to ensure the length of each returned


I removed this example. This is already logically known since it says the length of input is not the whole series, and the lengths of input and output should be same.

HyukjinKwon · 2020-02-05T12:05:02Z

python/pyspark/sql/pandas/functions.py

-       .. note:: It is not guaranteed that one invocation of a scalar iterator UDF will process all
-           batches from one partition, although it is currently implemented this way.
-           Your code shall not rely on this behavior because it might change in the future for
-           further optimization, e.g., one invocation processes multiple partitions.


I removed this note. Unless we explicitly document, nothing is explicitly guaranteed. It seems to me too much to know.

HyukjinKwon · 2020-02-05T12:05:27Z

python/pyspark/sql/pandas/functions.py

-           further optimization, e.g., one invocation processes multiple partitions.
-
-       Scalar iterator UDFs are used with :meth:`pyspark.sql.DataFrame.withColumn` and
-       :meth:`pyspark.sql.DataFrame.select`.


I removed this too as the same reason as https://github.com/apache/spark/pull/27466/files#r375215947

HyukjinKwon · 2020-02-05T12:05:54Z

python/pyspark/sql/pandas/functions.py

-       |  9|
-       +---+
-
-       In the UDF, you can initialize some states before processing batches, wrap your code with


I removed this as the same reason as https://github.com/apache/spark/pull/27466/files#r375215066

HyukjinKwon · 2020-02-05T12:06:26Z

python/pyspark/sql/pandas/functions.py

-
-    3. GROUPED_MAP
-
-       A grouped map UDF defines transformation: A `pandas.DataFrame` -> A `pandas.DataFrame`


Moved to applyInPandas at GroupedData. Missing information from here was ported to there.

HyukjinKwon · 2020-02-05T12:07:31Z

python/pyspark/sql/pandas/functions.py

    return _create_udf(f, returnType, evalType)
-
-
-def _test():


See https://github.com/apache/spark/pull/27466/files#r375213650

HyukjinKwon · 2020-02-05T12:07:57Z

python/pyspark/sql/pandas/group_ops.py

        +---+-------------------+

-        .. seealso:: :meth:`pyspark.sql.functions.pandas_udf`
+        Alternatively, the user can pass a function that takes two arguments.


Information ported from GROUPED MAP in pandas_udf.

HyukjinKwon · 2020-02-05T12:11:37Z

cc @rxin, @zero323, @cloud-fan, @mengxr, @viirya, @dongjoon-hyun, @WeichenXu123, @ueshin, @BryanCutler, @icexelloss, @rberenguel FYI

I would appreciate if you guys have some time to take a quick look. It has to be in Spark 3.0 but RC is supposed to start very soon, Mid Feb 2020.

docs/sql-pyspark-pandas-with-arrow.md

zero323 · 2020-02-05T13:15:21Z

docs/sql-pyspark-pandas-with-arrow.md

+### Iterator of Series to Iterator of Series

-The following example shows how to create scalar iterator Pandas UDFs:
+The type hint can be expressed as `Iterator[pandas.Series]` -> `Iterator[pandas.Series]`.


Nitpick. It is more Iterator[Union[Tuple[pandas.Series, ...], pandas.Series]] -> Iterator[pandas.Series], isn't it? But I guess that's too much...

True .. although I didn't add Iterator[Union[Tuple[pandas.Series, ...], pandas.Series]] type hint to support yet ...

I think we can combine it later if we happen to add this type hint as well. Shouldn't be a big deal at this moment.

docs/sql-pyspark-pandas-with-arrow.md

BryanCutler

I took a quick look and seems pretty good to me. Thanks @HyukjinKwon !

docs/sql-pyspark-pandas-with-arrow.md

HyukjinKwon · 2020-02-06T00:46:24Z

python/pyspark/sql/udf.py

    globs['spark'] = spark
+    # Hack to skip the unit tests in register. These are currently being tested in proper tests.
+    # We should reenable this test once we completely drop Python 2.
+    del pyspark.sql.udf.UDFRegistration.register


To doubly make sure, I tested and checked:

that it doesn't affect the main codes:

>>> help(spark.udf.register)

Help on method register in module pyspark.sql.udf: register(name, f, returnType=None) method of pyspark.sql.udf.UDFRegistration instance ...

the generated doc too just to make sure.

the tests pass.

SparkQA · 2020-02-06T04:01:34Z

Test build #117958 has finished for PR 27466 at commit 7b7ae90.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

docs/sql-pyspark-pandas-with-arrow.md

SparkQA · 2020-02-06T06:58:58Z

Test build #117965 has finished for PR 27466 at commit 6d23dbd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

docs/sql-pyspark-pandas-with-arrow.md

HyukjinKwon · 2020-02-07T02:36:55Z

docs/sql-pyspark-pandas-with-arrow.md

@@ -65,132 +65,215 @@ Spark will fall back to create the DataFrame without Arrow.

 ## Pandas UDFs (a.k.a. Vectorized UDFs)


@cloud-fan what about now?

SparkQA · 2020-02-07T05:19:44Z

Test build #118008 has finished for PR 27466 at commit 47b155c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

docs/sql-pyspark-pandas-with-arrow.md

python/pyspark/sql/pandas/functions.py

SparkQA · 2020-02-09T07:55:24Z

Test build #118087 has finished for PR 27466 at commit 76a6da2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-10T07:29:35Z

Should be ready for a look.

python/pyspark/sql/pandas/functions.py

SparkQA · 2020-02-11T11:51:18Z

Test build #118222 has finished for PR 27466 at commit 626ff3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-12T01:49:28Z

Thanks all. Merged to master and branch-3.0.

…Python type hints ### What changes were proposed in this pull request? This PR targets to document the Pandas UDF redesign with type hints introduced at SPARK-28264. Mostly self-describing; however, there are few things to note for reviewers. 1. This PR replace the existing documentation of pandas UDFs to the newer redesign to promote the Python type hints. I added some words that Spark 3.0 still keeps the compatibility though. 2. This PR proposes to name non-pandas UDFs as "Pandas Function API" 3. SCALAR_ITER become two separate sections to reduce confusion: - `Iterator[pd.Series]` -> `Iterator[pd.Series]` - `Iterator[Tuple[pd.Series, ...]]` -> `Iterator[pd.Series]` 4. I removed some examples that look overkill to me. 5. I also removed some information in the doc, that seems duplicating or too much. ### Why are the changes needed? To document new redesign in pandas UDF. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests should cover. Closes #27466 from HyukjinKwon/SPARK-30722. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit aa6a605) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…Python type hints ### What changes were proposed in this pull request? This PR targets to document the Pandas UDF redesign with type hints introduced at SPARK-28264. Mostly self-describing; however, there are few things to note for reviewers. 1. This PR replace the existing documentation of pandas UDFs to the newer redesign to promote the Python type hints. I added some words that Spark 3.0 still keeps the compatibility though. 2. This PR proposes to name non-pandas UDFs as "Pandas Function API" 3. SCALAR_ITER become two separate sections to reduce confusion: - `Iterator[pd.Series]` -> `Iterator[pd.Series]` - `Iterator[Tuple[pd.Series, ...]]` -> `Iterator[pd.Series]` 4. I removed some examples that look overkill to me. 5. I also removed some information in the doc, that seems duplicating or too much. ### Why are the changes needed? To document new redesign in pandas UDF. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests should cover. Closes apache#27466 from HyukjinKwon/SPARK-30722. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

This comment has been minimized.

Sign in to view

Update documentation for Pandas UDF with Python type hints

4f85930

HyukjinKwon force-pushed the SPARK-30722 branch from d3eb543 to 4f85930 Compare February 5, 2020 11:40

HyukjinKwon changed the title ~~[WIP][SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints~~ [SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints Feb 5, 2020

HyukjinKwon commented Feb 5, 2020

View reviewed changes

zero323 reviewed Feb 5, 2020

View reviewed changes

docs/sql-pyspark-pandas-with-arrow.md Show resolved Hide resolved

This comment has been minimized.

Sign in to view

zero323 reviewed Feb 5, 2020

View reviewed changes

docs/sql-pyspark-pandas-with-arrow.md Outdated Show resolved Hide resolved

zero323 reviewed Feb 5, 2020

View reviewed changes

cloud-fan reviewed Feb 5, 2020

View reviewed changes

docs/sql-pyspark-pandas-with-arrow.md Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

dongjoon-hyun added PYSPARK SQL labels Feb 5, 2020

BryanCutler reviewed Feb 6, 2020

View reviewed changes

docs/sql-pyspark-pandas-with-arrow.md Outdated Show resolved Hide resolved

Make the tests pass first

7b7ae90

HyukjinKwon commented Feb 6, 2020

View reviewed changes

Address comments

6d23dbd

viirya reviewed Feb 6, 2020

View reviewed changes

docs/sql-pyspark-pandas-with-arrow.md Show resolved Hide resolved

viirya reviewed Feb 6, 2020

View reviewed changes

docs/sql-pyspark-pandas-with-arrow.md Outdated Show resolved Hide resolved

docs/sql-pyspark-pandas-with-arrow.md Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 6, 2020

View reviewed changes

docs/sql-pyspark-pandas-with-arrow.md Outdated Show resolved Hide resolved

Address comments

47b155c

HyukjinKwon commented Feb 7, 2020

View reviewed changes

cloud-fan reviewed Feb 7, 2020

View reviewed changes

docs/sql-pyspark-pandas-with-arrow.md Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 7, 2020

View reviewed changes

docs/sql-pyspark-pandas-with-arrow.md Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 7, 2020

View reviewed changes

python/pyspark/sql/pandas/functions.py Show resolved Hide resolved

Addres comments

76a6da2

cloud-fan reviewed Feb 10, 2020

View reviewed changes

python/pyspark/sql/pandas/functions.py Show resolved Hide resolved

cloud-fan reviewed Feb 10, 2020

View reviewed changes

python/pyspark/sql/pandas/functions.py Show resolved Hide resolved

cloud-fan reviewed Feb 10, 2020

View reviewed changes

python/pyspark/sql/pandas/functions.py Show resolved Hide resolved

Address a comment

626ff3c

cloud-fan approved these changes Feb 11, 2020

View reviewed changes

zero323 approved these changes Feb 11, 2020

View reviewed changes

HyukjinKwon closed this in aa6a605 Feb 12, 2020

HyukjinKwon deleted the SPARK-30722 branch March 3, 2020 01:16

		@@ -65,132 +65,188 @@ Spark will fall back to create the DataFrame without Arrow.

		## Pandas UDFs (a.k.a. Vectorized UDFs)


		:class:`MapType`, nested :class:`StructType` are currently not supported as output types.

		Scalar UDFs can be used with :meth:`pyspark.sql.DataFrame.withColumn` and


		3. GROUPED_MAP

		A grouped map UDF defines transformation: A `pandas.DataFrame` -> A `pandas.DataFrame`

		@@ -65,132 +65,215 @@ Spark will fall back to create the DataFrame without Arrow.

		## Pandas UDFs (a.k.a. Vectorized UDFs)

Conversation

HyukjinKwon commented Feb 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon Feb 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 6, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Feb 6, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 7, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Feb 9, 2020

Uh oh!

HyukjinKwon commented Feb 10, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

HyukjinKwon commented Feb 12, 2020

Uh oh!

HyukjinKwon commented Feb 5, 2020 •

edited

Loading

HyukjinKwon Feb 5, 2020 •

edited

Loading

HyukjinKwon commented Feb 5, 2020 •

edited

Loading

HyukjinKwon Feb 6, 2020 •

edited

Loading