Skip to content

[SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints#27466

Closed
HyukjinKwon wants to merge 6 commits intoapache:masterfrom
HyukjinKwon:SPARK-30722
Closed

[SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints#27466
HyukjinKwon wants to merge 6 commits intoapache:masterfrom
HyukjinKwon:SPARK-30722

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 5, 2020

What changes were proposed in this pull request?

This PR targets to document the Pandas UDF redesign with type hints introduced at SPARK-28264.
Mostly self-describing; however, there are few things to note for reviewers.

  1. This PR replace the existing documentation of pandas UDFs to the newer redesign to promote the Python type hints. I added some words that Spark 3.0 still keeps the compatibility though.

  2. This PR proposes to name non-pandas UDFs as "Pandas Function API"

  3. SCALAR_ITER become two separate sections to reduce confusion:

  • Iterator[pd.Series] -> Iterator[pd.Series]
  • Iterator[Tuple[pd.Series, ...]] -> Iterator[pd.Series]
  1. I removed some examples that look overkill to me.

  2. I also removed some information in the doc, that seems duplicating or too much.

Why are the changes needed?

To document new redesign in pandas UDF.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests should cover.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints [SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints Feb 5, 2020
"pyspark.sql.avro.functions",
"pyspark.sql.pandas.conversion",
"pyspark.sql.pandas.map_ops",
"pyspark.sql.pandas.functions",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the tests in pyspark.sql.pandas.functions should be conditionally ran - It should skip the tests if pandas or PyArrow are not available. However, we have been skipping them always due to the lack of mechanism to conditionally run the doctests.

Now, the doctests at pyspark.sql.pandas.functions have type hints that are only for Python 3.5+. So, even if we skip all the tests like the previous way, it shows compilation error due to illegal syntax in Python 2. This is why I had to remove this from the module list to avoid compiling the doctests at all.

@@ -65,132 +65,188 @@ Spark will fall back to create the DataFrame without Arrow.

## Pandas UDFs (a.k.a. Vectorized UDFs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see also the PR description of #27165 (comment)

# | 9|
# +---+

# In the UDF, you can initialize some states before processing batches.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this example. It seems too much to know, and the example itself doesn't look particularly useful.


:class:`MapType`, nested :class:`StructType` are currently not supported as output types.

Scalar UDFs can be used with :meth:`pyspark.sql.DataFrame.withColumn` and
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this info. To me it looks too much to know. I just said "A Pandas UDF behaves as a regular PySpark function API in general." instead.


.. note:: The length of `pandas.Series` within a scalar UDF is not that of the whole input
column, but is the length of an internal batch used for each call to the function.
Therefore, this can be used, for example, to ensure the length of each returned
Copy link
Member Author

@HyukjinKwon HyukjinKwon Feb 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this example. This is already logically known since it says the length of input is not the whole series, and the lengths of input and output should be same.

.. note:: It is not guaranteed that one invocation of a scalar iterator UDF will process all
batches from one partition, although it is currently implemented this way.
Your code shall not rely on this behavior because it might change in the future for
further optimization, e.g., one invocation processes multiple partitions.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this note. Unless we explicitly document, nothing is explicitly guaranteed. It seems to me too much to know.

further optimization, e.g., one invocation processes multiple partitions.

Scalar iterator UDFs are used with :meth:`pyspark.sql.DataFrame.withColumn` and
:meth:`pyspark.sql.DataFrame.select`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this too as the same reason as https://github.com/apache/spark/pull/27466/files#r375215947

| 9|
+---+

In the UDF, you can initialize some states before processing batches, wrap your code with
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


3. GROUPED_MAP

A grouped map UDF defines transformation: A `pandas.DataFrame` -> A `pandas.DataFrame`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to applyInPandas at GroupedData. Missing information from here was ported to there.

return _create_udf(f, returnType, evalType)


def _test():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+---+-------------------+

.. seealso:: :meth:`pyspark.sql.functions.pandas_udf`
Alternatively, the user can pass a function that takes two arguments.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Information ported from GROUPED MAP in pandas_udf.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 5, 2020

cc @rxin, @zero323, @cloud-fan, @mengxr, @viirya, @dongjoon-hyun, @WeichenXu123, @ueshin, @BryanCutler, @icexelloss, @rberenguel FYI

I would appreciate if you guys have some time to take a quick look. It has to be in Spark 3.0 but RC is supposed to start very soon, Mid Feb 2020.

@zero323

This comment has been minimized.

### Iterator of Series to Iterator of Series

The following example shows how to create scalar iterator Pandas UDFs:
The type hint can be expressed as `Iterator[pandas.Series]` -> `Iterator[pandas.Series]`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick. It is more Iterator[Union[Tuple[pandas.Series, ...], pandas.Series]] -> Iterator[pandas.Series], isn't it? But I guess that's too much...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True .. although I didn't add Iterator[Union[Tuple[pandas.Series, ...], pandas.Series]] type hint to support yet ...

I think we can combine it later if we happen to add this type hint as well. Shouldn't be a big deal at this moment.

@SparkQA

This comment has been minimized.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look and seems pretty good to me. Thanks @HyukjinKwon !

globs['spark'] = spark
# Hack to skip the unit tests in register. These are currently being tested in proper tests.
# We should reenable this test once we completely drop Python 2.
del pyspark.sql.udf.UDFRegistration.register
Copy link
Member Author

@HyukjinKwon HyukjinKwon Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To doubly make sure, I tested and checked:

  • that it doesn't affect the main codes:
>>> help(spark.udf.register)
Help on method register in module pyspark.sql.udf:

register(name, f, returnType=None) method of pyspark.sql.udf.UDFRegistration instance
...
  • the generated doc too just to make sure.

  • the tests pass.

@SparkQA
Copy link

SparkQA commented Feb 6, 2020

Test build #117958 has finished for PR 27466 at commit 7b7ae90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 6, 2020

Test build #117965 has finished for PR 27466 at commit 6d23dbd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -65,132 +65,215 @@ Spark will fall back to create the DataFrame without Arrow.

## Pandas UDFs (a.k.a. Vectorized UDFs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan what about now?

Screen Shot 2020-02-07 at 11 35 48 AM

@SparkQA
Copy link

SparkQA commented Feb 7, 2020

Test build #118008 has finished for PR 27466 at commit 47b155c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 9, 2020

Test build #118087 has finished for PR 27466 at commit 76a6da2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Should be ready for a look.

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118222 has finished for PR 27466 at commit 626ff3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Thanks all. Merged to master and branch-3.0.

HyukjinKwon added a commit that referenced this pull request Feb 12, 2020
…Python type hints

### What changes were proposed in this pull request?

This PR targets to document the Pandas UDF redesign with type hints introduced at SPARK-28264.
Mostly self-describing; however, there are few things to note for reviewers.

1. This PR replace the existing documentation of pandas UDFs to the newer redesign to promote the Python type hints. I added some words that Spark 3.0 still keeps the compatibility though.

2. This PR proposes to name non-pandas UDFs as "Pandas Function API"

3. SCALAR_ITER become two separate sections to reduce confusion:
  - `Iterator[pd.Series]` -> `Iterator[pd.Series]`
  - `Iterator[Tuple[pd.Series, ...]]` -> `Iterator[pd.Series]`

4. I removed some examples that look overkill to me.

5. I also removed some information in the doc, that seems duplicating or too much.

### Why are the changes needed?

To document new redesign in pandas UDF.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests should cover.

Closes #27466 from HyukjinKwon/SPARK-30722.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit aa6a605)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon deleted the SPARK-30722 branch March 3, 2020 01:16
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…Python type hints

### What changes were proposed in this pull request?

This PR targets to document the Pandas UDF redesign with type hints introduced at SPARK-28264.
Mostly self-describing; however, there are few things to note for reviewers.

1. This PR replace the existing documentation of pandas UDFs to the newer redesign to promote the Python type hints. I added some words that Spark 3.0 still keeps the compatibility though.

2. This PR proposes to name non-pandas UDFs as "Pandas Function API"

3. SCALAR_ITER become two separate sections to reduce confusion:
  - `Iterator[pd.Series]` -> `Iterator[pd.Series]`
  - `Iterator[Tuple[pd.Series, ...]]` -> `Iterator[pd.Series]`

4. I removed some examples that look overkill to me.

5. I also removed some information in the doc, that seems duplicating or too much.

### Why are the changes needed?

To document new redesign in pandas UDF.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests should cover.

Closes apache#27466 from HyukjinKwon/SPARK-30722.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

Comments