[SPARK-26412][PYTHON][FOLLOW-UP] Improve error messages in Scala iterator pandas UDF by HyukjinKwon · Pull Request #28135 · apache/spark

HyukjinKwon · 2020-04-06T08:41:05Z

What changes were proposed in this pull request?

This PR proposes to improve the error message from Scalar iterator pandas UDF.

Why are the changes needed?

To show the correct error messages.

Does this PR introduce any user-facing change?

Yes, but only in unreleased branches.

import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(iterator):
      for _ in iterator:
            yield pd.Series(1)

spark.range(10).repartition(1).select(pandas_plus_one("id")).show()

import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(iterator):
      for _ in iterator:
            yield pd.Series(list(range(20)))

spark.range(10).repartition(1).select(pandas_plus_one("id")).show()

Before:

RuntimeError: The number of output rows of pandas iterator UDF should 
be the same with input rows. The input rows number is 10 but the output 
rows number is 1.

AssertionError: Pandas MAP_ITER UDF outputted more rows than input rows.

After:

RuntimeError: The length of output in Scalar iterator pandas UDF should be 
the same with the input's; however, the length of output was 1 and the length 
of input was 10.

AssertionError: Pandas SCALAR_ITER UDF outputted more rows than input rows.

How was this patch tested?

Unittests were fixed accordingly.

python/pyspark/worker.py

HyukjinKwon · 2020-04-08T03:25:39Z

@WeichenXu123, mind taking another look when you're available?

python/pyspark/worker.py

SparkQA · 2020-04-09T03:39:53Z

Test build #120994 has finished for PR 28135 at commit 0c1df44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-09T04:02:15Z

Test build #120995 has finished for PR 28135 at commit 6309034.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-04-09T04:13:28Z

Merged to master and branch-3.0.

…ator pandas UDF ### What changes were proposed in this pull request? This PR proposes to improve the error message from Scalar iterator pandas UDF. ### Why are the changes needed? To show the correct error messages. ### Does this PR introduce any user-facing change? Yes, but only in unreleased branches. ```python import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf('long', PandasUDFType.SCALAR_ITER) def pandas_plus_one(iterator): for _ in iterator: yield pd.Series(1) spark.range(10).repartition(1).select(pandas_plus_one("id")).show() ``` ```python import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf('long', PandasUDFType.SCALAR_ITER) def pandas_plus_one(iterator): for _ in iterator: yield pd.Series(list(range(20))) spark.range(10).repartition(1).select(pandas_plus_one("id")).show() ``` **Before:** ``` RuntimeError: The number of output rows of pandas iterator UDF should be the same with input rows. The input rows number is 10 but the output rows number is 1. ``` ``` AssertionError: Pandas MAP_ITER UDF outputted more rows than input rows. ``` **After:** ``` RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 10. ``` ``` AssertionError: Pandas SCALAR_ITER UDF outputted more rows than input rows. ``` ### How was this patch tested? Unittests were fixed accordingly. Closes #28135 from HyukjinKwon/SPARK-26412-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 4fafdcd) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ator pandas UDF ### What changes were proposed in this pull request? This PR proposes to improve the error message from Scalar iterator pandas UDF. ### Why are the changes needed? To show the correct error messages. ### Does this PR introduce any user-facing change? Yes, but only in unreleased branches. ```python import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf('long', PandasUDFType.SCALAR_ITER) def pandas_plus_one(iterator): for _ in iterator: yield pd.Series(1) spark.range(10).repartition(1).select(pandas_plus_one("id")).show() ``` ```python import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf('long', PandasUDFType.SCALAR_ITER) def pandas_plus_one(iterator): for _ in iterator: yield pd.Series(list(range(20))) spark.range(10).repartition(1).select(pandas_plus_one("id")).show() ``` **Before:** ``` RuntimeError: The number of output rows of pandas iterator UDF should be the same with input rows. The input rows number is 10 but the output rows number is 1. ``` ``` AssertionError: Pandas MAP_ITER UDF outputted more rows than input rows. ``` **After:** ``` RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 10. ``` ``` AssertionError: Pandas SCALAR_ITER UDF outputted more rows than input rows. ``` ### How was this patch tested? Unittests were fixed accordingly. Closes apache#28135 from HyukjinKwon/SPARK-26412-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

HyukjinKwon requested a review from WeichenXu123 April 6, 2020 08:41

This comment has been minimized.

Sign in to view

HyukjinKwon commented Apr 7, 2020

View reviewed changes

python/pyspark/worker.py Outdated Show resolved Hide resolved

Improve error messages in Scala iterator pandas UDF

504ec1d

HyukjinKwon force-pushed the SPARK-26412-followup branch from 3eef95f to 504ec1d Compare April 9, 2020 02:25

WeichenXu123 reviewed Apr 9, 2020

View reviewed changes

python/pyspark/worker.py Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

HyukjinKwon added 2 commits April 9, 2020 11:57

Fix the test accordingly

0c1df44

address comments

6309034

WeichenXu123 approved these changes Apr 9, 2020

View reviewed changes

HyukjinKwon closed this in 4fafdcd Apr 9, 2020

HyukjinKwon deleted the SPARK-26412-followup branch July 27, 2020 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26412][PYTHON][FOLLOW-UP] Improve error messages in Scala iterator pandas UDF#28135

[SPARK-26412][PYTHON][FOLLOW-UP] Improve error messages in Scala iterator pandas UDF#28135
HyukjinKwon wants to merge 3 commits intoapache:masterfrom
HyukjinKwon:SPARK-26412-followup

HyukjinKwon commented Apr 6, 2020 •

edited

Loading

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

HyukjinKwon commented Apr 8, 2020

Uh oh!

Uh oh!

This comment has been minimized.

SparkQA commented Apr 9, 2020

Uh oh!

SparkQA commented Apr 9, 2020

Uh oh!

HyukjinKwon commented Apr 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

HyukjinKwon commented Apr 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

HyukjinKwon commented Apr 8, 2020

Uh oh!

Uh oh!

This comment has been minimized.

SparkQA commented Apr 9, 2020

Uh oh!

SparkQA commented Apr 9, 2020

Uh oh!

HyukjinKwon commented Apr 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

HyukjinKwon commented Apr 6, 2020 •

edited

Loading