Skip to content

[SPARK-26412][PYTHON][FOLLOW-UP] Improve error messages in Scala iterator pandas UDF#28135

Closed
HyukjinKwon wants to merge 3 commits intoapache:masterfrom
HyukjinKwon:SPARK-26412-followup
Closed

[SPARK-26412][PYTHON][FOLLOW-UP] Improve error messages in Scala iterator pandas UDF#28135
HyukjinKwon wants to merge 3 commits intoapache:masterfrom
HyukjinKwon:SPARK-26412-followup

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Apr 6, 2020

What changes were proposed in this pull request?

This PR proposes to improve the error message from Scalar iterator pandas UDF.

Why are the changes needed?

To show the correct error messages.

Does this PR introduce any user-facing change?

Yes, but only in unreleased branches.

import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(iterator):
      for _ in iterator:
            yield pd.Series(1)

spark.range(10).repartition(1).select(pandas_plus_one("id")).show()
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(iterator):
      for _ in iterator:
            yield pd.Series(list(range(20)))

spark.range(10).repartition(1).select(pandas_plus_one("id")).show()

Before:

RuntimeError: The number of output rows of pandas iterator UDF should 
be the same with input rows. The input rows number is 10 but the output 
rows number is 1.
AssertionError: Pandas MAP_ITER UDF outputted more rows than input rows.

After:

RuntimeError: The length of output in Scalar iterator pandas UDF should be 
the same with the input's; however, the length of output was 1 and the length 
of input was 10.
AssertionError: Pandas SCALAR_ITER UDF outputted more rows than input rows.

How was this patch tested?

Unittests were fixed accordingly.

@HyukjinKwon HyukjinKwon requested a review from WeichenXu123 April 6, 2020 08:41
@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member Author

@WeichenXu123, mind taking another look when you're available?

@HyukjinKwon HyukjinKwon force-pushed the SPARK-26412-followup branch from 3eef95f to 504ec1d Compare April 9, 2020 02:25
@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Apr 9, 2020

Test build #120994 has finished for PR 28135 at commit 0c1df44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 9, 2020

Test build #120995 has finished for PR 28135 at commit 6309034.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Merged to master and branch-3.0.

HyukjinKwon added a commit that referenced this pull request Apr 9, 2020
…ator pandas UDF

### What changes were proposed in this pull request?

This PR proposes to improve the error message from Scalar iterator pandas UDF.

### Why are the changes needed?

To show the correct error messages.

### Does this PR introduce any user-facing change?

Yes, but only in unreleased branches.

```python
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType

pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(iterator):
      for _ in iterator:
            yield pd.Series(1)

spark.range(10).repartition(1).select(pandas_plus_one("id")).show()
```
```python
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType

pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(iterator):
      for _ in iterator:
            yield pd.Series(list(range(20)))

spark.range(10).repartition(1).select(pandas_plus_one("id")).show()
```

**Before:**

```
RuntimeError: The number of output rows of pandas iterator UDF should
be the same with input rows. The input rows number is 10 but the output
rows number is 1.
```
```
AssertionError: Pandas MAP_ITER UDF outputted more rows than input rows.
```

**After:**

```
RuntimeError: The length of output in Scalar iterator pandas UDF should be
the same with the input's; however, the length of output was 1 and the length
of input was 10.
```
```
AssertionError: Pandas SCALAR_ITER UDF outputted more rows than input rows.
```

### How was this patch tested?

Unittests were fixed accordingly.

Closes #28135 from HyukjinKwon/SPARK-26412-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 4fafdcd)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…ator pandas UDF

### What changes were proposed in this pull request?

This PR proposes to improve the error message from Scalar iterator pandas UDF.

### Why are the changes needed?

To show the correct error messages.

### Does this PR introduce any user-facing change?

Yes, but only in unreleased branches.

```python
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType

pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(iterator):
      for _ in iterator:
            yield pd.Series(1)

spark.range(10).repartition(1).select(pandas_plus_one("id")).show()
```
```python
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType

pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(iterator):
      for _ in iterator:
            yield pd.Series(list(range(20)))

spark.range(10).repartition(1).select(pandas_plus_one("id")).show()
```

**Before:**

```
RuntimeError: The number of output rows of pandas iterator UDF should
be the same with input rows. The input rows number is 10 but the output
rows number is 1.
```
```
AssertionError: Pandas MAP_ITER UDF outputted more rows than input rows.
```

**After:**

```
RuntimeError: The length of output in Scalar iterator pandas UDF should be
the same with the input's; however, the length of output was 1 and the length
of input was 10.
```
```
AssertionError: Pandas SCALAR_ITER UDF outputted more rows than input rows.
```

### How was this patch tested?

Unittests were fixed accordingly.

Closes apache#28135 from HyukjinKwon/SPARK-26412-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon deleted the SPARK-26412-followup branch July 27, 2020 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments