[SPARK-45523][Python] Return useful error message if UDTF returns None for any non-nullable column #43356

dtenedor · 2023-10-12T23:48:30Z

What changes were proposed in this pull request?

This PR updates Python UDTF evaluation to return a useful error message if UDTF returns None for any non-nullable column.

This implementation also checks recursively for None values in subfields of array/struct/map columns as well.

For example:

from pyspark.sql.functions import AnalyzeResult
from pyspark.sql.types import ArrayType, IntegerType, StringType, StructType

class Tvf:
    @staticmethod
    def analyze(*args):
        return AnalyzeResult(
            schema=StructType()
                .add("result", ArrayType(IntegerType(), containsNull=False), True)
            )

    def eval(self, *args):
        yield [1, 2, 3, 4],

    def terminate(self):
        yield [1, 2, None, 3],

SELECT * FROM Tvf(TABLE(VALUES (0), (1)))

> org.apache.spark.api.python.PythonException
[UDTF_EXEC_ERROR] User defined table function encountered an error in the 'eval' or 
'terminate' method: Column 0 within a returned row had a value of None, either directly or
within array/struct/map subfields, but the corresponding column type was declared as non
nullable; please update the UDTF to return a non-None value at this location or otherwise
declare the column type as nullable.

Why are the changes needed?

Previously this case returned a null pointer exception.

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

This PR adds new test coverage.

Was this patch authored or co-authored using generative AI tooling?

No

dtenedor · 2023-10-12T23:49:46Z

cc @ueshin

python/pyspark/worker.py

dtenedor

Thanks @ueshin for your reviews, please take another look!

python/pyspark/worker.py

respond to code review comments respond to code review comments respond to code review comments respond to code review comments respond to code review comments commit respond to code review comments respond to code review comments

fix

fetch from master

allisonwang-db · 2023-10-18T20:16:09Z

python/pyspark/worker.py

@@ -841,6 +845,63 @@ def _remove_partition_by_exprs(self, arg: Any) -> Any:
            "the query again."
        )

+    # Compares each UDTF output row against the output schema for this particular UDTF call,
+    # raising an error if the two are incompatible.
+    def check_output_row_against_schema(row: Any) -> None:


@ueshin do you think this will add extra performance overhead if we check this for each output row?

Note: In a previous iteration of this PR, I had a check to see if the schema contained any non-nullable columns in order to enable this. However, I would like to later extend these checks to compare provided row values against the expected output schema column types, which currently produce internal exceptions instead of good error messages if they don't match. We would need to check every value in every row in that case, so I figured it was OK to just do that here as well.

Yes, this will add huge performance overhead.

@dtenedor Could we at least build the check function based on the data type in advance?

check_output_row_against_schema = _build_null_checker(return_type)

Checking the data type and nullable each row should be too expensive.

The builder should be placed somewhere reusable.

Sure, I added this check back for now.

sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala

respond to code review comments

ueshin · 2023-10-19T18:29:27Z

python/pyspark/worker.py

@@ -841,6 +845,63 @@ def _remove_partition_by_exprs(self, arg: Any) -> Any:
            "the query again."
        )

+    # Compares each UDTF output row against the output schema for this particular UDTF call,
+    # raising an error if the two are incompatible.
+    def check_output_row_against_schema(row: Any) -> None:


Yes, this will add huge performance overhead.

@dtenedor Could we at least build the check function based on the data type in advance?

check_output_row_against_schema = _build_null_checker(return_type)

Checking the data type and nullable each row should be too expensive.

The builder should be placed somewhere reusable.

python/pyspark/worker.py

ueshin · 2023-10-19T18:36:35Z

python/pyspark/worker.py

@@ -879,6 +940,8 @@ def verify_result(result):
                verify_pandas_result(
                    result, return_type, assign_cols_by_name=False, truncate_return_schema=False
                )
+                for result_tuple in result.itertuples():
+                    check_output_row_against_schema(list(result_tuple))


Shall we move this to before the pandas DataFrame is created?

I tried that originally but the UDTF result is an Iterable and it turns out that iterating through it consumes the values, making it impossible to create the DataFrame after because the iterator is empty :)

ueshin · 2023-10-19T18:46:20Z

python/pyspark/worker.py

+                            sub_value, data_type.elementType, data_type.containsNull
+                        )
+                elif isinstance(data_type, StructType) and isinstance(value, Row):
+                    for field_name, field_value in value.asDict().items():


adDict will break the case there are duplicated field names.

Good point; I switched this to iterate through the row using column indexes instead.

ueshin · 2023-10-19T18:46:56Z

python/pyspark/worker.py

+                    elif isinstance(value, Row):
+                        items = value.asDict().items()


In what case does this happen?

Turns out, never :) I removed this check, it is simpler now.

ueshin · 2023-10-20T22:21:57Z

The failed tests seem not related to this PR.

ueshin · 2023-10-20T22:22:06Z

Thanks! merging to master.

### What changes were proposed in this pull request? This is a follow-up of #43356. Refactor the null-checking to have shortcuts. ### Why are the changes needed? The null-check can have shortcuts for some cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43492 from ueshin/issues/SPARK-45523/nullcheck. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

github-actions bot added SQL CORE PYTHON labels Oct 12, 2023

dtenedor closed this Oct 12, 2023

dtenedor reopened this Oct 12, 2023

ueshin reviewed Oct 13, 2023

View reviewed changes

python/pyspark/worker.py Outdated Show resolved Hide resolved

dtenedor force-pushed the improve-errors-null-checks branch from 8e9d028 to c6769d5 Compare October 14, 2023 00:21

dtenedor commented Oct 14, 2023

View reviewed changes

python/pyspark/worker.py Outdated Show resolved Hide resolved

dtenedor requested a review from ueshin October 14, 2023 00:22

dtenedor changed the title ~~[SPARK-45523][Python] Return useful error message if UDTF returns None for non-nullable column~~ [SPARK-45523][Python] Return useful error message if UDTF returns None for any non-nullable column Oct 14, 2023

respond to code review comments

d0f16f1

respond to code review comments respond to code review comments respond to code review comments respond to code review comments respond to code review comments commit respond to code review comments respond to code review comments

dtenedor force-pushed the improve-errors-null-checks branch from c6769d5 to d0f16f1 Compare October 17, 2023 00:41

dtenedor added 2 commits October 17, 2023 12:38

Fix.

834accb

Fix.

af543a2

dtenedor mentioned this pull request Oct 17, 2023

[SPARK-45524][PYTHON][SQL] Initial support for Python data source read API #43360

Closed

dtenedor added 5 commits October 17, 2023 17:45

fix

9e99acd

fix

fix

be78f6e

Merge branch 'master' into improve-errors-null-checks

e5f2cec

fetch from master

fix arrow udtf implementation

750687c

fix linter

77cfdab

allisonwang-db reviewed Oct 18, 2023

View reviewed changes

respond to code review comments

cf944f4

respond to code review comments

dtenedor requested a review from allisonwang-db October 18, 2023 21:50

ueshin reviewed Oct 19, 2023

View reviewed changes

respond to code review comments

b278a3e

dtenedor requested a review from ueshin October 19, 2023 22:20

ueshin approved these changes Oct 20, 2023

View reviewed changes

ueshin closed this in 227cd8b Oct 20, 2023

ueshin mentioned this pull request Oct 24, 2023

[SPARK-45523][PYTHON] Refactor the null-checking to have shortcuts #43492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45523][Python] Return useful error message if UDTF returns None for any non-nullable column #43356

[SPARK-45523][Python] Return useful error message if UDTF returns None for any non-nullable column #43356

dtenedor commented Oct 12, 2023 •

edited

Loading

dtenedor commented Oct 12, 2023

dtenedor left a comment

allisonwang-db Oct 18, 2023

dtenedor Oct 18, 2023 •

edited

Loading

ueshin Oct 19, 2023

dtenedor Oct 19, 2023 •

edited

Loading

ueshin Oct 19, 2023

ueshin Oct 19, 2023

dtenedor Oct 19, 2023 •

edited

Loading

ueshin Oct 19, 2023

dtenedor Oct 19, 2023

ueshin Oct 19, 2023

dtenedor Oct 19, 2023

ueshin commented Oct 20, 2023

ueshin commented Oct 20, 2023

[SPARK-45523][Python] Return useful error message if UDTF returns None for any non-nullable column #43356

[SPARK-45523][Python] Return useful error message if UDTF returns None for any non-nullable column #43356

Conversation

dtenedor commented Oct 12, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dtenedor commented Oct 12, 2023

dtenedor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtenedor Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtenedor Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtenedor Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin commented Oct 20, 2023

ueshin commented Oct 20, 2023

dtenedor commented Oct 12, 2023 •

edited

Loading

dtenedor Oct 18, 2023 •

edited

Loading

dtenedor Oct 19, 2023 •

edited

Loading

dtenedor Oct 19, 2023 •

edited

Loading