[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures #29951

BryanCutler · 2020-10-06T07:06:30Z

What changes were proposed in this pull request?

This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release.

Why are the changes needed?

Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming.

If the user enters an invalid schema, the error produced by pyarrow is not consistent and either TypeError or ArrowInvalid, with the latter being caught, and raised as a RuntimeError with the extra info.

The error handling is improved by:

narrowing the exception type to TypeErrors, which ArrowInvalid is a subclass and what is raised on safe conversion failures.
The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place.
The original exception is chained to better show it to the user.

Does this PR introduce any user-facing change?

Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error.

How was this patch tested?

Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot

…to be compatible with future arrow

BryanCutler · 2020-10-06T07:08:23Z

python/pyspark/sql/pandas/serializers.py

-                            "disabled by using SQL config " + \
-                            "`spark.sql.execution.pandas.convertToArrowArraySafely`."
-                raise RuntimeError(error_msg % (s.dtype, t), e)
+            except ValueError as e:


errors during safe conversion will be ArrowInvalid, which subclasses ValueError

BryanCutler · 2020-10-06T07:08:26Z

python/pyspark/sql/pandas/serializers.py

+                                "unsafe conversions warned by Arrow. Arrow safe type check " + \
+                                "can be disabled by using SQL config " + \
+                                "`spark.sql.execution.pandas.convertToArrowArraySafely`."
+                    raise ValueError(error_msg % (s.dtype, t)) from e


Now that we dropped Python 2, this seems more appropriate

In branch-3.0.

File "/home/jenkins/workspace/spark-branch-3.0-test-sbt-hadoop-2.7-hive-1.2/python/pyspark/sql/pandas/serializers.py", line 166 raise ValueError(error_msg % (s.dtype, t)) from e ^ SyntaxError: invalid syntax

HyukjinKwon

LGTM

BryanCutler · 2020-10-06T07:10:40Z

python/pyspark/sql/tests/test_arrow.py

-        with QuietTest(self.sc):
-            with self.assertRaisesRegexp(Exception, "integer.*required"):
-                self.spark.createDataFrame(pdf, schema=wrong_schema)
+        with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": False}):


The error here is a TypeError, but in case it gets changed to an ArrowInvalid, we do not want to have the original error chained because the assert does not include it when checking

BryanCutler · 2020-10-06T07:14:23Z

Thanks @HyukjinKwon , you're fast 😁
This will help get the Spark integration tests passing in apache/arrow#8352

SparkQA · 2020-10-06T07:30:23Z

Test build #129437 has finished for PR 29951 at commit e7a09e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-06T07:50:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34044/

SparkQA · 2020-10-06T08:07:58Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34044/

…rsion failures ### What changes were proposed in this pull request? This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release. ### Why are the changes needed? Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming. If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info. The error handling is improved by: - narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures. - The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place. - The original exception is chained to better show it to the user. ### Does this PR introduce _any_ user-facing change? Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error. ### How was this patch tested? Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot Closes #29951 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 0812d6c) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

HyukjinKwon · 2020-10-06T09:12:43Z

Merged to master and branch-3.0.

dongjoon-hyun · 2020-10-07T00:06:32Z

Hi, @BryanCutler and @HyukjinKwon . This seems to break branch-3.0.

dongjoon-hyun · 2020-10-07T00:07:57Z

Could you take a look at that please?

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-sbt-hadoop-2.7-hive-1.2/876/console

HyukjinKwon · 2020-10-07T00:09:24Z

Sure!

HyukjinKwon · 2020-10-07T00:09:44Z

I'll actually revert this out of branch-3.0, and leave it to @BryanCutler to open a PR to backport.

HyukjinKwon · 2020-10-07T00:12:45Z

Reverted from branch-3.0. @BryanCutler would you mind opening a PR to port back?

dongjoon-hyun · 2020-10-07T02:25:46Z

Thank you for swift fix, @HyukjinKwon .

BryanCutler · 2020-10-07T05:12:08Z

Shoot, sorry about that. Exception chaining is only a Python 3 thing. I'll fix this up for branch-3.0 since that is still testing with Python 2.

HyukjinKwon · 2020-10-07T05:14:45Z

Thanks!

Make conversion error extra info conditional and chain ex, fix tests …

e7a09e9

…to be compatible with future arrow

BryanCutler commented Oct 6, 2020

View reviewed changes

HyukjinKwon approved these changes Oct 6, 2020

View reviewed changes

BryanCutler commented Oct 6, 2020

View reviewed changes

HyukjinKwon closed this in 0812d6c Oct 6, 2020

BryanCutler deleted the arrow-better-handle-pandas-errors-SPARK-33073 branch October 6, 2020 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures #29951

[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures #29951

BryanCutler commented Oct 6, 2020 •

edited

Loading

BryanCutler Oct 6, 2020

BryanCutler Oct 6, 2020

dongjoon-hyun Oct 7, 2020

HyukjinKwon left a comment

BryanCutler Oct 6, 2020

BryanCutler commented Oct 6, 2020

SparkQA commented Oct 6, 2020

SparkQA commented Oct 6, 2020

SparkQA commented Oct 6, 2020

HyukjinKwon commented Oct 6, 2020

dongjoon-hyun commented Oct 7, 2020

dongjoon-hyun commented Oct 7, 2020 •

edited

Loading

HyukjinKwon commented Oct 7, 2020

HyukjinKwon commented Oct 7, 2020

HyukjinKwon commented Oct 7, 2020

dongjoon-hyun commented Oct 7, 2020

BryanCutler commented Oct 7, 2020

HyukjinKwon commented Oct 7, 2020

[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures #29951

[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures #29951

Conversation

BryanCutler commented Oct 6, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

BryanCutler Oct 6, 2020

Choose a reason for hiding this comment

BryanCutler Oct 6, 2020

Choose a reason for hiding this comment

dongjoon-hyun Oct 7, 2020

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

BryanCutler Oct 6, 2020

Choose a reason for hiding this comment

BryanCutler commented Oct 6, 2020

SparkQA commented Oct 6, 2020

SparkQA commented Oct 6, 2020

SparkQA commented Oct 6, 2020

HyukjinKwon commented Oct 6, 2020

dongjoon-hyun commented Oct 7, 2020

dongjoon-hyun commented Oct 7, 2020 • edited Loading

HyukjinKwon commented Oct 7, 2020

HyukjinKwon commented Oct 7, 2020

HyukjinKwon commented Oct 7, 2020

dongjoon-hyun commented Oct 7, 2020

BryanCutler commented Oct 7, 2020

HyukjinKwon commented Oct 7, 2020

BryanCutler commented Oct 6, 2020 •

edited

Loading

dongjoon-hyun commented Oct 7, 2020 •

edited

Loading