-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures #29951
[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures #29951
Conversation
…to be compatible with future arrow
"disabled by using SQL config " + \ | ||
"`spark.sql.execution.pandas.convertToArrowArraySafely`." | ||
raise RuntimeError(error_msg % (s.dtype, t), e) | ||
except ValueError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errors during safe conversion will be ArrowInvalid
, which subclasses ValueError
"unsafe conversions warned by Arrow. Arrow safe type check " + \ | ||
"can be disabled by using SQL config " + \ | ||
"`spark.sql.execution.pandas.convertToArrowArraySafely`." | ||
raise ValueError(error_msg % (s.dtype, t)) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we dropped Python 2, this seems more appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In branch-3.0
.
File "/home/jenkins/workspace/spark-branch-3.0-test-sbt-hadoop-2.7-hive-1.2/python/pyspark/sql/pandas/serializers.py", line 166
raise ValueError(error_msg % (s.dtype, t)) from e
^
SyntaxError: invalid syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
with QuietTest(self.sc): | ||
with self.assertRaisesRegexp(Exception, "integer.*required"): | ||
self.spark.createDataFrame(pdf, schema=wrong_schema) | ||
with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": False}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error here is a TypeError
, but in case it gets changed to an ArrowInvalid
, we do not want to have the original error chained because the assert does not include it when checking
Thanks @HyukjinKwon , you're fast 😁 |
Test build #129437 has finished for PR 29951 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
…rsion failures ### What changes were proposed in this pull request? This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release. ### Why are the changes needed? Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming. If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info. The error handling is improved by: - narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures. - The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place. - The original exception is chained to better show it to the user. ### Does this PR introduce _any_ user-facing change? Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error. ### How was this patch tested? Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot Closes #29951 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 0812d6c) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Merged to master and branch-3.0. |
Hi, @BryanCutler and @HyukjinKwon . This seems to break branch-3.0. |
Could you take a look at that please? |
Sure! |
I'll actually revert this out of branch-3.0, and leave it to @BryanCutler to open a PR to backport. |
Reverted from branch-3.0. @BryanCutler would you mind opening a PR to port back? |
Thank you for swift fix, @HyukjinKwon . |
Shoot, sorry about that. Exception chaining is only a Python 3 thing. I'll fix this up for branch-3.0 since that is still testing with Python 2. |
Thanks! |
What changes were proposed in this pull request?
This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release.
Why are the changes needed?
Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming.
If the user enters an invalid schema, the error produced by pyarrow is not consistent and either
TypeError
orArrowInvalid
, with the latter being caught, and raised as aRuntimeError
with the extra info.The error handling is improved by:
TypeError
s, whichArrowInvalid
is a subclass and what is raised on safe conversion failures.Does this PR introduce any user-facing change?
Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error.
How was this patch tested?
Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot