[SPARK-33073][PYTHON][3.0] Improve error handling on Pandas to Arrow conversion failures #29962

BryanCutler · 2020-10-07T05:57:56Z

What changes were proposed in this pull request?

This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release.

Why are the changes needed?

Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming.

If the user enters an invalid schema, the error produced by pyarrow is not consistent and either TypeError or ArrowInvalid, with the latter being caught, and raised as a RuntimeError with the extra info.

The error handling is improved by:

narrowing the exception type to TypeErrors, which ArrowInvalid is a subclass and what is raised on safe conversion failures.
The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place.
The original exception is chained to better show it to the user (only for Spark 3.1+ which requires Python 3)

Does this PR introduce any user-facing change?

Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error.

How was this patch tested?

Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot, and Python 2 with 0.15.1

dongjoon-hyun · 2020-10-07T05:59:24Z

Thank you, @BryanCutler !

BryanCutler · 2020-10-07T05:59:40Z

cc @HyukjinKwon @dongjoon-hyun , I tested this locally with Python 2

SparkQA · 2020-10-07T06:34:57Z

Test build #129492 has finished for PR 29962 at commit 5c3c195.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-07T06:47:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34098/

SparkQA · 2020-10-07T07:07:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34098/

HyukjinKwon · 2020-10-07T08:09:33Z

Merged to branch-3.0.

Thanks @BryanCutler and @dongjoon-hyun.

…conversion failures ### What changes were proposed in this pull request? This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release. ### Why are the changes needed? Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming. If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info. The error handling is improved by: - narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures. - The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place. - The original exception is chained to better show it to the user (only for Spark 3.1+ which requires Python 3) ### Does this PR introduce _any_ user-facing change? Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error. ### How was this patch tested? Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot, and Python 2 with 0.15.1 Closes #29962 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073-branch-3.0. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

BryanCutler · 2020-10-07T18:44:01Z

Thanks @HyukjinKwon and @dongjoon-hyun !

…conversion failures ### What changes were proposed in this pull request? This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release. ### Why are the changes needed? Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming. If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info. The error handling is improved by: - narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures. - The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place. - The original exception is chained to better show it to the user (only for Spark 3.1+ which requires Python 3) ### Does this PR introduce _any_ user-facing change? Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error. ### How was this patch tested? Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot, and Python 2 with 0.15.1 Closes apache#29962 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073-branch-3.0. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Improve error handling on Pandas to Arrow conversion failures

5c3c195

HyukjinKwon approved these changes Oct 7, 2020

View reviewed changes

HyukjinKwon closed this Oct 7, 2020

BryanCutler deleted the arrow-better-handle-pandas-errors-SPARK-33073-branch-3.0 branch October 7, 2020 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33073][PYTHON][3.0] Improve error handling on Pandas to Arrow conversion failures #29962

[SPARK-33073][PYTHON][3.0] Improve error handling on Pandas to Arrow conversion failures #29962

BryanCutler commented Oct 7, 2020

dongjoon-hyun commented Oct 7, 2020

BryanCutler commented Oct 7, 2020

SparkQA commented Oct 7, 2020

SparkQA commented Oct 7, 2020

SparkQA commented Oct 7, 2020

HyukjinKwon commented Oct 7, 2020

BryanCutler commented Oct 7, 2020

[SPARK-33073][PYTHON][3.0] Improve error handling on Pandas to Arrow conversion failures #29962

[SPARK-33073][PYTHON][3.0] Improve error handling on Pandas to Arrow conversion failures #29962

Conversation

BryanCutler commented Oct 7, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Oct 7, 2020

BryanCutler commented Oct 7, 2020

SparkQA commented Oct 7, 2020

SparkQA commented Oct 7, 2020

SparkQA commented Oct 7, 2020

HyukjinKwon commented Oct 7, 2020

BryanCutler commented Oct 7, 2020