-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import #31902
Conversation
ok to test |
Test build #136269 has finished for PR 31902 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
This looks like a flake in the tests:
|
cc @HyukjinKwon |
Test build #136288 has finished for PR 31902 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136342 has finished for PR 31902 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136344 has finished for PR 31902 at commit
|
Kubernetes integration test starting |
This reverts commit b21b67a.
Kubernetes integration test status failure |
Tests passed at #31902 (comment). Merging. |
Merged to master and branch-3.1. |
…ow fail to import ### What changes were proposed in this pull request? Pass the raised `ImportError` on failing to import pandas/pyarrow. This will help the user identify whether pandas/pyarrow are indeed not in the environment or if they threw a different `ImportError`. ### Why are the changes needed? This can already happen in Pandas for example where it could throw an `ImportError` on its initialisation path if `dateutil` doesn't satisfy a certain version requirement https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438 ### Does this PR introduce _any_ user-facing change? Yes, it will now show the root cause of the exception when pandas or arrow is missing during import. ### How was this patch tested? Manually tested. ```python from pyspark.sql.functions import pandas_udf spark.range(1).select(pandas_udf(lambda x: x)) ``` Before: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/...//spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` After: ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/pandas/utils.py", line 49, in require_minimum_pyarrow_version import pyarrow ModuleNotFoundError: No module named 'pyarrow' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 55, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` Closes #31902 from johnhany97/jayad/spark-34803. Lead-authored-by: John Ayad <johnhany97@gmail.com> Co-authored-by: John H. Ayad <johnhany97@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit ddfc75e) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Test build #136350 has finished for PR 31902 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
…ow fail to import ### What changes were proposed in this pull request? Pass the raised `ImportError` on failing to import pandas/pyarrow. This will help the user identify whether pandas/pyarrow are indeed not in the environment or if they threw a different `ImportError`. ### Why are the changes needed? This can already happen in Pandas for example where it could throw an `ImportError` on its initialisation path if `dateutil` doesn't satisfy a certain version requirement https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438 ### Does this PR introduce _any_ user-facing change? Yes, it will now show the root cause of the exception when pandas or arrow is missing during import. ### How was this patch tested? Manually tested. ```python from pyspark.sql.functions import pandas_udf spark.range(1).select(pandas_udf(lambda x: x)) ``` Before: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/...//spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` After: ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/pandas/utils.py", line 49, in require_minimum_pyarrow_version import pyarrow ModuleNotFoundError: No module named 'pyarrow' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 55, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` Closes apache#31902 from johnhany97/jayad/spark-34803. Lead-authored-by: John Ayad <johnhany97@gmail.com> Co-authored-by: John H. Ayad <johnhany97@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit ddfc75e) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Pass the raised
ImportError
on failing to import pandas/pyarrow. This will help the user identify whether pandas/pyarrow are indeed not in the environment or if they threw a differentImportError
.Why are the changes needed?
This can already happen in Pandas for example where it could throw an
ImportError
on its initialisation path ifdateutil
doesn't satisfy a certain version requirement https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438Does this PR introduce any user-facing change?
Yes, it will now show the root cause of the exception when pandas or arrow is missing during import.
How was this patch tested?
Manually tested.
Before:
After: