[SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import #31902

johnhany97 · 2021-03-19T16:21:00Z

What changes were proposed in this pull request?

Pass the raised ImportError on failing to import pandas/pyarrow. This will help the user identify whether pandas/pyarrow are indeed not in the environment or if they threw a different ImportError.

Why are the changes needed?

This can already happen in Pandas for example where it could throw an ImportError on its initialisation path if dateutil doesn't satisfy a certain version requirement https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438

Does this PR introduce any user-facing change?

Yes, it will now show the root cause of the exception when pandas or arrow is missing during import.

How was this patch tested?

Manually tested.

from pyspark.sql.functions import pandas_udf
spark.range(1).select(pandas_udf(lambda x: x, "int")("id")).show()

Before:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/...//spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf
    require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version
    raise ImportError("PyArrow >= %s must be installed; however, "
ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.

After:

Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 49, in require_minimum_pyarrow_version
    import pyarrow
ModuleNotFoundError: No module named 'pyarrow'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf
    require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 55, in require_minimum_pyarrow_version
    raise ImportError("PyArrow >= %s must be installed; however, "
ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.

…o import

dongjoon-hyun · 2021-03-19T20:43:19Z

ok to test

SparkQA · 2021-03-19T21:20:41Z

Test build #136269 has finished for PR 31902 at commit bf47b43.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-19T21:44:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40851/

SparkQA · 2021-03-19T21:52:54Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40851/

johnhany97 · 2021-03-19T21:59:09Z

This looks like a flake in the tests:

Err:20 http://security.debian.org/debian-security buster/updates/main amd64 libldap-common all 2.4.47+dfsg-3+deb10u4
  404  Not Found [IP: 151.101.194.132 80]
Err:21 http://security.debian.org/debian-security buster/updates/main amd64 libldap-2.4-2 amd64 2.4.47+dfsg-3+deb10u4
  404  Not Found [IP: 151.101.194.132 80]
Fetched 7590 kB in 1s (12.4 MB/s)
E: Failed to fetch http://security.debian.org/debian-security/pool/updates/main/o/openldap/libldap-common_2.4.47+dfsg-3+deb10u4_all.deb  404  Not Found [IP: 151.101.194.132 80]
E: Failed to fetch http://security.debian.org/debian-security/pool/updates/main/o/openldap/libldap-2.4-2_2.4.47+dfsg-3+deb10u4_amd64.deb  404  Not Found [IP: 151.101.194.132 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
The command '/bin/sh -c echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list &&   apt install -y gnupg &&   (apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' || apt-key adv --keyserver keys.openpgp.org --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF') &&   apt-get update &&   apt install -y -t buster-cran35 r-base r-base-dev &&   rm -rf /var/cache/apt/*' returned a non-zero code: 100
Failed to build SparkR Docker image, please refer to Docker build output for details.
[ERROR] Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
    at org.apache.commons.exec.DefaultExecutor.executeInternal (DefaultExecutor.java:404)
    at org.apache.commons.exec.DefaultExecutor.execute (DefaultExecutor.java:166)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:804)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:751)
    at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:313)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:957)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:289)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:193)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)

python/pyspark/sql/pandas/utils.py

dongjoon-hyun · 2021-03-20T20:26:56Z

cc @HyukjinKwon

SparkQA · 2021-03-20T21:26:22Z

Test build #136288 has finished for PR 31902 at commit 30542f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-20T21:41:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40870/

SparkQA · 2021-03-20T21:46:58Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40870/

python/pyspark/sql/pandas/utils.py

SparkQA · 2021-03-22T11:34:55Z

Test build #136342 has finished for PR 31902 at commit 02ba207.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

python/pyspark/sql/pandas/utils.py

SparkQA · 2021-03-22T12:01:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40926/

SparkQA · 2021-03-22T12:07:29Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40926/

SparkQA · 2021-03-22T13:17:43Z

Test build #136344 has finished for PR 31902 at commit b21b67a.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-22T14:18:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40928/

This reverts commit b21b67a.

SparkQA · 2021-03-22T14:27:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40928/

HyukjinKwon · 2021-03-22T14:29:14Z

Tests passed at #31902 (comment). Merging.

HyukjinKwon · 2021-03-22T14:29:24Z

Merged to master and branch-3.1.

…ow fail to import ### What changes were proposed in this pull request? Pass the raised `ImportError` on failing to import pandas/pyarrow. This will help the user identify whether pandas/pyarrow are indeed not in the environment or if they threw a different `ImportError`. ### Why are the changes needed? This can already happen in Pandas for example where it could throw an `ImportError` on its initialisation path if `dateutil` doesn't satisfy a certain version requirement https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438 ### Does this PR introduce _any_ user-facing change? Yes, it will now show the root cause of the exception when pandas or arrow is missing during import. ### How was this patch tested? Manually tested. ```python from pyspark.sql.functions import pandas_udf spark.range(1).select(pandas_udf(lambda x: x)) ``` Before: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/...//spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` After: ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/pandas/utils.py", line 49, in require_minimum_pyarrow_version import pyarrow ModuleNotFoundError: No module named 'pyarrow' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 55, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` Closes #31902 from johnhany97/jayad/spark-34803. Lead-authored-by: John Ayad <johnhany97@gmail.com> Co-authored-by: John H. Ayad <johnhany97@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit ddfc75e) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

SparkQA · 2021-03-22T14:53:21Z

Test build #136350 has finished for PR 31902 at commit e63f4cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-22T15:43:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40934/

SparkQA · 2021-03-22T15:54:50Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40934/

…ow fail to import ### What changes were proposed in this pull request? Pass the raised `ImportError` on failing to import pandas/pyarrow. This will help the user identify whether pandas/pyarrow are indeed not in the environment or if they threw a different `ImportError`. ### Why are the changes needed? This can already happen in Pandas for example where it could throw an `ImportError` on its initialisation path if `dateutil` doesn't satisfy a certain version requirement https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438 ### Does this PR introduce _any_ user-facing change? Yes, it will now show the root cause of the exception when pandas or arrow is missing during import. ### How was this patch tested? Manually tested. ```python from pyspark.sql.functions import pandas_udf spark.range(1).select(pandas_udf(lambda x: x)) ``` Before: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/...//spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` After: ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/pandas/utils.py", line 49, in require_minimum_pyarrow_version import pyarrow ModuleNotFoundError: No module named 'pyarrow' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 55, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` Closes apache#31902 from johnhany97/jayad/spark-34803. Lead-authored-by: John Ayad <johnhany97@gmail.com> Co-authored-by: John H. Ayad <johnhany97@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit ddfc75e) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

johnhany97 added 2 commits March 19, 2021 16:17

[SPARK-34803] Pass the raised ImportError if pandas or pyarrow fail t…

24b28f8

…o import

Adjust variable name to conform to Python semantics

bf47b43

dongjoon-hyun changed the title ~~[SPARK-34803] Pass the raised ImportError if pandas or pyarrow fail to import~~ [SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import Mar 19, 2021

github-actions bot added CORE PYTHON SQL labels Mar 19, 2021

dongjoon-hyun reviewed Mar 20, 2021

View reviewed changes

python/pyspark/sql/pandas/utils.py Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 20, 2021

View reviewed changes

python/pyspark/sql/pandas/utils.py Outdated Show resolved Hide resolved

add whitespace

30542f3

HyukjinKwon reviewed Mar 21, 2021

View reviewed changes

python/pyspark/sql/pandas/utils.py Outdated Show resolved Hide resolved

Use Py3 exception chaining

02ba207

HyukjinKwon approved these changes Mar 22, 2021

View reviewed changes

HyukjinKwon reviewed Mar 22, 2021

View reviewed changes

python/pyspark/sql/pandas/utils.py Show resolved Hide resolved

Just use error

b21b67a

Revert "Just use error"

e63f4cf

This reverts commit b21b67a.

HyukjinKwon closed this in ddfc75e Mar 22, 2021

johnhany97 deleted the jayad/spark-34803 branch March 22, 2021 15:22

johnhany97 mentioned this pull request Mar 22, 2021

[SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import palantir/spark#745

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import #31902

[SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import #31902

johnhany97 commented Mar 19, 2021 •

edited by HyukjinKwon

Loading

dongjoon-hyun commented Mar 19, 2021

SparkQA commented Mar 19, 2021

SparkQA commented Mar 19, 2021

SparkQA commented Mar 19, 2021

johnhany97 commented Mar 19, 2021 •

edited

Loading

dongjoon-hyun commented Mar 20, 2021

SparkQA commented Mar 20, 2021

SparkQA commented Mar 20, 2021

SparkQA commented Mar 20, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

HyukjinKwon commented Mar 22, 2021

HyukjinKwon commented Mar 22, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

[SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import #31902

[SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import #31902

Conversation

johnhany97 commented Mar 19, 2021 • edited by HyukjinKwon Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Mar 19, 2021

SparkQA commented Mar 19, 2021

SparkQA commented Mar 19, 2021

SparkQA commented Mar 19, 2021

johnhany97 commented Mar 19, 2021 • edited Loading

dongjoon-hyun commented Mar 20, 2021

SparkQA commented Mar 20, 2021

SparkQA commented Mar 20, 2021

SparkQA commented Mar 20, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

HyukjinKwon commented Mar 22, 2021

HyukjinKwon commented Mar 22, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

SparkQA commented Mar 22, 2021

johnhany97 commented Mar 19, 2021 •

edited by HyukjinKwon

Loading

johnhany97 commented Mar 19, 2021 •

edited

Loading