[SPARK-38927][TESTS] Skip NumPy/Pandas tests in `test_rdd.py` if not available #36235

williamhyun · 2022-04-17T18:58:28Z

What changes were proposed in this pull request?

This PR aims to skip NumPy/Pandas tests in test_rdd.py if they are not available.

Why are the changes needed?

Currently, the tests that involve NumPy or Pandas are failing because NumPy and Pandas are unavailable in underlying Python. The tests should be skipped instead instead of showing failure.

BEFORE

======================================================================
ERROR: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".../test_rdd.py", line 723, in test_take_on_jrdd_with_large_rows_should_not_cause_deadlock
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

----------------------------------------------------------------------
Ran 1 test in 1.990s

FAILED (errors=1)

AFTER

Finished test(python3.9): pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (1s) ... 1 tests were skipped
Tests passed in 1 seconds

Skipped tests in pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock with python3.9:
    test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ... skipped 'NumPy or Pandas not installed'

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs.

…able

williamhyun · 2022-04-17T19:10:14Z

The CIs are running here:

https://github.com/williamhyun/spark/actions/runs/2180840175

dongjoon-hyun

+1, LGTM. Thank you, @williamhyun .
I verified this locally and updated your PR description.

$ python/run-tests --testnames 'pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock'
Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
Will test against the following Python executables: ['python3.9']
Will test the following Python tests: ['pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock']
python3.9 python_implementation is CPython
python3.9 version is: Python 3.9.12
Starting test(python3.9): pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (temp output: /var/folders/mq/c32xpgtj4tj19vt8b10wp8rc0000gn/T/python3.9__pyspark.tests.test_rdd_RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock__9j4o35gk.log)
Finished test(python3.9): pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (1s) ... 1 tests were skipped
Tests passed in 1 seconds

Skipped tests in pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock with python3.9:
    test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ... skipped 'NumPy or Pandas not installed'

…available ### What changes were proposed in this pull request? This PR aims to skip NumPy/Pandas tests in `test_rdd.py` if they are not available. ### Why are the changes needed? Currently, the tests that involve NumPy or Pandas are failing because NumPy and Pandas are unavailable in underlying Python. The tests should be skipped instead instead of showing failure. **BEFORE** ``` ====================================================================== ERROR: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ---------------------------------------------------------------------- Traceback (most recent call last): File ".../test_rdd.py", line 723, in test_take_on_jrdd_with_large_rows_should_not_cause_deadlock import numpy as np ModuleNotFoundError: No module named 'numpy' ---------------------------------------------------------------------- Ran 1 test in 1.990s FAILED (errors=1) ``` **AFTER** ``` Finished test(python3.9): pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (1s) ... 1 tests were skipped Tests passed in 1 seconds Skipped tests in pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock with python3.9: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ... skipped 'NumPy or Pandas not installed' ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #36235 from williamhyun/skipnumpy. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c34140d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2022-04-17T20:11:51Z

FYI @ankurdave and @HyukjinKwon since test_take_on_jrdd_with_large_rows_should_not_cause_deadlock was the new one who introduced these NumPy/Pandas dependency to RDDTests.

HyukjinKwon

LGTM!

…available ### What changes were proposed in this pull request? This PR aims to skip NumPy/Pandas tests in `test_rdd.py` if they are not available. ### Why are the changes needed? Currently, the tests that involve NumPy or Pandas are failing because NumPy and Pandas are unavailable in underlying Python. The tests should be skipped instead instead of showing failure. **BEFORE** ``` ====================================================================== ERROR: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ---------------------------------------------------------------------- Traceback (most recent call last): File ".../test_rdd.py", line 723, in test_take_on_jrdd_with_large_rows_should_not_cause_deadlock import numpy as np ModuleNotFoundError: No module named 'numpy' ---------------------------------------------------------------------- Ran 1 test in 1.990s FAILED (errors=1) ``` **AFTER** ``` Finished test(python3.9): pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (1s) ... 1 tests were skipped Tests passed in 1 seconds Skipped tests in pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock with python3.9: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ... skipped 'NumPy or Pandas not installed' ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#36235 from williamhyun/skipnumpy. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c34140d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

[SPARK-38927][TESTS] Skip NumPy/Pandas tests in test_rdd if not avail…

1fd8258

…able

github-actions bot added CORE PYTHON labels Apr 17, 2022

dongjoon-hyun approved these changes Apr 17, 2022

View reviewed changes

dongjoon-hyun closed this in c34140d Apr 17, 2022

HyukjinKwon reviewed Apr 17, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38927][TESTS] Skip NumPy/Pandas tests in `test_rdd.py` if not available #36235

[SPARK-38927][TESTS] Skip NumPy/Pandas tests in `test_rdd.py` if not available #36235

williamhyun commented Apr 17, 2022 •

edited by dongjoon-hyun

williamhyun commented Apr 17, 2022

dongjoon-hyun left a comment

dongjoon-hyun commented Apr 17, 2022

HyukjinKwon left a comment

[SPARK-38927][TESTS] Skip NumPy/Pandas tests in test_rdd.py if not available #36235

[SPARK-38927][TESTS] Skip NumPy/Pandas tests in test_rdd.py if not available #36235

Conversation

williamhyun commented Apr 17, 2022 • edited by dongjoon-hyun

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

williamhyun commented Apr 17, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 17, 2022

HyukjinKwon left a comment

Choose a reason for hiding this comment

[SPARK-38927][TESTS] Skip NumPy/Pandas tests in `test_rdd.py` if not available #36235

[SPARK-38927][TESTS] Skip NumPy/Pandas tests in `test_rdd.py` if not available #36235

williamhyun commented Apr 17, 2022 •

edited by dongjoon-hyun