Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38927][TESTS] Skip NumPy/Pandas tests in test_rdd.py if not available #36235

Closed
wants to merge 1 commit into from
Closed

Conversation

williamhyun
Copy link
Member

@williamhyun williamhyun commented Apr 17, 2022

What changes were proposed in this pull request?

This PR aims to skip NumPy/Pandas tests in test_rdd.py if they are not available.

Why are the changes needed?

Currently, the tests that involve NumPy or Pandas are failing because NumPy and Pandas are unavailable in underlying Python. The tests should be skipped instead instead of showing failure.

BEFORE

======================================================================
ERROR: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".../test_rdd.py", line 723, in test_take_on_jrdd_with_large_rows_should_not_cause_deadlock
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

----------------------------------------------------------------------
Ran 1 test in 1.990s

FAILED (errors=1)

AFTER

Finished test(python3.9): pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (1s) ... 1 tests were skipped
Tests passed in 1 seconds

Skipped tests in pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock with python3.9:
    test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ... skipped 'NumPy or Pandas not installed'

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs.

@williamhyun
Copy link
Member Author

The CIs are running here:

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @williamhyun .
I verified this locally and updated your PR description.

$ python/run-tests --testnames 'pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock'
Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
Will test against the following Python executables: ['python3.9']
Will test the following Python tests: ['pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock']
python3.9 python_implementation is CPython
python3.9 version is: Python 3.9.12
Starting test(python3.9): pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (temp output: /var/folders/mq/c32xpgtj4tj19vt8b10wp8rc0000gn/T/python3.9__pyspark.tests.test_rdd_RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock__9j4o35gk.log)
Finished test(python3.9): pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (1s) ... 1 tests were skipped
Tests passed in 1 seconds

Skipped tests in pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock with python3.9:
    test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ... skipped 'NumPy or Pandas not installed'

dongjoon-hyun pushed a commit that referenced this pull request Apr 17, 2022
…available

### What changes were proposed in this pull request?
This PR aims to skip NumPy/Pandas tests in `test_rdd.py` if they are not available.

### Why are the changes needed?
Currently, the tests that involve NumPy or Pandas are failing because NumPy and Pandas are unavailable in underlying Python. The tests should be skipped instead instead of showing failure.

**BEFORE**
```
======================================================================
ERROR: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".../test_rdd.py", line 723, in test_take_on_jrdd_with_large_rows_should_not_cause_deadlock
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

----------------------------------------------------------------------
Ran 1 test in 1.990s

FAILED (errors=1)
```

**AFTER**
```
Finished test(python3.9): pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (1s) ... 1 tests were skipped
Tests passed in 1 seconds

Skipped tests in pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock with python3.9:
    test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ... skipped 'NumPy or Pandas not installed'
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes #36235 from williamhyun/skipnumpy.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit c34140d)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Apr 17, 2022
…available

### What changes were proposed in this pull request?
This PR aims to skip NumPy/Pandas tests in `test_rdd.py` if they are not available.

### Why are the changes needed?
Currently, the tests that involve NumPy or Pandas are failing because NumPy and Pandas are unavailable in underlying Python. The tests should be skipped instead instead of showing failure.

**BEFORE**
```
======================================================================
ERROR: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".../test_rdd.py", line 723, in test_take_on_jrdd_with_large_rows_should_not_cause_deadlock
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

----------------------------------------------------------------------
Ran 1 test in 1.990s

FAILED (errors=1)
```

**AFTER**
```
Finished test(python3.9): pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (1s) ... 1 tests were skipped
Tests passed in 1 seconds

Skipped tests in pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock with python3.9:
    test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ... skipped 'NumPy or Pandas not installed'
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes #36235 from williamhyun/skipnumpy.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit c34140d)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

FYI @ankurdave and @HyukjinKwon since test_take_on_jrdd_with_large_rows_should_not_cause_deadlock was the new one who introduced these NumPy/Pandas dependency to RDDTests.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
…available

### What changes were proposed in this pull request?
This PR aims to skip NumPy/Pandas tests in `test_rdd.py` if they are not available.

### Why are the changes needed?
Currently, the tests that involve NumPy or Pandas are failing because NumPy and Pandas are unavailable in underlying Python. The tests should be skipped instead instead of showing failure.

**BEFORE**
```
======================================================================
ERROR: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".../test_rdd.py", line 723, in test_take_on_jrdd_with_large_rows_should_not_cause_deadlock
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

----------------------------------------------------------------------
Ran 1 test in 1.990s

FAILED (errors=1)
```

**AFTER**
```
Finished test(python3.9): pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (1s) ... 1 tests were skipped
Tests passed in 1 seconds

Skipped tests in pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock with python3.9:
    test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ... skipped 'NumPy or Pandas not installed'
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes apache#36235 from williamhyun/skipnumpy.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit c34140d)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants