Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Nov 21, 2025

What changes were proposed in this pull request?

This PR aims to skip test_perf_profiler_data_source if pyarrow is absent.

Why are the changes needed?

To recover the failed PyPy CIs.

======================================================================
ERROR: test_perf_profiler_data_source (pyspark.sql.tests.test_udf_profiler.UDFProfiler2Tests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/sql/tests/test_udf_profiler.py", line 609, in test_perf_profiler_data_source
    self.spark.read.format("TestDataSource").load().collect()
  File "/__w/spark/spark/python/pyspark/sql/classic/dataframe.py", line 469, in collect
    sock_info = self._jdf.collectToPython()
  File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__
    return_value = get_return_value(
  File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", line 263, in deco
    return f(*a, **kw)
  File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/protocol.py", line 327, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o235.collectToPython.
: org.apache.spark.SparkException: 
Error from python worker:
  Traceback (most recent call last):
    File "/usr/local/pypy/pypy3.10/lib/pypy3.10/runpy.py", line 199, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/local/pypy/pypy3.10/lib/pypy3.10/runpy.py", line 86, in _run_code
      exec(code, run_globals)
    File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 37, in <module>
    File "/usr/local/pypy/pypy3.10/lib/pypy3.10/importlib/__init__.py", line 126, in import_module
      return _bootstrap._gcd_import(name[level:], package, level)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
    File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
    File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
    File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
    File "<builtin>/frozen importlib._bootstrap_external", line 897, in exec_module
    File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
    File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/worker/plan_data_source_read.py", line 21, in <module>
      import pyarrow as pa
  ModuleNotFoundError: No module named 'pyarrow'

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-54153][PYTHON][TESTS] Skip test_perf_profiler_data_source if pyarrow is absent [SPARK-54153][PYTHON][TESTS][FOLLOWUP] Skip test_perf_profiler_data_source if pyarrow is absent Nov 21, 2025
@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Nov 21, 2025

Could you review this too when you are here, please, @sunchao ? 😄

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @sunchao ! Have a nice Thanksgiving Holiday~

dongjoon-hyun added a commit that referenced this pull request Nov 21, 2025
…source` if `pyarrow` is absent

### What changes were proposed in this pull request?

This PR aims to skip `test_perf_profiler_data_source` if `pyarrow` is absent.

### Why are the changes needed?

To recover the failed `PyPy` CIs.
- https://github.com/apache/spark/actions/workflows/build_python_pypy3.10.yml
  - https://github.com/apache/spark/actions/runs/19574648782
    - https://github.com/apache/spark/actions/runs/19574648782/job/56056836234

```
======================================================================
ERROR: test_perf_profiler_data_source (pyspark.sql.tests.test_udf_profiler.UDFProfiler2Tests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/sql/tests/test_udf_profiler.py", line 609, in test_perf_profiler_data_source
    self.spark.read.format("TestDataSource").load().collect()
  File "/__w/spark/spark/python/pyspark/sql/classic/dataframe.py", line 469, in collect
    sock_info = self._jdf.collectToPython()
  File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__
    return_value = get_return_value(
  File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", line 263, in deco
    return f(*a, **kw)
  File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/protocol.py", line 327, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o235.collectToPython.
: org.apache.spark.SparkException:
Error from python worker:
  Traceback (most recent call last):
    File "/usr/local/pypy/pypy3.10/lib/pypy3.10/runpy.py", line 199, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/local/pypy/pypy3.10/lib/pypy3.10/runpy.py", line 86, in _run_code
      exec(code, run_globals)
    File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 37, in <module>
    File "/usr/local/pypy/pypy3.10/lib/pypy3.10/importlib/__init__.py", line 126, in import_module
      return _bootstrap._gcd_import(name[level:], package, level)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
    File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
    File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
    File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
    File "<builtin>/frozen importlib._bootstrap_external", line 897, in exec_module
    File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
    File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/worker/plan_data_source_read.py", line 21, in <module>
      import pyarrow as pa
  ModuleNotFoundError: No module named 'pyarrow'
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53162 from dongjoon-hyun/SPARK-54153.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 9b0b1ce)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@sunchao
Copy link
Member

sunchao commented Nov 21, 2025

@dongjoon-hyun Ha thanks, you too!

@dongjoon-hyun dongjoon-hyun deleted the SPARK-54153 branch November 21, 2025 22:59
@dongjoon-hyun
Copy link
Member Author

For the record, this recovers PyPy CI.

Screenshot 2025-11-21 at 16 30 09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants