Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Dec 23, 2024

What changes were proposed in this pull request?

Add a daily build for PySpark with old dependencies

Why are the changes needed?

to guard the installation described in https://apache.github.io/spark/api/python/getting_started/install.html

The installation guide is outdated:

  • pyspark-sql/connect requires
    -- pyarrow>=11.0
    -- numpy>=1.21
    -- pandas>=2.0.0

  • pyspark-pandas requires a even new versions of pandas/pyarrow/numpy
    -- pyarrow>=11.0
    -- numpy>=1.22.4
    -- pandas>=2.2.0

This PR excludes PS: we can either

  • make PS works in the old versions, and then add it in this workflow;
  • or upgrade the minimum requirements, and add a separate workflow for it;

Does this PR introduce any user-facing change?

no, infra-only

How was this patch tested?

PR build with

envs:
default: '{"PYSPARK_IMAGE_TO_TEST": "python-minimum", "PYTHON_TO_TEST": "python3.9"}'

jobs:
default: '{"pyspark": "true"}'

https://github.com/zhengruifeng/spark/runs/34827211339

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Dec 24, 2024

pyarrow 10.0 fails the whole pyspark

https://github.com/zhengruifeng/spark/actions/runs/12464102622/job/34787749014

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Dec 24, 2024

@zhengruifeng
Copy link
Contributor Author

@zhengruifeng
Copy link
Contributor Author

@zhengruifeng
Copy link
Contributor Author

@zhengruifeng
Copy link
Contributor Author

Will send a separate PR to upgrade the minimum requirement of pyarrow to 11.0.0

dongjoon-hyun pushed a commit that referenced this pull request Dec 25, 2024
### What changes were proposed in this pull request?
Upgrade the minimum version of `pyarrow` to 11.0.0

### Why are the changes needed?
according to my test in #49267, pyspark with `pyarrow=10.0.0` has already been broken

- pyspark-sql failed
- pyspark-connect failed
- pyspark-pandas failed

see https://github.com/zhengruifeng/spark/actions/runs/12464102622/job/34787749014

### Does this PR introduce _any_ user-facing change?
doc changes

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #49282 from zhengruifeng/mini_arrow_11.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@zhengruifeng
Copy link
Contributor Author

thanks, merged to master

@zhengruifeng zhengruifeng deleted the infra_py_old branch December 26, 2024 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants