Skip to content

[SPARK-56864][INFRA][PYTHON] Consolidate python-ps-minimum image into python-minimum#55872

Closed
zhengruifeng wants to merge 2 commits into
apache:masterfrom
zhengruifeng:remove-python-ps-minimum
Closed

[SPARK-56864][INFRA][PYTHON] Consolidate python-ps-minimum image into python-minimum#55872
zhengruifeng wants to merge 2 commits into
apache:masterfrom
zhengruifeng:remove-python-ps-minimum

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented May 14, 2026

What changes were proposed in this pull request?

This PR consolidates the python-ps-minimum Docker image and its CI workflow into the existing python-minimum image, eliminating a near-duplicate.

Specifically:

  • Updates the label on dev/spark-test-image/python-minimum/Dockerfile to cover both PySpark and Pandas API on Spark.
  • Deletes dev/spark-test-image/python-ps-minimum/Dockerfile.
  • Deletes .github/workflows/build_python_ps_minimum.yml.
  • Adds "pyspark-pandas": "true" to .github/workflows/build_python_minimum.yml so Pandas API on Spark minimum-deps coverage is preserved.
  • Drops the python-ps-minimum entries from .github/workflows/build_infra_images_cache.yml (the paths trigger and the build/push step).
  • Removes the build_python_ps_minimum.yml badge from README.md.

Why are the changes needed?

To save CI resources. The two Dockerfiles were nearly identical. The only functional differences were in BASIC_PIP_PKGS:

Package python-minimum python-ps-minimum
numpy pinned ==1.22.4 unpinned
scikit-learn included omitted

Everything else (base image, apt packages, Python version, venv setup, CONNECT_PIP_PKGS) was the same. Maintaining both images doubles the image build/cache cost and runs a duplicate scheduled workflow without commensurate test value. Reusing python-minimum (which has the stricter pin and a superset of packages) for the Pandas API on Spark minimum-deps job keeps coverage while halving the image footprint and the associated CI runtime.

Does this PR introduce any user-facing change?

No. CI-only change.

How was this patch tested?

Existing CI. The merged build_python_minimum.yml now runs both pyspark and pyspark-pandas jobs against the python-minimum image.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (model: claude-opus-4-7)

@zhengruifeng zhengruifeng changed the title [INFRA] Consolidate python-ps-minimum image into python-minimum [SPARK-56864][INFRA][PYTHON] Consolidate python-ps-minimum image into python-minimum May 14, 2026
@zhengruifeng zhengruifeng marked this pull request as ready for review May 14, 2026 10:36
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

cc @peter-toth

Generated-by: Claude Code (model: claude-opus-4-7)
Generated-by: Claude Code (model: claude-opus-4-7)
@zhengruifeng zhengruifeng force-pushed the remove-python-ps-minimum branch from 75e22e7 to 682212e Compare May 18, 2026 01:48
@zhengruifeng
Copy link
Copy Markdown
Contributor Author

thanks all, merged to master.
will send separte PR for 4.x and 4.2

@zhengruifeng zhengruifeng deleted the remove-python-ps-minimum branch May 18, 2026 02:28
zhengruifeng added a commit that referenced this pull request May 18, 2026
… into python-minimum

### What changes were proposed in this pull request?

Backport of #55872 to branch-4.2.

This PR consolidates the `python-ps-minimum` Docker image and its CI workflow into the existing `python-minimum` image, eliminating a near-duplicate.

Specifically:
- Deletes `dev/spark-test-image/python-ps-minimum/Dockerfile`.
- Deletes `.github/workflows/build_python_ps_minimum.yml`.
- Adds `"pyspark-pandas": "true"` to `.github/workflows/build_python_minimum.yml` so Pandas API on Spark minimum-deps coverage is preserved.
- Drops the `python-ps-minimum` entries from `.github/workflows/build_infra_images_cache.yml` (the `paths` trigger and the build/push step).
- Removes the `build_python_ps_minimum.yml` badge from `README.md`.

### Why are the changes needed?

To save CI resources. The two Dockerfiles were nearly identical. The only functional differences were in `BASIC_PIP_PKGS`:

| Package | python-minimum | python-ps-minimum |
|---|---|---|
| `numpy` | pinned `==1.22.4` | unpinned |
| `scikit-learn` | included | omitted |

Everything else (base image, apt packages, Python version, venv setup, `CONNECT_PIP_PKGS`) was the same. Maintaining both images doubles the image build/cache cost and runs a duplicate scheduled workflow without commensurate test value. Reusing `python-minimum` (which has the stricter pin and a superset of packages) for the Pandas API on Spark minimum-deps job keeps coverage while halving the image footprint and the associated CI runtime.

### Does this PR introduce _any_ user-facing change?

No. CI-only change.

### How was this patch tested?

Existing CI. The merged `build_python_minimum.yml` now runs both `pyspark` and `pyspark-pandas` jobs against the `python-minimum` image.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (model: claude-opus-4-7)

Closes #55945 from zhengruifeng/backport-spark-56864-to-4.2.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
zhengruifeng added a commit that referenced this pull request May 18, 2026
… into python-minimum

### What changes were proposed in this pull request?

Backport of #55872 to branch-4.x.

This PR consolidates the `python-ps-minimum` Docker image and its CI workflow into the existing `python-minimum` image, eliminating a near-duplicate.

Specifically:
- Deletes `dev/spark-test-image/python-ps-minimum/Dockerfile`.
- Deletes `.github/workflows/build_python_ps_minimum.yml`.
- Adds `"pyspark-pandas": "true"` to `.github/workflows/build_python_minimum.yml` so Pandas API on Spark minimum-deps coverage is preserved.
- Drops the `python-ps-minimum` entries from `.github/workflows/build_infra_images_cache.yml` (the `paths` trigger and the build/push step).
- Removes the `build_python_ps_minimum.yml` badge from `README.md`.

### Why are the changes needed?

To save CI resources. The two Dockerfiles were nearly identical. The only functional differences were in `BASIC_PIP_PKGS`:

| Package | python-minimum | python-ps-minimum |
|---|---|---|
| `numpy` | pinned `==1.22.4` | unpinned |
| `scikit-learn` | included | omitted |

Everything else (base image, apt packages, Python version, venv setup, `CONNECT_PIP_PKGS`) was the same. Maintaining both images doubles the image build/cache cost and runs a duplicate scheduled workflow without commensurate test value. Reusing `python-minimum` (which has the stricter pin and a superset of packages) for the Pandas API on Spark minimum-deps job keeps coverage while halving the image footprint and the associated CI runtime.

### Does this PR introduce _any_ user-facing change?

No. CI-only change.

### How was this patch tested?

Existing CI. The merged `build_python_minimum.yml` now runs both `pyspark` and `pyspark-pandas` jobs against the `python-minimum` image.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (model: claude-opus-4-7)

Closes #55944 from zhengruifeng/backport-spark-56864-to-4.x.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants