Skip to content

[SPARK-56864][INFRA][PYTHON][4.2] Consolidate python-ps-minimum image into python-minimum#55945

Closed
zhengruifeng wants to merge 1 commit into
apache:branch-4.2from
zhengruifeng:backport-spark-56864-to-4.2
Closed

[SPARK-56864][INFRA][PYTHON][4.2] Consolidate python-ps-minimum image into python-minimum#55945
zhengruifeng wants to merge 1 commit into
apache:branch-4.2from
zhengruifeng:backport-spark-56864-to-4.2

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Backport of #55872 to branch-4.2.

This PR consolidates the python-ps-minimum Docker image and its CI workflow into the existing python-minimum image, eliminating a near-duplicate.

Specifically:

  • Deletes dev/spark-test-image/python-ps-minimum/Dockerfile.
  • Deletes .github/workflows/build_python_ps_minimum.yml.
  • Adds "pyspark-pandas": "true" to .github/workflows/build_python_minimum.yml so Pandas API on Spark minimum-deps coverage is preserved.
  • Drops the python-ps-minimum entries from .github/workflows/build_infra_images_cache.yml (the paths trigger and the build/push step).
  • Removes the build_python_ps_minimum.yml badge from README.md.

Why are the changes needed?

To save CI resources. The two Dockerfiles were nearly identical. The only functional differences were in BASIC_PIP_PKGS:

Package python-minimum python-ps-minimum
numpy pinned ==1.22.4 unpinned
scikit-learn included omitted

Everything else (base image, apt packages, Python version, venv setup, CONNECT_PIP_PKGS) was the same. Maintaining both images doubles the image build/cache cost and runs a duplicate scheduled workflow without commensurate test value. Reusing python-minimum (which has the stricter pin and a superset of packages) for the Pandas API on Spark minimum-deps job keeps coverage while halving the image footprint and the associated CI runtime.

Does this PR introduce any user-facing change?

No. CI-only change.

How was this patch tested?

Existing CI. The merged build_python_minimum.yml now runs both pyspark and pyspark-pandas jobs against the python-minimum image.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (model: claude-opus-4-7)

… python-minimum

This PR consolidates the `python-ps-minimum` Docker image and its CI workflow into the existing `python-minimum` image, eliminating a near-duplicate.

Specifically:
- Updates the label on `dev/spark-test-image/python-minimum/Dockerfile` to cover both PySpark and Pandas API on Spark.
- Deletes `dev/spark-test-image/python-ps-minimum/Dockerfile`.
- Deletes `.github/workflows/build_python_ps_minimum.yml`.
- Adds `"pyspark-pandas": "true"` to `.github/workflows/build_python_minimum.yml` so Pandas API on Spark minimum-deps coverage is preserved.
- Drops the `python-ps-minimum` entries from `.github/workflows/build_infra_images_cache.yml` (the `paths` trigger and the build/push step).
- Removes the `build_python_ps_minimum.yml` badge from `README.md`.

To save CI resources. The two Dockerfiles were nearly identical. The only functional differences were in `BASIC_PIP_PKGS`:

| Package | python-minimum | python-ps-minimum |
|---|---|---|
| `numpy` | pinned `==1.22.4` | unpinned |
| `scikit-learn` | included | omitted |

Everything else (base image, apt packages, Python version, venv setup, `CONNECT_PIP_PKGS`) was the same. Maintaining both images doubles the image build/cache cost and runs a duplicate scheduled workflow without commensurate test value. Reusing `python-minimum` (which has the stricter pin and a superset of packages) for the Pandas API on Spark minimum-deps job keeps coverage while halving the image footprint and the associated CI runtime.

No. CI-only change.

Existing CI. The merged `build_python_minimum.yml` now runs both `pyspark` and `pyspark-pandas` jobs against the `python-minimum` image.

Generated-by: Claude Code (model: claude-opus-4-7)

Closes apache#55872 from zhengruifeng/remove-python-ps-minimum.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
zhengruifeng added a commit that referenced this pull request May 18, 2026
… into python-minimum

### What changes were proposed in this pull request?

Backport of #55872 to branch-4.2.

This PR consolidates the `python-ps-minimum` Docker image and its CI workflow into the existing `python-minimum` image, eliminating a near-duplicate.

Specifically:
- Deletes `dev/spark-test-image/python-ps-minimum/Dockerfile`.
- Deletes `.github/workflows/build_python_ps_minimum.yml`.
- Adds `"pyspark-pandas": "true"` to `.github/workflows/build_python_minimum.yml` so Pandas API on Spark minimum-deps coverage is preserved.
- Drops the `python-ps-minimum` entries from `.github/workflows/build_infra_images_cache.yml` (the `paths` trigger and the build/push step).
- Removes the `build_python_ps_minimum.yml` badge from `README.md`.

### Why are the changes needed?

To save CI resources. The two Dockerfiles were nearly identical. The only functional differences were in `BASIC_PIP_PKGS`:

| Package | python-minimum | python-ps-minimum |
|---|---|---|
| `numpy` | pinned `==1.22.4` | unpinned |
| `scikit-learn` | included | omitted |

Everything else (base image, apt packages, Python version, venv setup, `CONNECT_PIP_PKGS`) was the same. Maintaining both images doubles the image build/cache cost and runs a duplicate scheduled workflow without commensurate test value. Reusing `python-minimum` (which has the stricter pin and a superset of packages) for the Pandas API on Spark minimum-deps job keeps coverage while halving the image footprint and the associated CI runtime.

### Does this PR introduce _any_ user-facing change?

No. CI-only change.

### How was this patch tested?

Existing CI. The merged `build_python_minimum.yml` now runs both `pyspark` and `pyspark-pandas` jobs against the `python-minimum` image.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (model: claude-opus-4-7)

Closes #55945 from zhengruifeng/backport-spark-56864-to-4.2.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
@zhengruifeng
Copy link
Copy Markdown
Contributor Author

merged to 4.2

@zhengruifeng zhengruifeng deleted the backport-spark-56864-to-4.2 branch May 18, 2026 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant