chore(docker): reduce base_java17 and spark_base image size#18542
Merged
voonhous merged 1 commit intoApr 30, 2026
Merged
Conversation
Addresses apache#18523. Shrinks the Java 17 integration-test image from ~3.56 GB to ~2.58 GB (~27%) without changing any runtime behavior. The container commands, environment variables, exposed ports, and entrypoint are identical. base_java17/Dockerfile: - switch base image from eclipse-temurin:17-jdk to 17-jre-jammy. The container only runs Hadoop; no Java compilation happens inside it, so the JDK toolchain is not needed. - convert to a multi-stage build. Stage 1 downloads and extracts the Hadoop tarball; stage 2 only COPYs the extracted tree. curl, ca-certificates, and the tar.gz no longer land in the final layer. - use --no-install-recommends and clean apt lists in the runtime stage. - drop the unused .asc signature download and the now-dead wget dep. spark_base/Dockerfile: - replace the Python-3.10.14-from-source build (which pulled in build-essential and a full compile toolchain, then built CPython with --enable-optimizations inside the image) with the distro python3-minimal + python3-pip packages. PySpark only needs a Python runtime at runtime.
Collaborator
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18542 +/- ##
============================================
+ Coverage 67.78% 68.84% +1.06%
- Complexity 27798 28460 +662
============================================
Files 2467 2475 +8
Lines 135958 136563 +605
Branches 16498 16608 +110
============================================
+ Hits 92153 94022 +1869
+ Misses 36528 34978 -1550
- Partials 7277 7563 +286
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
hudi-agent
reviewed
Apr 22, 2026
Contributor
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
No reviewable code files in this PR.
cc @yihua
Member
|
@yihua Verified to work on my E2E tests. Reduction in image size can also be seen comparing
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Describe the issue this Pull Request addresses
Addresses #18523 — the
apachehudi/hudi-hadoop_3.4.0-baseJava 17 integration-test image is significantly larger than its Java 11 counterpart. This PR applies a first round of size reductions that do not change any runtime behavior.Summary and Changelog
Shrinks the Java 17 IT image from ~3.56 GB to ~2.58 GB (~27%, ~900 MB). All entrypoints, environment variables,
PATH, exposed ports,HADOOP_HOME,SPARK_HOME, and container-level commands remain identical.docker/hoodie/hadoop/base_java17/Dockerfileeclipse-temurin:17-jdktoeclipse-temurin:17-jre-jammy. The container only runs Hadoop; nothing inside the image compiles Java. Saves ~200 MB.hadoop-builder): installscurl+ca-certificates, downloads and extracts the Hadoop tarball.COPY --from=hadoop-builderof only the extracted Hadoop tree.curl/ca-certificatesinstall and thetar.gznever land in the final image layer.--no-install-recommendsandrm -rf /var/lib/apt/lists/*in the runtime stage..ascsignature download and the now-deadwgetdependency.docker/hoodie/hadoop/spark_base/Dockerfilebuild-essential,libssl-dev,libgdbm-dev,libreadline-dev,libbz2-dev,libsqlite3-dev,libffi-dev,zlib1g-dev, then built CPython with--enable-optimizations) with the distro packagespython3-minimal+python3-pip. PySpark only needs a Python runtime. Saves ~600 MB.pythonandpython3are symlinked to the distropython3exactly as before.Before / After
base_java17on master todaybase_java17with this PRbase_java11(reference, for context)Usage — identical before and after
Impact
HADOOP_HOME,SPARK_HOME,PATH, exposed ports — is unchanged.docker pullanddocker builddue to the smaller image and the removal of the CPython compile step.javac(JDK → JRE). Nothing in the default IT stacks invokesjavacinside the container; if a downstream user was relying on it, they can derive from this image andapt-get install openjdk-17-jdk-headless.Risk Level
low
share/hadoop/tools/lib/AWS SDK v2 bundle that ships inside the upstream Hadoop tarball) are deliberately not included here, since they narrow the image's out-of-the-box surface area. Happy to add them in a follow-up if maintainers prefer.Documentation Update
none — no new configs, no user-facing API or behavior change, no website update required.
Contributor's checklist
setup_demo.shstack)