chore(docker): reduce base_java17 and spark_base image size by kartikeyaagrawal · Pull Request #18542 · apache/hudi

kartikeyaagrawal · 2026-04-21T16:44:57Z

Describe the issue this Pull Request addresses

Addresses #18523 — the apachehudi/hudi-hadoop_3.4.0-base Java 17 integration-test image is significantly larger than its Java 11 counterpart. This PR applies a first round of size reductions that do not change any runtime behavior.

Summary and Changelog

Shrinks the Java 17 IT image from ~3.56 GB to ~2.58 GB (~27%, ~900 MB). All entrypoints, environment variables, PATH, exposed ports, HADOOP_HOME, SPARK_HOME, and container-level commands remain identical.

docker/hoodie/hadoop/base_java17/Dockerfile

Switch base from eclipse-temurin:17-jdk to eclipse-temurin:17-jre-jammy. The container only runs Hadoop; nothing inside the image compiles Java. Saves ~200 MB.
Convert to a multi-stage build:
- Stage 1 (hadoop-builder): installs curl + ca-certificates, downloads and extracts the Hadoop tarball.
- Stage 2 (runtime): COPY --from=hadoop-builder of only the extracted Hadoop tree.
- Effect: the curl/ca-certificates install and the tar.gz never land in the final image layer.
Use --no-install-recommends and rm -rf /var/lib/apt/lists/* in the runtime stage.
Drop the unused .asc signature download and the now-dead wget dependency.

docker/hoodie/hadoop/spark_base/Dockerfile

Replace the Python-3.10.14-from-source build (which installed build-essential, libssl-dev, libgdbm-dev, libreadline-dev, libbz2-dev, libsqlite3-dev, libffi-dev, zlib1g-dev, then built CPython with --enable-optimizations) with the distro packages python3-minimal + python3-pip. PySpark only needs a Python runtime. Saves ~600 MB.
python and python3 are symlinked to the distro python3 exactly as before.

Before / After

Image	Size
`base_java17` on master today	~3.56 GB
`base_java17` with this PR	~2.58 GB
`base_java11` (reference, for context)	~1.68 GB

Usage — identical before and after

# Build (unchanged)
cd docker
./build_docker_images.sh --hadoop-version 3.4.0 --spark-version 4.0.1 --hive-version 3.1.3

# Bring up demo stack (unchanged)
./setup_demo.sh

# Inside the container (unchanged)
docker exec -it adhoc-1 hadoop fs -ls /
docker exec -it adhoc-1 hadoop version
docker exec -it adhoc-2 pyspark

Impact

User-facing: none. No Hudi source code is touched. The image's API — entrypoint, commands, env vars, HADOOP_HOME, SPARK_HOME, PATH, exposed ports — is unchanged.
Performance: faster docker pull and docker build due to the smaller image and the removal of the CPython compile step.
Behavioral difference to be aware of: the runtime image no longer contains javac (JDK → JRE). Nothing in the default IT stacks invokes javac inside the container; if a downstream user was relying on it, they can derive from this image and apt-get install openjdk-17-jdk-headless.

Risk Level

low

Dockerfile-only change; no Hudi Java/Scala/Python source modified.
Hadoop and Spark versions, layout, and classpaths are unchanged.
Follow-on, potentially larger reductions (e.g., removing the ~560 MB share/hadoop/tools/lib/ AWS SDK v2 bundle that ships inside the upstream Hadoop tarball) are deliberately not included here, since they narrow the image's out-of-the-box surface area. Happy to add them in a follow-up if maintainers prefer.

Documentation Update

none — no new configs, no user-facing API or behavior change, no website update required.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable — N/A (Dockerfile-only; validated by building both images and running the existing setup_demo.sh stack)

Addresses apache#18523. Shrinks the Java 17 integration-test image from ~3.56 GB to ~2.58 GB (~27%) without changing any runtime behavior. The container commands, environment variables, exposed ports, and entrypoint are identical. base_java17/Dockerfile: - switch base image from eclipse-temurin:17-jdk to 17-jre-jammy. The container only runs Hadoop; no Java compilation happens inside it, so the JDK toolchain is not needed. - convert to a multi-stage build. Stage 1 downloads and extracts the Hadoop tarball; stage 2 only COPYs the extracted tree. curl, ca-certificates, and the tar.gz no longer land in the final layer. - use --no-install-recommends and clean apt lists in the runtime stage. - drop the unused .asc signature download and the now-dead wget dep. spark_base/Dockerfile: - replace the Python-3.10.14-from-source build (which pulled in build-essential and a full compile toolchain, then built CPython with --enable-optimizations inside the image) with the distro python3-minimal + python3-pip packages. PySpark only needs a Python runtime at runtime.

hudi-bot · 2026-04-21T18:17:34Z

CI report:

43eba4f Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-04-22T05:53:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.84%. Comparing base (3a387da) to head (43eba4f).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18542      +/-   ##
============================================
+ Coverage     67.78%   68.84%   +1.06%     
- Complexity    27798    28460     +662     
============================================
  Files          2467     2475       +8     
  Lines        135958   136563     +605     
  Branches      16498    16608     +110     
============================================
+ Hits          92153    94022    +1869     
+ Misses        36528    34978    -1550     
- Partials       7277     7563     +286

Flag	Coverage Δ
common-and-other-modules	`44.47% <ø> (-0.16%)`	⬇️
hadoop-mr-java-client	`44.78% <ø> (+<0.01%)`	⬆️
spark-client-hadoop-common	`48.55% <ø> (+0.13%)`	⬆️
spark-java-tests	`49.41% <ø> (+0.50%)`	⬆️
spark-scala-tests	`45.31% <ø> (+2.25%)`	⬆️
utilities	`38.03% <ø> (-0.16%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 145 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

No reviewable code files in this PR.

cc @yihua

yihua

@voonhous could you help review this and validate if this works?

voonhous · 2026-04-29T16:53:18Z

@yihua Verified to work on my E2E tests.

Reduction in image size can also be seen comparing 1.2.0-SNAPSHOT (from remote) and 1.3.0-SNAPSHOT the one that we built locally with the changes in this branch.

apachehudi/hudi-hadoop_3.4.0-hive_2.3.10-sparkbase_4.0.2:1.2.0-SNAPSHOT = 6.81 GB
apachehudi/hudi-hadoop_3.4.0-hive_2.3.10-sparkbase_4.0.2:1.3.0-SNAPSHOT = 5.74 GB <--

voonhous

LGTM

voonhous

LGTM

hudi-agent reviewed Apr 22, 2026

View reviewed changes

github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 23, 2026

yihua reviewed Apr 25, 2026

View reviewed changes

voonhous reviewed Apr 29, 2026

View reviewed changes

voonhous approved these changes Apr 29, 2026

View reviewed changes

voonhous merged commit 0a28d69 into apache:master Apr 30, 2026
106 of 109 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(docker): reduce base_java17 and spark_base image size#18542

chore(docker): reduce base_java17 and spark_base image size#18542
voonhous merged 1 commit into
apache:masterfrom
kartikeyaagrawal:chore/docker-reduce-java17-image-size

kartikeyaagrawal commented Apr 21, 2026

Uh oh!

hudi-bot commented Apr 21, 2026

Uh oh!

codecov-commenter commented Apr 22, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

yihua left a comment

Uh oh!

voonhous commented Apr 29, 2026 •

edited

Loading

Uh oh!

voonhous left a comment

Uh oh!

voonhous left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

kartikeyaagrawal commented Apr 21, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Before / After

Usage — identical before and after

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Apr 21, 2026

CI report:

Uh oh!

codecov-commenter commented Apr 22, 2026

Codecov Report

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

voonhous commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voonhous left a comment

Choose a reason for hiding this comment

Uh oh!

voonhous left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

voonhous commented Apr 29, 2026 •

edited

Loading