Skip to content

chore(docker): reduce base_java17 and spark_base image size#18542

Merged
voonhous merged 1 commit into
apache:masterfrom
kartikeyaagrawal:chore/docker-reduce-java17-image-size
Apr 30, 2026
Merged

chore(docker): reduce base_java17 and spark_base image size#18542
voonhous merged 1 commit into
apache:masterfrom
kartikeyaagrawal:chore/docker-reduce-java17-image-size

Conversation

@kartikeyaagrawal
Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Addresses #18523 — the apachehudi/hudi-hadoop_3.4.0-base Java 17 integration-test image is significantly larger than its Java 11 counterpart. This PR applies a first round of size reductions that do not change any runtime behavior.

Summary and Changelog

Shrinks the Java 17 IT image from ~3.56 GB to ~2.58 GB (~27%, ~900 MB). All entrypoints, environment variables, PATH, exposed ports, HADOOP_HOME, SPARK_HOME, and container-level commands remain identical.

docker/hoodie/hadoop/base_java17/Dockerfile

  • Switch base from eclipse-temurin:17-jdk to eclipse-temurin:17-jre-jammy. The container only runs Hadoop; nothing inside the image compiles Java. Saves ~200 MB.
  • Convert to a multi-stage build:
    • Stage 1 (hadoop-builder): installs curl + ca-certificates, downloads and extracts the Hadoop tarball.
    • Stage 2 (runtime): COPY --from=hadoop-builder of only the extracted Hadoop tree.
    • Effect: the curl/ca-certificates install and the tar.gz never land in the final image layer.
  • Use --no-install-recommends and rm -rf /var/lib/apt/lists/* in the runtime stage.
  • Drop the unused .asc signature download and the now-dead wget dependency.

docker/hoodie/hadoop/spark_base/Dockerfile

  • Replace the Python-3.10.14-from-source build (which installed build-essential, libssl-dev, libgdbm-dev, libreadline-dev, libbz2-dev, libsqlite3-dev, libffi-dev, zlib1g-dev, then built CPython with --enable-optimizations) with the distro packages python3-minimal + python3-pip. PySpark only needs a Python runtime. Saves ~600 MB.
  • python and python3 are symlinked to the distro python3 exactly as before.

Before / After

Image Size
base_java17 on master today ~3.56 GB
base_java17 with this PR ~2.58 GB
base_java11 (reference, for context) ~1.68 GB

Usage — identical before and after

# Build (unchanged)
cd docker
./build_docker_images.sh --hadoop-version 3.4.0 --spark-version 4.0.1 --hive-version 3.1.3

# Bring up demo stack (unchanged)
./setup_demo.sh

# Inside the container (unchanged)
docker exec -it adhoc-1 hadoop fs -ls /
docker exec -it adhoc-1 hadoop version
docker exec -it adhoc-2 pyspark

Impact

  • User-facing: none. No Hudi source code is touched. The image's API — entrypoint, commands, env vars, HADOOP_HOME, SPARK_HOME, PATH, exposed ports — is unchanged.
  • Performance: faster docker pull and docker build due to the smaller image and the removal of the CPython compile step.
  • Behavioral difference to be aware of: the runtime image no longer contains javac (JDK → JRE). Nothing in the default IT stacks invokes javac inside the container; if a downstream user was relying on it, they can derive from this image and apt-get install openjdk-17-jdk-headless.

Risk Level

low

  • Dockerfile-only change; no Hudi Java/Scala/Python source modified.
  • Hadoop and Spark versions, layout, and classpaths are unchanged.
  • Follow-on, potentially larger reductions (e.g., removing the ~560 MB share/hadoop/tools/lib/ AWS SDK v2 bundle that ships inside the upstream Hadoop tarball) are deliberately not included here, since they narrow the image's out-of-the-box surface area. Happy to add them in a follow-up if maintainers prefer.

Documentation Update

none — no new configs, no user-facing API or behavior change, no website update required.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable — N/A (Dockerfile-only; validated by building both images and running the existing setup_demo.sh stack)

Addresses apache#18523.

Shrinks the Java 17 integration-test image from ~3.56 GB to ~2.58 GB
(~27%) without changing any runtime behavior. The container commands,
environment variables, exposed ports, and entrypoint are identical.

base_java17/Dockerfile:
- switch base image from eclipse-temurin:17-jdk to 17-jre-jammy. The
  container only runs Hadoop; no Java compilation happens inside it,
  so the JDK toolchain is not needed.
- convert to a multi-stage build. Stage 1 downloads and extracts the
  Hadoop tarball; stage 2 only COPYs the extracted tree. curl,
  ca-certificates, and the tar.gz no longer land in the final layer.
- use --no-install-recommends and clean apt lists in the runtime stage.
- drop the unused .asc signature download and the now-dead wget dep.

spark_base/Dockerfile:
- replace the Python-3.10.14-from-source build (which pulled in
  build-essential and a full compile toolchain, then built CPython
  with --enable-optimizations inside the image) with the distro
  python3-minimal + python3-pip packages. PySpark only needs a
  Python runtime at runtime.
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.84%. Comparing base (3a387da) to head (43eba4f).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18542      +/-   ##
============================================
+ Coverage     67.78%   68.84%   +1.06%     
- Complexity    27798    28460     +662     
============================================
  Files          2467     2475       +8     
  Lines        135958   136563     +605     
  Branches      16498    16608     +110     
============================================
+ Hits          92153    94022    +1869     
+ Misses        36528    34978    -1550     
- Partials       7277     7563     +286     
Flag Coverage Δ
common-and-other-modules 44.47% <ø> (-0.16%) ⬇️
hadoop-mr-java-client 44.78% <ø> (+<0.01%) ⬆️
spark-client-hadoop-common 48.55% <ø> (+0.13%) ⬆️
spark-java-tests 49.41% <ø> (+0.50%) ⬆️
spark-scala-tests 45.31% <ø> (+2.25%) ⬆️
utilities 38.03% <ø> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 145 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

No reviewable code files in this PR.

cc @yihua

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 23, 2026
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voonhous could you help review this and validate if this works?

@voonhous
Copy link
Copy Markdown
Member

voonhous commented Apr 29, 2026

@yihua Verified to work on my E2E tests.

Reduction in image size can also be seen comparing 1.2.0-SNAPSHOT (from remote) and 1.3.0-SNAPSHOT the one that we built locally with the changes in this branch.

image
  • apachehudi/hudi-hadoop_3.4.0-hive_2.3.10-sparkbase_4.0.2:1.2.0-SNAPSHOT = 6.81 GB
  • apachehudi/hudi-hadoop_3.4.0-hive_2.3.10-sparkbase_4.0.2:1.3.0-SNAPSHOT = 5.74 GB <--

Copy link
Copy Markdown
Member

@voonhous voonhous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@voonhous voonhous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@voonhous voonhous merged commit 0a28d69 into apache:master Apr 30, 2026
106 of 109 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants