Use custom runtime image for bundled JDK #65111

mark-vieira · 2020-11-17T01:14:05Z

With Elasticsearch 8.0 we intend to ditch that no-jdk distribution (see #65109), to that end, to make it a little less painful for folks that have to download the full bundled JDK distribution, even though they have no intention of using it, we can look at reducing the footprint of the bundled JDK.

One way to do this is leverage jlink to build a custom runtime image with only the JDK modules used by Elasticsearch and its plugins. There are a few things we'll need to sort out:

How do we determine the list of modules to include? Ideally this would by dynamic, perhaps by generating this from jdeps against our existing libs/modules/plugins.
What about reflection usages? Perhaps just rely on existing test coverage using bundled/jlink JDK for catching issues here?
What about external plugins that require modules not included in the bundled JDK? Is there a way to detect this, or at least blow up gracefully and instruct folks to bring their own JKD?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-11-17T01:14:08Z

Pinging @elastic/es-delivery (Team:Delivery)

danielmitterdorfer · 2023-11-28T13:09:17Z

Out of curiosity I've played around with jlink. It seems that determining the correct list of modules carries the most risk, yet it had only negligible benefits in my experiments. So I wonder whether we should just do the simplest thing and invoke jlink with all modules. I've tested on JDK 21.0.1 (bundled with ES 8.11.1).

Command to create a custom JDK (used jlink from the bundled JDK):

jlink --add-modules $JDK_MODULES --output custom-jdk --no-man-pages --no-header-files

# scenario: all-modules
export JDK_MODULES=ALL-MODULE-PATH
# scenario: min-modules (somewhat made up list skipping things that I believe we don't need; I did not get jdeps to produce proper output for Elasticsearch)
export JDK_MODULES="java.base,java.compiler,java.instrument,java.logging,java.management,java.management.rmi,java.naming,java.net.http,java.prefs,java.scripting,java.se,java.security.jgss,java.security.sasl,java.sql,java.sql.rowset,java.xml,java.xml.crypto,jdk.accessibility,jdk.attach,jdk.charsets,jdk.compiler,jdk.crypto.cryptoki,jdk.crypto.ec,jdk.dynalink,jdk.editpad,jdk.hotspot.agent,jdk.incubator.vector,jdk.internal.ed,jdk.internal.jvmstat,jdk.internal.le,jdk.internal.opt,jdk.internal.vm.ci,jdk.internal.vm.compiler,jdk.internal.vm.compiler.management,jdk.jcmd,jdk.jconsole,jdk.jdeps,jdk.jdi,jdk.jdwp.agent,jdk.jfr,jdk.jsobject,jdk.jstatd,jdk.localedata,jdk.management,jdk.management.agent,jdk.management.jfr,jdk.naming.dns,jdk.naming.rmi,jdk.net,jdk.nio.mapmode,jdk.random,jdk.sctp,jdk.security.auth,jdk.security.jgss,jdk.xml.dom,jdk.zipfs"

This resulted in the following size for elasticsearch-8.11.1-linux-x86_64.tar.gz. I've tested the .tar.gz distribution size because jlink compresses the module file and while this might lower the size of the JDK, it might not have a large impact on the size of the .tar.gz distribution.

Scenario	Size [MB]	Size [byte]
baseline	602	630398135
all-modules	518	543013385
min-modules	516	540043484

or just the size of the jdk directory:

Scenario	Size [MB]	Size [byte]
baseline	287	299437707
all-modules	178	185198903
min-modules	168	175241952

So a (relatively) simple jlink --add-modules ALL-MODULE-PATH --output custom-jdk --no-man-pages --no-header-files should already lead to a quite significant improvement.

Btw, the next biggest contributor to the distribution size is ML and the biggest dependency there (a whopping 160MB - almost as much as the entire size of our stripped down JDK above) is caused by libtorch_cpu.so.

mark-vieira · 2023-11-30T07:17:21Z

So I wonder whether we should just do the simplest thing and invoke jlink with all modules. I've tested on JDK 21.0.1 (bundled with ES 8.11.1).

If we're including all modules, where is the savings coming from? Is it just compression? Is there any noticable preformance/startup cost?

Btw, the next biggest contributor to the distribution size is ML and the biggest dependency there (a whopping 160MB - almost as much as the entire size of our stripped down JDK above) is caused by libtorch_cpu.so.

This is a known "issue". It's the main reason why the aarch64 distribution is so much smaller as well, since this ML dependency is x86-only.

danielmitterdorfer · 2023-12-01T13:22:21Z

If we're including all modules, where is the savings coming from? Is it just compression? Is there any noticable preformance/startup cost?

The majority of savings is coming from the eliminated JMOD files (that are only useful if you want to build a custom JDK but not when you have actually built one, see this great summary on Stackoverflow. Another ~30MB are caused by two class data sharing files that have been eliminated + the include files.

Is there any noticable preformance/startup cost?

I did not notice any impact at all in startup performance.

This is a known "issue"

Is this tracked anywhere? I could not find anything either here or in the ML repo or is there no intention to change this?

mark-vieira · 2023-12-04T21:22:46Z

The majority of savings is coming from the eliminated JMOD files (that are only useful if you want to build a custom JDK but not when you have actually built one, see this great summary on Stackoverflow. Another ~30MB are caused by two class data sharing files that have been eliminated + the include files.

Is running jlink actually necessary then or could we just strip these out when copying the JDK? We may run into signing issues though on platforms like Mac, which could also be the case with jlink.

Is this tracked anywhere? I could not find anything either here or in the ML repo or is there no intention to change this?

I'm not aware of any intention to change this. @droberts195 we are aware of this, but I don't think there is any reasonable fix here other than not using the optimized libraries. In this case we are trading better performance for a larger distribution.

danielmitterdorfer · 2023-12-05T06:21:22Z

Is running jlink actually necessary then or could we just strip these out when copying the JDK?

It's probably not necessary but using jlink seems like the canonical way to do that? I'd expect using jlink is the more robust alternative in case the JDK layout changes (granted, that's unlikely).

We may run into signing issues though on platforms like Mac, which could also be the case with jlink.

Fair point. While it would be nice to have smaller binaries on all platforms I believe it would already be a step forward if we can reduce the binary size on many platforms. So we could add this step only on platforms where it is safe to do so and keep the original JDK on others.

I'm not aware of any intention to change this

Ok, I already thought so but thanks for confirming.

droberts195 · 2023-12-05T09:52:13Z

Is this tracked anywhere?

elastic/ml-cpp#2038 contains some details about this.

but I don't think there is any reasonable fix here other than not using the optimized libraries.

There are two parts: libtorch and libmkl. libtorch is the bulk of the implementation of PyTorch, and cannot be removed unless we scrap our inference functionality altogether. libmkl makes inference faster on Intel processors. We only ship that with our linux-x86_64 image - we accept that the other platforms are just much slower for inference and always run inference on Intel Linux in our cloud offerings. We could remove libmkl part and take the performance hit in return for smaller image sizes. However, the difference MKL makes over the most basic PyTorch build is enormous. When @davidkyle tested the performance of PyTorch inference comparing Eigen to MKL he found that when libtorch is built with MKL it is much faster on Cascade Lake CPUs. For one benchmark he ran the total time when using MKL was 2.5s compared to 81s with Eigen. There are other BLAS libraries apart from MKL and Eigen that would probably have performance and size somewhere in between. But we've been shipping MKL for nearly 2 years now with no problems apart from the size, so somebody would need to come up with a very good business case to mess with that now given the importance to the business of ELSER and NLP in general.

danielmitterdorfer · 2023-12-05T10:34:39Z

[...] libtorch is the bulk of the implementation of PyTorch, and cannot be removed unless we scrap our inference functionality altogether. [...]

Thanks for the background on this, that's helpful. Given what you said, this library seems like the right tradeoff in terms of performance.

mark-vieira added the :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts label Nov 17, 2020

elasticmachine added the Team:Delivery Meta label for Delivery team label Nov 17, 2020

mark-vieira mentioned this issue Dec 17, 2020

Remove no-jdk distributions for 8.0 release #65109

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use custom runtime image for bundled JDK #65111

Use custom runtime image for bundled JDK #65111

mark-vieira commented Nov 17, 2020

elasticmachine commented Nov 17, 2020

danielmitterdorfer commented Nov 28, 2023

mark-vieira commented Nov 30, 2023

danielmitterdorfer commented Dec 1, 2023

mark-vieira commented Dec 4, 2023

danielmitterdorfer commented Dec 5, 2023 •

edited

Loading

droberts195 commented Dec 5, 2023

danielmitterdorfer commented Dec 5, 2023

Use custom runtime image for bundled JDK #65111

Use custom runtime image for bundled JDK #65111

Comments

mark-vieira commented Nov 17, 2020

elasticmachine commented Nov 17, 2020

danielmitterdorfer commented Nov 28, 2023

mark-vieira commented Nov 30, 2023

danielmitterdorfer commented Dec 1, 2023

mark-vieira commented Dec 4, 2023

danielmitterdorfer commented Dec 5, 2023 • edited Loading

droberts195 commented Dec 5, 2023

danielmitterdorfer commented Dec 5, 2023

danielmitterdorfer commented Dec 5, 2023 •

edited

Loading