Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use custom runtime image for bundled JDK #65111

Open
mark-vieira opened this issue Nov 17, 2020 · 8 comments
Open

Use custom runtime image for bundled JDK #65111

mark-vieira opened this issue Nov 17, 2020 · 8 comments
Labels
:Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts Team:Delivery Meta label for Delivery team

Comments

@mark-vieira
Copy link
Contributor

With Elasticsearch 8.0 we intend to ditch that no-jdk distribution (see #65109), to that end, to make it a little less painful for folks that have to download the full bundled JDK distribution, even though they have no intention of using it, we can look at reducing the footprint of the bundled JDK.

One way to do this is leverage jlink to build a custom runtime image with only the JDK modules used by Elasticsearch and its plugins. There are a few things we'll need to sort out:

  • How do we determine the list of modules to include? Ideally this would by dynamic, perhaps by generating this from jdeps against our existing libs/modules/plugins.
  • What about reflection usages? Perhaps just rely on existing test coverage using bundled/jlink JDK for catching issues here?
  • What about external plugins that require modules not included in the bundled JDK? Is there a way to detect this, or at least blow up gracefully and instruct folks to bring their own JKD?
@mark-vieira mark-vieira added the :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts label Nov 17, 2020
@elasticmachine elasticmachine added the Team:Delivery Meta label for Delivery team label Nov 17, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@danielmitterdorfer
Copy link
Member

Out of curiosity I've played around with jlink. It seems that determining the correct list of modules carries the most risk, yet it had only negligible benefits in my experiments. So I wonder whether we should just do the simplest thing and invoke jlink with all modules. I've tested on JDK 21.0.1 (bundled with ES 8.11.1).

Command to create a custom JDK (used jlink from the bundled JDK):

jlink --add-modules $JDK_MODULES --output custom-jdk --no-man-pages --no-header-files
# scenario: all-modules
export JDK_MODULES=ALL-MODULE-PATH
# scenario: min-modules (somewhat made up list skipping things that I believe we don't need; I did not get jdeps to produce proper output for Elasticsearch)
export JDK_MODULES="java.base,java.compiler,java.instrument,java.logging,java.management,java.management.rmi,java.naming,java.net.http,java.prefs,java.scripting,java.se,java.security.jgss,java.security.sasl,java.sql,java.sql.rowset,java.xml,java.xml.crypto,jdk.accessibility,jdk.attach,jdk.charsets,jdk.compiler,jdk.crypto.cryptoki,jdk.crypto.ec,jdk.dynalink,jdk.editpad,jdk.hotspot.agent,jdk.incubator.vector,jdk.internal.ed,jdk.internal.jvmstat,jdk.internal.le,jdk.internal.opt,jdk.internal.vm.ci,jdk.internal.vm.compiler,jdk.internal.vm.compiler.management,jdk.jcmd,jdk.jconsole,jdk.jdeps,jdk.jdi,jdk.jdwp.agent,jdk.jfr,jdk.jsobject,jdk.jstatd,jdk.localedata,jdk.management,jdk.management.agent,jdk.management.jfr,jdk.naming.dns,jdk.naming.rmi,jdk.net,jdk.nio.mapmode,jdk.random,jdk.sctp,jdk.security.auth,jdk.security.jgss,jdk.xml.dom,jdk.zipfs"

This resulted in the following size for elasticsearch-8.11.1-linux-x86_64.tar.gz. I've tested the .tar.gz distribution size because jlink compresses the module file and while this might lower the size of the JDK, it might not have a large impact on the size of the .tar.gz distribution.

Scenario Size [MB] Size [byte]
baseline 602 630398135
all-modules 518 543013385
min-modules 516 540043484

or just the size of the jdk directory:

Scenario Size [MB] Size [byte]
baseline 287 299437707
all-modules 178 185198903
min-modules 168 175241952

So a (relatively) simple jlink --add-modules ALL-MODULE-PATH --output custom-jdk --no-man-pages --no-header-files should already lead to a quite significant improvement.

Btw, the next biggest contributor to the distribution size is ML and the biggest dependency there (a whopping 160MB - almost as much as the entire size of our stripped down JDK above) is caused by libtorch_cpu.so.

@mark-vieira
Copy link
Contributor Author

So I wonder whether we should just do the simplest thing and invoke jlink with all modules. I've tested on JDK 21.0.1 (bundled with ES 8.11.1).

If we're including all modules, where is the savings coming from? Is it just compression? Is there any noticable preformance/startup cost?

Btw, the next biggest contributor to the distribution size is ML and the biggest dependency there (a whopping 160MB - almost as much as the entire size of our stripped down JDK above) is caused by libtorch_cpu.so.

This is a known "issue". It's the main reason why the aarch64 distribution is so much smaller as well, since this ML dependency is x86-only.

@danielmitterdorfer
Copy link
Member

If we're including all modules, where is the savings coming from? Is it just compression? Is there any noticable preformance/startup cost?

The majority of savings is coming from the eliminated JMOD files (that are only useful if you want to build a custom JDK but not when you have actually built one, see this great summary on Stackoverflow. Another ~30MB are caused by two class data sharing files that have been eliminated + the include files.

Is there any noticable preformance/startup cost?

I did not notice any impact at all in startup performance.

This is a known "issue"

Is this tracked anywhere? I could not find anything either here or in the ML repo or is there no intention to change this?

@mark-vieira
Copy link
Contributor Author

The majority of savings is coming from the eliminated JMOD files (that are only useful if you want to build a custom JDK but not when you have actually built one, see this great summary on Stackoverflow. Another ~30MB are caused by two class data sharing files that have been eliminated + the include files.

Is running jlink actually necessary then or could we just strip these out when copying the JDK? We may run into signing issues though on platforms like Mac, which could also be the case with jlink.

Is this tracked anywhere? I could not find anything either here or in the ML repo or is there no intention to change this?

I'm not aware of any intention to change this. @droberts195 we are aware of this, but I don't think there is any reasonable fix here other than not using the optimized libraries. In this case we are trading better performance for a larger distribution.

@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Dec 5, 2023

Is running jlink actually necessary then or could we just strip these out when copying the JDK?

It's probably not necessary but using jlink seems like the canonical way to do that? I'd expect using jlink is the more robust alternative in case the JDK layout changes (granted, that's unlikely).

We may run into signing issues though on platforms like Mac, which could also be the case with jlink.

Fair point. While it would be nice to have smaller binaries on all platforms I believe it would already be a step forward if we can reduce the binary size on many platforms. So we could add this step only on platforms where it is safe to do so and keep the original JDK on others.

I'm not aware of any intention to change this

Ok, I already thought so but thanks for confirming.

@droberts195
Copy link
Contributor

Is this tracked anywhere?

elastic/ml-cpp#2038 contains some details about this.

but I don't think there is any reasonable fix here other than not using the optimized libraries.

There are two parts: libtorch and libmkl. libtorch is the bulk of the implementation of PyTorch, and cannot be removed unless we scrap our inference functionality altogether. libmkl makes inference faster on Intel processors. We only ship that with our linux-x86_64 image - we accept that the other platforms are just much slower for inference and always run inference on Intel Linux in our cloud offerings. We could remove libmkl part and take the performance hit in return for smaller image sizes. However, the difference MKL makes over the most basic PyTorch build is enormous. When @davidkyle tested the performance of PyTorch inference comparing Eigen to MKL he found that when libtorch is built with MKL it is much faster on Cascade Lake CPUs. For one benchmark he ran the total time when using MKL was 2.5s compared to 81s with Eigen. There are other BLAS libraries apart from MKL and Eigen that would probably have performance and size somewhere in between. But we've been shipping MKL for nearly 2 years now with no problems apart from the size, so somebody would need to come up with a very good business case to mess with that now given the importance to the business of ELSER and NLP in general.

@danielmitterdorfer
Copy link
Member

[...] libtorch is the bulk of the implementation of PyTorch, and cannot be removed unless we scrap our inference functionality altogether. [...]

Thanks for the background on this, that's helpful. Given what you said, this library seems like the right tradeoff in terms of performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts Team:Delivery Meta label for Delivery team
Projects
None yet
Development

No branches or pull requests

4 participants