Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine server JAR and auto-extracting ZIP #5301

Closed
jmmv opened this issue May 30, 2018 · 14 comments
Closed

Combine server JAR and auto-extracting ZIP #5301

jmmv opened this issue May 30, 2018 · 14 comments
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Performance Issues for Performance teams type: feature request

Comments

@jmmv
Copy link
Contributor

jmmv commented May 30, 2018

The Bazel binary is a combination of the executable C++ client and a ZIP file. These two components are pasted together by a simple cat operation followed by a zip -A call to realign the ZIP contents.

The ZIP file contains the server JAR. And the server JAR is, in itself, a ZIP file. Somewhat redundant, right?

It should be possible to combine both the auto-extracting ZIP with the server JAR. In other words: take the server JAR, add to it whatever embedded tools and data files we need, and then use the JAR itself as the payload for the C++ client. Then teach the client to just use itself as the JAR passed to the JVM instead of pointing the JVM to a file in the install base.

Doing this does not fully eliminate the need for the install base, but is a step towards that goal. This change would reduce the size of the install base significantly (because the server JAR is the biggest component in it), would also reduce the Bazel's startup time because the binary auto-extraction would need to unpack one fewer large file, and would also reduce the time it takes to build Bazel by shortening the packaging action (which is in the critical path).

I was able to make this work relatively easily with Bazel but then got stuck when applying the same change to Blaze. I think I hit an issue where the zip tool didn't like the large number of files within Blaze's server JAR and didn't find a way around it. Then gave up because I had other more important things to do... so filing this bug to avoid this idea getting lost.

@jmmv jmmv added type: feature request P2 We'll consider working on this in future. (Assignee optional) category: misc > release / binary labels May 30, 2018
@meisterT
Copy link
Member

#6318 and https://github.com/bazelbuild/bazel/milestone/13 seem to be related.

@jmmv
Copy link
Contributor Author

jmmv commented Oct 10, 2018

Well, this FR won't really change the size of the binary, though it might impact how well it compresses (by avoiding compressing an already-compressed file). It probably makes sense to add this bug to that milestone though.

@jin jin added team-Performance Issues for Performance teams untriaged and removed untriaged labels Feb 19, 2019
@jmmv
Copy link
Contributor Author

jmmv commented Apr 5, 2019

Looked at this again. As I had lost my previous diff, I had to redo this from scratch. See https://github.com/jmmv/bazel/tree/jar.

Same conclusion as before but with more details now: making this work for Bazel is rather easy. But we cannot make progress because Blaze cannot work. I dug some more today and concluded that, indeed, the number of files is the problem.

When a ZIP file includes more than 64k files, it has to use the ZIP64 format. That's fine in itself... except that ZIP64 files cannot be made self-extracting executables. To reproduce: just create a ZIP file with more entries than 64k and then prepend some junk to it (or the Bazel client iself). From there on, any zip calls on it will fail. Doesn't matter if you use -fz to forcibly-enable ZIP64 throughout the process. And no, Java won't read the file as a valid JAR either, so we cannot just say it's a bug in zip.

Bazel currently has 24k files in its JAR, but Blaze has 75k. Making this change only for Bazel and not for Blaze is not a good idea because we'd need to handle two code paths and the benefit of this change is not large enough to justify that cost. So the answer here is to put Blaze on a diet first. Which... seeing the number of files... may be a good thing in itself.

@jmmv
Copy link
Contributor Author

jmmv commented Apr 5, 2019

The funny thing is that unzip does seem to handle the self-extracting ZIP with too many entries, albeit after printing an error/warning:

error: End-of-centdir-64 signature not where expected (prepended bytes?)
  (attempting to process anyway)
warning [/tmp/new2.zip]:  9431024 extra bytes at beginning or within zipfile
  (attempting to process anyway)

So... why cannot zip do the same thing? Maybe if zip worked and zip -qA fixed whichever offsets, things would work?

@lberki
Copy link
Contributor

lberki commented Apr 9, 2019

64K classes should be enough for everyone... I think that limiting ourselves to 64K classes will bite us in the future. I sure don't want to have code reviews of the sort "inline this inner class because then self-extraction will fail"

Are you sure zip64 files cannot be self-extracting? Maybe it's not as easy as prepending some junk to them, but I'm hoping that it's not an inherent limitation of the file format and can be worked around by tweaking some in-file pointers appropriately.

@lberki
Copy link
Contributor

lberki commented Apr 9, 2019

At least this tidbit insinuates that it's possible:

https://s12316.pcdn.co/wp-content/documentation/xceed-filesystem-for-net/topic114.html

"Xceed's self-extracting binary modules support Zip64, they can create self-extracting files larger than 2 GB"

@jmmv
Copy link
Contributor Author

jmmv commented Apr 9, 2019

Huh, all details I found online (mostly limitations about all existing tooling I found) implied that this was not possible. I even went through the ZIP format documentation but found nothing relevant there and then, well, ran out of time to look into this on Friday.

Yeah, limiting ourselves is probably a bad idea -- unless we could make the contents significantly smaller than they are today.

@meisterT
Copy link
Member

meisterT commented Apr 9, 2019

All the work we're doing with https://github.com/bazelbuild/bazel/milestone/13 should limit the number of classes that appear in the final bazel zip. We haven't started doing anything similar internal (yet).
Note that there's also #6318.

would also reduce the Bazel's startup time

Did you measure this for your prototype?

@lberki
Copy link
Contributor

lberki commented Apr 9, 2019

Maybe we get lucky and we don't even need to invoke the zip tool because singlejar helps us out. I'd be surprised if it did not support zip64, given the size of binaries at Google and IIRC it can also prepend things to the output binary (under the sneaky name of --java_launcher -- it's called that because, well, it's used to prepend an executable...)

@jmmv
Copy link
Contributor Author

jmmv commented Apr 10, 2019

@meisterT I didn't measure yet but the impact of this change alone won't be large in the Bazel case: we have a lot of cruft outside of the JAR that we have to extract anyway, and the JAR is not thaaaat large. You can try though :) https://github.com/jmmv/bazel/tree/jar. In the Blaze case this will be more noticeable though because the JAR is much bigger and we only have a handful of files outside of it.

My eventual desire is to get rid of the self-extraction altogether (though I'm sure that'll be controversial), which will definitely have an impact. Maybe this is not a big issue for end users because the extraction "only" happens one per upgrade... but think about two more cases:

  • Integration tests: each integration test shard ends up having to extract Bazel, and there the costs add up. I bet this will especially be the case on macOS, where file system access times seem to degrade significantly under load.
  • Iterative development: imagine killing the 10-30 second zipping step at the end of every single build we perform... which, even if remote, still is super-slow because we have to transfer MBs of data.

@jmmv
Copy link
Contributor Author

jmmv commented Apr 12, 2019

@lberki Huh, good observation. I've been playing with singlejar and, indeed, if I tell it to use the Bazel client as the Java launcher, it creates a valid ZIP file (with all those 75k entries) with the prepended binary. Now, things are more complicated in Blaze because we do use the Java launcher feature there... but this now seems feasible without having a limit on the number of files.

@meisterT meisterT added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed P2 We'll consider working on this in future. (Assignee optional) category: misc > release / binary labels May 11, 2020
@meisterT meisterT added P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) and removed P3 We're not considering working on this, but happy to review a PR. (No assignee) labels Apr 13, 2021
@meisterT
Copy link
Member

meisterT commented Apr 5, 2023

We do not have any plans to work on this issue.

@meisterT meisterT closed this as completed Apr 5, 2023
@brentleyjones brentleyjones closed this as not planned Won't fix, can't repro, duplicate, stale Apr 5, 2023
@jmmv
Copy link
Contributor Author

jmmv commented Apr 16, 2023

This is unfortunate. The change itself seemed straightforward. but a problem remained in the Blaze variant of this... which means even if someone else wanted to tackle the problem outside of Google, they wouldn't be able to due to the lack of a repro. Anyhow, I can see how this has little value and there are more important things to do.

@lberki
Copy link
Contributor

lberki commented Apr 17, 2023

Yeah, it's unfortunate, but we have to prioritize :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Performance Issues for Performance teams type: feature request
Projects
None yet
Development

No branches or pull requests

5 participants