Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationale behind using bzip2 for wasm-binaries.tbz2 #1235

Closed
kspalaiologos opened this issue Jun 18, 2023 · 16 comments
Closed

Rationale behind using bzip2 for wasm-binaries.tbz2 #1235

kspalaiologos opened this issue Jun 18, 2023 · 16 comments

Comments

@kspalaiologos
Copy link

Installing Emscripten for the first time on my machine takes approximately 1min 43.79s wall clock time. 1 min 29.44s out of this figure is spent in bzip2 -d decompressing the wasm-binaries.tbz2 archive, hence my question: why bzip2?

BWT codecs are not a good choice for the kind of data contained inside of the archive. I have ran some tests involving better than bzip2 BWT codecs, such as bzip3, yielding an archive smaller by about 14%, but this is irrelevant as the total time spent in bzip3: (-dj8) is still pretty significant - 37.419s. BWT codecs tend to be symmetric either because of the SACA algorithm or the entropy coding stage. Further, they do not provide any preprocessing capabilities for executables contained within the archive.

As such, I have tested a few LZ codecs. The archive produced by zstd -9k lies between bz2 and bz3 at around 330'331'630 bytes, but it is 25 times faster to decompress than bzip2 and 9 times faster to decompress than bzip3, hence using zstandard instead of bzip2 would improve the installation time from 1min 43s to 14s.

bzip3 and zstandard are still admittedly unique on linux machines, but rather ubiquitous lzma provides an even better ratio, albeit considerably slower, which i have verified using lzma -9k and then lzma -df as 207'465'837 bytes, almost halving the distribution size (thanks to LZMA's executable code preprocessors, among others) with a decompression time of 35s.

To conclude: using zstandard (or any LZ codec) instead of bzip2 would decrease download sizes by around 10% and speed up the installation process 6 times. Why is bzip2 still used?

@sbc100
Copy link
Collaborator

sbc100 commented Jun 20, 2023

Why is bzip2 still used?

No particular reason. As long as we can decompress that archive using a module that is part of python3.6 I think we would happily switch to a different format if there are benefits for be had.

@kspalaiologos
Copy link
Author

@sbc100 Is the requirement of the codec being bundled with py3.6 so hard? Of course, bzip2 archives could be pulled on systems that can not/do not support zstandard, but why not add optional support for it on systems that do have zstd in PATH, considering that it shortens the installation time by almost an order of magnitude?

@sbc100
Copy link
Collaborator

sbc100 commented Jun 20, 2023

@sbc100 Is the requirement of the codec being bundled with py3.6 so hard?

Its not set in stone, but we would rather not add more system dependencies.

Would switching to some other format that is built into python still give us some of the benefits which you are after?

Of course, bzip2 archives could be pulled on systems that can not/do not support zstandard, but why not add optional support for it on systems that do have zstd in PATH, considering that it shortens the installation time by almost an order of magnitude?

Uploading 2 different versions of the archive is possible I think it would add some complexity to the upload and downloading process. If you would like to experiment with PRs to emscripten-releases and emsdk then we could see just how much complexity it would add. (See https://chromium.googlesource.com/emscripten-releases/+/d7a2d5b091de9ea6937bbe6513e055c1bf750e6d/src/build.py#246 and

"linux_url": "https://storage.googleapis.com/webassembly/emscripten-releases-builds/linux/%releases-tag%/wasm-binaries.tbz2",
"macos_url": "https://storage.googleapis.com/webassembly/emscripten-releases-builds/mac/%releases-tag%/wasm-binaries.tbz2",
"windows_url": "https://storage.googleapis.com/webassembly/emscripten-releases-builds/win/%releases-tag%/wasm-binaries.zip",
)

@sbc100
Copy link
Collaborator

sbc100 commented Jun 20, 2023

(BTW this is the first time I've ever heard of this zstandard thing..)

@kspalaiologos
Copy link
Author

Would switching to some other format that is built into python still give us some of the benefits which you are after?

Python does support LZMA out of the box. Decompression would of course be slower than zstandard, but still around 2-3 times better than the current solution. It would also save a lot of bandwidth over bzip2.

@sbc100
Copy link
Collaborator

sbc100 commented Jun 20, 2023

Actiually, looking at the code now it looks like call out to the system tar executable to extract these archives:

emsdk/emsdk.py

Lines 510 to 517 in 775ba04

# http://pythonicprose.blogspot.fi/2009/10/python-extract-targz-archive.html
def untargz(source_filename, dest_dir):
print("Unpacking '" + source_filename + "' to '" + dest_dir + "'")
mkdir_p(dest_dir)
returncode = run(['tar', '-xvf' if VERBOSE else '-xf', sdk_path(source_filename), '--strip', '1'], cwd=dest_dir)
# tfile = tarfile.open(source_filename, 'r:gz')
# tfile.extractall(dest_dir)
return returncode == 0

That code seems to date back to 2013: fb549cd

I'm guessing that code would "just work" given a .tar.xz file? (assuming the host system has lzma executable that tar can use.. I wonder, does the base macOS image include that?)

@kspalaiologos
Copy link
Author

kspalaiologos commented Jun 20, 2023

You don't actually need lzma installed on the system. That said, bzip2 is bundled with python and still emsdk does not make use of it, calling whatever is installed on my system instead :). tar -I zstd -xvf archive.tar.zst and tar -xJf file.pkg.tar.xz could work. GNU Tar detects the compression format automatically, so you can just swap out .bz2 for .xz and nobody running coreutils would notice.

@sbc100
Copy link
Collaborator

sbc100 commented Jun 20, 2023

Doesn't the tar executable fork out to the underlying lzma or zstd or bzip2 executable.. and if that is not installed the system the tar command will fail right? At least I seem to remember folks reported tar can fail when bzip2 is missing.

I guess it depends how tar was built and what version of tar is being used.

@sbc100
Copy link
Collaborator

sbc100 commented Jun 20, 2023

Also, doesn't macOS tar alto detect the compression format automatically? I assume that it must otherwise the existing code would not work (since we just run tar -xvf)

@kspalaiologos
Copy link
Author

Doesn't the tar executable fork out to the underlying lzma or zstd or bzip2 executable.. and if that is not installed the system the tar command will fail right? At least I seem to remember folks reported tar can fail when bzip2 is missing.

Indeed, that is right.

Also, doesn't macOS tar alto detect the compression format automatically? I assume that it must otherwise the existing code would not work (since we just run tar -xvf)

Yes, likely, but I don't have any experience with Macs.

@dschuff
Copy link
Member

dschuff commented Jul 1, 2023

It looks like Mac has supported tar.xz files since 10.10 (https://www.ctrl.blog/entry/archive-utility-xz.html). And it turns out we already use the xz archives for the version of Node we ship with emsdk on Linux, and nobody has complained. So I'd be in favor of switching given the size and decompression speed advantages.

We would probably have to do some hackery in the emsdk installer if we want it to support getting the bz2 archives for older versions of emscripten and xz for newer versions.

@sbc100
Copy link
Collaborator

sbc100 commented Sep 20, 2023

Some results from my initial attempts at switch to .xz.

  • File size is 25% smaller (242M vs 330M)
  • Compression time is 3 times slower (4m44 vs 1m31)
  • Decompression is about 2 times as fast (33s vs 17s)

So it seems like we should go for it. We could even look at speeding up compression using the -T0 flag to xz if that compression time is an issue.

I'm looking into add the magic to emsdk now (I think we will have to have it check for both filenames).

@sbc100
Copy link
Collaborator

sbc100 commented Sep 20, 2023

Yup! Passing the -T0 flag to xz gets compression time down to 16 seconds on my 56 core destkop (tar -I "xz -T0" -cf wasm-binaries2.tar.xz install/ ), and only sacrafixed 1% on side (246M vs 242M).

sbc100 added a commit that referenced this issue Sep 21, 2023
This is a bit of a hack but I can't think of another way to do it.
Basically when downloading SDKs, we first try the new `.xz` extension.
If that fails, we fall back to the old `.tbz2`.  Both these first two
download attempts we run in "silent" mode.  If both of them fail we
re-run the original request in non-silent mode so that the error message
will always contain the original `.xz` extension.

See #1235
sbc100 added a commit that referenced this issue Sep 21, 2023
This is a bit of a hack but I can't think of another way to do it.
Basically when downloading SDKs, we first try the new `.xz` extension.
If that fails, we fall back to the old `.tbz2`.  Both these first two
download attempts we run in "silent" mode.  If both of them fail we
re-run the original request in non-silent mode so that the error message
will always contain the original `.xz` extension.

See #1235
sbc100 added a commit that referenced this issue Sep 21, 2023
This is a bit of a hack but I can't think of another way to do it.
Basically when downloading SDKs, we first try the new `.xz` extension.
If that fails, we fall back to the old `.tbz2`.  Both these first two
download attempts we run in "silent" mode.  If both of them fail we
re-run the original request in non-silent mode so that the error message
will always contain the original `.xz` extension.

See #1235
@dschuff
Copy link
Member

dschuff commented Sep 22, 2023

emscripten-releases side CL is landing, let's keep an eye on things. Any appetite to help our windows users too? The windows archive has always been the largest (although not just because of the compression).

@sbc100
Copy link
Collaborator

sbc100 commented Sep 22, 2023

I'm personally inclined to leave windows alone, but mostly because i find debugging windows issues to be a lot harder than macOS or linux ones

@sbc100
Copy link
Collaborator

sbc100 commented Sep 26, 2023

Closing this for now since we removed the use of bzip2

@sbc100 sbc100 closed this as completed Sep 26, 2023
shlomif pushed a commit to shlomif/emsdk that referenced this issue Sep 29, 2023
This is a bit of a hack but I can't think of another way to do it.
Basically when downloading SDKs, we first try the new `.xz` extension.
If that fails, we fall back to the old `.tbz2`.  Both these first two
download attempts we run in "silent" mode.  If both of them fail we
re-run the original request in non-silent mode so that the error message
will always contain the original `.xz` extension.

See emscripten-core#1235
sbc100 added a commit that referenced this issue Oct 10, 2023
sbc100 added a commit that referenced this issue Oct 10, 2023
sbc100 added a commit that referenced this issue Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants