-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new option -sGZIP_EMBEDDINGS that gzip the wasm output #21426
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. Its great to see this working out.
I wonder if we can do this without adding yet another setting.
Instead can we use just make this the default for SINGLE_FILE
whenever async compilation is available?
Also, I'd love to see the effect of this in one of our code size tests.
It looks like the current code size tests we have for SINGLE_FILE are:
other.test_minimal_runtime_code_size_hello_webgl2_wasm2js
other.test_minimal_runtime_code_size_hello_webgl2_wasm
See:
Lines 10682 to 10684 in b5b7fed
random_printf_sources = [test_file('hello_random_printf.c'), | |
'-sMALLOC=none', | |
'-sSINGLE_FILE'] |
If you update this PR to make gzip the default then we can expect to see the output size for these test drop.
Run the tests with --rebase
to update the expectations for the output sizes.
The Compression Streams API is supported with these, which are quite a lot later than Emscripten's minimum supported browser versions: Chrome: 80 |
Increased startup time can be unexpected for some people, especially in large projects. That was what made me add another option. |
Perhaps if we only enable it for |
Fair enough. if we do make this opt in then I suggest we go with a new setting name rather than that So I suggest:
|
@sbc100 Could you please elaborate why =2 stuff are hard to maintain? I don't have much experience with Emscripten, sorry? I find adding a new option harder in this case, ngl. This is because we have to do all the check with -sSINGLE_FILE and -sWASM_ASYNC_COMPILATION. This only works in tandem with -sSINGLE_FILE, so in this case, we only need to look for #if SINGLE_FILE and ignore the rest |
Yes, I will do that too |
We can also have a polyfill (can be complex IMHO, but could be worth it for large code) |
Yeah, I think in this case the cost of the polyfill would outweigh the benefit (sadly) |
True, using compression-streams-polyfill is only 17kb of code (7kb gzipped). Alternatively, fflate could be used directly in non-streaming mode, which is apparently faster than the native API. It would be 4kb gzipped and then we wouldn't need to worry about browser support. |
Oh wow! I stand corrected. If we can potentially save several Mb then including a 4kb polyfill might be really nice idea. We could define some size threshold for it too.. so it would be included for tiny modules. |
Yeah, fflate is really amazing. I use it for making .zips in the browser in another project. I haven't used its async functions though. The WASM files might be big enough for that to be worth using. |
So that would make it work even with WASM_ASYNC_COMPILATION on! Also, decompression only fflate takes only 3kb for synchronous, and 4kb for async, not 8kb |
I'm excited that this new setting might help me to land: #21217 since it could lessen to cost of the base64 encoding of the data file too. |
src/base64Utils.js
Outdated
return bytes; | ||
#else | ||
return new Response(new Blob([bytes]).stream().pipeThrough(new DecompressionStream("gzip"))).arrayBuffer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we continue to allow the above nodejs optimization that avoid atob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah sure, only in nodejs though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to do that i think would be to have intArrayFromBase64 continue to have its existing well-defined meaning and we could an outer
gunzipByteArray` function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance this code can be made async? If so you can use fetch
to decode base64:
export async function parse_base64(data: string, data_type = 'octet-binary'): Promise<Uint8Array> {
// Parse base64 using a trick from https://stackoverflow.com/a/54123275/2854284
const response = await fetch(`data:application/${data_type};base64,${data}`)
if (!response.ok) {
throw new Error(`Could not parse base64: ${response.status}`)
}
return new Uint8Array(await response.arrayBuffer())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not easily I fear, since we need to continue to support synchronous compilation of the wasm file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be cleaner to layer this decompression on top of the base64 utilities. I.e. have the caller of tryParseAsDataURI
take care of decompressing the restul?
Hmm, one particular test (test_minimal_runtime_code_size_random_printf_wasm2js - test_other.other) is always failing |
Our PR are dependent lol |
If you change effects that code size then you can run those tests with I usually run |
I've always regretted adding the =2 flavors such as I think if you add a new e.g.
If find this easier to follow than using a
|
OK if I land #21217 now? I think it should probably not conflict too much with this change and might even simplify it. |
Actually I think maybe they won't conflict at all. |
If this was changed to use fflate, can it be used from npm, or would it have to be vendored? |
We could get it from npm (i.e. put it in the package.json for emscripten).. assuming the npm version can easily be copied inline into the resulting module. But it might make more sense to add it as a git submodule. |
What a change! This took SUPPORT_BASE64_EMBEDDING off my head. But now what is the use of SUPPORT_BASE64_EMBEDDING anyway, memory initialization is gone? Just SINGLE_FILE seems enough? |
When fflate process streaming, it output chunks. I can't find anyway to instantiate wasm streaming from the chunks without reassembling them, which kills the sole purpose of streaming. I love streaming a lot, but it seems impossible here. I will add its MIT Licence into the folder, and remove implicit opt in. |
Anything else to improve? |
Makes sense. I did a quick test locally and base64+gzip makes a large wasm file (poppler) almost 50% larger than pure gzip. I still wonder about the implications of the webserver doing gzip anyhow, which I think we can assume most sites use, and especially ones that care about efficiency. If the webserver does gzip then we can use a different encoding than base64 that does not break byte alignment. That is, if we do a more naive encoding of the binary into text, which is not as efficient as base64, but when composed with gzip it would be as efficient as gzip. It seems like that would give us all the efficiency benefits here, assuming the webserver uses gzip, and it would be simpler (no fflate etc.)? |
I did some tests on a .wasm file of Unity's Boat Attack demo project, which is 35.5 MB worth of Wasm: BoatAttack.Unity2023.2.wasm.gz \1. On this file, reproduced that gzip-compressing base64-encoded data performs poorly: Uncompressed: 35,554 KB Gzipped: 9,736 KB \2. Brotli exhibits similar behavior, but not nearly as drastically: Uncompressed: 35,554 KB Brotlied: 8,163 KB \3. Out of curiosity, I did an ad hoc python test script that embeds the binary data in strings in a way that doesn't straddle bits across multiple characters. data = open('uncompressed.wasm', 'rb').read()
#b = []
#for i in range(256):
# b += [i]
#data = bytearray(b)
f = open('output.js', 'wb')
f.write("var js = '".encode('utf-8'))
for d in data:
if d == 0: f.write(bytearray([196, 128])) # Replace null byte with UTF-8 \x100 (dec 256) so it can be reconstructed with a simple "x & 0xFF" operation.
elif d == ord("'"): f.write("\\'".encode('utf-8')) # Escape single quote ' character with a backspace since we are writing a string inside single quotes. (' -> 2 bytes)
elif d == ord('\r'): f.write("\\r".encode('utf-8')) # Escape carriage return 0x0D as \r -> 2 bytes
elif d == ord('\n'): f.write("\\n".encode('utf-8')) # Escape newline 0x0A as \n -> 2 bytes
elif d == ord('\\'): f.write('\\\\'.encode('utf-8')) # Escape backslash \ as \\ -> 2 bytes
else: f.write(f'{chr(d)}'.encode('utf-8')) # Otherwise write the original value encoded in UTF-8 (1 or 2 bytes).
f.write("';".encode('utf-8')) Decompressing would look like follows: // Output string from the above compressor
var js = 'Ā�������� \n�\r������������������ !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~��������������������������������� ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ';
// Decompression: read individual byte indices
for(var i = 0; i < js.length; ++i) { console.log(js.charCodeAt(i)&0xFF); }
// Decompression: convert to typed array
var u8array = Uint8Array.from(js, x => x.charCodeAt(0));
console.log(u8array); (will need more correctness testing, but briefly looking at it, Node, Firefox and Chrome on my system were able to parse the original binary data back from the generated string) With this kind of an ad hoc "binary-embedded" encoding, I get the following data for the same BoatAttack .wasm file: Binary-embedded uncompressed .js file: 45,597 KB (+28.2% bigger than original binary file, -3.79% smaller than Base64-encoded) So this kind of format would still gzip and brotli quite well, and would avoid double compression. EDIT: A small improvement to the above code would be to offset all values by Uncompressed .js file: 41,578 KB (down from 45,597 KB of the &0xFF version) |
Simple hex encoding is also an option, at the cost of increasing the ungzipped JS file by 50% (compared to base64). But it would then be HTTP compressed effectively. And in theory you could probably process it 4 bytes at a time (8 characters)? Don't know if that would be faster or slower than one at a time charCodeAt. |
We can even use base 122 for 19% overhead of embedding, at the cost of a worse compression ratio after embedding (what servers would do). It seems that if we use a more space-efficient embedding method, we get worse compression on the server, if we use a less space efficient one, we get more done on the server. I don't really know where is the sweet spot. |
Here's a quick prototype of such binary encoding: main...juj:emscripten:binary_encode
yields 4,452 base64.html whereas
yields 3,650 binenc.html Synchronously decoding the binary encoding should be faster than synchronously decoding base64 at least in browser context, since the code to decode (in browser context) is a strict subset of computations that base64 decoding performs. I wonder if there would be something here that would improve your use case @msqr1? |
Its seems like there are 3 main concerns that you have about shipping/using the
Regarding (1), its MIT licensed to I don't think that will be an issue right? We already included bunch MIT licenes third party JS code don't we? e.g. the polyfills? Regarding (2), its looks like you have a shown yourself that it can be win, right? Regarding (3), its it the case that Regarding your alternative approach using a custom encoding, that should even better, but I also don't see a need to spend too much time on this if we already have a clear size win from using fflate (even in when combined with server-side brotli). We can always followup with even better improvements if somebody wants to work on this right? |
I care about the size, I think that downloading speed can impact more than decoding speed. So I think we have to test all base16,32,64,85, it is already a win for fflate, this is just some mini improvement over base64. Ngl, I think we patch some stuff up and merge this pr, improvements can be done in another one. |
Thanks for doing those experiments @juj! This is my summary of them:
(tiny/large is always relative to a column, not a row) If we care about servers without gzip then this PR is the best option: it's always the tiniest, because if there is server gzip then another gzip on top is almost a no-op, and if there is no server gzip then this is a big win. But if we do not care about servers without gzip then @juj's simple encoding is just as small, because gzipping it is efficient. There are also factors like compression time that we could measure, but just regarding binary size the tradeoffs seem clear. Do we care about servers without gzip? I'd personally guess "no" - we should encourage servers to use gzip, and not add complexity on our side as a workaround for them. |
Thanks @sbc100 , that's a good summarization.
For Unity the choice of the license itself is not all of the problem, but merely the fact that any new body of third party code enters the distributable will require a license audit (Unity legal maintains a network graph of licenses of all code it ships). From Emscripten repository perspective the licensing issue was solved by adding the license file in the third_party/fflate/ directory (thanks @msqr1), so it is tight bundled with the code. As long as all that third party code neatly lives in its own subdirectory so it is easily identified (so that I have an easy time removing it in our redistributions if I need to), the licensing concern is all clear.
Indeed, after my previous comment, I have now seen the light that
This is the part that I have been pondering about. In my view, overall I can find the following reasons to target small code size at the expense of a tradeoff somewhere else:
Point 2 generally doesn't apply here in either positive or negative direction before or after, so can be ignored in this context. This leaves points 1 and 3. When I think more about this, it is quite rare that we would have a competition polarizing these two points, i.e. that optimizing for disk size could pessimize startup times. But this type of extra compression step certainly would be one such situation that places 1 and 3 against each other, instead of optimizing for both simultaneously. When developers choose -Oz, I think most of them do so in order to pursue 1. (i.e. choose -Oz so that their build will start up quicker, even though it might run slower) In specific cases they could be pursuing 3 instead. (i.e. choose -Oz so that their build will fit in their CDN, even though that might make it start up slower, and also run slower) Of course there is a possibility that -25% smaller pre-gzipped file size could still win on both 1 and 3 simultaneously, but that comes down to benchmarking a competition between user's network download speed and their CPU decoding speed, so it is not immediately trivial which one wins. I am ok with landing this PR as it currently stands, since it comes with an option to let users configure it.
I don't think any of the baseX methods will be good for this use, they will all likely have similar poor compressibility due to the same effect as base64. The binary encoding feature I mention above, is smaller for download size than the two layers of gzip, and only requires one layer of gzip (or br). So it will be both faster to download and faster to start up after download.
If Or if compared to how far these encodings are from the optimal "just brotli compressed" code: brotlied: 8,163 KB There are two potential downsides however: So if you're looking to get the smallest download with fastest startup, binary encoding will be a clear winner over gzip+base64. But if you need to see the small file size number on disk, then this kind of binary encoding would not help that. (there would be a way to get to small files by manually pre-compressing your files to @msqr1 Would this kind of binary encoding method obsolete the need for -sGZIP_EMBEDDINGS for you? Or would that mean that we'd be looking at maintaining two features side by side, if we landed this PR, and later added a |
Collecting the actual numbers fro Unity's Boat Attack wasm file into table of this form, the data looks like this:
If server has brotli compression set up, then the binary encoding will be superior to gzip+base64+gzip (-14% smaller). And with server having gzip it is practically equal (+2% bigger). If feasible for the use case, I think this binary encoding will be the best of both bytes-transferred-over-the-wire and the startup time decoding step will be a fraction of the time it takes to do a base64 decoding step. |
Well, I just want a small size with a good startup time, juj's approach seems superior, and without the need for a third-party library, I'd take juj's approach. |
If we called the option something generic like |
@kripken Just curious, do you happen to keep the direct binary embedding to JS source code, I want to see how is that even possible. Also I wouldn't call this new setting -sCOMPRESS_EMBEDDING, following juj, this would be a direct change to -sSINGLE_FILE, as this is a new encoding method. |
So we maintain both base64 and juj encoding? |
Sorry, I'm not sure what you're asking here? But if you mean @juj's idea for encoding then the code is in #21478.
That's worth discussing, good question. Let's see how #21478 looks. But in general it would be nice to have just a single encoding here, if there are no serious downsides. |
@kripken juj said you had this, I just want to see how on earth this is possible, and what does it mean to have discrepancy between browser |
One downside with @juj's encoding is that bundlers often default to ASCII output. If someone isn't careful they could massively increase the file size. But at least it would still HTTP compress effectively... |
I dug up the git history now. Looks like it was not Alon who did it, but another contributor @gagern (hi if you're still around). Originally the feature was added as a linker option The feature was introduced in this PR in June 2015. I am not sure which PR removed it (I see a trail of #5401 and #5296). The PR #12325 Sep 2020 finally removed the code, but it seems like the feature was obsoleted already before that PR. Re-reading the implementation of that method, it was a bit different than what I remembered, as it used ASCII encoding and escaped bytes >= 0x80 as The method I propose in #21478 does things a bit differently:
|
Out of interest, is this embedding/encoding only ever used to wasm binary file? I seem to remember that wasm2js uses its own internal base64 encoding for data segments so won't be effect by this change. Are there any other places where we support embedding binary data? Would we expect the difference between |
It should be equally significant for all data, not just wasm binaries |
Honestly, I still like the flavor-of-a-setting idea, it let users know that this is a variant, without typing too much, still being grep-able and hold the same meaning. Instead of MAIN_MODULE=2 we can do it like SUPPORT_LONGJMP where it can be MAIN_MODULE=dce. What do you think @sbc100? |
So for this setting you are proposing something like Although one potential downside is that it wouldn't then cover other types of data embedding that we might have.. but perhaps this is the only remaining place we do data embedding in JS so it might be fine. |
No, I kinda gave up on this, I'm talking about your general concern over =2 variants |
Doesn't work with WASM_ASYNC_COMPILATION off, this new setting gzip the wasm binary before embedding, which can reduce code size. It will decompress via JS DecompressionStream. Can increase startup time, but the code size benefit can be massive for a big project. For mine: -O0, 5.3MB --> 1.7MB, -O3, 3.7MB --> 1.4MB. Implements #21383. Edit any
Fixes: #21383