Macro layer sketch for discussion, with implications for plain binary format #58

Closed
lukewagner opened this Issue May 12, 2015 · 20 comments

Projects

None yet

5 participants

@lukewagner
Member

Having a "macro layer" that sits in-between the raw binary encoding and generic compression (e.g., lzham) has been something we've all talked about for a while now. Details for how it works determine what size-optimizations make sense in the raw binary encoding. In particular, a question I've had for a while is how small can a use of a macro be; if it can be a single byte, then maybe we can obviate the need for any optimizations that try to fold immediates into opcodes (even the modest Java 0-3 versions) which would be rather elegant.

Here's one scheme I was thinking about that achieves this. (Forget the context-dependent index space idea, I'll mention that at the end.) I'll start with just nullary macros (#define FOO (x+1) as opposed to #define FOO(x) ((x)+1):

  1. A new "macros" module section contains a sequence of macro bodies which are simply little ASTs encoded the same way as they'd be in a function body.
  2. A use of a macro names the macro by its index in the sequence and the "semantics" of a macro use are to simply splat the body (without any recursive macro expansion) in place. Macros can also only be used in function bodies.
  3. Assumption: we don't want to "reserve" any indices/bits in the raw binary format since, after macro expansion, these would just be wasted. Also, this is clean separation of layers.
    • That means the macro layer has to use some sort of escape sequence
    • A simple two-byte scheme analogous to C string escapes is to reserve one byte value (say 255) for "macro use". Upon encountering this byte, the next var-length-int is the index of the macro. One such macro could be the opcode 255 (remember, no recursive expansion).
  4. To offer a one-byte macro form, we could instead reserve the top bit: if it is set, that means a macro use and the next 7 bits would either be:
    • a sentinel value to indicate that the macro index didn't fit in this byte, so the index is the following var-length-int.
    • a macro index 0-126.

What do we gain? This gives us a way to express 127 of the most common subtrees in a single byte. While there is an expected exponential die-off of hotness of opcodes, I expect we'll find that there are easily 127 hot subtrees in a module. In fact, based on this logic, we may want to reserve more than 127 values for macro indices. Importantly, the v.1 plan to implement the macro layer in user wasm code itself (a polyfill without native support) would allow us to experiment in this space a lot even after v.1 has shipped.

With this, we don't need a single-byte GetLocal or LiteralImmediate forms; these can just be macros. Also, constant pools wouldn't be necessary. One temporary downside is that, as long as a polyfill is used for macros, the expanded form will take up more memory which will hurt decode time (BinaryEncoding.md suggests the size hit would be around 26%, so not terrible). However, with a native macro layer, the macro body doesn't have to be copied, the native decoder can just recurse into the macro body, so this issue would disappear (and indeed get much better if the macro savings were significant).

Considering non-nullary macros: a macro could declare its arity in the module table and then, upon encountering a macro, the following n-arity subtrees would be passed as arguments. This doable but rather more complex. A first-order approximation I'm much more interested in would be to only allow macros parameterized over local variable indices. That is, you could define #define ADD1(x) (x+1) with the limitation that x had to be a variable name. The macro body could use some special macro-layer-only GetMacroArg op (erased by the macro layer) to refer to x in the above example. More experimentation is needed, of course.

Given all this, the effect of context-dependent index spaces would be to allow the index space of i32 (almost always the hottest space) to be disjoint from the other expression types' spaces and thus more macros would be able to fit in a single byte overall. How much this wins would be highly dependent on the codebase, but it's quite possible it wouldn't matter that much since any truly hot non-int32 expressions would end up being given single-byte macro indices.

Thoughts?

@kg
Contributor
kg commented May 13, 2015

My testing shows that aggressive subtree deduplication (more or less what you're proposing here as macros) can cull as much as 80% of the JS AST that represents an asm.js module. I will do some digging to find out how many 'extremely hot' subtrees it has with high hit counts, and how large they are.

I think non-nullary macros only make sense as a post-v1 feature. Their use case is narrower and they require more work on the compiler end. Using macros purely to splat subtrees in is a good approach because it's cheap to implement, reduces wire size, and reduces the heap usage of the decoded AST (since you can just wire up pointers to the macro).

Based on my tests aggressive subtree deduplication also mitigates the need for specialized constant tables - but on the other hand, you can't easily deduplicate subtrees when there are no node boundaries (inlined immediates, etc) to do the insertion at.

@jfbastien
Member

I'm a bit cautious of measuring redundancy in the original code because we know that LLVM isn't quite optimal in this respect. It's a good metric, and we don't have a better LLVM yet, but we should expect redundancy in the LLVM-generated code to eventually decrease. I'll have an intern working on this over the summer.

Another idea: could macros be layered? We could have per-module macros, and perf-function or per-block macros. That would take advantage of local usage patterns, and keep the macro table indices small.

Are we still considering specializing opcodes based on input types? I have a gut feeling that this is also helpful (even with macros) but I'm not 100% sure. I know @kg had some reservations in that.

This approach would also mean that we can build a "wasm-crush" program which takes a wasm module and tries out different compression approaches. I'm assuming we'll have some form of hill-climbing, but someone could also build some fancy optimizer that also takes into account the lzham compression effect!

@lukewagner
Member

@kg Regarding non-nullary macros and "post-v1": none of this is v.1 (at least, in my proposal above). If and when we chose to standardize this macro layer, we would have plenty of measurements so we won't have to speculate on this..

@jfbastien "specializing opcodes based on input types" is what I called "context-dependent index spaces" and discussed in the last paragraph. Independent of size, there is also a slight decoding speed advantage: you don't have to validate the return types since all ops are, by construction, of the right type. All this can be measured, though, and having an intern investigate this properly would be great.

@lukewagner
Member

@jfbastien Oops, I forgot about the layering-macros point. Yes, scoping index spaces more narrowly than the module could make sense here. Ben has also brought up this idea a few times in a slightly different context. Many functions are tiny so the cost of defining the index space could outweigh the wins, which suggests even more flexible scoping (groups of functions). I guess it all just begs experimental data showing how much it wins (particularly w/ generic compression applied). Good intern fodder :) To be clear: I'm imagining this scoping applied only at the macro layer, not the raw encoding layer.

@kripken
Member
kripken commented May 13, 2015

Is the suggestion that the macro layer is something that is specced and implemented natively by browsers, or something that is implemented in userspace (in js or wasm, shipped to the client as part of the web page, basically custom compression)?

Do we expect that a macro layer can do significantly better than gzip with a large window or lzham, etc.? The AST format should be quite easy for gzip to compress, I would expect. Is there a specific way in which we think we can do better?

@kg
Contributor
kg commented May 13, 2015

In my tests tree deduplication produces measurable gains in file size,
decode speed and heap usage for decoded AST, and the file size gains
persist even after gzip or LZMA. They're less significant post-compression,
though. Compression does a great job.

On 13 May 2015 at 16:47, Alon Zakai notifications@github.com wrote:

Is the suggestion that the macro layer is something that is specced and
implemented natively by browsers, or something that is implemented in
userspace (in js or wasm, shipped to the client as part of the web page,
basically custom compression)?

Do we expect that a macro layer can do significantly better than gzip with
a large window or lzham, etc.? The AST format should be quite easy for gzip
to compress, I would expect. Is there a specific way in which we think we
can do better?


Reply to this email directly or view it on GitHub
WebAssembly/spec#58 (comment).

@kripken
Member
kripken commented May 13, 2015

Is there a writeup of your methodology and results?

@kg
Contributor
kg commented May 14, 2015

It's in my repo, but I haven't done an extensive writeup since the experiment is ongoing. Happy to stop and write up what I have. My general approach does not correspond with the proposed spec because I started earlier and have been slowly converging.

I tested on a handful of JSIL and emscripten compiled applications, and my current test AST is an esprima JS AST. I've been iterating from that towards one that resembles (if not perfectly matches) the one being used by the current polyfill, though that's a slow process since it's impossible to interop with that polyfill without writing C++.

Tree deduplication alone culls 70-80% of the unique nodes in the AST in my tests, both for asm.js applications and JSIL applications. A bytecode representation of the esprima AST without deduplication - just the basic techniques we've been discussing, like LEB128 uints, tables, ordering indices so common items have shorter indices - is typically slightly larger than source JS (more overhead if it wasn't minified). Once deduplication kicks in, it's equal in size or considerably smaller than source JS.

Unfortunately, one thing I haven't figured out is that this current approach loses compared to gzipped javascript. But I think that's probably a distraction - esprima is not a realistic AST in terms of file-size terms, it's just useful for understanding how much of a typical JS ast can be deduplicated.

The main comparison gap here is what deduplication will look like on the proposed AST, since it has larger, more complicated nodes, and less nodes - this may make it much harder to find nodes to dedupe, and it means there are less seams where deduping can happen.

I don't have my extensive data/writeups here at home, but I just ran the prototype on bananabread. Some output elided. Ignore the encode times, since JS is a nightmare.

// encoding
05/12/2015  11:43         4,033,785 bb.js
esprima parse: 1807ms
astToModule: 90266ms
Deduped 1937310 object(s) (81.8%)
deduplicateObjects: 51803ms
tag bytes written: 12921
varint sizes: [ 774409, 257940, 831339, 0, 0, 0 ]

05/13/2015  17:08         4,133,051 bb.js.webasm
05/13/2015  17:05       109,505,224 bb.js.ast.json

// read ast json
JSON.parse: 2057ms
heapUsed 335190360

// decoding
read webasm: 32ms
  read string tables: 135ms
  read objects: 644ms
  read arrays: 29ms
bytesToModule: 938ms
heapUsed 53055160

To summarize what this run is showing: parsing bananabread into an esprima ast produced an enormous tree; 82% of the nodes were eliminated by deduplication. Most of these are killed because you can deduplicate entire subtrees in cases where context makes them equivalent - i.e. the expression 'x + y' can be deduplicated in any scope because it refers to them symbolically.

The 'tag bytes written' metric is a measurement of the most recent change I've made to the prototype - I'm walking it towards full static typing as used by luke's prototyping, with some static type definitions for the esprima ast. In the real world I don't think you'd have any tag bytes whatsoever, just opcodes and types implied by them. Before my most recent change I was losing something like 500k just to type tags...

Varint sizes are the # of varints written out of each length (1 byte, 2 byte, 3 byte, etc). This mostly demonstrates the effectiveness of sorting the tables by usage count. One optimization luke has discussed (I think his prototype currently implements it?) that would help here is not putting single-use nodes/subtrees into the tables at all, and just inlining them as immediates. Regardless, in this case 780k of the indices are 1-byte, 257k are 2-byte, and 831k are 3-byte. For smaller files 50-60% of indices are 1-byte, in my testing. Splitting indices up into smaller tables would help here - I believe this is a core element of luke's proposal - as in my current prototype, i only have 3 tables, for 'objects', 'arrays', and 'strings'. Tables of static types indexed using static type information would probably save me tons of disk space here.

The output bytecode from my prototype in this case is larger than the source JS, but only slightly. I was able to get bytecode smaller than JS by filtering the esprima AST, but that's kind of a distraction, so anyway... The ast.json is for comparison to show how large the esprima ast in question would be if serialized to JSON.

For comparison, I load the tree with JSON parse and dump the amount of wall clock time that took and how large the v8 heap representing it is.

Finally, the bytecode is decoded and used to reconstruct an AST. This is done efficiently in a single pass like the format we're proposing in a similar (but not identical) method. The key things to note here:
Bytecode can reconstruct the full tree in 938ms compared to JSON.parse's 2057ms, and the heap usage for the bytecode reconstruction is 53mb instead of 335mb.

This prototype aims to be simple and like-for-like in terms of getting a sense of what this stuff would look like in a polyfill - so it's straightforward hand-written JS, not a ton of code. A JS polyfill we actually ship could easily outperform this.

For scenarios where we are constructing an AST (in v8 or spidermonkey) instead of streaming out generated JS, I think the memory gains alone here are pretty compelling. The fact that a naive implementation is able to reconstruct the AST quickly is a good sign too.

For reference, on one of my test JSIL executables, a 3.7MB JSIL output file produces a 1.8MB bytecode file. Deduplication gains are equivalent, AST size savings are equivalent, and decode performance is still good.

Based on my results so far I actually believe that tree deduplication would eliminate the need for a lot of the micro-optimizations we're considering, but I can't prove that until I've tested against a more realistic representation of ASTs in our proposed format.

@lukewagner
Member

@kripken From the first sentence of the OP: the macro layer would sit in between the simple binary encoding (pre-order AST encoding) and generic compression (lzham). This is the "three layers" @jfbastien often refers to. As also described in the OP, macros can subsume the two techniques described in BinaryEncoding.md--each of which showed significant post-general-compression wins so I'm quite confident that the generalization will win even more.

To answer your second question, from the middleish paragraphs OP: this would not be part of WebAssembly v.1, it would be done in wasm/asm.js and could be rolled into the polyfill library along with lzham and decoding-to-asm.js (swiss-army polyfill). In the future, after we did tons of experiments and felt like we understood this macro space and wanted a moderate decoding speed boost, we could then standardize as part of some v.>1.

@kripken
Member
kripken commented May 14, 2015

@kg: Thanks for the writeup! Very interesting.

@lukewagner: Sorry if my questions weren't clear. The main thing I was wondering about was what you answered in your second paragraph, thanks.

I am still curious about the comparison to gzip, lzma, etc. Basically, macro layers find and eliminate redundancies, and so do gzip etc - I am wondering in what we think we can do better than those at their own game (given a sufficiently large window and other zopfli-like tricks). Some non-macro possibilities are:

  1. Use an AST format, as we have decided. This is already a huge factor in making the code both smaller and more obviously compressible by gzip etc!
  2. Reordering indices to use fewer bytes for common ones. Definitely something gzip etc. don't do well. It sounded like in @kg's tests this was a minor factor, though?
  3. Reorder functions. rfk experimented with this for large size wins on pypy.js. Basically this takes advantage of the fact that function order has no semantic meaning, but is something gzip etc cannot do. What's nice is that doing this doesn't require anything on the decoding side. What's less nice is that the naive encoding is n^2, but rfk manages to do well with a simplification. Note btw that I suspect we would get even bigger wins than rfk, since our AST will be more compressible than asm.js.
  4. Reorder basic blocks. We actually lose the ability to do this because of (1), although we could in theory flip if-elses by adding a not, reorder switches, etc. In any case, the win would likely be mostly on small window sizes in gzip etc., which we probably care about less.

I could be wrong, but my guess is that tests on the final AST, in its optimized format (implicit types, etc.), will show that gzip etc. do most or all of the macro-like things for us, because fundamentally that's the same type of thing that gzip is meant to do - unless I have missed something important in the macro layer idea? In general, I wonder if focusing on the things we know gzip etc. can't do or don't do well might be more beneficial. @kg, your data partially suggests I am wrong (the deduplication is massive!), but I guess I am unable to figure out why gzip etc. can't do similar deduplication on the same data? Perhaps things will change on the final AST?

@jfbastien
Member

@kripken generic compression is good at finding redundancy, but macro compression finds more opportunities because it knows about the structure of the format and can therefore permute things generic compression can't, and find relationships in the tree that generic can't. Generic is great because it can detect repeated patterns, and minimize the encoding so everything isn't byte-aligned. These two approaches are complimentary.

The macro layer implementation I had in mind would be based on a progressive refinement approach. The more time is spent trying to compress, the better the result. @sunfishcode and I discussed this yesterday, and I think we agreed.

I think we need a "deterministic and good enough" compression mode, as well as a "spend XX minutes and do the best you can, nondeterministically" mode.

The nondeterministic mode would use multiple threads and explore random mutations (we'd need to figure out ratios of mutations), with potentially some cross-talk. This can be done either as a hill-climb (accepting some dips sometimes, with checkpoint) or as some genetic algorithm (ooh trendy).

As the nondeterministic mode starts getting smaller improvements it should measure size of macro+lzham compression, not just macro size. My thinking is that the inner loop needs to be fast, so reducing macro size is a good approximate, but as we get less compression we need a slower but more precise answer so lzham should be measured.

I think this would be a great intern project, if anyone is interested!

@kripken
Member
kripken commented May 14, 2015

Oh, have I misunderstood "macro layer" too literally? I thought the intention was literally macro compression. But the first concrete example you give is to permute things based on the underlying format - perhaps like permuting functions and basic blocks, as mentioned earlier? If so then sorry for my previous confusion.

What is an example of "find relationships in the tree that generic can't"?

@lukewagner
Member

@kripken A specific way macros could beat generic compression is by being parameterized (this is the "non-nullary macro" case described in the OP). In theory generic compression could have parameterization (maybe it does! I don't know), but by knowing the structure of the payload, a macro layer should be able to be more aggressive.

@kripken
Member
kripken commented May 14, 2015

We discussed some more on irc. @lukewagner and @jfbastien convinced me that our macros are not a generic solution, since we know where ast nodes begin and end. A generic compressor would need to consider a far larger number of combinations, very likely impractically. We just look at all of a node+children, then each of those children, etc., O(nodes), as opposed to each byte to each other byte, so O(bytes*bytes). Therefore a wasm-specific macro compressor sounds very promising.

@MikeHolman
Member

Note btw that I suspect we would get even bigger wins than rfk, since our AST will be more compressible than asm.js.

I'm curious, why webasm would be more compressible than asm.js? I'd think that in general a pure binary format would be much more dense than a text format and thus less compressible, and asm.js especially where all the repetitive expressions like x|0 would certainly be folded by a compression algorithm.

In any case, I'm pretty optimistic about our macro strategy.

@kripken
Member
kripken commented May 15, 2015

For example, the types become implicit in the current wasm proposal, so x0 + x1 + ... + x100 will look very similar no matter the types of those variables, while in asm.js the |0s would make the two cases different.

asm.js also has cases where more operations are needed than in wasm, like return (~~x()) | 0;. It is true that the | 0 will likely be picked up by gzip, but depending on other things in the window, it might not be; on average this adds more burden on the compressor.

edit: made what I said more precise

@MikeHolman
Member

Ah that's right, the context based space should make this very compressible. I'm sure patterns like op[0] (getlocal(0), getlocal(1)) will be very common. And we can maximize these collisions by reordering opcode/local space, though I'm unsure how that would look in tandem with macros.

@lukewagner
Member

FWIW, while the final size of (gzip,lzma)+binary is smaller than (gzip,lzma)+asm.js, the compression ratio is much higher for asm.js (which is what I interpreted @MikeHolman's question as asking). Example numbers for the polyfill prototype on Unity AngryBots:
asm.js: 19MiB
gzip asm.js: 4.1MiB (4.6x reduction)
xz asm.js: 2.6MiB (7.3x reduction)
binary: 6.3MiB
gzip binary: 3.0MiB (2.1x reduction)
xz binary: 2.1MiB (3.0x reduction)

@MikeHolman
Member

Thanks Luke, that is what I meant! I think it would mean we were doing a pretty bad job if asm.js code could be compressed smaller than our binary encoding can be we. I guess divergence was we were thinking about 2 different metrics of compressibility.

Also, after looking at those numbers, man that gzip guy is pretty good.

@lukewagner
Member

With Alon's changes to BinaryEncoding.md, I think we have captured the main points of this discussion, so closing. I think the next steps are for someone to make a prototype on top of the WebAssembly binary format and that can open a new issue.

@lukewagner lukewagner closed this May 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment