Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expressionless encoding #669

Closed
ghost opened this issue Apr 22, 2016 · 11 comments
Closed

Expressionless encoding #669

ghost opened this issue Apr 22, 2016 · 11 comments

Comments

@ghost
Copy link

ghost commented Apr 22, 2016

For the discussion of the design and pros and cons of an expressionless encoding. I'll update this opening comment with the current proposal and rationale.

The proposal:

  • A set of operators (aka instructions) that read their arguments from local variables (aka registers) and write their results to local variables so that wasm code can be expressed without expressions. These operators would return no values if used in an expression, e.g the same result as nop. For example: (set_local $r1 (i32.add (get_local $v1) (get_local $v2))) becomes (i32.add $r1 $v1 $v2).
  • Some opcodes would have have variations that accept immediate constants. This is noted to significantly improve the encoding efficiency. For example (i32.add/lc $r1 $v1 10).
  • The existing parametric operators select and set_local become type specific. Thus all local variable indexes are type specific and use a type specific index space which helps compression and simplifies a prediction algorithm by avoiding special consider for these operators.
  • The exiting if operator might no longer be as useful and might be removed, or kept for efficiently encoding diamond control flow patterns, but might be subsumed by an alternative more efficient encoding of block, and one suggestion is to replace if by unless Replace the if operator by an unless operator. #676.
  • The break operators would have no expression value. Encoding the number of values would not be necessary, a small encoding advantage.
  • Functions could only return values via the return operator. The proposed end operator might want to accept arguments.
  • Function multiple return values are supported by the call operators writing to multiple local variables, and the return operator reading a local variable for each returned value.
  • A predictor is defined that assumes roughly an expression stack allocation pattern in the local variables. A simple proposal is to track the index of the last local variable written and for each type. The arguments of an operator are predicted to access the last value written, with an assumption of single use and a stack allocation pattern, so each access hit decreases the predicted index of that type, and the arguments are assumed to be read last-to-first. The result is predicted to be the next index. This proposal avoids a stack-top register.
  • The type of the last variable written might also be a good predictor of the type of the next operator. Where special opcodes are defined that omit arguments that are prediction hits, these might also usefully be selected based on type of the last variable written to handle the frequent case by using a smaller set of opcodes. For example, to select between i32.add i64.add f32.add and f64.add based on the last variable written.

Rationale:

  • Prediction can replace the advantages of the expression stack, and a simple predictor does a relatively good job. Consider if the current wasm expressions implicitly used local variables to store expression temporary values rather than a separate stack. The encoding efficiency has not changed, but now there is the option to re-use these values. Next examine each operator, and note that it is now reading and writing to local variables, but with some implicit get_local and set_local operations. The encoding efficiency has not change, but the runtime complexity of supporting expressions can be replaced by a local variable predictor, and a simple one at that, that predicts an expression-stack allocation pattern. The fallback, no prediction hit, operators would then accept general local variable indexes and these would more efficiently handle the non-expression patterns. The prediction might be exploited by simply assigning the index zero to a hit, and leaving it to the compressor to assign a short code.

For example the follow expression converted to expressionless statements would have a perfect prediction hit rate:

(i32.add
 (i32.add
  (i32.add (i32.const 1)
           (i32.const 2))
  (i32.const 3))
 (i32.add (i32.const 4)
          (i32.const 5)))
=>
   (i32.const $ti1 1)
   (i32.add/lc $ti1 $ti1 2)
   (i32.add/lc $ti1 $ti1 3)
   (i32.const $ti2 4)
   (i32.add/lc $ti2 $ti2 5)
   (i32.add $ti1 $ti1 $ti2)
  • The expression based predictor would bias wasm to expression patterns, give an incentive for developers to emit code in this allocation pattern and to reorder operators to fit this pattern, just as the existing expression support does.
  • Existing strategies such as returning a value from store operators could be expressed in a predictor too, by simply setting the predicted next read index to the index of the local variable holding this argument. But adding such complexity to a predictor might have a small return. The last value written by set_local would be the predicted to be the next value read, so that pattern is optimized well, and changing the order of the arguments to the store operators would fit a simpler predictor.
  • The simple predictor proposed keeps an index per type, so unlike an expression stack it can efficiently pipeline operators working on different types of data. For example it could efficiently encode two interleaved expression patterns working on different types of data which could not have been expressed with expressions with a single stack.
  • The simple predictor has no top-of-stack, and resets with each write operation. Code could jump between expression stacks with little cost if this happened to help. An operator that writes results that are not used will like cause a write prediction miss on the next write which will reset the prediction index.
  • The simple predictor does not update the predicted index on a read miss. This supports common patterns in which one argument is outside the expression stack, for example one argument is a function argument in a local variable.
  • A producer does not need to use the predictor, and could just emit the raw indexes, which might suit a producer within a web browser which is not compressing the encoder data.
  • In contrast to a post-order encoding of a group of get_local/operator/set_local, bundling the local variable indexes into the operator allow their type to be known when decoding which supports using type specific variable index spaces and avoids redundant types on immediate constants when they are also bundled.
  • These expressionless operators could replace the existing expression based encodings, or be a parallel alternative.
  • Annotating code with debug information or meta information might be simpler with a flat expressionless encoding.
  • Interpreters would be able to allocate a single block of stack on entry, and would not need to dynamically allocate room for expression temporary values.
  • Expressions demand some infrastructure, so there is a runtime complexity.
  • Returning values from blocks is pushing the 'expression pattern' and adding complexity to do so. Often multiple values are passed around, and the 'expression pattern' can be pushed a little further with multiple value support. Adding block1 can push it a little further. Operators could be added to produce any number and order of values, but the options to consume the values are still very limited to single use and: destructuring, pass-through parametric operators, result values, arguments to a call. A type system is need to validate these operators. It seems like a lot of language development work and a lot of potential bike-shedding, and the product might become rather specific and baggage from other perspectives and presentation forms.
  • A decoder that rebuilds expressions for presentation would have the option to use block local variables and other programming conveniences to present the code, even where these were considered too much or a burden for the wasm language.
  • Having immediate local variables makes multiple use of values in variables more efficient. For example constants loaded into variables can be used repeatedly with lower encoding size than with the wasm expressions which need a separate get_local for each constant use. If it helped runtime compilers then these might be declarable at the start of the function and read-only which might allow a compiler to put them in a constant pool - that is constant pool references sharing the index space with local variables.
  • There is a demo of AngryBots converted to an expressionless form but using the existing encoding and now updated for the post-order encoding. It can probably be decoded to wast using sexpr-wasm. It is a simple expansion of expressions and not a global function reallocation of local variables, and is 138% of the original size. When encoded into operators that read and write from immediate local variables, and with some opcodes supporting immediate arguments, the file size is 96% of the original size and this is without using if just blocks and breaks and without a predictor (which may not change the uncompressed size if just using a zero index for a hit). Using prediction and specialized opcodes for predicted operands reduces the file size to 80% of the original size.
  • A patch for Firefox to add some support for this proposal will run AngryBots encoded using 'register' base operators. Note this has not yet been updated for the post-order encoding. This currently uses opcode 0xbb to enable this alternative mode for operators, and per function. The uncompressed file size is 113% of the original. This does not yet implement opcodes with immediate constants which improves the encoding efficiency or the prediction model.
@kripken
Copy link
Member

kripken commented Apr 22, 2016

Can this be summarized as moving from an expression AST to a register machine + prediction for register values?

I can sort of see how prediction could help with the added entropy of a register-based bytecode (predict the next use, and if you are right, don't write the number; and running the same deterministic prediction on client and server guarantees the same stream of registers). However, that adds complexity to the spec. And I'm not sure how good prediction could be - I'd be surprised if it's right even half the time. The expression AST has a guarantee, on the other hand.

Overall this seems like the kind of thing we should have considered a long time ago - not that we can't consider it now, but if we do, it means substantially delaying shipping. So this would require especially strong justification, I think.

@qwertie
Copy link

qwertie commented Apr 23, 2016

I for one do not want an expressionless encoding. If there are some substantial technical advantages to an expressionless encoding, I wouldn't blame the CG for adopting it, but the ease of comprehension of Wasm is something I have come to really appreciate - not just as a format for pretty printing (as mentioned, a transform could produce an expression tree for that purpose) but as something I look forward to working with at a low level, because it is simple to comprehend in its "native" form.

C has sometimes been called a "high level assembly language", but Wasm fits that description much better. It's just really nice.

@ghost
Copy link
Author

ghost commented Apr 23, 2016

@qwertie The 'expression' support that you get in wasm might not be much chop for 'ease of comprehension' anyway and very limited in scope, which I why I have given up on it. To present it for 'ease of comprehension' I am having to first decompose it into a more primitive form and then rebuild expressions, so the current expressions are just a burden and baggage and also have a good deal of implementation cost too.

I expect some producers might have an extra burden, and might that be what you are concerned about? An algorithm for expanding an expression into statements that operate on registers is almost as simple as walking the AST, and will be very natural with the post-order encoding. Could a relatively simple library that transforms from the expression format to the expressionless format address this? Doing so seems consistent with the goals of wasm - that is moving complexity out of the consumer and to the producer.

I want a good 'high level assembly language' for wasm too, but the wasm code section encoding is not it and never will be. I have tired and tried, I see the barriers, it's not going to happen. Thankfully the ignored-sections are part of wasm and we can build this language there and we can explore many different versions and continue development largely independent of the wasm code section. It could include 'let' variables, multiple value support, discarding of unconsumed values, etc.

@ghost
Copy link
Author

ghost commented Apr 23, 2016

@kripken Yes, 'register machine + prediction for register values' is a good summary.

Can you find an example where the predictor would not have a 100% hit rate for a wasm AST expression expanded to register instructions? I see a hit rate of around 50% for AngryBots for all variable accesses not just those for expression temps, and around 25% of the instruction instances have a hit for all operands.

These operators are just bundles of set_local, get_local and exiting operators, so it's not a lot of work to support these, and probably some satisfying simplifications for the implementations. When were you expecting to see wasm 'ship'?

If I get time I'll implement a decoder in a web browser to give a practical example. Is the v8 wasm decoder the benchmark? Is the Firefox SM decoder competitive and useful for a demo?

@qwertie
Copy link

qwertie commented Apr 23, 2016

I expect some producers might have an extra burden, and might that be what you are concerned about?

No, I'm talking about the cognitive ease of comprehending an expression tree versus a flat instruction list.

I see the barriers, it's not going to happen.

What barriers? Barriers to what? (note that these are two separate questions.)

@ghost
Copy link
Author

ghost commented Apr 23, 2016

@qwertie I agree that expressions are better for the reader, but the wasm encoding is a binary code and the opcodes are not expressions and with the post-order encoding could be viewed as stack machine operations. There is even discussion of adding a discard operation and a tee operation. Working at the low level you see operators. It's only via a decoder that expressions are built, and this can be done from the wasm stack machine opcodes or from register base opcodes. A difference might be how restricted the wasm stack machine opcodes are, restricted to single use patterns, but a decoder can determine if a register has a single use and if so then it fits the same simple pattern, but where there are multiple uses it does not have to just give up but can use a block local let variable or other expression patterns to help the reader. So the key is making it as simple as possible for the decoder to analyse the code, and the current expression support is just baggage from this use case. The simpler that an SSA style analysis is then the simpler the job for the decoders, and the restricted expression support in wasm just complicates this.

I am not even sure if the expressions help the producer. The llvm backend has to work hard to fit the expression pattern. With the register base opcodes the encoding efficiency should be higher where a producer does not want to go to such lengths to shoe horn its output into restricted expression patterns.

The 'barriers' are getting the browser vendors on board and in agreements. My experience is extreme resistance to added complexity, and that there are a lot of differences of opinion. The expressionless encoding proposal removes complexity, and fits a plan to develop higher level decoders externally from the wasm code encoding where agreement is not necessary and different projects can explore different presentation languages.

@kripken
Copy link
Member

kripken commented Apr 23, 2016

Can you find an example where the predictor would not have a 100% hit rate for a wasm AST expression expanded to register instructions? I see a hit rate of around 50% for AngryBots for all variable accesses not just those for expression temps, and around 25% of the instruction instances have a hit for all operands.

Yes, you can get all the expression registers right, but each instruction's use of a register needs to either use a predicted register, or use a user-specific one. That's extra information that in the AST approach is not present. Or, if you find a way to encode things without that extra information, you'd be isomophic to an AST anyhow.

Overall, we did experiments early on (mentioned in the FAQ, etc.) on ASTs vs other approaches like a register machine, and ASTs were smaller.

When were you expecting to see wasm 'ship'?

I don't know, but my point was that a major change now would require redoing a lot of engineering that's already been done, which would significantly delay the ship date, whenever it is. I just don't think it's realistic to expect that to happen here, given that it's something that we've already carefully evaluated before. Unless we really missed something big back then, which is certainly possible, and if so we should certainly be willing to delay things - but I just think that is very unlikely.

@ghost
Copy link
Author

ghost commented Apr 23, 2016

@kripken So we agree that if the registers are allocated appropriately for code that fits the current AST expression pattern then the prediction would be perfect.

The proposal does allow an immediate reference to a local variable that might be a prediction miss but that is a feature that does not need to be used, for example an explicit get_local could be issued to separate it. Where this 'miss' occurs is only costs an immediate leb128, not a get_local opcode. There are obviously some subtle differences and the restricted AST is better optimized for code that fits the pattern, but for patterns that do not match then the proposal is more efficient and it degrades more gracefully.

The prior results comparing a register machine probably did not have the predictor, but even without the predictor the uncompressed sizes are similar! Without putting in some development work to optimize different approaches the results of comparisons might not be useful, for example some opcodes accepting immediate constants were very helpful for the register machine. The register machine seems to have a lot of advantages, and even if it were not as efficient at encoding it might worth considering.

Even if wasm were to keep the expression support, it can be argued that it needs the register opcode variations anyway to efficiently encode patterns that do not fit the restricted expression patterns. For example the proposed multi-value support which writes results of call operators to local variables, and passing around multiple values, and code with multiple uses of definitions. But if wasm has these register opcodes then there seems to be a good argument that it does not even need the expression support, that it could be simplified. So wasm could ship with the AST, then add then new encodings, then deprecate the expression support. It does not appear a big engineering task if people really wanted to go with this path, and I have had a look at v8 and SM now. I've demonstrated AngryBots converted into the register-machine format but using explicit set_local get_local so it's a small matter to produce code in this format.

If I can demonstrate a more compact encoding, a simpler and faster decoder, better or simpler unoptimized code execution, then it would seem to make a good case that 'we really missed something big back then'. I can only try.

Worst case if this proposal is not adopted: tools need to unpack/pack AST expressions as part of the encoding compression (they would want to do something similar anyway to exploit prediction); and have perhaps at most 30% worse encoding efficiency and slower decoding for code that does not fit this pattern well or does not want to bother; and the runtime decoders are more complex; and the spec is more complex.

@ghost
Copy link
Author

ghost commented Apr 24, 2016

@ghost
Copy link
Author

ghost commented May 4, 2016

In case it helps with the ongoing discussion, I see a potentially significant problem with this expressionless encoding, namely the number of local variables generally increases and this increases the storage and processing requirements of the SSA decoder. For example, the v8 SSA decoder maintains a fixed vector of the local variables for SSA analysis and at each branch splits this creating a copy so that when they merge the differences can be noted to assign Phis. It is easy to see that as the number of local variables increases then copying and merging is going to increase linearly.

From this perspective, the AST encoding can be seen to move the expression intermediate values out of the SSA environment and thus reduce the amount of work for the SSA decoding. With the post-order encoding, the AST can be seen as a stack machine, but the restrictions ensure that the values on the stack can never be split or merged in the control flow.

Values on this stack at the entry to a block are not usable in the block (unless they were written to a local?) and are still there on exit, and they are discarded if there is a break out of the block. All values pushed within a block are popped on any exit from the block. These restrictions on this stack machine seem sufficient to ensure that the values on the stack need not be part of the SSA environment, would people agree?

The exit values of a block, the value for a break or the fall-through, are part of the SSA environment from what I understand. In v8 these seem to be the control values? This can be seen in the example (block $l (i32.add (br_if $l <v> <cond>) <c>)) from #681 where the block $l merges two values supplied from the AST stack. If there were more than one value then all would be part of the SSA environment.

Shall explore the cost of the extra local variables, and it might be mitigated by having effective stack pointers which limited the number of active local variables that need to be considered as part of the SSA environment. But it also suggests another look as the AST.

Have not got my head around it all, but could we relax some of the AST stack restrictions while still ensuring that the stack values were not part of the SSA environment? For example, could it be possible to read any of them rather than just popping a value? Any chance that an SSA expert could suggest the absolute minimum restrictions on a stack machine that would still ensure that the values on the stack were not part of the SSA environment? If some restrictions can be relaxed then could these be considered?

@ghost
Copy link
Author

ghost commented May 11, 2016

  1. This design depended on the runtime decoder tracking the SSA definitions in local variables, and this encoding increases the number of local variables which increases the overhead of this operation. After some more study it appears that the SSA conversion would be simpler without local variables at all. If it were possible to return multiple values from block and to loop heads then all variables could be single-assignment and it appears that the conversion to SSA could be done without local variables.
  2. Single use variables are so common that an efficient encoding can not practically ignore this case. This design proposed a predictor of a stack like allocation pattern, but an interpreter of the literal encoding would need to track these predictions too which might have a similar overhead to a stack machine anyway.
  3. The current wasm AST does not allow reading of values up the expression stack rather just popping arguments from the top of this stack. This restriction appears unnecessary for SSA conversion, and it appears we can allow multiple uses of definitions. There are many encoding options, and one decoded form would be block scoped single assignment variables, and this could include the results of blocks and loop heads. But it might also be expressed as a stack of definitions with an operator to access any value on the stack.

It appears best to continue making the case for multiple value expressions and effective block scoped single-assignment variables. One of the objections to block scoped local variables was that it would complicate the SSA local variable environment and that seemed a show stopper, but it seems entirely possible to eliminate local variables while still having effective single-assignment block scope variables (which is what expression intermediate values are). Getting past that blocker opens a range of alternative encodings and different trade off that might yield a simpler and faster SSA decoder and still an efficient encoding.

@ghost ghost closed this as completed May 11, 2016
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants