Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate Expression/Statement distinction in AST? #223

Closed
titzer opened this issue Jun 24, 2015 · 46 comments
Closed

Eliminate Expression/Statement distinction in AST? #223

titzer opened this issue Jun 24, 2015 · 46 comments

Comments

@titzer
Copy link

titzer commented Jun 24, 2015

No description provided.

@qwertie
Copy link

qwertie commented Jun 24, 2015

While I haven't worked this out in detail, I know from designing Enhanced C# (unfinished) and LES that languages in the Algol tradition (C/C++/Java etc) can be viewed as purely expression-based languages like LISP or Haskell.

As I was saying elsewhere, I think all executable code (if not more of the system) should be expression-based.

  • If void (unit, ()) is understood as a proper data type, then if (without else) and do_while can be seen as expressions that return void, and if-else can be seen as equivalent to the conditional operator ?:
  • Side effects can occur inside an expression, and the comma operator provides sequences. Everyone is already familiar with this effect in C, e.g. (x = y * y, (x += y) + 5). Discarding a value happens naturally whenever one operation (whose result you want to discard) is followed by any other operation. However if the operation whose result you want to discard is the last operation in a block, you have to add a dummy 'void' (foo(), bar(), void)
  • forever can be seen as an expression whose arguments are the statements inside the loop, and break could include a 'return value', which will be the value of an enclosing forever loop.
  • return is not needed; return values mid-function can be simulated by enclosing the function in a forever that uses break to specify the return value.

Merging statements and expressions into a single concept called 'expressions' seems to produce a strict superset of the current plan. I think it would be a simpler and more capable model than what is currently proposed, and it won't substantially alter code generation for the Algol branch of languages (or, if it turns out that it does, perhaps the language could be tweaked until it doesn't) The translation to asm.js could be more complicated, admittedly, but remember that this is a short-term problem--we won't use a polyfill forever.

(goto can be shoehorned into the "everything is an expression" model too, albeit clumsily.)

@jfbastien
Copy link
Member

Yes. It doesn't seem to provide much at the moment.

@qwertie
Copy link

qwertie commented Jun 24, 2015

I'd like to be able to write code like this:

x += switch(...) {
    case 1: break foo();
    case 2: break bar();
    case 3: break baz();
}

or even:

doStuff(12345, if (...) {
        foo()
    } else {
        y = 0;
        forever {
            y += ...
            if (...) break(y);
        }
    });

(The latter, while a bit ugly, is shorter than the typical way you'd write that code, so it agrees with the design goal of "size efficiency")

Things like this are possible in LISP and Haskell (except that Haskell doesn't allow mutable variables) and I'm sure I've seen it in at least one other (otherwise traditional) imperative language.

@MikeHolman
Copy link
Member

@qwertie I think this is a pretty cool idea, but I'm concerned about polyfill to js and added complexity to VMs (it seems like you would have to build up a flowgraph for type checking).

@jfbastien
Copy link
Member

I agree that such syntax is nice if someone is writing wasm files manually. I don't think that's a goal of WebAssembly, though.

I nonetheless think that there could be advantages to what you suggest:

  • Does it make supporting other languages easier (not C/C++)?
  • Does it potentially have size savings which the macro / generic compression approach don't have?
  • Does it make compilation / parsing slightly easier / faster?

Before we discuss this further, can you please confirm that you're a member of the W3C community group? "@qwertie" isn't quite a name I can find there! The WebAssembly effort really needs to operate under the W3C's umbrella, and significant contributions require membership.

@lukewagner
Copy link
Member

FWIW, @sunfishcode have also been talking about having all statements be expressions for the purpose of avoiding get_local/set_local ops (which take 30-40% of the bytes, at least in the prototype) by leveraging AST's concise representation of single-use expressions. It'd be good to motivate this by showing size savings (even in the presence of the macro/generic compression layers), but I'm in favor.

@qwertie
Copy link

qwertie commented Jun 24, 2015

@jfbastien Yes, I'm a member of the W3C group (real name: David Piepgrass)

IMO expression-based languages are 'the future', because even if one doesn't like them stylistically, they have other advantages. For example, languages may have higher-level constructs that are naturally lowered into "an expression-with-a-loop-inside": e.g. list comprehensions, function inlining.

@kripken
Copy link
Member

kripken commented Jun 24, 2015

@lukewagner that sounds interesting, but I don't quite see it, can you please elaborate?

@jfbastien
Copy link
Member

It sounds like we agree it would be nice to have, but only if data shows that it'll be useful. Style is great, but we're not trying to optimize for that.

@kg
Copy link
Contributor

kg commented Jun 24, 2015

The only reason I suggested it is that it reduces the tree depth of a typical AST somewhat, which could improve compression. I've yet to test this theory out in depth other than doing the change in one specific case and seeing my test files get considerably smaller. (from Block -> [ExpressionStatement, ...] to Block -> [Expression, ...])

@MikeHolman
Copy link
Member

I have given this some more thought and I realize that I way overestimated the complexity for WebAssembly VMs, and from that perspective I'm fine with this. And I can believe that there are some modest wins in AST size (even considering macro layer), so that does give some motivation.

However, I still can't imagine a good way to translate this to js. The only thing I can think of which works in the general case is outlining the flow expressions into function expressions/calls, but that is obviously too costly. I'm sure we could hoist them into separate statements in many cases (and assign the result to a new local for value producing flow expressions), but that would require the polyfill to do dependence analysis...

Unless someone has a better idea for polyfill, I think we should leave this as a possible future feature for a time when the polyfill isn't as important.

@kg
Copy link
Contributor

kg commented Jun 24, 2015

My implication here is that the distinction doesn't entirely go away, but it isn't encoded into the AST. For example, if you're in a block and find an expression, it's obviously a free-standing statement. Much like how we've been leaning towards using context to infer things about operations (we know this expression is an integral one because it's enclosed inside an int setlocal operation), we can often make robust inferences about whether something is an expression or a statement in a given context.

There are only a few cases where it really matters, though. BlockStatement -> Block and ExpressionStatement -> Expression can both be pretty profound in terms of making the AST shallower and slightly smaller to encode. Best-case, each one has a type indicator (of some kind) for the 'statement' (block, expression) and then a type indicator for the inner expression in the case of the ExpressionStatement. It's possible there are other ways to optimize out these costs, of course. This might be something that's best tackled at the encoding layer much like macros.

@creationix
Copy link

I'm all for making the AST simpler and smaller, this had advantages in several areas as noted.

Is the only real blocker finding an efficient transpile target for the polyfill?

I think with @kg's idea most expressions used as statements will be obvious and won't need wrapping in functions. I would love to see some concrete examples of the AST in question so I could mull it over.

@MikeHolman
Copy link
Member

@kg I definitely think for non-value producing expressions (e.g. set_local, or void function calls) we can remove the need for ExpressionStatement->Expression. For example, we can let void type expressions share an opcode space with statements (and still keep the semantic differences).

My concern is with what @qwertie is talking about, where we do completely remove the distinction between them.

@sunfishcode
Copy link
Member

Right now, the only way a local value can be live across a basic block boundary (ignoring control flow embedded in expressions) is to live in a local variable. Making statements be expressions would mean that some values could flow out of blocks without having to go through local variables, which would theoretically reduce the size due to get_local/set_local that @lukewagner mentioned.

An interesting twist is that we could extend this from allowing a single value to flow out of a block to allowing multiple values to flow out of a block, by allowing expressions to return multiple values (something which would have other uses besides). For sanity, we don't want to deal with tuples floating through the language in general, so we'd probably require multiple return values to be immediately unpacked and named, so there'd still be some set_local/get_local overhead, but it'd be reduced.

Either way, it would be interesting to see how much this would help compression.

One question I have though is: do we really want expressions with embedded control flow at all? The current design mentions comma and short-circuit ?: but do we even want those? Embeddable control flow makes it somewhat more complex to construct a CFG from the AST, which could complicate some baseline JIT strategies. If the compression wins aren't great, this may be worth considering.

@kg
Copy link
Contributor

kg commented Jun 25, 2015

The question of whether we want embedded control flow at all is a very important one. I'll have to think about it in depth, but I personally don't like the comma operator or short-circuit ?: (mostly on instinct). I've just been assuming that we want to be able to express them in our AST. IIRC in past discussions we have used the existence of the short-circuiting ternary operator to justify omitting some other things, since asm.js leans on it heavily.

@titzer
Copy link
Author

titzer commented Jun 25, 2015

On Thu, Jun 25, 2015 at 2:45 AM, Dan Gohman notifications@github.com
wrote:

Right now, the only way a local value can be live across a basic block
boundary (ignoring control flow embedded in expressions) is to live in a
local variable. Making statements be expressions would mean that some
values could flow out of blocks without having to go through local
variables, which would theoretically reduce the size due to
get_local/set_local that @lukewagner https://github.com/lukewagner
mentioned.

An interesting twist is that we could extend this from allowing a single
value to flow out of a block to allowing multiple values to flow out of a
block, by allowing expressions to return multiple values (something which
would have other uses besides). For sanity, we don't want to deal with
tuples floating through the language in general, so we'd probably require
multiple return values to be immediately unpacked and named, so there'd
still be some set_local/get_local overhead, but it'd be reduced.

Either way, it would be interesting to see how much this would help
compression.

One question I have though is: do we really want expressions with embedded
control flow at all? The current design mentions comma and short-circuit ?:
but do we even want those? Embeddable control flow makes it somewhat more
complex to construct a CFG from the AST, which could complicate some
baseline JIT strategies. If the compression wins aren't great, this may be
worth considering.

For reference, both ?: and comma in the v8-native-prototype decoder were an
additional 10 lines of code each, approximately.


Reply to this email directly or view it on GitHub
#223 (comment).

@sunfishcode
Copy link
Member

I guess that's true, since you form the CFG while decoding. I was thinking about forming the CFG based on an in-memory AST, but that's not the common case here for baseline JITs.

@rossberg
Copy link
Member

rossberg commented Jul 1, 2015

I strongly second eliminating statements. This would be particularly useful for translating higher-level languages into wasm in the future, which will often need to create control structure inside expression contexts.

Even though supporting HL languages is not an initial goal, the statement distinction is a fundamental design decision that cannot easily be reverted later.

As for the JS polyfill, there is a proposal for adding "do-expressions" to the next version of JavaScript (which I am a co-champion of). The idea is to allow "do { <statement-list> }" as an expression. Once that arrives (hopefully next year), translation to JS would become straightforward.

@rossberg
Copy link
Member

rossberg commented Jul 1, 2015

For the record, Ben asked me whether eliminating statements would necessitate an explicit operator for discarding values. Replicating my reply here:

I think that's not needed. Basically, you can always tell from the syntactic context where a value is discarded (off-hand: left of a comma/semicolon, at the end of a loop body, and recursively in the arm of any branch construct whose value is discarded).

Another way to spec this easily is via the type system. Essentially, some expressions produce type void (e.g. loops), some expressions require their subexpressions to have type void (e.g. the left operand of a semicolon), and there is an implicit conversion from any type to void, whose operational interpretation is discarding the value.

This nicely generalises the situation of function calls that return void being expressions already.

@wora
Copy link

wora commented Jul 2, 2015

Eventually people will compile different languages to wasm. There is hardly much benefit to keep distinction between statements and expressions even for C/C++. Such distinction will very likely become a real disadvantage for handling other languages. Now would be the best time to eliminate the distinction entirely.

@MikeHolman
Copy link
Member

I don't think the polyfill can use any new, not implemented, js features (including modules for that matter). That would remove all utility from it.

@kripken
Copy link
Member

kripken commented Jul 2, 2015

It sounds like there are a few good reasons to want to eliminate the expression/statement distinction. However, we do need a good solution for polyfilling to current JS, as @MikeHolman said, so I think it would be useful to see an experiment showing that this change would not prevent that. Modifying @lukewagner's prototype might be one way.

@MI3Guy
Copy link

MI3Guy commented Jul 3, 2015

I'm not exactly sure how relevant this is given that WebAssembly doesn't really have subtyping. One useful representation for the type of jump expressions such as break and return is a "bottom" type. (A type that is a subtype of all other types and has no instances.) This would also be useful when exceptions are implemented.

@lukewagner
Copy link
Member

@kripken If we assume that the polyfill can inject temporary locals (something it already does to handle one unfortunate corner case with mixed f32/f64 typed array stores), I don't see a fundamental problem with this. The one challenge is keeping everything single-pass: if you've started emitting an expression tree and encounter a loop, it's too late. For this, the separate issue I just mentioned, and a few other twiddly details that I current cheat on in the polyfill prototype, I'm thinking we'll really end up wanting a polyfill-optimized specific layer (non-standard, so we can do whatever for short-term polyfill considerations, and then change before/if standardizing). Also, while temporary variables could in theory result in much worse code, the specific layer could be designed (in cooperation with the wasm-compiler) to effectively never need them.

@kripken
Copy link
Member

kripken commented Jul 4, 2015

Sounds plausible, but I think the real question is how much overhead that would add, compared to the much more straightforward situation we have investigated so far. That's why I'd be curious to see experiments here.

@titzer
Copy link
Author

titzer commented Oct 22, 2015

I was initially against statements == expressions due to implementation concerns, but in the intervening time have implemented blocks, ifs, loops, break, and continue in terms of expressions, finding the implementation burden to be NBD. There are nice advantages to having statements == expressions, like getting rid of return and comma and conditional, as well as allowing function bodies to be a single expression. So I support statements == expressions.

@qwertie
Copy link

qwertie commented Oct 23, 2015

A lot of work has been done since this issue was opened. The current spec doesn't have a statement-expr distinction, does it? Has a plan for polyfilling been established? And is it fair to say there is consensus on the issue?

@lukewagner
Copy link
Member

I also support stmts=exprs. There are a couple of viable polyfill strategies. If we integrate the polyfill with the specific layer as suggested above, then the specific layer binary stream could contain a marker in front of expression trees that had problematic stmt nesting (from an asm.js pov) so that the polyfill could conservatively emit that expression tree in three address code.

@rossberg
Copy link
Member

I'm also in favour (surprise!). I can see several ways of polyfilling it, starting from dumb solutions like creating auxiliary functions to ones introducing minimal numbers of auxiliary variables at the statement/expression boundary.

@kripken
Copy link
Member

kripken commented Oct 23, 2015

There are nice advantages to having statements == expressions, like getting rid of return

This would remove the need for a final return at the end of a function, but I don't see how it can avoid needing return in the middle if we want an early exit?

@titzer
Copy link
Author

titzer commented Oct 23, 2015

Early returns can be done by making the body of the function a block and
using a break-with-value from that block.

On Fri, Oct 23, 2015 at 2:35 PM, Alon Zakai notifications@github.com
wrote:

There are nice advantages to having statements == expressions, like
getting rid of return

This would remove the need for a final return at the end of a function,
but I don't see how it can avoid needing return in the middle if we want
an early exit?


Reply to this email directly or view it on GitHub
#223 (comment).

@rossberg
Copy link
Member

@kripken, and what titzer describes is how the spec desugars return already.

@kripken
Copy link
Member

kripken commented Oct 24, 2015

A break could achieve the same result, I agree. But then a concern is that the break would need to specify the nesting level. That adds some entropy that I would guess compresses more poorly. I don't see a way to do an early return that doesn't have downsides compared to return.

Overall I don't think we have clarity on whether this improves or degrades code size? I think we've seen arguments both ways: having statements == expressions lets you return a value in some cases "for free", avoiding a temp var, but for the most part statements like while etc. will not return a value anyhow, so nothing is saved. And there are some common cases like early returns where this seems detrimental. I also worry there might be cases where a value must be manually discarded, e.g.

void foo() {
  bar()
}

where bar returns a value, used in other places calling it, but that we want to ignore here.

Overall, my current understanding is that

  1. This would be nicer. Having everything be an expression is more elegant!
  2. It's not clear yet if this helps or hurts code size.
  3. There were arguments above that this would help port languages in which statements == expressions, but I'm not sure I follow that, since e.g. LLVM IR isn't like that, but pretty much everything can compile to it these days. On the other hand, having statements == expressions might help making compilers for those languages somewhat simpler, but that is a far weaker benefit.
  4. This complicates the JS polyfill, but does not make it impossible.

Seems like a mix, so far?

@titzer
Copy link
Author

titzer commented Oct 24, 2015

On Fri, Oct 23, 2015 at 5:20 PM, Alon Zakai notifications@github.com
wrote:

A break could achieve the same result, I agree. But then a concern is that
the break would need to specify the nesting level. That adds some entropy
that I would guess compresses more poorly. I don't see a way to do an early
return that doesn't have downsides compared to return.

Breaks already specify their nesting level. It's easy to add a level-0
break as a separate opcode to save a byte.

Overall I don't think we have clarity on whether this improves or degrades
code size? I think we've seen arguments both ways: having statements ==
expressions lets you return a value in some cases "for free", avoiding a
temp var, but for the most part statements like while etc. will not
return a value anyhow, so nothing is saved. And there are some common cases
like early returns where this seems detrimental. I also worry there might
be cases where a value must be manually discarded, e.g.

We're not at the point where we're squeezing bytes, but it's easy to
imagine some one-byte breaks like break0, break1, break2, when combined
with an implicit block for function bodies, is equivalent.

void foo() {
bar()
}

where bar returns a value, used in other places calling it, but that we
want to ignore here.

Void can accept and discard any value, so this is NBD.

Overall, my current understanding is that

  1. This would be nicer. Having everything be an expression is more
    elegant!

That's a plus.

  1. It's not clear yet if this helps or hurts code size.

That's a neutral.

  1. There were arguments above that this would help port languages in
    which statements == expressions, but I'm not sure I follow that, since e.g.
    LLVM IR isn't like that, but pretty much everything can compile to it these
    days. On the other hand, having statements == expressions might help making
    compilers for those languages somewhat simpler, but that is a far weaker
    benefit.

That's a plus for some languages, and a zero for LLVM. So plus.

  1. This complicates the JS polyfill, but does not make it impossible.

That's a minus, but the magnitude is unclear. It could be that other
things complicate the polyfill more.

Seems like a mix, so far?

I think we're comfortably on the positive side here, which is where the
consensus is coming from.


Reply to this email directly or view it on GitHub
#223 (comment).

@kripken
Copy link
Member

kripken commented Oct 24, 2015

I'm mostly fine with this, except that "we don't know" != "neutral". Optimally we could take the time to investigate this carefully.

In any case, I do think that this - and possibly other changes, as you said, but this seems the biggest to me - makes the polyfill kind of pointless, in the sense that it's going to be less efficient to (a) make 1 wasm build and convert that to JS on the client when necessary, vs. (b) make 1 wasm build and 1 JS build. I suspect people will end up doing (b). Does that worry anyone?

@MikeHolman
Copy link
Member

It does worry me. I don't really see any benefit other than "it feels good", and the potential impact on the polyfill is large. The polyfill is not a hugely important scenario for me, but I think it is useful for the health of the project. And it seems like we would be obliterating it in order to add a superfluous feature.

@rossberg
Copy link
Member

I assume we actually do want to keep return as a sugar-level operator (that doesn't carry an explicit depth). So that specific concern should not be an issue.

I don't agree that this makes the polyfill pointless. I also don't think that it makes it less efficient. It does make it slightly more interesting, but not in infeasible ways.

@rossberg
Copy link
Member

@kripken:

  1. There were arguments above that this would help port languages in which statements == expressions, but I'm not sure I follow that, since e.g. LLVM IR isn't like that, but pretty much everything can compile to it these days. On the other hand, having statements == expressions might help making compilers for those languages somewhat simpler, but that is a far weaker benefit.

Don't assume that all compilers will want to go through a complex pipeline like LLVM. In fact, I'm pretty sure the vast majority won't. Somewhere else I was pointing to the long list of compilers to JS -- if only a fraction of those ever get ported to Wasm (and I dearly hope so!) then we already have plenty of compilers that will benefit from expressions, some significantly.

@titzer
Copy link
Author

titzer commented Oct 24, 2015

We will try to answer the polyfill question definitively in the next two months, as @kg will be working on it.

@kripken
Copy link
Member

kripken commented Oct 24, 2015

@rossberg-chromium: yes, I completely agree, making it easier for all those languages is a very good thing. I worry about compromises towards that goal, but probably too much ;)

Regarding polyfill efficiency: If the polyfill needs to generate out of line functions as @rossberg-chromium suggested, or perform a more complex analysis to place auxiliary variables, or use three address code based on markers as @lukewagner suggested, then all of those can work, but they will necessarily reduce efficiency of either polyfill time or throughput of the generated code. Possibly substantially, in large codebases. For example, if there is a significant increase in the number of variables in a large function (e.g. three address code temporaries), then experience has shown that that can make a massive impact on performance. Emscripten works extremely hard to minimize the total number of variables, spending much of its compile time on it, because we know it is worth it. In fact, in large-enough projects, often unoptimized or partially optimized builds (having the # of variables not fully reduced) are completely unusable. The complexity of such variable reduction is similar to that of register allocation; the problems are closely related.

In small codebases, this won't matter. But in large ones, I strongly suspect it will.

I'm not saying this is in itself a big enough issue that we should consider avoiding statements == expressions. I just want to point out that we may be moving to a model where we can't polyfill well enough on the client, compared to people making two builds. Two builds will just be better: no polyfill time, fully optimized code.

I hope this doesn't sound too pessimistic. I'm glad to hear of plans to work on the polyfill, and would like to help out.

@kg
Copy link
Contributor

kg commented Oct 24, 2015

Haven't had time to catch up on this thread, but I wanted to point out that the entropy concerns re: break vs return are pretty minor. If it turns out to be an issue we can address it, but I really don't think it will (especially since we can now omit many returns entirely)

@qwertie
Copy link

qwertie commented Oct 24, 2015

If polyfilling is slow in certain situations (like when there is a loop inside an "if" condition), perhaps the code could be restructured in advance with a Wasm=>Wasm transform, rather than asking the polyfill do it. Presumably this will increase the code size slightly and decrease native Wasm performance slightly. Developers can stop using the restructuring option once most browsers have native Wasm.

@kripken
Copy link
Member

kripken commented Oct 24, 2015

It occurred to me that there are alternatives to "single build of wasm + polyfill to JS" and "2 builds, wasm and JS". For example, a single build that is not wasm, but something else that is easy to convert on the client to either wasm or asm.js. For example, the single build could be @lukewagner's asm.js polyfill prototype, or it could just be asm.js itself. This is practical because while we are making wasm => JS more complex, asm.js => wasm remains trivial.

Clearly this is not optimal, and has a bunch of obvious downsides. But it seems worth exploring, even if just as a fallback plan. And actually such a converter could have other benefits, like offline upgrading of legacy asm.js builds to wasm. I'll investigate.

@qwertie: Interesting idea!

@lukewagner
Copy link
Member

The reasoning that convinces me in favor of stmts=exprs is that comma and conditional (which we know we want and are useful for building fatter exprs) are just block and conditional and so it seems pretty arbitrary to have expr versions of just these two but not the rest with the only reason being "b/c JS". If stmts=exprs made the polyfill untenable, then it would be a different story, but that seems far from the case.

@sunfishcode
Copy link
Member

This has been implemented for some time now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests