Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upPlease Support Arbitrary Labels and Gotos. #796
Comments
oridb
changed the title
Please Support Labels and Gotos.
Please Support Arbitrary Labels and Gotos.
Sep 8, 2016
This comment has been minimized.
This comment has been minimized.
ghost
commented
Sep 8, 2016
•
|
@oridb Wasm is somewhat optimized for the consumer to be able to quickly convert to SSA form, and the structure does help here for common code patterns, so the structure is not necessarily a burden for the consumer. I disagree with your assertion that 'both sides of the code generation work around the format specified'. Wasm is very much about a slim and fast consumer, and if you have some proposals to make it slimmer and faster then that might be constructive. Blocks that can be ordered into a DAG can be expressed in the wasm blocks and branches, such as your example. The switch-loop is the style used when necessary, and perhaps consumers might do some jump threading to help here. Perhaps have a look at binaryen which might do much of the work for your compiler backend. There have been other requests for more general CFG support, and some other approaches using loops mentioned, but perhaps the focus is elsewhere at present. I don't think there are any plans to support 'continuation passing style' explicitly in the encoding, but there has been mention of blocks and loops popping arguments (just like a lambda) and supporting multiple values (multiple lambda arguments) and adding a |
This comment has been minimized.
This comment has been minimized.
oridb
commented
Sep 8, 2016
I'm not seeing any common code pattern that are easier to represent in terms of branches to arbitrary labels, vs the restricted loops and blocks subset that web assembly enforces. I could see a minor benefit if there was an attempt to make the code closely resemble the input code for certain classes of langauge, but that doesn't seem to be a goal -- and the constructs are a bit bare if they were there for
Yes, they can be. However, I'd strongly prefer not to add extra work to determine which ones can be represented this way, versus which ones need extra work. Realistically, I'd skip doing the extra analysis, and always just generate the switch loop form. Again, my argument isn't that loops and blocks make things impossible; It's that everything they can do is simpler and easier for a machine to write with goto, goto_if, and arbitrary, unstructured labels.
I already have a serviceable backend that I'm fairly happy with, and plans to fully bootstrap the entire compiler in my own language. I'd rather not add in a rather large extra dependency simply to work around the enforced use of loops/blocks. If I simply use switch loops, emitting the code is pretty trivial. If I try to actually use the features present in web assembly effectively, instead of doing my damndest to pretend they don't exist, it becomes a good deal more unpleasant.
I'm still not convinced that loops have any benefits -- anything that can be represented with a loop can be represented with a goto and label, and there are fast and well known conversions to SSA from flat instruction lists. As afar as CPS goes, I don't think that there needs to be explicit support -- it's popular in FP circles because it's fairly easy to convert to assembly directly, and gives similar benefits to SSA in terms of reasoning (http://mlton.org/pipermail/mlton/2003-January/023054.html); Again, I'm not an expert on it, but from what I remember, the invocation continuation gets lowered to a label, a few movs, and a goto. |
This comment has been minimized.
This comment has been minimized.
ghost
commented
Sep 8, 2016
Would be interesting to know how they compare with wasm SSA decoders, that is the important question? Wasm makes use of a values stack at present, and some of the benefits of that would gone without the structure, it would hurt decoder performance. Without the values stack the SSA decoding would have more work too, I've tried a register base code and decoding was slower (not sure how significant that is). Would you keep the values stack, or use a register based design? If keeping the values stack then perhaps it becomes a CIL clone, and perhaps wasm performance could be compared to CIL, has anyone actually check this? |
This comment has been minimized.
This comment has been minimized.
oridb
commented
Sep 8, 2016
•
I don't actually have any strong feelings on that end. I'd imagine compactness of the encoding would be one of the biggest concerns; A register design may not fare that well there -- or it may turn out to compress fantastically over gzip. I don't actually know off the top of my head. Performance is another concern, although I suspect that it might be less important given the ability to cache binary output, plus the fact that download time may outweigh the decoding by orders of magnitude.
If you're decoding to SSA, that implies that you'd also be doing a reasonable amount of optimization. I'd be curious to benchmark how significant decoding performance is in the first place. But, yes, that's definitely a good question. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for your questions and concerns. It's worth noting that many of the designers and implementors of The design of WebAssembly's control flow constructs simplifies consumers by We've had a lot of internal discussion between members about this very On Thu, Sep 8, 2016 at 9:53 AM, Ori Bernstein notifications@github.com
|
This comment has been minimized.
This comment has been minimized.
qwertie
commented
Sep 8, 2016
•
|
Thanks @titzer, I was developing a suspicion that Wasm's structure had a purpose beyond just similarity to asm.js. I wonder though: Java bytecode (and CIL) don't model CFGs or the value stack directly, they have to be inferred by the JIT. But in Wasm (especially if block signatures are added) the JIT can easily figure out what's going on with the value stack and control flow, so I wonder, if CFGs (or irreducible control flow specifically) were modeled explicitly like loops and blocks are, might that avoid most of the nasty corner cases you're thinking of? There's this neat optimization that interpreters use that relies on irreducible control flow to improve branch prediction... |
This comment has been minimized.
This comment has been minimized.
I agree that gotos are very useful for many compilers. That's why tools like Binaryen let you generate arbitrary CFGs with gotos, and they can convert that very quickly and efficiently into WebAssembly for you. It might help to think of WebAssembly as a thing optimized for browsers to consume (as @titzer pointed out). Most compilers should probably not generate WebAssembly directly, but rather use a tool like Binaryen, so that they can emit gotos, get a bunch of optimizations for free, and don't need to think about low-level binary format details of WebAssembly (instead you emit an IR using a simple API). Regarding polyfilling with the while-switch pattern you mention: in emscripten we started out that way before we developed the "relooper" method of recreating loops. The while-switch pattern is around 4x slower on average (but in some cases significantly less or more, e.g. small loops are more sensitive). I agree with you that in theory jump-threading optimizations could speed that up, but performance will be less predictable as some VMs will do it better than others. It is also significantly larger in terms of code size. |
This comment has been minimized.
This comment has been minimized.
oridb
commented
Sep 8, 2016
I'm still not convinced that this aspect is going to matter that much - again, I suspect the cost of fetching the bytecode would dominate the delay the user sees, with the second biggest cost being the optimizations done, and not the parsing and validation. I'm also assuming/hoping that the bytecode would be tossed out, and the compiled output is what would be cached, making the compilation effectively a one-time cost. But if you were optimizing for web browser consumption, why not simply define web assembly as SSA, which seems to me both more in line with what I'd expect, and less effort to 'convert' to SSA? |
This comment has been minimized.
This comment has been minimized.
|
You can start to parse and compile while downloading, and some VMs might not do a full compile up front (they might just use a simple baseline for example). So download and compile times can be smaller than expected, and as a result parsing and validation can end up a significant factor in the total delay the user sees. Regarding SSA representations, they tend to have large code sizes. SSA is great for optimizing code, but not for serializing code compactly. |
This comment has been minimized.
This comment has been minimized.
ghost
commented
Sep 9, 2016
•
|
@oridb See the comment by @titzer 'The design of WebAssembly's control flow constructs simplifies consumers by enabling fast, simple verification, easy, one pass conversion to SSA form ...' - it can generate verified SSA in one pass. Even if wasm used SSA for the encoding it would still have the burden of verifying it, of computing the dominator structure which is easy with the wasm control flow restrictions. Much of the encoding efficiency of wasm appears to come from being optimized for the common code pattern in which definitions have a single use that is used in stack order. I expect that an SSA encoding could do so too, so it could be of similar encoding efficiency. Operators such as I think wasm is not too far from being able to encode most code in SSA style. If definitions were passed up the scope tree as basic block outputs then it might be complete. Might the SSA encoding be orthogonal to the CFG matter. E.g. There could be an SSA encoding with the wasm CFG restrictions, there could be a register based VM with the CFG restrictions. A goal for wasm is to move the optimization burden out the runtime consumer. There is strong resistance to adding complexity in the runtime compiler, as it increases the attack surface. So much of the design challenge is to ask what can be done to simplify the runtime compiler without hurting performance, and much debate! |
This comment has been minimized.
This comment has been minimized.
comex
commented
Dec 7, 2016
•
|
Well, it's probably too late now, but I'd like to question the idea that the relooper algorithm, or variants thereof, can produce "good enough" results in all cases. They clearly can in most cases, since most source code doesn't contain irreducible control flow to start with, optimizations don't usually make things too hairy, and if they do, e.g. as part of merging duplicate blocks, they can probably be taught not to. But what about pathological cases? For example, what if you have a coroutine which a compiler has transformed to a regular function with structure like this pseudo-C: void transformed_coroutine(struct autogenerated_context_struct *ctx) {
int arg1, arg2; // function args
int var1, var2, var3, …; // all vars used by the function
switch (ctx->current_label) { // restore state
case 0:
// initial state, load function args caller supplied and proceed to start
arg1 = ctx->arg1;
arg2 = ctx->arg2;
break;
case 1:
// restore all vars which are live at label 1, then jump there
var2 = ctx->var2;
var3 = ctx->var3;
goto resume_1;
[more cases…]
}
[main body goes here...]
[somewhere deep in nested control flow:]
// originally a yield/await/etc.
ctx->var2 = var2;
ctx->var3 = var3;
ctx->current_label = 1;
return;
resume_1:
// continue on
}So you have mostly normal control flow, but with some gotos pointed at the middle of it. This is roughly how LLVM coroutines work. I don't think there's any nice way to reloop something like that, if the 'normal' control flow is complex enough. (Could be wrong.) Either you duplicate massive parts of the function, potentially needing a separate copy for every yield point, or you turn the whole thing into a giant switch, which according to @kripken is 4x slower than relooper on typical code (which itself is probably somewhat slower than not needing relooper at all). The VM could reduce the overhead of a giant switch with jump threading optimizations, but surely it's more expensive for the VM to perform those optimizations, essentially guessing how the code reduces to gotos, than to just accept explicit gotos. As @kripken says, it's also less predictable. Maybe doing that kind of transformation is a bad idea to start with, since afterward nothing dominates anything so SSA-based optimizations can't do much… maybe it's better done at the assembly level, maybe wasm should eventually get native coroutine support instead? But the compiler can perform most optimizations before doing the transformation, and it seems that at least the designers of LLVM coroutines didn't see an urgent need to delay the transformation until code generation. On the other hand, since there's a fair amount of variety in the exact semantics people want from coroutines (e.g. duplication of suspended coroutines, ability to inspect 'stack frames' for GC), when it comes to designing a portable bytecode (rather than a compiler), it's more flexible to properly support already-transformed code than to have the VM do the transformation. Anyway, coroutines are just one example. Another example I can think of is implementing a VM-within-a-VM. While a more common feature of JITs is side exits, which don't require goto, there are situations that call for side entries - again, requiring goto into the middle of loops and such. Another would be optimized interpreters: not that interpreters targeting wasm can really match those targeting native code, which at minimum can improve performance with computed gotos, and can dip into assembly for more… but part of the motivation for computed gotos is to better leverage the branch predictor by giving each case its own jump instruction, so you might be able to replicate some of the effect by having a separate switch after each opcode handler, where the cases would all just be gotos. Or at least have an if or two to check for specific instructions that commonly come after the current one. There are some special cases of that pattern that might be representable with structured control flow, but not the general case. And so on… Surely there's some way to allow arbitrary control flow without making the VM do a lot of work. Straw man idea, might be broken: you could have a scheme where jumps to child scopes are allowed, but only if the number of scopes you have to enter is less than a limit defined by the target block. The limit would default to 0 (no jumps from parent scopes), which preserves the current semantics, and a block's limit can't be greater than the parent block's limit + 1 (easy to check). And the VM would change its dominance heuristic from "X dominates Y if it is a parent of Y" to "X dominates Y if it is a parent of Y with distance greater than Y's child jump limit". (This is a conservative approximation, not guaranteed to represent the exact dominator set, but the same is true for the existing heuristic - it's possible for an inner block to dominate the bottom half of an outer one.) Since only code with irreducible control flow would need to specify a limit, it wouldn't increase code size in the common case. Edit: Interestingly, that would basically make the block structure into a representation of the dominance tree. I guess it would be much simpler to express that directly: a tree of basic blocks, where a block is allowed to jump to a sibling, ancestor, or immediate child block, but not to a further descendant. I'm not sure how that best maps onto the existing scope structure, where a "block" can consist of multiple basic blocks with sub-loops in between. |
This comment has been minimized.
This comment has been minimized.
ghost
commented
Dec 14, 2016
•
|
FWIW: Wasm has a particular design, which is explained in just a few very significant words "except that the nesting restriction makes it impossible to branch into the middle of a loop from outside the loop". If it were just a DAG then validation could just check that branches were forward, but with loops this would allow branching into the middle of the loop from outside the loop, hence the nested block design. The CFG is only part of this design, the other being data flow, and there is a stack of values and blocks can also be organized to unwind the values stack which can very usefully communicate the live range to the consumer which saves work converting to SSA. It is possible to extend wasm to be an SSA encoding (add If this were extended to handle arbitrary CFG then might it look like the following. This is an SSA style encoding so values are constants. It seems to still fit the stack style to a large extent, just not certain of all the details. So within
But would web browsers ever handle this efficient internally? Would someone with a stack machine background recognize the code pattern and be able to match it to a stack encoding? |
This comment has been minimized.
This comment has been minimized.
ghost
commented
Dec 14, 2016
|
There is some interesting discussion on irreducible loops here http://bboissin.appspot.com/static/upload/bboissin-thesis-2010-09-22.pdf I did not follow it all on a quick pass, but it mentions converting irreducible loops to reducible loops by adding an entry node. For wasm it sounds like adding a defined input to loops that is specifically for dispatching within the loop, similar to the current solution but with a defined variable for this. The above mentions this is virtualized, optimized away, in processing. Perhaps something like this could be an option? If this is on the horizon, and given that producers already need to use a similar technique but using a local variable, then might it be worth considering now so that wasm produced early has potential to run faster on more advanced runtimes? This might also create an incentive for competition between the runtimes to explore this. This would not exactly be arbitrary labels and gotos but something that these might be transformed into that has some chance of being efficiently compiled in future. |
flagxor
added
the
control flow
label
Feb 3, 2017
flagxor
added this to the Future Features milestone
Feb 3, 2017
This comment has been minimized.
This comment has been minimized.
darkuranium
commented
Apr 13, 2017
|
For the record, I am strongly with @oridb and @comex on this issue. Given the nature of WebAssembly, any mistakes you make now are likely to stick for decades to come (look at Javascript!). That's why the issue is so critical; avoid supporting gotos now for whatever reason it is (e.g. to ease optimization, which is --- quite frankly --- a specific implementation's influence over a generic thing, and honestly, I think it's lazy), and you'll end up with problems in the long run. I can already see future (or current, but in the future) WebAssembly implementations trying to special-case recognize the usual while/switch patterns to implement labels in order to handle them properly. That's a hack. WebAssembly is clean slate, so now is the time to avoid dirty hacks (or rather, the requirements for them). |
This comment has been minimized.
This comment has been minimized.
|
WebAssembly as currently specified is already shipping in browsers and toolchains, and developers have already created code which takes the form laid out in that design. We therefore cannot change the design in a breaking manner. We can, however, add to the design in a backward-compatible manner. I don't think any of those involved think At this point in time, someone with motivation needs to come up with a proposal which makes sense and implement it. I don't see such a proposal being rejected if it provides solid data.
So I'll call your bluff: I think having the motivation you show, and not coming up with a proposal and implementation as I detail above, is quite frankly lazy. I'm being cheeky of course. Consider that we've got folks banging on our doors for threads, GC, SIMD, etc—all making passionate and sensible arguments for why their feature is most important—it would be great if you could help us tackle one of these issues. There are folks doing so for the other features I mention. None for Otherwise I think |
This comment has been minimized.
This comment has been minimized.
cheery
commented
Apr 13, 2017
•
|
Hi. I am in middle of writing a translation from webassembly to IR and back to webassembly, and I've had a discussion about this subject with people. I've been pointed out that irreducible control flow is tricky to represent in webassembly. It can prove out to be troublesome for optimizing compilers that occassionally write out irreducible control flows. This might be something like the loop under, which has multiple entry points:
EBB compilers would produce the following:
Next we get to translating this to webassembly. The problem is that although we have decompilers figured out ages ago, they always had the option of adding the goto into irreducible flows. Before it gets to be translated, the compiler is going to do tricks on this. But eventually you get to scan through the code and position the beginnings and endings of the structures. You end up with the following candinates after you eliminate the fall-through jumps:
Next you need to build a stack out of these. Which one goes to the bottom? It is either the 'inside loop' or then it is the 'loop'. We can't do this so we have to cut the stack and copy things around:
Now we can translate this to webassembly. Pardon me, I'm not yet familiar with how these loops construct out. This is not a particular problem if we think about old software. It is likely that the new software is translated to web assembly. But the problem is in how our compilers work. They have been doing the control flow with basic blocks for decades and assume everything goes. Technically the language is translated in, then translated out. We only need a mechanism that allows the values to flow across the boundaries neat without the drama. The structured flow is only useful for people intending to read the code. But for example, the following would work just as fine:
The numbers would be implicit, that is.. when the compiler sees a 'label', it knows that it starts a new extended block and give it a new index number, starting to increment from 0. To produce a static stack, you could track how many items are in the stack when you encounter a jump into the label. If there ends up being inconsistent stack after a jump into the label, the program is invalid. If you find the above bad, you can also try add an explicit stack length into each label (perhaps delta from the last indexed label's stack size, if the absolute value is bad for compression), and a marker to each jump about how many values it copies in from the top of the stack during the jump. I could bet that you can't outsmart the gzip in any way by the fact how you represent the control flow, so you could choose the flow that's nice for the guys that have the hardest work here. (I can illustrate with my flexible compiler toolchain for the 'outsmarting the gzip' -thing if you like, just send me a message and lets put up a demo!) |
This comment has been minimized.
This comment has been minimized.
cheery
commented
Apr 14, 2017
•
|
I feel like a shatterhead right now. Just re-read through the WebAssembly spec and picked up that the irreducible control flow is intentionally left out from the MVP, perhaps for the reason that emscripten had to solve the problem on the early days. The solution on how to handle the irreducible control flow in WebAssembly is explained in the paper "Emscripten: An LLVM-to-JavaScript Compiler". The relooper reorganizes the program something like this:
The rational was that the structured control flow helps reading the source code dump, and I guess it is believed to help the polyfill implementations. The people compiling from webassembly will probably adapt to handle and separate the collapsed control flow. |
This comment has been minimized.
This comment has been minimized.
comex
commented
Apr 16, 2017
|
So:
|
This comment has been minimized.
This comment has been minimized.
qwertie
commented
Apr 16, 2017
•
|
It would be really nice if it were possible to jump into a loop though, wouldn't it? IIUC, if that case were accounted for then the nasty loop+br_table combo would never be needed... Edit: oh, you can make a loops without |
This comment has been minimized.
This comment has been minimized.
comex
commented
Apr 16, 2017
|
@qwertie If a given loop is not a natural loop, the wasm-targeting compiler should express it using |
This comment has been minimized.
This comment has been minimized.
Not quite: at least in SM, the IR graph is not a fully general graph; we assume certain graph invariants that follow from being generated from a structured source (JS or wasm) and often simplify and/or optimize the algorithms. Supporting a fully general CFG would either require auditing/changing many of the passes in the pipeline to not assume these invariants (either by generalizing them or pessimizing them in case of irreducibility) or node-splitting duplication up front to make the graph reducible. This is certainly doable, of course, but it's not true that this is simply a matter of wasm being an artificial bottleneck. Also, the fact that there are many options and different engines will do different things suggests that having the producer deal with irreducibility up front will produce somewhat more predictable performance in the presence of irreducible control flow. When we've discussed backwards-compatible paths for extending wasm with arbitrary goto support in the past, one big question is what's the use case here: is it "make producers simpler by not having to run a relooper-type algorithm" or is it "allow more efficient codegen for actually-irreducible control flow"? If it's just the former, then I think we probably would want some scheme of embedding arbitrary labels/gotos (that is both backwards compatible and also composes with future block-structured try/catch); it's just a matter of weighing cost/benefit and the issues mentioned above. But for the latter use case, one thing we've observed is that, while you do every now and then see a Duff's device case in the wild (which isn't actually an efficient way to unroll a loop...), often where you see irreducibility pop up where performance matters is interpreter loops. Interpreter loops also benefit from indirect threading which needs computed goto. Also, even in beefy offline compilers, interpreter loops tend to get the worst register allocation. Since interpreter loop performance can be pretty important, one question is whether what we really need is a control flow primitive that allows the engine to perform indirect threading and do decent regalloc. (This is an open question to me.) |
This comment has been minimized.
This comment has been minimized.
comex
commented
Apr 17, 2017
|
@lukewagner
For me it's the latter; my proposal expects producers to still run a relooper-type algorithm to save the backend the work of identifying dominators and natural loops, falling back to I really should gather more data on how common irreducible control flow is in practice… However, my belief is that penalizing such flow is essentially arbitrary and unnecessary. In most cases, the effect on overall program runtime should be small. However, if a hotspot happens to include irreducible control flow, there will be a severe penalty; in the future, WebAssembly optimization guides might include this as a common gotcha, and explain how to identify and avoid it. If my belief is correct, this is an entirely unnecessary form of cognitive overhead for programmers. And even when the overhead is small, WebAssembly already has enough overhead compared to native code that it should seek to avoid any extra. I'm open to persuasion that my belief is incorrect.
That sounds interesting, but I think it would be better to start with a more general-purpose primitive. After all, a primitive tailored for interpreters would still require backends to deal with irreducible control flow; if you're going to bite that bullet, may as well support the general case too. Alternately, my proposal might already serve as a decent primitive for interpreters. If you combine |
This comment has been minimized.
This comment has been minimized.
|
@comex I guess one could simply turn off whole optimization passes at the function level in the presence of irreducible control flow (although SSA generation, regalloc, and a probably a few others would be needed and thus require work), but I was assuming we wanted to actually generate quality code for functions with irreducible control flow and that involves auditing each algorithm that previously assumed a structured graph. |
This comment has been minimized.
This comment has been minimized.
|
The nested loop structure, the thing that reducibility guarantees, is
pretty much thrown away at the start. [...] I checked the current
WebAssembly implementations in JavaScriptCore, V8, and SpiderMonkey, and
they all seem to follow this pattern.
Not quite: at least in SM, the IR graph is not a fully general graph; we
assume certain graph invariants that follow from being generated from a
structured source (JS or wasm) and often simplify and/or optimize the
algorithms.
Same in V8. It is actually one of my major gripes with SSA in both
respective literature and implementations that they almost never define
what constitutes a "well-formed" CFG, but tend to implicitly assume various
undocumented constraints anyways, usually ensured by construction by the
language frontend. I bet that many/most optimisations in existing compilers
would not be able to deal with truly arbitrary CFGs.
As @lukewagner says, the main use case for irreducible control probably is
"threaded code" for optimised interpreters. Hard to say how relevant those
are for the Wasm domain, and whether its absence actually is the biggest
bottleneck.
Having discussed irreducible control flow with a number of people
researching compiler IRs, the "cleanest" solution probably would be to add
the notion of mutually recursive blocks. That would happen to fit Wasm's
control structure quite well.
|
This comment has been minimized.
This comment has been minimized.
stoklund
commented
Apr 19, 2017
|
Loop optimizations in LLVM will generally ignore irreducible control flow and not attempt to optimize it. The loop analysis they're based on will only recognize natural loops, so you just have to be aware that there can be CFG cycles that are not recognized as loops. Of course, other optimizations are more local in nature and work just fine with irreducible CFGs. From memory, and probably wrong, SPEC2006 has a single irreducible loop in 401.bzip2 and that's it. It's quite rare in practice. Clang will only emit a single |
This comment has been minimized.
This comment has been minimized.
|
There is no single-pass verification algorithm for irreducible control flow
that I am aware of. The design choice for reducible control flow only was
highly influenced by this requirement.
As mentioned earlier, irreducible control flow can be modeled at least two
different ways. A loop with a switch statement can actually be optimized
into the original irreducible graph by a simple local jump-threading
optimization (e.g. by folding the pattern where an assignment of a constant
to a local variable occurs, then a branch to a conditional branch that
immediately switches on that local variable).
So the irreducible control constructs are not necessary at all, and it is
only a matter of a single compiler backend transformation to recover the
original irreducible graph and optimize it (for engines whose compilers
support irreducible control flow--which none of the 4 browsers do, to the
best of my knowledge).
Best,
-Ben
…On Thu, Apr 20, 2017 at 5:20 AM, Jakob Stoklund Olesen < ***@***.***> wrote:
Loop optimizations in LLVM will generally ignore irreducible control flow
and not attempt to optimize it. The loop analysis they're based on will
only recognize natural loops, so you just have to be aware that there can
be CFG cycles that are not recognized as loops. Of course, other
optimizations are more local in nature and work just fine with irreducible
CFGs.
From memory, and probably wrong, SPEC2006 has a single irreducible loop in
401.bzip2 and that's it. It's quite rare in practice.
Clang will only emit a single indirectbr instruction in functions using
computed goto. This has the effect of turning threaded interpreters into
natural loops with the indirectbr block as a loop header. After leaving
LLVM IR, the single indirectbr is tail-duplicated in the code generator
to reconstruct the original tangle.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#796 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALnq1K99AR5YaQuNOIFIckLLSIZbmbd0ks5rxkJQgaJpZM4J3ofA>
.
|
This comment has been minimized.
This comment has been minimized.
|
I can also say further that if irreducible constructs were to be added to
WebAssembly, they would not work in TurboFan (V8's optimizing JIT), so such
functions would either end up being interpreted (extremely slow) or being
compiled by a baseline compiler (somewhat slower), since we will likely not
invest effort in upgrading TurboFan to support irreducible control flow.
That means functions with irreducible control flow in WebAssembly would
probably end up with much worse performance.
Of course, another option would for the WebAssembly engine in V8 to run the
relooper to feed TurboFan reducible graphs, but that would make compilation
(and startup worse). Relooping should remain an offline procedure in my
opinion, otherwise we are ending with up inescapable engine costs.
Best,
-Ben
…On Mon, May 1, 2017 at 12:48 PM, Ben L. Titzer ***@***.***> wrote:
There is no single-pass verification algorithm for irreducible control
flow that I am aware of. The design choice for reducible control flow only
was highly influenced by this requirement.
As mentioned earlier, irreducible control flow can be modeled at least two
different ways. A loop with a switch statement can actually be optimized
into the original irreducible graph by a simple local jump-threading
optimization (e.g. by folding the pattern where an assignment of a constant
to a local variable occurs, then a branch to a conditional branch that
immediately switches on that local variable).
So the irreducible control constructs are not necessary at all, and it is
only a matter of a single compiler backend transformation to recover the
original irreducible graph and optimize it (for engines whose compilers
support irreducible control flow--which none of the 4 browsers do, to the
best of my knowledge).
Best,
-Ben
On Thu, Apr 20, 2017 at 5:20 AM, Jakob Stoklund Olesen <
***@***.***> wrote:
> Loop optimizations in LLVM will generally ignore irreducible control flow
> and not attempt to optimize it. The loop analysis they're based on will
> only recognize natural loops, so you just have to be aware that there can
> be CFG cycles that are not recognized as loops. Of course, other
> optimizations are more local in nature and work just fine with irreducible
> CFGs.
>
> From memory, and probably wrong, SPEC2006 has a single irreducible loop
> in 401.bzip2 and that's it. It's quite rare in practice.
>
> Clang will only emit a single indirectbr instruction in functions using
> computed goto. This has the effect of turning threaded interpreters into
> natural loops with the indirectbr block as a loop header. After leaving
> LLVM IR, the single indirectbr is tail-duplicated in the code generator
> to reconstruct the original tangle.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#796 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ALnq1K99AR5YaQuNOIFIckLLSIZbmbd0ks5rxkJQgaJpZM4J3ofA>
> .
>
|
This comment has been minimized.
This comment has been minimized.
|
There are established methods for linear-time verification of irreducible control flow. A notable example is the JVM: with stackmaps, it has linear-time verification. WebAssembly already has block signatures on every block-like construct. With explicit type information at every point where multiple control flow paths merge, it's not necessary to used fixed-point algorithms. (As an aside, a while ago I asked why one would disallow a hypothetical The loop-with-a-switch pattern can of course be jump-threaded away, but it's not practical to rely on. If an engine doesn't optimize it away, it would have a disruptive level of overhead. If most engines do optimize it, then it's unclear what's accomplished by keeping irreducible control flow out of the language itself. |
This comment has been minimized.
This comment has been minimized.
comex
commented
May 4, 2017
|
Sigh… I meant to reply earlier but life got in the way. I’ve been grepping through some JS engines and I guess I have to weaken my claim about irreducible control flow ‘just working’. I still don’t think it would be that hard to make it work, but there are some constructs that would be difficult to adapt in a way that would actually benefit over… Well, let’s assume, for the sake of argument, that making the optimization pipeline support irreducible control flow properly is too hard. A JS engine can still easily support it in a hacky way, like this: Within the backend, treat a There might be issues with, e.g., any optimization that merges/deletes branches, but it should be pretty easy to avoid that; the details depend on the engine design. In some sense, my suggestion is equivalent to @titzer’s “simple local jump-threading optimization”. I’m suggesting making ‘native’ irreducible control flow look like a loop+switch, but an alternative would be to identify real loop+switches – that is, @titzer’s “pattern where an assignment of a constant to a local variable occurs, then a branch to a conditional branch that immediately switches on that local variable” – and add metadata allowing the indirect branch to be removed late in the pipeline. If this optimization becomes ubiquitous, it could be a decent substitute for an explicit instruction. Either way, the obvious downside to the hacky approach is that optimizations don’t understand the real control flow graph; they effectively act as if any label could jump to any other label. In particular, register allocation has to treat a variable as live in all labels, even if, say, it’s always assigned right before jumping to a specific label, as in this pseudocode: a:
control = 1;
goto x;
b:
control = 2;
goto x;
...
x:
// use control
That could lead to seriously suboptimal register use in some cases. But as I’ll note later, the liveness algorithms that JITs use may be fundamentally unable to do this well, anyway… Whatever the case, optimizing late is a lot better than not optimizing at all. A single direct jump is much nicer than a jump + compare + load + indirect jump; the CPU branch predictor may eventually be able to predict the latter’s target based on past state, but not as well as the compiler can. And you can avoid spending a register and/or memory on the ‘current state’ variable. As for the representation, which is better: explicit ( Benefits to implicit:
Drawbacks to implicit:
Anyway, I don’t care that much either way - as long as, should we decide on the implicit approach, the major browsers actually commit to performing the relevant optimization. … Going back to the question of supporting irreducible flow natively - what the obstacles are, how much benefit there is - here are some specific examples from IonMonkey of optimization passes that would have to be modified to support it: AliasAnalysis.cpp: iterates over blocks in reverse postorder (once), and generates ordering dependencies for an instruction (as used in InstructionReordering) by looking only at previously seen stores as possibly aliasing. This doesn’t work for cyclic control flow. But (explicitly marked) loops are handled specially, with a second pass that checks instructions in loops against any later stores anywhere in the same loop. -> So there’d have to be some loop marking for FlowAliasAnalysis.cpp: an alternative algorithm which is a bit smarter. Also iterates over blocks in reverse postorder, but on encountering each block it merges the calculated last-stores information for each of its predecessors (assumed to have already been calculated), except for loop headers, where it takes the backedge into account. -> Messier because it assumes (a) predecessors to individual basic blocks always appear before it except for loop backedges, and (b) a loop can only have one backedge. There are different ways this could be fixed, but it would probably require explicit handling of BacktrackingAllocator.cpp: similar behavior for register allocation: it does a linear reverse pass through the list of instructions and assumes that all uses of an instruction will appear after (i.e. be processed before) its definition, except when encountering loop backedges: registers which are live at the beginning of a loop simply stay live through the entire loop. -> Every label would need to be treated like a loop header, but the liveness would have to extend for the entire labels block. Not hard to implement, but again, the result would be no better than the hacky approach. I think. |
This comment has been minimized.
This comment has been minimized.
|
@comex Another consideration here is how much wasm engines are expected to do. For example, you mention Ion's AliasAnalysis above, however the other side of the story is that alias analysis isn't that important for WebAssembly code, at least for now while most programs are using linear memory. Ion's BacktrackingAllocator.cpp liveness algorithm would require some work, but it wouldn't be prohibitive. Most of Ion already does handle various forms of irreducible control flow, since OSR can create multiple entries into loops. An broader question here is what optimizations WebAssembly engines will be expected to do. If one expects WebAssembly to be an assembly-like platform, with predictable performance where producers/libraries do most of the optimization, then irreducible control flow would be a pretty low cost because engines wouldn't need the big complex algorithms where it's a significant burden. If one expects WebAssembly to be a higher-level bytecode, which does more high-level optimization automatically, and engines are more complex, then it becomes more valuable to keep irreducible control-flow out of the language, to avoid the extra complexity. BTW, also worth mentioning in this issue is Braun et al's on-the-fly SSA construction algorithm, which is a simple and fast on-the-fly SSA construction algorithm, and supports irreducible control flow. |
This comment has been minimized.
This comment has been minimized.
tbodt
commented
Apr 4, 2018
|
I'm interested in using WebAssembly as a qemu backend on iOS, where WebKit (and the dynamic linker, but that checks code signing) is the only program that is allowed to mark memory as executable. Qemu's codegen assumes that goto statements will be a part of any processor it has to codegen for, which makes a WebAssembly backend almost impossible without gotos being added. |
This comment has been minimized.
This comment has been minimized.
eholk
commented
Apr 4, 2018
|
@tbodt - Would you be able to use Binaryen's relooper? That let's you generate what is basically Wasm-with-goto and then converts it into structured control flow for Wasm. |
This comment has been minimized.
This comment has been minimized.
tbodt
commented
Apr 4, 2018
|
@eholk That sounds like it would be much much slower than a direct translation of machine code to wasm. |
This comment has been minimized.
This comment has been minimized.
|
@tbodt Using Binaryen does add an extra IR on the way, yeah, but it shouldn't be much slower, I think, it's optimized for compilation speed. And it may also have benefits other than handling gotos etc. as you can optionally run the Binaryen optimizer, which may do things the qemu optimizer doesn't (wasm-specific things). Actually I would be very interested to collaborate with you on that, if you want :) I think porting Qemu to wasm would be very useful. |
This comment has been minimized.
This comment has been minimized.
tbodt
commented
Apr 4, 2018
|
So on second thought, gotos wouldn't really help a whole lot. Qemu's codegen generates the code for basic blocks when they are first run. If a block jumps to a block that hasn't been generated yet, it generates the block and patches the previous block with a goto to the next block. Dynamic code loading and patching of existing functions are not things that can be done in webassembly, as far as I know. @kripken I'd be interested in collaborating, where would be the best place to chat with you? |
This comment has been minimized.
This comment has been minimized.
|
You can't patch existing functions directly, but you can use I'm not sure that anyone has tried this yet, though, so there's likely to be many rough edges. |
This comment has been minimized.
This comment has been minimized.
tbodt
commented
Apr 4, 2018
•
|
That could work if tailcalls were implemented. Otherwise the stack would overflow pretty quickly. Another challenge would be allocating space in the default table. How do you map an address to a table index? |
This comment has been minimized.
This comment has been minimized.
|
Another option is to regenerate the wasm function on each new basic block. This means a number of recompiles equal to the number of used blocks, but I'd bet it's the only way to get the code to run quickly after it is compiled (especially inner loops), and it doesn't need to be a full recompile, we can reuse the Binaryen IR for each existing block, add IR for the new block, and just run the relooper on all of them. (But maybe we can get qemu to compile the whole function up front instead of lazily?) @tbodt for collaboration on doing this with Binaryen, one option is to create a repo with your work (and can use issues there etc.), another is to open a specific issue in Binaryen for qemu. |
This comment has been minimized.
This comment has been minimized.
tbodt
commented
Apr 4, 2018
|
We can't get qemu to compile a whole function at a time, because qemu doesn't have a concept of a "function". As for recompiling the whole cache of blocks, that sounds like it might take a long time. I'll figure out how to use qemu's builtin profiler and then open an issue on binaryen. |
This comment has been minimized.
This comment has been minimized.
davidgrenier
commented
Jun 28, 2018
|
Side question. In my view, a language targeting WebAssembly should be able to provide efficient mutually recursive function. For a depiction of their usefulness I'd invite you to read: http://sharp-gamedev.blogspot.com/2011/08/forgotten-control-flow-construct.html In particular, the need expressed by Cheery seems to be addressed by mutually recursive function. I understand the need for tail recursion, but I'm wondering if mutually recursive function can only be implemented if the underlying machinery provides gotos or not. If they do, then to me that makes for legitimate argument in favour of them since there'll be a ton of programming language that'll have a hard time targeting WebAssembly otherwise. If they don't then perhaps the minimum mechanism to support mutually recursive function is all that would be needed (along with tail recursion). |
This comment has been minimized.
This comment has been minimized.
|
@davidgrenier, the functions in a Wasm module are all mutually recursive. Can you elaborate what you regard as inefficient about them? Are you only referring to the lack of tail calls or something else? General tail calls are coming. Tail recursion (mutual or otherwise) is gonna be a special case of that. |
This comment has been minimized.
This comment has been minimized.
davidgrenier
commented
Jun 28, 2018
|
I wasn't saying anything was inefficient about them. I'm saying, that if you have them, you don't need general goto because mutually recursive functions provide all that language implementer targeting WebAssembly should need. |
This comment has been minimized.
This comment has been minimized.
vasili111
commented
Jun 29, 2018
|
Goto is very useful for code generation from diagrams in visual programming. Maybe now visual programming is not very popular but in the future it can get more people and I think wasm should be ready for it. More about code generation from the diagrams and goto: http://drakon-editor.sourceforge.net/generation.html |
This comment has been minimized.
This comment has been minimized.
neelance
commented
Jun 29, 2018
|
The upcoming Go 1.11 release will have experimental support for WebAssembly. This will include full support for all of Go's features, including goroutines, channels, etc. However, the performance of the generated WebAssembly is currently not that good. This is mainly because of the missing goto instruction. Without the goto instruction we had to resort to using a toplevel loop and jump table in every function. Using the relooper algorithm is not an option for us, because when switching between goroutines we need to be able to resume execution at different points of a function. The relooper can not help with this, only a goto instruction can. It is awesome that WebAssembly got to the point where it can support a language like Go. But to be truly the assembly of the web, WebAssembly should be equally powerful as other assembly languages. Go has an advanced compiler which is able to emit very efficient assembly for a number of other platforms. This is why I would like to argue that it is mainly a limitation of WebAssembly and not of the Go compiler that it is not possible to also use this compiler to emit efficient assembly for the web. |
This comment has been minimized.
This comment has been minimized.
Just to clarify, a regular goto would not be enough for that, a computed goto is required for your use case, is that correct? |
This comment has been minimized.
This comment has been minimized.
neelance
commented
Jun 29, 2018
|
I think a regular goto would probably be sufficient in terms of performance. Jumps between basic blocks are static anyways and for switching goroutines a |
This comment has been minimized.
This comment has been minimized.
|
It sounds like you have normal control flow in each function, but also need the ability to jump from the function entry to certain other locations in the "middle", when resuming a goroutine - how many such locations are there? If it's every single basic block, then the relooper would be forced to emit a toplevel loop that every instruction goes through, but if it's just a few, that shouldn't be a problem. (That's actually what happens with setjmp support in emscripten - we just create the extra necessary paths between LLVM's basic blocks, and let the relooper process that normally.) |
This comment has been minimized.
This comment has been minimized.
neelance
commented
Jun 29, 2018
|
Every call to some other function is such a location and most basic blocks have at least one call instruction. We're more or less unwinding and restoring the call stack. |
This comment has been minimized.
This comment has been minimized.
|
I see, thanks. Yeah, I agree that for that to be practical you need either static goto or call stack restoring support (which has also been considered). |
This comment has been minimized.
This comment has been minimized.
Heimdell
commented
Jun 30, 2018
|
Will it be possible to call function in CPS style or implement |
This comment has been minimized.
This comment has been minimized.
|
@Heimdell, support for some form of delimited continuations (a.k.a. "stack switching") is on the road map, which should be enough for almost any interesting control abstraction. We cannot support undelimited continuations (i.e., full call/cc), though, since the Wasm call stack can be arbitrarily intermixed with other languages, including reentrant calls out to the embedder, and thus cannot assumed to be copyable or movable. |
oridb commentedSep 8, 2016
I'd like to point out that I haven't been involved in the web assembly effort,
and I'm not maintaining any large or widely used compilers (just my own
toy-ish language, minor contributions to the QBE compiler backend, and an
internship on IBM's compiler team), but I ended up getting a bit ranty, and
was encouraged to share more widely.
So, while I'm a bit uncomfortable jumping in and suggesting major changes
to a project I haven't been working on... here goes:
My Complaints:
When I'm writing a compiler, the first thing that I'd do to with the high level
structure -- loops, if statements, and so on -- is validate them for semantics,
do type checking and so on. The second thing I do with them is just throw them
out, and flatten to basic blocks, and possibly to SSA form. In some other parts
of the compiler world, a popular format is continuation passing style. I'm not
an expert on compiling with continuation passing style, but it neither seems to
be a good fit for the loops and scoped blocks that web assembly seems to have
embraced.
I'd like to argue that a flatter, goto based format would be far more useful as
a target for compiler developers, and would not significantly hinder the
writing of a usable polyfill.
Personally, also I'm not a big fan of nested complex expressions. They're a bit
clunkier to consume, especially if inner nodes can have side effects, but I
don't strongly object to them as a compiler implementer -- The web assembly
JIT can consume them, I can ignore them and generate the instructions that map
to my IR. They don't make me want to flip tables.
The bigger problem comes down to loops, blocks, and other syntactic elements
that, as an optimizing compiler writer, you try very hard to represent as a
graph with branches representing edges; The explicit control flow constructs
are a hindrance. Reconstructing them from the graph once you've actually done
the optimizations you want is certainly possible, but it's quite a bit of
complexity to work around a more complex format. And that annoys me: Both the
producer and the consumer are working around entirely invented problems
which would be avoided by simply dropping complex control flow constructs
from web assembly.
In addition, the insistence of higher level constructs leads to some
pathological cases. For example, Duff's Device ends up with horrible web
assembly output, as seen by messing around in The Wasm Explorer.
However, the inverse is not true: Everything that can be expressed
in web assembler can be trivially converted to an equivalent in some
unstructured, goto based format.
So, at the very least, I'd like to suggest that the web assembly team add
support for arbitrary labels and gotos. If they choose to keep the higher
level constructs, it would be a bit of wasteful complexity, but at least
compiler writers like me wold be able to ignore them and generate output
directly.
Polyfilling:
One of the concerns I have heard when discussing this is that the loop
and block based structure allows for easier polyfilling of web assembly.
While this isn't entirely false, I think that a simple polyfill solution
for labels and gotos is possible. Whiie it might not be quite as optimal,
I think that it's worth a little bit of ugliness in the bytecode in order
to avoid starting a new tool with built in technical debt.
If we assume an LLVM (or QBE) like syntax for web assmembly, then some code
that looks like:
might compile to:
This could be polyfilled to Javascript that looks like:
Is it ugly? Yeah. Does it matter? Hopefuly, if web assembly takes off,
not for long.
And if not:
Well, if I ever got around to targeting web assembly, I guess I'd generate code
using the approach I mentioned in the polyfill, and do my best to ignore all of
the high level constructs, hoping that the compilers would be smart enough to
catch on to this pattern.
But it would be nice if we didn't need to have both sides of the code generation
work around the format specified.