Skip to content
This repository has been archived by the owner on Nov 3, 2021. It is now read-only.

PROPOSAL: "frame" as a better alternative to pick/let #41

Open
xphung opened this issue Mar 7, 2020 · 8 comments
Open

PROPOSAL: "frame" as a better alternative to pick/let #41

xphung opened this issue Mar 7, 2020 · 8 comments

Comments

@xphung
Copy link

xphung commented Mar 7, 2020

Instead of providing "pick" or "let" instructions, I propose providing a "frame" block instruction (effectively providing frame-pointer like access to the wasm stack, similar to what happens for the "call" instruction), eg:

instead of:
i32.const 100 ;; push some values onto stack
i32.const 200
i32.const 300
pick 2 ;; result is 100

you would have:
i32.const 100 ;; push some values onto stack like an inline function call
i32.const 200
i32.const 300
frame param(i32 i32 i32) result(i32)
;; above 3 stack values are now locals 0,1,2 (equivalent to a call into an inlined function)
getlocal 0 ;; result is 100
end
;; stack frame is torn down, and all associated param values (100,200,300) are dropped from stack
;; (equivalent to return from an inline function)

Advantages over "pick" and "let":

  • consistent with wasm semantics of call/getlocal (whereas "pick" provides a parallel system of random access to local data, which overlaps with getlocal, but is not consistent with it)
  • can be used to inline functions - can just copy inlined function bytecode into a function call site
  • easier for humans to follow when reading/writing assembly code, as frame offsets are fixed (eg. getlocal 0 is always 1st param), whereas the index used with pick varies with each subsequent stack manipulation instruction

Disadvantages:

  • potentially increases instruction count by 2 compared to "pick", due to need enter a "frame" block before accessing local data, and then "end" block afterwards
    (however, no additional instruction count, if "frame"/"end" is replacing a "block"/"end" instruction pair that needs to be there anyway)
    (also, if used after a multi-value return, "pick" does not remove the returned values and hence needs multiple "drop" instructions after processing of return values is completed - whereas "frame"/"end" will tear down the stack values supplied as input into "frame", without any additional instructions)
  • inside the "frame" block, can't access locals which were defined outside the block, unless these are passed in as parameters
@binji
Copy link
Member

binji commented Mar 9, 2020

Agreed this sounds like a good idea, it matches nicely with the way WebAssembly already uses the stack and locals. When you say it's different to let, what do you mean? It sounds similar to the way let is described in this document: https://github.com/WebAssembly/function-references/blob/master/proposals/function-references/Overview.md.

whereas "frame"/"end" will tear down the stack values supplied as input into "frame", without any additional instructions)

This is unlike other WebAssembly block functions, which don't tear down their stack implicitly. I think it would be more natural for this to require a br 0 to clean up the stack. It does mean that the instruction sequence doesn't quite match an inlined function.

inside the "frame" block, can't access locals which were defined outside the block, unless these are passed in as parameters

The let described above appends locals to current list of locals, so it would still be possible to access locals from nested lets or the original function definition. Is there a benefit to disallowing access to outer locals?

@aardappel
Copy link

inside the "frame" block, can't access locals which were defined outside the block, unless these are passed in as parameters
I think that would be very unpractical. We might have to define it such that each nested frame adds N locals to the existing set of locals. But now you're talking about a much more heavyweight feature for engines to support it.

I generally do not see the advantage of this over pick and friends. To me, the whole reason you might want pick over getlocal in the first place is that getlocal is often very redundant: we have values on the stack, we want to consume values on the stack, but in between sit a bunch of getlocals whose sole purpose it is to recreate that stack.

Compare this hypothetical example:

i32.const 1
i32.const 2
call foo

func foo (i32 i32)
# first the args were on the stack, then we move them into locals, and now we're putting them back on the stack!
local.get 1
local.get 0
i32.add
# here we need to remove locals (and maybe args)
end

# vs a hypothetical foo that does not put args in locals, but just leaves them on the stack.
func foo (i32 i32)
# just as an example, assume args were in the wrong order, so we pick, or swap
pick 1
i32.add
# no cleanup required.
# end
end

@binji
Copy link
Member

binji commented Mar 9, 2020

But now you're talking about a much more heavyweight feature for engines to support it.

Is it? In many ways it fits nicely with what engines already have to do with block parameters in the multi-value proposal. The difference is that now those values can also be accessed by local.get, etc. But it's always statically known whether those accesses are to function params, locals, or stack values.

I generally do not see the advantage of this over pick and friends.

See some of the discussion here, in particular @rossberg's comment:

I think both complement each other: pick works well for a one-off access into the stack. But for accessing the same slot multiple times, or multiple related accesses, or operand reordering, stack twiddling with pick and swap is more difficult to write, more difficult to read and debug, and I suspect often less compact.

@xphung
Copy link
Author

xphung commented Mar 9, 2020

I generally do not see the advantage of this over pick and friends. To me, the whole reason you might want pick over getlocal in the first place is that getlocal is often very redundant: we have values on the stack, we want to consume values on the stack, but in between sit a bunch of getlocals whose sole purpose it is to recreate that stack.

;; vs a hypothetical foo that does not put args in locals, but just leaves them on the stack.
func foo (i32 i32)
;; just as an example, assume args were in the wrong order, so we pick, or swap
pick 1
i32.add
;; no cleanup required.
end
end

If wasm had "pick" from the start in MVP v1.0 (and left fn parameters on stack rather than mapping them into locals), then I agree my "frame" proposal wouldn't be as useful as "pick".

I don't think getlocal/setlocal will be deprecated any time soon though...

For better or worse, wasm today is a "hybrid" stack machine that is [stack]+[local vars]+[params in locals]. It manages this complexity by adhering strictly to stack as top-only access, ie: the wasm stack is only push/pop and doesn't try to do random access. IMO, "pick" will blur the conceptual distinction between stack vs local vars, ie: both will have random access.

Also, what the "pick" example above doesn't show is after blocks/funcs returning multi-values, extra "drop" instructions may be needed to remove left over stack values. (There is likely to be at least one left over value, as otherwise pick wouldn't be needed, to duplicate values into correct stack order).

In addition to "pick", a "swap" will also be needed to restructure the stack sequence to comply with the end of block stack signature.... once the extra swaps/drops are all added, is there really any code density benefit with providing "pick"? IMO, we need to be very clear about this before we throw out the conceptual "purity" of wasm's stack (of not providing random access to the stack).

Looking at @rossberg 's proposal for "let", it is similar as what I propose for frame, but goes an additional step of remapping outer locals into local #[orig local idx + N], where N is the locals declared by an inner let.

For CPU's with plentiful registers, the wasm "stack" is really a CPU register "window" scheme, ie: a way of directing which CPU register the next wasm arithmetic instruction will operate on. (Local vars deeper inside the stack are then candidates for a "simplistic" JIT to spill into RAM).

What both "frame" and "let" are doing (which "pick" doesn't do) is indicating upfront to a simplistic JIT compiler to pin down & not to spill certain stack slots for the lifetime of the block. However, "let" doesn't convey as much information on what local vars are safe to spill to RAM (as any outer local var can still be accessed inside), whereas "frame" is very explicit in telling the JIT upfront which local vars to keep inside CPU registers. (Effectively the outer local var "push" needed by "frame" is like the "register" keyword in C, it's an advisory hint to the JIT to keep the var in a CPU register... if these hints are not useful to real JIT's, as opposed to a simplistic JIT, then Rossberg's proposal is superior to my proposal - as it has better encoding density).

Also, my own envisaged use case for "frame" is that it will process/re-sequence a set of multi-values, which then will be handed over to the outer block to complete processing on... I am guessing in most cases, the processing done inside "frame" is likely to be small and simple and will not need to access many outer vars, but it would be interesting to see data on this.

@xphung
Copy link
Author

xphung commented Mar 11, 2020

Hi, I can't resist thinking more about the "pick" vs "let" duality, and whether there is a case to have both pick & let.

I remain a proponent of providing "let" and not providing "pick" (I've already more or less conceded above that @rossberg 's "let" is superior to my "frame").

However, here is a suggestion: provide "let" and "dup", ie: pick #0 - but not a generalised pick. dup is a top-of-stack only operation, so it doesn't change the existing wasm model - that is, the wasm stack is limited to non-random top of stack ops, and local vars are to be the mechanism for random access.

There is possibly even a case for providing a "swap #1" - ie: limited to top of stack, so only swapping top two slots of the stack.

Limiting dup and swap to only the top of stack means we don't need to encode an immediate argument for these instructions, so they will only be 1 byte rather than 2+ bytes.

This is a win for code density - as well as the keeping the stack conceptually pure, by not providing random access to deeper slots in the stack.

I suspect the 80/20 law applies to this case - ie: 80% of the use cases of a hypothetical generalised pick/swap are handled by a more simple dup/swap, without immediate arguments. The other more complex 20% are then better addressed by "let" (making generalised pick/swap unnecessary).

@binji
Copy link
Member

binji commented Mar 11, 2020

I think this is an interesting idea, but it would be much more compelling with data. let was useful for the function-references proposal because it provides a real value -- being able to define a local variable without a default. But it seems dup and swap would be purely for optimizing code size, so it would be useful to know how much savings we could expect to achieve. That said, I'm not sure the easiest way to gather this kind of information... perhaps modifying binaryen to understand these new instructions so it can "re-stackify" a function.

@xphung
Copy link
Author

xphung commented Mar 29, 2020

I think this is an interesting idea, but it would be much more compelling with data. let was useful for the function-references proposal because it provides a real value -- being able to define a local variable without a default. But it seems dup and swap would be purely for optimizing code size, so it would be useful to know how much savings we could expect to achieve. That said, I'm not sure the easiest way to gather this kind of information...

Java VM and CIL data might be useful here.

Java VM appears to have a number of variants of dup. These insert the top value of the stack into slots 2-3 levels down. It also has swap, but no pick (anyone with more detail JVM knowledge correct me if I am wrong). Hence, profiling instruction counts of the "dup" variants could give a frequency histogram of stack manipulation by slot dept (0 vs 1 vs 2 vs 3) but this would be a "reverse" of the pick data we want (so not perfect)

CIL has dup, but does not seem to have swap or pick (again, anyone with detailed CIL knowledge correct me if wrong). So CIL won't be helpful in giving data on need for pick beyond swap, although presumably the incidence of multiple pops followed by dup might be very approximate hint on this.

If there is interest in the JVM instruction count data, then I could do some profiling of some java classes to get the dup_XX instruction counts.

@binji
Copy link
Member

binji commented Mar 30, 2020

Sounds potentially useful, but I'm note quite sure how to translate that info into wasm binary savings. I guess if we knew how common dup is in JVM, and then the relative code density of the two formats, then we could make a guess. In wasm currently, a dup could be implemented by adding a local, then using local.tee followed by local.get:

(local $new i32)
...
(local.tee $new)
(local.get $new)

The new local is practically free (just requires incrementing the local count for that value type, so requires an extra byte only at 27n boundary). The local.tee and local.get instructions are 2 to 6 bytes each, but 2 is much more likely. So we can expect dup to save at least 3 bytes every time it's required. Maybe that will give us a good ballpark.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants