Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yet Another MVP proposal #94

Closed
taralx opened this issue May 30, 2020 · 64 comments
Closed

Yet Another MVP proposal #94

taralx opened this issue May 30, 2020 · 64 comments

Comments

@taralx
Copy link

taralx commented May 30, 2020

Update: More fleshed-out proposal in the follow-up issue.


Thinking about a true "MVP" for GC got me to this proposal, inspired by CHERI:

  • New type pointer, probably 64-bit for wasm32.
  • Memory instructions (new or augmented) to load/store from pointer instead of i32+memidx.
  • New instruction allocate that accepts i32 size and returns a new pointer of at least that size.
  • New instructions pointer.load and pointer.store that only accept pointer, not i32+memidx.
  • Add/sub instructions (either new or augmented) for moving a pointer around.

Big difference from existing proposals is that it doesn't try to track the full structure of a thing, only which parts of it are pointers.

As constructed above, requires tag bits to track which parts of the structure are pointers, which doesn't sound cheap. One alternative is to fix the region that can store pointers, which has the potential advantage of enabling code to be pointer-size agnostic.

Anyway, this is probably half-baked and doesn't work for some reason, but I figured I'd put it out there. :)

@RossTate
Copy link
Contributor

I suspect that if you were to refine this idea, you’d end up with something looking close to this: #78 (comment)

@taralx
Copy link
Author

taralx commented May 31, 2020

@RossTate Indeed it is similar! There's some things in here that I think are worth considering:

  1. Separate pointer type to allow identification of roots.
  2. Pointers can only come from allocate, host, or reading from another pointer, which means we can be certain about pointer validity.
  3. Dynamic tracking of pointer-ness is much simpler than asking the compiler to pre-declare the pointer region or forcing the use of weird 31-bit numbers.

@taralx
Copy link
Author

taralx commented May 31, 2020

Also there is nothing stopping pointer values being pointers into the host, e.g. JS heap.

@RossTate
Copy link
Contributor

The thing I can’t figure out with your suggestion is how the garbage collector is supposed to walk structures after finding roots.

@taralx
Copy link
Author

taralx commented May 31, 2020

Let me write down a few more bits of my original idea:

  • pointer types are larger than 32-bit, because they carry the allocation bounds with them. (Bounds can be compressed, e.g. CHERI Concentrate.)
  • Every pointer-typed value and every memory location capable of storing a pointer has a validity bit telling us if the pointer is valid.
  • This is enough for the GC to perform full tracing. Typing gives us roots, pointers have bounds to know where to look, tag bits tell us which parts of memory are inner pointers.
  • Since pointers have bounds, we can reflect them into JS as e.g. typed arrays.

@taralx
Copy link
Author

taralx commented May 31, 2020

I then cut out all the extraneous stuff, because you can do all that in a variety of ways. But as you pointed out, that just results in a sketch that is similar to what you posted.

@RossTate
Copy link
Contributor

RossTate commented May 31, 2020

Okay, so it sounds like a structure is an array of cells that are dynamically either pointers or 32-bit integers. On 64-bit machines, each cell would likely be 64 bits in size, taking advantage of room for bit flags to distinguish pointers from integers. On 32-bit machines, each cell would likely be 32 bits in size with a bit set and negative offsets indicating which cells are pointers and which are integers (assuming no concurrent reads and writes). Does that sound like what you're envisioning?

@taralx
Copy link
Author

taralx commented May 31, 2020

The problem with that 32-bit option is what do you do when someone writes a negative integer into the memory? You're back to basically enshrining i31s, which doesn't mesh well with existing PLs.

Another option I came up with is to not mix pointers and data, a la capnproto. allocate would have to take two parameters, number of pointers and number of data bytes. The nice part there is you don't need to know how big pointers are. The problem is I don't know how one would best represent a pointer to pointer vs a pointer to data. Doing it at the type level gets complicated fast.

@taralx
Copy link
Author

taralx commented May 31, 2020

(FWIW, in CHERI there is a separate data region that stores the "valid pointer" bits.)

@RossTate
Copy link
Contributor

RossTate commented May 31, 2020

The problem with that 32-bit option is what do you do when someone writes a negative integer into the memory?

By "negative offsets" I meant with respect to the object's layout, not negative integers. But regardless, I can't tell what your proposal has in mind here. How are you mixing integers and pointers?

allocate would have to take two parameters, number of pointers and number of data bytes.

I had originally ruled out this option, but I figured out a way to address the problem I had foreseen. Do as you say, but have the space for pointers grow negatively and the space for integers grow positively. This still supports the common OO pattern where subclasses add more fields—both reference and primitive fields—and yet still need the layout to align with that of superclasses.

@taralx
Copy link
Author

taralx commented Jun 1, 2020

Oh, I see. Negative offsets, not negative values.

Hm, that might work?

@RossTate
Copy link
Contributor

RossTate commented Jun 1, 2020

I think so. It has various pros and cons, though I don't think any of them are particularly subtle at this point. So it seems we're at a good place for others to input their thoughts.

@aardappel
Copy link

aardappel commented Jun 1, 2020

Reading this issue, it seems to have more in common with the existing proposal than with #78: it wants to have memory managed by the engine, whose pointers are always exact, and which can be exchanged with other modules or with JS refs. That is all very different from #78

So it is better to look at how it is different from the main proposal, and I guess it is trying to replace strongly typed GC struct/arrays with generic memory blocks, at the cost of more runtime tracking (bounds checks).. I don't see that as a win. I think if we're going to have GC objects entirely engine managed we might as well statically know the references in them.

If more flexibility is required, you could make the current GC proposal more flexible by making scalars untyped: rather than declaring a struct with certain sized int/float types in them, you'd declare a struct with N exact anyrefs and M bytes of "scalar memory" that can be addressed in whatever way.

@RossTate
Copy link
Contributor

RossTate commented Jun 1, 2020

Good point on the comparison with #78!

I think if we're going to have GC objects entirely engine managed we might as well statically know the references in them.

The current GC proposal does not provide that. An anyref and any reference to a struct can always have more references in the referent than the type indicates.

you'd declare a struct with ... M bytes of "scalar memory" that can be addressed in whatever way

This is more what I had in mind (and I'm guessing @taralx too) for the integer portion. That is, a structure is a chunk of linear memory (at positive offsets) and an array of references (at negative offsets).

I don't see that as a win.

My sense is that the win here lies in its simplicity. It adds just one or two types and doesn't introduce subtyping, and yet it seems flexible enough to express GC languages. It would be relatively easy to implement and ship quickly. And although it's not particularly efficient due to bounds checks (which there are many mitigation techniques for), it would be easy to express in a more advanced GC proposal that is more efficient, so it could serve well as a holdover.

@aardappel
Copy link

Sorry, I should have said "strongly" rather than "statically". But yes, to your point, some level of dynamic checks are already required.

I do like the elegance of all GC objects being uniformly 2 adjacent arrays (refs + bytes).. but would be good to have an idea what the expected cost would be compared to the current proposal.

@RossTate
Copy link
Contributor

RossTate commented Jun 1, 2020

The strategy here would do a single indirection (twice on 32-bit machines; once on 64-bit machines) to perform a (double) bounds check. You could have an instruction that just asserts up front that there are so many "scalar" and reference fields (trapping if it's not the case), after which even simple streaming optimization would be able to eliminate subsequent bounds checks. The current proposal requires a double indirection to perform a cast. So I suspect which one performs better would vary by benchmark, and I imagine that often their performances would be quite similar. My guess is binary size would likely be smaller with the strategy here, and type-checking would likely be faster with the strategy here.

@binji
Copy link
Member

binji commented Jun 1, 2020

I think I understand the model being described here, though it's changed a bit from the initial post (in particular, I don't see how pointer.add/pointer.sub could work with the model being described here). It sounds like what we're describing is this instead:

;; allocate a new object
pointer.new <bytes> <refs> : [] -> [pointer]

;; read a pointer at given index 
pointer.get : [pointer i32] -> [pointer]

;; write a pointer at given index
pointer.set : [pointer i32 pointer] -> []

;; read memory at given offset, etc. for other loads/stores
pointer.load_i32 : [pointer i32] -> [i32]
pointer.load_f32 : [pointer i32] -> [f32]
pointer.store_i32 : [pointer i32 i32] -> []
...

And I guess we're assuming that externref <: pointer too? Or there needs to be a separate type parameter for pointer.get and pointer.set?

@RossTate
Copy link
Contributor

RossTate commented Jun 1, 2020

The reason I said "one or two" types is because we'd probably want a type for pointer reference fields. Let's call it primref because all the various primitive reference types (including pointer, or maybe structref?) would be coercible to and castable from primref. Of course, we could index get and set with a primitive reference type to save space and avoid the additional coercion instruction in the common case.

@taralx
Copy link
Author

taralx commented Jun 2, 2020

What do you think about the idea of pointers carrying their bounds around? It increases the pointer size, but removes the need to perform indirection to do bounds checks.

@RossTate
Copy link
Contributor

RossTate commented Jun 2, 2020

That sounds like an implementation strategy that wouldn't need to be visible at the wasm level (so long as we didn't somehow make the design incompatible with that strategy). That is, I don't think anything above forces pointer to be a particular size or have a particular representation of bounds.

@RossTate
Copy link
Contributor

RossTate commented Jun 9, 2020

Thoughts from others?

@taralx
Copy link
Author

taralx commented Jun 9, 2020

Well, I'm not others, but it seems like there is a qualitative difference in the goals of proposals like this one vs the existing GC proposal. Would it be a good idea to split "linear GC" from this more complex proposal?

@RossTate
Copy link
Contributor

RossTate commented Jun 9, 2020

It started off that way, but after working out more details, it seems to serve the same purpose as the existing proposal. Is there some particular difference you have in mind?

@taralx
Copy link
Author

taralx commented Jun 9, 2020

The existing proposal has avoiding dynamic checks as an explicit goal, according to @rossberg. There's no way to avoid them in this version, since we don't know what a pointer points to at compile time.

@jakobkummerow
Copy link
Contributor

I appreciate the striving for simplicity here: simplicity is good.

That said, minimalism is not the only design goal that matters, even for an "MVP". From a Web Platform perspective, beyond coming up with a GC proposal that's expressive enough to work, there's also a performance requirement: it's already possible to compile whatever managed language for the web by compiling it to JavaScript, as e.g. GWT/J2CL, KotlinJS, Dart2JS, and a bunch of others are demonstrating. That's not exactly an ideal solution, especially from a technical point of view, but it works. So with this background in mind, the mission for Wasm-GC, should we choose to accept it, is to provide a better alternative, and that primarily means better performance according to one or more of the metrics: execution performance, memory consumption, wire bytes size. So adding a bit of complexity to the concept (like fully typing each pointer field in a struct) seems like a good tradeoff if it makes it more likely that the end result will have compelling arguments going for it compared to just continuing to compile stuff to JavaScript.

@taralx
Copy link
Author

taralx commented Jun 9, 2020

Let's not lose track of the advantages of this proposal over the status quo, for example:

  1. Provides a simple, safe dynamic memory allocation interface. This allows modules to avoid including their own memory allocator if they want.
  2. Enables embeddings to pass memory buffers into WebAssembly without copying. (e.g. for streams)
  3. Provides a mechanism for ad-hoc transfer of large amounts of data between modules without needing multi-memory and the associated machinery.

@binji
Copy link
Member

binji commented Jun 9, 2020

@taralx That's a good point, though putting it that way makes it seem like you're actually describing two different features -- the first is the ability to allocate and use garbage-collected memory buffers. The second is allowing these buffers to also contain references to other garbage-collected objects.

We could support this first case via first-class memory instances, which may be more natural for how wasm currently works.

@taralx
Copy link
Author

taralx commented Jun 9, 2020

I'm not sure first-class memory instances is a good match for this. Memories are expensive, generally requiring special guard pages, etc. I doubt you're going to want to make a memory for each malloc(). Additionally, you'd have to store the memory references in a table, making a kind of strange indirected "fat pointer" that will bloat code size.

@RossTate
Copy link
Contributor

(I started writing this earlier today, but then got distracted by personal responsibilities, so sorry that this forks off from earlier in the discussion.)

Both good points, but it's important to be aware that any GC proposal will have dynamic checks. Consider, for example, the following Java code (which is meant as a pedagogical device rather than something you'd see in the wild):

double diagonal(Double[][] matrix, int size) {
    double prod = 1.0;
    for (int i = 0; i < size; i++)
        prod *= matrix[i][i];
    return prod;
}

First let's consider how we'd implement this in the existing proposal. One important fact to be aware of is that the current proposal cannot guarantee a Double[][] is actually an array of arrays of doubles. (It takes a long technical digression into Java semantics and the expressiveness of the current proposal's casting system, so I'll ask that you take my word for this.) The current proposal can only guarantee that it matrix is an Object[]. So let's consider the evaluation of the expression matrix[i][i] (including its implicit unboxing to a double):

  • Before the loop begins, as an optimization we'll prefetch the contents of matrix, which requires some struct.get, which we'll store in $contents.
  1. Use array.get to get the value of matrix[i], but which we only know is an Object (requires a bounds check)
  2. Use rtt.cast (which involves a load, a bounds check, another load, and a comparison) to cast that value to an Object[] (the current proposal can't cast to Double[])
  3. Use struct.get to get that array's contents (remember that the current proposal doesn't allow arrays to have headers, and Java needs all its reference values to have v-tables, so Java arrays are implemented as structs with a reference to an array)
  4. Use array.get to get the value of matrix[i][i], but which we only know is an Object (requires a bounds check)
  5. Use rtt.cast to cast that value to Double (which again involves a load, a bounds check, another load, and a comparison)
  6. Use struct.get to get that value's f64

Second, let's consider the same using structref:

  1. Use structref.load_ref to get the value of matrix[i], but as a primref (requires a bound check)
  2. Use primref.as_structref to cast that to a structref (either a pointer-embedded bit-flag check or a load and compare)
  3. Use structref.load_ref to get the value of matrix[i][i], but as a primref (requires a bound check)
  4. Use primref.as_structref to cast that to a structref (again, either a pointer-embedded bit-flag check or a load and compare)
  5. Use structref.load_f64 to get that value's f64 (requires a bounds check)

I think it's reasonable to expect that structref will perform much faster than the current proposal here. That's because the type system of the current proposal only saves a bounds check, whereas the type system is too inflexible to express more efficient representations and the cast system is too general-purpose to optimize.

Now there will be Java benchmarks in which the current proposal will outperform structref, and in fact Java might actually be one of the best matches for the current proposal because it's generally so monomorphic. But now consider an example from C#:

int Sum(int[] ints) {
    int sum = 0;
    for (int i = 0; i < ints.Length, i++)
        sum += ints[i];
    return sum;
}

Before I go on, it's important to realize that an int[] in C# can be passed to a generic method like T[] reverse<T>(T[]). This means an int[] needs to be represented in WebAssembly in an abstract way. In the current proposal, that means treating every array as an Object[] and casting appropriately. So ints[i] will return an Object (after a bounds check), that then has to get cast to an boxed integer, which then has to get unboxed.

With structref, though, we have a lot more flexibility. As such, we can take advantage of the fact that C# is reified (sorry to those who already know this), meaning the <T> in the signature of reverse actually corresponds to a run-time argument (unlike Java's erased types). This run-time argument can provide all sorts of useful information about T. In particular, it can tell us T's size, i.e. how many reference fields it has and how many linear-memory bytes it has. Given that information alone, reverse can reverse the contents of a structref representing a T[]. This means that an int[] can be represented by a structref exactly how you'd expect: a linear-memory block with 4 bytes for each int (plus a reference to the v-table and a reference to the run-time representation of the int type). So ints[i] returns an i32 (after a bounds check), and you're done.

I can give more examples where structref enables more efficient translations of various languages, and I can give examples where the current proposal is faster. I can also go on and on about casting mechanisms. My point is to not disregard the suggestion here on expectations of performance. While we do want the MVP to perform well enough to be a better target than JavaScript (and, by the way, there are reasons besides performance to target WebAssembly, like better abstraction if we ensure it), the MVP will inevitably have overheads. So we also want the MVP to be easy to embed within some more powerful future system. The suggestion here is easy to embed in a wide variety of future directions, whereas the current proposal is not nearly so easy and certainly not in so many directions.

@aardappel
Copy link

aardappel commented Jun 10, 2020

Before I go on, it's important to realize that an int[] in C# can be passed to a generic method like T[] reverse(T[]). This means an int[] needs to be represented in WebAssembly in an abstract way. In the current proposal, that means treating every array as an Object[] and casting appropriately. So ints[i] will return an Object (after a bounds check), that then has to get cast to an boxed integer, which then has to get unboxed.

Are we sure that is the only way? I'd say not being able to represent int[] as a flat array of 32-bit scalars is a bit of a "deal breaker" for any GC design. Using boxed integers will make it fall off a cliff performance wise, and use more memory. It would would make Wasm versions of these languages unacceptably slower than "native".

Worse, can you imagine byte[] needing boxing per element? When being used for the storage of large binary blobs?

I know Java doesn't allow int[] to be passed to anything generic, so it doesn't have this problem. I guess "native" C# does JIT specialization to get around this genericity? Can Wasm-C# instead rely on AOT specialization? Would that break language semantics?

It be rather odd if a program using int[] in an un-generic way was punished thru-out for the mere possibility of a generic call.

Sounds like in the specific case of reverse converting the entire array at the boundary would be possible, but that may not be feasible in the general case, e.g. when passed to a class that holds on to a reference.

@binji
Copy link
Member

binji commented Jun 12, 2020

I think the main questions I have are:

  1. What's the overhead of a bounds-check on every field access, in real-world programs?
  2. How does this change with different bounds-checking strategies?
  3. How does this compare w/ cost of casts in similar real-world programs?
  4. What's the path forward for the locality issue?

@RossTate
Copy link
Contributor

Good questions. Regarding 1, it's an oversimplification to expect every field access to require a bounds check. If you have a structref that you're going to access multiple times, you would up front execute, say, structref.assert_bounds 5 16 to assert the struct has at least 5 reference fields and 16 bytes. From that point on, even a streaming compiler would easily be able to determine that bounds checks on appropriate field accesses could be elided. If we have an immutable version of let, a streaming compiler could even easily do this across blocks.

Maybe this is what you meant by your point 2—I just wanted to head off any misconceptions.

@timjs
Copy link

timjs commented Jun 13, 2020

What you’re proposing @RossTate, could be a cast from a general managed reference (let’s say of type gcref) to a specific struct reference containing info about the number of references and the length of its linear memory (of type structref 5 16) isn’t it?

For the MVP, we’d allow structs to contain a number of gcrefs (positioned at negative offsets) and a length in bytes of linear memory which is only allowed to contain non-references (positioned at positive offsets). Every general gcref would need to be casted exactly once to this specific type. From that point on, accessing any of its reference or value fields would be direct. The type system ensures the field index is in bounds. (That is, struct field indices are ment to be statically known, isn’t it? Or am I missing something? For dynamic acces there would be arrays...)

Alternatively, instead of a structref N M with Ngcrefs and M bytes of linear memory, we could choose to go for a structref N T+ of N gcrefs and T a concrete list of value types. Question is then if something like structref 3 i32 i64 f32 would be castable to structref 3 i64 i64 or not. Is this safe? In case of the linear memory alternative, reinterpretation of linear memory is allowed...

@binji, what would be the best way to check on performance of real world programs? Are there some (semi artificial) examples, like @RossTate showed above, which cover most use cases? At first sight, I think this is highly dependable on the source language and it’s abilities.

@dcodeIO
Copy link

dcodeIO commented Jun 13, 2020

Fwiw, I'd suspect that every bit of additional work performed by fundamental language features will inevitably add up and is worth avoiding, since multiple tradeoffs in one code path potentially multiply. And more tradeoffs will be made, if not by the spec, then by implementers, and if not by implementers, then by users.

@RossTate
Copy link
Contributor

@timjs, I'm not actually proposing to have the structref type be qualified by its lengths. That is, structref.assert_bounds doesn't change the type of anything in WebAssembly. But it does inform the compiler that, after control passes that point, the structref has at least that many fields. A really simple compiler could ignore that information and just always check bounds, but even a streaming compiler could track that information to eliminate those bounds checks.

Suppose we didn't have such an instruction. Then if code were to set the first reference field, then the second reference field, and then the third reference field, this would require three bounds checks because it's possible the structref has two fields, in which case the first two sets should happen before trapping. If you write the instructions in the opposite order, then you need only one bounds check. By doing an assert_bounds 3 0 up front, then you don't have to worry about the order of the subsequent sets because they're all guaranteed to work if the assertion passes.

Note that this issue of implicit information tracking is present in the existing proposal as well. OCaml is likely to run into it in one of two ways. The reason is that OCaml has a number of operations that are intended to be implemented by walking over its datastructures in an abstract manner. These include polymorphic structural equality, two inequivalent variations on structural comparison, and polymorphic hashing. It seems OCaml will have to implement this in one of two ways: have most values be arrays of references, or have most values have an associated v-table.

Let's consider the former strategy: using arrays. Something like cons would be represented as an array with three fields: one for the tag, one for the head element, and one for the tail list. Operations like List.hd and List.tl will be implemented as array accesses, running into the exact same issues as above (but without an assert_bounds operation).

Let's consider the latter strategy: using v-tables. The implementation of structural comparison for a value that's the third case of an algebraic data type will proceed as follows. It will take the value it's being compared to and br_on_cast it for belonging to each of the five cases in order (all of which are at the same rtt depth). Naively, this would involve loading the value's casting table 5 times, checking it has enough entries 5 times, loading the appropriate rtt 5 times, and then comparing that value 5 times. But since a number of languages (at least the more dynamic languages) will use this pattern a lot, I suspect the expectation is that the optimizer will track that the value has not changed (just as with assert_bounds above) and consequently only load the table once, check its length once, load the appropriate rtt once, and then compare just that value 5 times.

So the issue of implicit simple (i.e. streaming-compatible) optimization is present in both designs, though structref.assert_bounds is more explicit about incorporating it. And now I realize that there's reason to believe that OCaml might be more efficiently implemented with structref than the current proposal, though the difference (if it exists) would be much less pronounced than with C# and Go.

@taralx
Copy link
Author

taralx commented Jun 13, 2020

@timjs I'm not seeing the advantage of structref N T+ - can you explain what this would be fixing?

@timjs
Copy link

timjs commented Jun 14, 2020

@RossTate, I'm trying to understand the benefits of your proposal to not encode length information into a structref type. If WebAssembly's performance should be predictable, and the optimization of eliminating bounds checking is so easy, why not enforce it? I think that reflecting this information into the type helps ensuring this. It doesn't depend on the language compiler inserting a stuctref.assert_bounds instruction or not.

I think the implementation of a polymorphic list (type 'a list = nil | cons of 'a * 'a list in OCaml or the equivalent in Haskell/Idris/PureScript/...) would not be expressed by dynamically sized arrays, but be most naturally expressed by a structref with 2 reference fields and 0 bytes of linear memory, i.e. a structref 2 0. A monomorphic linked list with only integers (type intlist = nil | cons of int * intlist in OCaml, the equivalent in Go, or by using Haskell's levity polymorphism) would be represented by a structref 1 4, having just one reference to the next cons cell and 4 bytes of linear memory to store an i32.

I wasn't entirely clear before that I'm assuming each structref in memory will always have a header word storing (at least) the number of references and the size of linear memory. This info is obviously needed for GC, and can also be used for bounds checking/casting operations. Additionally, the header is the place to put a variant tag, which makes this proposal easily post-MVP extensible. Structural comparison would also be implemented by using the header information: traversing all references and comparing tags & linear memory along the way. A similar thing happens for polymorphic hashing.

This model greatly mimics OCaml's memory blocks, Haskell's heap objects, and Clean's heap objects (chapter 10). As my background is mostly in the implementation of functional languages, I could be missing an important thing here to support object-oriented languages. Nevertheless, Go's structs as well as OCaml's variants should fit nicely in this scheme. The idea of putting references at negative offsets and linear memory at positive offsets would allow for the implementation of classes and casting structrefs to subtypes, i.e. casting a class Car : Vehicle {public string modelName;} to a class Vehicle {public string brand;} in C#.

@taralx, the only advantage of using structref N T+ is that it is closer to the current proposal where fields are indexed directly instead of using byte offsets. I think that reusing the concepts of linear memory makes things a bit simpler, but maybe others find it more elegant to specify the field types instead of the byte size.

@RossTate
Copy link
Contributor

@timjs, that's a fair question (and amongst a bunch of useful thoughts!). The high-level answer is simplicity. The following are a bunch of more detailed answers:

  • Keeping the type simple also keeps the annotation size small.
  • There are endless ways to reason about integer bounds, and we can endlessly debate which ones to incorporate into the type system. Each one we do incorporate imposes a burden on all compilers, even if they don't care about the performance improvement. And for the ones we don't incorporate, optimizers can still use them anyways. So I'm trying to avoid a tarpit by not incorporating bounds into the type at all.
  • This is the MVP; it's not meant to be the final answer. Keeping it simple makes it easier to extend and leaves room for directions for such extensions. Adding integer-bounds reasoning is known to get complicated and might rule out some such directions; it's hard to say at the moment.
  • Adding integer bounds into the type has obvious subtyping implications, which introduces a whole 'nother layer of complexity. This simple design avoids subtyping entirely.

At a meta-level, right now it is very difficult to know which typing features will actually improve performance. We simply don't have the infrastructure in place to collect such knowledge, and without that knowledge we'll just be guessing as to which extensions/refinements/variations are actually worthwhile. Having something flexible like this gets people compiling to WebAssembly without the type system limiting how they represent data. That gets us corpuses that we can analyze for patterns, and it gets us backends we can experiment with and modify to test out whether variations have impact and whether it's actually practical to expect backends to produce such variations. So keeping it simple also avoids guesswork, leaving development of extensions/refinements/variations to when they can be informed by real data.

Does that seem like overall sound reasoning?

@timjs
Copy link

timjs commented Jun 16, 2020

Thats a very thorough and clear answer @RossTate! I can imagine that for the MVP it is not needed to add bound information to the types. In my proposal it can always be added later as a subtype, i.e. structref N M <: structref. For a MVP an instruction as structref.assert_bounds N M would suffice.

Btw, I think my description diverged somewhat from the original proposal by @taralx. Should I open another issue or write it out as a more complete proposal? Are these proposals something the that the subgroup wants to discuss during meetings?

@RossTate
Copy link
Contributor

My sense is that this is a collective brainstorming session where we are getting a sense of how this rough direction could work, what it's advantages and disadvantages might be, and whether there's interest in exploring further. If there is interest, then I think it'd be worthwhile separately developing a bunch of detailed variations (e.g. structref and pointer are themselves different in the details as well) and then communally synthesizing these variations. So my recommendation would be to hold off on making a more complete proposal until the group indicates there's interest in the overall direction.

@timjs
Copy link

timjs commented Jun 16, 2020

Sounds good! More comments and ideas are always welcome.

@sabine
Copy link

sabine commented Jul 7, 2020

@jakobkummerow

the mission for Wasm-GC, should we choose to accept it, is to provide a better alternative, and that primarily means better performance according to one or more of the metrics: execution performance, memory consumption, wire bytes size. So adding a bit of complexity to the concept (like fully typing each pointer field in a struct) seems like a good tradeoff if it makes it more likely that the end result will have compelling arguments going for it

Open questions:

  1. Do we have a language where the current proposal is likely to enable better performance than this conceptually simpler approach?
  2. Is adding types the right kind of complexity to add, in order to gain better performance?
  3. are runtime types cheaper than bounds checks or more expensive? Note that, for arrays in the current MVP, we must do bounds-checks, too, when accessing them.

Looking at languages that compile to JavaScript, I think

  • there is no need for the WASM GC to be significantly faster than the JavaScript GC, but it should not be slower to compile to WASM GC than to JavaScript GC
  • running a language on WASM enables performance gains and the ability to faithfully represent language semantics which were problematic on JavaScript. The benefits here are in other areas than GC (e.g. exception handling, fine-grained control over low-level operations).
  • WASM GC must provide a reasonably friendly memory model from the point of view of compiling to it - in particular, we need to be sure that it is possible to compile various languages to use WASM GC. Ideally, without jumping through too many performance-degrading hoops in order to represent their heap model, and without redoing entire compilers. At the same time, the implementation burden on the WASM engines must be reasonable. A WASM GC that is viable in production is a compromise.
  • where execution performance is the main goal, there is no way around for a language to bring their own GC on WASM linear memory. The requirements of different languages just imply very different GC strategies, so that an optimal GC will look quite different between them. Still, where wire bytes size (-> time to first render on the web for web applications) and interop with JavaScript is a main goal, the built-in WASM GC will win hands-down, despite being slower than the GC strategy usually employed by the language, for purely practical reasons.

@jakobkummerow
Copy link
Contributor

The current proposal seems well suited for the assumption that most code is monomorphic and casts are rare (either because types are known statically, or because a single runtime cast guards a nontrivial amount of code, i.e. a statically typed section). In such a scenario, having precise types on fields means that values loaded from fields don't have to be cast or typechecked at all.

There are plenty of examples where this assumption holds (e.g. pretty much all languages that have a notion of "classes" with "methods", so long as they don't have high-frequency calls that interleave methods defined on subtypes and supertypes). There are also plenty of examples where it doesn't hold (e.g. hashing any generic object's contents, as was discussed on the other thread).

(FWIW, in JavaScript we see both: a large part of V8's performance is based on the insight that most code ends up being monomorphic even in such a highly dynamic language; at the same time we've seen plenty of examples where code makes use of the language's flexibility, and any engine feature that assumes otherwise falls flat onto its face.)

@RossTate
Copy link
Contributor

RossTate commented Jul 8, 2020

There are plenty of examples where this assumption holds (e.g. pretty much all languages that have a notion of "classes" with "methods", so long as they don't have high-frequency calls that interleave methods defined on subtypes and supertypes).

Unfortunately not true for C# due to reified generics, nested structures, interior pointers, and so on. Nested structures and interior pointers are also issues for Go. I don't know enough about Dart these days to assess. Also doesn't work well for Scala due to multiple inheritance of traits, though they at least have the infrastructure for compiling to the JVM that they can possibly fall back on. (I'm assuming your statement was only meant to apply to typed languages.)

So, amongst the top 50 TIOBE languages, this statement really only seems to hold for Java and Kotlin (and possibly Dart).

@taralx
Copy link
Author

taralx commented Jul 9, 2020

Hrm. Makes me wonder if the heterogenous version from earlier isn't better, despite the need to maintain word types.

@timjs
Copy link

timjs commented Jul 9, 2020

@sabine

Open questions:

  1. Do we have a language where the current proposal is likely to enable better performance than this conceptually simpler approach?
  2. Is adding types the right kind of complexity to add, in order to gain better performance?
  3. Are runtime types cheaper than bounds checks or more expensive? Note that, for arrays in the current MVP, we must do bounds-checks, too, when accessing them.

Very good questions and observations IMHO! I think that creating a heap model supporting so many different languages can only end in two ways:

  • A very complicated proposal with a lot of type information that tries to support as much memory and casting models that languages currently use.
  • Or a very simple one, only having the type info to support GC and be as dynamic as possible on all other facets.

I think the last point is the reason why compiling high level statically typed languages to dynamic languages like Scheme or JavaScript works so well, while you've to double fold yourself when compiling a "non-OO language" to the JVM or CLR. However, Wasm wants to provide some safety with its memory model. This means we can't be too dynamic.

So the question is if adding more type information to a GC reference, besides its lower level layout, is better or not. I.e. what is the added value of knowing that a reference is of type struct i32 (ref (array i32)) (ref f64) i32 instead of structref 2 8 knowing that it has 2 references and 8 bytes of linear memory? As well for language designers as for engine implementors? That is, I think the common denominator to compile to is reasoning about layout and making bound checks as cheap as possible. (And yes, I'm saying this as a very big proponent of strong type systems 😆)

@taralx

Hrm. Makes me wonder if the heterogenous version from earlier isn't better, despite the need to maintain word types.

Which version are you referring to?

@jakobkummerow
Copy link
Contributor

what is the added value of knowing that a reference is of type struct i32 (ref (array i32)) (ref f64) i32 instead of structref 2 8 knowing that it has 2 references and 8 bytes of linear memory?

The value is that whenever we know the static type, we don't need any checks at all to access the struct's fields, and static type knowledge propagates over dependent field loads. So the question becomes: how often do we know the static types of things? And how large or small is the performance penalty when we don't?

When we start with a property loading chain like foo.bar.baz.qux() in the source language, and we have static types for everything, then this can compile to three loads (and a call) without any checks in between. If we lack type information at first, then we'll need a single cast for the foo object, and the rest follows without checks.
Whereas if our type system only knows the number of pointer fields and linear memory bytes, then we need a bounds check before every single load, and presumably a type check before the final call. (In fact, "knowing that a reference is of type structref 2 8" is easier said than done: even if we somehow know that foo has this type, then once we load .bar from it, all we statically know is that bar is a reference. To figure out that bar is of type structref <n> <m>, we have to query it at runtime. If the type system wanted to know that statically, then foo's type would have to be structref (references (to a structref <n> <m>) (to a structref <n'> <m'>)) (8 bytes), which would get pretty close to the fully typed system that this approach wanted to avoid, and would quickly become unwieldy if we wanted to express more than one step of nesting.)

If we think about scenarios where most things are statically typed as anyref and we have to typecheck/cast everything all the time, then just doing bounds checks is faster than RTT-based subtype checks, no doubt about that. If we consider scenarios where most types are statically known, then having all those types allows skipping all checks, which obviously yields higher peak performance.

I suppose in an ideal world, someone would volunteer to prototype both approaches end-to-end, so that we could measure their practical performance on a wide range of real-world applications... any takers? :-)

@aardappel
Copy link

@sabine

Ideally, without jumping through too many performance-degrading hoops in order to represent their heap model, and without redoing entire compilers.

Yes, this is a point I think we're not taking into account enough. Sadly I think "performance thru strong typing at the Wasm level" and "ease of porting as many existing language representation/runtimes as possible" are fundamentally at odds with eachother.

where execution performance is the main goal, there is no way around for a language to bring their own GC on WASM linear memory. The requirements of different languages just imply very different GC strategies, so that an optimal GC will look quite different between them.

There is one point where a dedicated Wasm GC can win over a linear memory GC in performance, and that is the stack. A linear memory GC has to put most of its stack into memory, for it to be scannable, turning many local.get into i32.load, which will generally be a significant slowdown.

But yes, a dedicated Wasm GC shouldn't be slower than the linear memory alternative, which would make it an unattractive target for many languages also.

@sabine
Copy link

sabine commented Jul 10, 2020

the common denominator to compile to is reasoning about layout and making bound checks as cheap as possible. (And yes, I'm saying this as a very big proponent of strong type systems laughing)

I like that thought because it goes well with the mental model of a compiler person. It provides some feeling of "being in control of the layout", or maybe the superficial notion of "being able to reuse the layout of the existing compiler to the degree possible". I suspect that this is mostly prejudice on my part, though, and we must identify what the qualitative difference to the "more typed" proposals actually is. (Or whether, in the end, with all the necessary extensions, this proposal turns into more or less the same thing as the existing proposal, but with value types replaced by byte sequences.)

Generally, when looking at linear memory, and the heap, I see only the size in bytes and alignment of values (which compilers traditionally have to deal with). I don't need to see whether there is a float, an unsigned int or a signed int or whatever stored in that linear memory of a heap block.

On the linear memory implementation, WASM gets this right: Only when I read from the memory and the value goes into a register, it needs a type (in order for the engine to know what registers it can reasonably go in! So that makes perfect sense!).
Having "register-typed values" instead of bytes in a model of a heap block seems like an unnecessary hoop for the WASM engine implementers to jump through, and, if that needs to be type-checked, an unnecessary waste of time. As a producer, I know whether I read an (signed or unsigned) int or a float at the time the read happens (because I would also have to know to what register to load that thing).

If I write an int and read a float, that's my own, harmless mistake: WASM doesn't need to prevent me from running that, as there is no security risk associated with that, right? If I'm wrong here, correct me please.

@aardappel

There is one point where a dedicated Wasm GC can win over a linear memory GC in performance, and that is the stack. A linear memory GC has to put most of its stack into memory, for it to be scannable, turning many local.get into i32.load, which will generally be a significant slowdown.

Currently, this is true. If WASM engines implemented support for stack operations (e.g. allow linear memory GCs to walk the local variables of the stack to relocate their references), linear memory GC could be faster.

But... for the sake of reduction of overall complexity, I would rather not want to have to deal with two (or more) GCs. What is really quite interesting is that, with one shared GC, suddenly the complexity cost of interoperating with other garbage collected languages becomes manageable. Yes, you still have to write some wrappers to account for calling conventions and heap representation, but there is no need to carefully place and remove pointers in GC root tables, deal with circular unreachable structures, etc. in addition.

@aardappel
Copy link

@sabine

with one shared GC, suddenly the complexity cost of interoperating with other garbage collected languages becomes manageable. Yes, you still have to write some wrappers to account for calling conventions and heap representation, but there is no need to carefully place and remove pointers in GC root tables, deal with circular unreachable structures, etc.

Then default to copies, not pointers ;) No really, that's what I argue in #78

@sabine
Copy link

sabine commented Jul 10, 2020

@jakobkummerow Thanks, I see the qualitative difference between structref (without typed reference arrays) and the current GC MVP proposal now.

@taralx
Copy link
Author

taralx commented Jul 12, 2020

Well, let's start with fleshing this proposal out, since it has drawn so much attention. I'll try to put together a more detailed proposal (including considerations like non-reference type imports) in time for the GC meeting in 2 weeks.

@timjs
Copy link

timjs commented Jul 13, 2020

@taralx

Let me know if you'd like some help.

@taralx
Copy link
Author

taralx commented Jul 31, 2020

(moved to new issue since people probably muted this one)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants