Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
696 lines (535 sloc) 23.6 KB

Table of Contents

Summary

This provides a general summary of the compilation process from Whiley to low-level target platforms (e.g. JavaScript, Java, C, etc). As part of this, a hypothetical Whiley Low Level Language (WyLL) is introduced as an abstraction to illustrate code generation.

Motivation

In constructing code generators for different target platforms (e.g. JavaScript, Java, C, etc) a number of common problems have arisen. The purpose of this document is to detail those problems and their solutions and, hence, prevent backends from constantly reinventing the wheel.

By far, the hardest problems in compiling for a target platform stem from the complex nature of types in Whiley. This mostly (though not exclusively) arises from flow typing. Consider this Whiley code:

function f(int|null x) -> (int r):
    if x is int:
        return x
    else:
        return 0

Here, the type of x is (int|null)&int) on the true branch and (int|null)-int on the false branch. But, how should such types be represented on a given target platform? Such types are retained at the WyIL level because they include useful nominal information which (amongst other things) is important for error reporting and verification.

The above types for x can easily be simplified to int and null respectively. But that raises another question: can all types be simplified in some way? In fact, the answer is no (though there are useful approximations we can use).

Technical Details

This document is concerned with the compilation of Whiley programs to a range of target platforms, including JavaScript, Java and C. Other platforms, such as LLVM and WebAssembly are also of interest. Whilst each of these platforms has differences, they also have lots in common.

The Whiley Low Level Language can be imagined as a simple imperative language that strips away much of the complexity found in Whiley's type system and semantics. The intention is that this captures the commonalities across target platforms. We can summarise these commonalities as follows:

  • Types. We assume the target platform provides a range of primitive types. This includes booleans, fixed-width and unbound integers, arrays, records, references and lambdas (a.k.a. function pointers or closures). Furthermore, we assume that reference types have a finite representation regardless of what they refer to. Finally, we assume that recursive types are not natively supported and, instead, must simulated using references.

  • Value Semantics. We assume the target platform does not necessarily provide value semantics for all types. For example, arrays and objects in Java have reference semantics. As such, cloning of values is necessary at different points (though the exact details are platform specific).

  • Methods. We assume the target platform provides a general concept of a function or method which can be invoked. Whilst this can accept zero or more parameters, we assume it only supports zero or one return value (i.e. no platform we are aware of natively supports multiple returns). In addition, whilst overloading may be possible on the target platform, we assume true overloading of Whiley types is not possible. As such, a mechanism for name mangling is required.

  • Statements & Expressions. We assume the usual range of statements and expressions is supported. This includes conditionals, loops, switches, etc. In general, translation of statements and expressions in Whiley to WyLL is straightforward and is largely ignored in this document.

We note that each platform supports these features in different ways. The WyLL is intended to be parameterised to allow for this.

Types

Whiley supports a wide range of both finitely sized and infinitely sized types. For example, it natively supports int:13, uint:5, int:15[], etc. As such, it is unreasonable to expect the target platform to support such a large range of types. Instead, a mapping from Whiley types to native types is assumed and is used to guide the generation of WyLL for the given platform.

The language of types supported in WyLL is given as follows:

  • (Booleans). We assume the target platform has support for encoding boolean types (even if this is done using simple integer types).

  • (Integers). We assume arbitrary fixed-width signed and unsigned integers are supported (e.g. int:13, uint:16, etc). For example, int:n maps to byte in Java when 1 <= n <= 8, to short when 9 <= n <= 16, etc. In JavaScript, we can map int:n to native numbers upto n=52, whilst in C we would map int:7 to int8_t, int:13 to int_16_t, etc. In addition, the target platform is assumed to provide a translation for unbounded integers (e.g. BigInteger in Java, or through GMP in C, etc).

  • (Records). We assume an arbitrary sized record type is available (e.g. {int x, int y}). For example, in Java this might map to a class of some kind (e.g. Struct), whilst in JavaScript this maps to an object. Open Records?

  • (Arrays). We assume an arbitrary sized array type is available (e.g. int[] maps to int[] in Java, and an array in JavaScript, etc).

  • (References). We assume an arbitrary reference type is available (e.g. &int). In both Java and JavaScript, for example, this would be achieved with some kind of boxing strategy.

  • (Lambda). We assume an arbitrary function type is available (e.g. (int,int)->(int), etc). In Java, this can be implemented with some kind of class or interface (e.g. by generalising Function). In JavaScript, closures are easily applied.

  • (Unions). We assume a mechanism exists for representing a variable which may hold values from any two different types (e.g. int|null, etc). In Java we can achieve this via Object, provided that primitives are boxed, whilst in JavaScript it's automatic.

Of importance here are the Whiley types are not supported in WyLL. Most notably, these includes intersections and differences. These must be compiled out when generated WyLL.

Representations

The question of how to translate between arbitrary Whiley types and the more restricted form supported in WyLL is challenging. There are three aspects:

  • Simple Types. These are non-recursive types which do not involve intersection or negation. They are the so-called simple types because they map directly to WyLL types.

  • Complex Types. These are more difficult and, indeed, one may wonder why the language supports them at all! Mostly, they exist because of flow typing though intersections, for example, can be useful in some cases for combining type invariants.

  • Recursive Types. These are types which are defined in terms of themselves. They are necessarily rooted at a nominal type declaration.

Simple Types. These don't warrant further discussion (though perhaps there are some questions regarding whether or not they should be flattened).

Complex Types. The Whiley Intermediate Language (WyIL) preserves a lossless view of types generated via flow type checking. Whilst such types can be complex in nature, preserving them is necessary as (amongst other things) they include useful nominal information which is needed for error reporting and verification. For example, consider this Whiley code:

function f(int|null x) -> (int r):
    if x is int:
        return x
    else:
        return 0

Here, the type of x on the true branch is (int|null)&int). This can easily be simplified to (int&int)|(null&int) and then to int. Likewise, on the false branch we have (int|null)-int which simplifies to (int-int)|(null-int) and then to null.

From the perspective of WyLL, the above is not as challenging as it may at first seem. The most important observation is that WyLL need only concern itself with types explicitly denoted in the source code. In the above example, WyLL does not consider the intermediate types generated from flow typing (e.g. (int|null)&int)). Rather, it need only coerce x from type int|null to int on the true branch.

Despite the above observation, WyLL must still handle arbitrarily complex types since the user can explicitly denote them. A good example is the following:

type XHTML is XML & HTML

This declares that the XHTML type is both an instance of the XML type and the HTML type. Unfortunately, we cannot always get rid of intersections and negations through simplification. After conversion to DNF, we are left with a union of conjuncts of the form T1 - T2 - ... - Tn. An example of a type which cannot be further simplified is {int x, ...}-{int x, int y}, and any further simplification is lossy. We observe that, for any complex type S, there is a minimal simple type T where S <: T. We refer to T as the representation type as it is the minimal underlying representation needed to store values of S. Indeed, a type of the form T1 - T2 - ... - Tn is always safely approximated by T1.

Recursive Types. The obvious way to handle recursive types is to compile back pointers into references. For example, this Whiley code:

type LinkedList is null | { LinkedList next, int data }

becomes this WyLL code:

type LinkedList is null | { &LinkedList next, int data }

The key challenge here is to work out where the backlinks are, and also to figure out how to manipulate them properly.

Coercions

Implicit or explicit type coercions present numerous challenges for code generation. Furthermore, the exact requirements for when a coercion is required depend entirely on the target platform. Consider this example:

function read(int:8[] items, usize i) -> (int:32 result):
    return items[i]

Here, an implicit coercion is placed after the array access to convert from an int:8 to an int:32. Of course, the need for this coercion depends on the target platform. On Java we need to cast from a byte to an int but, on JavaScript, nothing needs to be done.

The set of possible coercions is classified as follows:

  • Widening / Narrowing. A simple transformation between integer types (e.g. int:8 to int:16, or from int:16 to int:8). Most platforms require such coercions, but the exact details vary. For example, even in Javascript going from int:8 to int requires an explicit coercion.

  • Reordering. A simple transformation between record types (e.g. {int x, int y} to {int y, int x}). Not all platforms require such coercions. For example, in JavaScript objects are dictionaries and, hence, are order agnostic. But, in a language like C, the exact ordering is important.

  • Opening / Closing. A transformation to/from a smaller record type to/from a larger record. For example, from {int x, int y} to {int x, ...} or from {int x, ...} to {int x, int y}. Again, the need for this depends on the platform and, in particular, what type test operations are available. For example, in both Java and JavaScript, such coercions are not likely to be necessary. But, in a language like C, they are certainly necessary.

  • Tagging / Untagging. A transformation to/from a tagged union. For example, from int to int|null or from int|null to int. On the assumption that integer tags are used to distinguish unions, then most platforms will require such coercions. However, on some platforms (e.g. Java), it is possible to implement unions using only the underlying type test operators and tags could be omitted (though this not consider ideal).

  • Retagging. A transformation between union types (e.g. int|null to null|bool|int). This occurs when a value in a union flows into another value in a union. The need for such coercions, again, depends as above for tagging / untagging.

The implementation of coercions at the WyLL level assumes only a primitive operator for converting integer types (e.g. between an int:8 and an int:16, etc), and three operators on unions (entering, leaving and inspecting). All other coercions are implemented using existing WyLL types, statements and expression (though in some cases, it may be preferable to override this). The following examples attempt to clarify this somewhat:

Unions. Consider the following Whiley code:

function f(int[]|null narr) -> (int[] r):
   if narr is int[]
      return narr
   else:
      return []

This would generate (roughly) the following WyLL code:

int[] f_u2ain(int[]|null narr) {
  if(narr? == 0) { return narr#0; }
  else { return []; }
}

Here, the tag access operator ? simply returns the value of the tag, whilst the untag operator e#c coerces the payload into the given type.

Arrays. These are more complex because they necessarily require the construction of a new array. The following illustrates:

function f(int:8[] xs) -> (int:16[] ys):
    return xs

This would generate (roughly) the following WyLL code:

int:16[] f_ai(int:8[] xs) {
   return cast$1(xs);
}

int:16[] cast$1(int:8[] xs) {
  int:16[] ys = [0; |xs|]
  for(int i=0;i!=|ys|;++i) {
    ys[i] = (int:16) xs[i];
  }
  return ys;	
}

Observe that on platforms where, for example, int:8 has the same representation as int:16, the above coercion is unnecessary. The given type mapping then prevents the coercion from being created as the Whiley type int:8 would map to the WyLL type int:16, etc.

Records. These are relatively straightforward using a record initialiser. The following illustrates:

function f({int:8 x} rec) -> ({int:16 x} res):
    return rec

This would generate (roughly) the following WyLL code:

{int:16 x} f_ri1x({int:8 x} rec) {
   return { x: (int:16) rec.x };
}

Again, the primitive integer coercion operator (int:16) is used as the fundamental building block.

Type Propagation

Type propagation is essentially an optimisation to improve the placement of coercions. The following illustrates:

function f(int x) -> ((int|bool)[] xs):
   return [x,x]

By default, the array initialiser will construct an array of type int[] which would then need to be coerced to an array of type (int|bool)[] (which amounts to cloning the array). However, if we know the ultimate destination is the type (int|bool)[] then we can just create the original array with that representation.

Whilst back-propagation makes a lot of sense, there are complications to deal with. Consider this variation:

function f(int:8 x) -> (int:16[]|bool xs):
   return [x,x]

The question here is: what type should the array initialiser have? Certainly, it cannot have type int:16[]|bool as this is not an array type. Likewise, type int:8 is not optimal for reasons explained above. Therefore, we intersect the actual type (int:8) with the target type (int:16[]|bool) to determine the correct type.

Accessors

The presence of readable/writeable types motivates the need for the so-called accessors. Consider the following example:

function f(int[]|bool[] arr) -> (int|bool r):
   return arr[0]

Here, we need to generate (roughly) the following WyLL code:

int|bool f(int[]|bool[] arr) {
  return read$1(arr,0)
}

int|bool read$1(int[]|bool[] arr, int i) {
   if(arr? == 0) { return arr#0[i]; }
   else { return arr#1[i]; }
}

In this case, each array access is compiled down to use a concrete array type. Records and references also support readable types in a similar fashion.

Writeable types work in roughly the same way. For example:

type Record is {int|bool field}|{bool|null field}
function f(Record rec) -> (Record r):
   rec.field = false
   return rec

Here, we need to a setter rather than a getter. It might look something like this:

Record f(Record rec) {
  rec = write$1(rec,false);
  return rec;
}

Record write$1(Record rec, bool v) {
   if(rec? == 0) { rec#0.field = v; }
   else { rec#1.field = v; }
   return rec;
}

Again, both record assignments now operate on record types directly, rather than on union types.

Runtime Type Information

(See also RFC#0017)

Runtime type information is required to answer runtime type tests. There are two primary categories of information here: finite and infinite. The former can be handled relatively easily through the use of runtime type tags, whilst the latter is more complex and different approaches are needed for different platforms.

Finite Type Information. This is generated through the use of union types. The following illustrates the canonical example:

type msg is {int kind, int payload}|{int kind, int[] payload}

function msg(int k, int p) -> (msg r):
   return {kind: k, payload: p}

Here, finite type information is generated at the point of the return in the form of a runtime type tag. The value {kind: k, payload: p} is said to enter the union at this point. In WyRL this looks like the following:

msg msg(int k, int p) {
  return {kind: k, payload: p}#0;
}

Here the expression {kind: k, payload: p}#0 adds tag 0 to the left-hand side, where the tag value is determined at compile time.

Infinite Type Information. An outstanding issue is how to deal with the representation of type information in open records. These types are complex because they represent an infinite number of types, rather than a finite number of cases. For example, we cannot easily use integer tags to distinguish between open records (as we did above) since this would require a potentially infinite tag space.

Borrowing

The presence of value semantics in Whiley presents some challenges for efficient compilation. Specifically, we want to avoid cloning large data structures (e.g. arrays) as much as possible.

A move analysis phase helps determine when a value of a dynamically sized data type can be "moved" or must be "copied". Moving is preferable (when permitted) because then the original reference can be used without copying the underlying data (hence, is significantly more efficient). The following provides a useful example:

function g(int[] xs, int i) -> (int[] ys):
    xs[j] = get(xs,i,i)
    return xs

In the invocation get(xs,i,i) the array xs cannot be moved since it is live afterwards. Instead, we must clone xs at this point. However, the subsequent use of xs in the return statement does not require a clone and can be moved (since xs is no longer live).

The following illustrates another situation where temporary moves or "borrows" are permitted:

function get(int[] xs, int i, int j) -> (int r):
    int sum = xs[i]
    sum = sum + xs[j]
    return sum

Clearly, there is no need to clone xs when it is used in the initialiser for sum. This is because the use is temporary and does not modify xs. We say that xs is not consumed.

Finally, it's worth noting that, in fact, the invocation above to get(xs,i,i) could (in theory) avoid cloning xs. This is possible when it is known at the call site that xs is not modified to by the invocation. Such knowledge could come from an interprocedural analysis, though this may be expensive. However, it can also come from the presence of final modifiers on variables. For example, if we updated the signature for get() as follows:

function get(final int[] xs, int i, int j) -> (int r):
   ...

Here, calles to get() know that the xs parameter will not be modified and, hence, they can safely avoid cloning.

Name Mangling

WyLL provides a default mechanism for type mangling which translates Whiley types into ASCII strings containing only lowercase/uppercase characters, digits and the underscore. The translation is:

  • Primitives. Each primitive types is encoded using a single capital letter. Here, Null is N, Bool is B, Int is I.

  • Nominals. These are encoded using a single uppercase prefix Q followed by, for each component, the number of characters and then the component identifier. For example, the name std::ascii::string gives Q3std4ascii6string.

  • Arrays. These are encoded using a single lowercase prefix a, followed by the mangled element type. For example, int[] gives "aI".

  • References. These are encoded using a single lowercase prefix of either q. For non-static references, this is followed by the lifetime index which is a De Bruijn index. Eitherway, the mangled element type follows next. For example, &int gives "qI", whilst &l:bool gives "q0B" (assuming lifetime l is the first declared lifetime variable).

  • Records. These are encoded using a single lowercase prefix r followed by the number of fields then, for each field, its mangled type followed by the number of characters in the field name and then the field name itself. For example, the type {int op, bool flag} gives the mangle "r2I2opB4flag". This encoding accounts correctly for the possibility of nested records.

  • Callables. Functions, Methods and Properties are encoded using the prefixes f, m and p respectively. These are followed by the number of parameters and the parameter mangles and, likewise, for the returns. For example, function(int,bool)->any becomes "f2IB1A".

  • Unions, Intersections and Negations. Negations are handled as for arrays using the prefix n. Unions and intersections are handled using the prefixes u and i respectively. This is followed by the number of components and then the components themselves. For example, int|null becomes "u2IN".

Multiple Returns

No target platform that we're aware of natively supports multiple returns (though, potentially, something like LLVM might). Therefore, multiple returns are compiled away when generating WyLL and this is relatively straightforward. For assignments it's really easy, for multiple returns it's slightly harder (to do well).

Multiple Assignments. Consider a multiple assignment like this:

int x, int y, int z 

x,y,z = 1,f()

Here, f() returns two ints. The obvious translation is this:

x = 1
{int f1, int f2} tmp = f()
y = tmp.f1
z = tmp.f2

Obviously, we need to be careful about the variable name introduced to prevent clashes.

Multiple Returns. These can be handled in exactly the same way above by reusing the declared returns. However, we might want something neater in some cases. For example, we could translate this:

return x,y

into this:

return {x: x, y: y}

But, of course, we can't do that with an actual multiple return. Therefore, this:

return x,f()

becomes this:

{int f1, int f2} tmp = f()
return x, tmp.f1, tmp.f2

However, care needs to be taken for side-effecting expressions as these must still be evaluated in the correct order.

Drawbacks and Limitations

None.

Unresolved Issues

Plenty.