- Feature Name: statically_sized_types
- Start Date: 22-05-2017
- RFC PR: https://github.com/Whiley/RFCs/pull/3
Summary
This proposal is for introducing statically sized types to the Whiley language. Such types occupy a fixed (and known) amount of space at runtime. For example, a statically sized integer type might occupy exactly 32bits at runtime.
Motivation
The Whiley Language currently supports dynamically-sized data types.
For example, the int type in Whiley is unbound and can
hold any possible integer (theoretically, at least). Such types are
naturally difficult to represent efficiently on a machine.
Furthermore, they impede communication between Whiley and foreign
code which employs fixed-sized data types (e.g. int32 in C).
This proposal addresses the problem of specifying what underlying representation a given type should have. That is, providing a way for the programmer to dictate in absolute terms what representation should be used. This allows the programmer to optimise the underlying representation as he/she sees fit. It also enables them to interface with foreign code more easily.
Several alternatives that could be considered to this proposal:
-
Library definitions of statically sized types. The standard library currently includes various definitions of statically sized types (e.g.
u8,i16,u32, etc). Knowledge of these could be hard-coded into the compiler. However, this unnecessarily couples the compiler with the standard library. Furthermore, there are certain foreign data types (e.g. C strings) which cannot be easily modelled in this way. -
Range Analysis. Another approach previously explored was the use of integer range analysis (see this paper). Here, the compiler examines the invariants given on a datatype to determine an appropriate representation. For example, consider a type
(int n)wheren >= 0 && n <= 255. The range returned for this type would be0 .. 255and, hence, it could fit into a singlebyte. This approach improves upon the previous in that it does not couple the compiler with the standard library. However, it still cannot encode certain data types (e.g. C strings). Furthermore, the range analysis process is hard to state clearly because of the complex nature of invariants in Whiley. For example, what range should we return for the constraintn >= 0 && (n <= 255 || n < 128)? Probably0 .. 255makes the most sense here, but invariant expressions can be arbitrarily complex and we need a clear and simple mechanism for specifying representation. Finally, using range analysis makes interoperation with foreign code more challenging. This is because we may have foreign data types whose representation is not minimal for their true invariants. For example, anintparameter with only seven possible values (e.g.1..7). In such case, we need to separate the representation from the constraints.
As a result of their limitations, the above approaches are not considered suitable for addressing the problem.
Technical Details
This proposal introduces the following data types which are referred to as statically sized:
- Signed Integers. The type of statically-sized signed integers is
int:n, wherenis a numeric constant denoting the number of bits of precision used for a twos-compliment representation (typically8,16,32, etc). - Unsigned Integers. The type of statically-sized unsigned
integers is
uint:n, wherenis a numeric constant denoting the number of bits of precision used (typically8,16,32, etc). In addition, the special typeusizedenotes a fixed-width unsigned integer sufficient to the hold the length of an array. - Arrays. The type of fixed-length arrays is given by
T[n]wherenis an unsigned numeric constant. A fixed-length array is statically sized if its element typeTis statically sized. For a fixed-length arrayarrof typeT[n], it is implicit that|arr| == n. - Unknown-Length Arrays. The type of unknown-length arrays is
given by
T[?]. Such arrays have a maximum length, but this is unknown and the array length operator is defined for such types only within specification elements.
In addition, the existing types int and T[] are referred to as
dynamically sized. In contrast, the existing types bool, null,
&T and function/method types are referred to as statically sized.
The status of other existing compound types (e.g. records, unions,
intersections, negations) is determined by their element types.
Specifically, if all element types are statically-sized then the
enclosing type is statically sized. For example, { int:8 x, int:8 y} is statically sized, whilst int|bool is not.
Foreign Functions
The proposed types allow one to specify foreign functions more
precisely. For example, the well-known strncpy function has the
following type in C:
char *strncpy(char *, const char *, size_t);
Using the above types, we can give a corresponding Whiley type for this:
type C_string is (uint:8[?] items)
// C strings are null terminated
where some { i in 0..|items| | items[i] == NULL }
&C_string strncpy(&C_string , &C_string, usize)
Here, the type &(uint:8[?]) denotes a reference to an array of
bytes of unknown length. The key is that denoting the array as having
unknown length affects its underlying representation. Specifically,
no length field is stored with the array. Thus, it can be represented
as just a sequence of zero or more elements to match the C representation.
The above illustrates the use of the array length operator on an
unknown-length array in a specification element (i.e. |items|). By
allowing this operator in specification elements, we can still reason
about the array's actual length in an abstract sense.
Finally, as an example to illustrate, we can provide the following mapping from C99 fixed-width data types:
int8_t=>int:8,int16_t=>int:16,int32_t=>int:32, etcuint8_t=>uint:8,uint16_t=>uint:16,uint32_t=>uint:32, etcsize_t=> usizechar=>uint:8
Similar mappings exist for other languages. For example, a Java int
maps to int:32 whilst a long maps to int:64, etc. Likewise, a
Java int[] maps to a int:32[], etc.
Coercions
This proposal treats the status of a type (i.e. statically sized or
dynamically sized) as a type modifier rather than as a type. This
has significant implications for the meaning of such types. For
example, a variable of type int:32 can flow into a variable of type
int:8 without the need for an explicit cast. The following
illustrates a valid Whiley function under this proposal:
function abs(int:32 x) -> (uint:32 r)
ensures r >= 0:
//
if x >= 0:
return x
else:
return -x
From a typing perspective, this function allows a variable of type
int:32 to flow into one of type uint:32. This is clearly unsound
in an unrestricted fashion. For example, -1 is a valid member of
type int:32 but not uint:32. Nevertheless, the above program is
considered valid under this proposal.
The reason the above program is considered valid is that, under verification, the constraints imposed by the different integer representations are shown to hold. In other words, validity of the underlying representation depends upon verification.
As one exception, we note that int[?] cannot flow into int[] (with
or without a cast). This is simply because it is logically impossible
to implement as, at the point we have an instance of int[?] we no
longer have a length variable. However, it is valid for a variable
arr of type int[?] to flow into a variable of type int[n] (for
some n). The safety of this relies on the verifier to establish
that |arr| == n.
Typing
An important question is how expressions involving statically-sized data types will be typed. For example, consider this Whiley program:
function sum(int:16 x) -> (int:32 y):
return x + 1
What type should be inferred for the expression x+1? There are two
possible approaches:
- Forward Propagation. In this approach, the type of an expression
is inferred by determining the worse-case bound on its output. In
the above example, the inferred type would
int:17. - Backward Propagation. In this approach, the type of an
expression is inferred by from the target location. That is, the
type of the variable into which it will flow. In the above example,
the inferred type would
int:32as this follows from the return type.
This proposal does not currently mandate which of these approaches should be taken. This is because, at this stage, it is unclear what effect either of these will have.
Some guidelines regarding typing of expressions (see this paper for more details):
operator+(int:n,int m)produces a result of typeint:(n+m).operator-(int:n,int m)produces a result of typeint:(n+m).operator*(int:n,int m)produces a result of typeint:(n*m).operator/(int:n,int m)produces a result of typeint:n.operator%(int:n,int m)produces a result of typeint:m.operator||(T[])produces a result of typeusize.operator[](T[],usize)produces a result of typeT.
Further details regarding the typing of operators will need to be worked out. We note the above implies the operands of an operator are not required to have the same static size, as this would be quite unwieldy.
Verification
Whilst coercions between different statically-sized types happen
implicitly at the type level, safety constraints are imposed during
verification. We now clarify the constraints imposed, where x^y
is taken to mean x to the power of y (e.g. 2^4 gives 16).
-
Signed Integers. Values of a signed integer
int:nshould be in the range-2^(n-1)to2^(n-1)-1. -
Unsigned Integers. Values of an unsigned integer
uintshould be in the range0to2^n.
For a more detailed explanation of these constraints, see the Wikipedia page on Two's Complement. As an example, the following program should verify:
function f(int:8 x) -> (int:4 r):
if x >= 0 && x <= 7:
return x
else:
return 0
This verifies primarily because 7 <= 2^(4-1)-1 reduces to 7 <= 7
which clearly holds. In contrast, the following does not verify:
function f(int:8 x) -> (int:4 r):
return x
This fails to verify because 2^(8-1)-1 <= 2^(4-1)-1 reduces to 127 <= 7
which clearly does not hold.
Terminology
-
Statically-Sized Type. A statically-sized type is one whose representation size (in bits) is known at compile time.
-
Dynamically-Sized Type. A dynamically-sized type is one whose representation size (in bits) is unknown at compile time.
-
Fixed-Length Array. A fixed-length array is one whose length (in elements) is known at compile time.
-
Unknown-Length Array. An unknown-length array is one whose length is unknown either at compile time or runtime.
Drawbacks and Limitations
- Implicit Coercions. The flexibility offered by implicit coercions between types of different size presents a potential hazard. In particular, for programs which are not verified, there is a potential (and silent) loss of information.
Unresolved Issues
- Constraints on
usize. At this time, it's unclear how one can safely verify programs involving types ofusizeas there are no implicit bounds placed on them. For the case of a variable (e.g.i) being used as an index into an array (e.g.arr) it is relatively straightforward as we already requirei < |arr|. Hence,imust fit safetly into the typeusize. However, consider an expression of the formx < |arr|wherexhas some statically sized typeint:n. Should we castxto typeusizefor the comparison? This seems questionable since we could lose information. A clever compiler could spot this and returnfalsefor the condition, but this is unclear.