- Feature Name: statically_sized_types
- Start Date: 22-05-2017
- RFC PR: #3
This proposal is for introducing statically sized types to the Whiley language. Such types occupy a fixed (and known) amount of space at runtime. For example, a statically sized integer type might occupy exactly 32bits at runtime.
The Whiley Language currently supports dynamically-sized data types.
For example, the int
type in Whiley is unbound and can
hold any possible integer (theoretically, at least). Such types are
naturally difficult to represent efficiently on a machine.
Furthermore, they impede communication between Whiley and foreign
code which employs fixed-sized data types (e.g. int32
in C).
This proposal addresses the problem of specifying what underlying representation a given type should have. That is, providing a way for the programmer to dictate in absolute terms what representation should be used. This allows the programmer to optimise the underlying representation as he/she sees fit. It also enables them to interface with foreign code more easily.
Several alternatives that could be considered to this proposal:
-
Library definitions of statically sized types. The standard library currently includes various definitions of statically sized types (e.g.
u8
,i16
,u32
, etc). Knowledge of these could be hard-coded into the compiler. However, this unnecessarily couples the compiler with the standard library. Furthermore, there are certain foreign data types (e.g. C strings) which cannot be easily modelled in this way. -
Range Analysis. Another approach previously explored was the use of integer range analysis (see this paper). Here, the compiler examines the invariants given on a datatype to determine an appropriate representation. For example, consider a type
(int n)
wheren >= 0 && n <= 255
. The range returned for this type would be0 .. 255
and, hence, it could fit into a singlebyte
. This approach improves upon the previous in that it does not couple the compiler with the standard library. However, it still cannot encode certain data types (e.g. C strings). Furthermore, the range analysis process is hard to state clearly because of the complex nature of invariants in Whiley. For example, what range should we return for the constraintn >= 0 && (n <= 255 || n < 128)
? Probably0 .. 255
makes the most sense here, but invariant expressions can be arbitrarily complex and we need a clear and simple mechanism for specifying representation. Finally, using range analysis makes interoperation with foreign code more challenging. This is because we may have foreign data types whose representation is not minimal for their true invariants. For example, anint
parameter with only seven possible values (e.g.1
..7
). In such case, we need to separate the representation from the constraints.
As a result of their limitations, the above approaches are not considered suitable for addressing the problem.
This proposal introduces the following data types which are referred to as statically sized:
- Signed Integers. The type of statically-sized signed integers is
int:n
, wheren
is a numeric constant denoting the number of bits of precision used for a twos-compliment representation (typically8
,16
,32
, etc). - Unsigned Integers. The type of statically-sized unsigned
integers is
uint:n
, wheren
is a numeric constant denoting the number of bits of precision used (typically8
,16
,32
, etc). In addition, the special typeusize
denotes a fixed-width unsigned integer sufficient to the hold the length of an array. - Arrays. The type of fixed-length arrays is given by
T[n]
wheren
is an unsigned numeric constant. A fixed-length array is statically sized if its element typeT
is statically sized. For a fixed-length arrayarr
of typeT[n]
, it is implicit that|arr| == n
. - Unknown-Length Arrays. The type of unknown-length arrays is
given by
T[?]
. Such arrays have a maximum length, but this is unknown and the array length operator is defined for such types only within specification elements.
In addition, the existing types int
and T[]
are referred to as
dynamically sized. In contrast, the existing types bool
, null
,
&T
and function/method types are referred to as statically sized.
The status of other existing compound types (e.g. records, unions,
intersections, negations) is determined by their element types.
Specifically, if all element types are statically-sized then the
enclosing type is statically sized. For example, { int:8 x, int:8 y}
is statically sized, whilst int|bool
is not.
The proposed types allow one to specify foreign functions more
precisely. For example, the well-known strncpy
function has the
following type in C:
char *strncpy(char *, const char *, size_t);
Using the above types, we can give a corresponding Whiley type for this:
type C_string is (uint:8[?] items)
// C strings are null terminated
where some { i in 0..|items| | items[i] == NULL }
&C_string strncpy(&C_string , &C_string, usize)
Here, the type &(uint:8[?])
denotes a reference to an array of
bytes of unknown length. The key is that denoting the array as having
unknown length affects its underlying representation. Specifically,
no length field is stored with the array. Thus, it can be represented
as just a sequence of zero or more elements to match the C representation.
The above illustrates the use of the array length operator on an
unknown-length array in a specification element (i.e. |items|
). By
allowing this operator in specification elements, we can still reason
about the array's actual length in an abstract sense.
Finally, as an example to illustrate, we can provide the following mapping from C99 fixed-width data types:
int8_t
=>int:8
,int16_t
=>int:16
,int32_t
=>int:32
, etcuint8_t
=>uint:8
,uint16_t
=>uint:16
,uint32_t
=>uint:32
, etcsize_t
=> usizechar
=>uint:8
Similar mappings exist for other languages. For example, a Java int
maps to int:32
whilst a long
maps to int:64
, etc. Likewise, a
Java int[]
maps to a int:32[]
, etc.
This proposal treats the status of a type (i.e. statically sized or
dynamically sized) as a type modifier rather than as a type. This
has significant implications for the meaning of such types. For
example, a variable of type int:32
can flow into a variable of type
int:8
without the need for an explicit cast. The following
illustrates a valid Whiley function under this proposal:
function abs(int:32 x) -> (uint:32 r)
ensures r >= 0:
//
if x >= 0:
return x
else:
return -x
From a typing perspective, this function allows a variable of type
int:32
to flow into one of type uint:32
. This is clearly unsound
in an unrestricted fashion. For example, -1
is a valid member of
type int:32
but not uint:32
. Nevertheless, the above program is
considered valid under this proposal.
The reason the above program is considered valid is that, under verification, the constraints imposed by the different integer representations are shown to hold. In other words, validity of the underlying representation depends upon verification.
As one exception, we note that int[?]
cannot flow into int[]
(with
or without a cast). This is simply because it is logically impossible
to implement as, at the point we have an instance of int[?]
we no
longer have a length variable. However, it is valid for a variable
arr
of type int[?]
to flow into a variable of type int[n]
(for
some n
). The safety of this relies on the verifier to establish
that |arr| == n
.
An important question is how expressions involving statically-sized data types will be typed. For example, consider this Whiley program:
function sum(int:16 x) -> (int:32 y):
return x + 1
What type should be inferred for the expression x+1
? There are two
possible approaches:
- Forward Propagation. In this approach, the type of an expression
is inferred by determining the worse-case bound on its output. In
the above example, the inferred type would
int:17
. - Backward Propagation. In this approach, the type of an
expression is inferred by from the target location. That is, the
type of the variable into which it will flow. In the above example,
the inferred type would
int:32
as this follows from the return type.
This proposal does not currently mandate which of these approaches should be taken. This is because, at this stage, it is unclear what effect either of these will have.
Some guidelines regarding typing of expressions (see this paper for more details):
operator+(int:n,int m)
produces a result of typeint:(n+m)
.operator-(int:n,int m)
produces a result of typeint:(n+m)
.operator*(int:n,int m)
produces a result of typeint:(n*m)
.operator/(int:n,int m)
produces a result of typeint:n
.operator%(int:n,int m)
produces a result of typeint:m
.operator||(T[])
produces a result of typeusize
.operator[](T[],usize)
produces a result of typeT
.
Further details regarding the typing of operators will need to be worked out. We note the above implies the operands of an operator are not required to have the same static size, as this would be quite unwieldy.
Whilst coercions between different statically-sized types happen
implicitly at the type level, safety constraints are imposed during
verification. We now clarify the constraints imposed, where x^y
is taken to mean x
to the power of y
(e.g. 2^4
gives 16
).
-
Signed Integers. Values of a signed integer
int:n
should be in the range-2^(n-1)
to2^(n-1)-1
. -
Unsigned Integers. Values of an unsigned integer
uint
should be in the range0
to2^n
.
For a more detailed explanation of these constraints, see the Wikipedia page on Two's Complement. As an example, the following program should verify:
function f(int:8 x) -> (int:4 r):
if x >= 0 && x <= 7:
return x
else:
return 0
This verifies primarily because 7 <= 2^(4-1)-1
reduces to 7 <= 7
which clearly holds. In contrast, the following does not verify:
function f(int:8 x) -> (int:4 r):
return x
This fails to verify because 2^(8-1)-1 <= 2^(4-1)-1
reduces to 127 <= 7
which clearly does not hold.
-
Statically-Sized Type. A statically-sized type is one whose representation size (in bits) is known at compile time.
-
Dynamically-Sized Type. A dynamically-sized type is one whose representation size (in bits) is unknown at compile time.
-
Fixed-Length Array. A fixed-length array is one whose length (in elements) is known at compile time.
-
Unknown-Length Array. An unknown-length array is one whose length is unknown either at compile time or runtime.
- Implicit Coercions. The flexibility offered by implicit coercions between types of different size presents a potential hazard. In particular, for programs which are not verified, there is a potential (and silent) loss of information.
- Constraints on
usize
. At this time, it's unclear how one can safely verify programs involving types ofusize
as there are no implicit bounds placed on them. For the case of a variable (e.g.i
) being used as an index into an array (e.g.arr
) it is relatively straightforward as we already requirei < |arr|
. Hence,i
must fit safetly into the typeusize
. However, consider an expression of the formx < |arr|
wherex
has some statically sized typeint:n
. Should we castx
to typeusize
for the comparison? This seems questionable since we could lose information. A clever compiler could spot this and returnfalse
for the condition, but this is unclear.