Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tagged unions #1132

Open
15 of 23 tasks
degory opened this issue Mar 21, 2024 · 1 comment
Open
15 of 23 tasks

Tagged unions #1132

degory opened this issue Mar 21, 2024 · 1 comment

Comments

@degory
Copy link
Owner

degory commented Mar 21, 2024

Tagged Union Types

Introduction

This issue proposes the addition of tagged union types to the ghūl programming language. Tagged unions (also known as discriminated unions or sum types) allow defining a type as a union of distinct cases, each of which can carry its own data.

Syntax

The syntax for defining tagged unions in ghūl is:

union Shape is
    CIRCLE(radius: float);
    SQUARE(side: float);
si

union Option[T] is
    SOME(value: T);
    NONE;
si

union Result[T, E] is
    OK(value: T);
    ERROR(error: E);
si
  • A union is defined using the union keyword followed the union name, then an optional list of generic type parameters, and finally an is / si delimited body
  • Each variant of the union is defined with the variant name followed by an optional parenthesis delimited list of fields and terminated with a semicolon
  • Fields are specified as name: type pairs.
union_definition ::= "union" identifier type_parameters? modifiers? "is" variant_definition+ "si"

type_parameters ::= "[" type_parameter ("," type_parameter)* "]"
type_parameter ::= identifier

variant_definition ::= identifier variant_fields? ";"
variant_fields ::= "(" variant_field ("," variant_field)* ")"
variant_field ::= identifier ":" type_expression

type_arguments ::= "[" type_expression ("," type_expression)* "]"

identifier ::= ...
type_expression ::= ...
modifiers ::= ...

Semantics

  • A union type is a sum type that can hold a value of any of its variants at runtime.
  • Variants are distinguished by their names and the types of their fields.
  • Variants without fields are called unit variants (e.g., NONE in Option[T]).
  • Unions can be generic, with type parameters specified in square brackets after the union name (e.g., Option[T], Result[T, E]).
  • While it may be useful in future to allow generic variants, the initial release will not support them

Immutability

  • Values of union types are immutable after construction: its not possible to change their variant, or the fields of their variant

Construction

Unions are constructed using the new keyword followed by the qualified variant name (including any necessary actual generic type arguments) and any necessary field values. Note that type arguments are applied to the union type, not the variant:

let circle: Shape = new Shape.CIRCLE(10.0);
let square: Shape = new Shape.SQUARE(5.0);
let some_int: Option[int] = new Option[int].SOME(42);
let ok_result: Result[int, string] = new Result[int, string].OK(42);
let error_result: Result[int, string] = new Result[int, string].ERROR("Error message");

Although unit variants contain no fields, construction still requires parentheses, for consistency with use of new with other parameterless constructors in ghūl:

let none_int: Option[int] = new Option[int].NONE();

For convenience, constructor functions can be defined for commonly used unions, to avoid having to specify generic argument types:

some[T](value: T) -> Option[T] => new Option[T].SOME(value);
none[T]() -> Option[T] => new Option[T].NONE();
ok[T, E](value: T) -> Result[T, E] => new Result[T, E].OK(value);
error[T, E](error: E) -> Result[T, E] => new Result[T, E].ERROR(error);

Representation

The compiler can choose the most efficient representation for each union type based on its characteristics:

  • For unions with a small number of variants and small total size, a single value type (struct) can be used, with a discriminator field to identify the active variant and fields for the variant data. In particular unions with only two variants, one of which is a unit variant are an obvious candidate for struct representation.
  • For unions with a large number of variants or large total size, a reference type (class) hierarchy can be used, with a base class for the union and derived classes for each variant.
  • The compiler may use additional optimizations, such as storing variant fields at overlapping offsets, when it can prove it is safe to do so.
  • The compiler will guarantee that a union representation with the same variants and variant fields will remain compatible across builds, so that multiple assemblies in a project consuming the same union will interoperate. However, backwards compatibility won't be guaranteed if variants are added to a union or if fields are added to a variant.

The initial implementation may be limited representing a union and variants as a base class and derived classes, in which case struct representation support will be delivered under an separate issue.

Requirements Checklist

  • Define unions with the union keyword, name, optional generic type parameters, and is / si delimited body
  • Define union variants with name, optional fields, terminated with semicolon
  • Support unit variants (variants with no fields)
  • Support generic union types
  • Enforce immutability of union values after construction
  • Construct unions with new keyword, qualified variant name, generic type arguments applied to union type, and field values
  • Require parentheses in construction of unit variants for consistency
  • Support defining constructor functions for commonly used unions to avoid specifying generic argument types
  • Choose efficient representation for union types based on characteristics (struct for small unions, class hierarchy for large unions)
  • Guarantee compatible representation across builds for same union definition
  • Represent unions as base class and variants as derived classes

Optional:

  • Support struct representation for unions

Implementation Notes

The parser, the type expression syntax trees, and the generic type system all need some changes in order to handle variants of a generic union type.

Type Expression Syntax Trees

Need to update syntax trees so we can represent type member expressions where the part to the left of the dot is a generic type application

Parser

Need to update the parser to handle type member expressions where the part to the left of the dot is a generic type application

Type System and Generic Specialization

Specialization of variants of generic unions needs special handling.

In principle we could specialize variants in the usual way, treating them as members of their owning union and replacing all references to the parent unions formal type arguments within the variant with the actual type arguments supplied in the generic application type expression.

However, the .NET representation of the variants will be as classes in their own right. Hence it's more appropriate to make them generic classes with the same formal type arguments as the owning union as this will better align with the actual .NET representation.

So

union Option[T] is
    SOME(value: T);
    NONE;
si

will be represented as

class Option[T] is
    _tag: ubyte; // unless we use a type check or a virtual method to determine variant type
si

class SOME[T]: Option[T] is
    value: T;
si

class NONE[T]: Option[T] is
  // even though NONE has no fields, it needs to be assignment compatible with
  // Option[T] so it still needs a formal type argument T that is forwarded to Option[T]
si

Implementation Notes

The parser, the type expression syntax trees, and the generic type system all need some changes in order to handle variants of a generic union type.

Type Expression Syntax Trees and Parser

Need to update syntax trees and the parser to represent and handle type member expressions where the part to the left of the dot is a generic type application, such as Option[T].SOME.

Type System and Generic Specialization

Specialization of variants of generic unions needs special handling.

In principle, we could specialize variants in the usual way, treating them as members of their owning union and replacing all references to the parent union's formal type arguments within the variant with the actual type arguments supplied in the generic application type expression.

However, the .NET representation of the variants will be as classes in their own right. Hence, it's more appropriate to make them generic classes with the same formal type arguments as the owning union, as this will better align with the actual .NET representation.

For example:

union Option[T] is
    SOME(value: T);
    NONE;
si

will be represented as:

class Option[T] is
    _tag: ubyte; // unless we use a type check or a virtual method to determine variant type
si

class SOME[T]: Option[T] is
    value: T;
si

class NONE[T]: Option[T] is
  // even though NONE has no fields, it needs to be assignment compatible with
  // Option[T] so it still needs a formal type argument T that is forwarded to Option[T]
si

When declaring symbols, we need to ensure:

  • Variants take the same formal type arguments as their parent union
  • Variants are derived from their parent union
  • Variants supply their formal type arguments as actual type arguments to the parent union
  • Unions are derived from System.Object

Because variants contain no code that could reference any symbols, there's actually no need for them to be inside their parent's scope. It may actually be easier, however, not to attempt to break them out of it - it likely won't matter in practice either way. Plus, the one symbol in the parent we do potentially want to reference from the variants (the tag, which we might need to reference from compiler-generated code) will be inherited.

For a type expression like Option[int].SOME, we can't construct a GENERIC type or a GENERIC symbol that represents it, nor do we want to since we actually want to apply the type arguments to the variant, not the union. Hence, we should transform Option[int].SOME into Option.SOME[int].

We could do this blindly in the parser, before we even know what kind of symbol Option will be declared as, since the only kinds of generic types that contain (accessible) type members will be unions. This wouldn't be very future-proof, however, and it might result in confusing error messages.

It would be preferable to defer rewriting Option[int].SOME into Option.SOME[int] until the resolve-types phase, at which point we'll be able to check that Option really is a union, that it's generic and takes one type parameter, and that SOME is a member of it. We can then pretend the user had actually written a qualified identifier generic type application Option.SOME[int] and proceed to specialize that. In the case of any errors resolving the type, though, we will still have the information needed to report it as Option[int].SOME.

We should also notify the symbol use listener that both Option and SOME are referenced here, to ensure hover, symbol navigation, and rename all work (in particular, rename will fail to rename Option otherwise as it won't be referenced in any phase after resolve-types).

For a type expression Option[int].SOME, we should not attempt to specialize Option[int] (making a copy of Option[T] with int substituted for T) and search for member SOME within it. Instead, we should search the unspecialized non-generic Option class for its member SOME[T] and then specialize that with actual type argument int.

Within the body of a variant, any references in type expressions to other variants of the containing union should not be legal. References to the containing union are allowed - this is required to support recursive union types such as lists and trees. If a union is generic, then references to its type within the body of one of its variants need appropriate formal type arguments, which can (and probably should) be the formal type arguments of the union itself.

union Tree[T] is
    NODE(value: T, left: Tree[T], right: Tree[T]);
    LEAF;
si

We need to synthesize an init(...) method for each variant that takes an argument for each of the variant's fields and then initializes the fields accordingly

union Tree[T] is
    NODE(value: T, left: Tree[T], right: Tree[T]);
    LEAF;
si
...
class NODE[T]: Tree[T] is
    value: T;
    left: NODE[T];
    right: NODE[T];

    init(value: T, left: Tree[T], right: Tree[T]) is
        self.value = value;
        self.left = left;
        self.right = right;
    si
si

This can be done similarly to synthesized read and assign accessor methods - we can just construct appropriate syntax trees and inject them into the variant's syntax tree before the define symbols pass (e.g. in add accessors for properties pass)

However, unit variant instances need to be singletons and this complicates things. If the code that handles new expressions needs to be updated to handle this case, then perhaps instead we can add two new IR values - one for instantiating variants with arguments and the other for getting the singleton value for a given unit variant

Implementation Checklist

  • Update syntax trees to represent type member expressions where the part to the left of the dot is a generic type application (e.g., Option[T].SOME)
  • Update the parser to handle type member expressions where the part to the left of the dot is a generic type application when parsing type expressions
  • When parsing variants synthesize an appropriate init() method
  • When declaring symbols, ensure variants are derived from their containing union
  • When declaring symbols, ensure variants are given the same formal type parameters as their containing union
  • In the resolve-types pass, translate type expressions like Option[T].SOME to Option.SOME[T]
  • In the resolve-types pass, if a generic type application's member is not found, reference the original type expression, not the translated expression (i.e., report that SOME is not found in Option[T], not that SOME[T] isn't found in Option)
  • Ensure that within the body of a variant, any references in type expressions to other variants of the containing union are reported as an error
  • Notify the symbol use listener of references to both the union and its variant when resolving type expressions like Option[int].SOME, to support hover, symbol navigation, and rename functionality
  • Handle construction of variants with arguments, either via a synthesized init() method, by a new IR value, or by an innate call
  • Handle getting a unit variant type's singleton instance, by a new IR value or by an innate call
@degory degory changed the title Tagged unions Tagged unions with pattern matching Mar 21, 2024
@degory degory mentioned this issue Mar 22, 2024
19 tasks
@degory degory changed the title Tagged unions with pattern matching Tagged unions Mar 22, 2024
@degory
Copy link
Owner Author

degory commented Mar 22, 2024

Prerequisite for #1134

degory added a commit to degory/ghul-vsce that referenced this issue Mar 22, 2024
Enhancements:
- Highlight the `union` keyword (see degory/ghul#1132)
- Highlight !!! (error) and ??? (not inferred) types as invalid / red in in hover
degory added a commit to degory/ghul-vsce that referenced this issue Mar 22, 2024
Enhancements:
- Highlight the `union` keyword (see degory/ghul#1132)
- Highlight !!! (error) and ??? (not inferred) types as invalid / red in in hover
degory added a commit that referenced this issue Mar 26, 2024
Enhancements:
- Initial support for unions (see #1132)
- Union variant member access and variant type test properties (see #1139)
degory added a commit that referenced this issue Mar 26, 2024
Enhancements:
- Initial support for unions (see #1132)
- Union variant member access and variant type test properties (see #1139)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant