RTValue, MDM, IR #529

edwardpeters · 2024-02-08T16:05:51Z

edwardpeters
Feb 8, 2024
Collaborator

In the morphir-scala world, we have three very closely linked concepts:

IR (also called Value, but not in this document), which is Morphir IR that represents logic
MDM (Morphir Data Model), which is a representation of pure data (not logic) used by Morphir on its peripheries
RTValue, which is a representation used internally to the evaluator runtime, to represent the results of evaluation (which may not be pure data)

In this document, we refer to “Value” as any instance of any of these three - IR.Value, RTValue, or MDM.Data.

These three types are very similar: You can represent

List(
Tuple(Int(1), String(“Red”)),
Tuple(Int(2), String(“Green”))
)

In any of the above, and the trees will look quite similar. If you run these through the evaluator path, you would start with elm code like [(1, “red”), (2, “Green”)], which would be compiled to IR, which the evaluator would evaluate to RTValues and then convert to MDM; each structure along the way would look almost identical, while existing in completely different type trees.

However, the three formats have significant differences:
What Each Represents (And how):

MD
- Consists of only data, no logic
- Stores types such as LocalDate and HashMap directly in equivalent scala types
- Data is fully defined locally without references
- Stores type information alongside values (“Data” and “Concept”)
IR
- Consists of all morphir logic in its unevaluated state
- Represents many types only as “Apply” calls
- Values may be defined as variables or references, requiring local or global scope
- Stores type information alongside values
RTValue
- Consists of the results of evaluating morphir IR
- May represent values which are not “pure” data, such as lambdas
- Stores types such as LocalDate directly in equivalent scala types
- Values may refer to references requiring global scope, but never depend on local scope
- (Currently) does not store type information at all
- May contain IR nested beneath it, in the case of function values.

Dimensions
Representation of Complex Types

Values of a number of types including LocalDates, Maps and user-defined types may be represented differently between the models

MDM and RTValue represents scala-equivalent types as their native scala types - a LocalDate is a java.time.LocalDate, and user-defined types as a value map with a name
IR can only represent these types as the Apply nodes to functions (references or constructors) that create them. A LocalDate will be represented as something like Apply(Apply(Apply(“LocalDate.fromParts”, 1989), 1), 5)
- It is a possible action item to flatten out this currying and allow apply nodes to take multiple arguments in the IR. This would simplify these, but they would still be represented as apply nodes with nested IR representing the function and arguments.

Self Sufficiency
Trees in each model may make sense “on their own”, or they may include references that depend on some surrounding context. For instance, If true then 1 else 2 is entirely self sufficient; if true 1 else IntFunctions.foo() requires a global context that includes foo; if true 1 else x requires a local context in which x is defined.

A consequence of this is that the different representations are differently “Mobile”. An MDM value may be moved to any context (including other runtimes) and maintain its meaning. RTValues may move around within a runtime and remain valid, but may lose or change their meaning if sent to a different runtime with different global bindings. IR subttrees are only guaranteed to be well-defined at the location they appear in the IR; moving them may cause variable or reference nodes within them to lose meaning.

MDM values are always self sufficient and wholly defined within themselves (no variables or references)
RTValues may rely on global context (references) but not on local context
IR may rely on either local or global context

Unique/Canonical Representation
Depending on the model, a single value may or may not have different ways it can be represented (for some definition of single value).

MDM (Possibly excluding type information, below) does (I think?) have unique representations of all values: List(1, 2) can only ever be Data.List(Data.Int(1), Data.Int(2))
IR may have any number of representations of a single value: 1 + 2, if true 3 else 4, 5 - 2 and if true 3 else 5 are all valid representations of “3”
RTValues have unique representations of non-function data; whether function data is uniquely represented depends upon a definition of function equivalence that even I don’t want to get into.

Type Information
Trees in each model may or may not include type information on that element. In some cases, this type information may be derivable from the tree itself - if true “Red” else “Green is always a String, for instance - but in others the explicit type information may be necessary to know the type, such as for empty lists or Maybe.Nothing

IR and MDM have type information alongside value information
RTValues currently hold no type information at all
This was originally done partially out of laziness efficiency, and partially for performance concerns

Adding type information to RTValues may be a priority

Type Level Self Sufficiency
Similar to the values, this type information may or may depend upon surrounding context

MDM holds types that are self sufficient - the entire type tree is defined
“Aliases” exist, but the alias tree includes not just the reference, but the thing being referenced
As an aside, this means that MDM type trees may be inconsistent (e.g., if two things have the same alias but different defined types)
IR type information may be contextual
Reference types exist, which require a context in which the reference may be looked up to know the actual type
Type Variables exist, which would only have concrete type in an actual invocation of the function that contains them

This may complicate attaching type information to RTValues, as RTValues can “move” in a way which IR cannot, and could leave the scope in which their type information is valid

Type Level Unique/Canonical Representation
Similar to values, types may be represented in different ways. This complicates type checking, as asserting that two types are equivalent is not a simple structural equality check.

MDM has (mostly) only one way to define a type
“Alias” types muddy the waters somewhat, but exist because Alias differences are treated as real to represent translation to scala types that would otherwise be indistinct
IR has many ways to define a type (references, aliases, type variables and no one talk about extensible record types)

Remaining Work To Do
A value may be “done” - just data - or it may have remaining computation.

MDM is always “done”
IR can represent “more to do” - i.e., if true then 1 else 2 can still be evaluated further. This means that IR can represent things that are not pure data, such as function calls to infinite loops.
RTValues are more complicated, as they can represent functions. These functions can’t be evaluated without an argument being passed to them, so unlike the if statement, there is no immediate work to do, but more may be done later.

How Each is Used

MDM is used to represent data both as inputs and outputs of the evaluator. Additionally, derivers convert Scala values to MDM. These derivers (currently) rely on Scala 3 macros to parse type information and create the derivation functions
RTValue is used internally by the evaluator. The recursive evaluation loop of the evaluator produces RTValues, and the “store” of bound variables/function parameters is essentially a map from variable names to RTValues
IR is, of course, the logic of a compiled elm program
IR is ALSO used as input to the evaluator - see “How Evaluator Input Works”

Possible Changes & Challenges

Switch How Evaluator Input Works

Current Model

The current model of calling a function in a morphir IR file with a scala value is to create a synthetic IR Node representing the application of IR representing that scala value to a reference to that function, and evaluate this in a context which has the global namespace from the IR file. Step by step:

You have the FQN of the entry point function you want to call - e.g, “MyPackage:MyModule:foo”
You have the scala value you want to call it with - e.g. “Red”
The scala value gets converted to MDM - e.g., Data.String(“Red”)
The MDM gets converted to IR - Value.Literal(StringLiteral(“Red”))
A reference IR node is created for the entry point - Value.Reference(fqn”MyPackage:MyModule:foo”)
An apply node is created combining the reference node with the input IR:
Value.Apply(Value.Reference(fqn”MyPackage:MyModule:foo”), Value.Literal(StringLiteral(“Red”)))
That apply node is evaluated (with the global namespace of that IR file available, including the actual definition of the function foo)

This model can be tricky for some types, such as maps, local dates and user-defined types: in these cases, the IR “Representation” of the input becomes a sequence of apply nodes to functions that would produce the value. For analogy, if you want to give the evaluator a cake, you de-construct the cake into flour, eggs, milk ad sugar, and pass those to the evaluator along with a recipe for cake.

This poses a potential performance concern when chaining multiple evaluator runs together, with each fed the output of the former.

Possible Alternative
An alternative model has occasionally been considered, which would instead convert the input to an RT value, lookup the body of the entry point function, and evaluate that body in a context including a store pre-populated with the input (bound to the argument name specified in the entry point). In this model:

You have the FQN of the entry point function you want to call - e.g, “MyPackage:MyModule:foo”
You have the scala value you want to call it with - e.g. “Red”
You lookup the entry point in your IR, and find a function definition with parameters and a body - say the parameter is X, and the body is an IR representation of “if x == “Red” then 1 else 2”
You convert the input to an RTValue (optionally passing through MDM in the process), getting RTValue.Primitive.String(“Red”)
You create a store in which this parameter is bound to the name - Store(Map(“x” -> RTValue.Primitive.String(“Red”)))
You evaluate the body of the entry point with that local variable store

If this model were adopted, it could have two benefits:
We could eliminate the tricky process of representing values as the IR that would produce those values (Passing the evaluator a cake just means passing it the cake)
If multiple evaluator runs are chained together, no conversion would be necessary between the output of one and the input of the next
However, it is possible that risks would be introduced. RTValues can include things like references to functions in the current IR file, which may be invalid if you blindly pass them along to another evaluator instance using a different IR file.

Eliminating RTValue
Given the complexity and overhead of maintaining three models, there have been several suggestions that RTValue be eliminated and replaced with one of the other two.

Replace with MDM
Replacing RTValue with MDM would mean expanding MDM with function representations, which could have IR within them, which could refer to functions specific to a given runtime.

Replace with IR
Replacing MDM with IR might be more reasonable, but has some non-trivial difficulties to explore:

The representation of some types (such as LocalDates) can be difficult to express as IR; at present, the only way (at least in morphir scala) is to represent them as function calls to SDK functions.
- This isn’t ideal if those results then need to be passed to other functions, as it requires the application of downstream functions to lazily evaluate the call
- This also poses challenges for pattern matching - Some(date) needs to know that it matches LocalDate.fromParts(1, 5, 1989) which is obvious enough when that value is represented as RTValue.ConstructorResult(Maybe.Some, localDate), but less so if it’s Apply(Apply(Apply…
- This could be mitigated by having a canonical representation of a LocalDate, and evaluating something like fromParts not into an application of itself, but into the application of a constructor wrapping a function with a non–optional return.
Significant diligence would be needed to ensure that values do not leak outside the scope in which they are defined. When building a store in response to let x = y, it is not enough to just bind the IR Variable(y) to x in the store and delay evaluation, ax x may leave the scope in which y is meaningfully defined.
Implementing lexical scope required attaching the store where a function was defined to that function; thus stores exist in RTValues. To do the same with IR would require representing the store as IR; this is possible by having a LetDefinition enclose it, but clunky and may interact with the above problems.
The significant laziness introduced by this model might be a performance concern. While it’s possible to save time in some cases by not evaluating un-taken code paths, it’s also possible to lose time repeatedly re-evaluating the same code path (i.e., if you have let x = , using x 10,000 times may re-trigger that computation.)

Notable, this model is used in the morphir-elm interpreter - however, that interpreter is unable to handle certain cases, such as partial function application (the code examples below work in native elm but not in morphir-elm develop:)

type MyEnum = TwoArg Int Int

unable_to_compute : Int -> MyEnum
unable_to_compute x = 
    let 
        inner = 
            let y = 5 in \z -> TwoArg y z
    in
        inner x

wrong_result : Int -> MyEnum
wrong_result x = 
    let 
        inner = TwoArg 5
    in
        inner x

Attach Type Information to RTValues
There’s been a desire to have type information ride along with RTValues, both for additional safety and to provide context when those values are passed downstream.

In particular, some functions (List.sum and List.product) have executions which require the type information to produce the correct result (in the case of empty lists).

This was initially thought to be a simple change, as the type information is attached to the IR node from which the RTValue is evaluated, so we should be able to just put that type information on the RTValue and have typed RTValues. On further thought this has some problems:

IR Types, like IR, may be dependent on their local scope for meaning. A node’s IR may e of type Type.Variable(“a”); that’s fine locally, but if we then build a tuple of values of Type.Variable(“a”) and return then from the generic function, we could have an RTValue like (1 : Type.Variable(“a”), 2 : Type.Variable(a)) : (Int, Int)`,which could then be passed to some scope in which a refers to a different type. I’m not sure specifically what bugs this would produce, but I think we’d rather not learn the hard way.

IR Types may be represented as aliases, with generic parameters applied along the way. For instance, if you have type alias MyList a = List a and type alias IntList = List int , then three different type trees - MyList Int, IntList and List Int all refer to the same type. This is not insurmountable - it is possible to de-alias either on the fly or as a pre-evaluation transformation - but it will need to be done for the type information to be useful. For example, in our motivating List.sum example, if your type is SomeTypeAlias SomeOtherTypeAlias, you need to be able to de-alias to tell what type of 0 to produce.

Performance hits may be possible. At least some work will need to be done at runtime to manage types, but this will likely be low; the greater concern is if we need to repeatedly explore type trees during execution to maintain sanity or correctness. For instance, if at each evaluation step we have to re-explore the tree to fix each level (to avoid our (0 : a, 1 : a) : (Int, Int) example above), this could represent a several fold increase in the total calculations performed.

It turns out that our motivating example may remain an issue anyway - List.sum may be carried out at a code position where the type is generic (number), and testing seems to indicate that such values may even be returned to the top level.

Should RTValue’s Type be more limited than IR’s type?
One possible consideration is that RTValues should have a type representation that is less powerful but more self-sufficient than that of IR. Several of the above problems could be eliminated (On the RTValue level) if RTValue types never contained aliases or type variables, similar to MDM.

This does not easily solve the problem, because we still need a sane and consistent way to generate these “concrete” type trees from the potentially richer type trees of Morphir, but it would at least give us a strong static guarantee that we never had such knots on the RTValue level.

Do we need pre-evaluation transformation passes to support this?
Another possible option for managing these concerns is to do pre-evaluation transformations of the IR:

De-alias all types (up to opaque types)
- This would make checking type equality easier for transformed code, as type conformance would (usually) be simple equality, excluding type variables and extensible record types
- It may be more performant to do this once in advance, especially for re-entrant code
Generate concrete variants of generic functions
- By creating multiple concrete implementations of each generic function and binding each call site to one such, it should be possible to eliminate type variables from the IR
- This would eliminate the issue of context-dependent types
- Somewhat ironically, this might obviate our List.sum use case, as we could reify that to a concrete implementation in advance rather than using type information on the RTValue

Create Unified AST Tree
The driving distinctions between the three models are now to partially overlapping sets of values, more than the representation of those values. A significant chunk of each - all lists, tuples, records and primitives, along with their associated types - have the same core meaning and structure in each interpretation.

It could be possible to simplify things and reduce boilerplate by creating a single sealed trait hierarchy in which different nodes extend different subsets of these three. It should be possible to express such a type hierarchy (in Scala and in Rust - possibly not Elm) in such a way that some forms, such as Tuple and List, will belong to given types only if their child types do as well.

edwardpeters · 2024-02-08T18:14:39Z

edwardpeters
Feb 8, 2024
Collaborator Author

@AttilaMihaly Thinking about our discussion today, I don't think there's anything foundational to RTValue that would prevent it being represented as "Part of Morphir" rather than "Part of Scala". Specifically:

It includes Scala types, but I wouldn't call that foundational - I think Derived Types might fill all of the use cases there (I'm still not clear on them) but if not it's easy to envision other structures that would.
First class functions with partial application and lexical scope don't fit neatly within Morphir IR, but they're not scala-specific either (they don't need to be backed with scala lambdas or anything). Everything in them can be represented easily with algebraic data structures over common values.

I imagine adding another AST to the Morphir ecosystem alongside Value, Type, Pattern, etc. would not be the easiest lift, but I feel like it would simplify things in the long run.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTValue, MDM, IR #529

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

RTValue, MDM, IR #529

edwardpeters Feb 8, 2024 Collaborator

Replies: 1 comment

edwardpeters Feb 8, 2024 Collaborator Author

edwardpeters
Feb 8, 2024
Collaborator

edwardpeters
Feb 8, 2024
Collaborator Author