Skip to content

Transpiler Architecture

Chris Campbell edited this page Apr 29, 2024 · 7 revisions

Table of contents

Introduction

The SDEverywhere transpiler converts models written in the Vensim modeling language to either C or JavaScript. The transpiler supports the Vensim language features and Vensim library functions that are most commonly used in models, including subscripts.

The transpiler is published in the @sdeverywhere/compile package, which is used by the sde command line tool (from the @sdeverywhere/cli package). The source code can be found in the packages/compile directory in this repo.

The @sdeverywhere/compile package is written in the ECMAScript 2015 language (also known as ES6), a modern, standardized version of JavaScript. Much of the code is written in a functional programming style using the Ramda toolkit. (Most other packages in the SDEverywhere repo are written in TypeScript, but the cli and compile packages are currently written in JavaScript.)

Note that the term "SDEverywhere" generally refers to the collection of libraries and tools that are developed in this repo, but when this document refers to SDEverywhere, it is using it as a shorthand for the SDEverywhere transpiler (the compile package).

Overview of Transpilation

High level view

From a high level perspective, the transpiler can be thought of as a black box that takes model files as input, performs some computation, and generates C or JavaScript files as output.

graph TB;
    input["Model files"]
    compiler("Transpiler")
    output["C or JS files"]

    input-->compiler
    compiler-->output

    style input stroke:none,fill:green,color:white
    style output stroke:none,fill:royalblue,color:white
Loading

Medium level view

The next diagram shows the above sequence in terms of the actual files and high-level function. We can see that the input files consist of one or more Vensim model (.mdl) files, zero or more exogenous data files (e.g., in .xlsx, .csv, or .dat format), and a spec.json file that tells the transpiler what input/output variables to include, where to find the data files, and so on. These input files are fed to the parseAndGenerate function, which returns the generated code (the content of a .c or .js file) as output.

graph TB;
    input_mdl["{model}.mdl"]
    input_data[".xlsx | .csv | .dat"]
    input_spec["spec.json"]
    compiler("parseAndGenerate<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
    output_c["{model}.c"]
    output_js["{model}.js"]

    input_mdl-->compiler
    input_data-->compiler
    input_spec-->compiler
    compiler-->output_c
    compiler-->output_js

    classDef input stroke:none,fill:green,color:white
    classDef output stroke:none,fill:royalblue,color:white

    class input_mdl,input_data,input_spec input
    class output_c,output_js output
Loading

Low level view

The following diagram shows the phases of the transpilation process (i.e., the parseAndGenerate function from above) in more detail, including the intermediate files/objects that are passed from one phase to the next. The remainder of this document will explain each of these phases in more detail.

graph TB;
    input_mdl["{model}.mdl"]
    preprocessor("preprocessVensimModel<br/><div style='font-size:0.8em'>(@sdeverywhere/parse)</div>")
    intermediate_mdl["processed.mdl"]
    parser("parseVensimModel<br/><div style='font-size:0.8em'>(@sdeverywhere/parse)</div>")
    intermediate_ast["AST"]
    reader("readDimensionDefs + readVariables<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
    intermediate_model1["Unresolved Model<br/><div style='font-size:0.8em'>(DimensionDefs + Variables)</div>"]
    analyzer("analyze<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
    intermediate_model2["Resolved Model<br/><div style='font-size:0.8em'>(DimensionDefs + Variables)</div>"]
    generator("generateCode<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
    output_c["processed.c"]

    input_mdl-->preprocessor
    preprocessor-->intermediate_mdl
    intermediate_mdl-->parser
    parser-->intermediate_ast
    intermediate_ast-->reader
    reader-->intermediate_model1
    intermediate_model1-->analyzer
    analyzer-->intermediate_model2
    intermediate_model2-->generator
    generator-->output_c

    classDef input stroke:none,fill:green,color:white
    classDef intermediate stroke:none,fill:orangered,color:white
    classDef output stroke:none,fill:royalblue,color:white

    class input_mdl input
    class intermediate_mdl,intermediate_ast,intermediate_model1,intermediate_model2 intermediate
    class output_c output
Loading

Phase 1: Preprocess and Flatten

graph LR;
    input_mdl["{model}.mdl"]
    preprocessor("preprocessVensimModel<br/><div style='font-size:0.8em'>(@sdeverywhere/parse)</div>")
    intermediate_mdl["processed.mdl"]

    input_mdl-->preprocessor
    preprocessor-->intermediate_mdl

    classDef input stroke:none,fill:green,color:white
    classDef intermediate stroke:none,fill:orangered,color:white

    class input_mdl input
    class intermediate_mdl intermediate
Loading

In the first phase, the Vensim model file(s) are preprocessed to make the definitions easier for the parsing phase to digest.

In the common case where there is a single .mdl file, that file is passed through a preprocessor. The preprocessor removes some things that the parser grammar can't yet handle, such as macros, tabbed arrays, and the graph/sketch/view definitions in the private section of the .mdl file. The preprocessor produces a new .mdl file that contains only the relevant dimension definitions and equations needed for the parsing phase.

In the case of a complex model that consists of multiple "submodels" (i.e., multiple .mdl files), the sde flatten command must be used to preprocess all .mdl files and combine duplicate definitions to produce a single .mdl file that contains the resolved dimension definitions and equations.

Phase 2: Parse

graph LR;
    intermediate_mdl["processed.mdl"]
    parser("parseVensimModel<br/><div style='font-size:0.8em'>(@sdeverywhere/parse)</div>")
    intermediate_ast["AST"]

    intermediate_mdl-->parser
    parser-->intermediate_ast

    classDef intermediate stroke:none,fill:orangered,color:white
 
    class intermediate_mdl,intermediate_ast intermediate
Loading

In the second phase, a single preprocessed .mdl file is passed to the parseVensimModel function (part of the @sdeverywhere/parse package), which parses the model definitions and produces an abstract syntax tree (AST).

The AST is an in-memory representation of the model that allows later phases to work with the model definitions in a way that is not strongly tied to the source file format. Though currently we only have support for Vensim models as an input format, we plan to add support for the XMILE format used by Stella. The AST is designed to be file format agnostic, meaning that once a model file is parsed into an AST, the later phases of the transpiler can work with that AST structure without needing separate, special-cased logic for Vensim models and XMILE models.

Internally, the parseVensimModel function uses the antlr4-vensim package to parse the dimension definitions and equations from the .mdl file and produce the AST.

Phase 3: Read Dimensions and Variables

graph LR;
    intermediate_ast["AST"]
    reader("readDimensionDefs + readVariables<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
    intermediate_model1["Unresolved Model<br/><div style='font-size:0.8em'>(DimensionDefs + Variables)</div>"]

    intermediate_ast-->reader
    reader-->intermediate_model1

    classDef intermediate stroke:none,fill:orangered,color:white

    class intermediate_ast,intermediate_model1 intermediate
Loading

The AST from the second phase is the input to the third phase. This phase has two parts.

First, the dimension definitions ("subscript ranges" in Vensim terminology) from the AST are read. For each dimension definition, a corresponding Subscript object is created and managed by the subscript module. For more on subscripts and dimensions, consult the Dimensions and Subscripts section below.

Second, the equations and data variable definitions from the AST are read. For each equation or data variable, a corresponding Variable object is created and managed by the model module. For more on variables and the Variable class, consult the Variables section below.

During this phase, the Variable objects are not fully resolved. They contain a reference to the parsed Equation, and the left-hand side (i.e., the name of the variable) is determined, but the right-hand side of the equation has not yet been examined. That will happen in the next phase.

Detail: readVariables

Syntactically, an equation can be one of three things: a variable, a lookup, or a constant list. The readVariable function creates multiple variables for each constant in a constant list. Subscripts are put into normal form.

When a variable is added to the model, the Model object checks to see if there is an index subscript on the LHS. If so, the variable is a non-apply-to-all array, and is added to the nonAtoANames list indexed by the variable name, with a value of an array of flags for each subscript in normal order, indicating whether the subscript is an index or not.

A subscripted constant variable can be defined with all of the constants in a list on the RHS. This notation is handled as a top-level alternative for the RHS in the grammar. When readVariables finds a constant list, it creates new Variable instances, one for each index in the constant list.

Phase 4: Analyze Equations

graph LR;
    intermediate_model1["Unresolved Model<br/><div style='font-size:0.8em'>(DimensionDefs + Variables)</div>"]
    analyzer("analyze<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
    intermediate_model2["Resolved Model<br/><div style='font-size:0.8em'>(DimensionDefs + Variables)</div>"]

    intermediate_model1-->analyzer
    analyzer-->intermediate_model2

    classDef intermediate stroke:none,fill:orangered,color:white

    class intermediate_model1,intermediate_model2 intermediate
Loading

In the fourth phase, the Variable objects created during the third phase are analyzed, and the right-hand side of each equation is examined to determine which variables and functions it references.

During this phase, the variable type (varType) for each Variable instance is determined. Function call arguments are validated to make sure the number and format of the arguments match what is expected by the functions (as specified by the Vensim documentation, in the case of Vensim models). For complex function calls, this phase may store additional metadata in the Variable object that prepares that variable for the code generation phase.

This phase will throw an error if it encounters inconsistencies, such as variable dependencies that cannot be resolved (unknown variable references) or unresolved data variables (for which the data cannot be found in the associated external data files).

At the completion of this phase, if everything is valid, the Variable objects are considered fully resolved and ready to be passed to the code generation phase.

Detail: readEquations

When readEquation finds lookup syntax on the RHS, it creates a lookup variable by setting the points, optional range, and variable type in the Variable. If a variable has no references, the variable type is set to const. If a function name such as _INTEG is encountered, the variable type is set to level.

If the variable is non-apply-to-all, and it has a dimension subscript on the RHS in the same position as an index subscript on the LHS, then the equation references each element of the non-apply-to-all variable separately, one for each index in the dimension. The readEquation function constructs a refId for each of the expanded variables and adds it to the expandedRefIds list. The references are added later in the addReferencesToList function.

Detail: removeUnusedVariables

After the first part of phase is complete, the Variable objects form a dependency tree. The spec.json file is consulted to see which input and output variables are specified for inclusion in the generated model code. The removeUnusedVariables function walks the dependency tree (consulting the references and initReferences properties of each Variable object). Only the variables specified in the outputVarNames array from the spec.json file and their dependencies are retained in the generated model. After the removeUnusedVariables function completes, the variables array in the model module will contain only the retained Variable objects; the rest are discarded.

To help illustrate this, consider the following model (defined in pseudo-Vensim code):

w = 5 ~~|
x = 4 ~~|
y = x + 3 ~~|
z = 1 ~~|
u = y * 2 ~~|

Suppose the spec.json file has:

{
  "outputVarNames": ["u", "z"]
}

In this case, the generated model will include:

  • u (because it is a declared "output")
  • y (because it is referenced by u)
  • x (because it is referenced by y)
  • z (because it is a declared "output")

The generated model will not include w because it is not referenced in the dependency tree.

Phase 5: Generate Code

graph LR;
    intermediate_model2["Resolved Model<br/><div style='font-size:0.8em'>(DimensionDefs + Variables)</div>"]
    generator("generateCode<br/><div style='font-size:0.8em'>(@sdeverywhere/compile)</div>")
    output_c["processed.c"]

    intermediate_model2-->generator
    generator-->output_c

    classDef intermediate stroke:none,fill:orangered,color:white
    classDef output stroke:none,fill:royalblue,color:white

    class intermediate_model2 intermediate
    class output_c output
Loading

In the last phase, the analyzed equations are used to generate C or JavaScript code that can be used to run the model.

The generated code is divided into distinct sections and functions:

  • variable declaration
  • lookup/data initialization (initLookups)
  • constant initialization (initConstants)
  • level variable initialization (initLevels)
  • level variable evaluation (evalLevels)
  • aux variable evaluation (evalAux)
  • input variable handling (setInputsFromBuffer)
  • output variable handling (storeOutputData)

For each section, the generateCode function calls generateEquation for each Variable instance to generate lines of code for that section and variable:

  • There will be one C variable declaration (either a double or Lookup*) generated for each Variable instance.
  • For each data variable or lookup, there will be code generated that initializes the Lookup data structure corresponding to that variable.
  • For each constant value, there will be code generated that intializes the constant value once at the start of the model run.
  • For each level variable, there will be code generated that initializes the level at the start of the model run and separate code that evaluates the level variable at each time step.
  • For each aux variable, there will be code generated that evaluates the variable at each time step.
  • The setInputsFromBuffer function will contain one line for each input variable declared in the spec.json file (in the inputVarNames array).
  • The storeOutputData function will contain one line for each output variable declared in the spec.json file (in the outputVarNames array).

Detail: Variable Lists

The code generator gets lists of variables for each section of the program and calls the generateEquation function to generate code for each variable.

The Model object supplies the variable lists, relying on the following internal functions:

  • varsOfType returns Variable instances for a given varType.
  • sortVarsOfType returns aux or level Variable instances sorted in dependency order using eval time references.
  • sortInitVars does the same using init time references. The other difference is that aux and level vars are evaluated separately at eval time, while a mixture of level vars and the aux vars they depend on are evaluated at init time.

Detail: generateEquation

The generateEquation function maintains a context object (see GenExprContext type) that has a number of properties that hold intermediate results as the AST is visited.

Code is generated differently in the init section of the program. This is controlled by the mode flag in the GenExprContext.

Array functions such as SUM require the creation of a temporary variable and a loop. These intermediate variables are tracked in the GenExprContext.

Subscripted variables are also evaluated in a loop. The subscript loop opening and closing code are tracked in the GenExprContext, as is the array function code.

Array functions mark one dimension that the function operates over. The dimension is marked by a ! character at the end of the dimension name. If this is detected, the ! is removed and the name of the marked dimension is saved in the markedDimIds array in the GenExprContext.

Terminology

SDEverywhere uses XMILE terminology in most cases. A Vensim subscript range becomes a "dimension" that has "indices". (The XMILE specification has "element" as the child of "dimension" in the model XML format, but uses "index" informally, so SDEverywhere sticks with "index".) XMILE does not include the notion of subranges. SDEverywhere calls subranges "subdimensions".

Vensim refers to variables and equations interchangeably. This usually makes sense, since most variables are defined by a single equation. In SDEverywhere, models define variables with equations. However, a subscripted variable may be defined by multiple equations. In XMILE terminology, an apply-to-all array has an equation that defines all indices of the variable. There is just one array variable. A non-apply-to-all array is defined by different equations for each index. This means there are multiple variables, one for each index.

The Variable class is the heart of SDEverywhere. An equation has a left-hand side (LHS), usually the variable name, and a right-hand side (RHS), usually a formula expression that is evaluated to determine the variable's value. The RHS could also be a Vensim lookup (a set of data points) or a constant array. For more detail, consult the Variables section below.

Dimensions and Subscripts

  • A Vensim "subscript range definition" defines an SDEverywhere dimension.
  • Subscript range definitions give a list of subscripts that can include dimensions, indices, or both.
  • A dimension can map to multiple dimensions listed in the mapping value.
  • In a subscript range definition with a mapping, the map-from dimension is on the left, and the map-to dimensions are on the right after the -> marker.
  • In an equation, the map-to dimension is on the LHS, and a map-from dimension is on the RHS.
  • Dimensions cannot be defined in a mapping.
  • A subscript is not an index if it is defined as a dimension.
  • An index in a map-from dimension can be mapped to a dimension with multiple indices in the map-to dimension.
  • When a map-to dimension lists subscripts, it has the same semantics as a regular subscript range definition.
  • The reasons to list subscripts in a map-to dimension is to map subscripts in a different order than the dimension's definition, or to map an index in the map-from dimension to more than index in the map-to dimension.

Subscript range definition forms

  • dimension: subscripts
  • dimension: subscripts -> dimensions
  • dimension: subscripts -> (dimension: map-to subscripts)

The dimensions given to the right of the -> marker are the "mapping value" of the mapping.

Subscript mapping example

Here is a mapped dimension with three subscripts in the map-from and map-to dimensions.

DimA: R1, R2, R3 -> (EFGroups: Group1, Group2, Group3)

But EFGroups has four subscripts!

EFGroups: DimF, E1, E2, E3

The map-to dimension does not really have three subscripts. The mapping must list three subscripts to match the number of indices in the map-from dimension. But the total number of indices in the map-to dimension can be greater than in the map-from dimension.

These dimensions are simple lists of indices.

DimE: E1, E2, E3
DimF: F1, F2, F3
DimR: R1, R2, R3

If we expand the DimF dimension in the EFGroups map-to dimension, we see that EFGroups has a total of six indices.

EFGroups: F1, F2, F3, E1, E2, E3

Therefore, the mapping in DimA maps the the three indices in DimA to the six indices in EFGroups in a different order than they occur in the definition of EFGroups, through other dimensions Group1, Group2, and Group3.

DimA: R1, R2, R3 -> (EFGroups: Group1, Group2, Group3)

Group1: F1, E1
Group2: F2, E2
Group3: F3, E3

R1 → Group1 → F1, E1
R2 → Group2 → F2, E2
R3 → Group3 → F3, E3

What this mapping accomplishes is to group the subscripts in EFGroups in a different way when it occurs in an equation with DimA. For instance:

x[EFGroups] = a[DimA] * 10

Notice that in an equation, the map-to dimension is on the LHS and the map-from dimension is on the RHS, the opposite of how they occur in the subscript range definition. This subscripted equation is evaluated as follows when expanded over its indices by SDEverywhere:

x[F1] = a[R1] * 10
x[E1] = a[R1] * 10
x[F2] = a[R2] * 10
x[E2] = a[R2] * 10
x[F3] = a[R3] * 10
x[E3] = a[R3] * 10

Variables

The Variable class is defined in the variable module and contains the parsed Equation along with other metadata that was determined during the "read" and "analyze" phases.

The parsedEqn property holds a reference to the parsed Equation from the AST. This enables the code generator to walk the subtree for the variable.

In the Variable object, the modelLHS and modelFormula properties preserve the Vensim variable name (left-hand side of the equation, aka LHS) and the Vensim formula (right-hand side, aka RHS). Everywhere else, names of variables are in a canonical format compatible with the C programming language. The Vensim name is converted to lower case (it is case insensitive), spaces are replaced with underscores, and an underscore is prepended to the name. Vensim function names are similar, but are upper-cased instead.

The unsubscripted form of the Vensim variable name, in canonical format, is saved in the varName property. If there are subscripts in the LHS, the maximal canonical dimension names in sorted "normal" order establish subscript families by position in the families property. The subscripts are saved as canonical dimension or index names in the LHS in normal order in the subscripts property.

Lookup variables do not have a formula. Instead, they have a list of 2D points and an optional range. These are saved in the points and range properties.

Each variable has a refId property that gives the variable's LHS in a normal form that can be used in lists of references. The refId is the same as the varName for unsubscripted variables. A subscripted variable can include both dimension and index subscripts on the LHS. When another variable refers to the subscripted variable, we add its refId to the list of references. The normal form for a refId has the canonical name of each dimension or index sorted by their subscript families, separated by commas in a single pair of brackets, for example: _a[_dima,_dimb].

The references array property lists the refIds of variables that this variable's formula references. This determines the dependency order and thus evaluation order during code generation. Some Vensim functions such as _INTEG have a special initialization argument that is evaluated before the normal run loop. The references in the expression for this argument are stored in the initReferences property and do not appear in references unless they occur elsewhere in the formula.

The varType property holds the variable type, which determines where the variable is evaluated in the sim’s run loop. The Vensim variable types that SDEverywhere supports are:

  • constant
  • auxiliary
  • level
  • lookup
  • initial
  • data

Lookups may occur as function arguments as well as variables in their own right. When this happens, the code generator generates an internal lookup variable to hold the lookup's points. The name of the generated variable is saved in the lookupArgVarName property. It replaces the lookup as the function argument when code is generated.

SMOOTH* calls are replaced by a generated level variable named in smoothVarName. DELAY3* calls are replaced by a level named in delayVarName and an aux variable named in delayTimeVarName.

Generated Model Run Loop

Each section of a complete model program in C is written in sequence. The decl section declares C variables, including arrays of the proper size. The init section initializes constant variables and evaluates levels and the auxiliary variables necessary to evaluate them. The eval section is the main run loop. It evaluates aux variables and then outputs the state. The time is advanced to the next time step. Levels are evaluated next, and then the loop is finished. The input/output section has the code that sends output variable values to the output channel and optionally sets input values when the program starts.

graph TB;
    A["Declare variables"]
    B["Initialize constants"]
    C["Set input variable values"]
    D["Initialize levels"]
    E["Evaluate aux variables<br/><div style='color:red;font-size:0.8em'>time = <em>t</em>"]
    F["Capture output variable values"]
    G["Advance the time<br/><div style='color:red;font-size:0.8em'>time = <em>t</em> + time step"]
    H["Evaluate level variables"]

    A-->B
    B-->C
    C-->D
    D-->E
    E-->F
    H-->E
    F-->G
    G-->H
Loading