Interpreter Internals

← History and Evolution | Next: Building from Source

This chapter describes the internal architecture of the nuBASIC interpreter. It is aimed at developers who want to understand how the interpreter works, contribute to the codebase, or build upon it for their own language experiments. All classes and files mentioned are in the nu namespace; header files are in include/ and implementations in lib/.

Main Components

The interpreter follows the classical pipeline: source text → tokens → abstract syntax tree → execution.

Component	Class / Type	Key file
Tokenizer	`tokenizer_t`	`include/nu_tokenizer.h`
Token	`token_t`	`include/nu_token.h`
Token List	`token_list_t`	`include/nu_token_list.h`
Expression Parser	`expr_parser_t`	`include/nu_expr_parser.h`
Expression node base	`expr_any_t`	`include/nu_expr_any.h`
Statement Parser	`statement_parser_t`	`include/parser/nu_statement_parser.h`
Statement base	`stmt_t`	`include/nu_stmt.h`
Static Program Context	`prog_ctx_t`	`include/nu_prog_ctx.h`
Runtime Program Context	`rt_prog_ctx_t`	`include/nu_rt_prog_ctx.h`
Program line map	`prog_line_t`	`include/nu_program.h`
Interpreter	`interpreter_t`	`include/nu_interpreter.h`
Variant value	`variant_t`	`include/nu_variant.h`
Built-in help	`builtin_help_t`	`include/nu_builtin_help.h`

Tokenizer — breaks a source line into a flat sequence of typed token_t objects.
Expression Parser — reads a token sequence representing an expression and returns an expr_any_t::handle_t that can be evaluated on demand.
Statement Parser — builds an executable statement object tree from a complete source line.
Static Program Context (prog_ctx_t) — accumulated during the build phase; holds procedure prototypes, structure definitions, label tables, and multi-line construct metadata.
Runtime Program Context (rt_prog_ctx_t) — extends prog_ctx_t with variable values, the call stack, loop runtime data, file descriptors, the program counter, and function return registers.
Interpreter (interpreter_t) — owns the source lines, the parsed program, and implements the CLI command loop, rebuild(), run(), and the debugger.

A Line-Oriented Interpreter

BASIC is a line-oriented language: the position of a hard line break is syntactically significant. nuBASIC preserves this throughout the entire pipeline.

Source text is stored in two parallel maps inside interpreter_t:

_source_line — maps a line_num_t to the raw source string.
_prog_line — maps line numbers to compiled statement objects plus debug information.

During rebuild(), each source line is handed to stmt_parser_t::compile_line(). The static context (prog_ctx_t) is built at the same time, stitching together multi-line constructs (For/Next, While/Wend, If/ElseIf/EndIf, Sub/Function, etc.).

At execution time, run() iterates over _prog_line in key order, calling the run() method of each block statement object. Jump statements (GoTo, GoSub, Return, procedure calls) modify the runtime program counter directly to redirect execution.

Tokens and the Tokenizer

Each token_t carries:

Attribute	Member	Description
Original text	`_org_id`	Preserves the original capitalisation
Normalised text	`_identifier`	Lowercase version, used for keyword matching
Token class	`_type` (`tkncl_t` enum)	Category of the token
Source position	`_position`	Byte offset within the source line

The tkncl_t enumeration:

enum class tkncl_t {
    UNDEFINED,
    BLANK,
    NEWLINE,
    IDENTIFIER,    // keywords and variable names
    INTEGRAL,      // integer literals
    REAL,          // floating-point literals
    OPERATOR,      // +  -  *  /  ^  =  <>  <  <=  >  >=  \  &
    SUBEXP_BEGIN,  // (
    SUBEXP_END,    // )
    STRING_LITERAL,
    STRING_COMMENT,
    SUBSCR_BEGIN,  // [
    SUBSCR_END,    // ]
    LINE_COMMENT   // '  or  Rem
};

The Statement Parser uses _type to make fast branching decisions — an IDENTIFIER token whose _identifier is "print" immediately routes to parse_print().

Token List Container

token_list_t wraps a std::deque<token_t> and adds:

Efficient removal from either end (front-popping as tokens are consumed)
Marker insertion: SUBEXP_BEGIN/SUBEXP_END tokens delimit sub-expressions after operator-precedence reordering
Helpers to peek at the front token, check emptiness, and skip blanks

Parsing the Code

Expression Parser (expr_parser_t) — analyses a single expression and returns an expr_any_t::handle_t:

expr_any_t::handle_t compile(expr_tknzr_t& tknzr);
expr_any_t::handle_t compile(token_list_t tl, size_t expr_pos);

Statement Parser (stmt_parser_t) — analyses one complete source line:

stmt_t::handle_t compile_line(prog_ctx_t& ctx, const std::string& source_line);
stmt_t::handle_t parse_block(prog_ctx_t& ctx, token_list_t& tl, ...);
stmt_t::handle_t parse_stmt (prog_ctx_t& ctx, token_list_t& tl);

parse_block() loops while the token list is non-empty, calling parse_stmt() on each iteration. parse_stmt() inspects the first token and dispatches to a dedicated parse_xxx() method for each statement keyword.

CLI Parser (inside interpreter_t) — handles interpreter-specific commands (List, Run, Load, Save, New, Help, …) via interpreter_t::exec_command().

Expression Parsing in Detail

The central challenge is operator precedence: 2 + 4 * 17 must evaluate to 2 + (4 * 17) = 70, not (2 + 4) * 17 = 102.

The Expression Parser resolves this in three steps.

Step 1 — Tokenise. Feed the expression string to a tokenizer_t.

Step 2 — Reorder by precedence. Insert SUBEXP_BEGIN/SUBEXP_END markers to encode the following precedence order (highest to lowest):

Unary identity and negation (+, -)
Exponentiation (^)
Multiplication and floating-point division (*, /)
Integer division (\, Div)
Modulus (Mod)
Addition and subtraction (+, -)
Comparison (=, <>, <, <=, >, >=)
Logical and bitwise (And, Or, Xor, and bitwise variants)

Step 3 — Build the syntax tree. expr_parser_t::parse() recursively constructs a tree of expr_any_t nodes. For 2 + 4 * 17:

        binary_expression(+)
              /          \
        literal          binary_expression(*)
        (integer, 2)           /        \
                          literal      literal
                          (int, 4)     (int, 17)

The expr_any_t hierarchy:

Binary expression — one operator, two child nodes
Unary expression — one operator, one child
Function call — built-in or user-defined, with an argument list of expr_any_t nodes
Variable — name looked up in the runtime context at eval() time
Literal constant — integer, floating-point, or string value embedded at compile time

The base class interface:

class expr_any_t {
public:
    using handle_t    = std::shared_ptr<expr_any_t>;
    using func_args_t = std::vector<expr_any_t::handle_t>;

    virtual variant_t eval(rt_prog_ctx_t& ctx) const = 0;
    virtual bool      empty()  const noexcept = 0;
    virtual std::string name() const noexcept = 0;
    virtual ~expr_any_t() {}
};

The Variant Type

variant_t (include/nu_variant.h) is the universal value type. Every expression evaluates to a variant_t; every variable stores one.

Enumerator	Meaning
`UNDEFINED`	Uninitialised / not-set
`INTEGER`	32-bit signed integer
`DOUBLE`	Double-precision floating point
`STRING`	Text string
`BYTEVECTOR`	Raw byte array
`BOOLEAN`	Boolean value
`STRUCT`	User-defined structure
`OBJECT`	Object handle (for GUI/external objects)
`ANY`	Wildcard (used in procedure signatures)

Internally, scalar integers, doubles, booleans, and strings use inline storage, so the hot paths for arithmetic, comparisons, and short strings avoid the old one-element vector allocation. Array variables still use the vector payload path for indexed storage. Structure and object metadata is boxed behind a payload pointer, and struct payload copies use copy-on-write so pass-by-value is cheap until a copy is mutated. Struct field storage preserves declaration order, which is important for native interop layouts and predictable introspection.

Tracing Execution of a Simple Program

The complete path from keyboard input to printed output for:

10 Print 2 + 4 * 17

1. Entry. The user types RUN. interpreter_t::exec_command("RUN") is called.

2. Build phase. rebuild() iterates over _source_line. For line 10 it calls stmt_parser_t::compile_line(ctx, "10 PRINT 2+4*17"). The tokenizer produces:

{ ("print", IDENTIFIER), (" ", BLANK),
  ("2", INTEGRAL), ("+", OPERATOR),
  ("4", INTEGRAL), ("*", OPERATOR), ("17", INTEGRAL) }

parse_stmt() sees "print" → dispatches to parse_print().

3. Building the Print statement. parse_print() calls the Expression Parser on 2 + 4 * 17. The parser builds the tree:

binary_expression(+)
 ├── literal(integer, 2)
 └── binary_expression(*)
      ├── literal(integer, 4)
      └── literal(integer, 17)

The root handle is wrapped in a stmt_print_t object stored in _prog_line[10].

4. Run phase. run() calls stmt_print_t::run(ctx) for line 10.

5. Evaluation. stmt_print_t::run() calls node->eval(ctx) on the root:

literal(2) returns variant_t(INTEGER, 2)
binary_expression(*) evaluates to variant_t(INTEGER, 68) (4 × 17)
binary_expression(+) adds them: variant_t(INTEGER, 70)

6. Output. The variant is converted to "70" and written to the output stream.

Extending the Built-in Function Set

Adding a new built-in function requires touching exactly three source files. The example is a 1-D convolution function Conv.

File	Change
`lib/nu_global_function_tbl.cc`	1. Add the C++ algorithm. 2. Add the functor. 3. Register with `fmap["conv"] = conv_functor;`
`lib/nu_builtin_help.cc`	Add one `help_content_t` entry to `_help_content[]`
`lib/nu_reserved_keywords.cc`	Add `"conv"` to the `list()` set

Step 1 — Implement the C++ Algorithm

template <typename T>
std::vector<T> conv(const std::vector<T>& v1, const std::vector<T>& v2)
{
    const int n = int(v1.size());
    const int m = int(v2.size());
    const int k = n + m - 1;
    std::vector<T> w(k, T());

    for (int i = 0; i < k; ++i) {
        const int jmn = (i >= m - 1) ? i - (m - 1) : 0;
        const int jmx = (i < n - 1)  ? i            : n - 1;
        for (int j = jmn; j <= jmx; ++j)
            w[i] += v1[j] * v2[i - j];
    }
    return w;
}

Step 2 — Write the Functor

Every built-in function is registered as a C++ callable with the signature:

variant_t functor_name(
    rt_prog_ctx_t&        ctx,
    const std::string&    name,
    const nu::func_args_t& args
);

Arguments are evaluated in order (args[0], args[1], …); each eval() call may update variables or advance I/O state, so the order is semantically significant.

Step 3 — Register the Functor

Inside global_function_tbl_t::get_instance():

fmap["conv"] = conv_functor;

The key is a case-insensitive string (icstring_t), so users may write Conv, conv, or CONV interchangeably.

Step 4 — Add Inline Help

In lib/nu_builtin_help.cc, add to _help_content[]:

{ lang_item_t::FUNCTION, "conv",
    "Returns a vector of Double as result of convolution of 2 given vectors of numbers",
    "Conv( v1, v2 [, count1, count2 ] )" },

Step 5 — Add to the Reserved Keyword List

In lib/nu_reserved_keywords.cc, add "conv" (lowercase) to the set returned by reserved_keywords_t::list(). This enables syntax highlighting and F12 auto-completion in the IDE.

← History and Evolution | Next: Building from Source

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpreter Internals

Interpreter Internals

Contents

Main Components

A Line-Oriented Interpreter

Tokens and the Tokenizer

Token List Container

Parsing the Code

Expression Parsing in Detail

The Variant Type

Tracing Execution of a Simple Program

Extending the Built-in Function Set

Step 1 — Implement the C++ Algorithm

Step 2 — Write the Functor

Step 3 — Register the Functor

Step 4 — Add Inline Help

Step 5 — Add to the Reserved Keyword List

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally