Syntax Trees
Suppose you want to find the print
token in a BASIC program. You might naively try pattern matching against print
. But this would fail, if the program were, say,
10 print "print is tokenized as 186"
because of the print
in the string literal. Next you write a regular expression that excludes quoted text. This works for the above, but not for a program like, say,
10 print "print is tokenized as 186": rem print something about "print"
20 data print,186
Now you have to exclude comments and DATA
statements as well. Furthermore, when searching for string literals, do you want to include quoted text in comments? Say you have accounted for all that, and someone comes with a program
10 let print = 3
20 print print
Now your algorithm reports there are 3 print
tokens, when the programmer intended two of these as variable names. Using print
as a variable name is illegal, but the point is the parser should find the error.
The purpose of a syntax tree is to provide information about the content of an expression (in the broadest sense) in a way that does not require complex pattern matching. In other words, if you have the syntax tree, you don't need to worry about any of the above.
S-expressions are a way of writing a syntax tree in textual form. Given the second example above, this parser yields the S-expression
(source_file
(line
(linenum) (statement (tok_print) (str)) (statement (tok_rem) (comment_text)))
(line
(linenum) (statement (tok_data) (data_literal) (data_int))))
Containment of a bracket pair within another bracket pair indicates a child-parent relationship. For example, since (line...)
occurs inside the brackets associated with (source_file...)
, line
is a child of source_file
.
Now consider the task of finding all the print
tokens in the code. Rather than pattern matching against the actual text, you search for the node type you want, in this case tok_print
. This search is a straightforward string comparison. The perfect parser would allow you to find anything you need in this simple way. Of course no parser is perfect, so you may need to use other information, such as looking at the node's parent type, to find what you want.
The specifics of walking a syntax tree depend on the parser API. This parser is built using Tree-sitter, which exposes an API that allows you to move a cursor through the tree. For example, using the WASM binding from TypeScript one might have
// assume we have `parser` object and code in the string `mycode`
const tree = parser.parse(mycode);
const curs = tree.walk();
curs.gotoFirstChild(); // should be a line
curs.gotoFirstChild(); // should be a line number
curs.gotoNextSibling(); // should be a statement
curs.gotoFirstChild(); // could be many things
if (curs.nodeType=='tok_print')
console.log('first statement is print');
Applications will often make use of a standard function to walk the whole tree using cursor functions. As an example, vscode-language-applesoft does it like this:
walk(syntaxTree: Parser.Tree,visit: (node: Parser.TreeCursor) => WalkerChoice) {
this.depth = 0;
const curs = syntaxTree.walk();
let choice : WalkerChoice = WalkerOptions.gotoChild;
do {
if (choice == WalkerOptions.gotoChild && curs.gotoFirstChild()) {
this.depth++;
choice = visit(curs);
}
else if (choice == WalkerOptions.gotoParentSibling && curs.gotoParent() && curs.gotoNextSibling()) {
this.depth--;
choice = visit(curs);
}
else if (choice==WalkerOptions.gotoSibling && curs.gotoNextSibling())
choice = visit(curs);
else if (curs.gotoNextSibling())
choice = visit(curs);
else if (curs.gotoParent()) {
this.depth--;
choice = WalkerOptions.gotoSibling;
}
else
choice = WalkerOptions.exit;
} while (choice!=WalkerOptions.exit);
}
Note that the visit function will generally need to be bound to some class of which walk
is a member, in order to pass information out of the function. In particular, you will have something like
this.walk(tree,this.visit_node.bind(this));