This document is still a work in progress.
Clojure-ts-mode is based on the tree-sitter-clojure grammar.
If you want to contribute to clojure-ts-mode, it is recommend that you familiarize yourself with how tree-sitter works. The official documentation is a great place to start: https://tree-sitter.github.io/tree-sitter/ These guides for Emacs tree-sitter development are also useful
- https://casouri.github.io/note/2023/tree-sitter-starter-guide/index.html
Developing major modes with tree-sitter
(From the Emacs 29+ Manual,C-h i
, search fortree-sitter
)
In short: Tree-sitter is a tool that generates parser libraries for programming languages, and provides an API for interacting with those parsers. The generated parsers can create syntax trees from source code text. The nodes of those trees are defined by the grammar. Emacs can use these generated parsers to provide major modes with things like syntax highlighting, indentation, navigation, structural editing, and many other things.
- Parser: A dynamic library compiled from C source code that is generated by the tree-sitter tool. A parser reads source code for a particular language and produces a syntax tree.
- Grammar: The rules that define how a parser will create the syntax tree for a language. The grammar is written in javascript. Tree-sitter tooling consumes the grammar as input and outputs C source (which can be compiled into a parser)
- Syntax Tree: a tree data structure comprised of syntax nodes that represents some source code text.
- Concrete Syntax Tree: Syntax trees that contain nodes for every token in the source code, including things likes brackets and parentheses. Tree-sitter creates Concrete Syntax Trees.
- Abstract Syntax Tree: A syntax tree with less important details removed. An AST may contain a node for a list, but not individual parentheses. Tree-sitter does not create Abstract Syntax Trees.
- Syntax Node: A node in a syntax tree. It represents some subset of a source code text. Each node has a type, defined by the grammar used to produce it. Some common node types represent language constructs like strings, integers, operators.
- Named Syntax Node: A node that can be identified by a name given to it in the tree-sitter Grammar. In clojure-ts-mode,
list_lit
is a named node for lists. - Anonymous Syntax Node: A node that cannot be identified by a name. In the Grammar these are identified by simple strings, not by complex Grammar rules. In clojure-ts-mode,
"("
and")"
are anonymous nodes.
- Named Syntax Node: A node that can be identified by a name given to it in the tree-sitter Grammar. In clojure-ts-mode,
- Font Locking: What Emacs calls "Syntax Highlighting".
Clojure-ts-mode uses the tree-sitter-clojure grammar, which can be found at https://github.com/sogaiu/tree-sitter-clojure The clojure-ts-mode grammar provides very basic, low level nodes that try to match clojure's very light syntax.
There are nodes to represent:
- Symbols (sym_lit)
- Contain (sym_ns) and (sym_name) nodes
- Keywords (kwd_lit)
- Contain (kwd_ns) and (kw_name) nodes
- Strings (str_lit)
- Chars (char_lit)
- Nil (nil_lit)
- Booleans (bool_lit)
- Numbers (num_lit)
- Comments (comment, dis_expr)
- dis_expr is the
#_
discard expression
- dis_expr is the
- Lists (list_list)
- Vectors (vec_lit)
- Maps (map_lit)
- Sets (set_lit)
There are also nodes to represent metadata, which appear on meta:
child fields of the nodes the metadata is defined on.
For example a simple vector with metadata defined on it like so
^:has-metadata [1]
will produce a parse tree like so
(vec_lit
meta: (meta_lit
value: (kwd_lit name: (kwd_name)))
value: (num_lit))
The best place to learn more about the tree-sitter-clojure grammar is to read the grammar.js file from the tree-sitter-clojure repository.
An important observation that anyone familiar with popular tree-sitter grammars may have picked up on is that there are no nodes representing things like functions, macros, types, and other semantic concepts.
Representing the semantics of Clojure in a tree-sitter grammar is much more difficult than traditional languages that do not use macros heavily like Clojure and other lisps.
To understand what an expression represents in Clojure source code requires macro-expansion of the source code.
Macro-expansion requires a runtime, and tree-sitter does not have access to a Clojure runtime and will never have access to a Clojure runtime.
Additionally tree-sitter never looks back on what it has parsed, only forward, considering what is directly ahead of it. So even if it could identify a macro like myspecialdef
it would forget about it as soon as it moved passed the declaring defmacro
node.
Another way to think about this: tree-sitter is designed to be fast and good-enough for tooling to implement syntax highlighting, indentation, and other editing conveniences. It is not meant for interpreting and execution.
Consider the following macro
(defmacro defn2 [sym args & body]
`(defn ~sym ~args ~@body))
(defn2 dog [] "bark")
This macro lets the caller define a function, but a hypothetical tree-sitter-clojure semantic grammar might just see a function call where a variable dog is passed as an argument.
How should tree-sitter know that dog
should be highlighted like function? It would have to evaluate the defn2
macro to understand that.
(defmacro no-defn [body]
(if (= 'defn (first body))
(rest body)
body))
(defn foo [& rest] 1)
(no-defn (defn foo [] 2))
evaluates to 1, and the following
(foo)
evaluates to 1.
How is tree-sitter supposed to understand that (defn foo [] 2)
of the expression (no-defn (defn foo [] 2))
is not a function declaration? It would have to evaluate the no-defn
macro.
While these examples are silly, they illustrate the issue with encoding semantics into the tree-sitter-clojure grammar. If we tried to make the grammar understand functions, macros, types, and other semantic elements it will end up giving false positives and negatives in the parse tree. While this is an inevitability for simple static analysis of Clojure code, tree-sitter-clojure chooses to avoid making these kinds of mistakes all-together. Instead, it is up to the emacs-lisp code and other consumers of the tree-sitter-clojure grammar to make decisions about the semantic meaning of clojure-code.
There are some pros and cons of this decision for tree-sitter-clojure to only consider syntax and not semantics. Some of the (non-exhaustive) upsides:
- No semantic false positives or negatives in the parse tree.
- Simple grammar to maintain with less nodes and rules
- Small, fast grammar (with a small set of grammar rules, tree-sitter-clojure has one of the smallest binaries and fastest grammars in widespread use)
- Stability: the grammar changes infrequently and is very stable for downstream consumers
And the primary downside: Semantics must be (re)-implemented in tools that consume the grammar. While this results in more work for tooling authors, the tools that use the grammar are easier to change than the grammar itself. The inaccurate nature of statically interpreting Clojure semantics means that not every decision made for the grammar would meet the needs of the various grammar consumers. This would lead to bugs and feature requests. Nearly all changes to the grammar will result in some sort of breakages to its consumers, so changes are best avoided once the grammar has stabilized. Therefore avoiding these semantic interpretations in the grammar is one of the best ways to minimize changes in the grammar.
- https://github.com/sogaiu/tree-sitter-clojure/blob/master/doc/scope.md
- https://tree-sitter.github.io/tree-sitter/using-parsers#named-vs-anonymous-nodes
TODO
TODO
TODO: demonstrate how clojure-ts-mode creates semantic meaning from a given syntax tree. Show examples of how new semantic meaning can be added (with highlighting, indentation, etc).