Permalink
Switch branches/tags
Nothing to show
Find file
90ef070 Dec 3, 2011
453 lines (354 sloc) 18.9 KB

A Guided Tour of an Implementation

of Clojure Destructuring Bind

in Emacs Lisp

Introduction

Clojure is a (relatively) new Lisp variant which has attracted a lot of attention due to its modern features and close relationship with the JVM. Emacs Lisp is a lisp so antiquated that it attracts more ridicule than plaudits. This document is an attempt to bring some of the nicer syntactical features from Clojure into Elisp.

Why? I spend a lot of time in Emacs, which I use to organize notes, experiment logs and to coordinate other programming languages. Because I have limited opportunity to use other languages in my day to day life, I like to implement features from other languages in Elisp so that I get to use them on a daily basis. In the same vein I've implemented something similar to Clojure's multimethods, syntax for monadic operation, many monads, lazy lists, and a stack-based concatenative language inspired by Factor. I'm currently working on an implementation of Kanren, the logic language described in "The Reasoned Schemer." This archive includes just the projects described here, but all my code is available on github.

I've submitted this particular code because it represents several months of work, demonstrates an auto-didactic streak which I think is an important part of my personality, and also involves my interest in functional programming languages and programming language theory and design. I also like this particular project because it demonstrates how plucky even an old Lisp can be - we can import very modern features into a lisp dialect which is quite old and crusty. I'm not, however, a Lisp zealot - I'm quite interested in working with other functional languages.

Running the Code

This code requires GNU Emacs 23.1.1. Extract the files located with this readme to somewhere convenient. Then start emacs with emacs -q. In the scratch buffer, execute:

(add-to-list 'load-path <Location/of/files/>)

You should then be able to (require 'defn) and execute the examples in this document and/or play with the library. Note that like most other macro-heavy Elisp projects (like the partial Common Lisp implementation) this library performs best when byte-compiled.

Destructuring Bind

Clojure owes a certain debt to the statically typed functional languages. It's emphasis on Lazy values calls to mind languages like Haskell, and its destructuring bind forms refer to the pattern matching (efficiently) afforded by static type systems like those in Standard ML and its variants. Destructuring bind has appeared in dynamic languages like Lua and Python, where simple tuples can be destructured, but Clojure's support is somewhat more extensive in that it extends to generic destructuring of sequences and tables. A few examples will illustrate the facility.

In Scheme, for instance, we might wish to swap two numbers held in the first and second slot of a list.

(define (swap lst) 
 (list (cadr lst) (car lst)))

In Clojure we could specify how to extract values from the arguments of a function right in the function definition:

(defn swap [[a b]] (list b a))

(Clojure's binding forms are always literal vectors.) Here swap takes only one argument, but it binds two names. The inner [a b] expression indicates that the single argument is expected to be a list, and that its first and second values should be bound to the symbols a and b respectively. Clojure's destructuring bind supports recursive binding forms. If, for instance, we wanted to write a function which accepts a single list, whose second element is a list, whose first value we wish to extract, we could write:

(defn extract [[_ [a]]] a)

Where we have nested the destructuring deep into the sequence to pull out a. Error checking aside, this is a pretty nice feature, particularly because Clojure also allows destructuring on tables, which have their own source-level representation. To write a function which pulls out the values in a table located at keys :x and :y and returns them in a flat list, we can write:

(defn extract [{a :x b :y}] (list a b))

Suppose the value at x were a list and we wished to get its second element:

(defn extract [{[_ part] :x}] part)

Would do the trick. We can combine, recursively, table and sequence destructuring syntax. This project is an implementation of this feature in Emacs Lisp.

In short, Clojure's destructuring bind is a mini-language for describing access to data structures.

Bonus Material: Recur

Tail call optimization is a controversial subject in some arenas. The feature was somewhat famously kept out of Python by its "dictator for life." Scheme implementations, on the other hand, are required to support this feature. More for reasons related to the JVM than any political sensitivity, Clojure takes a middle path: tail calls to "oneself" can be made by virtue of an explicit form, recur, which resembles a function call but can only be invoked from tail position, and which reuses the current stack frame instead of creating a new one. This allows many basic algorithms which depend on tail calls for elegant expression to be written naturally in Clojure.

This library also allows the use of a recur special form, statically checked to be only from tail position.

Syntactic & Other Notes

Clojure supports tables at the level of source code via a curly braces notation:

{:x 10 :y 11} 

Would be the table with keys :x, :y pointing to 10 and 11 respectively. Unfortunately, Emacs Lisp does not provide facilities to extend the reader, which we could use to create a syntax for tables if we were using (say) Common Lisp. We'll be using a quasi-ad-hoc solution instead. My standard library provides functions to create tables succinctly:

(tbl! :x 10 :y 11)

is equivalent to the above Clojure. For destructuring syntax we will use a vector to represent sequences ([a b c]) and a vector with a special head token to represent tables ([:: a :x b :y]). That is, :: indicates an expression represents a table, rather than sequence destructuring.

Additionally, we'll allow the table syntax to destructure association-lists, since these are common table surrogates in other lisp dialects and because they can be made persistant (as data structures) more easily than Emacs Lisp's tables. This has no impact on the syntax for destructuring bind, however.

Finally, this implementation uses a few non-standard special forms from my standard library which bear remarking upon. The form let-seq, defined in utils.el is a simple form for destructuring lists. It creates a context where a series of symbols are bound to the values in a list:

(let-seq (a b c) (list 3 2 1) 
  (list a b c))

Evaluates to '(1 2 3).

The form let-tbl allows a very simple form of table destructuring.

(let-tbl 
 ((x :a)
  (y :b))
 (tbl! :a 10 :b 11)
 (+ x y))

Evaluates to 21.

Both let-seq and let-tbl are used to parse the destructuring syntax only. Destructuring bind itself does not expand into them, basically because destructuring can have recursive structures and these forms are "flat".

The implementation makes use of other functions in the utils.el library, but these are the most conspicuous departures from recognize- able lisp.

On a final note, Emacs Lisp is, by default, a dynamically scoped language. Functional programming depends a lot on closures (even your ad asks "Do you think in closures?"), and so you'll see the lexical-let form throughout the source code. This is form from a partial implementation of Common Lisp that ships with Emacs which creates a lexical scope rather than a dynamic one. Lexical scope is a little more expensive than dynamic scope in Emacs, so you'll see this form peppered throughout the code where necessary, rather than used exclusively.

Implementation Sketch

Any destructuring bind expression will ultimately expand to a let* form in Emacs Lisp. Eg:

(dlet [[x y] (list 1 2)]
   (+ x y))

Expands to something not unlike:

(let* ((_ (list 1 2))
       (x (elt _ 0))
       (y (elt _ 1)))
    (+ x y))

(Where _ in the actual expansion would be a fresh symbol for the purposes of macro hygiene, probably generated with gensym.)

The task of this project is to parse the clojure-style forms and convert them into a series of symbol/value pairs for a let* expression.

Two sub-parsers are defined in the files parse-seq-binder.el and parser-table-binder.el, for sequences and tables respectively. These parsers check and extract the syntax of their respective types, but do not recursively parse sub-parts. Recursive parsing is supported at a higher level, essentially by trampolining.

Both sub-parsers are ad-hoc parsers which fold series of state dependent functions over the sequence of tokens representing the binding expression. Ultimately they return a list of important information about each binding expression which allows the equivalent pure emacs lisp let forms to be constructed.

The action of these parsers is coordinated in defn.el's handle-binding function which handles detecting which type of binder needs to be interpreted and dispatching to the appropriate parser. Recursive parsing of binding forms is supported at this level.

Emacs Lisp enforces a max-nesting depth and so the slightly more obvious strategy of creating nested-let*s is put aside in favor of passing around an accumulator for the entire sequence of binding pairs. Hence, the functions handle-binding, handle-tbl-binding and handle-seq-bindind sport an additional argument previous-lets so that a list of let* binders can be collected. There is much room for optimization in this collection of of let forms, but since macro-expansion is itself pretty expensive in Elisp for various reasons, in practice we'll almost always compile code using destructuring. The Emacs Lisp byte compiler is capable of performing these optimizations for us.

Once a sequence of symbol/value pairs is constructed, the rest of the work of the macro is essentially bookkeeping. fn is the form which does most of the work, since defn expands in terms of fn. The body of the expanded fn is a lambda expression. This expression takes any number of arguments and checks this number against the arities of the binders it was defined with. If it finds a match, it invokes the appropriate body - each body wrapped in a let* form whose binding expressions were generated by parsing and expanding the appropriate destructuring expression.

The body of the function is expanded in slightly different ways depending on whether the function tries to call itself recursively using recur.

Recur Implementation

Tail recursion is similar to a loop in that it essentially indicates that we re-execute a set of forms in a context where variables have been bound to new values. This library supports tail recursion by using a codewalker to find and appropriately expand recur forms in the body of a recur supporting form, enclosing the entire body in a while loop. An invisible loop sentinel controls whether we "recur" when execution terminates or whether we rebind the appropriate variables (using a setq expression) and execute the body again.

Because we are destructively updating the variables in our recursive context, it is particularly important that we check whether recur is invoked from tail position. If we were to expand a recur form in any context where there were subsequent expressions which depended on the values in the scope, these expressions would behave anomalously. Hence, we need a codewalker to check each occurrence of recur is in tail position.

Given an expression expr, a tail position is any position for which there are no subsequent expressions to be evaluated. For example:

(progn a b c)

In the above, c is in tail position.

(progn a b (recur c))

would be a legal place for recur to occur, but:

(progn (recur a) b c)

Would produce an error because (recur a) occurs before b or c are evaluated. In an if statement, tail-ness is preserved by either branch of the statement. So if an if statement is in tail position, either branch is also in tail position, because they are mutually exclusive. Similar reasoning applies to cond and other special forms.

The codewalker implemented in the second half of defn.el bears this basic idea out, covering the entire bestiary of basic Emacs Lisp forms. Because Emacs Lisp is extensible via macros, our codewalker cannot understand all possible un-expanded Emacs Lisp fragments. Hencec this library calls macroexpand-all on code before it is walked in order boil down any Elisp into a form containing only primitive expressions. The macro dsetq is then used to expand recur into an expression which sets the local variables to their new values.

If the codewalker detects a recur in non-tail position, it produces an error.

Note on Recur and Lexical Scope

This version of recursion will not handle lexically scoped variables completely correctly. This is because it mutates the context of a recursion without considering whether a lambda expression or other artifact is hanging onto a reference to some value in that context. Supporting the correct behavior would require a much more complex codewalker, and, in any case the need is somewhat obviated by the requirement that Emacs Lisp programmers explicitly indicate a desire for lexical scope using a lexical-let expression. Each lexical-let then closes over its own version of the variables in scope. So even though we are employing mutation under the hood:

(setq closures (dloop [i 0
                          acc nil]
                      (if (= i 10) (reverse acc)
                        (recur (+ i 1) 
                               (cons
                                (lexical-let ((x i))
                                  (lambda () x)) acc)))))

, which collects a series of lambda's enclosing different values of i, the following code produces the correct values:

(funcall (elt closures 0)) -> 0
(funcall (elt closures 4)) -> 4

etc.

Emacs Lisp will support real lexical scope in the next release of emacs. I'm pretty excited by this development, but I'm not sure how it will impact these libraries. Probably they should be completely rewritten to exploit the new model of execution.

Example Usage

Destructuring bind lets us construct some fairly concise implementations of certain functions. The function prod, which takes a list and returns the product of its elements, might be implemented with (lambda (lst) (reduce #'* lst)), of course, but it makes a nice example:

(require 'defn)
(defn prod
 ;;; prod will have two bodies, differentiated by arity
 ([[head & tail :as list] acc]
  (if list (recur tail (* acc head)) acc))
 ([list]
  (prod list 1)))

This example demonstrates dispatch based on arity, destructuring of a sequence (first argument, first body) and recursion.

(prod '(1 2 3 4 5 6 7)); -> 5040

The form defn always compiles its innards, so the resulting code is pretty zippy. Note that when making a call to an alternate arity version of the same function, we must use a regular function invokation rather than recur. We'd get a binding error otherwise.

Suppose we are using an association list to represent people in some kind of database. People have first and last names, but they might not have a middle name, so both:

'((:first-name "George") (:last-name "Washington"))
'((:first-name "Timofey") (:middle-name "Pavlovich") (:last-name
  "Pnin"))

are legitimate "persons" (neglecting that "Pavlovich" is a patronym rather than a proper middle name). How might we write a function to extract the middle name succinctly?

(defn extract-middle-name 
      [[:: middle :middle-name :or 
       (alist>>
        :middle-name :NMI)]]
     middle)

(extract-middle-name '((:first-name "Timofey") 
                       (:middle-name "Pavlovich") 
                       (:last-name "Pnin"))); -> "Pavlovich"

(extract-middle-name '((:first-name "George") (:last-name
"Washington"))); -> :NMI 

(NMI stands for "No Middle Name")

The function extract-middle-name demonstrates the table destructuring syntax and the use of an :or clause, which specifies an alternative expression to destructuring the event that any destructuring of the input fails. It can be used to provide default values for variables.

Destructuring bind in Clojure provides one other feature which is supported by this library. Tables may use a short hand :keys expression:

(defn demo-keys [[:: :keys [a b c]]] (list a b c))
(demo-keys 
  (alist>> :a 'x :b 'y :c 'z))

which is equivalent to:

(defn demo-keys-eqv [[:: a :a b :b c :c]] (list a b c))

That about covers it!

If I'd Known Then What I Know Now

This project is a little more than a year old, and in the intervening time, I've learned a few new things which would have changed, somewhat, the implementation. For one, the more complete understanding of the destructuring bind syntax which I developed while creating this library would probably have informed a more elegant implementation. There are definite signs of gradual accretion in this code. A complete rewrite is on my todo list.

More concretely, I now recognize that parsing the binding expressions is really a job for less ad-hoc parsing strategy. Subsequent to this project, I produced an implementation of monadic parser combinators based on SMUG, for Common Lisp. If I were to re-implement this project now, I'd use a combinator approach for parsing. For another project I've written a Common-Lisp lambda list parser using parser combinators, which is in simplified-lambda-list-parser.el. That library should convey the style of what a more informed implementation might look like (although CL-lambda lists are simpler than Clojure destructuring forms). See also monad-parse.el and monads.el, which are my implementations of monadic operation and parsing, though more closely based on previous work than this library.

The question of recur is separate, basically, from the destructuring bind problem, and handling the two things together is responsible for some of the mess of this implementation. I've subsequently factored out recur support into a separate library. A rewrite would concern itself mostly with destructuring bind, and use the recur.el library to support self-recursion. The implementation in recur.el is also a little bit cleaner, and supports a better error checking model than the one in this file. I've included it, if the reader is interested. There are also some stylistic differences - the codewalker dispatches to a function for each form encountered rather than doing processing in place, for instance. This makes for a more readable implementation.

Thanks for reading, I hope you've found it interesting.