This repo contains a command-line app that takes in a simple regular expression (letters and digits with no spaces and the operators "|", "*" and "+" together with parenthesization) and outputs the source to a Go program which parses inputs for matches to the same expression.
It is intended for educational purposes and not for use in any production system.
We will use the example from Thompson's paper for demonstrative purposes. Executing
./thompson-regex 'a(b|c)*d'
should produce output like the below (after running go fmt
):
package main
import (
"fmt"
"log"
"os"
)
/* trimmed some stuff up here */
func main() {
if len(os.Args) != 2 {
log.Fatalln("must supply input string")
}
input := []rune(os.Args[1])
expmatcher := concat{
concat{
char('a'),
closure{
or{
char('b'),
char('c'),
},
0,
},
},
char('d'),
}
matches := []string{}
for i := 0; i < len(input); {
if ok, n := expmatcher.match(input[i:]); ok {
matches = append(matches, string(input[i:i+n]))
i += n
continue
}
i++
}
fmt.Printf("%q\n", matches)
}
Additional output languages can be added here. For example,
./thompson-regex 'a(b|c)*d' -l python3
import sys
# trimmed a ton of stuff from here
if len(sys.argv) != 2:
print("must supply input string")
sys.exit()
inputstr = sys.argv[1]
exprmatcher = ((Char('a') + (Char('b') | Char('c')) ** 0) + Char('d'))
matches = []
i = 0
while i < len(inputstr):
ismatch, n = exprmatcher.match(inputstr[i:])
if ismatch:
matches += [inputstr[i:i+n]]
i += n
else:
i += 1
print(matches)
Ken Thompson's famous paper on implementing regular expressions is surprisingly simple and yet extremely sophisticated at the same time. The "heart" of the method he proposed was the use of a stack in order to compile the expression to code that parses it, which in this case was IBM 7094 object code.
I find this idea beautiful, and wanted a chance to play around with it for myself, as well as demonstrate to anyone interested the way the algorithm works. Since the aim is demonstration rather than a concrete (or usable) implementation, it makes sense to compile to a high-level language, which is why I have started by targeting Go.
I have used the term assembler to refer to the portion of the code that produces Go source (and hopefully other sources in future), which may be slightly confusing given its usual association with the program that produces machine code out of assembly langauge. The aim is to draw an analogy: Thompson's goal was efficiency, so his implementation produced object code; mine is illustration and education, so it makes sense to target a high level language with the human mind being the real intended machine.
Feel free to create issues to share feedback or make suggestions. The way the codebase is designed it is quite easy to add other output formats, provided they are text-based: see here.