Weighted Regular Expressions, an experiment in porting an academic Haskell library to Rust
Clone or download

README.md

Language: Haskell Language: Rust Topic: Academic Topic: Regular Expressions Status: In Progress

Rigged Regular Expressions

The two code branches contain versions of Weighted Regular Expression as described in the paper A Play on Regular Expressions: A Functional Pearl (paper, slides), by Sebastian Fischer, Frank Huch, and Thomas Wilke. The Haskell version is faithful to the paper, and is more or less what's presented in the paper, entered by hand (as that's pretty much the only way anything gets into my brain these days). The Rust version is a port of the Haskell version, inefficiencies and all, in order to see what it takes to take academic results and reimplement them in what is presented as a "systems" language.

The paper uses the term "weighted" to describe the post-processing they do with an algebraic construct known as a semiring, which exploits categorical similarities between the set of booleans, the set of ordinary numbers, and other data structures to turn analyzing the return value of a successful recognition into a parseable dataset all its own. I'm using the word "rig" as it's shorter; a semiring is related to the algebraic construct called a "ring," but cannot handle negative elements, and so... a "rig."

I'm hip to the idea of Fritz Henglein's dream of extracting all possible parsing information via a catamorphism over the parse tree, and that's exactly what semiring does. I'm evaluating the use of the semiring over Might's parser combinator approach to see which is easier to implement and provides the most bang for the buck. So having regular expressions "rigged" just seemed like the right term.

Status

The paper describes itself as "a play," and is presented in three acts; the first act has two scenes, the second and third have three scenes. The end of each scene describes a regular expression engine of some efficiency and capability. In total there are eight different regular expression engines implemented.

Currently written:

Todo

  • Section II.2 of the paper (Substring matching)
  • An abstraction of the Brzozowski algorithm, without an 'enum' dispatch type.
  • Rigged implementations of the Brzozowski's algorithm, in Haskell
  • Rigged implementations of the Brzozowski's algorithm, in Rust
  • An explanation for how Might & Adam's code is secretly a concrete (and naive?) implementation of Semirings, a result of their being more Lisp-minded than Haskell-minded.
  • An exploration of whether or not extended regular expressions (regular expressions supporting the Intersection, Negation, and Interleaf operations) is theoretically sound.
  • An implementation of a Rigged Brzozowski Algorithm with Adams's and Darais's optimizations.
  • An implementation of a Rigged Brzozowski Algorithm with Might's recursion enabled, using Darais's laziness as its basis.

Lessons learned

A major goal for this exercise is to illustrate the challenges, and perhaps impossibilities, of porting idiomatic Haskell to idiomatic Rust. The regular expressions being built are static and compiled at compile-time with the rest of the code.

The best thing I've learned is that many Haskell idioms do port over easily to Rust. Looking at the Rust version of accept in the first two examples and comparing them to the Haskell, you can readily see just how straightforward it translates.

On the other hand, Haskell's ability to define list processing recursively really shines in the definitions of split and parts, which are both three lines long in Haskell, but 21 and 29 lines long respectively, in Rust.

The other thing is that the "Rigged Kleene regular expressions in Haskell with Parse Forests" makes me unhappy; I don't have a good theoretical model for the change I made. My expectation is that "one" here isn't just one, but also carries with it knowledge of what that "one-ness" means; the zero/one relationship sustains as far as the poset is concerned, but the one value now carries knowledge the semiring needs to assemble the resulting parse forest. This is similar to Matt Might's "smart epsilons," but I haven't gotten there yet.

LICENSE

The Rust code is Copyright Elf M. Sternberg (c) 2019.

The Haskell code as presented so far is straight from the paper and therefor belongs to Sebastian Fischer, Frank Huch, and Thomas Wilke. In the event that I choose an alternative DFA constructor (and I probably will, because they use Glushkov's Construction, and I'm trying to use Brzozowski's Algorithm), the files will be annotated to indicate which is which.