# elfsternberg/riggedregex

Weighted Regular Expressions, an experiment in porting an academic Haskell library to Rust
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs
node
python
rust
.gitattributes
.gitignore

# Rigged Regular Expressions

The two main code branches, the `rust` and `haskell` branches, contain versions of Weighted Regular Expression as described in the paper A Play on Regular Expressions: A Functional Pearl (paper, slides), by Sebastian Fischer, Frank Huch, and Thomas Wilke. The early Haskell versions are faithful to the paper, and are more or less what's presented in the paper, entered by hand (as that's pretty much the only way anything gets into my brain these days). The Rust versions are a ports of the Haskell version, inefficiencies and all, in order to see what it takes to take academic results and reimplement them in what is presented as a "systems" language.

The paper uses the term "weighted" to describe the post-processing they do with an algebraic construct known as a semiring, which exploits categorical similarities between the set of booleans, the set of ordinary numbers, and other data structures to turn analyzing the return value of a successful recognition into a parseable dataset all its own. I'm using the word "rig" as it's shorter; a semiring is related to the algebraic construct called a "ring," but cannot handle negative elements, and so... a "rig."

I'm interested in the ideas of Fritz Henglein's dream of extracting all possible parsing information via a catamorphism over the parse tree, and that's exactly what semiring does. I'm evaluating the use of the semiring over Might's parser combinator approach to see which is easier to implement and provides the most bang for the buck. So having regular expressions "rigged" just seemed like the right term.

Later implementations in both branches apply the Play strategies to Brzozowski's "derivatives of regular expressions" algorithm, and demonstrates that the results hold.

This proves what I set out to show, that Goodman's 1979 paper on Semiring Parsing and Brzozowski's Parsing With Derivatives are compatible, and that these results are expressible in a strongly typed systems language like Rust.

Primary mission accomplished.

## Secondary mission

While the primary goal is complete, the entirety of the project is not. There are several things left until I consider this project done. They are:

• Implement recursive regular expressions
• Apply the equational optimizations for efficiency.
• Supply an aggregating semiring so that multiple and correlative semantic extractions can happen at the same time.
• Supply Adams's LoL: Language of Languages as an interface layer to a richer parsing experience.

These are probably separate projects, as they're specific meta-syntactic wrappers around the final version:

• Use the semiring implementation to provide a PEG engine
• Use the PEG to provide Rosie in Rust

And then there's a reality that Russ Cox and Geoff Langdale are so far ahead of me that it's not even funny.

## Status

The paper describes itself as "a play," and is presented in three acts; the first act has two scenes, the second and third have three scenes. The end of each scene describes a regular expression engine of some efficiency and capability. In total there are eight different regular expression engines implemented.

Currently written:

## Todo

• An exploration of whether or not extended regular expressions (regular expressions supporting the Intersection, Negation, and Interleaf operations) is theoretically sound.
• An implementation of a Rigged Brzozowski Algorithm with Adams's and Darais's optimizations.
• An implementation of a Rigged Brzozowski Algorithm with Might's recursion enabled, using Adams's laziness as its basis.

## Lessons learned

A major goal for this exercise is to illustrate the challenges, and perhaps impossibilities, of porting idiomatic Haskell to idiomatic Rust. The regular expressions being built are static and compiled at compile-time with the rest of the code.

The best thing I've learned is that many Haskell idioms do port over easily to Rust. Looking at the Rust version of `accept` in the first two examples and comparing them to the Haskell, you can readily see just how straightforward it translates.

On the other hand, Haskell's ability to define list processing recursively really shines in the definitions of `split` and `parts`, which are both three lines long in Haskell, but 21 and 29 lines long respectively, in Rust.

The other thing is that the "Rigged Kleene regular expressions in Haskell with Parse Forests" makes me unhappy; I don't have a good theoretical model for the change I made. It turns out my theoretical model was just fine; I just didn't know or understand it sufficiently at the time. The expectation is that yes, you can return an annihilator, or you can return a value. The default value is "one," but the meaning of "one" can be whatever you want it to mean, so long as the Semiring laws hold for the definitions you invent. In short, the value treats the zero and one as "annihilator" and "identity" operations against your return value. When the value is primitive, such as in the Boolean or Integer case, all you need are the zero and one; when the value is compositional, you need the `mul` operator to do the dirty work for you.

And then you remove the 'identity' pass because, well, the CPU doesn't need to waste time on that.