This repository has been archived by the owner. It is now read-only.
A *correct* implementation of regular expression matching (using finite state automata).
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
redone
tests
.gitignore
LICENSE
README.md
TODO
test.py

README.md

redone

A correct implementation of regular expressions, using finite state automata rather than backtracking.

tl;dr

Most regex implementations (except for GNU grep, GNU awk and a few others) use backtracking, which (while adding support for backreferences) massively reduces the efficiency of the implementation. Even the best backtracking implementations have O(2^n) time complexity in several cases.

On the other hand, by using finite state automata, regular expression matching can have O(n) time complexity (or possibly superlinear, if the implementation uses NFAs and doesn't cache them to produce DFAs). In either case, the differences can be quite drastic:

Source: Russ Cox / rsc@swtch.com 2007

Usage

To use this, you just need to do the following:

>>> import redone
>>>
>>> pattern = "some pattern here"
>>> string = "some string here"
>>>
>>> # Pre-compiled version (recommended).
>>> r = redone.compile(pattern)
>>> r.match(string)
# <RegexMatch(...) ...>
>>> r.fullmatch(string)
# <RegexMatch(...) ...>
>>> r.search(string)
# <RegexMatch(...) ...>
>>> r.findall(string)
# [<RegexMatch(...) ...>, ...]
>>> r.finditer(string)
# <generator object ...>
>>> r.sub("replacement", string)
# "replaced string"
>>>
>>> # On-the-fly version.
>>> redone.match(pattern, string)
# <RegexMatch(...) ...>
>>> redone.fullmatch(pattern, string)
# <RegexMatch(...) ...>
>>> redone.search(pattern, string)
# <RegexMatch(...) ...>
>>> redone.findall(pattern, string)
# [<RegexMatch(...) ...>, ...]
>>> redone.finditer(pattern, string)
# <generator object ...>
>>> redone.sub(pattern, "replacement", string)
# "replaced string"

The following features are still "in the works":

  • Proper UTF-8 support (the regex alphabet only includes string.printable).
  • Flags (mainly case insensitivity).
  • Submatch extraction.
  • Assertions (^, $, \b and the like).
  • Character class ranges (because apparently they are used a lot).

The following features are likely not to be implemented:

  • ASCII escape sequences (there's no need, just embed them in the pattern).
  • POSIX "special" character classes ([:alnum:]), just use character classes.
  • Backreferences. Since implementing them in NFAs with polynomial time is NP-complete, the only solution is to use backtracking in that one case (which would massively complicate the codebase).

Features

The redone regular expression language currently contains the following features:

  • Wildcard matching (.).
  • Unions (a|b).
  • Character sets ([abc][^def]).
  • Regex grouping ((ab(c))).
  • Repetition (a?b*c+).
  • Counted repetition (a{2}b{3,}c{4,5}).

Warnings

  • This was made in order to prove a point (and as a programming exercise). Python's mantra is that it is "fast enough", so the backtracking approach used by re seems to match that. Also, I'd argue that using the well-tested re module is much safer than using an implementation some random on the internet wrote, and that the speed issue isn't a big deal on most regular expressions.
  • While finite state automata are theoretically faster than backtracking, in practice, the C code written is better optimised for common patterns and is better tested. That doesn't mean that the re module doesn't suffer from the "perl problem", it just means that (in most usecases) that isn't an issue.
  • Not suitable for programmers under the age of 3.