Skip to content

bert2/nfa-string-matching

Repository files navigation

Regular Expression Matching using Nondeterministic Finite Automata

This is an implementation of a nondeterministic finite automaton (NFA) that matches text input against regular expressions. The implementation is based on Ken Thompson's NFA construction and inspired by Russ Cox' article Regular Expression Matching Can Be Simple And Fast.

Given a regular expression and an input text the implementation first parses the pattern to construct the corresponding NFA and then feeds the input text into the NFA. Whether the input matches the pattern is printed to the output and reflected with the exit code of the application. Apart from the regular expressions the parser also understands the glob syntax.

Supported Regex Syntax

Pattern Description
a, b, … Simple match looking for the specified character
. Wildcard: match any character
a|b Alternation: match either a or b
a* Repetition: match a zero or more times
a+ Repetition: match a one or more times
a? Option: match a zero or one times
(ab) Grouping

Supported Glob Syntax

Pattern Description
a, b, … Simple match looking for the specified character
? Matches any character
* Matches any string of characters including the empty string
[a-f], [0-9], … Matches any character within the specified range
\*, \?, \\, \[, \] Escape character for matching meta characters literally

Build

Build nfa-string-matching.sln using Visual Studio 2017.

Usage

> .\StringMatcher\bin\Release\StringMatcher.exe --regex "my .* pattern" --text "my regex pattern"
Match: true
> .\StringMatcher\bin\Release\StringMatcher.exe --glob "my * pattern" --text "my glob pattern"
Match: true
> .\StringMatcher\bin\Debug\StringMatcher.exe --help
USAGE: StringMatcher.exe [--help] [--glob <string>] [--regex <string>] [--text <string>] [--printgraph] [--interactive]

OPTIONS:

    --glob, -g <string>   the glob pattern describing the strings to match (not combinable with '--regex')
    --regex, -r <string>  the regex pattern describing the strings to match (not combinable with '--glob')
    --text, -t <string>   the string to be matched
    --printgraph, -p      prints the graph of the generated NFA as a link to https://gravizo.com/.
    --interactive, -i     starts interactive mode
    --help                display this list of options.

Performance

In order to track the impact of optimization attempts a performance test project has been added to the solution. It repeatedly executes the implementation using a glob pattern and input text of increasing size. The results are labeled with the current git commit hash, stored in a CSV file and rendered to a graph.

The test matches the input text "an" against the regular expression "a?nan" with n ranging from 1 to 100. For instance, for n = 2 the input "aa" is matched with the pattern "a?a?aa". The test provokes the worst case scenario where the implementation has to try to match the input with all occurences of "a?", before it can match it successfully against the suffixed "a" characters.

Graph of performance test results

The figure above shows the most recent performance graph. The pattern length n is plotted against the X axis, the runtime in milliseconds against the Y axis. The Y axis has been log-scaled in order to capture the wide range of measured runtimes more accurately. The legend shows the commit hash for each performance test ordered from oldest (top) to newest (bottom). Because the utilized plot library (FSharp.Charting) throws errors when rendering 0 values on a log-scaled axis, the measurements have been clamped to always be greater or equal 1 ms.

For comparison the results when matching the same input text with C#'s regex implementation have been plotted as well (test cut short at n = 32, because there are only so many milliseconds left until the sun expands and swallows our star system).

A Note on the Performance Results

Up until commit 73335672a0c60fd970624779cc2820381355c10a an NFA implementation was used that lend itself to easy concatination of automata, but required more cycles to execute them. Later commits used a refactored version with opposite properties, i.e. faster execution at the expense of a more complex construction.

Commit 91ed9108bc3ec0a608b69881af7a7aeb68099b8f introduced another refactoring that also fixed a major bug. Due to this bug incorrect automata were generated that allowed for illegal shortcuts during execution and resulted in false runtimes.

Execution of automata got a solid boost around commit bc8473528193061a17725de9d9bbd71eb4bce8f8 by getting rid of list concatenations and introducing sets to avoid List.distinct.

The improved runtime around commit fbce41e50032893beee3775de3ca80306a5d73c5 is due to changing the performance test itself. The tests are now based on regular expressions rather than glob patterns. The new expressions yield smaller automatons.

Reproducing the Performance Results

Each test run of the implementation was executed and recorded the following way:

> cd PerformanceTest\
> .\bin\Release\PerformanceTest.exe 1 100 5
Performance testing commit 46e6e40
Warming up with pattern of length 100...done
Running automaton with pattern of length 100 (05x)
Analyzing runtime behaviour...done
Saving results to perftest-results.csv...done
Rendering performance results to perftest-results.png...done
> git add *
> git commit -m '...'

The test run for C#'s regex class can be executed like this:

> cd PerformanceTest\
> .\bin\Release\PerformanceTest.exe --testCSharpRegex

TODO

  • Finish README.md
  • Render performance graph y axis starting at 0 (unable to produce a nicely looking rendering)
  • Use FParsec instead of diy'ed parser combinator
  • Fix AutomatonPrinter to reflect breaking changes of the Automaton module
  • Cache DFA state (doesn't seem to be worth the overhead)
  • Implement regular expression syntax
    • CLI to regular expression matcher
    • Kleene star
    • Kleene plus
    • Optional match
    • Submatch expressions between '(' and ')'
    • Metacharacter '.'
    • Metacharacter escaping
    • Alternations
    • Character classes
    • Counted repetitions
    • ...
  • Use FParsec's OperatorPrecendenceParser for regex syntax
  • Don't misuse GetHashCode as "unique" value for CompareTo
  • Try merging expandAll and moveNext into one function yuck
  • Try dotnet core
  • FSCheck regular expression matcher
  • Performance test randomized regex patterns
  • Discard some of the collected performance test results
  • Try NFA State type that is a Letter -> State list (doesn't seem to make sense)
  • Clean up messy performance test code
    • Split into modules
    • Use parser combinator to read CSV file, just because they are fun

About

Learning F#: regular expression matching using NFAs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages