Skip to content

coder5506/Cuis-Smalltalk-Regexp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cuis-Smalltalk-Regexp

This package implements linear-time1 regular expression matching for Cuis Smalltalk, with full Unicode support, that runs in space proportional only to the size of the pattern.

This package is intended to update the older Regex package originally ported from Squeak which, in turn, was based on an older package originally written for VisualWorks in 1996. This is a new implementation supporting Unicode and providing many newer features most have come to expect from regular expression libraries.

Do note that this new implementation is not significantly faster on "normal" regular expressions,2 but rather avoids exponential slowdowns caused by pathological expressions. Also, aside from the Unicode support, most of the new features are merely conveniences and do not introduce new capabilities unavailable in the older library.

See

See also chapter 6 of Deep into Pharo, available separately, or the corresponding documentation from

Status

Not all planned features are yet implemented. In particular, look-ahead/-behind patterns and backreferences are not implemented, and set operations are not well tested. That said, all Regex features from the older library are fully-supported, and all tests pass.

It is planned to optimize matching Unicode strings by performing byte-level matching on their UTF-8 encoding, but this optimization is not yet implemented.

License

MIT License

Dependencies

Requires Cuis-Smalltalk-Unicode.

Implementation

This package implements the "virtual machine" approach to regular expression matching, pioneered by Ken Thompson and later documented by Russ Cox.

Essentially, a regular expression (RegexpPattern) is compiled (ReCompiler) to a sequence of instructions (ReProgram) where the position of the instruction in the sequence implicitly represents the state of the match as it might be encoded in a finite automata. This approach is more flexible than simulating a finite automata directly, because it is easy customize the instructions to the needs of the matcher.

The program is then executed (ReProcess) as a collection of "threads"(ReThread), each thread having its own counter to represent its state. The collection of threads together simulate a non-deterministic finite automata executing the match.

Thompson further had the idea to execute the threads together, iterating through each character of the input and giving each thread a timeslice in which to process that character. As the state of each thread is determined entirely by its program counter, the number of possible threads is limited to the number instructions, ensuring the program runs in space bounded by the size of the pattern, independent of the size of the input.

Contributing

Issues preferably, where the best contribution would be a failing test case.

Footnotes

  1. Linear-time so long as you do not use look-ahead/-behind patterns or backreferences, all of which necessarily recur. See also status.

  2. There have been some optimizations, primarily in search, and more are planned, but for now performance should not be considered a reason to prefer this package to the older one.

About

Modern regular expression matching for Cuis Smalltalk

Resources

License

Stars

Watchers

Forks