Skip to content

cpp-for-everything/RegexMatcher

Repository files navigation

RegexMatcher: High-Performance Multi-Pattern Regex Engine in C++

Matching Text from Start to Finish Against Multiple Regular Expressions


Overview

RegexMatcher is a high-performance C++ library for matching input text against multiple regular expressions simultaneously, from the beginning to the end of a string. It is designed for efficiency and accuracy, enabling deterministic behavior even in large-scale, high-throughput systems such as web servers, content filters, or custom compilers.

This project implements an advanced data structure inspired by finite automata and combines it with regex parsing to allow effective matching, leveraging an iterative algorithmic design. Unlike traditional regex engines that operate one pattern at a time, RegexMatcher is optimized for batch processing of many regular expressions with bounded overhead.

Scientific Foundation

The core algorithmic approach of this project is described in the following peer-reviewed publication:

I. Stankov and A. Tsvetanov, "Matching Text from Start to Finish Against Multiple Regular Expressions," 2024 32nd National Conference with International Participation (TELECOM), Sofia, Bulgaria, pp. 1–4, doi: 10.1109/TELECOM63374.2024.10812214.

The paper introduces a hybrid model that blends deterministic finite automata (DFA) with trie-based backtracking and partial Aho-Corasick integration. The algorithm is designed to guarantee full-string matches and minimize ambiguity, supporting practical applications in:

  • Web server request parsing
  • URL pattern recognition
  • Telecom stream filtering
  • High-throughput automated testing tools

Keywords

Uniform resource locators, C++ languages, Data structures, Web servers, Iterative algorithms, Telecommunications, Testing, Regex, Pattern matching, Aho-Corasick

Features

  • Multi-pattern support: Match multiple regex patterns simultaneously from start to end.
  • 🧠 Deterministic traversal: Uses a hybrid state machine model with no backtracking on successful paths.
  • 🛠 Written in modern C++: Clean, modular, and efficient C++17 implementation.
  • 🔍 Debuggable and testable: Includes testing harnesses and debugging modes for state machine introspection.
  • 🌐 Designed for scalability: Suitable for integration into web servers, telecom processing units, and parsers.

Quick Start

Clone and Build

git clone https://github.com/cpp-for-everything/RegexMatcher.git
cd RegexMatcher
mkdir build && cd build
cmake ..
make

Example Usage

You can construct a matcher and feed it multiple regexes like so:

#include "regex_matcher.h"

RegexMatcher matcher;
matcher.add_pattern("^a(bc)*d$");
matcher.add_pattern("^123[0-9]+abc$");

if (matcher.match("123456abc")) {
    std::cout << "Match found!" << std::endl;
}

Benchmarks and Performance

RegexMatcher achieves competitive throughput compared to existing open-source libraries, especially in workloads with many patterns and strict matching requirements. Benchmarking against traditional engines (e.g., PCRE, std::regex) demonstrates its superior performance in multi-pattern deterministic scenarios.

More detailed evaluation and formal analysis can be found in the scientific paper.

Applications

RegexMatcher is designed for real-world scenarios where correctness and speed are critical. This includes:

  • Parsing and validating URLs and network traffic
  • Lexical analysis for domain-specific languages
  • High-performance rule engines for web filtering
  • Static or runtime analysis of log patterns and structured data

License

This project is licensed under the MIT License. See the LICENSE file for details.

Citation

If you use this library in academic work, please cite the following:

@inproceedings{stankov2024regexmatcher,
  author    = {I. Stankov and A. Tsvetanov},
  title     = {Matching Text from Start to Finish Against Multiple Regular Expressions},
  booktitle = {2024 32nd National Conference with International Participation (TELECOM)},
  year      = {2024},
  pages     = {1--4},
  doi       = {10.1109/TELECOM63374.2024.10812214}
}

Contributing

Contributions are welcome! If you have suggestions, optimizations, or bug fixes, feel free to open a pull request or issue.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published