Skip to content

almondtools/patternsearchalgorithms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PatternSearchAlgorithms

codecov Codacy Badge

PatternSearchAlgorithms is a library using a deterministic finite automaton (DFA) for matching and searching. It does not provide features as group capturing, matching modes or backreferences. It is just efficient for complex patterns and large texts. The match time of such an automaton is linear dependent on the number of chars to match (O(n), where n = number of chars to match).

Common regex packages use nondeterministic automatons (NFA) to capture the regular expression, allowing several special features but implying performance leaks. The match time of an NFA based solution is dependent on the nodes in the automaton and the chars to match (O(m^2*n), where m = number of automaton nodes, n = number of chars to match).

Matching a pattern

    Pattern pattern = Pattern.compile("(more|less) efficient", OptimizationTarget.MATCH);

    Matcher matcher = pattern.matcher("more efficient");
    System.out.prinltn("matches: " + matcher.matches()); // matches:true
    
    ...
    
    Matcher matcher = pattern.matcher("patternsearchalgorithms is more efficient than java.util.regex");
    System.out.prinltn("matches: " + matcher.matches()); // matches:false

Finding a pattern

    Pattern pattern = Pattern.compile("(more|less) efficient", OptimizationTarget.SEARCH);

    Finder matcher = pattern.matcher("patternsearchalgorithms is more efficient than java.util.regex");
    while (matcher.find()) {
        System.out.println("found text = " + matcher.match.text()); // found text = more efficient
        System.out.println("at = " + matcher.match.start()); // at = 27
        System.out.println("to = " + matcher.match.end()); // to = 14
    }

    ...
    
    Finder matcher = pattern.matcher("patternsearchalgorithms is more efficient than java.util.regex");
    for (Match match : matcher.findAll()) {
        System.out.println("found text = " + match.text()); // found text = more efficient
        System.out.println("at = " + match.start()); // at = 27
        System.out.println("to = " + match.end()); // to = 14
    }

Syntax

We support the regular expression syntax of regexparser.

NFA-Expressions (java.util.Pattern) vs. DFA-Expressions

Java regular expressions (java.util.Pattern) are quickly created and optimized. Simple regular expressions are executed quite fast.

PatternSearchAlgorithms regular expressions need a long creation and optimization time. After this initial effort the execution time is no longer dependent on pattern complexity.

Use Java regular expressions:

  • if the expression is short and simple
  • if the expression is matched only a few times
  • if the expression is often created (e.g. in a loop)

Use patternsearchalgorithms regular expressions:

  • if the expression is long or complex
  • if the expression is matched many times
  • if the expression is once created and often applied

Performance Comparison

A performance benchmark for regex packages can be found at https://github.com/almondtools/regexbench.

This benchmark does not only check the performance but also the correctness of the results:

  • each benchmark fails if the expected number matches is not found
  • DFA packages cannot compute the same groups as NFA packages - accepted difference
  • Non-Posix-NFA packages (as jregex and java.util.regex) do not always detect the longest leftmost match - accepted difference

Maven Dependency

<dependency>
    <groupId>net.amygdalum</groupId>
    <artifactId>regexparser</artifactId>
    <version>0.0.5</version>
</dependency>

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages