Skip to content
Alexey Grigorev edited this page Oct 13, 2015 · 8 revisions

rseq: Sequence pattern matching made easier

rseq is a Java library that allows to capture patterns in sequences (i.e. lists). It is especially useful in NLP applications. This tutorial introduces this library.

If you'd rather see the code first, you can go have a look at the tests for this tutorial, and come back for the explanation later.

Basic classes:

  • Matcher is an interface with a method boolean match(E e) which returns true if an object matchers some certain criteria. Some matchers are already implemented in the Matchers class, e.g.:
    • eq for checking if an object is equal to a provided one (by using the equals method)
    • and and or for combining several matchers
    • and others, see the Matchers class for a complete list
  • Pattern combines several matchers
  • Match represents a successful match of some subsequence against the provided Pattern

To get started, add a dependency to rseq to your pom:

<dependency>
  <groupId>com.alexeygrigorev</groupId>
  <artifactId>rseq</artifactId>
  <version>0.0.1</version>
</dependency>

Simple Example

Suppose we have the following sequence of String objects: [Where, E, is, the, energy, and, λ, is, the, wavelength]

List<String> words = 
    Arrays.asList("Where E is the energy and λ is the wavelength".split(" "));

Firstly, we would like to check if this list contain a subsequence [E, is, the, energy]. To check it, we create the following Pattern:

Pattern<String> pattern = 
    Pattern.create(eq("E"), eq("is"), eq("the"), eq("energy"));

Where eq is imported from Matchers.eq and returns a successful match when all the object of the subsequence is equal to the passed one. Now we run this pattern against our sequence of words:

List<Match<String>> matches = pattern.find(words);

We expect the result to contain one match, so the size of matches is one.

The Match contains some additional information about the match, e.g. what is the sub-sequence, and the indexes where the match starts and ends:

Match<String> match = matches.get(0);
assertEquals(Arrays.asList("E", "is", "the", "energy"), 
                 match.getMatchedSubsequence());
assertEquals(1, match.matchedFrom());
assertEquals(1 + 4, match.matchedTo());

Okay, now we also want to match the second part of the sentence with "λ" and "wavelength". One of the ways to do it can be using the or matcher:

Pattern.create(eq("E").or(eq("λ")), eq("is"), eq("the"), 
               eq("energy").or(eq("wavelength")));

Now we expect to have two matches:

List<Match<String>> matches = pattern.find(words);
assertEquals(2, matches.size());

assertEquals(Arrays.asList("E", "is", "the", "energy"), 
                matches.get(0).getMatchedSubsequence());
assertEquals(Arrays.asList("λ", "is", "the", "wavelength"), 
                matches.get(1).getMatchedSubsequence());

What if we have more than 2 mathematical identifiers to check, not just "E" and "λ"? For that we can use the in matcher. It can be used with collections or with arrays (and vararg arrays):

List<String> ids = Arrays.asList("E", "λ", "p", "m", "c");
Pattern<String> pattern = 
    Pattern.create(in(ids), eq("is"), eq("the"), 
                   in("energy", "wavelength"));

Now, suppose we have a slightly different sequence: [Where, E, is, the, energy, and, λ, is, wavelength]. The first subsequence is the same, but the second one does not have "the" before wavelength. To still be able to get the two subsequences, we can make the the matcher optional:

Pattern.create(in(ids), eq("is"), eq("the").optional(), 
               in("energy", "wavelength"));

It still returns two matches.

We can create matchers ourselves, and to do it, we need to implement the Matcher interface, or to extend the XMatcher class. The second option is better because it will provide the fluent syntax for or, and, optional and other matchers. If we want to create a matcher that tests all String objects against a regular expression . (i.e. it matches with all one-character strings), we can do the following:

XMatcher<String> oneLetterRegexp = new XMatcher<String>() {
    @Override
    public boolean match(String object) {
        return object.matches(".");
    }
};

In Java 8, you can use the syntax for lambdas:

Matcher<String> oneLetterRegexp = o -> o.matches(".");

We can use this matcher in the same way as other matchers:

Pattern.create(oneLetterRegexp, eq("is"), eq("the").optional(), 
               in("energy", "wavelength"));

Even better, with Java 8 you can pass lambdas directly to Pattern.create:

Pattern.create(o -> o.matches("."), eq("is"), eq("the"), eq("energy"));

Now let's do something more useful: we will capture all identifiers and their definitions. We assume that an identifier is a one character string, so we'll use the oneLetterRegexp matcher. The definition can be any string, and to get it we can use the Matchers.anything matcher.

To capture a value of some matcher, we use the captureAs method of the XMatcher class (or, you can use Matchers.capture):

XMatcher<String> anything = anything();
Pattern<String> pattern = 
    Pattern.create(oneLetterRegexp.captureAs("ID"), eq("is"), 
                   eq("the").optional(),
                   anything.captureAs("DEF"));

Here we capture the matched value for oneLetterRegexp and put it into a variable ID, and capture values for anything into DEF. We can use this Pattern as previously, by applying the find method to some sequence:

List<Match<String>> matches = pattern.find(words);
assertEquals(2, matches.size());

Match<String> match1 = matches.get(0);
assertEquals("E", match1.getVariable("ID"));
assertEquals("energy", match1.getVariable("DEF"));

Match<String> match2 = matches.get(1);
assertEquals("λ", match2.getVariable("ID"));
assertEquals("wavelength", match2.getVariable("DEF"));

What if we want to capture a subsequence? For example, the definition may or may not contain the article "the", and we want to keep the article as well for cases where the article is present. To do it, we can use the group matcher:

XMatcher<String> anything = anything();
Pattern<String> pattern = Pattern.create(
        oneLetterRegexp.captureAs("ID"), eq("is"), 
        group(eq("the").optional(), anything).captureAs("DEF"));
List<Match<String>> matches = pattern.find(words);

The groups are accessed via the getCapturedGroup method on the Match class:

Match<String> match1 = matches.get(0);
assertEquals("E", match1.getVariable("ID"));
assertEquals(Arrays.asList("the", "energy"), match1.getCapturedGroup("DEF"));

Match<String> match2 = matches.get(1);
assertEquals("λ", match2.getVariable("ID"));
assertEquals(Arrays.asList("wavelength"), match2.getCapturedGroup("DEF"));

It is also possible to replace the matched subsequence with something else. Suppose, you have some knowledge base where you have some information about concepts like "energy" or "wavelength", and you want to have your text link to this base.

final Map<String, String> db = new HashMap<String, String>();
db.put("energy", "https://en.wikipedia.org/wiki/Energy");
db.put("wavelength", "https://en.wikipedia.org/wiki/Wavelength");

List<String> replaced = pattern.replace(words, new TransformerToList<String>() {
    @Override
    public List<String> transform(Match<String> match) {
        List<String> result = new ArrayList<>(match.getMatchedSubsequence());
        String link = db.get(match.getVariable("DEF"));
        result.add("(see " + link + ")");
        return result;
    }
});

The result is [Where, E, is, the, energy, (see https://en.wikipedia.org/wiki/Energy), and, λ, is, wavelength, (see https://en.wikipedia.org/wiki/Wavelength)]

If you know that you want to replace the whole matched subsequence with just one element, you can use a shortcut version of the replace method - replaceToOne:

List<String> actual = pattern.replaceToOne(words, 
                              new TransformerToElement<String>() {
    @Override
    public String transform(Match<String> match) {
        return StringUtils.join(match.getMatchedSubsequence(), " ");
    }
});

The result is [Where, E is the energy, and, λ is wavelength]

Finally, there is the oneOrMore and zeroOrMore matchers (and their greedy counterparts oneOrMoreGreedy and zeroOrMoreGreedy).

For example, suppose, for some reason, you want to find consecutive sequences of Strings with at least 3 characters. In [Where, E, is, the, energy, and, λ, is, the, wavelength] that would be subsequences [E, is, the] and [and, λ, is]. To do that, you can define a matcher

XMatcher<String> treeLettersOrLess = new XMatcher<String>() {
    @Override
    public boolean match(String object) {
        return object.length() <= 3;
    }
};

And then use the oneOrMore modifier on this matcher:

Pattern<String> pattern = Pattern.create(treeLettersOrLess.oneOrMore());

Beans: More Complex Example

The example above is quite simple: it applies rseq to String sequences, and all this can possibly be done with usual regular expressions. But rseq can do more complex things: it can be applied to a sequence of any class, not just strings. It can be Integers, Bytes or any other Java classes, including user-defined classes.

Suppose you have a text, and you tokenized it and annotated the tokes with part-of-speech tags (e.g. with Standford Core NLP). You store the results in a Word class:

public class Word {
    private String token;
    private String pos;
    
    // constructor, getters, setters, hashCode, equals...
}

Now suppose we have a sentence "where U is the internal energy, T is the absolute temperature, and S is the entropy." (taken from wiki), which after annotating becomes

where/WRB U/NNP is/VBZ the/DT internal/JJ energy/NN ,/, T/NNP is/VBZ the/DT absolute/JJ temperature/NN ,/, and/CC S/NNP is/VBZ the/DT entropy/NN ./.

In that sequence we are interested in definitions of identifiers U, T and S. We note that all the identifiers are one-character strings, and all the definitions begin with the, then there could be an adjective and a noun. So we can create the following pattern to capture it:

Pattern<Word> pattern = Pattern.create(id.captureAs("ID"), is, 
        group(the, adjective.optional(), noun).captureAs("DEF"));

We can use the Java 8 anonymous functions to declare these matchers:

XMatcher<Word> id = x(w -> w.getToken().length() == 1);
XMatcher<Word> is = x(w -> "is".equals(w.getToken()));
XMatcher<Word> the = x(w -> "the".equals(w.getToken()));
XMatcher<Word> adjective = x(w -> "JJ".equals(w.getPos()));
XMatcher<Word> noun = x(w -> "NN".equals(w.getPos()));

Note the use of the Matchers.x method: it wraps a user-defined matcher (the implementation of the Matcher interface) to an XMatcher. This way you can both have a nice and not cumbersome way of defining Matchers and have access to the fluent interface methods like or, optional, etc.

We expect 3 matches:

List<Match<Word>> result = pattern.find(words);
assertEquals(3, result.size());

Match<Word> match1 = result.get(0);
assertEquals("U", match1.getVariable("ID").getToken());
assertEquals(words("the/DT internal/JJ energy/NN"), 
                            match1.getCapturedGroup("DEF"));

Match<Word> match2 = result.get(1);
assertEquals("T", match2.getVariable("ID").getToken());
assertEquals(words("the/DT absolute/JJ temperature/NN"), 
                            match2.getCapturedGroup("DEF"));

Match<Word> match3 = result.get(2);
assertEquals("S", match3.getVariable("ID").getToken());
assertEquals(words("the/DT entropy/NN"), 
                            match3.getCapturedGroup("DEF"));

Here words is a function that converts the string representation to a List of Word objects.

If your Java objects follow the Java Beans naming convention (at least they should have the getters), then you can use the BeanMatchers class that can use reflection to access your objects' properties.

For example, the same matchers as above could be created as

XMatcher<Word> id = BeanMatchers.regex(Word.class, "token", ".");
XMatcher<Word> is = BeanMatchers.eq(Word.class, "token", "is");
XMatcher<Word> the = BeanMatchers.eq(Word.class, "token", "the");
XMatcher<Word> adjective = BeanMatchers.eq(Word.class, "pos", "JJ");
XMatcher<Word> noun = BeanMatchers.eq(Word.class, "pos", "NN");

This way can be useful if you don't use Java 8 and can't use anonymous functions.