Skip to content
Alexey Grigorev edited this page Oct 4, 2015 · 8 revisions

rseq: Sequence pattern matching made easier

rseq is a Java library that allows to capture patterns in sequences (i.e. lists). It is especially useful in NLP applications.

Basic classes:

  • Matcher is an interface with a method boolean match(E e) which returns true if an object matchers some certain criteria. Some matchers are already implemented in the Matchers class, e.g.:
    • eq for checking if an object is equal to a provided one (by using the equals method)
    • and and or for combining several matchers
    • and others, see the Matchers class for a complete list
  • Pattern combines several matchers
  • Match is ...

Simple Example

Suppose we have the following sequence of String objects: [Where, E, is, the, energy, and, λ, is, the, wavelength]

List<String> words = Arrays.asList("Where E is the energy and λ is the wavelength".split(" "));

Firstly, we would like to check if this list contain a sublist [E, is, the, energy]. To check it, we create the following Pattern:

Pattern<String> pattern = Pattern.create(eq("E"), eq("is"), eq("the"), eq("energy"));

Where eq is imported from Matchers.eq and returns a successful match when some object in the sequence is equal to the passed one. Now we run this pattern against our sequence of words:

List<Match<String>> matches = pattern.find(words);

We expect the result to contain one match, so the size of matches is one.

The Match contains some additional information about the match, e.g. what is the sub-sequence, and the indexes where the match starts and ends:

Match<String> match = matches.get(0);
assertEquals(Arrays.asList("E", "is", "the", "energy"), match.getMatchedSubsequence());
assertEquals(1, match.matchedFrom());
assertEquals(1 + 4, match.matchedTo());

Okay, now we also want to match the second part of the sentence with λ and wavelength. One of the ways to do it can be using the or matcher:

Pattern.create(eq("E").or(eq("λ")), eq("is"), eq("the"), eq("energy").or(eq("wavelength")));

Now we expect to have two matches:

List<Match<String>> matches = pattern.find(words);
assertEquals(2, matches.size());

assertEquals(Arrays.asList("E", "is", "the", "energy"), matches.get(0).getMatchedSubsequence());
assertEquals(Arrays.asList("λ", "is", "the", "wavelength"), matches.get(1).getMatchedSubsequence());

What if we have more than 2 mathematical identifiers to check, not just E and λ? For that we can use the in matcher. It can be used with collections or with arrays:

List<String> ids = Arrays.asList("E", "λ", "p", "m", "c");
Pattern<String> pattern = Pattern.create(in(ids), eq("is"), eq("the"), in("energy", "wavelength"));

Now, suppose we have a slightly different sequence: [Where, E, is, the, energy, and, λ, is, wavelength]. The first subsequence is the same, but the second one does not have the before wavelength. To still be able to get the two subsequences, we can make the the matcher optional:

Pattern.create(in(ids), eq("is"), eq("the").optional(), in("energy", "wavelength"));

It still returns two matches.

We can create matchers ourselves, and to do it, we need to implement the Matcher interface, or to extend the XMatcher class. The second option is better because it will provide fluent syntax for or, and, optional and other matchers.

Suppose, we want to create a matcher that tests all String objects against a regular expression . (i.e. it matches with all one-character strings):

XMatcher<String> oneLetterRegexp = new XMatcher<String>() {
    @Override
    public boolean match(String object) {
        return object.matches(".");
    }
};

If you're using Java 8, you can use the syntax for lambdas:

Matcher<String> oneLetterRegexp = o -> o.matches(".");

We can use this matcher in the same way as other matchers:

Pattern.create(oneLetterRegexp, eq("is"), eq("the").optional(), in("energy", "wavelength"));

With Java 8 you can pass lambdas directly to Pattern.create:

Pattern.create(o -> o.matches("."), eq("is"), eq("the"), eq("energy"));

Now let's do something more useful: we will capture all identifiers and their definitions. We assume that an identifier is a one character string, so we'll use the oneLetterRegexp matcher. The definition can be any string, and to get it we can use the Matchers.anything matcher.

To capture a value of some matcher, we use the captureAs method of the XMatcher class (or, you can use Matchers.capture):

XMatcher<String> anything = anything();
Pattern<String> pattern = Pattern.create(oneLetterRegexp.captureAs("ID"), eq("is"), eq("the").optional(),
    anything.captureAs("DEF"));

Here we capture the matched value for oneLetterRegexp and put it into a variable ID, and capture values for anything into DEF. We can use this Pattern as previously, by applying the find method to some sequence:

List<Match<String>> matches = pattern.find(words);
assertEquals(2, matches.size());

Match<String> match1 = matches.get(0);
assertEquals("E", match1.getVariable("ID"));
assertEquals("energy", match1.getVariable("DEF"));

Match<String> match2 = matches.get(1);
assertEquals("λ", match2.getVariable("ID"));
assertEquals("wavelength", match2.getVariable("DEF"));

What if we want to capture

Beans: More Complex Example

The example above is quite simple: it applies rseq to String sequences, and all this can be done with regular expressions. But rseq can do more complex things: it can be applied to a sequence of any class, not just strings. It can be Integers, Bytes or any other Java classes, including user-defined classes. If your classes follow the Java bean convention (i.e. they use getters), there's a helper class BeanMatchers that simplifies the process of building Matchers.

Suppose you have a sequence of tokens (obtained e.g. with Standford Core NLP)

Clone this wiki locally