Tutorial
rseq is a Java library that allows to capture patterns in sequences (i.e. lists). It is especially useful in NLP applications.
Basic classes:
-
Matcher
is an interface with a methodboolean match(E e)
which returnstrue
if an object matchers some certain criteria. Some matchers are already implemented in theMatchers
class, e.g.:-
eq
for checking if an object is equal to a provided one (by using theequals
method) -
and
andor
for combining several matchers - and others, see the
Matchers
class for a complete list
-
-
Pattern
combines several matchers -
Match
is ...
Suppose we have the following sequence of String
objects: [Where, E, is, the, energy, and, λ, is, the, wavelength]
List<String> words = Arrays.asList("Where E is the energy and λ is the wavelength".split(" "));
Firstly, we would like to check if this list contain a sublist [E, is, the, energy]
. To check it, we create the following Pattern
:
Pattern<String> pattern = Pattern.create(eq("E"), eq("is"), eq("the"), eq("energy"));
Where eq
is imported from Matchers.eq
and returns a successful match when some object in the sequence is equal to the passed one. Now we run this pattern against our sequence of words:
List<Match<String>> matches = pattern.find(words);
We expect the result to contain one match, so the size of matches
is one.
The Match
contains some additional information about the match, e.g. what is the sub-sequence, and the indexes where the match starts and ends:
Match<String> match = matches.get(0);
assertEquals(Arrays.asList("E", "is", "the", "energy"), match.getMatchedSubsequence());
assertEquals(1, match.matchedFrom());
assertEquals(1 + 4, match.matchedTo());
Okay, now we also want to match the second part of the sentence with λ
and wavelength
. One of the ways to do it can be using the or
matcher:
Pattern.create(eq("E").or(eq("λ")), eq("is"), eq("the"), eq("energy").or(eq("wavelength")));
Now we expect to have two matches:
List<Match<String>> matches = pattern.find(words);
assertEquals(2, matches.size());
assertEquals(Arrays.asList("E", "is", "the", "energy"), matches.get(0).getMatchedSubsequence());
assertEquals(Arrays.asList("λ", "is", "the", "wavelength"), matches.get(1).getMatchedSubsequence());
What if we have more than 2 mathematical identifiers to check, not just E
and λ
? For that we can use the in
matcher. It can be used with collections or with arrays:
List<String> ids = Arrays.asList("E", "λ", "p", "m", "c");
Pattern<String> pattern = Pattern.create(in(ids), eq("is"), eq("the"), in("energy", "wavelength"));
Now, suppose we have a slightly different sequence: [Where, E, is, the, energy, and, λ, is, wavelength]
. The first subsequence is the same, but the second one does not have the
before wavelength
. To still be able to get the two subsequences, we can make the the
matcher optional:
Pattern.create(in(ids), eq("is"), eq("the").optional(), in("energy", "wavelength"));
It still returns two matches.
We can create matchers ourselves, and to do it, we need to implement the Matcher
interface, or to extend the XMatcher
class. The second option is better because it will provide fluent syntax for or
, and
, optional
and other matchers.
Suppose, we want to create a matcher that tests all String
objects against a regular expression .
(i.e. it matches with all one-character strings):
XMatcher<String> oneLetterRegexp = new XMatcher<String>() {
@Override
public boolean match(String object) {
return object.matches(".");
}
};
If you're using Java 8, you can use the syntax for lambdas:
Matcher<String> oneLetterRegexp = o -> o.matches(".");
We can use this matcher in the same way as other matchers:
Pattern.create(oneLetterRegexp, eq("is"), eq("the").optional(), in("energy", "wavelength"));
With Java 8 you can pass lambdas directly to Pattern.create
:
Pattern.create(o -> o.matches("."), eq("is"), eq("the"), eq("energy"));
Now let's do something more useful: we will capture all identifiers and their definitions. We assume that an identifier is a one character string, so we'll use the oneLetterRegexp
matcher. The definition can be any string, and to get it we can use the Matchers.anything
matcher.
To capture a value of some matcher, we use the captureAs
method of the XMatcher
class (or, you can use Matchers.capture
):
XMatcher<String> anything = anything();
Pattern<String> pattern = Pattern.create(oneLetterRegexp.captureAs("ID"), eq("is"), eq("the").optional(),
anything.captureAs("DEF"));
Here we capture the matched value for oneLetterRegexp
and put it into a variable ID
, and capture values for anything
into DEF
. We can use this Pattern
as previously, by applying the find
method to some sequence:
List<Match<String>> matches = pattern.find(words);
assertEquals(2, matches.size());
Match<String> match1 = matches.get(0);
assertEquals("E", match1.getVariable("ID"));
assertEquals("energy", match1.getVariable("DEF"));
Match<String> match2 = matches.get(1);
assertEquals("λ", match2.getVariable("ID"));
assertEquals("wavelength", match2.getVariable("DEF"));
What if we want to capture
The example above is quite simple: it applies rseq to String
sequences, and all this can be done with regular expressions. But rseq can do more complex things: it can be applied to a sequence of any class, not just strings. It can be Integer
s, Byte
s or any other Java classes, including user-defined classes. If your classes follow the Java bean convention (i.e. they use getters), there's a helper class BeanMatchers
that simplifies the process of building Matcher
s.
Suppose you have a sequence of tokens (obtained e.g. with Standford Core NLP)