Tutorial
rseq is a Java library that allows to capture patterns in sequences (i.e. lists). It is especially useful in NLP applications. This tutorial introduces this library.
If you'd rather see the code first, you can go have a look at the tests for this tutorial, and come back for the explanation later.
Basic classes:
-
Matcher
is an interface with a methodboolean match(E e)
which returnstrue
if an object matchers some certain criteria. Some matchers are already implemented in theMatchers
class, e.g.:-
eq
for checking if an object is equal to a provided one (by using theequals
method) -
and
andor
for combining several matchers - and others, see the
Matchers
class for a complete list
-
-
Pattern
combines several matchers -
Match
represents a successful match of some subsequence against the providedPattern
To get started, add a dependency to rseq to your pom:
<dependency>
<groupId>com.alexeygrigorev</groupId>
<artifactId>rseq</artifactId>
<version>0.0.1</version>
</dependency>
Suppose we have the following sequence of String
objects: [Where, E, is, the, energy, and, λ, is, the, wavelength]
List<String> words =
Arrays.asList("Where E is the energy and λ is the wavelength".split(" "));
Firstly, we would like to check if this list contain a subsequence [E, is, the, energy]
. To check it, we create the following Pattern
:
Pattern<String> pattern =
Pattern.create(eq("E"), eq("is"), eq("the"), eq("energy"));
Where eq
is imported from Matchers.eq
and returns a successful match when all the object of the subsequence is equal to the passed one. Now we run this pattern against our sequence of words:
List<Match<String>> matches = pattern.find(words);
We expect the result to contain one match, so the size of matches
is one.
The Match
contains some additional information about the match, e.g. what is the sub-sequence, and the indexes where the match starts and ends:
Match<String> match = matches.get(0);
assertEquals(Arrays.asList("E", "is", "the", "energy"),
match.getMatchedSubsequence());
assertEquals(1, match.matchedFrom());
assertEquals(1 + 4, match.matchedTo());
Okay, now we also want to match the second part of the sentence with "λ" and "wavelength". One of the ways to do it can be using the or
matcher:
Pattern.create(eq("E").or(eq("λ")), eq("is"), eq("the"),
eq("energy").or(eq("wavelength")));
Now we expect to have two matches:
List<Match<String>> matches = pattern.find(words);
assertEquals(2, matches.size());
assertEquals(Arrays.asList("E", "is", "the", "energy"),
matches.get(0).getMatchedSubsequence());
assertEquals(Arrays.asList("λ", "is", "the", "wavelength"),
matches.get(1).getMatchedSubsequence());
What if we have more than 2 mathematical identifiers to check, not just "E" and "λ"? For that we can use the in
matcher. It can be used with collections or with arrays (and vararg arrays):
List<String> ids = Arrays.asList("E", "λ", "p", "m", "c");
Pattern<String> pattern =
Pattern.create(in(ids), eq("is"), eq("the"),
in("energy", "wavelength"));
Now, suppose we have a slightly different sequence: [Where, E, is, the, energy, and, λ, is, wavelength]
. The first subsequence is the same, but the second one does not have "the" before wavelength
. To still be able to get the two subsequences, we can make the the
matcher optional:
Pattern.create(in(ids), eq("is"), eq("the").optional(),
in("energy", "wavelength"));
It still returns two matches.
We can create matchers ourselves, and to do it, we need to implement the Matcher
interface, or to extend the XMatcher
class. The second option is better because it will provide the fluent syntax for or
, and
, optional
and other matchers. If we want to create a matcher that tests all String
objects against a regular expression .
(i.e. it matches with all one-character strings), we can do the following:
XMatcher<String> oneLetterRegexp = new XMatcher<String>() {
@Override
public boolean match(String object) {
return object.matches(".");
}
};
In Java 8, you can use the syntax for lambdas:
Matcher<String> oneLetterRegexp = o -> o.matches(".");
We can use this matcher in the same way as other matchers:
Pattern.create(oneLetterRegexp, eq("is"), eq("the").optional(),
in("energy", "wavelength"));
Even better, with Java 8 you can pass lambdas directly to Pattern.create
:
Pattern.create(o -> o.matches("."), eq("is"), eq("the"), eq("energy"));
Now let's do something more useful: we will capture all identifiers and their definitions. We assume that an identifier is a one character string, so we'll use the oneLetterRegexp
matcher. The definition can be any string, and to get it we can use the Matchers.anything
matcher.
To capture a value of some matcher, we use the captureAs
method of the XMatcher
class (or, you can use Matchers.capture
):
XMatcher<String> anything = anything();
Pattern<String> pattern =
Pattern.create(oneLetterRegexp.captureAs("ID"), eq("is"),
eq("the").optional(),
anything.captureAs("DEF"));
Here we capture the matched value for oneLetterRegexp
and put it into a variable ID
, and capture values for anything
into DEF
. We can use this Pattern
as previously, by applying the find
method to some sequence:
List<Match<String>> matches = pattern.find(words);
assertEquals(2, matches.size());
Match<String> match1 = matches.get(0);
assertEquals("E", match1.getVariable("ID"));
assertEquals("energy", match1.getVariable("DEF"));
Match<String> match2 = matches.get(1);
assertEquals("λ", match2.getVariable("ID"));
assertEquals("wavelength", match2.getVariable("DEF"));
What if we want to capture a subsequence? For example, the definition may or may not contain the article "the", and we want to keep the article as well for cases where the article is present. To do it, we can use the group
matcher:
XMatcher<String> anything = anything();
Pattern<String> pattern = Pattern.create(
oneLetterRegexp.captureAs("ID"), eq("is"),
group(eq("the").optional(), anything).captureAs("DEF"));
List<Match<String>> matches = pattern.find(words);
The groups are accessed via the getCapturedGroup
method on the Match
class:
Match<String> match1 = matches.get(0);
assertEquals("E", match1.getVariable("ID"));
assertEquals(Arrays.asList("the", "energy"), match1.getCapturedGroup("DEF"));
Match<String> match2 = matches.get(1);
assertEquals("λ", match2.getVariable("ID"));
assertEquals(Arrays.asList("wavelength"), match2.getCapturedGroup("DEF"));
It is also possible to replace the matched subsequence with something else. Suppose, you have some knowledge base where you have some information about concepts like "energy" or "wavelength", and you want to have your text link to this base.
final Map<String, String> db = new HashMap<String, String>();
db.put("energy", "https://en.wikipedia.org/wiki/Energy");
db.put("wavelength", "https://en.wikipedia.org/wiki/Wavelength");
List<String> replaced = pattern.replace(words, new TransformerToList<String>() {
@Override
public List<String> transform(Match<String> match) {
List<String> result = new ArrayList<>(match.getMatchedSubsequence());
String link = db.get(match.getVariable("DEF"));
result.add("(see " + link + ")");
return result;
}
});
The result is [Where, E, is, the, energy, (see https://en.wikipedia.org/wiki/Energy), and, λ, is, wavelength, (see https://en.wikipedia.org/wiki/Wavelength)]
If you know that you want to replace the whole matched subsequence with just one element, you can use a shortcut version of the replace
method - replaceToOne
:
List<String> actual = pattern.replaceToOne(words,
new TransformerToElement<String>() {
@Override
public String transform(Match<String> match) {
return StringUtils.join(match.getMatchedSubsequence(), " ");
}
});
The result is [Where, E is the energy, and, λ is wavelength]
Finally, there is the oneOrMore
and zeroOrMore
matchers (and their greedy counterparts oneOrMoreGreedy
and zeroOrMoreGreedy
).
For example, suppose, for some reason, you want to find consecutive sequences of String
s with at least 3 characters. In [Where, E, is, the, energy, and, λ, is, the, wavelength]
that would be subsequences [E, is, the]
and [and, λ, is]
. To do that, you can define a matcher
XMatcher<String> treeLettersOrLess = new XMatcher<String>() {
@Override
public boolean match(String object) {
return object.length() <= 3;
}
};
And then use the oneOrMore
modifier on this matcher:
Pattern<String> pattern = Pattern.create(treeLettersOrLess.oneOrMore());
The example above is quite simple: it applies rseq to String
sequences, and all this can possibly be done with usual regular expressions. But rseq can do more complex things: it can be applied to a sequence of any class, not just strings. It can be Integer
s, Byte
s or any other Java classes, including user-defined classes.
Suppose you have a text, and you tokenized it and annotated the tokes with part-of-speech tags (e.g. with Standford Core NLP). You store the results in a Word
class:
public class Word {
private String token;
private String pos;
// constructor, getters, setters, hashCode, equals...
}
Now suppose we have a sentence "where U is the internal energy, T is the absolute temperature, and S is the entropy." (taken from wiki), which after annotating becomes
where/WRB U/NNP is/VBZ the/DT internal/JJ energy/NN ,/, T/NNP is/VBZ the/DT absolute/JJ temperature/NN ,/, and/CC S/NNP is/VBZ the/DT entropy/NN ./.
In that sequence we are interested in definitions of identifiers U
, T
and S
. We note that all the identifiers are one-character strings, and all the definitions begin with the
, then there could be an adjective and a noun. So we can create the following pattern to capture it:
Pattern<Word> pattern = Pattern.create(id.captureAs("ID"), is,
group(the, adjective.optional(), noun).captureAs("DEF"));
We can use the Java 8 anonymous functions to declare these matchers:
XMatcher<Word> id = x(w -> w.getToken().length() == 1);
XMatcher<Word> is = x(w -> "is".equals(w.getToken()));
XMatcher<Word> the = x(w -> "the".equals(w.getToken()));
XMatcher<Word> adjective = x(w -> "JJ".equals(w.getPos()));
XMatcher<Word> noun = x(w -> "NN".equals(w.getPos()));
Note the use of the Matchers.x
method: it wraps a user-defined matcher (the implementation of the Matcher
interface) to an XMatcher
. This way you can both have a nice and not cumbersome way of defining Matchers and have access to the fluent interface methods like or
, optional
, etc.
We expect 3 matches:
List<Match<Word>> result = pattern.find(words);
assertEquals(3, result.size());
Match<Word> match1 = result.get(0);
assertEquals("U", match1.getVariable("ID").getToken());
assertEquals(words("the/DT internal/JJ energy/NN"),
match1.getCapturedGroup("DEF"));
Match<Word> match2 = result.get(1);
assertEquals("T", match2.getVariable("ID").getToken());
assertEquals(words("the/DT absolute/JJ temperature/NN"),
match2.getCapturedGroup("DEF"));
Match<Word> match3 = result.get(2);
assertEquals("S", match3.getVariable("ID").getToken());
assertEquals(words("the/DT entropy/NN"),
match3.getCapturedGroup("DEF"));
Here words
is a function that converts the string representation to a List
of Word
objects.
If your Java objects follow the Java Beans naming convention (at least they should have the getters), then you can use the BeanMatchers
class that can use reflection to access your objects' properties.
For example, the same matchers as above could be created as
XMatcher<Word> id = BeanMatchers.regex(Word.class, "token", ".");
XMatcher<Word> is = BeanMatchers.eq(Word.class, "token", "is");
XMatcher<Word> the = BeanMatchers.eq(Word.class, "token", "the");
XMatcher<Word> adjective = BeanMatchers.eq(Word.class, "pos", "JJ");
XMatcher<Word> noun = BeanMatchers.eq(Word.class, "pos", "NN");
This way can be useful if you don't use Java 8 and can't use anonymous functions.