Developing Annotators with Resources
Annotators are the Baleen components that extract information and entities from content being passed through the pipeline. Shared resources are Baleen components that provide access to a resource, for example a file or a database, in a manner such that multiple annotators can make use of it efficiently. This avoids loading data into memory multiple times or setting up multiple connections to databases, for example.
We will look at developing a gazetteer annotator to annotate countries using the SharedCountryResource. The SharedCountryResource provides access to a preloaded database of countries that includes GeoJSON, Demonyms, etc.
As we are developing a new annotator, we need to ensure we have a dependency on the baleen-annotators module, as this will provide many of the base and utility classes that we will use as well as access to other common dependencies. To do this, we need to add the following to our POM file:
<dependency>
<groupId>uk.gov.dstl.baleen</groupId>
<artifactId>baleen-annotators</artifactId>
<version>2.4.0</version>
</dependency>
To start with, let's create a new Java class called Country
which extends AbstractAhoCorasickAnnotator
. The AbstractAhoCorasickAnnotator
class provides most of the functionality for us, and we just need to provide code to configure the gazetteer. We will create it in the uk.gov.dstl.baleen.annotators.guides
package to keep it separate from existing annotators. The final annotator we have produced will be identical to the Country
annotator in the uk.gov.dstl.baleen.annotators.gazetteer
package.
package uk.gov.dstl.baleen.annotators.guides; import java.util.Collections; import com.google.common.collect.ImmutableSet; import uk.gov.dstl.baleen.annotators.gazetteer.helpers.AbstractAhoCorasickAnnotator; import uk.gov.dstl.baleen.core.pipelines.orderers.AnalysisEngineAction; import uk.gov.dstl.baleen.exceptions.BaleenException; import uk.gov.dstl.baleen.resources.gazetteer.IGazetteer; public class Country extends AbstractAhoCorasickAnnotator { public Country() { } @Override public IGazetteer configureGazetteer() throws BaleenException { return null; } @Override public AnalysisEngineAction getAction() { return new AnalysisEngineAction(Collections.emptySet(), ImmutableSet.of(Location.class)); } }
Baleen uses UimaFIT to handle resource injection, so we can add the resource by adding the following lines to the top of our class:
@ExternalResource(key = "country") private SharedCountryResource country;
The key we give the class is used to identify a shared instance, so every time the same key is used the same instance of that class is provided. That is, we are using the same instance of SharedCountryResource
as every other class that accesses the resource with the key country.
The variable country
now references the SharedCountryResource
, and we could use it directly to access the resource if we wanted. Fortunately though, a lot of the hard work is done for us by the AbstractAhoCorasickAnnotator
.
Configuring our gazetteer is straight forward, as there are helper functions to do most of the hard work. All we need to do in this case is force the entity type to be Location (usually AbstractAhoCorasickAnnotator
allows the user to specify the type, but in this case we always want extracted countries to be locations), and then pass our SharedCountryResource to the gazetteer.
package uk.gov.dstl.baleen.annotators.guides; import java.util.Collections; import com.google.common.collect.ImmutableSet; import org.apache.uima.UimaContext; import org.apache.uima.fit.descriptor.ExternalResource; import org.apache.uima.resource.ResourceInitializationException; import uk.gov.dstl.baleen.annotators.gazetteer.helpers.AbstractAhoCorasickAnnotator; import uk.gov.dstl.baleen.annotators.gazetteer.helpers.GazetteerUtils; import uk.gov.dstl.baleen.core.pipelines.orderers.AnalysisEngineAction; import uk.gov.dstl.baleen.exceptions.BaleenException; import uk.gov.dstl.baleen.resources.SharedCountryResource; import uk.gov.dstl.baleen.resources.gazetteer.CountryGazetteer; import uk.gov.dstl.baleen.resources.gazetteer.IGazetteer; public class Country extends AbstractAhoCorasickAnnotator { @ExternalResource(key = "country") private SharedCountryResource country; @Override public void doInitialize(UimaContext aContext) throws ResourceInitializationException { type = "Location"; super.doInitialize(aContext); } @Override public IGazetteer configureGazetteer() throws BaleenException { IGazetteer gaz = new CountryGazetteer(); gaz.init(country, GazetteerUtils.configureCountry(caseSensitive)); return gaz; } @Override public AnalysisEngineAction getAction() { return new AnalysisEngineAction(Collections.emptySet(), ImmutableSet.of(Location.class)); } }
And that's it! We should now have a working gazetteer annotator that uses an external resource. You can include it in your pipeline following the information at Using Third Party Components.
To check it's working though, we can write and run Unit Tests to ensure the output is as expected.
package uk.gov.dstl.baleen.annotators.guides; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNotNull; import org.apache.uima.analysis_engine.AnalysisEngine; import org.apache.uima.analysis_engine.AnalysisEngineDescription; import org.apache.uima.fit.factory.AnalysisEngineFactory; import org.apache.uima.fit.factory.ExternalResourceFactory; import org.apache.uima.fit.util.JCasUtil; import org.apache.uima.resource.ExternalResourceDescription; import org.junit.Test; import uk.gov.dstl.baleen.annotators.gazetteer.Country; import uk.gov.dstl.baleen.annotators.testing.AnnotatorTestBase; import uk.gov.dstl.baleen.resources.SharedCountryResource; import uk.gov.dstl.baleen.types.semantic.Location; public class CountryGazetteerTest extends AnnotatorTestBase{ @Test public void test() throws Exception{ ExternalResourceDescription erd = ExternalResourceFactory.createExternalResourceDescription("country", SharedCountryResource.class); AnalysisEngineDescription aed = AnalysisEngineFactory.createEngineDescription(Country.class, "country", erd); AnalysisEngine ae = AnalysisEngineFactory.createEngine(aed); jCas.setDocumentText("Last month, Peter visited the coast of Jamaica"); ae.process(jCas); assertEquals(1, JCasUtil.select(jCas, Location.class).size()); Location l = JCasUtil.selectByIndex(jCas, Location.class, 0); assertEquals("Jamaica", l.getValue()); assertNotNull(l.getGeoJson()); ae.destroy(); } }