Skip to content

Developing Annotators with Resources

James Baker edited this page Jun 21, 2017 · 2 revisions

Annotators are the Baleen components that extract information and entities from content being passed through the pipeline. Shared resources are Baleen components that provide access to a resource, for example a file or a database, in a manner such that multiple annotators can make use of it efficiently. This avoids loading data into memory multiple times or setting up multiple connections to databases, for example.

We will look at developing a gazetteer annotator to annotate countries using the SharedCountryResource. The SharedCountryResource provides access to a preloaded database of countries that includes GeoJSON, Demonyms, etc.

Configuring Dependencies

As we are developing a new annotator, we need to ensure we have a dependency on the baleen-annotators module, as this will provide many of the base and utility classes that we will use as well as access to other common dependencies. To do this, we need to add the following to our POM file:

<dependency>
    <groupId>uk.gov.dstl.baleen</groupId>
    <artifactId>baleen-annotators</artifactId>
    <version>2.4.0</version>
</dependency>

Creating the Class

To start with, let's create a new Java class called Country which extends AbstractAhoCorasickAnnotator. The AbstractAhoCorasickAnnotator class provides most of the functionality for us, and we just need to provide code to configure the gazetteer. We will create it in the uk.gov.dstl.baleen.annotators.guides package to keep it separate from existing annotators. The final annotator we have produced will be identical to the Country annotator in the uk.gov.dstl.baleen.annotators.gazetteer package.

package uk.gov.dstl.baleen.annotators.guides;

import java.util.Collections;
import com.google.common.collect.ImmutableSet;

import uk.gov.dstl.baleen.annotators.gazetteer.helpers.AbstractAhoCorasickAnnotator;
import uk.gov.dstl.baleen.core.pipelines.orderers.AnalysisEngineAction;
import uk.gov.dstl.baleen.exceptions.BaleenException;
import uk.gov.dstl.baleen.resources.gazetteer.IGazetteer;

public class Country extends AbstractAhoCorasickAnnotator {

	public Country() {
	}
	
	@Override
	public IGazetteer configureGazetteer() throws BaleenException {
		return null;
	}
	
	@Override
	public AnalysisEngineAction getAction() {
		return new AnalysisEngineAction(Collections.emptySet(), ImmutableSet.of(Location.class));
	}
}

Adding the External Resource

Baleen uses UimaFIT to handle resource injection, so we can add the resource by adding the following lines to the top of our class:

@ExternalResource(key = "country")
private SharedCountryResource country;

The key we give the class is used to identify a shared instance, so every time the same key is used the same instance of that class is provided. That is, we are using the same instance of SharedCountryResource as every other class that accesses the resource with the key country.

The variable country now references the SharedCountryResource, and we could use it directly to access the resource if we wanted. Fortunately though, a lot of the hard work is done for us by the AbstractAhoCorasickAnnotator.

Configuring the Gazetteer

Configuring our gazetteer is straight forward, as there are helper functions to do most of the hard work. All we need to do in this case is force the entity type to be Location (usually AbstractAhoCorasickAnnotator allows the user to specify the type, but in this case we always want extracted countries to be locations), and then pass our SharedCountryResource to the gazetteer.

package uk.gov.dstl.baleen.annotators.guides;

import java.util.Collections;
import com.google.common.collect.ImmutableSet;

import org.apache.uima.UimaContext;
import org.apache.uima.fit.descriptor.ExternalResource;
import org.apache.uima.resource.ResourceInitializationException;

import uk.gov.dstl.baleen.annotators.gazetteer.helpers.AbstractAhoCorasickAnnotator;
import uk.gov.dstl.baleen.annotators.gazetteer.helpers.GazetteerUtils;
import uk.gov.dstl.baleen.core.pipelines.orderers.AnalysisEngineAction;
import uk.gov.dstl.baleen.exceptions.BaleenException;
import uk.gov.dstl.baleen.resources.SharedCountryResource;
import uk.gov.dstl.baleen.resources.gazetteer.CountryGazetteer;
import uk.gov.dstl.baleen.resources.gazetteer.IGazetteer;

public class Country extends AbstractAhoCorasickAnnotator {
	@ExternalResource(key = "country")
	private SharedCountryResource country;
	
	@Override
	public void doInitialize(UimaContext aContext) throws ResourceInitializationException {
		type = "Location";
		super.doInitialize(aContext);
	}
	
	@Override
	public IGazetteer configureGazetteer() throws BaleenException {
		IGazetteer gaz = new CountryGazetteer();
		gaz.init(country, GazetteerUtils.configureCountry(caseSensitive));
		
		return gaz;
	}
	
	@Override
	public AnalysisEngineAction getAction() {
		return new AnalysisEngineAction(Collections.emptySet(), ImmutableSet.of(Location.class));
	}
}

Testing

And that's it! We should now have a working gazetteer annotator that uses an external resource. You can include it in your pipeline following the information at Using Third Party Components.

To check it's working though, we can write and run Unit Tests to ensure the output is as expected.

package uk.gov.dstl.baleen.annotators.guides;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotNull;

import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.analysis_engine.AnalysisEngineDescription;
import org.apache.uima.fit.factory.AnalysisEngineFactory;
import org.apache.uima.fit.factory.ExternalResourceFactory;
import org.apache.uima.fit.util.JCasUtil;
import org.apache.uima.resource.ExternalResourceDescription;
import org.junit.Test;

import uk.gov.dstl.baleen.annotators.gazetteer.Country;
import uk.gov.dstl.baleen.annotators.testing.AnnotatorTestBase;
import uk.gov.dstl.baleen.resources.SharedCountryResource;
import uk.gov.dstl.baleen.types.semantic.Location;

public class CountryGazetteerTest extends AnnotatorTestBase{
	@Test
	public void test() throws Exception{
		ExternalResourceDescription erd = ExternalResourceFactory.createExternalResourceDescription("country", SharedCountryResource.class);
		AnalysisEngineDescription aed = AnalysisEngineFactory.createEngineDescription(Country.class, "country", erd);
		
		AnalysisEngine ae = AnalysisEngineFactory.createEngine(aed);
		
		jCas.setDocumentText("Last month, Peter visited the coast of Jamaica");
		
		ae.process(jCas);
		
		assertEquals(1, JCasUtil.select(jCas, Location.class).size());
		Location l = JCasUtil.selectByIndex(jCas, Location.class, 0);
		assertEquals("Jamaica", l.getValue());
		assertNotNull(l.getGeoJson());
		
		ae.destroy();
	}
}