Skip to content
James Baker edited this page May 4, 2017 · 2 revisions

A large amount of documents are highly structured either in their entirety or in parts. Most document authors will develop a common template for their documents to standardise content, reducing the time taken to write a document and helping readers quickly assimilate information through consistent layout and style. Often templates provide ‘data oriented’ areas, for example, information such as the author and version of the document or an abstract or executive summary.

Baleen 2.4 provides the ability to exploit templating information through a set of new annotators. This guide will talk through the process of configuring these annotators in order to extract content from templated documents.

Template Document

To start with, a template document is required from which Baleen will create a template configuration file. The template document should resemble the documents you wish to process, but have Records and Fields marked up instead of actual content.

Records

Records are marked up using two strings of text in the template document to signify the beginning and end of the record.

Beginning of a record

The beginning of a record region is marked up with the following text:

<<record:RecordName:begin>>

Where RecordName is a user specified name and should be unique within the template. Care should be taken with the placement of this such that additional structural elements are not inserted into the document (eg extra paragraphs) as it will offset the selector paths that are generated.

End of a record

The end of a record region is marked up with the following text:

<<record:RecordName:end>>

Records should not generally nest each other, but it will not stop logical output from being created (but data will be replicated between records). Again, care should be taken not to insert extra structural elements into the document.

Note that the begin and end markings in the record are optional, the first with that name will be the begin and the second, with that name will be the end.

Repeating records

Records can repeat in a document, say a record for each row of a table or whole repeating sections. To indicate that a record can repeat in the document add the word repeat or repeat=true to the begin mark up.

<<record:RecordName repeat>>

The repeating algorithm looks for the recurrence of the same structural elements from the preceding structure up to the following structure. With each occurrence the structural changes are recorded in the RecordStructureManager so lookups of paths will be accordingly updated to take account for repeating parts.

Fields

Fields are labelled using a single string of text:

<<field:FieldName>>

where FieldName is a user specified name for the field. Field names should ideally be unique through the document, but MUST be unique within a covering record.

A field can be marked as required by settings required=”true” (or simply adding required). This implies that if a record is missing such a field it is invalid and should be discarded.

Fields may have a default value, set using defaultValue=”value”. This is useful in cases where the author is liable to omit the value, but it can be safely provided.

A more powerful feature for focused extraction is provided by attaching regular expression to a fields, using regex="[^|]*(?=\s)". This allows fields to extract only part of the data within the structural element (table cell) in which they are contained. Care needs to be taken in escaping the regex, which should be treated as if it were in HTML, e.g. use &gt; for >.

Fields can also repeat, so multiple instances of the field occur in the record. This can be used to take all the items in a list, or all the paragraphs in a table cell. This replicates some functionality of repeating records but in some cases it is more appropriate to repeat the field within the record. To indicate a field can repeat add repeat="true" (or simply repeat) to the field.

Some example field markup:

<<field:name>>
<<field:ip regex="\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}" >>
<<field:name defaultValue=”Joe Bloggs” >>
<<field:rank required >>

Creating a Configuration File

Next, you will need to run a Baleen pipeline to convert the template document into a template configuration file. A sample pipeline for doing this is as follows (refer to the Javadoc for full details):

collectionreader:
  class: FolderReader
  folders:
  - ./recordDefinitionInput

annotators:
- templates.TemplateFieldDefinitionAnnotator
- templates.TemplateRecordDefinitionAnnotator

consumers:
- class: template.TemplateRecordConfigurationCreatingConsumer
  outputDirectory: recordDefinitions

Processing Documents

Now we have our configuration file, we can use it to process documents conforming to the template. For instance:

collectionreader:
  class: FolderReader
  folders:
  - ./documents

annotators:
- class: templates.TemplateAnnotator
  recordDefinitionsDirectory: recordDefinitions

The TemplateAnnotator annotates the actual document (JCas) with the new Record and Field types. Since multiple templates can be used within the same pipeline, that is multiple record annotators with different template definitions as input, record annotations are given a source attribute to allow other annotator to understand their provenance.

Downstream annotators will need to be aware of Record and Field annotation types if you wish to exploit these in your pipeline (for instance the annotators under the uk.gov.dstl.baleen.annotators.templates package). Alternatively, one of the template consumers could be used to output the annotated document to, for example, Mongo and a second pipeline could then ingest the data from Mongo and process only the relevant fields.