-
Notifications
You must be signed in to change notification settings - Fork 33
GROBID Training Data Generation
GROBID uses a hierarchy of models. Each of the models use training data specific to that model. In the general GROBID training workflow, the GROBID training data would be generated from the provided PDFs. It will use the existing models to pre-annotate the generated XML. A human annotator should then correct the generated annotations. In this process, the JATS XML wouldn’t be used.
We extended the process by replacing the GROBID model specific human annotation with auto-annotation using the provided JATS XML.
For example in 030544v1, GROBID might generate the following XML fragment for the header model:
<docTitle>
<titlePart>Title<lb/> Sub-tomogram averaging in RELION<lb/></titlePart>
</docTitle>The actual title according to the XML is Sub-tomogram averaging in RELION, i.e. it doesn’t include the word Title.
The auto-annotation will in that case change the annotation. Text without any other corresponding tag will be labelled as a note:
<note type="other">Title<lb/></note> <docTitle><titlePart>Sub-tomogram averaging in RELION<lb/></titlePart>
</docTitle>This relatively simple example would provide GROBID with new training data.
The auto-annotation uses a fuzzy search to allow for some changes in the text. E.g. breaking the text for the PDF might cause extra hyphens or minor changes.
That approach works well, as long as the JATS XML matches the document well. There are cases where there isn’t a good match. We excluded any samples from the training set that didn’t meet certain model specific requirements, e.g. no matching abstract.
There will be differences between the GROBID models. For example, the overall segmentation needs to know which text belongs to the “front”. In that case one could include all of the text up to the last element which is considered part of that section, such as the abstract.
Because of those differences, some GROBID model specific code is required, that is building on top of a more generic document alignment.
We currently have implementations for the following GROBID models:
- Segmentation
- Header
- Affiliation-Address
- Reference-Segmenter
- Reference / Citation
- FullText
- Figure (label and caption)
- Table (label and caption)