How to set up a DEV environment
Required Python version >= 3.6
Setting up the environment with
pipenv is a utility that manages virtual environments and
pip dependencies at the same time. To install it, navigate to the project's root directory and run:
pip3 install pipenv
This will make sure that
pipenv uses your latest version of Python3, which is hopefully 3.6 or higher. Please refer to the official website for more information on
A Makefile has been created for convenience, so that you can install the project dependencies, download the required models, test and build the tool easily.
To install all of the required packages for development and testing run:
To execute the unit tests run:
Code quality checks can be run with:
A wheel distribution of this tool can be created with:
How to write your own NER model
NERDS is a framework that provides some NER capabilities - among which the option of creating ensembles of NER models - but primarily made to be extended. In the following sections we take a look at the basic data exchange classes, and how you can use them to create your own models.
Understanding the main data exchange classes
There are 3 main classes in the
nerds.core.model.input.* package that are used in our NER models:
Document class is the abstract representation of a raw document. It should always implement the
plain_text_ attribute, that returns the plain text representation of the object, as it's the one where we are going to perform NER. Therefore, whenever we want to process any new type of document format - XML, PDF, JSON, brat, etc. - the only requirement is to write an adapter that reads the file(s) from an input directory and transforms them to
Document objects. The default
Document object works seamlessly with
Annotation class contains the data for a single annotation. This is the text (e.g. "fox"), the label (e.g. "ANIMAL") and the offsets that correspond to offsets in the
plain_text_ representation of a
Document (e.g. 40-42).
Important to note: The offsets is a 2-tuple of integers that represent the position of the first and the last character of the annotation. Be careful, because some libraries end the offset one character after the final character i.e. at
start_offset + len(word). This is not the case with us, we currently end the offsets at exactly the final character i.e. at
start_offset + len(word) - 1.
AnnotatedDocument class is a combination of
Document and a list of
Annotation, and it can represent two things:
- Ground truth data (e.g. brat annotation files).
- Predictions on documents after they run through our NER models.
AnnotatedDocument class exposes the
annotated_text_ attribute which returns the plain text representation of the document with inline annotations.
Extending the base model class
The basic class that every model needs to extend is the
NERModel class in the
nerds.core.model.ner.base package. The model class implements a
fit - transform API, similarly to
sklearn. To implement a new model, one must extend the following methods at minimum:
fit: Trains a model given a list of
transform: Gets a list of
Documentobjects and transforms them to
save: Disk persistence of a model.
load: Disk persistence of a model.
Please note that all of the class methods, utility functions, etc. should operate on
AnnotatedDocument objects, to maintain compatibility with the rest of the framework. The only exception is "private" methods used internally in classes.
So, let's assume you have a dataset that contains annotated text. If it's in a format that is already supported (e.g. brat), then you may just load it into
AnnotatedDocument objects using the built-in classes. Otherwise, you will have to extend the
nerds.core.model.input.DataInput class to support the format. Then, you may use the built-in NER models (or create your own) either alone, or in an ensemble and evaluate their predictive capabilities on your dataset.
nerds.core.model.evaluate package, there are helper methods and classes to perform k-fold cross-validation. Please, refer to the
nerds.examples package where you may look at working code examples with real datasets.
Contributing to the project
New models and input adapters are always welcome. Please make sure your code is well-documented and readable. Before creating a pull request make sure:
make testshows that all the unit test pass.
make lintshows no Python code violations.