# GateNLP

When you develop an NLP system, it's likely you'll end up combining many different components that operate on your text documents: a tokenizer that splits them up into tokens, an NER model (for example from `spaCy` or `Stanza`) that identifies the named entities, a text classification model that you've trained yourself with `transformers`, a gazetteer with words and phrases you want to identify, a list of regular expressions, etc. Integrating all these components is often a hassle. That's why there are tools such as GateNLP, which allow you to build an elegant NLP pipeline with a variety of annotators and to send your documents through it. 

## Creating a document

In this notebook, we're going to work with an example document from a corpus of medical transcriptions ([source](https://github.com/socd06/medical-nlp)). Before we start annotating, we create the document and add a description that describes its contents:

In [1]:
from gatenlp import Document

text = "CHIEF COMPLAINT:,  The patient comes for three-week postpartum checkup, complaining of allergies.,HISTORY OF PRESENT ILLNESS:,  She is doing well postpartum.  She has had no headache.  She is breastfeeding and feels like her milk is adequate.  She has not had much bleeding.  She is using about a mini pad twice a day, not any cramping or clotting and the discharge is turned from red to brown to now slightly yellowish.  She has not yet had sexual intercourse.  She does complain that she has had a little pain with the bowel movement, and every now and then she notices a little bright red bleeding.  She has not been particularly constipated but her husband says she is not eating her vegetables like she should.  Her seasonal allergies have back developed and she is complaining of extremely itchy watery eyes, runny nose, sneezing, and kind of a pressure sensation in her ears.,MEDICATIONS:,  Prenatal vitamins.,ALLERGIES:,  She thinks to Benadryl.,FAMILY HISTORY: , Mother is 50 and healthy.  Dad is 40 and healthy.  Half-sister, age 34, is healthy.  She has a sister who is age 10 who has some yeast infections.,PHYSICAL EXAMINATION:,VITALS:  Weight:  124 pounds.  Blood pressure 96/54.  Pulse:  72.  Respirations:  16.  LMP:  10/18/03.  Age:  39.,HEENT:  Head is normocephalic.  Eyes:  EOMs intact.  PERRLA.  Conjunctiva clear.  Fundi:  Discs flat, cups normal.  No AV nicking, hemorrhage or exudate.  Ears:  TMs intact.  Mouth:  No lesion.  Throat:  No inflammation.  She has allergic rhinitis with clear nasal drainage, clear watery discharge from the eyes.,Abdomen:  Soft.  No masses.,Pelvic:  Uterus is involuting.,Rectal:  She has one external hemorrhoid which has inflamed.  Stool is guaiac negative and using anoscope, no other lesions are identified.,ASSESSMENT/PLAN:,  Satisfactory three-week postpartum course, seasonal allergies.  We will try Patanol eyedrops and Allegra 60 mg twice a day.  She was cautioned about the possibility that this may alter her milk supply.  She is to drink extra fluids and call if she has problems with that.  We will try ProctoFoam HC.  For the hemorrhoids, also increase the fiber in her diet.  That prescription was written, as well as one for Allegra and Patanol.  She additionally will be begin on Micronor because she would like to protect herself from pregnancy until her husband get scheduled in and has a vasectomy, which is their ultimate plan for birth control, and she anticipates that happening fairly soon.  She will call and return if she continues to have problems with allergies.  Meantime, rechecking in three weeks for her final six-week postpartum checkup.soap / chart / progress notes, checkup, allergies, postpartum, complaining of allergies, seasonal allergies, postpartum checkup,"
doc = Document(text)

In [2]:
doc.features["description"] = "Three-Week Postpartum Checkup"

In [3]:
print(doc.save_mem(fmt="json"))

{"annotation_sets": {}, "text": "CHIEF COMPLAINT:,  The patient comes for three-week postpartum checkup, complaining of allergies.,HISTORY OF PRESENT ILLNESS:,  She is doing well postpartum.  She has had no headache.  She is breastfeeding and feels like her milk is adequate.  She has not had much bleeding.  She is using about a mini pad twice a day, not any cramping or clotting and the discharge is turned from red to brown to now slightly yellowish.  She has not yet had sexual intercourse.  She does complain that she has had a little pain with the bowel movement, and every now and then she notices a little bright red bleeding.  She has not been particularly constipated but her husband says she is not eating her vegetables like she should.  Her seasonal allergies have back developed and she is complaining of extremely itchy watery eyes, runny nose, sneezing, and kind of a pressure sensation in her ears.,MEDICATIONS:,  Prenatal vitamins.,ALLERGIES:,  She thinks to Benadryl.,FAMILY HISTOR

## Tokenize a document

Most of the annotations we'd like to add to the document are at the token level. This means we first have to tokenize the text. Because GateNLP has a ready-to-use integration with NLTK, we'll use NLTK's `TreebankWordTokenizer` to do this. We'll add the tokens to the `NLTK` annotation set, so that we remember what annotator was responsible for this annotation step.

In [4]:
from gatenlp.processing.tokenizer import NLTKTokenizer
from nltk.tokenize import TreebankWordTokenizer

tokenizer = NLTKTokenizer(nltk_tokenizer=TreebankWordTokenizer(), out_set="NLTK")
tokenizer(doc)

## Adding annotations to a document

Adding a new annotation to a document is straightforward: you simply set the index of the start character, the index after the end character, the label of the annotation, and any additional features. Here we create a new annotator that takes a regular expression and annotates all its matches in the document with a given label.

In [5]:
import re

from gatenlp.processing.annotator import Annotator

class RegexAnnotator(Annotator):

    def __init__(self, regex_pattern, label, outset=""):
        self.regex = regex_pattern
        self.label = label
        self.outset = outset

    def __call__(self, document):
        for match in re.finditer(self.regex, document.text):
            doc.annset(self.outset).add(match.start(), match.end(), self.label, {"text": match.group(0)})
            
        return doc


quantity_annotator = RegexAnnotator("\d+(\.\d+)? (pounds|mg)", "Quantity", outset="regex")

In [6]:
doc = quantity_annotator(doc)

for annotation in doc.annset("regex").with_type("Quantity"):
    print(annotation)

Annotation(1159,1169,Quantity,features=Features({'text': '124 pounds'}),id=0)
Annotation(1891,1896,Quantity,features=Features({'text': '60 mg'}),id=1)


We can also visualize the results in the document:

In [7]:
doc

## Applying a gazetteer

Next, we also have a small gazetteer with drug names that we'd like to identify in the document.

In [8]:
medications = [
    ("Benadryl", dict(url="https://www.drugs.com/benadryl.html")),
    ("Patanol", dict(url="https://www.drugs.com/patanol.html")),
    ("Allegra", dict(url="https://www.drugs.com/allegra.html"))
]

First we need to tokenize the entries in the gazetteer. We'll do this by simply splitting on whitespace.

Then we create a `TokenGazetteer`, where

- `fmt` sets the format of the gazetteer. In our case, this is `gazlist`: a list of tuples which all contain first a list of strings and then a dictionary of features,
- `outtype` specifies the annotation type,
- `annset` says what annotation set the gazetteer should get its tokens from, and
- `outset` specifies the annotation set the annotations should be added to.

In [9]:
from gatenlp.processing.gazetteer import TokenGazetteer

medications = [(txt.split(), feats) for txt, feats in medications]

medications_gazetteer = TokenGazetteer(medications, fmt="gazlist", outtype="Medication", annset="NLTK", outset="gazetteer")

Let's apply the gazetteer and look at the results:

In [10]:
doc = medications_gazetteer(doc)

for annotation in doc.annset("gazetteer").with_type("Medication"):
    print(annotation)

Annotation(1862,1869,Medication,features=Features({'url': 'https://www.drugs.com/patanol.html'}),id=0)
Annotation(1883,1890,Medication,features=Features({'url': 'https://www.drugs.com/allegra.html'}),id=1)
Annotation(2196,2203,Medication,features=Features({'url': 'https://www.drugs.com/allegra.html'}),id=2)


In [11]:
doc

# spaCy

Finally, GateNLP also has integration with spaCy and allows us to add all spaCy's annotations (tokens, sentences, but also entities) to the document.

In [12]:
from gatenlp.lib_spacy import AnnSpacy
import spacy

spacy_pipeline = spacy.load("en_core_web_sm")
spacy_annotator = AnnSpacy(pipeline=spacy_pipeline, outsetname="spaCy")
spacy_annotator(doc)