In [1]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../data/www/styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

# Named Entity Recognition

You got a taste of named entity recognition (NER) at the very end of our first introduction to text processing and analytics. NER is relatively simple in concept, what we want models to do are to recognize that 

`Harriet Tubman`

is not just the written words `Harriet` and `Tubman`, but that `Harriet Tubman` is the `name` of a `person`. This requires more semantic understanding of what is being written.

NER is one of the core tasks and challenges in NLP. Machine learning models are at the heart of the ability of modern packages to perform out of the box NER at a high-level.

In [2]:
import spacy

nlp = spacy.load('en_core_web_lg')
doc = nlp('Apple had a strong third quarter, with revenues of more than 3 trillion USD reported.')
doc.ents

(Apple, a strong third quarter, more than 3 trillion USD)

Spacy performs entity recognition out of the box with a loaded model like `en_core_web_lg`. In the sample document it identifies three entities `Apple`, `a strong third quarter`, and `more than 3 trillion USD`. We can investigate further what entities it thinks it has found.

In [3]:
for ent in doc.ents:
    print(ent)
    print(ent.label_)
    print('-------')

Apple
ORG
-------
a strong third quarter
DATE
-------
more than 3 trillion USD
MONEY
-------


So these entities have triggered the trained labels in `en_core_web_lg` of `ORG`, `DATE`, and `MONEY`. We can also see this in-line in a nicer way too.

In [4]:
from spacy import displacy

displacy.render(doc, style="ent")

Entites have to be taught to a model, you can think of the set of entity labels that are used as a sort of 'ontology' that the model will be taught about how to view the world. You can check on what entities are built in by looking at the source documentation for the [model itself](https://spacy.io/models/en#en_core_web_lg).

Out of the box `en_core_web_lg` is taught to recognize `CARDINAL, DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, and `WORK_OF_ART`. You can also check out the context of the dataset that it was trained in -- namely web documents.

Proper training is important, because it helps increase the flexibility of the model to less than perfect written context.

In [5]:
doc = nlp('apple had a strong third quarter, with revenues of more than 3 trillion USD reported.')
displacy.render(doc, style="ent")

In [6]:
doc = nlp("This year's apple phone is the worst version yet.")
doc.ents

(year,)

But that does not mean that it is impervious to written language. General purpose NER models have limits on their ability to recognize all entities -- it is essential that you profile the model performance on a test corpus that you have evaluated **by hand** so that you become familiar with where the model experiences failure and how it performs generally.

# Context and localization

In that vein, it is important to assess model performance when you change contexts.

Let's examine Othello again.

In [7]:
othello = open('../data/Othello.txt').read()

We can identify and clean up the lines to run it through.

In [8]:
othello[86:260]

"\n  RODERIGO. Tush, never tell me! I take it much unkindly\n    That thou, Iago, who hast had my purse\n    As if the strings were thine, shouldst know of this.\n  IAGO. 'Sblood,"

In [9]:
line = ' '.join(othello[86:260].split('RODERIGO. ')[-1].split(' IAGO.')[0].split())
line

'Tush, never tell me! I take it much unkindly That thou, Iago, who hast had my purse As if the strings were thine, shouldst know of this.'

In [10]:
doc = nlp(line)
displacy.render(doc, style='ent')

Or we could not

In [11]:
othello[694:1000]

't circumstance\n    Horribly stuff\'d with epithets of war,\n    And, in conclusion,\n    Nonsuits my mediators; for, "Certes," says he,\n    "I have already chose my officer."\n    And what was he?\n    Forsooth, a great arithmetician,\n    One Michael Cassio, a Florentine\n    (A fellow almost damn\'d in a fair w'

In [12]:
doc = nlp(othello[700:1000])
displacy.render(doc, style="ent")

The truth is that the formatting has little effect to the NER model (talking about line breaks -- not the existence of whitespace at all, whih is crucial) 

In [13]:
doc = nlp(' '.join(othello[700:1000].split()))
displacy.render(doc, style="ent")

But identification can be sensitive. 

In [14]:
doc = nlp(' '.join(othello[695:1000].split()))
displacy.render(doc, style="ent")

Why would I bring up an example that chops a word in half? You may think of always having "perfect" replication of your source material as data by default, but there is a wide range of data that you may be interested that comes through an OCR pipeline (i.e. scanned physical documents). The context of your data is related both to its method of creation into its current form and the context of its data (subject, time of creation, culture/language written in). These are the important details that you need to take into account when using a pre-trained NLP model.

In [15]:
doc = nlp(othello[:1009])
displacy.render(doc, style='ent')

Here we can tell that there is a context issue -- that the `en_core_web_lg` model has not been trained with the text of plays. 

# Custom trained models

The advantage of progess in the NLP space has been the sharing of pre-trained models. There are a number of pre-trained models and surrounding packages that are appropriate for different contexts. 

Here we will try [bookNLP](https://github.com/booknlp/booknlp), which has been trained for books. bookNLP uses spaCy for POS tagging and bert models for its other functions.

In [16]:
#!pip install booknlp

In [20]:
#Housekeeping for multiple openMP start-ups
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

from booknlp.booknlp import BookNLP

model_params={
        "pipeline":"entity,quote,supersense,event,coref", #This is not the full pipeline for speed
        "model":"small" #This is for our laptops
}

booknlp=BookNLP("en", model_params)

{'pipeline': 'entity,quote,supersense,event,coref', 'model': 'small'}
--- startup: 7.905 seconds ---


In [21]:
input_file="../data/Othello.txt"

output_directory="output_dir/othello/"

book_id="othello"

booknlp.process(input_file, output_directory, book_id)

--- spacy: 5.657 seconds ---
--- entities: 38.121 seconds ---
--- quotes: 0.052 seconds ---
--- attribution: 4.006 seconds ---
--- name coref: 0.126 seconds ---
--- coref: 11.642 seconds ---
--- TOTAL (excl. startup): 59.757 seconds ---, 35797 words


In [22]:
!ls output_dir/othello/*

output_dir/othello/othello.book       output_dir/othello/othello.quotes
output_dir/othello/othello.book.html  output_dir/othello/othello.supersense
output_dir/othello/othello.entities   output_dir/othello/othello.tokens


Lots to unpack here. The first thing we'll want to look at is the `book.html` -- this will give us the final summary and the annotated text of the passage.

In [23]:
!open output_dir/othello/othello.book.html

We can examine all of the other files too. Each one of these is structured as a table for further use. Tokens has every token, entities has the list of found entities, quotes are quotes form the text, and supersense is tagging related to wordnet.

In [25]:
import pandas as pd

df = pd.read_csv('output_dir/othello/othello.entities', sep='\t')
df.head(15)

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
0,1,2,2,PROP,GPE,Venice
1,2,4,4,PROP,GPE,Cyprus
2,1,9,9,PROP,GPE,Venice
3,203,11,12,NOM,FAC,A street
4,31,15,15,PROP,GPE,Roderigo
5,89,17,17,PROP,GPE,Iago
6,31,19,19,PROP,PER,RODERIGO
7,32,21,21,PROP,PER,Tush
8,0,25,25,PRON,PER,me
9,0,27,27,PRON,PER,I


### On your own

Evaluate the outputs and gather a sense of where the package struggles with Othello -- what is working and what is not?

# Custom training a NER model

There will come a point where you will need to train your own NER model -- either it differs too much from the context for available models, you have a need to improve performance, or you need custom entity classes. When you train a NER model you will generally select a pre-existing, trained NER model and give it training data with your labelled entities. 

Selecting a pre-existing model will have many influences -- what is the intended purpose? is there a subject specific training model? what hardware do you have available to use or can afford to use?

For our purposes in class we will use spaCy for simplicity. Just know that using transformer models with huggingface is effectively the same. 

## Step 1 -- Training Data

To keep things simple we will introduce a new `SPEAKER` entity, which will recognize and identify instances where a `SPEAKER` occurs. Creating training data is straightforward but tedious -- we need to build a set of docs, with the entity of interest and its character spans identified like so:

```
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
]
```

In [26]:
#First I want to make a directory to hold all of our data that we will be generating
os.mkdir('corpus')

### Exercise.

Create 1,000 training data examples of `SPEAKER` for us to use in our pipeline as a `training_data` list

Now that we have our training data, we will split it and we will process it for the spacy model

In [65]:
len(training_data)

1180

In [66]:
from spacy.tokens import DocBin

nlp = spacy.blank("en")

db = DocBin()
for text, annotations in training_data[:400]:
    doc = nlp(text)
    try:
        ents = []
        for start, end, label in annotations:
            span = doc.char_span(start, end, label=label)
            ents.append(span)
        doc.ents = ents
        db.add(doc)
    except ValueError:
        print( text, annotations)
db.to_disk("./corpus/train.spacy")

In [67]:
#And now for the evaluation data
db = DocBin()
for text, annotations in training_data[400:800]:
    doc = nlp(text)
    try:
        ents = []
        for start, end, label in annotations:
            span = doc.char_span(start, end, label=label)
            ents.append(span)
        doc.ents = ents
        db.add(doc)
    except ValueError:
        print( text, annotations)
db.to_disk("./corpus/dev.spacy")

Now we're essentially just working on the command line (but we will do it from the notebook). We're going to need a configuration file -- spaCy has nice interactive web site to do so: https://spacy.io/usage/training

Get the configuration and save it as `corpus/base_config.cfg`

Once we have our base configuration then we can fill it in for our specific machine.


In [46]:
!python -m spacy init fill-config ./corpus/base_config.cfg ./corpus/config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
corpus/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [69]:
!python -m spacy train ./corpus/config.cfg --output ./corpus/ --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy

[38;5;4mℹ Saving to output directory: corpus[0m
[38;5;4mℹ Using CPU[0m
[1m
[2021-12-21 15:43:26,307] [INFO] Set up nlp object from config
[2021-12-21 15:43:26,314] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-21 15:43:26,317] [INFO] Created vocabulary
[2021-12-21 15:43:26,318] [INFO] Finished initializing nlp object
[2021-12-21 15:43:26,679] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     42.83   21.15   11.88   96.00    0.21
  4     200          1.64    584.09  100.00  100.00  100.00    1.00
  9     400          0.00      0.00  100.00  100.00  100.00    1.00
 15     600          0.00      0.00  100.00  100.00  100.00    1.00
 23     800          0.00      0.00  100.00  100.00  100.0

And as we can see, we achieve a good model extremely quickly. From the outputs we can load up our trained models.

In [70]:
sp_nlp = spacy.load('corpus/model-best/')

In [71]:
training_data[0]

('  RODERIGO. Tush, never tell me! I take it much unkindly',
 [(2, 10, 'SPEAKER')])

In [74]:
doc = sp_nlp(training_data[0][0])
displacy.render(doc, style='ent')

And go a bit bigger on the text size.

In [77]:
doc = sp_nlp(othello[:1209])
displacy.render(doc, style='ent')