# Named Entity Recognition

You got a taste of named entity recognition (NER) at the very end of our first introduction to text processing and analytics. NER is relatively simple in concept, what we want models to do are to recognize that 

`Harriet Tubman`

is not just the written words `Harriet` and `Tubman`, but that `Harriet Tubman` is the `name` of a `person`. This requires more semantic understanding of what is being written.

NER is one of the core tasks and challenges in NLP. Machine learning models are at the heart of the ability of modern packages to perform out of the box NER at a high-level.

In [1]:
import spacy

nlp = spacy.load('en_core_web_lg')
doc = nlp('Apple had a strong third quarter, with revenues of more than 3 trillion USD reported.')
doc.ents

ModuleNotFoundError: No module named 'spacy'

Spacy performs entity recognition out of the box with a loaded model like `en_core_web_lg`. In the sample document it identifies three entities `Apple`, `a strong third quarter`, and `more than 3 trillion USD`. We can investigate further what entities it thinks it has found.

In [None]:
for ent in doc.ents:
    print(ent)
    print(ent.label_)
    print('-------')

So these entities have triggered the trained labels in `en_core_web_lg` of `ORG`, `DATE`, and `MONEY`. We can also see this in-line in a nicer way too.

In [None]:
from spacy import displacy

displacy.render(doc, style="ent")

Entites have to be taught to a model, you can think of the set of entity labels that are used as a sort of 'ontology' that the model will be taught about how to view the world. You can check on what entities are built in by looking at the source documentation for the [model itself](https://spacy.io/models/en#en_core_web_lg).

Out of the box `en_core_web_lg` is taught to recognize `CARDINAL, DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, and `WORK_OF_ART`. You can also check out the context of the dataset that it was trained in -- namely web documents.

Proper training is important, because it helps increase the flexibility of the model to less than perfect written context.

In [None]:
doc = nlp('apple had a strong third quarter, with revenues of more than 3 trillion USD reported.')
displacy.render(doc, style="ent")

In [None]:
doc = nlp("This year's apple phone is the worst version yet.")
doc.ents

But that does not mean that it is impervious to written language. General purpose NER models have limits on their ability to recognize all entities -- it is essential that you profile the model performance on a test corpus that you have evaluated **by hand** so that you become familiar with where the model experiences failure and how it performs generally.

# Context and localization

In that vein, it is important to assess model performance when you change contexts.

Let's examine Othello again.

In [2]:
othello = open('../../data/Othello.txt').read()

We can identify and clean up the lines to run it through.

In [None]:
othello[86:260]

In [None]:
line = ' '.join(othello[86:260].split('RODERIGO. ')[-1].split(' IAGO.')[0].split())
line

In [None]:
doc = nlp(line)
displacy.render(doc, style='ent')

Or we could not

In [None]:
othello[694:1000]

In [None]:
doc = nlp(othello[700:1000])
displacy.render(doc, style="ent")

The truth is that the formatting has little effect to the NER model (talking about line breaks -- not the existence of whitespace at all, whih is crucial) 

In [None]:
doc = nlp(' '.join(othello[700:1000].split()))
displacy.render(doc, style="ent")

But identification can be sensitive. 

In [None]:
doc = nlp(' '.join(othello[695:1000].split()))
displacy.render(doc, style="ent")

Why would I bring up an example that chops a word in half? You may think of always having "perfect" replication of your source material as data by default, but there is a wide range of data that you may be interested that comes through an OCR pipeline (i.e. scanned physical documents). The context of your data is related both to its method of creation into its current form and the context of its data (subject, time of creation, culture/language written in). These are the important details that you need to take into account when using a pre-trained NLP model.

In [None]:
doc = nlp(othello[:1009])
displacy.render(doc, style='ent')

Here we can tell that there is a context issue -- that the `en_core_web_lg` model has not been trained with the text of plays. 

# Custom trained models

The advantage of progess in the NLP space has been the sharing of pre-trained models. There are a number of pre-trained models and surrounding packages that are appropriate for different contexts. 

Here we will try [bookNLP](https://github.com/booknlp/booknlp), which has been trained for books. bookNLP uses spaCy for POS tagging and bert models for its other functions.

In [8]:
!conda install pytorch -y

Channels:
 - defaults
 - conda-forge
Platform: osx-arm64
/ DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
- DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/main/osx-arm64/repodata.json HTTP/11" 304 0
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/r/osx-arm64/repodata.json HTTP/11" 304 0
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/main/noarch/repodata.json HTTP/11" 304 0
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/r/noarch/repodata.json HTTP/11" 304 0
done
doneing environment: | 

## Package Plan ##

  environment location: /Users/adampah/anaconda3/envs/cssma2025

  added / updated specs:
    - pytor

In [9]:
!pip install booknlp

Collecting booknlp
  Using cached booknlp-1.0.8-py3-none-any.whl.metadata (345 bytes)
INFO: pip is looking at multiple versions of booknlp to determine which version is compatible with other requirements. This could take a while.
  Using cached booknlp-1.0.7.1.tar.gz (2.4 MB)
  Installing build dependencies ... [?25done
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting spacy>=3 (from booknlp)
  Downloading spacy-3.8.2.tar.gz (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
  Installing build deperror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[456 lines of output][0m
  [31m   [0m Ignoring numpy: markers 'python_version < "3.9"' don't match your environment


In [10]:
#Housekeeping for multiple openMP start-ups
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

from booknlp.booknlp import BookNLP

model_params={
        "pipeline":"entity,quote,supersense,event,coref", #This is not the full pipeline for speed
        "model":"small" #This is for our laptops
}

booknlp=BookNLP("en", model_params)

ModuleNotFoundError: No module named 'booknlp'

In [None]:
input_file="../../data/Othello.txt"

output_directory="output_dir/othello/"

book_id="othello"

booknlp.process(input_file, output_directory, book_id)

In [None]:
!ls output_dir/othello/*

Lots to unpack here. The first thing we'll want to look at is the `book.html` -- this will give us the final summary and the annotated text of the passage.

In [None]:
!open output_dir/othello/othello.book.html

We can examine all of the other files too. Each one of these is structured as a table for further use. Tokens has every token, entities has the list of found entities, quotes are quotes form the text, and supersense is tagging related to wordnet.

In [None]:
import pandas as pd

df = pd.read_csv('output_dir/othello/othello.entities', sep='\t')
df.head(15)

### On your own

Evaluate the outputs and gather a sense of where the package struggles with Othello -- what is working and what is not?

# Custom training a NER model

There will come a point where you will need to train your own NER model -- either it differs too much from the context for available models, you have a need to improve performance, or you need custom entity classes. When you train a NER model you will generally select a pre-existing, trained NER model and give it training data with your labelled entities. 

Selecting a pre-existing model will have many influences -- what is the intended purpose? is there a subject specific training model? what hardware do you have available to use or can afford to use?

For our purposes in class we will use spaCy for simplicity. Just know that using transformer models with huggingface is effectively the same. 

## Step 1 -- Training Data

To keep things simple we will introduce a new `SPEAKER` entity, which will recognize and identify instances where a `SPEAKER` occurs. Creating training data is straightforward but tedious -- we need to build a set of docs, with the entity of interest and its character spans identified like so:

```
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
]
```

In [None]:
#First I want to make a directory to hold all of our data that we will be generating
os.mkdir('corpus')

### Exercise.

Create 1,000 training data examples of `SPEAKER` for us to use in our pipeline as a `training_data` list

Now that we have our training data, we will split it and we will process it for the spacy model

In [None]:
len(training_data)

In [None]:
from spacy.tokens import DocBin

nlp = spacy.blank("en")

db = DocBin()
for text, annotations in training_data[:400]:
    doc = nlp(text)
    try:
        ents = []
        for start, end, label in annotations:
            span = doc.char_span(start, end, label=label)
            ents.append(span)
        doc.ents = ents
        db.add(doc)
    except ValueError:
        print( text, annotations)
db.to_disk("./corpus/train.spacy")

In [None]:
#And now for the evaluation data
db = DocBin()
for text, annotations in training_data[400:800]:
    doc = nlp(text)
    try:
        ents = []
        for start, end, label in annotations:
            span = doc.char_span(start, end, label=label)
            ents.append(span)
        doc.ents = ents
        db.add(doc)
    except ValueError:
        print( text, annotations)
db.to_disk("./corpus/dev.spacy")

Now we're essentially just working on the command line (but we will do it from the notebook). We're going to need a configuration file -- spaCy has nice interactive web site to do so: https://spacy.io/usage/training

Get the configuration and save it as `corpus/base_config.cfg`

Once we have our base configuration then we can fill it in for our specific machine.


In [None]:
!python -m spacy init fill-config ./corpus/base_config.cfg ./corpus/config.cfg

In [None]:
!python -m spacy train ./corpus/config.cfg --output ./corpus/ --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy

And as we can see, we achieve a good model extremely quickly. From the outputs we can load up our trained models.

In [None]:
sp_nlp = spacy.load('corpus/model-best/')

In [None]:
training_data[0]

In [None]:
doc = sp_nlp(training_data[0][0])
displacy.render(doc, style='ent')

And go a bit bigger on the text size.

In [None]:
doc = sp_nlp(othello[:1209])
displacy.render(doc, style='ent')

In [None]:
doc = sp_nlp(othello[4000:5500])
displacy.render(doc, style='ent')

# Disambiguation

Disambiguation is fundamentally different from the NER task that we have been dealing with and follows as a next step once you have extracted entities. 

Disambiguation is **extremely difficult** in almost any real big data setting. There is just no way around the fact that text data that you obtain will contain ambiguity that is impossible to resolve with 100% certainty unless you were able to speak to the data creator or were a part of the data creation process. 

As a simple example, one of the most active areas of disambiguation research is in Author Name Disambiguation (AND) for scientific publications. If you have two manuscripts that both have the author `Yang, Y.`, how you can be certain that `Yang, Y.` of `Manuscript_1` is the same `Yang, Y.` of `Manuscript 2`? More than one person can have the same name, there can be both a `Paris, TX` and `Paris, FR`, and so on.

There is 'duplication' in language, this is why at national levels people have some form of a number (social security number in the US) that is unique to each individual -- even if they have the same name. For locations, each city in the US has a FIPS code (like we used with the census) that is unique, even if a town or city shares a name with a town in another state. 

Disambiguation is the process of creating these unique entity indices for our own entities -- we are essentially taking all of the entities that we find and trying to reduce the pool down to the unique entities and assign a "SSN" to each unique pool of entities. Thus, we know that `IAGO` in line 10 is the same entity as `IAGO` in line 1000.

## Exercise Discussion. 

What features would you think to use in something like author name disambiguation?


## Back to our usage

We don't have much of a real example to work through here because what we would care about is direct matching. In the real world we would typically have to deal with misspellings or variations in spellings, which is what grows the pipeline. One of the most fundamental techniques in doing disambiguation is leveraging fuzzy matching to generate potential candidates. Names with a high-similarity are considered as potential match candidates for further consideration and low-similarity candidates are discarded. 

In [None]:
!pip install fuzzywuzzy

In [None]:
from fuzzywuzzy import fuzz

fuzz.ratio("IAGO", "IAG")

In [None]:
fuzz.ratio("OTHELLO", "IAGO")

In [None]:
fuzz.ratio("DESDEMONA", "IAGO")

In [None]:
fuzz.ratio("DESDEMONA", "OTHELLO")

As you can see here a dropped character is still giving us a high match ratio, while different names are giving poor matches. Of course choosing where 'high' and 'low' is is effectively another algorithm that you are creating. As such, it needs to be "trained" and optimized to be specific to your data. Importantly, fuzzymatching is not a silver bullet since it's relying on numerical measures -- the length of your target strings will have an impact on the calculated ratios (and thus how you might want to set your cut-off for a high match or not).