## Named Entity Linking with spaCy and TEI
![](https://explosion.ai/blog/img/spacy-transformers.jpg)

![](https://pbs.twimg.com/media/D0aHPzXWwAEgRwU?format=jpg&name=900x900)

In [3]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_lg")
doc = nlp(
"The School of Information Sciences, also The iSchool at Illinois, is a graduate school at the University of Illinois at Urbana–Champaign. Its Master of Science in Library and Information Science is currently accredited in full good standing by the American Library Association.")
displacy.render(doc, style="ent")

## Problem:
*This works very well for many 20th and 21st century texts.  But what about early modern English?*

In [4]:
doc = nlp(
    "ITEM because that the kings most deare Uncle, the king of Denmarke, Norway & Sweveland, as the same our soveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils, hurts and damage which have late happened aswell to him and his, as to other foraines and strangers, and also friends and speciall subjects of our said soveraigne Lord the king of his Realme of England, by ye going in, entring & passage of such forain & strange persons into his realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him, specially into his Iles of Fynmarke, and elswhere, aswell in their persons as their things and goods"
)
displacy.render(doc, style="ent")

![]("./out_of_domain.png")

### In this example, our goal is to teach an existing English-language model to identify early modern place names.

There are several approaches that we could take to this problem.  Different approaches can lend better or worse results and experimentation is an essential part of any machine learning project. 

#### How can we teach a statistical language model that Sweveland is a place?

Richard Hakluyt's The Principal Navigations, Voyages, Traffiques, and Discoveries of the English Nation (1599)

![](http://www.sequiturbooks.com/image/cache/Product%20Images/2015-12/The-Principal-1512150003/5ae35178-800x800.jpeg)

--- 

### Download the TEI files from Persius 
- We're going to extract a list of all the place names from the text to create training data.


In [74]:
import os 
import pickle
from collections import Counter
spec = {"tei":"http://www.tei-c.org/ns/1.0"}
from urllib.request import urlopen
from lxml import etree
from standoffconverter import Converter

def tei_loader(url):
    tei = urlopen(url).read()
    return etree.XML(tei)

table_of_contents_url = "http://www.perseus.tufts.edu/hopper/xmltoc?doc=Perseus%3Atext%3A1999.03.0070%3Anarrative%3D1"
table_of_contents_xml = tei_loader(table_of_contents_url)


chunks = table_of_contents_xml.xpath("//chunk[@ref]")
refs = [chunk.get('ref') for chunk in chunks] 
# an example ref 'Perseus%3Atext%3A1999.03.0070%3Anarrative%3D6'


standoffs = []

for ref in refs:
    try:
        url = 'http://www.perseus.tufts.edu/hopper/xmlchunk?doc=' + ref

        tei = tei_loader(url)
        so = Converter.from_tree(tei)
        standoffs.append(so)
    except Exception as e:
        print(e)


xmlParseEntityRef: no name, line 103, column 75 (<string>, line 103)
xmlParseEntityRef: no name, line 199, column 94 (<string>, line 199)
xmlParseEntityRef: no name, line 186, column 94 (<string>, line 186)
xmlParseEntityRef: no name, line 803, column 109 (<string>, line 803)
xmlParseEntityRef: no name, line 455, column 89 (<string>, line 455)
xmlParseEntityRef: no name, line 441, column 89 (<string>, line 441)
Unescaped '<' not allowed in attributes values, line 22, column 25 (<string>, line 22)
xmlParseEntityRef: no name, line 49, column 152 (<string>, line 49)
xmlParseEntityRef: no name, line 6, column 152 (<string>, line 6)
xmlParseEntityRef: no name, line 4, column 111 (<string>, line 4)
xmlParseEntityRef: no name, line 34, column 106 (<string>, line 34)
xmlParseEntityRef: no name, line 3, column 149 (<string>, line 3)


In [75]:
# Here's the text from the TEI document 
standoffs[0].plain

'\nA branch of a Statute made in the eight yeere of Henry the sixt, for the trade to Norwey, Sweveland, Den marke, and Fynmarke. \nITEM because that the kings most deare Uncle, the king\nof Denmarke, Norway\n & Sweveland, as the same our\nsoveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils,\nhurts and damage which have late happened aswell to\nhim and his, as to other foraines and strangers, and also\nfriends and speciall subjects of our said soveraigne Lord\n\n\nthe king of his Realme of England, by ye going in,\nentring & passage of such forain & strange persons into\nhis realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him,\nspecially into his Iles of Fynmarke, and elswhere, aswell\nin their persons as their things and goods: for eschuing\nof such losses, perils, hurts & damages, and that such\nlike (which God forbid) should not hereafter happen: our\nsaid soveraigne Lord t

In [76]:
# Here are the annotations 
print(json.dumps(json.loads(standoffs[0].to_json()), indent=2)) # or just standoffs[0].to_json()

[
  {
    "begin": 0,
    "end": 2933,
    "attrib": {},
    "depth": 0,
    "tag": "TEI.2"
  },
  {
    "begin": 0,
    "end": 2933,
    "attrib": {
      "lang": "en"
    },
    "depth": 1,
    "tag": "text"
  },
  {
    "begin": 0,
    "end": 2933,
    "attrib": {},
    "depth": 2,
    "tag": "body"
  },
  {
    "begin": 0,
    "end": 2933,
    "attrib": {
      "type": "narrative",
      "org": "uniform",
      "sample": "complete"
    },
    "depth": 3,
    "tag": "div1"
  },
  {
    "begin": 1,
    "end": 127,
    "attrib": {},
    "depth": 4,
    "tag": "head"
  },
  {
    "begin": 50,
    "end": 55,
    "attrib": {
      "type": "pers"
    },
    "depth": 5,
    "tag": "name"
  },
  {
    "begin": 83,
    "end": 89,
    "attrib": {},
    "depth": 5,
    "tag": "name"
  },
  {
    "begin": 91,
    "end": 100,
    "attrib": {},
    "depth": 5,
    "tag": "name"
  },
  {
    "begin": 117,
    "end": 125,
    "attrib": {},
    "depth": 5,
    "tag": "name"
  },
  {
    "begin": 128

In [77]:
# Get the text from the TEI document and create training data
import json
places = []
entities = []
for standoff in standoffs:
    for annotation in json.loads(standoff.to_json()):
        try:
            if annotation['attrib']['type'] == 'place':
                begin = annotation['begin']
                end = annotation['end']
                length = end-begin
                key = annotation['attrib']['key']
                key = key.split(',')[1]
                #modern_name = annotation['attrib']['reg']
                sent = standoff.plain[begin-300:end+ 300]
                assert len(sent) > 0
                begin = 300
                end = begin+length
                place = (sent, {'entities':[(begin,end,"GPE")]})
                places.append(place)
                
                dict_1 = {(begin, end): {key: 1.0,}}
                entities.append((sent, {"links": dict_1}))
        except:
            pass
        
    

Here is the documentation for training the named entity recognizer. Note the format expected for training data:  https://spacy.io/usage/training#ner

In [78]:
print('found',len(places),'places')
places[4]


found 10000 places


('and brede.\nWhat hath then Flanders, bee Flemings lieffe or loth,\nBut a little Mader and Flemish Cloth:\nBy Drapering of our wooll in substance\nLiven her commons, this is her governance,\nWithout wich they may not live at ease.\nThus must hem sterve, or with us must have peace.\n\n\n\nOf the commodities of Portugal\n. The second Chapter.\n\n       THE Marchandy also of Portugal\n\n       By divers lands turne into sale.\n       Portugalers with us have trouth in hand:\n       Whose Marchandy commeth much into England.\n       They ben our friends, with their commodities,\n       And wee English passen into their countr',
 {'entities': [(300, 309, 'GPE')]})

Note the format for entity-linker training data: https://spacy.io/usage/training#entity-linker

In [79]:
print('found',len(entities),'entities')
entities[4]

found 10000 entities


('and brede.\nWhat hath then Flanders, bee Flemings lieffe or loth,\nBut a little Mader and Flemish Cloth:\nBy Drapering of our wooll in substance\nLiven her commons, this is her governance,\nWithout wich they may not live at ease.\nThus must hem sterve, or with us must have peace.\n\n\n\nOf the commodities of Portugal\n. The second Chapter.\n\n       THE Marchandy also of Portugal\n\n       By divers lands turne into sale.\n       Portugalers with us have trouth in hand:\n       Whose Marchandy commeth much into England.\n       They ben our friends, with their commodities,\n       And wee English passen into their countr',
 {'links': {(300, 309): {'1000090': 1.0}}})

## Quick digression, what do we get with our TGN number?

In [80]:
from skosprovider_getty.providers import TGNProvider
aat = TGNProvider(metadata={'id': 'TGN'})
def get_place_name(id:int) -> str:
    place = aat.get_by_id(id)

    print('Labels')
    print('------')
    for l in place.labels:
       print(l.language + ': ' + l.label + ' [' + l.type + ']')

    print('Notes')
    print('-----')
    for n in place.notes:
        print(n.language + ': ' + n.note + ' [' + n.type + ']')
    
links = entities[2][1]['links']
for id in links.keys():
    tgn_id = list(links[id].keys())[0]
    get_place_name(tgn_id)




Labels
------
en: Kingston upon Thames [prefLabel]
und: Cyningestum [altLabel]
en: Kingston [altLabel]
und: Moreford [altLabel]
Notes
-----
en: Residential suburb of London; fomerly county town of Surrey, until absorbed by Greater London; 11 Saxon kings crowned here; after London Bridge, first bridge above River Thames built here ca. 1750; once center for brewing, tanning & river barge traffic. [scopeNote]


# Back to the making an early modern place name model 

In [64]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_lg")
nlp.pipe_names

['tagger', 'parser', 'ner']

In [65]:
ner = nlp.get_pipe('ner')
ner.add_label("TGN")

In [70]:
import random
from tqdm.autonotebook import tqdm
from spacy.util import minibatch, compounding


TRAIN_DATA = places
n_iter = 1

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes):  # only train NER

    for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
            print("Losses", losses)
    # Loop for 10 iterations
    
    nlp.to_disk("./tgn")

KeyboardInterrupt: 

In [None]:
### from https://spacy.io/usage/training#ner ###

import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

def main(model=None, output_dir=None, n_iter=100):
    """Load the model, set up the pipeline and train the entity recognizer."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        # reset and initialize the weights randomly – but only if we're
        # training a new model
        if model is None:
            nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
            print("Losses", itn, losses)

    # test the trained model
    for text, _ in TRAIN_DATA:
        doc = nlp(text)
        print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
        print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        for text, _ in TRAIN_DATA:
            doc = nlp2(text)
            print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
            print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

TRAIN_DATA = places
main(model="en_core_web_lg",output_dir="./tgn",n_iter=10)

Loaded model 'en_core_web_lg'
Losses {'ner': 24174.444000461794}
Losses {'ner': 20239.468402190556}
Losses {'ner': 19364.363913456233}
Losses {'ner': 18863.677987238767}
Losses {'ner': 18814.93196198962}


### Testing
Did it work? 

In [84]:
nlp = spacy.load("./tgn")

doc = nlp(
    "ITEM because that the kings most deare Uncle, the king of Denmarke, Norway & Sweveland, as the same our soveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils, hurts and damage which have late happened aswell to him and his, as to other foraines and strangers, and also friends and speciall subjects of our said soveraigne Lord the king of his Realme of England, by ye going in, entring & passage of such forain & strange persons into his realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him, specially into his Iles of Fynmarke, and elswhere, aswell in their persons as their things and goods"
)
displacy.render(doc, style="ent")

  "__main__", mod_spec)


# Now it's time to train the entity linker! 

In [None]:
from spacy.vocab import Vocab
import spacy
from spacy.kb import KnowledgeBase

from bin.wiki_entity_linking.train_descriptions import EntityEncoder


INPUT_DIM = 300  # dimension of pretrained input vectors
DESC_WIDTH = 64  # dimension of output entity vectors
n_iter=50

nlp = spacy.load("./ner_model") 


# check the length of the nlp vectors
if "vectors" not in nlp.meta or not nlp.vocab.vectors.size:
    raise ValueError(
        "The `nlp` object should have access to pretrained word vectors, "
        " cf. https://spacy.io/usage/models#languages."
    )

kb = KnowledgeBase(vocab=nlp.vocab)

# set up the data
entity_ids = []
descriptions = []
freqs = []
for key, value in entities.items():
    desc, freq = value
    entity_ids.append(key)
    descriptions.append(desc)
    freqs.append(freq)

# training entity description encodings
# this part can easily be replaced with a custom entity encoder
encoder = EntityEncoder(
    nlp=nlp,
    input_dim=INPUT_DIM,
    desc_width=DESC_WIDTH,
    epochs=n_iter,
)
encoder.train(description_list=descriptions, to_print=True)

# get the pretrained entity vectors
embeddings = encoder.apply_encoder(descriptions)

# set the entities, can also be done by calling `kb.add_entity` for each entity
kb.set_entities(entity_list=entity_ids, freq_list=freqs, vector_list=embeddings)

# adding aliases, the entities need to be defined in the KB beforehand
kb.add_alias(
    alias="Russ Cochran",
    entities=["Q2146908", "Q7381115"],
    probabilities=[0.24, 0.7],  # the sum of these probabilities should not exceed 1
)

# test the trained model
print()
_print_kb(kb)

# save model to output directory
if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    kb_path = str(output_dir / "kb")
    kb.dump(kb_path)
    print()
    print("Saved KB to", kb_path)

    vocab_path = output_dir / "vocab"
    kb.vocab.to_disk(vocab_path)
    print("Saved vocab to", vocab_path)

    print()

    # test the saved model
    # always reload a knowledge base with the same vocab instance!
    print("Loading vocab from", vocab_path)
    print("Loading KB from", kb_path)
    vocab2 = Vocab().from_disk(vocab_path)
    kb2 = KnowledgeBase(vocab=vocab2)
    kb2.load_bulk(kb_path)





