## Named Entity Linking with spaCy and TEI
![](https://explosion.ai/blog/img/spacy-transformers.jpg)

![](https://pbs.twimg.com/media/D0aHPzXWwAEgRwU?format=jpg&name=900x900)

In [1]:
from IPython.display import IFrame
# Youtube
IFrame("https://www.youtube.com/embed/PW3RJM8tDGo", 560, 315)

In [2]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_md") # Note that I use the medium model because entity linking requires a model with vectors
doc = nlp(
"The School of Information Sciences, also The iSchool at Illinois, is a graduate school at the University of Illinois at Urbana–Champaign. Its Master of Science in Library and Information Science is currently accredited in full good standing by the American Library Association.")
displacy.render(doc, style="ent")

## Problem I: Domain
*This works very well for many 20th and 21st century texts.  But what about early modern English?*

In [3]:
doc = nlp(
    "ITEM because that the kings most deare Uncle, the king of Denmarke, Norway & Sweveland, as the same our soveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils, hurts and damage which have late happened aswell to him and his, as to other foraines and strangers, and also friends and speciall subjects of our said soveraigne Lord the king of his Realme of England, by ye going in, entring & passage of such forain & strange persons into his realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him, specially into his Iles of Fynmarke, and elswhere, aswell in their persons as their things and goods"
)
displacy.render(doc, style="ent")

![]("./out_of_domain.png")

### In this first example, our goal is to teach an existing English-language model to identify early modern place names.

There are several approaches that we could take to this problem.  Different approaches can lend better or worse results and experimentation is an essential part of any machine learning project. 

#### How can we teach a statistical language model that Sweveland is a place? Where can I get data on early modern places? 

Richard Hakluyt's The Principal Navigations, Voyages, Traffiques, and Discoveries of the English Nation (1599)

![](http://www.sequiturbooks.com/image/cache/Product%20Images/2015-12/The-Principal-1512150003/5ae35178-800x800.jpeg)

--- 

### Download the TEI files from Persius 
- We're going to extract a list of all the place names from the text to create training data.
- To make working with the TEI/XML easier, we're using a standoffconverter by David Lassner
- The converter separates the text and annotations 


In [4]:
import os 
import pickle
from collections import Counter
spec = {"tei":"http://www.tei-c.org/ns/1.0"}
from urllib.request import urlopen
from lxml import etree
from standoffconverter import Converter

def tei_loader(url):
    tei = urlopen(url).read()
    return etree.XML(tei)

table_of_contents_url = "http://www.perseus.tufts.edu/hopper/xmltoc?doc=Perseus%3Atext%3A1999.03.0070%3Anarrative%3D1"
table_of_contents_xml = tei_loader(table_of_contents_url)


chunks = table_of_contents_xml.xpath("//chunk[@ref]")
refs = [chunk.get('ref') for chunk in chunks] 
# an example ref 'Perseus%3Atext%3A1999.03.0070%3Anarrative%3D6'


standoffs = []

for ref in refs:
    try:
        url = 'http://www.perseus.tufts.edu/hopper/xmlchunk?doc=' + ref

        tei = tei_loader(url)
        so = Converter.from_tree(tei)
        standoffs.append(so)
    except Exception as e:
        print(e)


xmlParseEntityRef: no name, line 103, column 75 (<string>, line 103)
xmlParseEntityRef: no name, line 199, column 94 (<string>, line 199)
xmlParseEntityRef: no name, line 186, column 94 (<string>, line 186)
xmlParseEntityRef: no name, line 803, column 109 (<string>, line 803)
xmlParseEntityRef: no name, line 455, column 89 (<string>, line 455)
xmlParseEntityRef: no name, line 441, column 89 (<string>, line 441)
Unescaped '<' not allowed in attributes values, line 22, column 25 (<string>, line 22)
xmlParseEntityRef: no name, line 49, column 152 (<string>, line 49)
xmlParseEntityRef: no name, line 6, column 152 (<string>, line 6)
xmlParseEntityRef: no name, line 4, column 111 (<string>, line 4)
xmlParseEntityRef: no name, line 34, column 106 (<string>, line 34)
xmlParseEntityRef: no name, line 3, column 149 (<string>, line 3)


In [5]:
# Here's the text from the TEI document 
standoffs[0].plain

'\nA branch of a Statute made in the eight yeere of Henry the sixt, for the trade to Norwey, Sweveland, Den marke, and Fynmarke. \nITEM because that the kings most deare Uncle, the king\nof Denmarke, Norway\n & Sweveland, as the same our\nsoveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils,\nhurts and damage which have late happened aswell to\nhim and his, as to other foraines and strangers, and also\nfriends and speciall subjects of our said soveraigne Lord\n\n\nthe king of his Realme of England, by ye going in,\nentring & passage of such forain & strange persons into\nhis realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him,\nspecially into his Iles of Fynmarke, and elswhere, aswell\nin their persons as their things and goods: for eschuing\nof such losses, perils, hurts & damages, and that such\nlike (which God forbid) should not hereafter happen: our\nsaid soveraigne Lord t

In [6]:
# Here are the annotations 
import json
print(json.dumps(json.loads(standoffs[0].to_json()), indent=2)) # or just standoffs[0].to_json()

[
  {
    "begin": 0,
    "end": 2933,
    "attrib": {},
    "depth": 0,
    "tag": "TEI.2"
  },
  {
    "begin": 0,
    "end": 2933,
    "attrib": {
      "lang": "en"
    },
    "depth": 1,
    "tag": "text"
  },
  {
    "begin": 0,
    "end": 2933,
    "attrib": {},
    "depth": 2,
    "tag": "body"
  },
  {
    "begin": 0,
    "end": 2933,
    "attrib": {
      "type": "narrative",
      "org": "uniform",
      "sample": "complete"
    },
    "depth": 3,
    "tag": "div1"
  },
  {
    "begin": 1,
    "end": 127,
    "attrib": {},
    "depth": 4,
    "tag": "head"
  },
  {
    "begin": 50,
    "end": 55,
    "attrib": {
      "type": "pers"
    },
    "depth": 5,
    "tag": "name"
  },
  {
    "begin": 83,
    "end": 89,
    "attrib": {},
    "depth": 5,
    "tag": "name"
  },
  {
    "begin": 91,
    "end": 100,
    "attrib": {},
    "depth": 5,
    "tag": "name"
  },
  {
    "begin": 117,
    "end": 125,
    "attrib": {},
    "depth": 5,
    "tag": "name"
  },
  {
    "begin": 128

In [39]:
# Get the text from the TEI document and create training data
import json
places = []
entities = []
place_names = [] 
place_ids = []
names = []
ADD_NAMES = False # if True, the dataset will included all markedup names from the TEI
ADD_PLACE = True
for standoff in standoffs:
    for annotation in json.loads(standoff.to_json()):
        try:
            if annotation['tag'] == 'name' and ADD_NAMES:
                begin = annotation['begin']
                end = annotation['end']
                length = end-begin
                sent = standoff.plain[begin-300:end+ 300]
                assert len(sent) > 0
                begin = 300
                end = begin+length
                if '\n' in sent[begin:end]:
                    end -= 1
                place_names.append(sent[begin:end])
                place = (sent, {'entities':[(begin,end,"NAME")]})
                places.append(place)
                
            if annotation['attrib']['type'] == 'place' and ADD_PLACE:
                begin = annotation['begin']
                end = annotation['end']
                length = end-begin
                key = annotation['attrib']['key']
                key = key.split(',')[1]
                place_ids.append(key)
                #modern_name = annotation['attrib']['reg']
                sent = standoff.plain[begin-300:end+ 300]
                assert len(sent) > 0
                begin = 300
                end = begin+length
                if '\n' in sent[begin:end]:
                    end -= 1
                place_names.append(sent[begin:end])
                place = (sent, {'entities':[(begin,end,"TGN")]})
                places.append(place)
                
                dict_1 = {(begin, end): {key: 1.0,}}
                entities.append((sent, {"links": dict_1}))
        except:
            pass
        
    

In [32]:
import pandas as pd
# Experiment with place names that are labeled as name, but not type:place
names= []
for standoff in standoffs: 
    for annotation in json.loads(standoff.to_json()):
        if annotation['tag'] == 'name':
            begin = annotation['begin']
            end = annotation['end']
            word = standoff.plain[begin:end]
            names.append(word)
names = set(names)
df = pd.DataFrame(names, columns =['name'])
df.to_csv('names.csv', index=False)

In [30]:
df.head()

Unnamed: 0,name
0,Molgomsey
1,Helike Kirke
2,Canaria Ilands
3,Parthions
4,captaine Venner


Here is the documentation for training the named entity recognizer. Note the format expected for training data:  https://spacy.io/usage/training#ner

In [40]:
print('found',len(places),'places')
places[4]


found 90052 places


(' and goods: for eschuing\nof such losses, perils, hurts & damages, and that such\nlike (which God forbid) should not hereafter happen: our\nsaid soveraigne Lord the king hath ordeined and statuted,\nthat all and singular strangers, aswell Englishmen and\nothers willing to apply by Ship and come into his Realme\nof Norwey and other dominions, streits, territories,\njurisdictions, Isles & places aforesaid with their ships,\nto the intent to get or have fish or any other Marchandises,\nor goods, shall apply and come to his Towne of Northberne, where the said king of Denmarke hath specially\nordained and stablished his sta',
 {'entities': [(300, 315, 'NAME')]})

Note the format for entity-linker training data: https://spacy.io/usage/training#entity-linker

In [41]:
print('found',len(entities),'entities')
entities[4]

found 10000 entities


('and brede.\nWhat hath then Flanders, bee Flemings lieffe or loth,\nBut a little Mader and Flemish Cloth:\nBy Drapering of our wooll in substance\nLiven her commons, this is her governance,\nWithout wich they may not live at ease.\nThus must hem sterve, or with us must have peace.\n\n\n\nOf the commodities of Portugal\n. The second Chapter.\n\n       THE Marchandy also of Portugal\n\n       By divers lands turne into sale.\n       Portugalers with us have trouth in hand:\n       Whose Marchandy commeth much into England.\n       They ben our friends, with their commodities,\n       And wee English passen into their countr',
 {'links': {(300, 308): {'1000090': 1.0}}})

In [42]:
from collections import Counter 
place_counts = Counter(place_names)
# place_counts['England'] == 797
place_counts.most_common(10)

[('England', 2189),
 ('English', 1131),
 ('Spaniards', 1004),
 ('', 738),
 ('Spaine', 665),
 ('Countrey', 654),
 ('Indians', 618),
 ('Iland', 615),
 ('America', 580),
 ('Guiana', 564)]

## Quick digression, what do we get with our TGN number?

In [37]:
from skosprovider_getty.providers import TGNProvider
aat = TGNProvider(metadata={'id': 'TGN'})
def get_place_name(id:int) -> str:
    place = aat.get_by_id(id)

    print('Labels')
    print('------')
    for l in place.labels:
       print(l.language + ': ' + l.label + ' [' + l.type + ']')

    print('Notes')
    print('-----')
    for n in place.notes:
        print(n.language + ': ' + n.note + ' [' + n.type + ']')
    
links = entities[2][1]['links']
for id in links.keys():
    tgn_id = list(links[id].keys())[0]
    get_place_name(tgn_id)




Labels
------
en: Kingston upon Thames [prefLabel]
und: Moreford [altLabel]
en: Kingston [altLabel]
und: Cyningestum [altLabel]
Notes
-----
en: Residential suburb of London; fomerly county town of Surrey, until absorbed by Greater London; 11 Saxon kings crowned here; after London Bridge, first bridge above River Thames built here ca. 1750; once center for brewing, tanning & river barge traffic. [scopeNote]


# Back to the making an early modern place name model 

In [44]:
### from https://spacy.io/usage/training#ner ###

import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
from spacy.pipeline import EntityRuler


def main(model=None, output_dir=None, n_iter=100):
    """Load the model, set up the pipeline and train the entity recognizer."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    ruler = EntityRuler(nlp)
    patterns = []
    for text, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            start = ent[0]
            end = ent[1]
            label_ = ent[2]
            word = text[start:end]
            row = {"label":label_, "pattern":word}
            patterns.append(row)    
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler)
    print(nlp.pipe_names)
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec", "entity_ruler"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        print(nlp.pipe_names)

        # reset and initialize the weights randomly – but only if we're
        # training a new model
        if model is None:
            nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
            print("Losses", itn, losses)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        for text, _ in TRAIN_DATA[:2]:
            doc = nlp2(text)
            displacy.render(doc, style="ent")
            #print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
            #print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

TRAIN_DATA = places[:10]
main(model="en_core_web_md",output_dir="./tgn",n_iter=10)

Loaded model 'en_core_web_md'
['tagger', 'parser', 'ner', 'entity_ruler']
['ner', 'entity_ruler']
Losses 0 {'ner': 1293.541628498584}
Losses 1 {'ner': 1245.728833436966}
Losses 2 {'ner': 1274.3573137521744}
Losses 3 {'ner': 1171.7467963695526}
Losses 4 {'ner': 1176.1816000938416}
Losses 5 {'ner': 1194.9532996416092}
Losses 6 {'ner': 1144.7757304906845}
Losses 7 {'ner': 1133.6882915496826}
Losses 8 {'ner': 1087.0501599013805}
Losses 9 {'ner': 1070.9576173126698}
Saved model to tgn
Loading from tgn


### Testing
Did it work? 

In [45]:
nlp = spacy.load("./tgn")

doc = nlp(
    "ITEM because that the kings most deare Uncle, the king of Denmarke, Norway & Sweveland, as the same our soveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils, hurts and damage which have late happened aswell to him and his, as to other foraines and strangers, and also friends and speciall subjects of our said soveraigne Lord the king of his Realme of England, by ye going in, entring & passage of such forain & strange persons into his realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him, specially into his Iles of Fynmarke, and elswhere, aswell in their persons as their things and goods"
)
displacy.render(doc, style="ent")

In [46]:
# Test on early modern text not in the training data
doc = nlp(
    """The army marched from Konia to Kaiseria (Caesarea), and thence to Sivas, where the feast of the Korbân (sacrifice) was celebrated. Here Mustafâ Pâshâ, the emperor's favourite, was promoted to the rank of second vezir, and called into the divân. The army then continued its march to Erzerum. Besides tiie guns provided by the commander-in-chief, there were forty large guns dragged by two thousand pairs of buftaloes. The army entered the castle of Kazmaghan, and halted under the walls of Eriviin in the year 1044 (1634).  
"""
)
displacy.render(doc, style="ent")

# Now it's time to train the entity linker! 

In [17]:
place_counts['Fynmarke'] 

0

In [None]:
entities[0]

## Problem with NER: Yep, it's a PERSON, but *which* person? 

In [None]:
from wikipedia2vec import Wikipedia2Vec
wiki2vec = Wikipedia2Vec.load(MODEL_FILE)
wiki2vec.most_similar(wiki2vec.get_word('yoda'), 5)

In [None]:
kb_entities = {}
for entity in entities: 
    id = entity[1]['links']
    start, end = list(id.keys())[0]
    word = entity[0][start:end]
    frequency = place_counts[word]
    place_id = list(id[(start,end)].keys())
    kb_entities[place_id[0]] = (word, frequency)

    
#{"Q2146908": ("American golfer", 342), "Q7381115": ("publisher", 17)}
result = {}

for key,value in kb_entities.items():
    if value not in result.values():
        result[key] = value

kb_entities = result

In [None]:
for key in kb_entities.keys():
    kb.add_alias(
        alias = kb_entities[key][0],
        entities = [key],
        probabilities=[1.0]
    )


    """
    kb.add_alias(
        alias="Russ Cochran",
        entities=["Q2146908", "Q7381115"],
        probabilities=[0.24, 0.7],  # the sum of these probabilities should not exceed 1
    )"""

    

In [None]:
from spacy.vocab import Vocab
import spacy
from spacy.kb import KnowledgeBase
from pathlib import Path

from bin.wiki_entity_linking.train_descriptions import EntityEncoder


ENTITIES = kb_entities # {"Q2146908": ("American golfer", 342), "Q7381115": ("publisher", 17)}

INPUT_DIM = 300  # dimension of pretrained input vectors
DESC_WIDTH = 64  # dimension of output entity vectors


def main(model=None, output_dir=None, n_iter=50):
    """Load the model, create the KB and pretrain the entity encodings.
    If an output_dir is provided, the KB will be stored there in a file 'kb'.
    The updated vocab will also be written to a directory in the output_dir."""

    nlp = spacy.load(model)  # load existing spaCy model
    print("Loaded model '%s'" % model)

    # check the length of the nlp vectors
    if "vectors" not in nlp.meta or not nlp.vocab.vectors.size:
        raise ValueError(
            "The `nlp` object should have access to pretrained word vectors, "
            " cf. https://spacy.io/usage/models#languages."
        )

    kb = KnowledgeBase(vocab=nlp.vocab)

    # set up the data
    entity_ids = []
    descriptions = []
    freqs = []
    for key, value in ENTITIES.items():
        desc, freq = value
        entity_ids.append(key)
        descriptions.append(desc)
        freqs.append(freq)

    # training entity description encodings
    # this part can easily be replaced with a custom entity encoder
    encoder = EntityEncoder(
        nlp=nlp,
        input_dim=INPUT_DIM,
        desc_width=DESC_WIDTH,
        epochs=n_iter,
    )
    encoder.train(description_list=descriptions, to_print=True)

    # get the pretrained entity vectors
    embeddings = encoder.apply_encoder(descriptions)

    # set the entities, can also be done by calling `kb.add_entity` for each entity
    kb.set_entities(entity_list=entity_ids, freq_list=freqs, vector_list=embeddings)

    # adding aliases, the entities need to be defined in the KB beforehand    
    for key in kb_entities.keys():
        kb.add_alias(
            alias = kb_entities[key][0],
            entities = [key],
            probabilities=[1.0]
        )



    # test the trained model
    print()
    _print_kb(kb)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        kb_path = str(output_dir / "kb")
        kb.dump(kb_path)
        print()
        print("Saved KB to", kb_path)

        vocab_path = output_dir / "vocab"
        kb.vocab.to_disk(vocab_path)
        print("Saved vocab to", vocab_path)


        


def _print_kb(kb):
    print(kb.get_size_entities(), "kb entities:", kb.get_entity_strings())
    print(kb.get_size_aliases(), "kb aliases:", kb.get_alias_strings())


main(model="./tgn",output_dir="./tgn_kb",n_iter=50)

In [None]:
"""
Compatible with: spaCy v2.2.3
Last tested with: v2.2.3
https://spacy.io/usage/training#entity-linker
"""
from __future__ import unicode_literals, print_function

import random
from pathlib import Path

from spacy.symbols import PERSON
from spacy.vocab import Vocab

import spacy
from spacy.kb import KnowledgeBase
from spacy.pipeline import EntityRuler
from spacy.tokens import Span
from spacy.util import minibatch, compounding


# training data
TRAIN_DATA = entities



def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
    """Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
    The `vocab` should be the one used during creation of the KB."""
    vocab = Vocab().from_disk(vocab_path)
    # create blank Language class with correct vocab
    nlp = spacy.blank("en", vocab=vocab)
    nlp.vocab.vectors.name = "spacy_pretrained_vectors"
    print("Created blank 'en' model with vocab from '%s'" % vocab_path)

    # Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy.
    nlp.add_pipe(nlp.create_pipe('sentencizer'))

    # Add a custom component to recognize "Russ Cochran" as an entity for the example training data.
    # Note that in a realistic application, an actual NER algorithm should be used instead.
    ruler = EntityRuler(nlp)
    patterns = [{"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]}]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler)

    # Create the Entity Linker component and add it to the pipeline.
    if "entity_linker" not in nlp.pipe_names:
        # use only the predicted EL score and not the prior probability (for demo purposes)
        cfg = {"incl_prior": False}
        entity_linker = nlp.create_pipe("entity_linker", cfg)
        #kb = KnowledgeBase(vocab=nlp.vocab)
        #kb.load_bulk(kb_path)
        #print("Loaded Knowledge Base from '%s'" % kb_path)
        entity_linker.set_kb(kb)
        nlp.add_pipe(entity_linker, last=True)

    # Convert the texts to docs to make sure we have doc.ents set for the training examples.
    # Also ensure that the annotated examples correspond to known identifiers in the knowlege base.
    kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
    TRAIN_DOCS = []
    for text, annotation in TRAIN_DATA:
        with nlp.disable_pipes("entity_linker"):
            doc = nlp(text)
        annotation_clean = annotation
        for offset, kb_id_dict in annotation["links"].items():
            new_dict = {}
            for kb_id, value in kb_id_dict.items():
                if kb_id in kb_ids:
                    new_dict[kb_id] = value
                else:
                    print(
                        "Removed", kb_id, "from training because it is not in the KB."
                    )
            annotation_clean["links"][offset] = new_dict
        TRAIN_DOCS.append((doc, annotation_clean))

    # get names of other pipes to disable them during training
    pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train entity linker
        # reset and initialize the weights randomly
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DOCS)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data
                    losses=losses,
                    sgd=optimizer,
                )
            print(itn, "Losses", losses)

    # test the trained model
    _apply_model(nlp)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print()
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        _apply_model(nlp2)


def _apply_model(nlp):
    for text, annotation in TRAIN_DATA:
        # apply the entity linker which will now make predictions for the 'Russ Cochran' entities
        doc = nlp(text)
        print()
        print("Entities", [(ent.text, ent.label_, ent.kb_id_) for ent in doc.ents])
        print("Tokens", [(t.text, t.ent_type_, t.ent_kb_id_) for t in doc])


main(kb_path="/Users/ds/projects/spaCy_workshops/iSchool/tgn_kb", vocab_path="/Users/ds/projects/spaCy_workshops/iSchool/tgn_kb/vocab",output_dir="./tgn_kb_1", n_iter=50)


In [None]:
vocab = Vocab().from_disk("./tgn_kb/vocab")
kb = KnowledgeBase(vocab=nlp.vocab)
kb.load_bulk("/Users/ds/projects/spaCy_workshops/iSchool/tgn_kb/kb")
kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()