## Named Entity Linking with spaCy and TEI
![](https://explosion.ai/blog/img/spacy-transformers.jpg)

![](https://pbs.twimg.com/media/D0aHPzXWwAEgRwU?format=jpg&name=900x900)/)

In [0]:
from IPython.display import IFrame
# Youtube
IFrame("https://www.youtube.com/embed/PW3RJM8tDGo", 560, 315)

In [0]:
import spacy
spacy.__version__

'2.1.9'

### In this first example, our goal is to teach an existing English-language model to identify early modern place names.

There are several approaches that we could take to this problem.  Different approaches can lend better or worse results and experimentation is an essential part of any machine learning project. 

#### How can we teach a statistical language model that Sweveland is a place? Where can I get data on early modern places? 

Richard Hakluyt's The Principal Navigations, Voyages, Traffiques, and Discoveries of the English Nation (1599)

![](http://www.sequiturbooks.com/image/cache/Product%20Images/2015-12/The-Principal-1512150003/5ae35178-800x800.jpeg)

--- 

### Download the TEI files from Persius 
- We're going to extract a list of all the place names from the text to create training data.
- To make working with the TEI/XML easier, we're using a standoffconverter by David Lassner
- The converter separates the text and annotations 


In [0]:
!wget https://github.com/apjanco/spaCy_workshops/raw/master/Session_6/en_early_modern_places-2.1.0.tar.gz
!pip install en_early_modern_places-2.1.0.tar.gz
!pip install standoffconverter
!pip install skosprovider_getty

--2020-02-22 20:55:34--  https://github.com/apjanco/spaCy_workshops/raw/master/Session_6/en_early_modern_places-2.1.0.tar.gz
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/apjanco/spaCy_workshops/master/Session_6/en_early_modern_places-2.1.0.tar.gz [following]
--2020-02-22 20:55:34--  https://raw.githubusercontent.com/apjanco/spaCy_workshops/master/Session_6/en_early_modern_places-2.1.0.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11156243 (11M) [application/octet-stream]
Saving to: ‘en_early_modern_places-2.1.0.tar.gz.2’


2020-02-22 20:55:35 (134 MB/s) - ‘en_early_modern_places-2.1.0



In [0]:
# Restart the kernel 
import spacy
nlp = spacy.load("en_early_modern_places")

In [0]:
import spacy
from spacy import displacy
from IPython.display import HTML


nlp = spacy.load("en_early_modern_places")

doc = nlp(
    "ITEM because that the kings most deare Uncle, the king of Denmarke, Norway & Sweveland, as the same our soveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils, hurts and damage which have late happened aswell to him and his, as to other foraines and strangers, and also friends and speciall subjects of our said soveraigne Lord the king of his Realme of England, by ye going in, entring & passage of such forain & strange persons into his realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him, specially into his Iles of Fynmarke, and elswhere, aswell in their persons as their things and goods"
)
HTML(displacy.render(doc, style="ent"))

There's more information in the TEI than just the place names.  There is also an id number in many of the records for the Getty Thesaurus of Place Names (TGN). If we add an entity_linker pipeline to the model, we will get not only recognition of place, but also of a specific place.

In [0]:
import os 
import pickle
from collections import Counter
spec = {"tei":"http://www.tei-c.org/ns/1.0"}
from urllib.request import urlopen
from lxml import etree
from standoffconverter import Converter

def tei_loader(url):
    tei = urlopen(url).read()
    return etree.XML(tei)

table_of_contents_url = "http://www.perseus.tufts.edu/hopper/xmltoc?doc=Perseus%3Atext%3A1999.03.0070%3Anarrative%3D1"
table_of_contents_xml = tei_loader(table_of_contents_url)


chunks = table_of_contents_xml.xpath("//chunk[@ref]")
refs = [chunk.get('ref') for chunk in chunks] 
# an example ref 'Perseus%3Atext%3A1999.03.0070%3Anarrative%3D6'


standoffs = []

for ref in refs:
    try:
        url = 'http://www.perseus.tufts.edu/hopper/xmlchunk?doc=' + ref

        tei = tei_loader(url)
        so = Converter.from_tree(tei)
        standoffs.append(so)
    except Exception as e:
        print(e)



xmlParseEntityRef: no name, line 103, column 75 (<string>, line 103)
xmlParseEntityRef: no name, line 199, column 94 (<string>, line 199)
xmlParseEntityRef: no name, line 186, column 94 (<string>, line 186)
xmlParseEntityRef: no name, line 803, column 109 (<string>, line 803)
xmlParseEntityRef: no name, line 455, column 89 (<string>, line 455)
xmlParseEntityRef: no name, line 441, column 89 (<string>, line 441)
Unescaped '<' not allowed in attributes values, line 22, column 25 (<string>, line 22)
xmlParseEntityRef: no name, line 49, column 152 (<string>, line 49)
xmlParseEntityRef: no name, line 6, column 152 (<string>, line 6)
xmlParseEntityRef: no name, line 4, column 111 (<string>, line 4)
xmlParseEntityRef: no name, line 34, column 106 (<string>, line 34)
xmlParseEntityRef: no name, line 3, column 149 (<string>, line 3)


In [0]:
# Get the text from the TEI document and create training data
import json
places = []
entities = []
place_names = [] 
place_ids = []
names = []
ADD_NAMES = False # if True, the dataset will included all markedup names from the TEI
ADD_PLACE = True
for standoff in standoffs:
    for annotation in json.loads(standoff.to_json()):
        try:
            if annotation['tag'] == 'name' and ADD_NAMES:
                begin = annotation['begin']
                end = annotation['end']
                length = end-begin
                sent = standoff.plain[begin-300:end+ 300]
                assert len(sent) > 0
                begin = 300
                end = begin+length
                if '\n' in sent[begin:end]:
                    end -= 1
                place_names.append(sent[begin:end])
                place = (sent, {'entities':[(begin,end,"NAME")]})
                places.append(place)
                
            if annotation['attrib']['type'] == 'place' and ADD_PLACE:
                begin = annotation['begin']
                end = annotation['end']
                length = end-begin
                key = annotation['attrib']['key']
                key = key.split(',')[1]
                place_ids.append(key)
                #modern_name = annotation['attrib']['reg']
                sent = standoff.plain[begin-300:end+ 300]
                assert len(sent) > 0
                begin = 300
                end = begin+length
                if '\n' in sent[begin:end]:
                    end -= 1
                place_names.append(sent[begin:end])
                place = (sent, {'entities':[(begin,end,"TGN")]})
                places.append(place)
                
                dict_1 = {(begin, end): {key: 1.0,}}
                entities.append((sent, {"links": dict_1}))
        except:
            pass

In [0]:
from collections import Counter 
place_counts = Counter(place_names)
# place_counts['England'] == 797
place_counts.most_common(10)

[('England', 797),
 ('America', 290),
 ('Guiana', 281),
 ('China', 255),
 ('Goa', 244),
 ('Pegu', 220),
 ('Russia', 212),
 ('Peru', 202),
 ('Mosco', 195),
 ('Russe', 193)]

In [0]:
kb_entities = {}
for entity in entities: 
    id = entity[1]['links']
    start, end = list(id.keys())[0]
    word = entity[0][start:end]
    frequency = place_counts[word]
    place_id = list(id[(start,end)].keys())
    kb_entities[place_id[0]] = (word, frequency)

    
result = {}

for key,value in kb_entities.items():
    if value not in result.values():
        result[key] = value

kb_entities = result

In [0]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
!unzip wiki-news-300d-1M.vec.zip

--2020-02-22 21:02:17--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.20.22.166, 104.20.6.166, 2606:4700:10::6814:6a6, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.20.22.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 681808098 (650M) [application/zip]
Saving to: ‘wiki-news-300d-1M.vec.zip.3’


2020-02-22 21:02:32 (44.1 MB/s) - ‘wiki-news-300d-1M.vec.zip.3’ saved [681808098/681808098]

Archive:  wiki-news-300d-1M.vec.zip
replace wiki-news-300d-1M.vec? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


spaCy creates entity vectors and requires a model with vectors so let's add them.  Here we'll add Facebook's Fasttext vectors.  We could also easily add Stanford's Glove vectors with the large English model.  

In [0]:
import spacy 
import numpy as np
from tqdm.autonotebook import tqdm

path_to_cc_XX_300_vec = "wiki-news-300d-1M.vec"

nlp = spacy.load("en_early_modern_places")

with open(path_to_cc_XX_300_vec, 'rb') as file_:
    header = file_.readline()
    nr_row, nr_dim = header.split()
    nlp.vocab.reset_vectors(width=int(nr_dim))
    for line in tqdm(file_, total=999994):
        line = line.rstrip().decode('utf8')
        pieces = line.rsplit(' ', int(nr_dim))
        word = pieces[0]
        vector = np.asarray([float(v) for v in pieces[1:]], dtype='f')
        nlp.vocab.set_vector(word, vector)  # add the vectors to the vocab

nlp.to_disk('places_vectors')

HBox(children=(IntProgress(value=0, max=999994), HTML(value='')))




In [0]:
#https://spacy.io/usage/training#entity-linker

from spacy.vocab import Vocab
import spacy
from spacy.kb import KnowledgeBase
from pathlib import Path

from bin.wiki_entity_linking.train_descriptions import EntityEncoder


ENTITIES = kb_entities # {"Q2146908": ("American golfer", 342), "Q7381115": ("publisher", 17)}

INPUT_DIM = 300  # dimension of pretrained input vectors
DESC_WIDTH = 64  # dimension of output entity vectors


def main(model=None, output_dir=None, n_iter=50):
    """Load the model, create the KB and pretrain the entity encodings.
    If an output_dir is provided, the KB will be stored there in a file 'kb'.
    The updated vocab will also be written to a directory in the output_dir."""

    nlp = spacy.load(model)  # load existing spaCy model
    print("Loaded model '%s'" % model)

    # check the length of the nlp vectors
    if "vectors" not in nlp.meta or not nlp.vocab.vectors.size:
        raise ValueError(
            "The `nlp` object should have access to pretrained word vectors, "
            " cf. https://spacy.io/usage/models#languages."
        )

    kb = KnowledgeBase(vocab=nlp.vocab,entity_vector_length=64)

    # set up the data
    entity_ids = []
    descriptions = []
    freqs = []
    for key, value in ENTITIES.items():
        desc, freq = value
        entity_ids.append(key)
        descriptions.append(desc)
        freqs.append(freq)

    # training entity description encodings
    # this part can easily be replaced with a custom entity encoder
    encoder = EntityEncoder(
        nlp=nlp,
        input_dim=INPUT_DIM,
        desc_width=DESC_WIDTH,
        #epochs=n_iter,
    )
    encoder.train(description_list=descriptions, to_print=True)

    # get the pretrained entity vectors
    embeddings = encoder.apply_encoder(descriptions)

    # set the entities, can also be done by calling `kb.add_entity` for each entity
    kb.set_entities(entity_list=entity_ids, freq_list=freqs, vector_list=embeddings)

    # adding aliases, the entities need to be defined in the KB beforehand    
    for key in kb_entities.keys():
        kb.add_alias(
            alias = kb_entities[key][0],
            entities = [key],
            probabilities=[1.0]
        )



    # test the trained model
    print()
    _print_kb(kb)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        kb_path = str(output_dir / "kb")
        kb.dump(kb_path)
        print()
        print("Saved KB to", kb_path)

        vocab_path = output_dir / "vocab"
        kb.vocab.to_disk(vocab_path)
        print("Saved vocab to", vocab_path)


        


def _print_kb(kb):
    print(kb.get_size_entities(), "kb entities:", kb.get_entity_strings())
    print(kb.get_size_aliases(), "kb aliases:", kb.get_alias_strings())


main(model="places_vectors",output_dir="./tgn_kb",n_iter=50)

Loaded model 'places_vectors'


  out=out, **kwargs)
  ret, rcount, out=ret, casting='unsafe', subok=False)


0 0 nan
Trained on 721 entities across 5 epochs
Final loss: nan


  "__main__", mod_spec)



721 kb entities: ['7011546', '1000144', '7015155', '1014466', '7014300', '7013071', '7010744', '2540217', 'Carthage', '1023771', '7014673', '7005903', '1121113', '1000090', '7016768', '1000056', '7017072', '2050214', '7011961', '7007652', '7004456', '1009213', '1007394', '7006190', '1127666', '7011374', '7010547', '1136549', '7011380', '1125922', '7005064', '7002759', '7007664', '7002354', '7009120', '1141024', '1055512', '7011385', '1002308', '7005554', '1020948', '7010028', '6006673', '4003876', '1130786', '7015528', '7002351', '7004109', '7000645', '7006278', '7016548', '7008157', '7009002', '4008282', '7000630', '7007127', '7005286', '7002444', '7004545', '6000442', '1045359', 'Placentia', '7013300', '1127524', '1130850', '7010874', '7005468', '7009090', '7006653', '7016845', '7003387', '7011024', '7012974', '1000160', '1062347', '7015156', '7011953', '1066243', '7012981', '1046884', '1020019', '1136825', '1000149', '7011173', '2226898', '7005019', '1001657', '1000047', '1121336',

In [0]:
"""
Compatible with: spaCy v2.2.3
Last tested with: v2.2.3
https://spacy.io/usage/training#entity-linker
"""
from __future__ import unicode_literals, print_function

import random
from pathlib import Path

from spacy.symbols import PERSON
from spacy.vocab import Vocab

import spacy
from spacy.kb import KnowledgeBase
from spacy.pipeline import EntityRuler
from spacy.tokens import Span
from spacy.util import minibatc       h, compounding
from bin.wiki_entity_linking.train_descriptions import EntityEncoder

INPUT_DIM = 300  # dimension of pretrained input vectors
DESC_WIDTH = 64  # dimension of output entity vectors

# training data
TRAIN_DATA = places
TRAIN_DATA_ENTS = entities
ENTITIES = kb_entities
n_iter=1
kb_path="tgn_kb/kb"
vocab_path="tgn_kb/vocab"
output_dir="./tgn_kb_1"


"""Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
The `vocab` should be the one used during creation of the KB."""
vocab = Vocab().from_disk(vocab_path)
# create blank Language class with correct vocab
nlp = spacy.load('places_vectors')
#nlp = spacy.blank("en", vocab=vocab)
nlp.vocab.vectors.name = "spacy_pretrained_vectors"
print("Created blank 'en' model with vocab from '%s'" % vocab_path)

# Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy.
if not 'sentencizer' in nlp.pipe_names:
    nlp.add_pipe(nlp.create_pipe('sentencizer'))

# Add a custom component to recognize "Russ Cochran" as an entity for the example training data.
# Note that in a realistic application, an actual NER algorithm should be used instead.
ruler = EntityRuler(nlp)
patterns = []
for text, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        start = ent[0]
        end = ent[1]
        label_ = ent[2]
        word = text[start:end]
        row = {"label":label_, "pattern":word}
        patterns.append(row)    
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
    

# Create the Entity Linker component and add it to the pipeline.
if "entity_linker" not in nlp.pipe_names:
    # use only the predicted EL score and not the prior probability (for demo purposes)
    cfg = {"incl_prior": False}
    entity_linker = nlp.create_pipe("entity_linker", cfg)
    kb = KnowledgeBase(vocab=nlp.vocab,entity_vector_length=64)

    # set up the data
    entity_ids = []
    descriptions = []
    freqs = []
    for key, value in ENTITIES.items():
        desc, freq = value
        entity_ids.append(key)
        descriptions.append(desc)
        freqs.append(freq)

    # training entity description encodings
    # this part can easily be replaced with a custom entity encoder
    encoder = EntityEncoder(
        nlp=nlp,
        input_dim=INPUT_DIM,
        desc_width=DESC_WIDTH,
        #epochs=n_iter,
    )
    encoder.train(description_list=descriptions, to_print=True)

    # get the pretrained entity vectors
    embeddings = encoder.apply_encoder(descriptions)

    # set the entities, can also be done by calling `kb.add_entity` for each entity
    kb.set_entities(entity_list=entity_ids, freq_list=freqs, vector_list=embeddings)

    # adding aliases, the entities need to be defined in the KB beforehand    
    for key in kb_entities.keys():
        kb.add_alias(
            alias = kb_entities[key][0],
            entities = [key],
            probabilities=[1.0]
        )
    
    #print("Loaded Knowledge Base from '%s'" % kb_path)
    entity_linker.set_kb(kb)
    nlp.add_pipe(entity_linker, last=True)

# Convert the texts to docs to make sure we have doc.ents set for the training examples.
# Also ensure that the annotated examples correspond to known identifiers in the knowlege base.
kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
TRAIN_DOCS = []
for text, annotation in TRAIN_DATA_ENTS:
    with nlp.disable_pipes("entity_linker"):
        doc = nlp(text)
    annotation_clean = annotation
    for offset, kb_id_dict in annotation["links"].items():
        new_dict = {}
        for kb_id, value in kb_id_dict.items():
            if kb_id in kb_ids:
                new_dict[kb_id] = value
            else:
                print(
                    "Removed", kb_id, "from training because it is not in the KB."
                )
        annotation_clean["links"][offset] = new_dict
    TRAIN_DOCS.append((doc, annotation_clean))

# get names of other pipes to disable them during training
pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes):  # only train entity linker
    # reset and initialize the weights randomly
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DOCS)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            try:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data
                    losses=losses,
                    sgd=optimizer,
                )
            except Exception as e:
                #print(e)
                pass
        print(itn, "Losses", losses)



# save model to output directory
if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    nlp.to_disk(output_dir)
    print()
    print("Saved model to", output_dir)






Created blank 'en' model with vocab from 'tgn_kb/vocab'


  out=out, **kwargs)
  ret, rcount, out=ret, casting='unsafe', subok=False)


0 0 nan
Trained on 721 entities across 5 epochs
Final loss: nan


  "__main__", mod_spec)


0 Losses {'entity_linker': nan}

Saved model to tgn_kb_1


In [0]:
for text, annotation in places[:3]:
    # apply the entity linker which will now make predictions for the 'Russ Cochran' entities
    doc = nlp(text)
    print()
    print("Entities", [(ent.text, ent.label_, ent.kb_id_) for ent in doc.ents])
    print("Tokens", [(t.text, t.ent_type_, t.ent_kb_id_) for t in doc])


Entities [('Hans', 'TGN', '4004801')]
Tokens [('nmarke', '', ''), ('hath', '', ''), ('specially', '', ''), ('\n', '', ''), ('ordained', '', ''), ('and', '', ''), ('stablished', '', ''), ('his', '', ''), ('staple', '', ''), ('for', '', ''), ('the', '', ''), ('concourses', '', ''), ('of', '', ''), ('\n', '', ''), ('strangers', '', ''), ('and', '', ''), ('specially', '', ''), ('of', '', ''), ('Englishmen', '', ''), (',', '', ''), ('to', '', ''), ('the', '', ''), ('exercise', '', ''), ('\n', '', ''), ('of', '', ''), ('such', '', ''), ('Marchandises', '', ''), (':', '', ''), ('granting', '', ''), ('to', '', ''), ('the', '', ''), ('said', '', ''), ('Englishmen', '', ''), ('\n', '', ''), ('that', '', ''), ('they', '', ''), ('shall', '', ''), ('there', '', ''), ('injoy', '', ''), ('in', '', ''), ('and', '', ''), ('by', '', ''), ('all', '', ''), ('things', '', ''), ('the', '', ''), ('same', '', ''), ('\n', '', ''), ('favour', '', ''), (',', '', ''), ('privileges', '', ''), ('and', '', ''), ('p

In [0]:
from skosprovider_getty.providers import TGNProvider
aat = TGNProvider(metadata={'id': 'TGN'})
def get_place_name(id:int) -> str:
    place = aat.get_by_id(id)

    print('Labels')
    print('------')
    for l in place.labels:
       print(l.language + ': ' + l.label + ' [' + l.type + ']')

    print('Notes')
    print('-----')
    for n in place.notes:
        print(n.language + ': ' + n.note + ' [' + n.type + ']')
for text, annotation in places[:1]:
    doc = nlp(text)
    print()
    print("Entities", [(get_place_name(ent.kb_id_)) for ent in doc.ents])



Labels
------
und: Hans [prefLabel]
Notes
-----
Entities [None]


In [0]:
import spacy

doc = nlp(
    """The army marched from Konia to Kaiseria (Caesarea), and thence to Sivas, where the feast of the Korbân (sacrifice) was celebrated. Here Mustafâ Pâshâ, the emperor's favourite, was promoted to the rank of second vezir, and called into the divân. The army then continued its march to Erzerum. Besides tiie guns provided by the commander-in-chief, there were forty large guns dragged by two thousand pairs of buftaloes. The army entered the castle of Kazmaghan, and halted under the walls of Eriviin in the year 1044 (1634).  
"""
)
HTML(displacy.render(doc, style="ent"))

# Alternative approaches, Wikipedia2Vec

In [0]:
if not Path('enwiki_20180420_500d.pkl.bz2').exists():
    !wget http://wikipedia2vec.s3.amazonaws.com/models/en/2018-04-20/enwiki_20180420_500d.pkl.bz2
    !bzip2 -d enwiki_20180420_500d.pkl.bz2

--2020-02-23 17:11:51--  http://wikipedia2vec.s3.amazonaws.com/models/en/2018-04-20/enwiki_20180420_500d.pkl.bz2
Resolving wikipedia2vec.s3.amazonaws.com (wikipedia2vec.s3.amazonaws.com)... 52.219.0.197
Connecting to wikipedia2vec.s3.amazonaws.com (wikipedia2vec.s3.amazonaws.com)|52.219.0.197|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17294111805 (16G) [application/x-bzip2]
Saving to: ‘enwiki_20180420_500d.pkl.bz2.1’


2020-02-23 17:16:21 (61.2 MB/s) - ‘enwiki_20180420_500d.pkl.bz2.1’ saved [17294111805/17294111805]


bzip2: Control-C or similar caught, quitting.
bzip2: Deleting output file enwiki_20180420_500d.pkl, if it exists.


In [0]:
ls -al

total 16888804
drwxr-xr-x 1 root root        4096 Feb 23 17:08 [0m[01;34m.[0m/
drwxr-xr-x 1 root root        4096 Feb 23 16:36 [01;34m..[0m/
drwxr-xr-x 1 root root        4096 Feb 19 17:12 [01;34m.config[0m/
-rw-r--r-- 1 root root 17294111805 May 17  2018 enwiki_20180420_500d.pkl.bz2
drwxr-xr-x 1 root root        4096 Feb  5 18:37 [01;34msample_data[0m/


In [0]:
try:
    from wikipedia2vec import Wikipedia2Vec
except ModuleNotFoundError:
    !pip install wikipedia2vec
    from wikipedia2vec import Wikipedia2Vec

wiki2vec = Wikipedia2Vec.load('enwiki_20180420_500d.pkl')
result = wiki2vec.most_similar(wiki2vec.get_word('yoda'), 5)
result

Collecting wikipedia2vec
  Using cached https://files.pythonhosted.org/packages/d8/88/751037c70ca86581d444824e66bb799ef9060339a1d5d1fc1804c422d7cc/wikipedia2vec-1.0.4.tar.gz
Collecting marisa-trie
  Using cached https://files.pythonhosted.org/packages/20/95/d23071d0992dabcb61c948fb118a90683193befc88c23e745b050a29e7db/marisa-trie-0.7.5.tar.gz
Collecting mwparserfromhell
  Using cached https://files.pythonhosted.org/packages/23/03/4fb04da533c7e237c0104151c028d8bff856293d34e51d208c529696fb79/mwparserfromhell-0.5.4.tar.gz
Building wheels for collected packages: wikipedia2vec, marisa-trie, mwparserfromhell
  Building wheel for wikipedia2vec (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia2vec: filename=wikipedia2vec-1.0.4-cp36-cp36m-linux_x86_64.whl size=4581859 sha256=1d1e02205670bcd50042320c31979d1cf0281a716a88cca5a4e70770a76e601e
  Stored in directory: /root/.cache/pip/wheels/16/e7/02/852c8ce366cc10adcf5d43c6471bbf926dd15c277578c13184
  Building wheel for marisa-trie (setup.

FileNotFoundError: ignored

In [0]:
print(result)

NameError: ignored

In [0]:
# Load a text, identify entity spans, get wiki2vec entity 
for i in result: 
    if type(i[0]).__name__ == 'Entity':
        confidence = i[1]
        name = i[0].title 
        print(name, confidence) 


NameError: ignored

# Another approach to the problem, SPARQL 

In [0]:
try:
    import spotlight
except ModuleNotFoundError:
    !pip install pyspotlight
    import spotlight 

annotations = spotlight.annotate('https://api.dbpedia-spotlight.org/en/annotate', 'Baby Yoda is cute', confidence=0.4, support=20)
annotations 


[{'URI': 'http://dbpedia.org/resource/Infant',
  'offset': 0,
  'percentageOfSecondRank': 0.00147352792162747,
  'similarityScore': 0.9978001054278687,
  'support': 3349,
  'surfaceForm': 'Baby',
  'types': ''},
 {'URI': 'http://dbpedia.org/resource/Yoda',
  'offset': 5,
  'percentageOfSecondRank': 1.225279889190865e-05,
  'similarityScore': 0.9999754186031542,
  'support': 840,
  'surfaceForm': 'Yoda',
  'types': 'Http://xmlns.com/foaf/0.1/Person,Wikidata:Q95074,Wikidata:Q5,Wikidata:Q24229398,Wikidata:Q215627,DUL:NaturalPerson,DUL:Agent,Schema:Person,DBpedia:Person,DBpedia:FictionalCharacter,DBpedia:Agent'}]