## Named Entity Linking with spaCy and TEI

In [6]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    """The iSchool at Illinois leads the way in shaping the future of information. This success is the result of a commitment to research, education, and engagement and a long history of innovation. Our School consistently earns recognition as one of the best destinations—both on campus and online—for professional studies in the information sciences."""
)
displacy.render(doc, style="ent")

## Problem:
*This works very well for many 20th and 21st century texts.  But what about early modern English?*

In [7]:
doc = nlp(
    "ITEM because that the kings most deare Uncle, the king of Denmarke, Norway & Sweveland, as the same our soveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils, hurts and damage which have late happened aswell to him and his, as to other foraines and strangers, and also friends and speciall subjects of our said soveraigne Lord the king of his Realme of England, by ye going in, entring & passage of such forain & strange persons into his realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him, specially into his Iles of Fynmarke, and elswhere, aswell in their persons as their things and goods"
)
displacy.render(doc, style="ent")

![]("./out_of_domain.png")

### In this example, our goal is to teach an existing English-language model to identify early modern place names.

There are several approaches that we could take to this problem.  Different approaches can lend better or worse results and experimentation is an essential part of any machine learning project. 

#### How can we teach a statistical language model that Sweveland is a place?

### Download the TEI files from Persius 
- We're going to extract a list of all the place names from the text to create a patterns JSONL file.
- We'll also extract the raw text to create a set of training documents. 

We are going to download the table of contents and create a list of the 937 segments of the document. We will then get each page, remove the place names (`<name type="place">Utrect</name>`) and add them to a places list.


In [10]:
import os 
import pickle
from collections import Counter
spec = {"tei":"http://www.tei-c.org/ns/1.0"}
from urllib.request import urlopen
from lxml import etree

def tei_loader(url):
    tei = urlopen(url).read()
    return etree.XML(tei)

table_of_contents_url = "http://www.perseus.tufts.edu/hopper/xmltoc?doc=Perseus%3Atext%3A1999.03.0070%3Anarrative%3D1"
table_of_contents_xml = tei_loader(table_of_contents_url)

if not os.path.exists('refs.pickle'):
    chunks = table_of_contents_xml.xpath("//chunk[@ref]")
    refs = [chunk.get('ref') for chunk in chunks] 
    # an example ref 'Perseus%3Atext%3A1999.03.0070%3Anarrative%3D6'


    places = []

    for ref in refs:

        url = 'http://www.perseus.tufts.edu/hopper/xmlchunk?doc=' + ref
        try:
            tei = tei_loader(url)

            #get all <name type='place'> tags
            for place in tei.findall(".//name[@type='place']", namespaces=spec):
                places.append(place.text.replace('\n',''))
        except Exception as e:
            print(e)
            
    pickle.dump(places, open('places.pickle', 'wb'))
    pickle.dump(refs, open('refs.pickle', 'wb'))

else:
    places = pickle.load(open('places.pickle', 'rb'))
    refs = pickle.load(open('refs.pickle', 'rb'))
    print('pickles loaded')

'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
xmlParseEntityRef: no name, line 103, column 75 (<string>, line 103)
'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
'NoneType' object has no attribute 'replace'
xmlParseEntityRef: no name, line 199, column 94 (<string>, line 199)
xmlParseEntityRef: no name, line 186, column 94 (<string>, line 186)
xmlParseEntityRef: no name, line 803, column 109 (<string>, line 803)
xmlParseEntityRef: no name, line 455, column 89 