## Working with GPUs 
spaCy works very well with GPUs.  For the examples below, it is twice as fast if we use them.  In the edit tab, click on notebook configuration and then select GPU under hardware acceleration.

In [0]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


In [0]:
!pip install spacy[cuda100]

Collecting cupy-cuda100>=5.0.0b4; extra == "cuda100"
[?25l  Downloading https://files.pythonhosted.org/packages/d1/70/1022cc25659bbef5932c590dcd44443a68dad723229fbc49e540c864ea6d/cupy_cuda100-8.0.0a1-cp36-cp36m-manylinux1_x86_64.whl (337.6MB)
[K     |████████████████████████████████| 337.6MB 47kB/s 
[?25hCollecting thinc-gpu-ops<0.1.0,>=0.0.1; extra == "cuda100"
[?25l  Downloading https://files.pythonhosted.org/packages/a4/ad/11ab80a24bcedd7dd0cfabaedba2ceaeca11f1aaeeff432a3d2e63ca7d02/thinc_gpu_ops-0.0.4.tar.gz (483kB)
[K     |████████████████████████████████| 491kB 60.8MB/s 
Building wheels for collected packages: thinc-gpu-ops
  Building wheel for thinc-gpu-ops (setup.py) ... [?25l[?25hdone
  Created wheel for thinc-gpu-ops: filename=thinc_gpu_ops-0.0.4-cp36-cp36m-linux_x86_64.whl size=220577 sha256=393d8a3c7d59be2b12dd494301081e610d03c8a108c2143615da4fa20bc50740
  Stored in directory: /root/.cache/pip/wheels/eb/ba/a3/9af9f326ed0d75a4540378af64a05a0e42be39d9b8513f3aea
Success

In [0]:
import spacy 
spacy.require_gpu()

True

### In this first example, our goal is to teach an existing English-language model to identify early modern place names.

There are several approaches that we could take to this problem.  Different approaches can lend better or worse results and experimentation is an essential part of any machine learning project. 

#### How can we teach a statistical language model that Sweveland is a place? Where can I get data on early modern places? 

Richard Hakluyt's The Principal Navigations, Voyages, Traffiques, and Discoveries of the English Nation (1599)

![](http://www.sequiturbooks.com/image/cache/Product%20Images/2015-12/The-Principal-1512150003/5ae35178-800x800.jpeg)

--- 

### Download the TEI files from Persius 
- We're going to extract a list of all the place names from the text to create training data.
- To make working with the TEI/XML easier, we're using a standoffconverter by David Lassner
- The converter separates the text and annotations 


In [0]:
import spacy 
from spacy import displacy
from IPython.display import HTML

nlp = spacy.load("en_core_web_sm")

doc = nlp(
    """Pittsburgh was named in 1758, by General John Forbes, in honor of British statesman William Pitt, 1st Earl of Chatham. As Forbes was a Scot, he probably pronounced the name /ˈpɪtsbərə/ PITS-bər-ə (similar to Edinburgh).[20][21] Pittsburgh was incorporated as a borough on April 22, 1794, with the following Act:[22] "Be it enacted by the Pennsylvania State Senate and Pennsylvania House of Representatives of the Commonwealth of Pennsylvania ... by the authority of the same, that the said town of Pittsburgh shall be ... erected into a borough, which shall be called the borough of Pittsburgh for ever."""
)
HTML(displacy.render(doc, style="ent"))

In [0]:
doc = nlp(
    "ITEM because that the kings most deare Uncle, the king of Denmarke, Norway & Sweveland, as the same our soveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils, hurts and damage which have late happened aswell to him and his, as to other foraines and strangers, and also friends and speciall subjects of our said soveraigne Lord the king of his Realme of England, by ye going in, entring & passage of such forain & strange persons into his realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him, specially into his Iles of Fynmarke, and elswhere, aswell in their persons as their things and goods"
)
HTML(displacy.render(doc, style="ent"))

In [0]:
!pip install standoffconverter

Collecting standoffconverter
  Downloading https://files.pythonhosted.org/packages/70/7f/770cec5bf31098722457d395aaed8dd59a001306eceb34b28c729498cf3a/standoffconverter-0.6.3-py3-none-any.whl
Installing collected packages: standoffconverter
Successfully installed standoffconverter-0.6.3


In [0]:
import os 
import pickle
from collections import Counter
spec = {"tei":"http://www.tei-c.org/ns/1.0"}
from urllib.request import urlopen
from lxml import etree
from standoffconverter import Converter

def tei_loader(url):
    tei = urlopen(url).read()
    return etree.XML(tei)

table_of_contents_url = "http://www.perseus.tufts.edu/hopper/xmltoc?doc=Perseus%3Atext%3A1999.03.0070%3Anarrative%3D1"
table_of_contents_xml = tei_loader(table_of_contents_url)


chunks = table_of_contents_xml.xpath("//chunk[@ref]")
refs = [chunk.get('ref') for chunk in chunks] 
# an example ref 'Perseus%3Atext%3A1999.03.0070%3Anarrative%3D6'


standoffs = []

for ref in refs:
    try:
        url = 'http://www.perseus.tufts.edu/hopper/xmlchunk?doc=' + ref

        tei = tei_loader(url)
        so = Converter.from_tree(tei)
        standoffs.append(so)
    except Exception as e:
        print(e)

xmlParseEntityRef: no name, line 103, column 75 (<string>, line 103)
xmlParseEntityRef: no name, line 199, column 94 (<string>, line 199)
xmlParseEntityRef: no name, line 186, column 94 (<string>, line 186)
xmlParseEntityRef: no name, line 803, column 109 (<string>, line 803)
xmlParseEntityRef: no name, line 455, column 89 (<string>, line 455)
xmlParseEntityRef: no name, line 441, column 89 (<string>, line 441)
Unescaped '<' not allowed in attributes values, line 22, column 25 (<string>, line 22)
xmlParseEntityRef: no name, line 49, column 152 (<string>, line 49)
xmlParseEntityRef: no name, line 6, column 152 (<string>, line 6)
xmlParseEntityRef: no name, line 4, column 111 (<string>, line 4)
xmlParseEntityRef: no name, line 34, column 106 (<string>, line 34)
xmlParseEntityRef: no name, line 3, column 149 (<string>, line 3)


In [0]:
import json 
label_ = "GPE" # Here we can either create a new label or fine-tune the existing GPE (place) label
places = []

for standoff in standoffs:
    for annotation in json.loads(standoff.to_json()):
        try:
            if annotation['attrib']['type'] == 'place':
                begin = annotation['begin']
                end = annotation['end']
                length = end-begin
                
                #modern_name = annotation['attrib']['reg']
                sent = standoff.plain[begin-300:end+ 300]
                assert len(sent) > 0
                begin = 300
                end = begin+length
                if '\n' in sent[begin:end]:
                    end -= 1
                place = (sent, {'entities':[(begin,end,label_)]})
                places.append(place)
                
        except Exception as e:
            pass

In [0]:
import random 

i = random.choice(range(len(places)))
start, end, label = places[i][1]['entities'][0]
text = places[i][0][start:end]
print(text)
print(places[i])

Russe
('e away: and if it should so happen, he were\nin great danger of loosing his head: for which cause\nhe requested to have some one for a pledge: wherefore\nM. Garrard one of the factors offered himselfe to go, who,\nbecause he could not speake the Russe\n tongue, tooke with\nhim Christopher Burrough, and a Russe\n interpretour:\nthat night they road from the seaside, to a village about\nten miles off, where at supper time the captaine had much\ntalke with M. Garrard of our countrey, demanding where\nabout it did lie, what countreys were neare unto it, and\nwith whom we had traffike, for by the Russe\n name of our\n', {'entities': [(300, 305, 'GPE')]})


In [0]:
### Source https://github.com/koaning/spacy-youtube-material/blob/master/04-statistical-model.ipynb
import spacy 
import random 
from tqdm.autonotebook import tqdm

def create_blank_nlp(train_data):
    nlp = spacy.blank("en")
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner, last=True)
    ner = nlp.get_pipe("ner")
    for _, annotations in tqdm(train_data):
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
    return nlp

sample_percent = 1
sample = int(sample_percent * len(places))
TRAIN_DATA = random.sample(places, sample)
nlp = create_blank_nlp(TRAIN_DATA)

HBox(children=(IntProgress(value=0, max=14317), HTML(value='')))




In [0]:
import random 
import datetime as dt
from spacy.util import minibatch, compounding



optimizer = nlp.begin_training()
for i in range(20):
    losses = {}
    batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
    count = sum(1 for _ in batches)
    batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
    for batch in tqdm(batches, total=count):
        texts, annotations = zip(*batch)
        nlp.update(
            texts,  # batch of texts
            annotations,  # batch of annotations
            drop=0.1,  # dropout - make it harder to memorise data
            losses=losses,
        )
    print(f"Losses at iteration {i} - {dt.datetime.now()} {losses}")

HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 0 - 2020-03-06 16:39:54.052450 {'ner': 64015.273220987525}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 1 - 2020-03-06 16:42:47.941316 {'ner': 18405.679875610644}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 2 - 2020-03-06 16:45:37.906529 {'ner': 17229.986627071597}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 3 - 2020-03-06 16:48:28.884443 {'ner': 16419.887193931383}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 4 - 2020-03-06 16:51:16.285467 {'ner': 15695.742968863548}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 5 - 2020-03-06 16:54:00.982245 {'ner': 14935.751244885}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 6 - 2020-03-06 16:56:47.009638 {'ner': 14377.267166666674}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 7 - 2020-03-06 16:59:34.360910 {'ner': 13730.002785727373}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 8 - 2020-03-06 17:02:22.394558 {'ner': 13369.529635285333}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 9 - 2020-03-06 17:05:14.120340 {'ner': 12735.254624424986}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 10 - 2020-03-06 17:08:04.566856 {'ner': 12094.455210836299}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 11 - 2020-03-06 17:10:54.146822 {'ner': 11572.156696648392}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 12 - 2020-03-06 17:13:48.679927 {'ner': 11027.50051713701}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 13 - 2020-03-06 17:16:42.070295 {'ner': 10584.228161432184}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 14 - 2020-03-06 17:19:36.238994 {'ner': 10073.448767667636}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 15 - 2020-03-06 17:22:30.451292 {'ner': 9788.210693213432}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 16 - 2020-03-06 17:25:23.247898 {'ner': 9478.433262232602}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 17 - 2020-03-06 17:28:15.807117 {'ner': 8950.475582000445}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 18 - 2020-03-06 17:31:04.738801 {'ner': 8677.351878590965}


HBox(children=(IntProgress(value=0, max=1564), HTML(value='')))


Losses at iteration 19 - 2020-03-06 17:33:56.455402 {'ner': 8311.775366346721}


In [0]:
doc = nlp(
    "ITEM because that the kings most deare Uncle, the king of Denmarke, Norway & Sweveland, as the same our soveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils, hurts and damage which have late happened aswell to him and his, as to other foraines and strangers, and also friends and speciall subjects of our said soveraigne Lord the king of his Realme of England, by ye going in, entring & passage of such forain & strange persons into his realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him, specially into his Iles of Fynmarke, and elswhere, aswell in their persons as their things and goods"
)
HTML(displacy.render(doc, style="ent"))