## Section III 

Prodigy 

Exercise I: Basic NER with pre-trained spaCy models [source](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)

In [43]:
import os 
import pickle
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
spec = {"tei":"http://www.tei-c.org/ns/1.0"}


In [38]:
doc = nlp('Hosted by Utrecht University, the 2019 iteration of the Digital Humanities (DH) conference, the annual international conference of the Alliance of Digital Humanities Organizations, will take place in the medieval city of Utrecht, one of the oldest cities in the Netherlands. The city’s rapid modernization and growth has inspired the conference’s guiding theme, complexity.')
#for ent in doc.ents:
#    print(ent.text, ent.label_)
displacy.render(doc, style="ent")


## Problem
This works very well for many 20th and 21st century texts.  But what about 17th century English?

In [39]:
doc = nlp('ITEM because that the kings most deare Uncle, the king of Denmarke, Norway & Sweveland, as the same our soveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils, hurts and damage which have late happened aswell to him and his, as to other foraines and strangers, and also friends and speciall subjects of our said soveraigne Lord the king of his Realme of England, by ye going in, entring & passage of such forain & strange persons into his realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him, specially into his Iles of Fynmarke, and elswhere, aswell in their persons as their things and goods')
#for ent in doc.ents:
#    print(ent.text, ent.label_)
displacy.render(doc, style="ent")


## TEI to spaCy patterns 

```json
PATTERNS.JSONL
{"label": "JOB_TITLE", "pattern": [{"lower": "engineering"}, {"lower": "manager"}]}
{"label": "JOB_TITLE", "pattern": [{"orth": "CEO"}]}```

exercise 1) Improve results for a specific task 

Training on new category from TEI training data 

example 2) Add a new category from a list of examples 

#create corpus of texts for our project 

MY_DATA.JSONL
{"text": "Pinterest Hires Its First Head of Diversity"}
{"text": "Airbnb and Others Set Terms for Employees to Cash Out"}

## Here were going to download a TEI file from Persius 
We're going to extract a list of all the place names from the text to create a patterns file.
We'll also extract the raw text to create a set of training documents. 

In [40]:
from urllib.request import urlopen
from lxml import etree

def tei_loader(url):
    tei = urlopen(url).read()
    return etree.XML(tei)
    


Here we are going to download the table of contents and create a list of the 937 parts of the document. We will then get each page, remove the place names and add them to a places list.

In [43]:
table_of_contents_url = "http://www.perseus.tufts.edu/hopper/xmltoc?doc=Perseus%3Atext%3A1999.03.0070%3Anarrative%3D1"
table_of_contents_xml = tei_loader(table_of_contents_url)

if not os.path.exists('refs.pickle'):
    chunks = table_of_contents_xml.xpath("//chunk[@ref]")
    refs = [chunk.get('ref') for chunk in chunks] 
    # an example ref 'Perseus%3Atext%3A1999.03.0070%3Anarrative%3D6'


    places = []

    for ref in refs:

        url = 'http://www.perseus.tufts.edu/hopper/xmlchunk?doc=' + ref
        try:
            tei = tei_loader(url)

            #get all <name type='place'> tags
            for place in tei.findall(".//name[@type='place']", namespaces=spec):
                places.append(place.text.replace('\n',''))
        except Exception as e:
            #print(e)
            pass
    pickle.dump(places, open('places.pickle', 'wb'))
    pickle.dump(refs, open('refs.pickle', 'wb'))

else:
    print('pickles loaded')
    places = pickle.load(open('places.pickle', 'rb'))
    refs = pickle.load(open('refs.pickle', 'rb'))



pickles loaded


In [45]:
print('number of documents: ',len(refs))
print('number of places found: ',len(set(places)))
places[10]

number of documents:  937
number of places found:  2279


'Kingston'

### Here we create a patterns.jsonl file with seed terms.  
These terms provide examples that the model can use to learn the new category. It is good to use as many terms as are practical; preferably 100-200. Further examples of patterns files can be found [here](https://github.com/explosion/prodigy-recipes/tree/master/example-patterns) and there is a very helpful tool for creating patterns that are relevant to your projects and texts [here](https://explosion.ai/demos/matcher)

```json
{"label": "GPE", "pattern": [{"lower": "república"}, {"lower": "de"}, {"lower": "angola"}]}
```

In [46]:
if not os.path.exists('patterns.jsonl'):
    with open('patterns.jsonl','w') as f:
        for place in set(places):
            pattern = '['
            for token in place.split()[:-1]:
                pattern += '{"lower": "' + token.lower() + '"},'
            pattern += '{"lower": "' + place.split()[-1].lower() + '"}'
            pattern += ']'
            row = '{"label": "PLACE", "pattern": ' + pattern + '}\n'
            f.write(row)

In [47]:
with open('patterns.jsonl','r') as f:
    print(f.read())

{"label": "PLACE", "pattern": [{"lower": "emden"}]}
{"label": "PLACE", "pattern": [{"lower": "bonna"}]}
{"label": "PLACE", "pattern": [{"lower": "island"},{"lower": "of"},{"lower": "fowlay"}]}
{"label": "PLACE", "pattern": [{"lower": "rome"}]}
{"label": "PLACE", "pattern": [{"lower": "sicill"}]}
{"label": "PLACE", "pattern": [{"lower": "solis"}]}
{"label": "PLACE", "pattern": [{"lower": "ile"},{"lower": "of"},{"lower": "pinos"}]}
{"label": "PLACE", "pattern": [{"lower": "ileof"},{"lower": "palme"}]}
{"label": "PLACE", "pattern": [{"lower": "amsterdam"}]}
{"label": "PLACE", "pattern": [{"lower": "cape"},{"lower": "race"}]}
{"label": "PLACE", "pattern": [{"lower": "ilands"},{"lower": "of"},{"lower": "tanaseri"}]}
{"label": "PLACE", "pattern": [{"lower": "perse"}]}
{"label": "PLACE", "pattern": [{"lower": "cali"}]}
{"label": "PLACE", "pattern": [{"lower": "malacca"}]}
{"label": "PLACE", "pattern": [{"lower": "exeter"}]}
{"label": "PLACE", "pattern": [{"lower": "cape"},{"lower": "of"},{"lo

In [49]:
#Now to extract the full text
percent_of_texts_to_process = 0.05
end_index = int(len(refs) * percent_of_texts_to_process)

if os.path.exists('principal_navigations.txt'):
    print('The file already exists. This process takes several minutes. To re-run, change process_anew to True. You can also adjust the percentage of the corpus to be gathered.')

process_anew = False
if process_anew:
    txts = []
    for ref in refs[:end_index]:

        url = 'http://www.perseus.tufts.edu/hopper/xmlchunk?doc=' + ref
        try:
            tei = tei_loader(url)

            new_txt = []
            for body in tei.iter('body'):
                new_txt.append(''.join(body.itertext()).strip().replace('\n',''))
                txts.append(''.join(new_txt))

        except Exception as e:
            #print(e)
            pass

    full_text = [txt.replace('        ',' ').replace('   ',' ').replace('  ',' ') for txt in txts]

    with open('principal_navigations.txt','w') as f:
        f.write(str(full_text))

The file already exists. This process takes several minutes. To re-run, change process_anew to True. You can also adjust the percentage of the corpus to be gathered.


In [50]:
# This is a sizable amout of text, let's create a small segment for now.
with open('principal_navigations.txt','r') as f:
    
    print(f.read()[:500])
    

['A branch of a Statute made in the eight yeere of Henry the sixt, for the trade to Norwey, Sweveland, Den marke, and Fynmarke. ITEM because that the kings most deare Uncle, the kingof Denmarke, Norway & Sweveland, as the same oursoveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils,hurts and damage which have late happened aswell tohim and his, as to other foraines and strangers, and alsofriends and speciall subjects of our said soveraigne L


## With patterns and text files created, we can now work with Prodigy!

In [13]:
!prodigy dataset historic_places "A dataset for British historic places" --author Andy


  ✨  [38;5;197mERROR:[0m 'historic_places' already exists in database
  SQLite.



## Need a training text.  First, we'll train on a set from the original corpus. We'll then try it on a comparable document from the same period.

In [None]:
!prodigy ner.teach historic_places en_core_web_sm principal_navigations.txt --label PLACE --patterns patterns.jsonl

Using 1 labels: PLACE

  ✨  Starting the web server at http://spacy.apjan.co:8080 ...
  Open the app in your browser and start annotating!

Task queue depth is 1
Task queue depth is 1


In [20]:
!prodigy ner.batch-train historic_places en_core_web_sm --output new_model --label PLACE

Using 1 labels: PLACE

Loaded model new_model
Using 50% of accept/reject examples (15) for evaluation
Using 100% of remaining examples (15) for training
Dropout: 0.2  Batch size: 4  Iterations: 10  


BEFORE      0.417            
Correct     [38;5;77m10[0m 
Incorrect   [38;5;197m14[0m
Entities    74               
Unknown     62               

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY  
01           187.649      7            17           36           0            0.292     
02           164.241      7            17           25           0            0.292     
03           101.649      8            16           22           0            0.333     
04           106.901      10           14           23           0            0.417     
05           157.445      11           13           28           0            0.458     
06           118.410      11           13           24           0            0.458     
07           94.879      

If you think the model would benefit from more training...

In [26]:
!prodigy ner.batch-train historic_places new_model --output new_model --label PLACE

Using 1 labels: PLACE

Loaded model new_model
Using 50% of accept/reject examples (15) for evaluation
Using 100% of remaining examples (15) for training
Dropout: 0.2  Batch size: 4  Iterations: 10  


BEFORE      0.500            
Correct     [38;5;77m12[0m 
Incorrect   [38;5;197m12[0m
Entities    33               
Unknown     21               

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY  
01           34.487       11           13           28           0            0.458     
02           59.485       11           13           25           0            0.458     
03           16.853       11           13           21           0            0.458     
04           18.268       11           13           20           0            0.458     
05           59.328       10           14           19           0            0.417     
06           64.039       11           13           25           0            0.458     
07           68.571      

Would our model improve with more data?  More training? 

In [25]:
!prodigy ner.train-curve historic_places new_model --label PLACE --n-iter 10 --eval-split 0.2 --dropout 0.2  --n-samples 4 

Using 1 labels: PLACE

Starting with model new_model
Dropout: 0.2  Batch size: 32  Iterations: 10  Samples: 4

%            RIGHT        WRONG        ACCURACY  
25%          8            12           0.40         [38;5;77m+0.40[0m         
50%          9            11           0.45         [38;5;77m+0.05[0m         
75%          9            11           0.45         [38;5;197m+0.00[0m        
100%         9            11           0.45         [38;5;197m+0.00[0m        


To load and see the results of our new model 

In [51]:
import spacy
nlp = spacy.load('new_model')
doc = nlp('ITEM because that the kings most deare Uncle, the king of Denmarke, Norway & Sweveland, as the same our soveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils, hurts and damage which have late happened aswell to him and his, as to other foraines and strangers, and also friends and speciall subjects of our said soveraigne Lord the king of his Realme of England, by ye going in, entring & passage of such forain & strange persons into his realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him, specially into his Iles of Fynmarke, and elswhere, aswell in their persons as their things and goods')
displacy.render(doc, style="ent")

That's not too bad.  What about a text that the model has never seen before? How about Çelebi Evliya's [Narrative of travels in Europe, Asia, and Africa](https://archive.org/details/narrativeoftrave01evli/page/n4)

In [44]:
import spacy
nlp = spacy.load('new_model')
doc = nlp("""It is to this day celebrated throughout the world as an extraordinary inscription, 
and is visited by travellers from Rum (Greece), 'Arab (Arabia), and 'Ajem 
(Persia). Some of them, who, in the expectation of finding hidden treasures, 
began to work at these ancient buildings with pickaxes like Ftrhad's, perished 
in the attempt, and were also buried there. Some holy men make pilgrimages 
to this place barefoot on Friday nights, and recite the chapter entitled Tekasur 
(Koran, chap. 102) ; for many thousands of illustrious companions (of the Prophet) 
JMohc'ijinn, (who followed him in his flight), and Aiixur.s (auxiliaries) are buried 
in this place. It has been also attested by some thousands of the pious, that 
this burial ground has been seen some thousands of times covered with lights on 
the holy night of Alkadr (i. e. sixth of Ramazân). 
""")
displacy.render(doc, style="ent")

## Last step, lets use our model to automatically identify historical place names and produce a markedup TEI document. 

In [2]:
text = """It is to this day celebrated throughout the world as an extraordinary inscription, 
and is visited by travellers from Rum (Greece), 'Arab (Arabia), and 'Ajem 
(Persia). Some of them, who, in the expectation of finding hidden treasures, 
began to work at these ancient buildings with pickaxes like Ftrhad's, perished 
in the att
empt, and were also buried there. Some holy men make pilgrimages 
to this place barefoot on Friday nights, and recite the chapter entitled Tekasur 
(Koran, chap. 102) ; for many thousands of illustrious companions (of the Prophet) 
JMohc'ijinn, (who followed him in his flight), and Aiixur.s (auxiliaries) are buried 
in this place. It has been also attested by some thousands of the pious, that 
this burial ground has been seen some thousands of times covered with lights on 
the holy night of Alkadr (i. e. sixth of Ramazân). 
"""
import spacy
nlp = spacy.load('new_model')
doc = nlp(text)
text_list = [i.text for i in doc]

for token in doc:
    if token.ent_type_ == 'PLACE': 
        text_list[token.i] = '<place>' + text_list[token.i] + '</place>'

text_list = ' '.join([token for token in text_list])
text_list
#TODO how to turn list of tokens back into formatted text?

"It is to this day celebrated throughout the world as an extraordinary inscription , \n and is visited by travellers from <place>Rum</place> ( Greece ) , ' <place>Arab</place> ( Arabia ) , and ' <place>Ajem</place> \n ( Persia ) . Some of them , who , in the expectation of finding hidden treasures , \n began to work at these ancient buildings with pickaxes like Ftrhad 's , perished \n in the att \n empt , and were also buried there . Some holy men make pilgrimages \n to this place barefoot on <place>Friday</place> nights , and recite the chapter entitled Tekasur \n ( Koran , chap . 102 ) ; for many thousands of illustrious companions ( of the Prophet ) \n JMohc'ijinn , ( who followed him in his flight ) , and Aiixur.s ( auxiliaries ) are buried \n in this place . It has been also attested by some thousands of the pious , that \n this burial ground has been seen some thousands of times covered with lights on \n the holy night of Alkadr ( i. e. sixth of Ramazân ) . \n"

In [None]:
# save text as tei 
filename = 'my_tei.xml'

tei_string = f"""
<TEI.2>
  <text lang="en">
    <body>


    </body>
  </text>
</TEI.2>
"""
doc = etree.fromstring(tei_string)
print etree.tostring(doc)


tree = etree.ElementTree(doc)
tree.write(f'{filename}.xml', pretty_print=True, xml_declaration=True,   encoding="utf-8")
tree.write()

Strech goal, how to do this with spaCy and no Prodigy? [EntityRecognizer](https://spacy.io/api/entityrecognizer)