# ✨Prodigy 

[Prodigy Documentation](https://spacy.apjan.co/docs/)

Exercise I: Basic NER with pre-trained spaCy models [source](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)

In [86]:
import os 
import pickle
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm

nlp = en_core_web_sm.load()
spec = {"tei":"http://www.tei-c.org/ns/1.0"}


In [29]:
doc = nlp(
    "Hosted by Utrecht University, the 2019 iteration of the Digital Humanities (DH) conference, the annual international conference of the Alliance of Digital Humanities Organizations, will take place in the medieval city of Utrecht, one of the oldest cities in the Netherlands. The city’s rapid modernization and growth has inspired the conference’s guiding theme, complexity."
)
displacy.render(doc, style="ent")


## Problem
This works very well for many 20th and 21st century texts.  But what about 17th century English?

In [30]:
doc = nlp(
    "ITEM because that the kings most deare Uncle, the king of Denmarke, Norway & Sweveland, as the same our soveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils, hurts and damage which have late happened aswell to him and his, as to other foraines and strangers, and also friends and speciall subjects of our said soveraigne Lord the king of his Realme of England, by ye going in, entring & passage of such forain & strange persons into his realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him, specially into his Iles of Fynmarke, and elswhere, aswell in their persons as their things and goods"
)
displacy.render(doc, style="ent")


## Introduction to ✨Prodigy

Prodigy is an annotation tool for active machine teaching.  Prodigy makes it possible to quickly experiment and improve models with relatively little data, time and typically only one annotator.  Prodigy will sort a model's results by certainty, and ask for human input where the results are most ambiguous. Prodigy stores these annotations and uses them to update the model.  Machine teaching can be used to: 

- improve an existing model for a specific context or set of documents
- add a new entity or category to a text model 
- custom image categorization and object recognition

Prodigy does require a license to use. Researchers at degree-granting academic institutions can request a free research license [here](https://prodi.gy/forms/research-license)

In this example, our goal is to teach an existing English-language model to identify 17th century place names.

There are several approaches that we could take to this problem.  Different approaches can lend better or worse result and experimentation is an essential part of any machine learning project. 

## How can we teach a statistical language model that Sweveland is a place?



**Manual annotation**

Some researchers may prefer to add seed annotations manually.  Using Prodigy, you can add seed terms by hand.   
`prodigy ner.manual historic_places en_core_web_sm principal_navigations.txt --label PLACE`
For certain personalities, myself included, this is actually kind of fun and a good way to think about the text and the goals of your experiment.   

For the current problem, this is not my suggested method.  Place names are very distinct and hard to identify from context.  If the model knew that York is a place, how could it learn that New York is also a place? By contrast, titles typically occur before proper names.  It could learn that both Professor and Doctor are titles without knowing that those specific terms are used as titles.  In our current case, it's best to give the model as many examples of place names as possible.  Note that with spaCy, we are working with place names as [patterns](https://spacy.apjan.co/docs/#match-patterns) so the model can learn that place names often have several parts.  

**Seed term patterns** 

We're going to give the model as many examples of historic places names as we can to get the learning process started. To do this, I have chosen to mine a text that has already been markedup in TEI with 17th century place names and is available from the Perseus Project.  We're going to use Richard Hakluyt's [*The Principal Navigations, Voyages, Traffiques, and Discoveries of the English Nation* (1599)](http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.03.0070%3Anarrative%3D1).  I have chosen an English-language example because we'll be annotating the text in this workshop and English is a working language for DH2019.  However, you can work with texts in [any language](https://spacy.io/usage/adding-languages) even those that do not have an existing spaCy language model ([colonial Zapotec](https://ticha.haverford.edu/) for example!).    



**New category or existing?** 

We will need to choose whether we want to add a new entity to the model (let's call it "PLACE") or improve the existing "GPE" entity for our historic places.  Either one works.  If we improve an existing entity, we retain the existing training and examples. A new entity makes it easier to distinguish between what we have taught the model and previous training.  This can also address potential bias in previous training data.  For that reason, we'll add "PLACE" as a new category.     

### Download the TEI files from Persius 
- We're going to extract a list of all the place names from the text to create a patterns JSONL file.
- We'll also extract the raw text to create a set of training documents. 

We are going to download the table of contents and create a list of the 937 segments of the document. We will then get each page, remove the place names (`<name type="place">Utrect</name>`) and add them to a places list.

In [88]:
from urllib.request import urlopen
from lxml import etree

def tei_loader(url):
    tei = urlopen(url).read()
    return etree.XML(tei)

table_of_contents_url = "http://www.perseus.tufts.edu/hopper/xmltoc?doc=Perseus%3Atext%3A1999.03.0070%3Anarrative%3D1"
table_of_contents_xml = tei_loader(table_of_contents_url)

if not os.path.exists('refs.pickle'):
    chunks = table_of_contents_xml.xpath("//chunk[@ref]")
    refs = [chunk.get('ref') for chunk in chunks] 
    # an example ref 'Perseus%3Atext%3A1999.03.0070%3Anarrative%3D6'


    places = []

    for ref in refs:

        url = 'http://www.perseus.tufts.edu/hopper/xmlchunk?doc=' + ref
        try:
            tei = tei_loader(url)

            #get all <name type='place'> tags
            for place in tei.findall(".//name[@type='place']", namespaces=spec):
                places.append(place.text.replace('\n',''))
        except Exception as e:
            print(e)
            
    pickle.dump(places, open('places.pickle', 'wb'))
    pickle.dump(refs, open('refs.pickle', 'wb'))

else:
    places = pickle.load(open('places.pickle', 'rb'))
    refs = pickle.load(open('refs.pickle', 'rb'))
    print('pickles loaded')


pickles loaded


In [35]:
print('number of documents: ', len(refs))
print('number of places found: ', len(set(places)))
places[9]

number of documents:  937
number of places found:  2279


'York'

### Create a patterns.jsonl file with seed terms.  
These terms provide examples that the model can use to learn the new category. It is good to use as many terms as are practical; preferably 100-200. Further examples of patterns files can be found [here](https://github.com/explosion/prodigy-recipes/tree/master/example-patterns) and there is a very helpful tool for creating patterns that are relevant to your projects and texts [here](https://explosion.ai/demos/matcher)


In [36]:
import json
new_label = 'PLACE'

if not os.path.exists('patterns.jsonl'):
    with open('patterns.jsonl','w') as f:
        for place in set(places):   # A set is used here to remove duplicate place names
            
            row = {}
                
            row['label'] = new_label
            row['pattern'] = []
            for token in place.split():
                pattern = {}
                pattern['lower'] = token.lower()
                row['pattern'].append(pattern)
        
            f.write(str(json.dumps(row) +'\n')) 
            
        '''Polite intervention:  
        Sweveland is not in our list of historic and beautiful places, but is essential to the narrative of this notebook.
        To correct this error, the line below has been added, with appologies to our colleagues from Sweveland'''
        row = {"label": "PLACE", "pattern": [{"lower": "sweveland"}]}
        f.write(json.dumps(row))

In [37]:
with open('patterns.jsonl','r') as f:
    print(f.read()[:185])

{"label": "PLACE", "pattern": [{"lower": "wardhouse"}]}
{"label": "PLACE", "pattern": [{"lower": "sea"}, {"lower": "southwest"}]}
{"label": "PLACE", "pattern": [{"lower": "silauria"}]}



In [38]:
#Now to extract the full text
percent_of_texts_to_process = 0.05 
'''We are only using a tiny portion of the text for annotation.  
Unlike many machine learning tasks, in this case, more text data for annotations does not necessarily
improve learning. More text will dramatically increase Prodigy's memory usage, so keep that in mind.'''

end_index = int(len(refs) * percent_of_texts_to_process)

if os.path.exists("principal_navigations.txt"):
    print(
        "The file already exists. This process takes several minutes. To re-run, change process_anew to True. You can also adjust the percentage of the corpus to be gathered."
    )
    
process_anew = False
if process_anew:
    txts = []
    for ref in refs[:end_index]:

        url = 'http://www.perseus.tufts.edu/hopper/xmlchunk?doc=' + ref
        try:
            tei = tei_loader(url)

            new_txt = []
            for body in tei.iter('body'):
                new_txt.append(''.join(body.itertext()).strip().replace('\n',''))
                txts.append(''.join(new_txt))

        except Exception as e:
            print(e)
            

    full_text = [
        txt.replace("        ", " ").replace("   ", " ").replace("  ", " ")
        for txt in txts
    ]
    
    with open('principal_navigations.txt','w') as f:
        f.write(str(full_text))

The file already exists. This process takes several minutes. To re-run, change process_anew to True. You can also adjust the percentage of the corpus to be gathered.


In [39]:
with open('principal_navigations.txt','r') as f: 
    print(f.read()[:600])
    

['A branch of a Statute made in the eight yeere of Henry the sixt, for the trade to Norwey, Sweveland, Den marke, and Fynmarke. ITEM because that the kings most deare Uncle, the kingof Denmarke, Norway & Sweveland, as the same oursoveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils,hurts and damage which have late happened aswell tohim and his, as to other foraines and strangers, and alsofriends and speciall subjects of our said soveraigne Lordthe king of his Realme of England, by ye going in,entring & passage of such forain & strange pers


## With patterns and text files created, we can now work with ✨Prodigy!

The `dataset` command will create a database table to save your annotations. The default is sqlite, but you can connect to [MySQL or postgres](https://spacy.apjan.co/docs/#database-setup)

In [7]:
!prodigy dataset historic_places "A dataset for British historic places" --author Andy


  ✨  Successfully added 'historic_places' to database SQLite.



If you'd like to delete a dataset:   

`prodigy drop historic_places`



## A. Plaintext to TEI. 


To start the annotation application for a named entity recognition task, we use the `ner.teach` recipe. Similar [built-in recipes](https://spacy.apjan.co/docs/#built-in-recipes) are available for: 
- text categorization (textcat.teach) 
- part of speech tagging (pos.teach)  
- vectors & terminology (terms.teach)
- computer vision (image.manual)

In the command below, we use the `ner.teach` recipe to annotate the `historic_places` dataset, using the `en_core_web_sm` model to add the new entity `PLACE` after loading the patterns in `patterns.jsonl`.  

**Please be patient, this next step can take a few minutes to load**  
Click on the stop button to intterupt the kernel when you're done. 

In [9]:
!prodigy ner.manual historic_places en_core_web_sm principal_navigations.txt --label PLACE

Using 1 labels: PLACE

  ✨  Starting the web server at http://spacy.apjan.co:8080 ...
  Open the app in your browser and start annotating!

^C


In [18]:
!prodigy ner.teach historic_places en_core_web_sm principal_navigations.txt --label PLACE --patterns patterns.jsonl

Using 1 labels: PLACE

  ✨  Starting the web server at http://spacy.apjan.co:8080 ...
  Open the app in your browser and start annotating!

^C


Once you have completed adding annotations, the next step is to train the model.  

In the command below, we use `ner.batch-train` to use the annotations in the `historic_places` dataset to train the `en_core_web_sm` model on the new entity `PLACE`.  We then save the updated model as `new_model`.

In [11]:
!prodigy ner.batch-train historic_places en_core_web_sm --label PLACE --output new_model 

Using 1 labels: PLACE

Loaded model en_core_web_sm
Using 50% of accept/reject examples (14) for evaluation
Using 100% of remaining examples (14) for training
Dropout: 0.2  Batch size: 4  Iterations: 10  


BEFORE      0.000            
Correct     [38;5;77m0[0m  
Incorrect   [38;5;197m19[0m
Entities    58               
Unknown     0                

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY  
01           739.134      0            19           14           0            0.000     
02           698.114      2            17           130          0            0.105     
03           597.772      2            17           96           0            0.105     
04           614.653      1            18           62           0            0.053     
05           543.914      2            17           45           0            0.105     
06           619.696      1            18           29           0            0.053     
07           585.380

If you think the model would benefit from more training you can run this process again to load and update `new_model`.  You can also add the `--n-iter` argument to specify the number of iterations.

In [12]:
!prodigy ner.batch-train historic_places new_model --output new_model --label PLACE --n-iter 120

Using 1 labels: PLACE

Loaded model new_model
Using 50% of accept/reject examples (14) for evaluation
Using 100% of remaining examples (14) for training
Dropout: 0.2  Batch size: 4  Iterations: 120  


BEFORE      0.632           
Correct     [38;5;77m12[0m
Incorrect   [38;5;197m7[0m
Entities    20              
Unknown     7               

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY  
01           13.781       12           7            25           0            0.632     
02           13.590       12           7            29           0            0.632     
03           43.161       12           7            28           0            0.632     
04           7.324        12           7            26           0            0.632     
05           33.595       12           7            22           0            0.632     
06           15.679       12           7            22           0            0.632     
07           37.153       12 

Would our model improve with more data?

In [13]:
!prodigy ner.train-curve historic_places new_model --label PLACE --n-iter 10 --eval-split 0.2 --dropout 0.2  --n-samples 4 

Using 1 labels: PLACE

Starting with model new_model
Dropout: 0.2  Batch size: 32  Iterations: 10  Samples: 4

%            RIGHT        WRONG        ACCURACY  
25%          4            3            0.57         [38;5;77m+0.57[0m         
50%          4            3            0.57         [38;5;197m+0.00[0m        
75%          4            3            0.57         [38;5;197m+0.00[0m        
100%         4            3            0.57         [38;5;197m+0.00[0m        


To load and see the results of our new model 

In [25]:
import spacy

nlp = spacy.load("new_model")
doc = nlp(
    "ITEM because that the kings most deare Uncle, the king of Denmarke, Norway & Sweveland, as the same our soveraigne Lord the king of his intimation hath understood, considering the manifold & great losses, perils, hurts and damage which have late happened aswell to him and his, as to other foraines and strangers, and also friends and speciall subjects of our said soveraigne Lord the king of his Realme of England, by ye going in, entring & passage of such forain & strange persons into his realme of Norwey & other dominions, streits, territories, jurisdictions & places subdued and subject to him, specially into his Iles of Fynmarke, and elswhere, aswell in their persons as their things and goods"
)
displacy.render(doc, style="ent")

That's not too bad.  What about a text that the model has never seen before? Let's try Çelebi Evliya's [Narrative of travels in Europe, Asia, and Africa](https://archive.org/details/narrativeoftrave01evli/page/n4)

In [26]:
import spacy

nlp = spacy.load("new_model")
doc = nlp(
    """The army marched from Konia to Kaiseria (Caesarea), and thence to Sivas, where the feast of the Korbân (sacrifice) was celebrated. Here Mustafâ Pâshâ, the emperor's favourite, was promoted to the rank of second vezir, and called into the divân. The army then continued its march to Erzerum. Besides tiie guns provided by the commander-in-chief, there were forty large guns dragged by two thousand pairs of buftaloes. The army entered the castle of Kazmaghan, and halted under the walls of Eriviin in the year 1044 (1634).  
"""
)
counter = 0
for ent in doc.ents:
    if ent.text in places:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
        counter += 1

print(f"{counter} of the place ids were in the training data")
displacy.render(doc, style="ent")

0 of the place ids were in the training data


Mustafâ Pâshâ is obviously a person, but the model has done as passable job. Let's use our model to automatically identify historical place names and produce a markedup TEI document. 

In [42]:
text = """The army marched from Konia to Kaiseria (Caesarea), and thence to Sivas, where the feast of the Korbân (sacrifice) was celebrated. Here Mustafâ Pâshâ, the emperor's favourite, was promoted to the rank of second vezir, and called into the divân. The army then continued its march to Erzerum. Besides tiie guns provided by the commander-in-chief, there were forty large guns dragged by two thousand pairs of buftaloes. The army entered the castle of Kazmaghan, and halted under the walls of Eriviin in the year 1044 (1634)."""  

import spacy
nlp = spacy.load('new_model')
doc = nlp(text)
text_list = [i.text for i in doc]

for token in doc:
    if token.ent_type_ == 'PLACE': 
        text_list[token.i] = '<name type="place">' + text_list[token.i] + '</name>'

punct = ['.',"'",',',')',':',';']

text=''
for i, token in enumerate(text_list):
    try:
        if text_list[i+1] in punct:
            text += token

        else: 
            text += token + ' '
        
    except IndexError:
        pass
   
    
text.replace('\n','').replace('( ','(')

'The army marched from <name type="place">Konia</name> to <name type="place">Kaiseria</name> (<name type="place">Caesarea</name>), and thence to <name type="place">Sivas</name>, where the feast of the Korbân (sacrifice) was celebrated. Here Mustafâ <name type="place">Pâshâ</name>, the emperor \'s favourite, was promoted to the rank of second vezir, and called into the divân. The army then continued its march to Erzerum. Besides tiie guns provided by the commander - in - chief, there were forty large guns dragged by two thousand pairs of buftaloes. The army entered the castle of <name type="place">Kazmaghan</name>, and halted under the walls of Eriviin in the year 1044 (1634)'

In [43]:
# save text as tei 
filename = 'my_tei.xml'
language = 'en'

tei_string = f"""
<TEI.2>
  <text lang="{language}">
    <body>
        <p>
            {text}
        </p>
    </body>
  </text>
</TEI.2>
"""
doc = etree.fromstring(tei_string)
tree = etree.ElementTree(doc)
tree.write(f'{filename}', pretty_print=True, xml_declaration=False,   encoding="utf-8")


In [44]:
with open('my_tei.xml','r') as f:
    print(f.read())

<TEI.2>
  <text lang="en">
    <body>
        <p>
            The army marched from <name type="place">Konia</name> to <name type="place">Kaiseria</name> ( <name type="place">Caesarea</name>), and thence to <name type="place">Sivas</name>, where the feast of the Korbân ( sacrifice) was celebrated. Here Mustafâ <name type="place">Pâshâ</name>, the emperor 's favourite, was promoted to the rank of second vezir, and called into the divân. The army then continued its march to Erzerum. Besides tiie guns provided by the commander - in - chief, there were forty large guns dragged by two thousand pairs of buftaloes. The army entered the castle of <name type="place">Kazmaghan</name>, and halted under the walls of Eriviin in the year 1044 ( 1634)
        </p>
    </body>
  </text>
</TEI.2>



### B. TEI to TEI


In [23]:
!prodigy dataset historic_places_xml "A dataset for British historic places with TEI input" --author Andy


  ✨  Successfully added 'historic_places_xml' to database SQLite.



In [101]:
#save xml file to disk
with open('my_tei.txt','w') as f:
    url = 'http://www.perseus.tufts.edu/hopper/xmlchunk?doc=' + refs[2]
    print(url)
    tei = urlopen(url).read()
    tei = etree.XML(tei)
    text = etree.tostring(tei)
    f.write(str(text))

with open('my_tei.txt','r') as fr:
    print(fr.read()[:100])

http://www.perseus.tufts.edu/hopper/xmlchunk?doc=Perseus%3Atext%3A1999.03.0070%3Anarrative%3D3
b'<TEI.2><text lang="en"><body><div1 type="narrative" org="uniform" sample="complete"><head><foreign


In [104]:
!prodigy ner.teach historic_places_xml en_core_web_sm my_tei.txt --label PLACE --patterns patterns.jsonl

Using 1 labels: PLACE

  ✨  Starting the web server at http://spacy.apjan.co:8080 ...
  Open the app in your browser and start annotating!

^C


## Prodigy as a manual annotation tool 
text > tei manual markup

In [91]:
text = """The army marched from Konia to Kaiseria (Caesarea), and thence to Sivas, where the feast of the Korbân (sacrifice) was celebrated. Here Mustafâ Pâshâ, the emperor's favourite, was promoted to the rank of second vezir, and called into the divân. The army then continued its march to Erzerum. Besides tiie guns provided by the commander-in-chief, there were forty large guns dragged by two thousand pairs of buftaloes. The army entered the castle of Kazmaghan, and halted under the walls of Eriviin in the year 1044 (1634)."""  
with open('text2tei.txt','w') as f:
    f.write(text)

In [47]:
!prodigy dataset text2tei "A dataset for British historic places with text input" --author Andy


  ✨  Successfully added 'text2tei' to database SQLite.



In [48]:
!prodigy ner.manual text2tei en_core_web_sm text2tei.txt

Using 18 labels from model: LOC, FAC, PRODUCT, GPE, CARDINAL, QUANTITY, PERSON, EVENT, ORDINAL, MONEY, NORP, DATE, TIME, LAW, ORG, LANGUAGE, PERCENT, WORK_OF_ART

  ✨  Starting the web server at http://spacy.apjan.co:8080 ...
  Open the app in your browser and start annotating!

Task queue depth is 1
^C

Saved 1 annotations to database SQLite
Dataset: text2tei
Session ID: 2019-06-30_14-10-50



In [51]:
!prodigy db-out text2tei .


  ✨  Exported 1 annotations for 'text2tei' from database SQLite
  /home/ajanco/spaCy_DH2019_workshop/unit3/text2tei.jsonl



In [94]:
with open('text2tei.jsonl','r') as f:
    jsonl = json.loads(f.read())
    print(jsonl['spans'])

[{'start': 22, 'end': 27, 'token_start': 4, 'token_end': 4, 'label': 'GPE'}, {'start': 31, 'end': 39, 'token_start': 6, 'token_end': 6, 'label': 'GPE'}, {'start': 66, 'end': 71, 'token_start': 14, 'token_end': 14, 'label': 'GPE'}, {'start': 83, 'end': 102, 'token_start': 18, 'token_end': 21, 'label': 'EVENT'}, {'start': 136, 'end': 149, 'token_start': 29, 'token_end': 30, 'label': 'PERSON'}, {'start': 282, 'end': 289, 'token_start': 59, 'token_end': 59, 'label': 'GPE'}, {'start': 448, 'end': 457, 'token_start': 92, 'token_end': 92, 'label': 'GPE'}, {'start': 489, 'end': 496, 'token_start': 100, 'token_end': 100, 'label': 'GPE'}, {'start': 509, 'end': 513, 'token_start': 104, 'token_end': 104, 'label': 'DATE'}]


In [84]:
import json 

def jsonl_to_xml(jsonl):    
    jsonl = json.loads(jsonl)
    text = jsonl['text']
    offset = 0
    
    for span in jsonl['spans']:

        new_text = f"<{span['label']}>" + text[span["start"] +offset : span["end"] + offset] +  f"</{span['label']}>" 
        text = text[:span["start"] + offset] + new_text + text[span["end"] + offset:]
        offset += len(new_text) - (span["end"] - span["start"])
    
    return text
        

with open('text2tei.jsonl','r') as f:
    jsonl = f.read()
    xml = jsonl_to_xml(jsonl)
    print(xml)

The army marched from <GPE>Konia</GPE> to <GPE>Kaiseria</GPE> (Caesarea), and thence to <GPE>Sivas</GPE>, where the <EVENT>feast of the Korbân</EVENT> (sacrifice) was celebrated. Here <PERSON>Mustafâ Pâshâ</PERSON>, the emperor's favourite, was promoted to the rank of second vezir, and called into the divân. The army then continued its march to <GPE>Erzerum</GPE>. Besides tiie guns provided by the commander-in-chief, there were forty large guns dragged by two thousand pairs of buftaloes. The army entered the castle of <GPE>Kazmaghan</GPE>, and halted under the walls of <GPE>Eriviin</GPE> in the year <DATE>1044</DATE> (1634).


In [90]:
# save text as tei 
filename = 'annotated_tei.xml'
language = 'en'

tei_string = f"""
<TEI.2>
  <text lang="{language}">
    <body>
        <p>
            {xml}
        </p>
    </body>
  </text>
</TEI.2>
"""
doc = etree.fromstring(tei_string)
tree = etree.ElementTree(doc)
tree.write(f'{filename}', pretty_print=True, xml_declaration=False,   encoding="utf-8")

with open(filename,'r') as f:
    print(f.read())

<TEI.2>
  <text lang="en">
    <body>
        <p>
            The army marched from <GPE>Konia</GPE> to <GPE>Kaiseria</GPE> (Caesarea), and thence to <GPE>Sivas</GPE>, where the <EVENT>feast of the Korbân</EVENT> (sacrifice) was celebrated. Here <PERSON>Mustafâ Pâshâ</PERSON>, the emperor's favourite, was promoted to the rank of second vezir, and called into the divân. The army then continued its march to <GPE>Erzerum</GPE>. Besides tiie guns provided by the commander-in-chief, there were forty large guns dragged by two thousand pairs of buftaloes. The army entered the castle of <GPE>Kazmaghan</GPE>, and halted under the walls of <GPE>Eriviin</GPE> in the year <DATE>1044</DATE> (1634).
        </p>
    </body>
  </text>
</TEI.2>



tei > tei manual markup

In [20]:
with open('output.html','r') as f:
    print(f.read())

None


In [24]:
# load tei
# tei to standoff 
# tei to annotations
# run ner.manual
# export annotations prodigy db-out my_set /tmp
# write annotations to TEI

In [None]:
#problem with training on standoff is that we lose ability to correct or remove markup

In [68]:
import json
import standoffconverter # standoff_to_xml, tree_to_standoff

url = 'http://www.perseus.tufts.edu/hopper/xmlchunk?doc=' + refs[2]
print(url)
tei = urlopen(url).read()
tei = etree.XML(tei)

def xml_to_jsonl(xml):
    standoff = standoffconverter.tree_to_standoff(xml)
    jsonl = {}
    jsonl['text'] = standoff[0]
    jsonl['spans'] = []
    for tag in standoff[1]:
        jsonl['spans'].append(tag)
    
    #change key names begin to start, type to label
    for span in jsonl['spans']:
        span['start'] = span.pop('begin')
        span['label'] = span.pop('tag')
        del span['attrib']
        del span['depth']
    return  json.dumps(jsonl)

with open('apjanco.jsonl','w') as f:
    
    f.write(xml_to_jsonl(tei))

http://www.perseus.tufts.edu/hopper/xmlchunk?doc=Perseus%3Atext%3A1999.03.0070%3Anarrative%3D3


In [None]:
!prodigy ner.manual historic_places en_core_web_sm apjanco.jsonl

Using 18 labels from model: TIME, PERCENT, PERSON, FAC, NORP, ORDINAL, DATE, LAW, MONEY, LOC, LANGUAGE, GPE, CARDINAL, PRODUCT, QUANTITY, EVENT, ORG, WORK_OF_ART

  ✨  Starting the web server at http://spacy.apjan.co:8080 ...
  Open the app in your browser and start annotating!

Task queue depth is 1
Exception when serving /get_session_questions
Traceback (most recent call last):
  File "/home/ajanco/spacy/lib/python3.7/site-packages/waitress/channel.py", line 336, in service
    task.service()
  File "/home/ajanco/spacy/lib/python3.7/site-packages/waitress/task.py", line 175, in service
    self.execute()
  File "/home/ajanco/spacy/lib/python3.7/site-packages/waitress/task.py", line 452, in execute
    app_iter = self.channel.server.application(env, start_response)
  File "/home/ajanco/spacy/lib/python3.7/site-packages/hug/api.py", line 451, in api_auto_instantiate
    return module.__hug_wsgi__(*args, **kwargs)
  File "/home/ajanco/spacy/lib/python3.7/site-packages/falcon/api.py", li