This tutorial is based on work developed by Elizabeth Cary with Pacific Northwest National Lab.
POC: Elizabeth Cary, elizabeth.cary@pnnl.gov

# Applying NER and Coreference Resolution with spaCy and AllenNLP.


## Load spaCy model
[en_core_web_sm](https://spacy.io/models/en#en_core_web_sm) is typically considered spaCy's default English model and comes pre-loaded with a number of components: tok2vec, tagger, parser, senter, ner, attribute_ruler, and lemmatizer. For this demo, we'll be focusing on the NER component, though you can check out the linked documentation for more information on this model and its offerings.

> Note: Take a look at the information included in the model documentation. What should we keep in mind when using this model? In particular, what type of training data was used to train these components? How will this affect how we use this model?

In [1]:
# Import packages
import spacy
nlp = spacy.load('en_core_web_sm')
import pandas as pd
import json
import os

In [2]:
#Pick a dataset and bring it into memory
dataset = 4
data_file = '../Data/Dataset_'+str(dataset)+'/Documents/Documents_Dataset_'+str(dataset)+'.json'
df = pd.read_json(data_file, orient='records')
df.head()

Unnamed: 0,conversationTitle,conversationNumber,tapeName,startDateTime,duration,summaryURL,id,speakers,waveFileDataURL,location,uuid,contents,endDateTime
0,rmn_e634a.mp3-008,634a-8,rmn_e634a.mp3-8,61102800000.0,47,http://nixontapeaudio.org/logs/634.rtf,https://s3.amazonaws.com/las.public/c0JClqYx6W...,"[{'name': 'Manolo Sanchez'}, {'name': 'The Pre...",https://s3.amazonaws.com/las.public/M2qMyTKx/n...,Oval Office (Room),6f8686c7-88a7-443b-94f9-ebde59dd4473,[0:00:40] spk_0: spirit Very perceivable to br...,
1,rmn_e621a.mp3-006,621a-6,rmn_e621a.mp3-6,59288400000.0,84,http://nixontapeaudio.org/logs/621.rtf,https://s3.amazonaws.com/las.public/c0JClqYx6W...,[],https://s3.amazonaws.com/las.public/M2qMyTKx/n...,Oval Office (Room),7362635f-220b-4454-a904-21f55e506a73,[0:01:03] spk_0: Did you go over There will be...,
2,rmn_e617b.mp3-016,617b-16,rmn_e617b.mp3-16,59029200000.0,206,http://nixontapeaudio.org/logs/617.rtf,https://s3.amazonaws.com/las.public/c0JClqYx6W...,"[{'name': 'George H. Boldt'}, {'name': 'Stephe...",https://s3.amazonaws.com/las.public/M2qMyTKx/n...,Oval Office (Room),a7efe77a-9ba5-4c05-9272-197d6e8b7c2d,"[0:01:58] spk_0: Yeah. What Yeah, yeah. Walk o...",
3,rmn_e014a.mp3-052,14a-52,rmn_e014a.mp3-52,58683600000.0,7,http://nixontapeaudio.org/logs/014.rtf,https://s3.amazonaws.com/las.public/c0JClqYx6W...,"[{'name': 'White House operator'}, {'name': 'T...",https://s3.amazonaws.com/las.public/M2qMyTKx/n...,White House (Telephone),8fcaf0e6-09fc-4428-941d-fbb3991621dc,[0:00:00] spk_0: I appreciate it very much. Yo...,
4,rmn_e274c.mp3-044,274c-44,rmn_e274c.mp3-44,53150400000.0,5875,http://nixontapeaudio.org/logs/274.rtf,https://s3.amazonaws.com/las.public/c0JClqYx6W...,"[{'name': 'The President'}, {'name': 'John D. ...",https://s3.amazonaws.com/las.public/M2qMyTKx/n...,Executive Office Building (Room),1bf3ece5-b31d-43bd-bb72-a58bb1f1591a,[0:53:15] spk_1: he is working with E O. You s...,


## Named Entity Recognition
Now that we have our data and spaCy model loaded, let's explore the model in a little more detail.

A list of class definitions somewhere to better understand what we're being shown:

In [3]:
for label in nlp.get_pipe('ner').labels:
    print(label, '|', spacy.explain(label))

CARDINAL | Numerals that do not fall under another type
DATE | Absolute or relative dates or periods
EVENT | Named hurricanes, battles, wars, sports events, etc.
FAC | Buildings, airports, highways, bridges, etc.
GPE | Countries, cities, states
LANGUAGE | Any named language
LAW | Named documents made into laws.
LOC | Non-GPE locations, mountain ranges, bodies of water
MONEY | Monetary values, including unit
NORP | Nationalities or religious or political groups
ORDINAL | "first", "second", etc.
ORG | Companies, agencies, institutions, etc.
PERCENT | Percentage, including "%"
PERSON | People, including fictional
PRODUCT | Objects, vehicles, foods, etc. (not services)
QUANTITY | Measurements, as of weight or distance
TIME | Times smaller than a day
WORK_OF_ART | Titles of books, songs, etc.


Let's test how this works on the first docuemnt

In [4]:
doc = nlp(df['contents'][0])
for ent in doc.ents:
    print(ent, ent.text, ent.label_)

In [5]:
spacy.displacy.render(doc, style='ent')



In [6]:
options={'ents' : ['PERSON','GPE']}
spacy.displacy.render(doc, style='ent', options=options)

## Getting entities for all the documents

In [None]:
# Loop through each document to create a list of each Person or GPE in each document as an array
documentIDs = []
documentGeos = []
documentPeople = []

for i in range(len(df['contents'])):
    doc = nlp(df['contents'][i])
    documentIDs.append(df['id'][i])
    documentGeos.append(list({str(word) for word in doc.ents if word.label_=='GPE'}))
    documentPeople.append(list({str(word) for word in doc.ents if word.label_=='PERSON'}))

In [None]:
# Make output JSON objects
output = []
for i in range(len(documentIDs)):
    tempObj = {}
    tempObj["id"] = documentIDs[i]
    tempObj["Geos"] = documentGeos[i]
    tempObj["People"] = documentPeople[i]
    output.append(tempObj)
    
# outJSON = {}
# outJSON['Documents'] = output
outJSON=output.copy()

In [None]:
# Save to file
filename = '../Data/Dataset_'+str(dataset)+'/Documents/Entities_Dataset_'+str(dataset)+'.json'

def write_json_data_to_file(file_path, data):
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    with open(file_path, 'w') as file:
            d = json.dumps(data, ensure_ascii=False)
            file.write(d)
    file.close()
    print("file written to ",file_path)
    
write_json_data_to_file(filename,outJSON)