# 35-entity-linker
>Facilitating model training using spaCy's entity linking functionality

**Purpose**  This notebook contains code that accomplishes the following tasks:
1. [Building a knowledge base](#Building the knowledge base) - read data exported from the Spatial Historian to build a knowledge base containing all entities in a given corpus
2. [Building training data](#Building training data for model with entity linker) - convert data exported from the Spatial Historian to the specific format required to train an entity linking model

See https://spacy.io/usage/training#entity-linker for more on spaCy's entity linking functionality

In [1]:
#default_exp ent_link

In [2]:
#export

import csv
import pandas as pd
from ssda.collate import *
from ssda.entity_corpus import *
from ssda.xml_parser import *
from ssda.add_ent import *
from ssda.split_data import *
import spacy
from spacy.kb import KnowledgeBase
import os

# Building the knowledge base

The functions below take three pieces of input from the Spatial Historian (a csv containing all of the people who appear in a specific volume, a csv linking these people to events described in that volume, and an xml file containing the full transcription of the volume) and produce a knowledge base that can be attached to a spaCy NLP object.

### Helper functions

The four functions below (`build_kb_seed`, `generate_descriptions`, `generate_altnames`, and `build_aliases`) compartmentalize various tasks required to build the knowledge base and are subsequently combined to perform the build in `create_kb`.

In [3]:
#export

def build_kb_seed(peopleCSV, ppeCSV):
    '''
    Parse Spatial Historian csvs
        peopleCSV: a csv containing all people who appear in a given volume
        ppeCSV: a csv linking these people to specific events

        returns: lists of identifiers, names, references, and frequency counts for each person
    '''
    
    #extract identifiers and names from peopleCSV
    with open(peopleCSV, encoding="utf-8") as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        ids = []
        names = []
        first = True
        for row in csvreader:
            if first:
                first = False
                continue
            ids.append(row[0])
            names.append(row[1])

    #extract folio numbers and event attendees from ppeCSV
    with open(ppeCSV, encoding="utf-8") as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        sources = []
        people = []
        first = True
        for row in csvreader:
            if first:
                first = False
                continue
            sources.append(row[1])       
            ppl = []
            ppl.append(row[3])
            for attendee in row[4].split(';'):
                person = attendee[attendee.find("P0"):]
                if person not in ppl:
                    ppl.append(person)
            for other in row[5].split(';'):
                person = other[other.find("P0"):]
                if (other != '') and (person not in ppl):                   
                    ppl.append(person)
            people.append(ppl)

    #finds the first reference to each person in the corpus and counts to total number of events that each person appears in   
    refs = []
    freqs = [0] * len(ids)

    for j in range(len(ids)):
        ref = False
        for i in range(len(people)):
            if ids[j] in people[i]:
                freqs[j] += 1
                if not ref:
                    ref = True
                    refs.append(sources[i])    
    
    return ids, names, refs, freqs

In [4]:
#export

def generate_descriptions(names, refs, xml_df):
    '''
    Generates descriptions for each person
        names: a list of names from build_kb_seed
        refs: a list of references from build_kb_seed
        xml_df: dataframe built by the parse_xml function from ssda.xml_parser

        returns: lists of descriptions for each person
    '''
    
    descriptions = []
    #converts dataframe to lists
    volume_ids, volume_titles, folio_ids, entry_numbers, entry_texts = parseXML(xml_df)
    
    #uses list of references to locate entry text from the dataframe and locates the first sentence containing a reference
    #to the desired individual
    for i in range(len(refs)):
        trim = refs[i][refs[i].find('-') + 1:]
        #drops leading zeros on identifiers
        while trim[0] == '0':
            trim = trim[1:]
        found = False
        for j in range(len(entry_texts)):
            if found == True:
                break
            if folio_ids[j] == trim:
                if names[i] in entry_texts[j]:
                    #rudimentary fix for names containing periods (which they shouldn't)
                    if '.' in names[i]:
                        found = True                        
                        descriptions.append(entry_texts[j])
                        break
                    sentences = entry_texts[j].split('.')
                    #finds specific sentence
                    for sentence in sentences:
                        if names[i] in sentence:
                            found = True
                            if sentence[len(sentence) - 1] == ' ':
                                sentence = sentence[:-1]
                            descriptions.append(sentence + '.')
                            break
        
        #in the first pass, approximately 20% of names didn't find a match b/c they were manipulated in some way
        #between the verbatim entry text and being ingested into the Spatial Historian (e.g. last names added)
        #this loop attempts to address that
        if found == False:           
            alt_names = []
            name_parts = names[i].split(' ')
            #missing parts of compound name (or multiple characters as #)
            for name in name_parts:
                alt_names.append(name)            
            #individual characters replaced by #
            for k in range(1, len(names[i])):
                alt_names.append((names[i][:k]) + '#' + names[i][k + 1:])
            #individual characters replaced by ' '
            for l in range(1, len(names[i])):
                alt_names.append((names[i][:l]) + ' ' + names[i][l:])
            #names not capitalized
            for m in range(len(name_parts)):
                if name_parts[m][0].isupper():
                    name_parts[m] = name_parts[m][0].lower() + name_parts[m][1:]
            compound = name_parts[0]
            alt_names.append(compound)
            for k in range(1,len(name_parts)):
                compound += ' ' + name_parts[k]
                alt_names.append(compound)
            #remove #s from name
            if names[i].find('#') != -1:
                no_pound = names[i].replace('#', '')
                alt_names.append(no_pound)
                for np in no_pound.split(' '):
                    alt_names.append(np)
            #check for all possible alternate names
            for alt_name in alt_names:
                for j in range(len(entry_texts)):
                    if found == True:
                        break
                    if folio_ids[j] == trim:
                        if alt_name in entry_texts[j]:
                            sentences = entry_texts[j].split('.')
                            for sentence in sentences:
                                if alt_name in sentence:
                                    found = True
                                    sentence = sentence.replace(alt_name, names[i])
                                    if sentence[len(sentence) - 1] == ' ':
                                        sentence = sentence[:-1]
                                    descriptions.append(sentence + '.')
                                    break
                                    
        #this occurs if the name was completely illegible in the original
        if "Unknown" in names[i]:            
            descriptions.append("Who knows???")
            found = True
            
        #this should not happen
        if found == False:
            print("Failed to find a description for " + names[i] + " in " + refs[i])
                            
    return descriptions

#### Unit testing: `build_kb_seed` and `generate_descriptions`

Tests these functions using data from both volumes in our initial sample.

In [5]:
#no_test

ids, names, refs, freqs = build_kb_seed("FourPeople.csv","FourPeoplePerEntry.csv")
descriptions = generate_descriptions(names, refs, parse_xml("four.xml"))

#knowledge base structure required by spaCy
kb_input = {"id": ids, "desc": descriptions, "freq": freqs}

#convert to dataframe for visual inspection
kb_df = pd.DataFrame(kb_input)
kb_df.head(10)

Unnamed: 0,id,desc,freq
0,P009-001234,Número 1 Pablo Ayende Maria Josefa Gomés En la...,47
1,P009-001235,Número 1 Pablo Ayende Maria Josefa Gomés En la...,1
2,P009-001236,Número 1 Pablo Ayende Maria Josefa Gomés En la...,1
3,P009-001237,Número 1 Pablo Ayende Maria Josefa Gomés En la...,1
4,P009-001238,Número 1 Pablo Ayende Maria Josefa Gomés En la...,1
5,P009-001239,Número 1 Pablo Ayende Maria Josefa Gomés En la...,1
6,P009-001240,Número 1 Pablo Ayende Maria Josefa Gomés En la...,7
7,P009-001241,Número 1 Pablo Ayende Maria Josefa Gomés En la...,1
8,P009-001242,Número 1 Pablo Ayende Maria Josefa Gomés En la...,1
9,P009-001243,Número 1 Pablo Ayende Maria Josefa Gomés En la...,1


In [6]:
#no_test

ids, names, refs, freqs = build_kb_seed("SevenPeople.csv","SevenPeoplePerEntry.csv")
descriptions = generate_descriptions(names, refs, parse_xml("seven.xml"))
kb_input = {"id": ids, "desc": descriptions, "freq": freqs}
kb_df = pd.DataFrame(kb_input)
kb_df.head(10)

Unnamed: 0,id,desc,freq
0,P009-001522,Partida 1a Francisca Bentura Lunes veinte y oc...,2
1,P009-001523,Partida 1a Francisca Bentura Lunes veinte y oc...,2
2,P009-001524,Partida 1a Francisca Bentura Lunes veinte y oc...,2
3,P009-001525,Partida 1a Francisca Bentura Lunes veinte y oc...,2
4,P009-001526,Partida 1a Francisca Bentura Lunes veinte y oc...,1
5,P009-001562,Partida 2a Maria Ysabel Martes veinte y nueve...,2
6,P009-001563,Partida 1a Francisca Bentura Lunes veinte y oc...,7
7,P009-001564,Partida 2a Maria Ysabel Martes veinte y nueve...,2
8,P009-001565,Partida 2a Maria Ysabel Martes veinte y nueve...,1
9,P009-001579,Partida 3 José Pantaleon Jueves treinta y uno ...,2


This function takes a list of names and returns a list of lists in which each element contains the corresponding name in the input list as well as any other name in the input list that is a substring of that name or that that name is a substring of. It's used to generate aliases, as well as prior probabilities for those aliases, for the knowledge base. This is obviously a very crude way of building this synonym list, and can/should be refined if the entity linker yields meaningful results.

In [7]:
#export

def generate_altnames(names):
    '''
    Generates a list of alternate names for each person
        names: a list of names from build_kb_seed        

        returns: a list of alternate names for each person
    '''
    
    altnames = []
    
    for i in range(len(names)):
        alts = [names[i]]
        for j in range(len(names)):
            if ((names[j] in names[i]) or (names[i] in names[j])) and (names[j] != names[i]):
                alts.append(names[j])
        altnames.append(alts)
        
    return altnames

This function takes the synonym list built by generate_altnames as well as the ID list from build_kb_seed and returns a list of tuples in which the first element in each tuple is a unique name string, the second is a list of possible ID matches, and the third is a list of prior probabilities for each match. For the time being, those prior probailities will all be set equal.

In [8]:
#export

def build_aliases(altnames, ids):
    '''
    Generates descriptions for each person
        altnames: a list of alternate names from generate_altnames        
        ids: a list of identifiers from build_kb_seed

        returns: a list of tuples in which the first element is a unique name string, 
        the second is a list of possible ID matches, 
        and the third is a list of prior probabilities for each match
    '''
    
    unames = []
    poss_matches = []
    
    for item in altnames:
        for name in item:
            if name not in unames:
                unames.append(name)
                
    for i in range(len(unames)):
        temp = []
        for j in range(len(altnames)):
            if unames[i] in altnames[j]:
                temp.append(ids[j])
        poss_matches.append(temp)
                
    probs = []
    
    for k in range(len(unames)):
        probs.append([(1 / len(poss_matches[k]))] * len(poss_matches[k]))
        
    
    return unames, poss_matches, probs

#### Unit testing: `generate_altnames` and `build_aliases`

Tests these functions using dummy data.

In [9]:
#no_test
#expected output: [["Daniel", "Dan", "Daniel Genkins"], ["Dan", "Daniel", "Daniel Genkins"], ["Tyrion"], 
#["Daniel Genkins", "Daniel", "Dan"]]

list_of_names = ["Daniel", "Dan", "Tyrion", "Daniel Genkins"]
altnames = generate_altnames(list_of_names)
print(altnames)

[['Daniel', 'Dan', 'Daniel Genkins'], ['Dan', 'Daniel', 'Daniel Genkins'], ['Tyrion'], ['Daniel Genkins', 'Daniel', 'Dan']]


In [10]:
#no_test
#expected output
#unames = ['Daniel', 'Dan', 'Daniel Genkins', 'Tyrion']
#poss_matches = [[1, 2, 4], [1, 2, 4], [1, 2, 4], [3]]
#probs = [[.333, .333, .333], [.333, .333, .333], [.333, .333, .333], [1]]

list_of_ids = [1, 2, 3, 4]
build_aliases(altnames, list_of_ids)

(['Daniel', 'Dan', 'Daniel Genkins', 'Tyrion'],
 [[1, 2, 4], [1, 2, 4], [1, 2, 4], [3]],
 [[0.3333333333333333, 0.3333333333333333, 0.3333333333333333],
  [0.3333333333333333, 0.3333333333333333, 0.3333333333333333],
  [0.3333333333333333, 0.3333333333333333, 0.3333333333333333],
  [1.0]])

### Combining helper functions

`create_kb` combines the helper functions defined above to build a spaCy knowledge base, which can then be saved using `save_kb`

The loops that build the knowledge base are adapted from https://github.com/explosion/projects/blob/master/nel-emerson/scripts/el_tutorial.py

In [11]:
#export

def create_kb(peopleCSV, ppeCSV, xml_df):
    '''
    Creates a spaCy knowledge base
        peopleCSV: a csv containing all people who appear in a given volume
        ppeCSV: a csv linking these people to specific events
        xml_df: dataframe built by the parse_xml function from ssda.xml_parser

        returns: a spaCy knowledge base
    '''
    
    #using helper functions from above to build all required pieces for kb
    ids, names, refs, freqs = build_kb_seed(peopleCSV, ppeCSV)
    descriptions = generate_descriptions(names, refs, xml_df)
    altnames = generate_altnames(names)
    unames, poss_matches, probs = build_aliases(altnames, ids)
    
    nlp = spacy.load("es_core_news_md")
    #just grabbed md for convenience since it also includes vectors
    #lg is likely a better option if this yields meaningful results
    
    kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=50)
    #not quite sure what entity vector length is, but I believe that it's defined by the model loaded above
    
    for i in range(len(ids)):
        desc_doc = nlp(descriptions[i])
        desc_enc = desc_doc.vector
        kb.add_entity(entity=ids[i], entity_vector=desc_enc, freq=freqs[i])        
    
    for j in range(len(unames)):
        kb.add_alias(alias=unames[j], entities = poss_matches[j], probabilities = probs[j])                
        
    return kb

In [12]:
#export

def save_kb(kb, output_dir):
    
    invalid_dir = False    
   
    #confirms that desired directory is valid, and creates it if it doesn't already exist
    if os.path.isfile(output_dir):
        invalid_dir = True
    elif not os.path.isdir(output_dir):
        os.mkdir(output_dir)    
    
    #saves knowledge base
    if not invalid_dir:
        kb_path = output_dir + "\\kb"
        kb.dump(kb_path)        
        print("Saved KB to", kb_path)

        vocab_path = output_dir + "\\vocab"
        kb.vocab.to_disk(vocab_path)
        print("Saved vocab to", vocab_path)
        
    return kb_path, vocab_path

#### Unit testing: `create_kb` and `test_kb`

In [13]:
#no_test

test_kb = create_kb("FourPeople.csv","FourPeoplePerEntry.csv", parse_xml("four.xml"))
save_kb(test_kb, "C:\\Users\\Daniel Genkins\\kb_test")

print("\n" + f"Entities in the KB: {test_kb.get_entity_strings()}" + "\n")
print(f"Aliases in the KB: {test_kb.get_alias_strings()}")

Saved KB to C:\Users\Daniel Genkins\kb_test\kb
Saved vocab to C:\Users\Daniel Genkins\kb_test\vocab

Entities in the KB: ['P009-001417', 'P009-001339', 'P009-001583', 'P009-001370', 'P009-001592', 'P009-001296', 'P009-001241', 'P009-001240', 'P009-001375', 'P009-001325', 'P009-001578', 'P009-001577', 'P009-001368', 'P009-001435', 'P009-001363', 'P009-001520', 'P009-001359', 'P009-001396', 'P009-001418', 'P009-001488', 'P009-001354', 'P009-001387', 'P009-001242', 'P009-001426', 'P009-001567', 'P009-001288', 'P009-001312', 'P009-001383', 'P009-001425', 'P009-001286', 'P009-001561', 'P009-001310', 'P009-001458', 'P009-001265', 'P009-001268', 'P009-001302', 'P009-001326', 'P009-001439', 'P009-001357', 'P009-001602', 'P009-001519', 'P009-001596', 'P009-001595', 'P009-001253', 'P009-001351', 'P009-001450', 'P009-001423', 'P009-001616', 'P009-001401', 'P009-001335', 'P009-001476', 'P009-001255', 'P009-001275', 'P009-001397', 'P009-001518', 'P009-001340', 'P009-001249', 'P009-001420', 'P009-00

# Building training data for modelling with entity linker

This function is not directly related to those above, but rather transforms Spatial Historian data to the format required in order to train a spaCy model with entity linking (so it *is* indirectly related in that all of the functions in this notebook are required in order to execute the full entity linking training loop.

In [14]:
#export

def get_poss_ids(sources, folio_id, events, ent_no, people, entity, ids, names):
    '''
    Helper function to streamline build_el_training_data
    
    Takes a variety of inputs and returns a list of possible IDs
    '''

    poss_ev = []
    poss_ids = []    
    
    #identify possible events
    for j in range(len(sources)):
        if sources[j] == folio_id:            
            poss_ev.append(events[j])
    
    #isolate specific event    
    event = poss_ev[ent_no - 1]    
            
    #identify attendees of specific event
    for k in range(len(events)):
        if events[k] == event:            
            poss_ppl = people[k]            
   
    #find all possible IDs for each entity mention   
    for person in poss_ppl:
        for l in range(len(ids)):            
            if (person == ids[l]) and (entity == names[l]):                                      
                poss_ids.append(ids[l])
                    
            
    #catches corner cases where names are garbled
    if len(poss_ids) == 0:
        for person in poss_ppl:
            for l in range(len(ids)):
                if (person == ids[l]) and (entity in names[l]):                   
                    poss_ids.append(ids[l])   
                        
    return poss_ids

In [15]:
#export

def build_el_training_data(train_df, ppeCSV, peopleCSV):
    '''
    Creates a spaCy knowledge base
        train_df: dataframe containing portion of data earmarked for training by ssda.split_data
        peopleCSV: a csv containing all people who appear in a given volume
        ppeCSV: a csv linking these people to specific events        

        returns: training data formatted for use with a spaCy entity linking model
    '''
    
    #turn df into lists for ease of manipulation
    entry_texts = train_df["text"].tolist()    
    starts = train_df["start"].tolist()
    ends = train_df["end"].tolist()
    entities = train_df["entity"].tolist()
    folio_ids = train_df["fol_id"].tolist()
    entry_ids = train_df["entry_no"].tolist()
    
    #set up data structures
    u_txt = []
    dicts = []   
    #a list of tuples in which the first element in each tuple is an entry text and the second is a dictionary of dictionaries
    el_inp = []
    
    #read in CSVs
    with open(peopleCSV, encoding="utf-8") as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        ids = []
        names = []
        first = True
        for row in csvreader:
            if first:
                first = False
                continue
            ids.append(row[0])
            names.append(row[1])
            
    with open(ppeCSV, encoding="utf-8") as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        sources = []
        people = []
        events = []
        first = True
        for row in csvreader:
            if first:
                first = False
                continue
            sources.append(row[1][-4:])
            events.append(row[0][-3:])
            ppl = []
            ppl.append(row[3])
            for attendee in row[4].split(';'):
                person = attendee[attendee.find("P0"):]
                if person not in ppl:
                    ppl.append(person)
            for other in row[5].split(';'):
                person = other[other.find("P0"):]
                if (other != '') and (person not in ppl):                   
                    ppl.append(person)
            people.append(ppl)    
      
    #loop through each entity reference and attach to appropriate entity text
    for i in range(len(entry_texts)):        
        folio_id = folio_ids[i]
        ent_no = int(entry_ids[i][-1:])
        start = int(starts[i])
        end = int(ends[i])
        entry_text = entry_texts[i]
        entity = entities[i]
        
        #build output components
        poss_ids = get_poss_ids(sources, folio_id, events, ent_no, people, entity, ids, names)
        prob = 1 / len(poss_ids)
        
        #combine output components
        if entry_text in u_txt:
            pos = u_txt.index(entry_text)
            dicts[pos][start, end] = {}
            dicts[pos][start, end][poss_ids[0]] = prob
        else:
            u_txt.append(entry_text)
            new_dict = {(start, end): {poss_ids[0]: prob}}
            dicts.append(new_dict)
        for m in range(1, len(poss_ids)):
            dicts[pos][start, end][poss_ids[m]] = prob           
            
    #append output components to return value
    for x in range(len(u_txt)):
        el_inp.append((u_txt[x], {"links": dicts[x]}))            
    
    return el_inp

#### Unit testing: `build_el_training_data`

In [16]:
#no_test

# load and create df from xml
xml_df = parse_xml("four.xml")

# Create entity df from entity csvs
ent_df = entity_df_maker("FourPeoplePerEntry.csv", "FourPeople.csv")

# Put these two dfs together and with entity span info
collated_df = collate_frames(xml_df, ent_df)

#train/test split
train, test = split_data(collated_df)

el_train_data = build_el_training_data(train, "FourPeoplePerEntry.csv", "FourPeople.csv")

#print output for visual inspection
for i in range(5):
    print(el_train_data[i])

('N. 46 Jose Camilo Montenegro Crespo, y Maria Concep.n Otero. En la ciudad de la Habana en veinte y dos de # de mil ochocientos catorce años Habiendo precedido las dilig# de estilo, y leidose las tres canonicas proclamas en tres dias festivos sin resultar impedimen# Yo don Jacinto Beltran presb.o encargrado por # de la Iglesia Auxiliar del Santo Angel Custodio: Desp# y vele segun el Ritual Romano á Jose Camil# Montenegro Crespo, hijo natural de Juan # po, natural y vecino de esta ciudad y feligre# 13.  #ardo libre; y á Maria de la Concep.n Otero, de ig.l clase, y de la prop.a naturalidad y vecindario, hija lexitima de Jose Maria Otero, y de Manuela de Jesus Ortiz; habiendoles preguntado tube por respuesta su mutuo consentim.to confesaron, comulgaron, fueron examinados en la Doctrina Cristiana, siendo #adrinos Jorge Villar, y Manuela Otero, y #tigos don Pedro Garcia, y don Manuel Valdes, y lo firmé. Jacinto Beltran y #quero ', {'links': {(448, 457): {'P009-001575': 1.0}, (864, 876): {'

In [17]:
#no_test

from nbdev.export import notebook2script
notebook2script()

Converted 10-process-data.ipynb.
Converted 11-entity-corpus-dataframe.ipynb.
Converted 12-ssda-xml-parser.ipynb.
Converted 20-data-exploration.ipynb.
Converted 30-dataset-generation.ipynb.
Converted 31-collate-xml-entities-spans.ipynb.
Converted 32-gen-spacy-input.ipynb.
Converted 33-split-data.ipynb.
Converted 34-add-entities.ipynb.
Converted 35-entity-linker.ipynb.
Converted 40-features-models-reports.ipynb.
Converted 41-generic-framework-for-spacy-training.ipynb.
Converted 42-testing-full-pipeline.ipynb.
