# Entity Ruler for Location Data
This Entity Ruler tags "LOC" to every single location name in "data/extracted_locations/combined_list". This will help with the annotation process for the varied articles later on in the process.

Import spaCy and it's EntityRuler pipeline to facilitate tagging, and JSON to read data files.

In [1]:
# Import SpaCy to enable EntityRuler creation, and JSON to retrieve data files

import spacy
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
import json

# Function to read and write data from JSON files, if needed.
def load_data(file):
    with open(file, "r", encoding="utf-8") as f:
        data = json.load(f)
    return (data)

def save_data(file, data):
    with open (file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4)

In [2]:
#Load the data
data = load_data('..\..\data\extracted_locations\combined_locations.json')

Function create_training_data takes in 2 parameters - the list of data and the label name. This allows for future labels to be created without hassle if required. For now all locations will take the LOC label.

In [9]:
# Function to create the training data in the {"label": type, "pattern": example} format that SpaCy uses
def create_training_data(data_list, type):
    data = data_list
    patterns = []
    for locations in data:
        pattern = {
            "label": type,
            "pattern": locations
        }
        patterns.append(pattern)
    return(patterns)

In [10]:
# Imput the data and label type and call the function to create the tagged list
patterns = create_training_data(data, "LOC")

The patterns of location names and their labels can now be loaded into spacy's add pipe function. As this is for NER, the EntityRuler pipe is created and given primacy over the standard model's NER functions.

In [11]:
# Push the tagged list through spaCy's Medium English model. Any model here is fine, really, because we're giving primacy ("partially overriding") the model's own NER with the tagged list.
nlp = spacy.load("en_core_web_md")

# An EntityRuler pipe is created and placed before the original model's NER pipe to give primacy to the EntityRuler.
ruler = nlp.add_pipe("entity_ruler", before = "ner")

# The data from the tagged list is then inputted into the EntityRuler
ruler.add_patterns(patterns)

In [None]:
nlp.to_disk("..\..\models\loc_er")