# Entity Ruler for Location Data

This notebook creates the "Dictionary of Locations" Entity Ruler pipe. Pulling from the [combined_locations.json](../../data/extracted_locations/combined_locations.json) file, the script tags "LOC" to each location entry.

The resulting "Dictionary of Locations" Entity Ruler pipe is later added to a pipeline for either direct model use, or use in creating annotations in an automated manner.

Import spaCy and it's EntityRuler pipeline to facilitate tagging, and JSON to read data files.

In [1]:
# Import SpaCy to enable EntityRuler creation, and JSON to retrieve data files

import spacy
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
import json

# Function to read and write data from JSON files, if needed.
def load_data(file):
    with open(file, "r", encoding="utf-8") as f:
        data = json.load(f)
    return (data)

# def save_data(file, data):
#     with open (file, "w", encoding="utf-8") as f:
#         json.dump(data, f, indent=4)

In [2]:
#Load the data
data = load_data('..\..\data\extracted_locations\combined_locations.json')

Each entry in the Entity Ruler pipe follows the format below:
```
{"label": type, "pattern": example}
```
The function below assigns "LOC" as the label to a location entry "pattern".

In [7]:
# Function to create the training data in the {"label": type, "pattern": example} format that SpaCy uses
def create_entityruler_items(data_list, tag_name):
    tagged_locations_list = []
    for location in data_list:
        pattern = {
            "label": tag_name,
            "pattern": location
        }
        tagged_locations_list.append(pattern)
    return tagged_locations_list

In [8]:
# Imput the data and label type and call the function to create the tagged list
patterns = create_entityruler_items(data, "LOC")

The "Dictionary of Locations" Entity Ruler can now be added to a model of choice.

Below, it is added to a Standard Medium English model to create the **loc_er** model, which will later be used to create annotations with a script.

In [11]:
# Push the tagged list through spaCy's Medium English model. Any model here is fine, really, because we're giving primacy ("partially overriding") the model's own NER with the tagged list.
nlp = spacy.load("en_core_web_md")

# An EntityRuler pipe is created and placed before the original model's NER pipe to give primacy to the EntityRuler.
ruler = nlp.add_pipe("entity_ruler", before = "ner")

# The data from the tagged list is then inputted into the EntityRuler
ruler.add_patterns(patterns)

In [None]:
nlp.to_disk("..\..\models\loc_er")