## 1. Downloading spaCy models

The first step is to download the spaCy model. The model has been pre-trained on annotated English corpora. You only have to run these code cells below the first time you run the notebook; after that, you can skip right to step 2 and carry on from there. (If you run them again later, nothing bad will happen; it’ll just download again.) You can also run spaCy in other notebooks on your computer in the future, and you’ll be able to skip the step of downloading the models.

In [1]:
#Imports the module you need to download and install the spaCy models
import sys

In [2]:
#Installs the English spaCy model
!{sys.executable} -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz (460.2 MB)
     |████████████████████████████████| 460.2 MB 15 kB/s              
[?25h  Preparing metadata (setup.py) ... [?25ldone


## 2. Importing spaCy and setting up NLP

Run the code cell below to import the spaCy module, and create a functions to loads the Englsih model and run the NLP algorithms (includes named-entity recognition).

In [3]:
#Imports spaCy
import spacy

#Imports the English model
import en_core_web_trf

In [4]:
#Sets up a function so you can run the English model on texts
nlp = en_core_web_trf.load()
# nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemmatizer"])

## 3. Importing other modules

There’s various other modules that will be useful in this notebook. The code comments explain what each one is for. This code cell imports all of those.

In [5]:
#io is used for opening and writing files
import io

#glob is used to find all the pathnames matching a specified pattern (here, all text files)
import glob

#os is used to navigate your folder directories (e.g. change folders to where you files are stored)
import os

# for handling data frames, etc.
import pandas as pd

# Import the spaCy visualizer
from spacy import displacy

# Import the Entity Ruler for making custom entities
from spacy.pipeline import EntityRuler
from spacy.language import Language  # type: ignore

import requests
import csv
import pathlib

# ! pip install spacy-lookup

# allows you to add custom entities for NER
#from spacy_lookup import Entity

## 4. Diretory setup

Assuming you’re running Jupyter Notebook from your computer’s home directory, this code cell gives you the opportunity to change directories, into the directory where you’re keeping your project files. I've put just a few of the ANSP volumes into a folder called `subset`.

In [6]:
#Define the file directory here
filedirectory = '/Users/thalassa/streamlit/streamlit-ansp'

#Change the working directory to the one you just defined
os.chdir(filedirectory)

In [7]:
species = pd.read_json("/Users/thalassa/streamlit/streamlit-ansp/data/ansp-taxa.json")
habitats = pd.read_json("/Users/thalassa/streamlit/streamlit-ansp/data/ansp-habitat.json")

In [8]:
# Iterate through species and habitat dictionary to turn values into lists
species_dict = dict(species)
for key, val in species_dict.items():
    species_dict[key] = [val,]

habitats_dict = dict(habitats)
for key, val in habitats_dict.items():
    habitats_dict[key] = [val,]

@Language.factory(name="species_entity")
def create_species_entity(nlp: Language, name: str):
    return Entity(name=name, keywords_dict=species_dict, label="TAXA")

@Language.factory(name="habitat_entity")
def create_habitat_entity(nlp: Language, name: str):
    # habitats_list = list(habitats.Habitat)
    return Entity(name=name, keywords_dict=habitats_dict, label="HABITAT")


In [9]:
#ruler = EntityRuler(nlp)
#nlp.add_pipe(ruler)
nlp.add_pipe("species_entity")
nlp.add_pipe("habitat_entity")

NameError: name 'Entity' is not defined

In [None]:
nlp.to_disk("ansp_ner")

In [10]:
nlp.pipeline


[('transformer',
  <spacy_transformers.pipeline_component.Transformer at 0x16ffa2f40>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x16fffd360>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x170004760>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x16ff8bd00>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x170035f40>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x170004d60>)]

## 5. Define an entity ruler

We need to tell the pipeline where we want to add in the new pipeline component. We want it to add the new entity ruler *before* we do the NER step.

In [None]:
ruler = nlp.add_pipe("entity_ruler", before='ner') # <- this is directly from spacy documentation

In [None]:
# Load the new pattern (your list of custom entities) by adding them from the properly formatted jsonl file.
ruler.from_disk("/Users/thalassa/Rcode/blog/data/ansp-entity-ruler.jsonl")