# Introducing myself to spaCy for future work using named entity recognition & entity relations on the Proceedings of the Academy of Natural Sciences of Philadelphia (ANSP).

The setup blocks for this notebook are adapted from portions of The Datasitter's Club, specifically, this Notebook: 

Skallerup Bessette, Lee and Quinn Quinn. “DSC Multilingual Mystery 2: Beware, Lee and Quinn!”. February 27, 2020. https://datasittersclub.github.io/site/dscm2.html.

I am so grateful for the work that Quinn so generously documents and shares openly. 

## 1. Downloading spaCy models

The first step is to download the spaCy model. The model has been pre-trained on annotated English corpora. You only have to run these code cells below the first time you run the notebook; after that, you can skip right to step 2 and carry on from there. (If you run them again later, nothing bad will happen; it’ll just download again.) You can also run spaCy in other notebooks on your computer in the future, and you’ll be able to skip the step of downloading the models.

In [None]:
#Imports the module you need to download and install the spaCy models
import sys

In [None]:
#Installs the English spaCy model
!{sys.executable} -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz

## 2. Importing spaCy and setting up NLP

Run the code cell below to import the spaCy module, and create a functions to loads the Englsih model and run the NLP algorithms (includes named-entity recognition).

In [2]:
#Imports spaCy
import spacy

#Imports the English model
import en_core_web_trf

#Sets up a function so you can run the English model on texts
nlp = en_core_web_trf.load()

## 3. Importing other modules

There’s various other modules that will be useful in this notebook. The code comments explain what each one is for. This code cell imports all of those.

In [3]:
#io is used for opening and writing files
import io

#glob is used to find all the pathnames matching a specified pattern (here, all text files)
import glob

#os is used to navigate your folder directories (e.g. change folders to where you files are stored)
import os

# for handling data frames, etc.
import pandas as pd

# Import the spaCy visualizer
from spacy import displacy

# Import the Entityt Ruler for making custom entities
from spacy.pipeline import EntityRuler

# ! pip install spacy-lookup

# allows you to add custom entities for NER
#from spacy_lookup import Entity

## 4. Diretory setup

Assuming you’re running Jupyter Notebook from your computer’s home directory, this code cell gives you the opportunity to change directories, into the directory where you’re keeping your project files. I've put just a few of the ANSP volumes into a folder called `subset`.

In [4]:
#Define the file directory here
filedirectory = '/Users/thalassa/Google Drive/My Drive/LEADING/corpus/subset'

#Change the working directory to the one you just defined
os.chdir(filedirectory)

## Running spaCy

In this first test, we are going to look at how spaCy's default "Locations" entity recognizer performs. One goal of this NLP project is to identify species occurrences in the ANSP volumes. A species occurrence requires three data points: a SPECIES, seen at a specific PLACE, at a specific DAY/TIME. I will load a custom entity set for the species names in future steps. 

## Note - this takes a while - do not run this chunk unless you want to see the LOC results.

In [None]:
#Sort all the files in the directory you specified above, alphabetically.

#For each of those files...
for filename in sorted(os.listdir(filedirectory)):
    #If the filename ends with .txt (i.e. if it's actually a text files)
    if filename.endswith('.txt'):
        #Write out below the name of the file
        print(filename)
        #The file name of the output file adds _ner_loc to the end of the file name of the input file
        outfilename = filename.replace('.txt', '_ner_loc.txt')
        #Open the infput filename
        with open(filename, 'r') as f:
            #Create and open the output filename
            with open(outfilename, 'w') as out:
                #Read the contents of the input file
                voltext = f.read()
                #Do English NLP on the contents of the input file
                volner = nlp(voltext)
                #For each recognized entity
                for ent in volner.ents:
                    #If that entity is labeled as a place
                    if ent.label_ == 'GPE':
                        #Print the entity, and the label (which should be PER)
                        print(ent.text, ent.label_)
                        #Write the entity to the output file
                        out.write(ent.text)
                        #Write a newline character to the output file
                        out.write('\n')


## Exploring spaCy tagging

Before running spaCy on a whole text, I'm going to try it on a few sentences in order to understand what the different annotations are and how they work. This snip of text is from, "Binney, W. G.: On the Anatomy and Lingual Dentition of Ariolimax and other Pulmonata" in the 1874 volume of the Proceedings of the Academy of Natural Sciences of Philadelphia. https://www.biodiversitylibrary.org/item/84867 


In [5]:
# Exmaple text
texts = [
    "I have examined one specimen of Ariolimax niger, J. G. Coop., preserved in spirit, belonging to the state collection of California, labelled and presented by Dr. Cooper, and in all respects an authentic type.",
    "Agreeing with this type I have other specimens from various California localities, so that I believe the species to be well established and generally distributed along the coast of California.",
    "From the Museum of Comparative Zoology at Cambridge, Mr. Anthony has sent me a specimen, long preserved in alcohol, marked from San Mateo, California.",
    "For reasons given below, I am inclined to consider this the form described by Dr. Cooper as A. Californicus.",
    "I have had the opportunity of examining another specimen of this form, received from Mr. Stearns, who collected it near San Francisco.",
]

# Just run the default NER, disabling the other NLP junk
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])
    displacy.render(doc, style="ent")

[('one', 'CARDINAL'), ('California', 'GPE'), ('Cooper', 'PERSON')]


[('California', 'GPE'), ('California', 'GPE')]


[('the Museum of Comparative Zoology', 'ORG'), ('Cambridge', 'ORG'), ('Anthony', 'PERSON'), ('San Mateo', 'GPE'), ('California', 'GPE')]


[('Cooper', 'PERSON')]


[('Stearns', 'PERSON'), ('San Francisco', 'GPE')]


In [None]:
# look at the dependencies found
displacy.render(example, style="dep")

In [None]:
# That was a really big plot - let's break th evisualization into sentences and visualize those
sentence_spans = list(example.sents)
displacy.render(sentence_spans, style="dep") 

# Tagging custom entities with spaCy

I'm going to be using the Global Names Recognition and Discovery tools to automate identifying genus and species names in the ANSP volumes. From there, I'll need to load to species names into spaCy so they can be tagged via the NER pipeline. Let's do a small test on our example from above. 

We already loaded spaCy's model `en_core_web_trf` as "nlp" in a prior code chunk. Now we need to create a new pipeline step using Entity Ruler (https://spacy.io/api/entityruler). We use the scientific names as our new entity pattern, and use SPECIES as the entity label.

These are the species names mentioned above, plus one I added just as a test.
species = ['Ariolimax niger', 'A. Californicus', 'Homo sapiens']

I created a JSONL file (really a text file with the file extension changed, lol) with the required format:

>{"label": "SPECIES", "pattern": "Ariolimax niger"}  
>{"label": "SPECIES", "pattern": "A. Californicus"}  
>{"label": "SPECIES", "pattern": "Homo sapiens"}

and saved it as species-examples.jsonl

Now, we will add these entities to the NLP pipeline by using the Entity Ruler (the Ruler of All Entities?).

This is the spaCy NLP pipeline (from spacy.io)

<img src="https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg" alt="NLP-Pipeline" width="600"/>

## Define an entity ruler

We need to tell the pipeline where we want to add in the new pipeline component. We want it to add the new entity ruler *before* we do the NER step.

In [6]:
ruler = nlp.add_pipe("entity_ruler", before='ner') # <- this is directly from spacy documentation

In [7]:
# Load the new pattern (your list of custom entities) by adding them from the properly formatted jsonl file.

ruler.from_disk("/Volumes/GoogleDrive/My Drive/LEADING/corpus/subset/species-example.jsonl")

<spacy.pipeline.entityruler.EntityRuler at 0x7ff2612cd240>

In [8]:
# Check the pipeline. The entity_ruler should happen *before* ner.

nlp.pipeline

[('transformer',
  <spacy_transformers.pipeline_component.Transformer at 0x7ff29f6a9b30>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7ff2a1a82630>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7ff2a1b02520>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7ff2a1b2e580>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7ff2a1b373c0>),
 ('entity_ruler', <spacy.pipeline.entityruler.EntityRuler at 0x7ff2612cd240>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7ff2a1b02b20>)]

# Test if the new entities (SPECIES) are identified in the new pipeline

In [20]:
# look at the first sentence. I know there's a species name in there. 

doc_new = nlp(texts[0])

In [21]:
displacy.render(doc_new, style="ent")

Now we need to define a custom rule to return the label for SPECIES

In [None]:
def cust_ruler(sent):
    doc_new=nlp(sent)
    
    for ent in doc_new.ents:
        if ent.label_ =='SPECIES':
            return ent.label_

In [None]:
data = cust_ruler(example)

In [None]:
data.head