# A notebook to test a custom entity set for doing NER on the Stanford Oceanographic Expeditions, looking for words associated with neuston. 

In partnership with Rebecca Helm, Assistant Professor, University of North Carolina, Asheville and Research Associate, Smithsonian Institution, NMNH

The setup blocks for this notebook are adapted from portions of The Datasitter's Club, specifically, this Notebook: 

Skallerup Bessette, Lee and Quinn Quinn. “DSC Multilingual Mystery 2: Beware, Lee and Quinn!”. February 27, 2020. https://datasittersclub.github.io/site/dscm2.html.

I am so grateful for the work that Quinn so generously documents and shares openly. 

## 1. Downloading spaCy models

The first step is to download the spaCy model. The model has been pre-trained on annotated English corpora. You only have to run these code cells below the first time you run the notebook; after that, you can skip right to step 2 and carry on from there. (If you run them again later, nothing bad will happen; it’ll just download again.) You can also run spaCy in other notebooks on your computer in the future, and you’ll be able to skip the step of downloading the models.

In [2]:
#Installs the English spaCy model
!python -m spacy download en_core_web_sm

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz (460.2 MB)
[K     |████████████████████████████████| 460.2 MB 13 kB/s  eta 0:00:017   |██▍                             | 34.7 MB 6.9 MB/s eta 0:01:02     |█████████▍                      | 135.1 MB 7.6 MB/s eta 0:00:44     |█████████████                   | 186.9 MB 6.6 MB/s eta 0:00:42     |█████████████▌                  | 193.6 MB 5.1 MB/s eta 0:00:53


## 2. Importing spaCy and setting up NLP

Run the code cell below to import the spaCy module, and create a functions to loads the Englsih model and run the NLP algorithms (includes named-entity recognition).

In [3]:
#Imports spaCy
import spacy

#Imports the English model
import en_core_web_sm

#Sets up a function so you can run the English model on texts
nlp = en_core_web_sm.load()

## 3. Importing other modules

There’s various other modules that will be useful in this notebook. The code comments explain what each one is for. This code cell imports all of those.

In [4]:
#io is used for opening and writing files
import io

#glob is used to find all the pathnames matching a specified pattern (here, all text files)
import glob

#os is used to navigate your folder directories (e.g. change folders to where you files are stored)
import os

# for handling data frames, etc.
import pandas as pd

# Import the spaCy visualizer
from spacy import displacy

# Import the Entityt Ruler for making custom entities
from spacy.pipeline import EntityRuler

## 4. Diretory setup

Assuming you’re running Jupyter Notebook from your computer’s home directory, this code cell gives you the opportunity to change directories, into the directory where you’re keeping your project files. I've put just a few of the ANSP volumes into a folder called `subset`.

In [5]:
#Define the file directory here
filedirectory = '../data'

#Change the working directory to the one you just defined
os.chdir(filedirectory)

# Tagging custom entities with spaCy

I'm going to be using the Global Names Recognition and Discovery tools to automate identifying genus and species names in the ANSP volumes. From there, I'll need to load to species names into spaCy so they can be tagged via the NER pipeline. Let's do a small test on our example from above. 

We already loaded spaCy's model `en_core_web_trf` as "nlp" in a prior code chunk. Now we need to create a new pipeline step using Entity Ruler (https://spacy.io/api/entityruler). We use the scientific names as our new entity pattern, and use SPECIES as the entity label.

These are the species names mentioned above, plus one I added just as a test.
species = ['Ariolimax niger', 'A. Californicus', 'Homo sapiens']

I created a JSONL file (really a text file with the file extension changed, lol) with the required format:

>{"label": "SPECIES", "pattern": "Ariolimax niger"}  
>{"label": "SPECIES", "pattern": "A. Californicus"}  
>{"label": "SPECIES", "pattern": "Homo sapiens"}

and saved it as species-examples.jsonl

Now, we will add these entities to the NLP pipeline by using the Entity Ruler (the Ruler of All Entities?).

This is the spaCy NLP pipeline (from spacy.io)

<img src="https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg" alt="NLP-Pipeline" width="600"/>

## Define an entity ruler

We need to tell the pipeline where we want to add in the new pipeline component. We want it to add the new entity ruler *before* we do the NER step.

In [6]:
ruler = nlp.add_pipe("entity_ruler", before='ner') # <- this is directly from spacy documentation

In [7]:
# Load the new pattern (your list of custom entities) by adding them from the properly formatted jsonl file.
ruler.from_disk("/Users/thalassa/Notebooks/helm.jsonl")

<spacy.pipeline.entityruler.EntityRuler at 0x7fd160aaf440>

In [8]:
# ruler = nlp.remove_pipe("ner")

In [9]:
# Check the pipeline. The entity_ruler should be listed *before* ner.
nlp.pipeline

[('transformer',
  <spacy_transformers.pipeline_component.Transformer at 0x7fd1626a8ef0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7fd162704d60>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fd1626fbb80>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7fd1627470c0>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fd162747f40>),
 ('entity_ruler', <spacy.pipeline.entityruler.EntityRuler at 0x7fd160aaf440>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fd1627641c0>)]

# Test if the new entities (HELM) are identified in the new pipeline

In [10]:
with open("/Users/thalassa/Google Drive/Shared drives/Miller Library/Collections/StanfordOceanographicExpeditions/OLD_TE VEGA NARRATIVES/txt/Cruise 01 Installment 01 Narrative_MT.txt") as f:
    text = f.read()

In [11]:
# Look at the first bit of text
print(text[1:500])

E VEGA EXPEDITIONS MAIDEN VOYAGE--GENERAL NARRATIVE 

July 14. 1963. In the prospectus for Cruise 1, TE VEGA was scheduled to sail "on or about June 17." Toward the first of June, it became clear that there would be a short delay and the students were notified that they should await further orders before joining the ship. As time went on, all hope of having the ship leave from Monterey was abandoned, and the students were finally instructed to join the ship in fan Diego on June 30th, in time fo


In [12]:
# Just run the default NER, disabling the other NLP junk
doc = nlp(text)

In [13]:
# look at the results
# print([(ent.text, ent.label_) for ent in doc.ents])

In [15]:
displacy.render(doc, style="ent", jupyter=True)

In [41]:
# put the results into a dataframe - why not?!
data = pd.DataFrame([ (ent.text, ent.label_ ) for ent in doc.ents], columns=["Text", "Label"] )

In [42]:
# Let's check it out
data.head()

Unnamed: 0,Text,Label
0,13,CARDINAL
1,August 15th,DATE
2,Mayotte,GPE
3,Comoro,LOC
4,fish,HELM


# Exporting NER results

Now that we've figurd out the process, it's time to go through each file, do the NLP, and save the results of the NER.  

In [32]:
#Sort all the files in the directory you specified above, alphabetically.

#For each of those files...
for filename in sorted(os.listdir(filedirectory)):
    #If the filename ends with .txt (i.e. if it's actually a text files)
    if filename.endswith('.txt'):
        #Write out below the name of the file
        print(filename)
        #The file name of the output file adds _ner to the end of the file name of the input file
        outfilename = filename.replace('.txt', '_ner.txt')
        # Open the infput filename
        with open(filename, 'r') as f:
            # Create and open the output filename
            with open(outfilename, 'w') as out:
                # Read the contents of the input file
                text = f.read()
                # Do NLP on the contents of the input file
                ner = nlp(text)
                # Create a dataframe with the NER results
                data = pd.DataFrame([ (ent.text, ent.label_ ) for ent in doc.ents], columns=["Text", "Label"] )
                # Save results as CSV file
                data.to_csv(outfilename, index_label= "Index")

Installment 01 Narrative.txt
Installment 02 Narrative.txt
Installment 03 Narrative.txt
Installment 04 Narrative.txt
Installment 05 Narrative.txt
Installment 06 Narrative.txt
Installment 07 Narrative.txt
Installment 08 Narrative.txt
Installment 09 Narrative.txt
Installment 11 Narrative.txt
Installment 12 Narrative.txt
Installment 13 Narrative.txt


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 17: invalid start byte

# Exporting annotation visuals

In [51]:
svg = displacy.render(doc, style="ent", jupyter=False)

In [53]:
outfilename = filename.replace('.txt', '.svg')

Installment 13 Narrative.svg


In [68]:
from pathlib import Path

In [70]:
output_path = Path("/images/" + outfilename)
print(output_path)

/images/Installment 13 Narrative.svg


In [73]:
outfilename.open("w", encoding="utf-8").write(svg)

AttributeError: 'str' object has no attribute 'open'

In [74]:
html = displacy.render(doc, style="ent", page=True)