# A notebook to test a custom entity set for doing NER on the Stanford Oceanographic Expeditions, looking for words associated with neuston. 

In partnership with Rebecca Helm, Assistant Professor, University of North Carolina, Asheville and Research Associate, Smithsonian Institution, NMNH

The setup blocks for this notebook are adapted from portions of The Datasitter's Club, specifically, this Notebook: 

Skallerup Bessette, Lee and Quinn Quinn. “DSC Multilingual Mystery 2: Beware, Lee and Quinn!”. February 27, 2020. https://datasittersclub.github.io/site/dscm2.html.

I am so grateful for the work that Quinn so generously documents and shares openly. 

## 1. Downloading spaCy models

The first step is to download the spaCy model. The model has been pre-trained on annotated English corpora. You only have to run these code cells below the first time you run the notebook; after that, you can skip right to step 2 and carry on from there. (If you run them again later, nothing bad will happen; it’ll just download again.) You can also run spaCy in other notebooks on your computer in the future, and you’ll be able to skip the step of downloading the models.

In [1]:
#Installs the English spaCy model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 7.0 MB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## 2. Importing spaCy and setting up NLP

Run the code cell below to import the spaCy module, and create a functions to loads the Englsih model and run the NLP algorithms (includes named-entity recognition).

In [2]:
#Imports spaCy
import spacy

#Imports the English model
import en_core_web_sm

#Sets up a function so you can run the English model on texts
nlp = en_core_web_sm.load()

## 3. Importing other modules

There’s various other modules that will be useful in this notebook. The code comments explain what each one is for. This code cell imports all of those.

In [3]:
#io is used for opening and writing files
import io
#glob is used to find all the pathnames matching a specified pattern (here, all text files)
import glob
#os is used to navigate your folder directories (e.g. change folders to where you files are stored)
import os
# for handling data frames, etc.
import pandas as pd
# Import the spaCy visualizer
from spacy import displacy
# Import the Entityt Ruler for making custom entities
from spacy.pipeline import EntityRuler

## 4. Diretory setup

Assuming you’re running Jupyter Notebook from the Binder I've set up for our GitHub repo, these are all relative paths.

In [4]:
#Define the file directory here
filedirectory = '../data'

#Change the working directory to the one you just defined
os.chdir(filedirectory)

# Tagging custom entities with spaCy

A custom set of words ("entities") was created for this exploration. We'll load them into spaCy so they can be tagged via the NER pipeline. We already loaded spaCy's model `en_core_web_sm` as "nlp" in a prior code chunk. Now we need to create a new pipeline step using Entity Ruler (https://spacy.io/api/entityruler). We use HELM as the entity label.

I created a JSONL file (really a text file with the file extension changed, lol) with the required format:

>{"label": "HELM", "pattern": "raft"}

>{"label": "HELM", "pattern": "wood"}

>{"label": "HELM", "pattern": "logs"}

and saved it as helm.jsonl

Now, we will add these entities to the NLP pipeline by using the Entity Ruler (the Ruler of All Entities?).

This is the spaCy NLP pipeline (from spacy.io)

<img src="https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg" alt="NLP-Pipeline" width="600"/>

## Define an entity ruler

We need to tell the pipeline where we want to add in the new pipeline component. We want it to add the new entity ruler *before* we do the NER step.

In [5]:
ruler = nlp.add_pipe("entity_ruler", before='ner') # <- this is directly from spacy documentation

In [8]:
# Load the new pattern (your list of custom entities) by adding them from the properly formatted jsonl file.
ruler.from_disk('../docs/helm.jsonl')

<spacy.pipeline.entityruler.EntityRuler at 0x7fdefe63f440>

## NOTE: if after you run this you find that all of the tags are distracting to you, come back to the line below and un-comment. It will no longer tag the default spaCy entities, like `LOC`, `DATE`, and `TIME`.

In [88]:
# ruler = nlp.remove_pipe("ner") 

In [89]:
# Check the pipeline. The entity_ruler should be listed *before* ner.
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fdf00cfba90>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7fdf00cdd4a0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fdf00b4dbe0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7fdf00d20740>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fdf00d79200>),
 ('entity_ruler', <spacy.pipeline.entityruler.EntityRuler at 0x7fdefe63f440>)]

# Test if the new entities (HELM) are identified in the new pipeline

## these are your files to choose from:

In [75]:
for filename in sorted(os.listdir(filedirectory)):
    print(filename)

.DS_Store
Cruise-01-Installment-01-Narrative.txt
Cruise-01-Installment-02-Narrative.txt
Cruise-01-Installment-03-Narrative.txt
Cruise-01-Installment-04-Narrative.txt
Cruise-01-Installment-05-Narrative.txt
Cruise-02-Additional-Post-Cruise-Narrative.txt
Cruise-02-Installment-06-Narrative.txt
Cruise-02-Installment-07-Narrative.txt
Cruise-02-Installment-08-Narrative.txt
Cruise-02-Installment-09-Narrative.txt
Cruise-03-Installment-10-Narrative.txt
Cruise-04-Installment-11-Narrative.txt
Cruise-04-Installment-12-Narrative.txt
Cruise-04-Installment-13-Narrative.txt
Cruise-05-Abbott-Letter.txt
Cruise-06-Installment-14-Narrative.txt
Cruise-06-Installment-15-Narrative.txt
Cruise-06-Installment-16-Narrative.txt
Cruise-07-Installment-17-Narrative.txt
Cruise-07-Installment-18-Narrative.txt
Cruise-07-Installment-19-Narrative.txt
Cruise-07-Installment-20-Narrative.txt
Cruise-08-Installment-21-Narrative.txt
Cruise-08-Installment-22-Narrative.txt
Cruise-08-Installment-23-Narrative.txt
Cruise-08-Installm

In [90]:
# copy a filename above and paste it here to process it.
with open('../data/Cruise-06-Installment-15-Narrative.txt') as f:
    text = f.read()

In [91]:
# Look at the first bit of text to be sure it loaded successfully
print(text[0:500])

﻿TE VEGA EXPEDITION
GENERAL NARRATIVE--INSTALLMENT #15

Our departure from beautiful Pulau Gaya was occasioned by imminent exhaustion of fresh water supplies, and perhaps slightly by the reports of pirates 25 miles away. This brought a Malaysian police patrol boat to our side the last night at anchorage; its lights at the reef passage gave us confidence all night.

The run up to Zamboanga took us past the north end of Sibutu, and the mountains at Port Bongao, both originally planned instead of P


In [92]:
# Run the spaCy NLP pipeline
doc = nlp(text)

In [93]:
# You can look at just the text of the results by un-commenting the following line.
# print([(ent.text, ent.label_) for ent in doc.ents])

In [94]:
# This is the visual display of the entity tagging
displacy.render(doc, style="ent", jupyter=True) 

In [86]:
# Want to see how many words were tagged in each entity type?

In [87]:
ent_labels = [e.label_ for e in doc.ents]
freq = dict()
for l in ent_labels:
    freq[l] = ent_labels.count(l)
print(freq)

{'HELM': 6}


## Note
Everything below here has to do with running the NER and exporting the tagged entities. This is DONE already, but am leaving here for reference.

# Exporting NER results

Now that we've figurd out the process, it's time to go through each file, do the NLP, and save the results of the NER.  

In [20]:
# Sort all the files in the directory you specified above, alphabetically.

# For each of those files...
for filename in sorted(os.listdir(filedirectory)):
    #If the filename ends with .txt (i.e. if it's actually a text files)
    if filename.endswith('.txt'):
        #Write out below the name of the file
        print(filename)
        #The file name of the output file adds _ner to the end of the file name of the input file
        outfilename = filename.replace('.txt', '_ner.txt')
        # Open the infput filename
        with open(filename, 'r') as f:
            # Create and open the output filename
            with open(outfilename, 'w') as out:
                # Read the contents of the input file
                text = f.read()
                # Do NLP on the contents of the input file
                ner = nlp(text)
                # Create a dataframe with the NER results
                data = pd.DataFrame([ (ent.text, ent.label_ ) for ent in doc.ents], columns=["Text", "Label"] )
                # Save results as CSV file
                data.to_csv(outfilename, index_label= "Index")

Cruise-01-Installment-01-Narrative.txt
Cruise-01-Installment-02-Narrative.txt
Cruise-01-Installment-03-Narrative.txt
Cruise-01-Installment-04-Narrative.txt
Cruise-01-Installment-05-Narrative.txt
Cruise-02-Additional-Post-Cruise-Narrative.txt
Cruise-02-Installment-06-Narrative.txt
Cruise-02-Installment-07-Narrative.txt
Cruise-02-Installment-08-Narrative.txt
Cruise-02-Installment-09-Narrative.txt
Cruise-03-Installment-10-Narrative.txt
Cruise-04-Installment-11-Narrative.txt
Cruise-04-Installment-12-Narrative.txt
Cruise-04-Installment-13-Narrative.txt
Cruise-05-Abbott-Letter.txt
Cruise-06-Installment-14-Narrative.txt
Cruise-06-Installment-15-Narrative.txt
Cruise-06-Installment-16-Narrative.txt
Cruise-07-Installment-17-Narrative.txt
Cruise-07-Installment-18-Narrative.txt
Cruise-07-Installment-19-Narrative.txt
Cruise-07-Installment-20-Narrative.txt
Cruise-08-Installment-21-Narrative.txt
Cruise-08-Installment-22-Narrative.txt
Cruise-08-Installment-23-Narrative.txt
Cruise-08-Installment-24-Nar

---
Now, create one file with all of the NER results together

In [60]:
# Create a file for the results to go into, and write everything there. 
#The file name of the output file
outfilename = 'soe-ner-all.txt'
df = []
for filename in sorted(os.listdir(filedirectory)):
    #If the filename ends with .txt (i.e. if it's actually a text file)
    if filename.endswith('.txt'):
        #Write out below the name of the file
        print(filename)
        # Create and open the output filename
        with open(outfilename, 'w') as out:
            # Open the infput filename
            with open(filename, 'r') as f:
                # Read the contents of the input file
                text = f.read()
                # Do NLP on the contents of the input file
                ner = nlp(text)
                # Create a dataframe with the NER results
                df2 = pd.DataFrame([ (filename, ent.text, ent.label_ ) for ent in doc.ents], columns=["File", "Text", "Label"])
                df.append(df2)

Cruise-01-Installment-01-Narrative.txt
Cruise-01-Installment-02-Narrative.txt
Cruise-01-Installment-03-Narrative.txt
Cruise-01-Installment-04-Narrative.txt
Cruise-01-Installment-05-Narrative.txt
Cruise-02-Additional-Post-Cruise-Narrative.txt
Cruise-02-Installment-06-Narrative.txt
Cruise-02-Installment-07-Narrative.txt
Cruise-02-Installment-08-Narrative.txt
Cruise-02-Installment-09-Narrative.txt
Cruise-03-Installment-10-Narrative.txt
Cruise-04-Installment-11-Narrative.txt
Cruise-04-Installment-12-Narrative.txt
Cruise-04-Installment-13-Narrative.txt
Cruise-05-Abbott-Letter.txt
Cruise-06-Installment-14-Narrative.txt
Cruise-06-Installment-15-Narrative.txt
Cruise-06-Installment-16-Narrative.txt
Cruise-07-Installment-17-Narrative.txt
Cruise-07-Installment-18-Narrative.txt
Cruise-07-Installment-19-Narrative.txt
Cruise-07-Installment-20-Narrative.txt
Cruise-08-Installment-21-Narrative.txt
Cruise-08-Installment-22-Narrative.txt
Cruise-08-Installment-23-Narrative.txt
Cruise-08-Installment-24-Nar

In [63]:
df = pd.concat(df)

In [64]:
# Save results as CSV file
df.to_csv(outfilename, index_label= "Index")

# Exporting annotation visuals

THIS SECTION DOESN'T WORK YET.

In [21]:
svg = displacy.render(doc, style="ent", jupyter=False)

In [22]:
outfilename = filename.replace('.txt', '.svg')

In [23]:
from pathlib import Path

In [24]:
output_path = Path("/images/" + outfilename)
print(output_path)

/images/Cruise-20-Installment-Narrative.svg


In [25]:
# This doesn't work.
outfilename.open("w", encoding="utf-8").write(svg)

AttributeError: 'str' object has no attribute 'open'

In [26]:
html = displacy.render(doc, style="ent", page=True)