## 1. Downloading spaCy models

The first step is to download the spaCy model. The model has been pre-trained on annotated English corpora. You only have to run these code cells below the first time you run the notebook; after that, you can skip right to step 2 and carry on from there. (If you run them again later, nothing bad will happen; it’ll just download again.) You can also run spaCy in other notebooks on your computer in the future, and you’ll be able to skip the step of downloading the models.

In [1]:
#Imports the module you need to download and install the spaCy models
import sys

In [2]:
#Installs the English spaCy model
!{sys.executable} -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz (460.2 MB)
     |████████████████████████████████| 460.2 MB 14 kB/s             
[?25h  Preparing metadata (setup.py) ... [?25ldone


## 2. Importing spaCy and setting up NLP

Run the code cell below to import the spaCy module, and create a functions to loads the Englsih model and run the NLP algorithms (includes named-entity recognition).

In [3]:
#Imports spaCy
import spacy

#Imports the English model
import en_core_web_trf

## 3. Importing other modules

There’s various other modules that will be useful in this notebook. The code comments explain what each one is for. This code cell imports all of those.

In [4]:
#io is used for opening and writing files
import io

#glob is used to find all the pathnames matching a specified pattern (here, all text files)
import glob

#os is used to navigate your folder directories (e.g. change folders to where you files are stored)
import os

# for handling data frames, etc.
import pandas as pd

# Import the spaCy visualizer
from spacy import displacy

# Import the Entityt Ruler for making custom entities
from spacy.pipeline import EntityRuler

import datetime 


## 4. Diretory setup

Assuming you’re running Jupyter Notebook from your computer’s home directory, this code cell gives you the opportunity to change directories, into the directory where you’re keeping your project files. I've put just a few of the ANSP volumes into a folder called `subset`.

In [5]:
#Define the file directory here
filedirectory = '/Users/thalassa/Rcode/blog/data/animals-clean'

#Change the working directory to the one you just defined
os.chdir(filedirectory)

In [6]:
import nltk
from nltk import tokenize
from nltk.tokenize import word_tokenize, sent_tokenize
import json
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag_sents
nltk.download('tagsets')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/thalassa/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to
[nltk_data]     /Users/thalassa/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [8]:
with open('/Users/thalassa/Rcode/blog/data/animals-clean/44pg098-clean-taxa.txt', 'r') as f:
    text = f.read()

In [9]:
sentences = tokenize.sent_tokenize(text)

In [10]:
print(sentences)

['THE BIRDS OF SOUTHEASTERN TEXAS AND SOUTHERN ARIZONA OBSERVED DURING MAY JUNE AND JULY 1891.', 'BY SAMUEL N. RHOADS.', 'With the idea of investigating the avifauna of the southern border of the United States and collectiug a series of the birds of Florida Texas and Arizona I left Philadelphia March 26th 1891 arriving at Jacksonville Florida on the fifth of the following month.', 'A sojourn of five weeks was made in the southwestern part of the state and considerable collections obtained.', 'Few facts additional to what has been already written on the bird life of this region were ascertained and it is not my intention to treat in detail of this part of the trip.', "I arrived at Corpus Christi Texas May 17th and here a three weeks' stay was made.", 'I then journeyed westward to Tucson Arizona arriving on the tenth of June and collecting birds in the immediate vicinity until the nineteenth.', 'That morning I took stage for Oracle a posthamlet situated in the oak belt forty miles north 

In [11]:
def load_json_data(file):
    with open(file, "r", encoding="utf-8") as f:
        data = json.load(f)
    return (data)

In [12]:
def load_text_data(file):
    with open(file, 'r') as f:
        text = f.read()
    return (text)

In [13]:
def create_tagged_sents(textfile, jsonfile):
    taxalist = load_json_data(jsonfile) 
    text = load_text_data(textfile)
    sentences_with_taxa = []
    for sen in sent_tokenize(text):
        l = word_tokenize(sen)
        if len(set(l).intersection(taxalist))>0:
            sentences_with_taxa.append(sen)
    return (sentences_with_taxa)

In [14]:
sent_with_taxa = create_tagged_sents("/Users/thalassa/Rcode/blog/data/animals-clean/44pg098-clean-taxa.txt", "/Users/thalassa/Rcode/blog/data/ansp-taxa-clean.json")

In [15]:
print(sent_with_taxa)

["Mr. Beckham's personal observations of Texan birds terminated in March and so far as I can discover very few if any of our observers have recorded data relating to the early summer birds of the Corpus Christi region Dresser's summer notes relating chiefly to the vicinity of San Antonio\nFurther description of the region included in the following notes would be superfluous after all that the aforementioned authors have written on the subject.", 'Larus atricilla Linn.', 'Larus franklinii Sw.  Rich.', 'Phalacrocorax mexicanus Brandt.', 'Anas fulvigula maculosa Senn..\nMottled Duck.', 'According to Mr. Priour the Spoonbill attains its maximum plumage development some time in January but he was unable to state whether this was due to a second moult in December or whether there is merely a wearing away of the tips of the feathers as in Agelaius and other birds.', 'Botaurus exilisGmel..', 'Ardea herodias Linn.', 'Ardea egretta GmeL American Egret.', 'Ardea tricolor ruficoUis Gosse.', 'Ardea

In [18]:
with open('44pg098_sents_with_taxa.txt', 'w') as f:
    for item in sent_with_taxa:
        f.write("%s\n" % item)

In [19]:
with open('44pg098_sents.txt', 'w') as f:
    for item in sentences:
        f.write("%s\n" % item)

## Use NLTK to tag parts of speech. End goal is to keep only sentences with a verb (wheat from chaff, etc.)

In [20]:
with open('/Users/thalassa/Rcode/blog/data/animals-clean/44pg098_sents.txt', 'r') as f:
    text = f.read()

In [25]:
words_text = nltk.word_tokenize(text)
text_sentence_tokens = sent_tokenize(text)

In [23]:
print (text_sentence_tokens[0:20])

['THE BIRDS OF SOUTHEASTERN TEXAS AND SOUTHERN ARIZONA OBSERVED DURING MAY JUNE AND JULY 1891.', 'BY SAMUEL N. RHOADS.', 'With the idea of investigating the avifauna of the southern border of the United States and collectiug a series of the birds of Florida Texas and Arizona I left Philadelphia March 26th 1891 arriving at Jacksonville Florida on the fifth of the following month.', 'A sojourn of five weeks was made in the southwestern part of the state and considerable collections obtained.', 'Few facts additional to what has been already written on the bird life of this region were ascertained and it is not my intention to treat in detail of this part of the trip.', "I arrived at Corpus Christi Texas May 17th and here a three weeks' stay was made.", 'I then journeyed westward to Tucson Arizona arriving on the tenth of June and collecting birds in the immediate vicinity until the nineteenth.', 'That morning I took stage for Oracle a posthamlet situated in the oak belt forty miles north 

In [26]:
print(words_text[0:10])

['THE', 'BIRDS', 'OF', 'SOUTHEASTERN', 'TEXAS', 'AND', 'SOUTHERN', 'ARIZONA', 'OBSERVED', 'DURING']


In [27]:
tagged_text = nltk.pos_tag(words_text)

In [28]:
print(tagged_text[0:20])

[('THE', 'DT'), ('BIRDS', 'NNP'), ('OF', 'NNP'), ('SOUTHEASTERN', 'NNP'), ('TEXAS', 'NNP'), ('AND', 'NNP'), ('SOUTHERN', 'NNP'), ('ARIZONA', 'NNP'), ('OBSERVED', 'NNP'), ('DURING', 'NNP'), ('MAY', 'NNP'), ('JUNE', 'NNP'), ('AND', 'NNP'), ('JULY', 'NNP'), ('1891', 'CD'), ('.', '.'), ('BY', 'NNP'), ('SAMUEL', 'NNP'), ('N.', 'NNP'), ('RHOADS', 'NNP')]


In [29]:
text_word_tokens = []
for sentence_token in text_sentence_tokens:
    text_word_tokens.append(word_tokenize(sentence_token))

In [30]:
print(text_word_tokens[0:10])

[['THE', 'BIRDS', 'OF', 'SOUTHEASTERN', 'TEXAS', 'AND', 'SOUTHERN', 'ARIZONA', 'OBSERVED', 'DURING', 'MAY', 'JUNE', 'AND', 'JULY', '1891', '.'], ['BY', 'SAMUEL', 'N.', 'RHOADS', '.'], ['With', 'the', 'idea', 'of', 'investigating', 'the', 'avifauna', 'of', 'the', 'southern', 'border', 'of', 'the', 'United', 'States', 'and', 'collectiug', 'a', 'series', 'of', 'the', 'birds', 'of', 'Florida', 'Texas', 'and', 'Arizona', 'I', 'left', 'Philadelphia', 'March', '26th', '1891', 'arriving', 'at', 'Jacksonville', 'Florida', 'on', 'the', 'fifth', 'of', 'the', 'following', 'month', '.'], ['A', 'sojourn', 'of', 'five', 'weeks', 'was', 'made', 'in', 'the', 'southwestern', 'part', 'of', 'the', 'state', 'and', 'considerable', 'collections', 'obtained', '.'], ['Few', 'facts', 'additional', 'to', 'what', 'has', 'been', 'already', 'written', 'on', 'the', 'bird', 'life', 'of', 'this', 'region', 'were', 'ascertained', 'and', 'it', 'is', 'not', 'my', 'intention', 'to', 'treat', 'in', 'detail', 'of', 'this', 

In [31]:
text_tagged = pos_tag_sents(text_word_tokens)

In [32]:
print(text_tagged[0:10])

[[('THE', 'DT'), ('BIRDS', 'NNP'), ('OF', 'NNP'), ('SOUTHEASTERN', 'NNP'), ('TEXAS', 'NNP'), ('AND', 'NNP'), ('SOUTHERN', 'NNP'), ('ARIZONA', 'NNP'), ('OBSERVED', 'NNP'), ('DURING', 'NNP'), ('MAY', 'NNP'), ('JUNE', 'NNP'), ('AND', 'NNP'), ('JULY', 'NNP'), ('1891', 'CD'), ('.', '.')], [('BY', 'NNP'), ('SAMUEL', 'NNP'), ('N.', 'NNP'), ('RHOADS', 'NNP'), ('.', '.')], [('With', 'IN'), ('the', 'DT'), ('idea', 'NN'), ('of', 'IN'), ('investigating', 'VBG'), ('the', 'DT'), ('avifauna', 'NN'), ('of', 'IN'), ('the', 'DT'), ('southern', 'JJ'), ('border', 'NN'), ('of', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('and', 'CC'), ('collectiug', 'VB'), ('a', 'DT'), ('series', 'NN'), ('of', 'IN'), ('the', 'DT'), ('birds', 'NNS'), ('of', 'IN'), ('Florida', 'NNP'), ('Texas', 'NNP'), ('and', 'CC'), ('Arizona', 'NNP'), ('I', 'PRP'), ('left', 'VBD'), ('Philadelphia', 'NNP'), ('March', 'NNP'), ('26th', 'CD'), ('1891', 'CD'), ('arriving', 'NN'), ('at', 'IN'), ('Jacksonville', 'NNP'), ('Florid

In [33]:
print (nltk.help.upenn_tagset('V.*'))

VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
VBD: verb, past tense
    dipped pleaded swiped regummed soaked tidied convened halted registered
    cushioned exacted snubbed strode aimed adopted belied figgered
    speculated wore appreciated contemplated ...
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
 

# spaCy NLP

In [36]:
#Sets up a function so you can run the English model on texts
nlp = en_core_web_trf.load()

#add the custom entity set (habitats ans taxonomic names)
ruler = nlp.add_pipe("entity_ruler", before='ner')

# this is a large entity set - it takes a while to load.
ruler.from_disk("/Users/thalassa/streamlit/streamlit-ansp/data/ansp-clean-patterns.jsonl")

<spacy.pipeline.entityruler.EntityRuler at 0x185d35f00>

## Run code on a single file to see how it works.

In [37]:
doc = nlp(text) #text was loaded in previous steps

In [63]:
rows = []

for token in doc:
    rows.append(
        {
            'Token': token.text, 
            'Lemma': token.lemma_,
            'POS': token.pos_,
            'Tag': token.tag_,
            'Dependency': token.dep_,
            'Head': token.head,
            'Head POS':token.head.pos_,
            'Children':[child for child in token.children],
            'Ent Type': token.ent_type_,
            'IsAlpha': token.is_alpha,
            'IsPunct': token.is_punct,
            'IsStop': token.is_stop
        }
    )   
tokes = pd.DataFrame(rows)

In [64]:
tokes.head(15)

Unnamed: 0,Token,Lemma,POS,Tag,Dependency,Head,Head POS,Children,Ent Type,IsAlpha,IsPunct,IsStop
0,THE,the,DET,DT,det,BIRDS,NOUN,[],,True,False,True
1,BIRDS,bird,NOUN,NNS,nsubj,OBSERVED,VERB,"[THE, OF]",,True,False,False
2,OF,of,ADP,IN,prep,BIRDS,NOUN,[TEXAS],,True,False,True
3,SOUTHEASTERN,southeastern,ADJ,JJ,amod,TEXAS,PROPN,[],,True,False,False
4,TEXAS,TEXAS,PROPN,NNP,pobj,OF,ADP,"[SOUTHEASTERN, AND, ARIZONA]",GPE,True,False,False
5,AND,and,CCONJ,CC,cc,TEXAS,PROPN,[],,True,False,True
6,SOUTHERN,southern,ADJ,JJ,amod,ARIZONA,PROPN,[],LOC,True,False,False
7,ARIZONA,ARIZONA,PROPN,NNP,conj,TEXAS,PROPN,[SOUTHERN],LOC,True,False,False
8,OBSERVED,observe,VERB,VBN,ROOT,OBSERVED,VERB,"[BIRDS, DURING, JUNE, .]",,True,False,False
9,DURING,dure,VERB,VBG,prep,OBSERVED,VERB,[],,True,False,True


In [45]:
tokes.to_csv('/Users/thalassa/Rcode/blog/data/animals-clean/44pg098_tokens.txt', sep='\t', index = False, header=True)

In [70]:
rows = []

for chunk in doc.noun_chunks:
    rows.append(
        {
            'Chunk': chunk.text, 
            'Chunk Root':chunk.root.text, 
            'Chunk Dep':chunk.root.dep_,
            'Chunk Head':chunk.root.head.text
        }
    )   
chunks = pd.DataFrame(rows)

In [76]:
chunks.head(30)

Unnamed: 0,Chunk,Chunk Root,Chunk Dep,Chunk Head
0,THE BIRDS,BIRDS,nsubj,OBSERVED
1,SOUTHEASTERN TEXAS,TEXAS,pobj,OF
2,SOUTHERN ARIZONA,ARIZONA,conj,TEXAS
3,SAMUEL N. RHOADS,RHOADS,pobj,BY
4,the idea,idea,pobj,With
5,the avifauna,avifauna,dobj,investigating
6,the southern border,border,pobj,of
7,the United States,States,pobj,of
8,a series,series,dobj,collectiug
9,the birds,birds,pobj,of


In [79]:
chunks.to_csv('/Users/thalassa/Rcode/blog/data/animals-clean/44pg098_spacy-chunks.txt', sep='\t', index = False, header=True)

In [None]:
[sent.text for sent in doc.sents]

In [55]:
with open('44pg098_spacy-sents.txt', 'w') as f:
    for sent in doc.sents:
        f.write("%s\n" % sent.text)

In [None]:
[chunk.text for chunk in doc.noun_chunks]

In [56]:
with open('44pg098_spacy-chunks.txt', 'w') as f:
    for chunk in doc.noun_chunks:
        f.write("%s\n" % chunk.text)

In [61]:
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

THE det BIRDS NOUN []
BIRDS nsubj OBSERVED VERB [THE, OF]
OF prep BIRDS NOUN [TEXAS]
SOUTHEASTERN amod TEXAS PROPN []
TEXAS pobj OF ADP [SOUTHEASTERN, AND, ARIZONA]
AND cc TEXAS PROPN []
SOUTHERN amod ARIZONA PROPN []
ARIZONA conj TEXAS PROPN [SOUTHERN]
OBSERVED ROOT OBSERVED VERB [BIRDS, DURING, JUNE, .]
DURING prep OBSERVED VERB []
MAY compound JUNE PROPN []
JUNE npadvmod OBSERVED VERB [MAY, AND, JULY, 1891]
AND cc JUNE PROPN []
JULY conj JUNE PROPN []
1891 nummod JUNE PROPN []
. punct OBSERVED VERB []

 ROOT 
 SPACE []
BY ROOT BY ADP [RHOADS, .]
SAMUEL compound RHOADS PROPN []
N. compound RHOADS PROPN []
RHOADS pobj BY ADP [SAMUEL, N.]
. punct BY ADP []

 punct left VERB []
With prep left VERB [idea]
the det idea NOUN []
idea pobj With ADP [the, of]
of prep idea NOUN [investigating]
investigating pcomp of ADP [avifauna, and, collectiug]
the det avifauna NOUN []
avifauna dobj investigating VERB [the, of]
of prep avifauna NOUN [border]
the det border NOUN []
southern amod border NOUN 

In [40]:
displacy.render(doc, style="ent")

In [None]:
options = {"compact": True, "bg": "#09a3d5",
           "color": "white", "font": "Source Sans Pro"}

displacy.render(doc, style="dep", jupyter = True, options=options)