## 1. Downloading spaCy models

The first step is to download the spaCy model. The model has been pre-trained on annotated English corpora. You only have to run these code cells below the first time you run the notebook; after that, you can skip right to step 2 and carry on from there. (If you run them again later, nothing bad will happen; it’ll just download again.) You can also run spaCy in other notebooks on your computer in the future, and you’ll be able to skip the step of downloading the models.

In [19]:
#Imports the module you need to download and install the spaCy models
import sys

In [20]:
#Installs the English spaCy model
!{sys.executable} -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz (460.2 MB)
     |████████████████████████████████| 460.2 MB 13 kB/s               
[?25h  Preparing metadata (setup.py) ... [?25ldone


## 2. Importing spaCy and setting up NLP

Run the code cell below to import the spaCy module, and create a functions to loads the Englsih model and run the NLP algorithms (includes named-entity recognition).

In [21]:
#Imports spaCy
import spacy

#Imports the English model
import en_core_web_trf

## 3. Importing other modules

There’s various other modules that will be useful in this notebook. The code comments explain what each one is for. This code cell imports all of those.

In [2]:
#io is used for opening and writing files
import io

#glob is used to find all the pathnames matching a specified pattern (here, all text files)
import glob

#os is used to navigate your folder directories (e.g. change folders to where you files are stored)
import os

# for handling data frames, etc.
import pandas as pd

# Import the spaCy visualizer
from spacy import displacy

# Import the Entityt Ruler for making custom entities
from spacy.pipeline import EntityRuler

import datetime 


## 4. Diretory setup

Assuming you’re running Jupyter Notebook from your computer’s home directory, this code cell gives you the opportunity to change directories, into the directory where you’re keeping your project files. I've put just a few of the ANSP volumes into a folder called `subset`.

In [3]:
#Define the file directory here
filedirectory = '/Users/thalassa/Rcode/blog/data/'

#Change the working directory to the one you just defined
os.chdir(filedirectory)

In [46]:
import nltk
from nltk import tokenize
from nltk.tokenize import word_tokenize, sent_tokenize
import json
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag_sents
nltk.download('tagsets')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/thalassa/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to
[nltk_data]     /Users/thalassa/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

In [9]:
with open('/Users/thalassa/Rcode/blog/data/animals-clean/44pg145-clean-taxa.txt', 'r') as f:
    text = f.read()

In [10]:
sentences = tokenize.sent_tokenize(text)

In [None]:
print(sentences)

In [23]:
def load_json_data(file):
    with open(file, "r", encoding="utf-8") as f:
        data = json.load(f)
    return (data)

In [30]:
def load_text_data(file):
    with open(file, 'r') as f:
        text = f.read()
    return (text)

In [31]:
def create_tagged_sents(textfile, jsonfile):
    taxalist = load_json_data(jsonfile) 
    text = load_text_data(textfile)
    sentences_with_taxa = []
    for sen in sent_tokenize(text):
        l = word_tokenize(sen)
        if len(set(l).intersection(taxalist))>0:
            sentences_with_taxa.append(sen)
    return (sentences_with_taxa)

In [32]:
sent_with_taxa = create_tagged_sents("/Users/thalassa/Rcode/blog/data/animals-clean/44pg145-clean-taxa.txt", "/Users/thalassa/Rcode/blog/data/ansp-taxa-clean.json")

In [None]:
print(sent_with_taxa)

In [None]:
start = datetime.datetime.utcnow()
sent_out = []
#For each of those files...
for filename in sorted(os.listdir(filedirectory)):
    #If the filename ends with .txt (i.e. if it's actually a text files)
    if filename.endswith('.txt'):
        #Write out below the name of the file
        print(filename)
        #Open the infput filename
        sent_with_taxa = create_tagged_sents(filename, "/Users/thalassa/Rcode/blog/data/ansp-taxa-clean.json")
        sent_out = sent_out + sent_with_taxa
        
end = datetime.datetime.utcnow()
print(f"Finished at {end}, total time {(end-start).seconds / 60.} minutes.")


In [49]:
with open('sentences_with_taxa.txt', 'w') as f:
    for item in sent_out:
        f.write("%s\n" % item)

## Use NLTK to tag parts of speech. End goal is to keep only sentences with a verb (wheat from chaff, etc.)

In [32]:
with open('/Users/thalassa/Rcode/blog/data/sentences_with_taxa.txt', 'r') as f:
    text = f.read()

In [33]:
# words_text = nltk.word_tokenize(text)
text_sentence_tokens = sent_tokenize(text)

In [34]:
print (text_sentence_tokens[0:20])

['CATALOGUE OF AMERICAN Testudinata\nChelonuea serpentina\nEmysaurus aliorum.', 'Kinosternum PENNSYLVANicuM.', 'Cistuda id.', 'Terrapene odorata et Boscii Menem 1. c. p. 27.', 'Cistttdo odorata Say 1. c. Emys odorata Schw.', 'Cistudo Clausa\nCistudo Carolina alior.', 'Emys clausa Schw.', 'Emys virgidata ejusd.', 'Terrapene Carolina viaculata et nelulosa Bell Zool.', 'Cistudo Blaiidiugii Holbr.', 'Emys Muhlenbergii Schcepff.', 'Chersine Muhlenhergii Merrem.', 'In the Catalogue of Amphibia in the collection of the British Museum and in that of the Jardin des Plantes the following species of tortoises are mentioned as coming from the United States.', 'In the English Catalogue are described Emys rivulata E. scripta E. Holbrookii E. macrocephala and E. Bennetii.', "Till' Kinosternum DouMedayii however forms an cxceptiitn.", "In Scha'pff Testudo tricarinata a young animal of some Kinosternum T. cinerea a young picta  T. scripia a young serrata or reticulata  T. rosliata a young Trionyx.", "'

In [35]:
print(words_text[0:10])

['CATALOGUE', 'OF', 'AMERICAN', 'Testudinata', 'Chelonuea', 'serpentina', 'Emysaurus', 'aliorum', '.', 'Kinosternum']


In [36]:
tagged_text = nltk.pos_tag(words_text)

In [37]:
print(tagged_text[0:20])

[('CATALOGUE', 'NN'), ('OF', 'IN'), ('AMERICAN', 'NNP'), ('Testudinata', 'NNP'), ('Chelonuea', 'NNP'), ('serpentina', 'VBD'), ('Emysaurus', 'NNP'), ('aliorum', 'NN'), ('.', '.'), ('Kinosternum', 'NNP'), ('PENNSYLVANicuM', 'NNP'), ('.', '.'), ('Cistuda', 'NNP'), ('id', 'NN'), ('.', '.'), ('Terrapene', 'NNP'), ('odorata', 'MD'), ('et', 'VB'), ('Boscii', 'NNP'), ('Menem', 'NNP')]


In [39]:
text_word_tokens = []
for sentence_token in text_sentence_tokens:
    text_word_tokens.append(word_tokenize(sentence_token))

In [42]:
print(text_word_tokens[0:10])

[['CATALOGUE', 'OF', 'AMERICAN', 'Testudinata', 'Chelonuea', 'serpentina', 'Emysaurus', 'aliorum', '.'], ['Kinosternum', 'PENNSYLVANicuM', '.'], ['Cistuda', 'id', '.'], ['Terrapene', 'odorata', 'et', 'Boscii', 'Menem', '1.', 'c.', 'p.', '27', '.'], ['Cistttdo', 'odorata', 'Say', '1.', 'c.', 'Emys', 'odorata', 'Schw', '.'], ['Cistudo', 'Clausa', 'Cistudo', 'Carolina', 'alior', '.'], ['Emys', 'clausa', 'Schw', '.'], ['Emys', 'virgidata', 'ejusd', '.'], ['Terrapene', 'Carolina', 'viaculata', 'et', 'nelulosa', 'Bell', 'Zool', '.'], ['Cistudo', 'Blaiidiugii', 'Holbr', '.']]


In [43]:
text_tagged = pos_tag_sents(text_word_tokens)

In [54]:
print(text_tagged[0:10])

[[('CATALOGUE', 'NN'), ('OF', 'IN'), ('AMERICAN', 'NNP'), ('Testudinata', 'NNP'), ('Chelonuea', 'NNP'), ('serpentina', 'VBD'), ('Emysaurus', 'NNP'), ('aliorum', 'NN'), ('.', '.')], [('Kinosternum', 'NNP'), ('PENNSYLVANicuM', 'NNP'), ('.', '.')], [('Cistuda', 'NNP'), ('id', 'NN'), ('.', '.')], [('Terrapene', 'NNP'), ('odorata', 'MD'), ('et', 'VB'), ('Boscii', 'NNP'), ('Menem', 'NNP'), ('1.', 'CD'), ('c.', 'NN'), ('p.', 'NN'), ('27', 'CD'), ('.', '.')], [('Cistttdo', 'NNP'), ('odorata', 'NNS'), ('Say', 'NNP'), ('1.', 'CD'), ('c.', 'NN'), ('Emys', 'NNP'), ('odorata', 'NN'), ('Schw', 'NNP'), ('.', '.')], [('Cistudo', 'NNP'), ('Clausa', 'NNP'), ('Cistudo', 'NNP'), ('Carolina', 'NNP'), ('alior', 'NN'), ('.', '.')], [('Emys', 'NNP'), ('clausa', 'NN'), ('Schw', 'NNP'), ('.', '.')], [('Emys', 'NNP'), ('virgidata', 'NN'), ('ejusd', 'NN'), ('.', '.')], [('Terrapene', 'NNP'), ('Carolina', 'NNP'), ('viaculata', 'NN'), ('et', 'NN'), ('nelulosa', 'NN'), ('Bell', 'NNP'), ('Zool', 'NNP'), ('.', '.')], 

In [47]:
print (nltk.help.upenn_tagset('V.*'))

VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
VBD: verb, past tense
    dipped pleaded swiped regummed soaked tidied convened halted registered
    cushioned exacted snubbed strode aimed adopted belied figgered
    speculated wore appreciated contemplated ...
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
 

In [52]:
verb = list(text_tagged[0:10])

In [53]:
print(verb[0:10])

[[('CATALOGUE', 'NN'), ('OF', 'IN'), ('AMERICAN', 'NNP'), ('Testudinata', 'NNP'), ('Chelonuea', 'NNP'), ('serpentina', 'VBD'), ('Emysaurus', 'NNP'), ('aliorum', 'NN'), ('.', '.')], [('Kinosternum', 'NNP'), ('PENNSYLVANicuM', 'NNP'), ('.', '.')], [('Cistuda', 'NNP'), ('id', 'NN'), ('.', '.')], [('Terrapene', 'NNP'), ('odorata', 'MD'), ('et', 'VB'), ('Boscii', 'NNP'), ('Menem', 'NNP'), ('1.', 'CD'), ('c.', 'NN'), ('p.', 'NN'), ('27', 'CD'), ('.', '.')], [('Cistttdo', 'NNP'), ('odorata', 'NNS'), ('Say', 'NNP'), ('1.', 'CD'), ('c.', 'NN'), ('Emys', 'NNP'), ('odorata', 'NN'), ('Schw', 'NNP'), ('.', '.')], [('Cistudo', 'NNP'), ('Clausa', 'NNP'), ('Cistudo', 'NNP'), ('Carolina', 'NNP'), ('alior', 'NN'), ('.', '.')], [('Emys', 'NNP'), ('clausa', 'NN'), ('Schw', 'NNP'), ('.', '.')], [('Emys', 'NNP'), ('virgidata', 'NN'), ('ejusd', 'NN'), ('.', '.')], [('Terrapene', 'NNP'), ('Carolina', 'NNP'), ('viaculata', 'NN'), ('et', 'NN'), ('nelulosa', 'NN'), ('Bell', 'NNP'), ('Zool', 'NNP'), ('.', '.')], 

In [None]:
#Sets up a function so you can run the English model on texts
nlp = en_core_web_trf.load()

#add the custom entity set (habitats ans taxonomic names)
ruler = nlp.add_pipe("entity_ruler", before='ner')

# this is a large entity set - it takes a while to load.
ruler.from_disk("/Users/thalassa/streamlit/streamlit-ansp/data/ansp-clean-patterns.jsonl")

## Run code on a single file to see how it works.

In [None]:
text = "Frances Naomi Clark was an American ichthyologist born in 1894, and was one of the first woman fishery researchers to receive world-wide recognition. Frances Naomi Clark was an American ichthyologist born in 1894, and was one of the first woman fishery researchers to receive world-wide recognition. Seven Ampelis cedrorum specimens were collected in a meadow near lowland fruit trees. Some habitats we know are in the json file are near large rocks, near river mouths, near the bottom and near the ocean. Some species names are Hemigrapsus affinis, Hemigrapsus crassimanus, Hendersonia alternifoliae and Hendersonia celtifolia."
doc = nlp(text)

In [12]:
rows = []

for token in doc:
    rows.append(
        {
            'Token': token.text, 
            'Lemma': token.lemma_,
            'POS': token.pos_,
            'Tag': token.tag_,
            'Dependency': token.dep_,
            'Head': token.head,
            'Ent Type': token.ent_type_,
            'IsAlpha': token.is_alpha,
            'IsPunct': token.is_punct,
            'IsStop': token.is_stop
        }
    )   
tokes = pd.DataFrame(rows)

In [14]:
tokes.head(15)

Unnamed: 0,Token,Lemma,POS,Tag,Dependency,Head,Ent Type,IsAlpha,IsPunct,IsStop
0,Frances,Frances,PROPN,NNP,compound,Clark,PERSON,True,False,False
1,Naomi,Naomi,PROPN,NNP,compound,Clark,PERSON,True,False,False
2,Clark,Clark,PROPN,NNP,nsubj,was,PERSON,True,False,False
3,was,be,AUX,VBD,ROOT,was,,True,False,True
4,an,an,DET,DT,det,ichthyologist,,True,False,True
5,American,american,ADJ,JJ,amod,ichthyologist,NORP,True,False,False
6,ichthyologist,ichthyologist,NOUN,NN,attr,was,,True,False,False
7,born,bear,VERB,VBN,acl,ichthyologist,,True,False,False
8,in,in,ADP,IN,prep,born,,True,False,True
9,1894,1894,NUM,CD,pobj,in,DATE,False,False,False


## Running spaCy

This step will run every text file throught the complete spaCy pipeline

## Note - this takes a while - do not run this chunk unless you want to see the LOC results.

In [None]:
#Sort all the files in the directory you specified above, alphabetically.

start = datetime.datetime.utcnow()

#For each of those files...
for filename in sorted(os.listdir(filedirectory)):
    #If the filename ends with .txt (i.e. if it's actually a text files)
    if filename.endswith('.txt'):
        #Write out below the name of the file
        print(filename)
        #The file name of the output file adds _ner_loc to the end of the file name of the input file
        outfilename = filename.replace('.txt', '_nlp.txt')
        #Open the infput filename
        with open(filename, 'r') as f:
            #Create and open the output filename
            with open(outfilename, 'w') as out:
                #Read the contents of the input file
                voltext = f.read()
                #Do English NLP on the contents of the input file
                volner = nlp(voltext)
                #For each recognized entity
                rows = []
                for token in doc:
                    rows.append(
                        {
                            'Token': token.text, 
                            'Lemma': token.lemma_,
                            'POS': token.pos_,
                            'Tag': token.tag_,
                            'Dependency': token.dep_,
                            'Head': token.head,
                            'Ent Type': token.ent_type_,
                            'IsAlpha': token.is_alpha,
                            'IsPunct': token.is_punct,
                            'IsStop': token.is_stop
                        }
                    )   
                tokes = pd.DataFrame(rows)
                tokes.to_csv(outfilename, sep='\t', index = False, header=True)
                
end = datetime.datetime.utcnow()
print(f"Finished at {end}, total time {(end-start).seconds / 60.} minutes.")


17669.txt
