## 1. Downloading spaCy models

The first step is to download the spaCy model. The model has been pre-trained on annotated English corpora. You only have to run these code cells below the first time you run the notebook; after that, you can skip right to step 2 and carry on from there. (If you run them again later, nothing bad will happen; it’ll just download again.) You can also run spaCy in other notebooks on your computer in the future, and you’ll be able to skip the step of downloading the models.

In [1]:
#Imports the module you need to download and install the spaCy models
import sys

In [None]:
#Installs the English spaCy model
!{sys.executable} -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz

## 2. Importing spaCy and setting up NLP

Run the code cell below to import the spaCy module, and create a functions to loads the Englsih model and run the NLP algorithms (includes named-entity recognition).

In [1]:
#Imports spaCy
import spacy

#Imports the English model
import en_core_web_trf

## 3. Importing other modules

There’s various other modules that will be useful in this notebook. The code comments explain what each one is for. This code cell imports all of those.

In [3]:
#io is used for opening and writing files
import io

#glob is used to find all the pathnames matching a specified pattern (here, all text files)
import glob

#os is used to navigate your folder directories (e.g. change folders to where you files are stored)
import os

# for handling data frames, etc.
import pandas as pd

# Import the spaCy visualizer
from spacy import displacy

# Import the Entityt Ruler for making custom entities
from spacy.pipeline import EntityRuler

import datetime 

# pre-processing pipeline
import textacy
from textacy import preprocessing

## 4. Diretory setup

Assuming you’re running Jupyter Notebook from your computer’s home directory, this code cell gives you the opportunity to change directories, into the directory where you’re keeping your project files. I've put just a few of the ANSP volumes into a folder called `subset`.

In [4]:
#Define the file directory here
filedirectory = '/Users/thalassa/Rcode/blog/data/animals-clean/'

#Change the working directory to the one you just defined
os.chdir(filedirectory)

In [18]:
import nltk
from nltk import tokenize
from nltk.tokenize import word_tokenize, sent_tokenize
import json

In [9]:
with open('/Users/thalassa/Rcode/blog/data/animals-clean/44pg145-clean-taxa.txt', 'r') as f:
    text = f.read()

In [10]:
sentences = tokenize.sent_tokenize(text)

In [None]:
print(sentences)

In [23]:
def load_json_data(file):
    with open(file, "r", encoding="utf-8") as f:
        data = json.load(f)
    return (data)

In [30]:
def load_text_data(file):
    with open(file, 'r') as f:
        text = f.read()
    return (text)

In [31]:
def create_tagged_sents(textfile, jsonfile):
    taxalist = load_json_data(jsonfile) 
    text = load_text_data(textfile)
    sentences_with_taxa = []
    for sen in sent_tokenize(text):
        l = word_tokenize(sen)
        if len(set(l).intersection(taxalist))>0:
            sentences_with_taxa.append(sen)
    return (sentences_with_taxa)

In [32]:
sent_with_taxa = create_tagged_sents("/Users/thalassa/Rcode/blog/data/animals-clean/44pg145-clean-taxa.txt", "/Users/thalassa/Rcode/blog/data/ansp-taxa-clean.json")

In [None]:
print(sent_with_taxa)

In [43]:
start = datetime.datetime.utcnow()
sent_out = []
#For each of those files...
for filename in sorted(os.listdir(filedirectory)):
    #If the filename ends with .txt (i.e. if it's actually a text files)
    if filename.endswith('.txt'):
        #Write out below the name of the file
        print(filename)
        #Open the infput filename
        sent_with_taxa = create_tagged_sents(filename, "/Users/thalassa/Rcode/blog/data/ansp-taxa-clean.json")
        sent_out = sent out + sent_with_taxa
        
end = datetime.datetime.utcnow()
print(f"Finished at {end}, total time {(end-start).seconds / 60.} minutes.")


07sxx30-clean-taxa.txt
12sxx03-clean-taxa.txt
19sxx03-clean-taxa.txt
23sxx01-clean-taxa.txt
23sxx04-clean-taxa.txt
23sxx06-clean-taxa.txt
24sxx03-clean-taxa.txt
24sxx07-clean-taxa.txt
24sxx08-clean-taxa.txt
25sxx02-clean-taxa.txt
25sxx07-clean-taxa.txt
25sxx10-clean-taxa.txt
25sxx11-clean-taxa.txt
25sxx12-clean-taxa.txt
25sxx14-clean-taxa.txt
25sxx15-clean-taxa.txt
26sxx03-clean-taxa.txt
26sxx04-clean-taxa.txt
26sxx05-clean-taxa.txt
26sxx06-clean-taxa.txt
26sxx08-clean-taxa.txt
26sxx09-clean-taxa.txt
26sxx10-clean-taxa.txt
27sxx03-clean-taxa.txt
27sxx06-clean-taxa.txt
27sxx09-clean-taxa.txt
27sxx21-clean-taxa.txt
27sxx62-clean-taxa.txt
29sxx05-clean-taxa.txt
29sxx06-clean-taxa.txt
30sxx01-clean-taxa.txt
30sxx04-clean-taxa.txt
30sxx05-clean-taxa.txt
30sxx06-clean-taxa.txt
30sxx08-clean-taxa.txt
30sxx09-clean-taxa.txt
30sxx10-clean-taxa.txt
31sxx07-clean-taxa.txt
31sxx08-clean-taxa.txt
31sxx10-clean-taxa.txt
31sxx11-clean-taxa.txt
31sxx13-clean-taxa.txt
31sxx14-clean-taxa.txt
31sxx16-cle

In [45]:
with open('sentences_with_taxa.txt', 'w') as f:
    for item in sent_out:
        f.write("%s\n" % item)

In [46]:
type(sent_out)


list

In [12]:
preproc = preprocessing.make_pipeline(
    preprocessing.normalize.whitespace,
    preprocessing.normalize.hyphenated_words,
    preprocessing.normalize.unicode,
    preprocessing.normalize.quotation_marks,
    )

In [16]:
preproc("Frances Naomi Clark was an Amer-ican ichthyologist born in 1894, and was one of the first wom.an fishery researchers to receive world-wide recognition.  ")

'Frances Naomi Clark was an Amer-ican ichthyologist born in 1894, and was one of the first wom.an fishery researchers to receive world-wide recognition.'

In [12]:
rows = []

for token in doc:
    rows.append(
        {
            'Token': token.text, 
            'Lemma': token.lemma_,
            'POS': token.pos_,
            'Tag': token.tag_,
            'Dependency': token.dep_,
            'Head': token.head,
            'Ent Type': token.ent_type_,
            'IsAlpha': token.is_alpha,
            'IsPunct': token.is_punct,
            'IsStop': token.is_stop
        }
    )   
tokes = pd.DataFrame(rows)

In [14]:
tokes.head(15)

Unnamed: 0,Token,Lemma,POS,Tag,Dependency,Head,Ent Type,IsAlpha,IsPunct,IsStop
0,Frances,Frances,PROPN,NNP,compound,Clark,PERSON,True,False,False
1,Naomi,Naomi,PROPN,NNP,compound,Clark,PERSON,True,False,False
2,Clark,Clark,PROPN,NNP,nsubj,was,PERSON,True,False,False
3,was,be,AUX,VBD,ROOT,was,,True,False,True
4,an,an,DET,DT,det,ichthyologist,,True,False,True
5,American,american,ADJ,JJ,amod,ichthyologist,NORP,True,False,False
6,ichthyologist,ichthyologist,NOUN,NN,attr,was,,True,False,False
7,born,bear,VERB,VBN,acl,ichthyologist,,True,False,False
8,in,in,ADP,IN,prep,born,,True,False,True
9,1894,1894,NUM,CD,pobj,in,DATE,False,False,False


## Running spaCy

This step will run every text file throught the complete spaCy pipeline

## Note - this takes a while - do not run this chunk unless you want to see the LOC results.

In [None]:
#Sort all the files in the directory you specified above, alphabetically.

start = datetime.datetime.utcnow()

#For each of those files...
for filename in sorted(os.listdir(filedirectory)):
    #If the filename ends with .txt (i.e. if it's actually a text files)
    if filename.endswith('.txt'):
        #Write out below the name of the file
        print(filename)
        #The file name of the output file adds _ner_loc to the end of the file name of the input file
        outfilename = filename.replace('.txt', '_nlp.txt')
        #Open the infput filename
        with open(filename, 'r') as f:
            #Create and open the output filename
            with open(outfilename, 'w') as out:
                #Read the contents of the input file
                voltext = f.read()
                #Do English NLP on the contents of the input file
                volner = nlp(voltext)
                #For each recognized entity
                rows = []
                for token in doc:
                    rows.append(
                        {
                            'Token': token.text, 
                            'Lemma': token.lemma_,
                            'POS': token.pos_,
                            'Tag': token.tag_,
                            'Dependency': token.dep_,
                            'Head': token.head,
                            'Ent Type': token.ent_type_,
                            'IsAlpha': token.is_alpha,
                            'IsPunct': token.is_punct,
                            'IsStop': token.is_stop
                        }
                    )   
                tokes = pd.DataFrame(rows)
                tokes.to_csv(outfilename, sep='\t', index = False, header=True)
                
end = datetime.datetime.utcnow()
print(f"Finished at {end}, total time {(end-start).seconds / 60.} minutes.")


17669.txt
