## 1. Downloading spaCy models

The first step is to download the spaCy model. The model has been pre-trained on annotated English corpora. You only have to run these code cells below the first time you run the notebook; after that, you can skip right to step 2 and carry on from there. (If you run them again later, nothing bad will happen; it’ll just download again.) You can also run spaCy in other notebooks on your computer in the future, and you’ll be able to skip the step of downloading the models.

In [1]:
#Imports the module you need to download and install the spaCy models
import sys

In [None]:
#Installs the English spaCy model
!{sys.executable} -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0.tar.gz

## 2. Importing spaCy and setting up NLP

Run the code cell below to import the spaCy module, and create a functions to loads the Englsih model and run the NLP algorithms (includes named-entity recognition).

In [1]:
#Imports spaCy
import spacy

#Imports the English model
import en_core_web_trf

## 3. Importing other modules

There’s various other modules that will be useful in this notebook. The code comments explain what each one is for. This code cell imports all of those.

In [2]:
#io is used for opening and writing files
import io

#glob is used to find all the pathnames matching a specified pattern (here, all text files)
import glob

#os is used to navigate your folder directories (e.g. change folders to where you files are stored)
import os

# for handling data frames, etc.
import pandas as pd

# Import the spaCy visualizer
from spacy import displacy

# Import the Entityt Ruler for making custom entities
from spacy.pipeline import EntityRuler

import datetime 

# pre-processing pipeline
import textacy
from textacy import preprocessing

## 4. Diretory setup

Assuming you’re running Jupyter Notebook from your computer’s home directory, this code cell gives you the opportunity to change directories, into the directory where you’re keeping your project files. I've put just a few of the ANSP volumes into a folder called `subset`.

In [3]:
#Define the file directory here
filedirectory = '/Users/thalassa/Rcode/blog/data/animals/'

#Change the working directory to the one you just defined
os.chdir(filedirectory)

In [4]:
#Sets up a function so you can run the English model on texts
nlp = en_core_web_trf.load()

#add the custom entity set (habitats ans taxonomic names)
#ruler = nlp.add_pipe("entity_ruler", before='ner')

# this is a large entity set - it takes a while to load.
#ruler.from_disk("/Users/thalassa/streamlit/streamlit-ansp/ansp-patterns.jsonl")

## Run code on a single file to see how it works.

In [33]:
text = ["Frances Naomi Clark was an Amer-ican ichthyologist born in 1894, and was one of the first wom.an fishery researchers to receive world-wide recognition.  Frances Naomi Clark was an American ichthyologist born in 1894, and was one of the first woman fishery researchers to receive world-wide recognition. Seven Ampelis cedrorum specimens were collected in a meadow near lowland fruit trees. Some habitats we know are in the json file are near      large rocks, near river mouths, near the bottom and near the ocean. Some species names are Hemigrapsus affinis, Hemigrapsus crassimanus, Hendersonia alternifoliae and Hendersonia celtifolia."
       ]
doc = nlp(text)

In [34]:
# Read Dataset 
Df = pd.read_csv('New Task.csv', encoding = 'latin-1')
# Show Dataset
Df.head()

['Frances Naomi Clark was an Amer-ican ichthyologist born in 1894, and was one of the first wom.an fishery researchers to receive world-wide recognition.  Frances Naomi Clark was an American ichthyologist born in 1894, and was one of the first woman fishery researchers to receive world-wide recognition. Seven Ampelis cedrorum specimens were collected in a meadow near lowland fruit trees. Some habitats we know are in the json file are near      large rocks, near river mouths, near the bottom and near the ocean. Some species names are Hemigrapsus affinis, Hemigrapsus crassimanus, Hendersonia alternifoliae and Hendersonia celtifolia.']


In [35]:
# Importing Libraries 
import unidecode 
import pandas as pd 
import re 
import time 
import nltk 
from nltk.corpus import stopwords 
nltk.download('stopwords') 
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from autocorrect import Speller 
from bs4 import BeautifulSoup 
from nltk.corpus import stopwords 
from nltk import word_tokenize 
import string 

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/thalassa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [36]:
def remove_whitespace(text):
    """ This function will remove 
        extra whitespaces from the text
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after extra whitespaces removed .
        
    Example:
    Input : How   are   you   doing   ?
    Output : How are you doing ?     
        
    """
    pattern = re.compile(r'\s+') 
    Without_whitespace = re.sub(pattern, ' ', text)
    # There are some instances where there is no space after '?' & ')', 
    # So I am replacing these with one space so that It will not consider two words as one token.
    text = Without_whitespace.replace('?', ' ? ').replace(')', ') ')
    return text

def remove_newlines_tabs(text):
    """
    This function will remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.
    
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after removal of newlines, tabs, \\n, \\ characters.
        
    Example:
    Input : This is her \\ first day at this place.\n Please,\t Be nice to her.\\n
    Output : This is her first day at this place. Please, Be nice to her. 
    
    """
    
    # Replacing all the occurrences of \n,\\n,\t,\\ with a space.
    Formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
    return Formatted_text

In [37]:
remove_newlines_tabs(text)

AttributeError: 'list' object has no attribute 'replace'

In [38]:
remove_whitespace(text)

TypeError: expected string or bytes-like object

In [12]:
preproc = preprocessing.make_pipeline(
    preprocessing.normalize.whitespace,
    preprocessing.normalize.hyphenated_words,
    preprocessing.normalize.unicode,
    preprocessing.normalize.quotation_marks,
    )

In [16]:
preproc("Frances Naomi Clark was an Amer-ican ichthyologist born in 1894, and was one of the first wom.an fishery researchers to receive world-wide recognition.  ")

'Frances Naomi Clark was an Amer-ican ichthyologist born in 1894, and was one of the first wom.an fishery researchers to receive world-wide recognition.'

In [12]:
rows = []

for token in doc:
    rows.append(
        {
            'Token': token.text, 
            'Lemma': token.lemma_,
            'POS': token.pos_,
            'Tag': token.tag_,
            'Dependency': token.dep_,
            'Head': token.head,
            'Ent Type': token.ent_type_,
            'IsAlpha': token.is_alpha,
            'IsPunct': token.is_punct,
            'IsStop': token.is_stop
        }
    )   
tokes = pd.DataFrame(rows)

In [14]:
tokes.head(15)

Unnamed: 0,Token,Lemma,POS,Tag,Dependency,Head,Ent Type,IsAlpha,IsPunct,IsStop
0,Frances,Frances,PROPN,NNP,compound,Clark,PERSON,True,False,False
1,Naomi,Naomi,PROPN,NNP,compound,Clark,PERSON,True,False,False
2,Clark,Clark,PROPN,NNP,nsubj,was,PERSON,True,False,False
3,was,be,AUX,VBD,ROOT,was,,True,False,True
4,an,an,DET,DT,det,ichthyologist,,True,False,True
5,American,american,ADJ,JJ,amod,ichthyologist,NORP,True,False,False
6,ichthyologist,ichthyologist,NOUN,NN,attr,was,,True,False,False
7,born,bear,VERB,VBN,acl,ichthyologist,,True,False,False
8,in,in,ADP,IN,prep,born,,True,False,True
9,1894,1894,NUM,CD,pobj,in,DATE,False,False,False


## Running spaCy

This step will run every text file throught the complete spaCy pipeline

## Note - this takes a while - do not run this chunk unless you want to see the LOC results.

In [None]:
#Sort all the files in the directory you specified above, alphabetically.

start = datetime.datetime.utcnow()

#For each of those files...
for filename in sorted(os.listdir(filedirectory)):
    #If the filename ends with .txt (i.e. if it's actually a text files)
    if filename.endswith('.txt'):
        #Write out below the name of the file
        print(filename)
        #The file name of the output file adds _ner_loc to the end of the file name of the input file
        outfilename = filename.replace('.txt', '_nlp.txt')
        #Open the infput filename
        with open(filename, 'r') as f:
            #Create and open the output filename
            with open(outfilename, 'w') as out:
                #Read the contents of the input file
                voltext = f.read()
                #Do English NLP on the contents of the input file
                volner = nlp(voltext)
                #For each recognized entity
                rows = []
                for token in doc:
                    rows.append(
                        {
                            'Token': token.text, 
                            'Lemma': token.lemma_,
                            'POS': token.pos_,
                            'Tag': token.tag_,
                            'Dependency': token.dep_,
                            'Head': token.head,
                            'Ent Type': token.ent_type_,
                            'IsAlpha': token.is_alpha,
                            'IsPunct': token.is_punct,
                            'IsStop': token.is_stop
                        }
                    )   
                tokes = pd.DataFrame(rows)
                tokes.to_csv(outfilename, sep='\t', index = False, header=True)
                
end = datetime.datetime.utcnow()
print(f"Finished at {end}, total time {(end-start).seconds / 60.} minutes.")


17669.txt
