___
<a href='https://agroinformatics.org'> <img src='GEMS-logo2d.png' /></a>
___

# spaCy Basics

**spaCy** (https://spacy.io/) is an open-source Python library that parses and "understands" large volumes of text. 


# Installation and Setup

Installation is a two-step process. First, install spaCy using either conda or pip. Next, download the specific model you want, based on language.<br> For more info visit https://spacy.io/usage/

### 1. From the command line or terminal:
> `conda install -c conda-forge spacy`
> <br>*or*<br>
> `pip install -U spacy`

> ### Alternatively you can create a virtual environment:
> `conda create -n spacyenv python=3 spacy=2`

### 2. Next, also from the command line (you must run this as admin or use sudo):

> `python -m spacy download en`

> ### If successful, you should see a message like:

> **`Linking successful`**<br>
> `    PathToSpacyEnvironment\spacyenv\lib\site-packages\en_core_web_sm -->`<br>
> `    PathToSpacyEnvironment\spacyenv\lib\site-packages\spacy\data\en`<br>
> ` `<br>
> `    You can now load the model via spacy.load('en')`


# Working with spaCy in Python

This is a typical set of instructions for importing and working with spaCy. Will likely take awhile - spaCy has a fairly large library to load:

In [74]:
# Import spaCy and load the language library
import spacy
import re
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(u'G.E.M.S is applying for an NSF grant worth $5 million. Good luck!')

# Check to see what components are current live
nlp.pipe_names

['tagger', 'parser', 'ner']

In [75]:
# Print each token separately
for token in doc:
    print(token.text, token.pos_, token.dep_)

G.E.M.S PROPN nsubj
is AUX aux
applying VERB ROOT
for ADP prep
an DET det
NSF PROPN compound
grant NOUN pobj
worth ADJ amod
$ SYM quantmod
5 NUM compound
million NUM npadvmod
. PUNCT punct
Good ADJ amod
luck NOUN ROOT
! PUNCT punct


You can use the explain function in spacy to find out meaning of tags

In [76]:
spacy.explain('PR')

Right away we see some interesting things happen:
1. G.E.M.S is recognized to be a Proper Noun, not just a word at the start of a sentence
2. G.E.M.S is kept together as one entity ('token'). Not broken on periods. 

___
# spaCy Objects

After importing the spacy module in the cell above we loaded a **model** and named it `nlp`.<br>Next we created a **Doc** object by applying the model to our text, and named it `doc`.<br>

___
# Pipeline
When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.   Image source: https://spacy.io/usage/spacy-101#pipelines

<img src="pipeline1.png" width="600">

We can check to see what components currently live in the pipeline. 

In [77]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7fa9b280d898>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7fa9b28567c8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7fa9b2856828>)]

___
## Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information. 

In [78]:
# NOTE: I have no idea if this is true. Kevin and Jesse would know better. Just demonstrating Spacy. 
doc2 = nlp(u"G.E.M.S isn't   looking into hiring anymore.")

for token in doc2:
    print(token.text, token.pos_, token.dep_)

G.E.M.S PROPN nsubj
is AUX aux
n't PART neg
   SPACE 
looking VERB ROOT
into ADP prep
hiring NOUN pcomp
anymore ADV advmod
. PUNCT punct


Notice how `isn't` has been split into two tokens. spaCy recognizes both the root verb `is` and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens. Spacy is able to recognize the extended space is different than space between words. For texts such as poems this might be useful.v

NOTE: Even though `doc2` contains processed information about each token, it also retains the original text:

In [79]:
doc2

G.E.M.S isn't   looking into hiring anymore.

In [80]:
doc2[0]

G.E.M.S

In [81]:
type(doc2)

spacy.tokens.doc.Doc

___
## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `G.E.M.S` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns. A big advantage of using Spacy is we did not have to do any training here. 

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

In [82]:
doc2[0].pos_

'PROPN'

___
## Dependencies
In addition to tagging, Spacy assigns syntactic dependencies to each token. This is the relationship between words. Notice below Spacy is able to recognize the word/token Great has two different relationships in the two sentences. 

For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
<br>A good explanation of typed dependencies can be found [here](https://nlp.stanford.edu/software/dependencies_manual.pdf)

In [83]:
doc2 = nlp(u"Great Engineers are in demand.")
doc3 = nlp(u"Great Britain is in trouble.")

In [84]:
print(doc2[0].text, ",", doc2[0].pos_, ",", doc2[0].dep_)

Great , PROPN , amod


In [85]:
print(doc3[0].text, ",", doc3[0].pos_, ",", doc3[0].dep_)

Great , PROPN , compound


To see the full name of a tag use `spacy.explain(tag)`

In [86]:
print("ADJ =",spacy.explain('ADJ'))
print("amod =",spacy.explain('amod'))
print("PROPN =",spacy.explain('PROPN'))
print("compound =",spacy.explain('compound'))

ADJ = adjective
amod = adjectival modifier
PROPN = proper noun
compound = compound


### Visualizing dependencies

In [87]:
# Import visualization package
from spacy import displacy

# Render the dependency parse immediately inside Jupyter:
displacy.render(doc2, style='dep', jupyter=True, options={'distance': 150})

___
## Additional Token Attributes
Some of the other information that spaCy assigns to tokens:

In [88]:
# Lemmas (the base form of the word):
print(doc2[2].text)
print(doc2[2].lemma_)

are
be


In [89]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[2].pos_)
print(doc2[2].tag_ + ' / ' + spacy.explain(doc2[2].tag_))

AUX
VBP / verb, non-3rd person singular present


In [90]:
# Boolean Values:
print(doc2[0].is_alpha)
print(doc2[0].is_stop)
print(doc2[3].is_stop)

True
False
True


## Named Entity Recognition (NER)
spaCy has an **'ner'** pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the `ents` property of a `Doc` object.

In [91]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(u'G.E.M.S is applying for an NSF grant worth $5 million.')

for ent in doc.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

NSF - ORG - Companies, agencies, institutions, etc.
$5 million - MONEY - Monetary values, including unit


### Visualizing (NER)

In [92]:
displacy.render(doc, style='ent', jupyter=True)

___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [93]:
doc3 = nlp(u'One of the reasons I enjoy working with this group is its diverse pool of talent.  \
To quote Karl Popper “Disciplines are distinguished partly for historical reasons and reasons of \
administrative convenience.......We are not students of some subject matter but students of problems. \
And problems may cut right across the borders of any subject matter or discipline” ')

In [94]:
quote_source = doc3[20:22]
print(quote_source)

Karl Popper


In [95]:
type(quote_source)

spacy.tokens.span.Span

## Aside: Generators
For efficiency, Spacy uses a lot of generators. This reduces memory foot print. Data are not generated unti when they are needed. <br/>

### Example: Given a list x = [1, 2, 3] <br/>
We know a user might need to use the square of the first two items. 

### Solution 1: Write a function


In [96]:
def doubleListFunc(lst):
    newList = []
    for item in lst:
        newList.append(item * 2)
    return newList

x = [4, 12, 3]
x1 = doubleListFunc(x) # This solution doubles everything in the list even if we might only need 1st item
print(x1[0])

8


### Solution 2: Write a generator

In [97]:
def doubleListGen(lst):
    for item in lst:
        yield item * 2
        
x = [4, 12, 3]
our_generator = doubleListGen(x)
next(our_generator) # Gets excuted when the item is needed. 

8

In [98]:
next(our_generator)

24

In [99]:
next(our_generator)

6

In [100]:
# next(our_generator) # This will give an error. Generator got to the end of the list

### With a generator, if you need all the items, you will need a loop

In [101]:
x = [4, 12, 3]
x2 = []
our_generator = doubleListGen(x)
for x in our_generator:
    x2.append(x)
print(x2[2])

6


### Or, use list comprehension

In [102]:
x = [4, 12, 3]
our_generator = doubleListGen(x)
x2 = [x for x in our_generator]
print(x2[2])

6


___
## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. 

In [103]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [104]:
#doc4.sents[0]
for sent in doc4.sents:
     print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [105]:
doc4[6].is_sent_start

True

# Extracting Barley Related Ag Data

Instructions for extracting agricultural related data from descriptions of barley cultivars. Before extracting ag data, we need to add new named entities to the training model. These entities are: <br/>
ALAS = varietal_alias <br/>
CROP = crop <br/>
CVAR = crop_variety <br/>
JRNL = journal_reference <br/>
PATH = pathogen <br/>
PED  = pedigree <br/>
PLAN = plant_anatomy <br/>
PPTD = plant_predisposition_to_disease <br/>
TRAT = trait

## Import training data and utility functions

In [106]:
from nerTraining import *

## Inspect training data

In [107]:
TRAIN_DATA[0]

('It was released by the Idaho AES in 2002.',
 {'entities': [(23, 32, 'ORG'), (36, 40, 'DATE')]})

In [108]:
TRAIN_DATA[1]

('It was released by Busch Agricultural Resources in 1989.',
 {'entities': [(19, 47, 'ORG'), (51, 55, 'DATE')]})

## Train model to recognized ag data entities e.g., TRAT

In [None]:
# NOTE: You need to change the value of output_dir to a directory in your system
output_dir="/home/ksilvers/Projects/IaaAgDataNER/NerModel"
n_iter = 100
trainModel(None,output_dir,n_iter)

Created blank 'en' model
Training 25% done
Training 50% done
Training 75% done


## Load model and do a simple test

In [None]:
agdata_nlp = spacy.load(output_dir)
test_text = '''Kold is a six-rowed winter feed barley. It was released by the Oregon and Idaho AESs in 1993'''
doc = agdata_nlp(test_text)

## Visualize classifications

In [None]:
# Import the displaCy library
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

### Check to see if PEDs are being split: Steveland/Luther//Wintermalt is split into two tokens

In [None]:
agdata_nlp = spacy.load(output_dir)
doc = agdata_nlp("'It was selected from the cross Steveland/Luther//Wintermalt'")
for token in doc:
    print(token.text)

## Try on a PDF file
### Open PDF file, extract one page, classify and display  

In [None]:
import PyPDF2
#Open PDF file for reading
pdfFile = open("/home/ksilvers/Projects/IaaAgDataNER/BarCvDescLJ11.pdf", mode="rb")
pdfReader = PyPDF2.PdfFileReader(pdfFile)

# Find the total number 
numPages = pdfReader.numPages

# Randomly pick one
pageNumber = random.randrange(0,numPages)

# Get text
OnePage = pdfReader.getPage(pageNumber)
OnePageText = OnePage.extractText()

# Close PDF file
pdfFile.close()

# Remove newlines. It appears multiple newlines together makes
# Spacy think that is the end of a sentence. The PDF reader reads the text in
# an odd fashion
OnePageText = OnePageText.replace('\n','')


# Customize PDF
colors = {'ALAS':'BlueViolet','CROP': 'Aqua','CVAR':'Chartreuse','PATH':'red','PED':'orange','PLAN':'pink','PPTD':'brown','TRAT':'yellow'}
cust_options = {'ents': ['ALAS','CROP','CVAR','PATH','PED','PLAN','PPTD','TRAT'], 'colors':colors}
#model = "/Users/gonsongo/Desktop/research/iaa/Projects/python/IaaAgDataNER/NerModel/model-best"
model = output_dir
agdata_nlp = spacy.load(model)

doc = agdata_nlp(OnePageText)

if doc.ents:
        displacy.render(doc, style='ent', jupyter=True, options=cust_options)
    