# Some Very Basic spaCy Examples

spaCy is an open source industrial strength NLP engine that can perform multiple functions out of the box. It strikes a good balance between speed of processing and accuracy of predictions.  It comes with a number of different language models trained on the [OntoNotes5](https://catalog.ldc.upenn.edu/LDC2013T19) data set.  This means that it is already trained to do part of speech tagging, dependency parsing, semantic role labeling, coreference resolution, and entity detection.  It can also be trained to do classification and a number of other tasks in the standard NLP stack.  It is very fast.  It can be a handy way of analyzing some text. Another use is annotating some text to then create a labelled training set that you use to train up your own model independent of spaCy.

It has also been pre-trained on multiple languages.  When using it you need to select and load a specific language model.

spaCy uses a combination of techniques including embeddings and convolutional neural nets.  There's a video of [Matthew Honnibal](https://spacy.io/universe/project/video-spacys-ner-model) (at 10:00) describing how Spacy performs named entity recogntion.  He describes the architecutal choices he made in adding the functionality.  You can also add your own training on top of the existing training to enhance the model.

Take a look at [this page](https://spacy.io/usage) on the spaCy website to see if you can run it on your machine.  It works on Linux, Mac, and Windows and will operate quite well on a machine without a GPU.  Eventhough it uses Cython you can use pip or conda to install a working package that doesn't require you to run a C compiler.

In order to use this notebook, you'll need to install spaCy on your machine.  If you're just experimenting, it is a good idea to use something like the virtualenv to install this so it will not adversely affect your existing set up.

Here are the steps, repeated from [this spaCy page](https://spacy.io/usage):

#### setup is simple when you create a virtual environment
1. make virtualenv
2. source .env/bin/activate
3. pip install -U spacy
4. pip install -U spacy-lookups-data

#### get the large english model (it comes with pre-trained embeddings)
- python -m spacy download en_core_web_lg

#### if you haven't already you should also install pandas so you can capture data for subsequent analysis and use
- pip install pandas

##### you can also make a special kernel in your jupyter notebook so you know you're running the right environment.
##### you can create that in your virtualenv

###### to create the kernel for your notebook
- python -m ipykernel install --user --name spacy3 --display-name "Python 3 Spacy3"
###### it will tell you it has created the kernel
Installed kernelspec spacy3 in /Users/markhb/Library/Jupyter/kernels/spacy3

#### if you want to get a specific version of a language model
- python -m spacy download en_core_web_sm-2.1.0 --direct

#### if you want to get a non-english model e.g. japanese model
- python -m spacy download ja_core_news_sm

In [1]:
import spacy
import pandas as pd

print(spacy.__version__)
print(pd.__version__)

3.1.3
1.2.4


### Pre-trained Language Models
Make sure you first load a language model. We're selecting English via the large model which gives us access to embeddings.  There are many other options and other languages.

In [2]:
#load an english model -- the large model includes word embeddings
nlp = spacy.load("en_core_web_lg")

### Natural Language Processing
When you invoke spaCy with some input text it generates a set of objects.  spaCy wants to process "document" like objects. This document can be a sentence or can be many sentences.  You provide text and spaCy runs the nlp function which returns a Doc object.  That Doc object contains a list of Token objects each of which is associated with a set of annotations.  Many examples below are just about harvesting the labels associated with each token after the processing of the Document. 

In [3]:
doc = nlp(u"This is a sentence.")

print("The first word is: ") 
doc[0].text

The first word is: 


'This'

### Similarity Calculations

In [4]:
#We're taking advantage of the large model which uses embeddings 
#This means we can compute the similarity of two sentences and synonymous words have similar embeddings
doc1 = nlp("How do I adopt a cat?")
doc2 = nlp("How do I obtain a pet?")

doc1.similarity(doc2)

0.9510932235818049

In [5]:
#Because this is word based (even with the underlying embeddings) and not a sentence based set of embeddings 
#these semantically similar sentences aren't as similar as you might expect.
doc3 = nlp("How old are you?")
doc4 = nlp("What is your age?")

doc3.similarity(doc4)

0.8914506294237656

## Multiple functions
Spacy is able to perform multiple functions that you might expect from an NLP stack.  Here are examples of some of those functions.  These functions include sentence boundary detection, lemmatization, part of speech tagging,
rule based matching, dependency parsing, noun phrase detection, and named entity recognition.  All of this functionality is combined under one umbrella.

### Sentence Boundary Detection

Sentence boundary detection is a hard problem because you can't just look for a period. Abbreviations can contain periods.  Often times, but not always, the boundary of a sentence is a period followed by a space or two and then a capital letter.  A well trained classifier can handle the many possibilities.

In [6]:
#sentence detection
# Given an input block of text, identify where the sentences end.

about_text = ('Sentence boundary detection is actually'
              ' a pretty hard problem.  Great advances'
              ' have been made in the U.S. in the past decade. New neural nets'
              ' like a CNN can help improve results on this classification task.')
about_doc = nlp(about_text)
sentences = list(about_doc.sents)
#len(sentences)

#now print out the sentences
for sentence in sentences:
    print (sentence)

Sentence boundary detection is actually a pretty hard problem.
 
Great advances have been made in the U.S. in the past decade.
New neural nets like a CNN can help improve results on this classification task.


### Lemmatization

Lemmatization is the identification of the root form of a word.  This means that plural nouns are converted to their signular form.  Similarly a verb is converted to its infinitive form.  It can be useful when you want to count up word occurences and want to consolidate different forms of the same word.

In [7]:
#lemmatization
#since each token is a word you can just print the lemma after the word even though it doesn't look great
organize_papers_text = ('We are helping organize all of the'
    ' conference papers on Natural Language'
    ' Processing from all conferences. We keep clustering the papers'
    ' in to different sets of subdomains.')
organize_papers_doc = nlp(organize_papers_text)

#print out each token and its associated lemma
for token in organize_papers_doc:
    print (token, token.lemma_)

We we
are be
helping help
organize organize
all all
of of
the the
conference conference
papers paper
on on
Natural Natural
Language Language
Processing Processing
from from
all all
conferences conference
. .
We we
keep keep
clustering cluster
the the
papers paper
in in
to to
different different
sets set
of of
subdomains subdomain
. .


### Part of Speech Tagging
Part of speech tagging can be very valuable.  Tagging words can allow you to quickly distinguish "things" from "actions" or "events." spaCy has several different tags to display related to part of speech as shown below.

In [8]:
#POS with unpretty print

for token in about_doc:
    print (token, token.tag_, token.pos_, spacy.explain(token.tag_))

Sentence NN NOUN noun, singular or mass
boundary NN NOUN noun, singular or mass
detection NN NOUN noun, singular or mass
is VBZ AUX verb, 3rd person singular present
actually RB ADV adverb
a DT DET determiner
pretty RB ADV adverb
hard JJ ADJ adjective (English), other noun-modifier (Chinese)
problem NN NOUN noun, singular or mass
. . PUNCT punctuation mark, sentence closer
  _SP SPACE whitespace
Great JJ ADJ adjective (English), other noun-modifier (Chinese)
advances NNS NOUN noun, plural
have VBP AUX verb, non-3rd person singular present
been VBN AUX verb, past participle
made VBN VERB verb, past participle
in IN ADP conjunction, subordinating or preposition
the DT DET determiner
U.S. NNP PROPN noun, proper singular
in IN ADP conjunction, subordinating or preposition
the DT DET determiner
past JJ ADJ adjective (English), other noun-modifier (Chinese)
decade NN NOUN noun, singular or mass
. . PUNCT punctuation mark, sentence closer
New JJ ADJ adjective (English), other noun-modifier (C

In [9]:
#POS
#capturing the output in a pandas dataframe makes it easier to view
dpos = pd.DataFrame()
dpos['text'] = [token.text for token in about_doc]
dpos['tag'] = [token.tag_ for token in about_doc]
dpos['pos'] = [token.pos_ for token in about_doc]
dpos['explain'] = [spacy.explain(token.tag_) for token in about_doc]

dpos


Unnamed: 0,text,tag,pos,explain
0,Sentence,NN,NOUN,"noun, singular or mass"
1,boundary,NN,NOUN,"noun, singular or mass"
2,detection,NN,NOUN,"noun, singular or mass"
3,is,VBZ,AUX,"verb, 3rd person singular present"
4,actually,RB,ADV,adverb
5,a,DT,DET,determiner
6,pretty,RB,ADV,adverb
7,hard,JJ,ADJ,"adjective (English), other noun-modifier (Chin..."
8,problem,NN,NOUN,"noun, singular or mass"
9,.,.,PUNCT,"punctuation mark, sentence closer"


### Rule-Based Matching
spaCy offers a rule based matching capability that allows you to construct rules to match strings and extract them.  This works well if you know all the things you're looking for like an unambigious list of your company's product names.  This just picks out the first proper noun followed by another proper noun. 

In [10]:
#Rule based matching
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

nnp_doc = nlp('Marti Hearst and Dan Jurafsky studied with Robert Wilensky at UC Berkeley.')

def extract_full_name(nlp_doc):
    pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
    matcher.add('FULL_NAME', [pattern])
    matches = matcher(nlp_doc)
    for match_id, start, end in matches:
        span = nlp_doc[start:end]
        return span.text

extract_full_name(nnp_doc)

'Marti Hearst'

### Dependency Parsing
spaCy performs dependency parsing right out of the box.  This can be a very handy way of identifying words and the relations between them.  Sometimes those relations fundamentally change the meaning of the word as in the case of negation.

In [11]:
#dependency parsing
w266_text = 'Students are learning Natural Language Processing in the W266 class.'
w266_doc = nlp(w266_text)
for token in w266_doc:
    print (token.text, token.tag_, token.head.text, token.dep_)

Students NNS learning nsubj
are VBP learning aux
learning VBG learning ROOT
Natural NNP Language compound
Language NNP Processing compound
Processing NNP learning dobj
in IN learning prep
the DT class det
W266 CD class compound
class NN in pobj
. . learning punct


In [12]:
# more parsing labels - same w266_doc
# you can extract many labels for use in downstream processes
for token in w266_doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Students student NOUN NNS nsubj Xxxxx True False
are be AUX VBP aux xxx True True
learning learn VERB VBG ROOT xxxx True False
Natural Natural PROPN NNP compound Xxxxx True False
Language Language PROPN NNP compound Xxxxx True False
Processing Processing PROPN NNP dobj Xxxxx True False
in in ADP IN prep xx True True
the the DET DT det xxx True True
W266 w266 NUM CD compound Xddd False False
class class NOUN NN pobj xxxx True False
. . PUNCT . punct . False False


In [13]:
#if you capture the tags in a dataframe you can then perform additional 
#operations like counting and filtering and searching

df = pd.DataFrame()
df['text'] = [token.text for token in w266_doc]
df['lemma'] = [token.lemma_ for token in w266_doc]
df['is_punctuation'] = [token.is_punct for token in w266_doc]
df['is_space'] = [token.is_space for token in w266_doc]
df['shape'] = [token.shape_ for token in w266_doc]
df['part_of_speech'] = [token.pos_ for token in w266_doc]
df['pos_tag'] = [token.tag_ for token in w266_doc]
df['head'] = [token.head.text for token in w266_doc] 
df['dep'] = [token.dep_ for token in w266_doc]

df

Unnamed: 0,text,lemma,is_punctuation,is_space,shape,part_of_speech,pos_tag,head,dep
0,Students,student,False,False,Xxxxx,NOUN,NNS,learning,nsubj
1,are,be,False,False,xxx,AUX,VBP,learning,aux
2,learning,learn,False,False,xxxx,VERB,VBG,learning,ROOT
3,Natural,Natural,False,False,Xxxxx,PROPN,NNP,Language,compound
4,Language,Language,False,False,Xxxxx,PROPN,NNP,Processing,compound
5,Processing,Processing,False,False,Xxxxx,PROPN,NNP,learning,dobj
6,in,in,False,False,xx,ADP,IN,learning,prep
7,the,the,False,False,xxx,DET,DT,class,det
8,W266,w266,False,False,Xddd,NUM,CD,class,compound
9,class,class,False,False,xxxx,NOUN,NN,in,pobj


### Noun Phrase Detection
spaCy can identify the noun phrases in the input text.  This can be an interesting set of objects to count if you're doing some basic analytics.  If you simply grabbed bi-grams you would introduce a lot of noise and miss some important parts of phrases.

In [14]:
#noun phrase detection

for chunk in w266_doc.noun_chunks:
    print (chunk)

Students
Natural Language Processing
the W266 class


### Named Entity Recogntion
spaCy is also trained to do some basic NER out of the box.  It has been trained using OntoNotes5 so you can see the set of entity tags it uses to annotate its content.  If those don't work for you, then you can train spaCy to identify different entities or use a different tag set.

In [15]:
#NER example
for ent in nnp_doc.ents:
    print(ent.text, ent.start_char, ent.end_char,
        ent.label_, spacy.explain(ent.label_))

Marti Hearst 0 12 PERSON People, including fictional
Dan Jurafsky 17 29 PERSON People, including fictional
Robert Wilensky 43 58 PERSON People, including fictional
UC Berkeley 62 73 ORG Companies, agencies, institutions, etc.


### SVO Triple Extraction example
You can leverage the dependency graph to identify subject-verb-object triples.  These can be used to populate a knowledge graph or to extract "facts" from text.

In [16]:
#SVO extraction

# object and subject constants
OBJECT_DEPS = {"dobj", "dative", "attr", "oprd"}
SUBJECT_DEPS = {"nsubj", "nsubjpass", "csubj", "agent", "expl"}

# extract the subject, object and verb from the input
def extract_triples(doc):
    sub = []
    at = []
    ve = []
    for token in doc:
        # is this a verb?
        if token.pos_ == "VERB":
            ve.append(token.text)
        # is this the object?
        if token.dep_ in OBJECT_DEPS or token.head.dep_ in OBJECT_DEPS:
            at.append(token.text)
        # is this the subject?
        if token.dep_ in SUBJECT_DEPS or token.head.dep_ in SUBJECT_DEPS:
            sub.append(token.text)
    return " ".join(sub).strip().lower(), " ".join(ve).strip().lower(), " ".join(at).strip().lower()


# print out the pos and deps
for token in w266_doc:
    print("Token {} POS: {}, dep: {}".format(token.text, token.pos_, token.dep_))

# process the input information
subject, verb, attribute = extract_triples(w266_doc)
print("svo triple:, subject: {}, verb: {}, attribute: {}".format(subject, verb, attribute))

Token Students POS: NOUN, dep: nsubj
Token are POS: AUX, dep: aux
Token learning POS: VERB, dep: ROOT
Token Natural POS: PROPN, dep: compound
Token Language POS: PROPN, dep: compound
Token Processing POS: PROPN, dep: dobj
Token in POS: ADP, dep: prep
Token the POS: DET, dep: det
Token W266 POS: NUM, dep: compound
Token class POS: NOUN, dep: pobj
Token . POS: PUNCT, dep: punct
svo triple:, subject: students, verb: learning, attribute: language processing


### Question Identification

spaCy can also be used to identify questions in text.  Question can be yes/no questions like "Do you..." or "Can you..." or "Will you..."  Question can also use a wh- word like who, what, where, when, or how.

In [17]:
#question identification
w266_qtext = 'What are students learning in the W266 class?'
#w266_qtext = 'Do students learn natural language processing in the W266 class?'
w266_question = nlp(w266_qtext)


# tags that define wether the word is wh-
WH_WORDS = {"WP", "WP$", "WRB"}


# whether the doc is a question, as well as the wh-word if any
def is_question(doc):
    # is the first token a verb?
    if len(doc) > 0 and doc[0].pos_ == "AUX":  # covers both auxiliary & modal verbs
        return True, "yes/no question"
    # go over all words
    for token in doc:
        # is it a wh- word?
        if token.tag_ in WH_WORDS:
            return True, token.text.lower()
    return False, ""


for token in w266_question:
    print("Token {} POS: {}, dep: {}".format(token.text, token.pos_, token.dep_))

# test the input statement
question, wh_word = is_question(w266_question)
print("question type: {}".format(wh_word))

Token What POS: PRON, dep: dobj
Token are POS: AUX, dep: aux
Token students POS: NOUN, dep: nsubj
Token learning POS: VERB, dep: ROOT
Token in POS: ADP, dep: prep
Token the POS: DET, dep: det
Token W266 POS: NUM, dep: compound
Token class POS: NOUN, dep: pobj
Token ? POS: PUNCT, dep: punct
question type: what
