# NLP with spaCy: 
## Named Entity Recognition (NER) & Part-of-Speech (POS) Tagging

*Lauren F. Klein wrote version 1.0, based off a notebook by Alison Parrish. I have supplemented it with material from Melanie Walsh's chapters [Named Entity Recognition](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/Text-Analysis/Named-Entity-Recognition.html) and [Part-of-Speech Tagging](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/Text-Analysis/POS-Keywords.html) from her online textbook [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/welcome.html).*

# Named Entity Recognition

### Why is NER useful?

NER is useful for extracting key information from texts. You might use NER to identify the most frequently appearing characters in a novel or build a network of characters. Or you might use NER to identify the geographic locations mentioned in texts, a first step toward mapping them.

### What is Natural Language Processing?

Named Entity Recognition is a fundamental task in the field of natural language processing (NLP). What is NLP, exactly? NLP is an interdisciplinary field that blends linguistics, statistics, and computer science. The heart of NLP is to understand human language with statistics and computers. Applications of NLP are all around us. Have you ever heard of a little thing called spellcheck? How about autocomplete, Google translate, chat bots, and Siri? These are all examples of NLP in action!

## spaCy

We're going to use an open-source Python library, spaCy, for NER and POS tagging today. spaCy relies on machine learning models that were trained on a large amount of carefully-labeled texts. (These texts were, in fact, often labeled and corrected by hand).

The English-language spaCy model that we’re going to use in this lesson was trained on an annotated corpus called “OntoNotes”: 2 million+ words drawn from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech,” which were meticulously tagged by a group of researchers and professionals for people’s names and places, for nouns and verbs, for subjects and objects, and much more. (Like a lot of other major machine learning projects, OntoNotes was also sponsored by the Defense Advaced Research Projects Agency (DARPA), the branch of the Defense Department that develops technology for the U.S. military.)

Okay, let's get started.

### Installing spaCy

From your computer's terminal, type:

    !pip install -U spacy

### Downloading the spaCy language model 

Next we need to download the English-language model (en_core_web_sm), which will be processing and making predictions about our texts. This is the model that was trained on the annotated “OntoNotes” corpus. You can download the en_core_web_sm model by running the cell below:

    !python -m spacy download en_core_web_sm
    
As above, if you're having trouble with sudo on your machine, just remove the "sudo" from the line above.

### Load language model

Once the model is downloaded, we need to load it.

In [1]:
import en_core_web_sm
nlp = en_core_web_sm.load()

### Import libraries

We’re going to import spacy and displacy, a special spaCy module for visualization. The `spaCy` code expects all strings to be unicode strings, so make sure you've included `from __future__ import unicode_literals` at the top of your notebook—it'll make your life easier, trust me.

In [2]:
from __future__ import unicode_literals
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

### Process document

We first need to process our document with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.

After processing, the document object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

Let's create a `Document` object with a few sentences from the Universal Declaration of Human Rights:

In [3]:
doc = nlp("All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. Everyone has the right to life, liberty and security of person.")

## What can we do with spaCy?

### Sentences

spaCy offers an easy way to identify sentences with its `.sents` method. `doc.sents` will turn the doc into a list of sentences. Once you've created a document object, you can iterate over the sentences it contains using the `.sents` attribute:

In [4]:
print("Here is the doc: ")
print(doc)
print("\n")
print("And here are the doc's sentences: ")

for item in doc.sents:
    print(item.text)

Here is the doc: 
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. Everyone has the right to life, liberty and security of person.


And here are the doc's sentences: 
All human beings are born free and equal in dignity and rights.
They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
Everyone has the right to life, liberty and security of person.


Note: The `.sents` attribute is a generator, so you can't index or count it directly. To index or count the .sents attribute, you'll need to convert it to a list first using the `list()` function:

In [5]:
sentences_as_list = list(doc.sents)

In [6]:
# check the length to make sure it worked

print("Here's the number of sentences: " + str(len(sentences_as_list)))

Here's the number of sentences: 3


### Words

Iterating over a document yields each word in turn. Words are represented with spaCy [Token](https://spacy.io/docs/api/token) objects, which have several interesting attributes. 

The `.text` attribute gives the word, and the `.lemma_` attribute gives the word's "lemma."

Here are Daniel Jurafsky and James H. Martin on lemmas:

"**Lemmas and Senses**

Let’s start by looking at how one word (we’ll choose mouse) might be defined in a dictionary:
mouse (N)

1.  any of numerous small rodents...
2.  a hand-operated device that controls a cursor...

Here the form mouse is the `lemma`, also called the citation form. The form mouse would also be the lemma for the word mice; dictionaries don’t have separated efinitions for inflected forms like mice. Similarly sing is the lemma for sing, sang, sung. In many languages the infinitive form is used as the lemma for the verb, so Spanish dormir “to sleep” is the lemma for duermes “you sleep”. The specific forms _sung_ or _carpets_ or _sing_ or _duermes_ are called `wordforms`."

Let's take a look:

In [7]:
print("Word, lemma\n")
for word in doc:
    print(word.text + ", " + word.lemma_)
    
# Note: On the underscore at the end of a variable, see
# https://www.datacamp.com/community/tutorials/role-underscore-python

Word, lemma

All, all
human, human
beings, being
are, be
born, bear
free, free
and, and
equal, equal
in, in
dignity, dignity
and, and
rights, right
., .
They, -PRON-
are, be
endowed, endow
with, with
reason, reason
and, and
conscience, conscience
and, and
should, should
act, act
towards, towards
one, one
another, another
in, in
a, a
spirit, spirit
of, of
brotherhood, brotherhood
., .
Everyone, everyone
has, have
the, the
right, right
to, to
life, life
,, ,
liberty, liberty
and, and
security, security
of, of
person, person
., .


Individual sentences can also be iterated over to get a list of words:

In [8]:
sentence = list(doc.sents)[1]  # same as sentence = sentences_as_list[1]

for word in sentence:
    print(word.text, word.lemma_)

They -PRON-
are be
endowed endow
with with
reason reason
and and
conscience conscience
and and
should should
act act
towards towards
one one
another another
in in
a a
spirit spirit
of of
brotherhood brotherhood
. .


## spaCy & entities

Identifying sentences and words is just the beginning of what spaCy can do! Encoded now in the document is a great deal of information about named entities: that is, about people, places, and many types of things.

### Loading data from a file

You can load data from a file easily with spaCy. You just have to make sure that the data is in Unicode format, not plain-text. An easy way to do this is to call `.decode('utf8')` on the string after you've loaded it. Let's take a look by loading up one of the songs from our lyrics corpus. Let's try Demi Lovato's "Made in the USA."

Remember: **Be sure that you have a folder titled 'lyrics' in the same folder with this notebook on your computer.**

In [9]:
with open("./lyrics/Demi-lovato-made-in-the-usa.txt", "r", encoding="utf-8") as file:
    lyrics = file.read()
print(lyrics)

Our love runs deep like a Chevy
If you fall, I'll fall with you, baby
'Cause that's the way we like to do it
That's the way we like
You run around, open doors like a gentleman
Tell me, girl, every day of my everything
'Cause that's the way you like to do it
That's the way you like

Just a little West Coast and a bit of sunshine
Hair blowing in the wind, losing track of time
Just you and I
Just you and I
Whoa, whoa

No matter how far we go
I want the whole world to know I want you bad
And I won't have it any other way
No matter what the people say
I know that we'll never break
'Cause our love was made, made in the USA
Made in the USA, yeah

You always reading my mind like a letter
When I'm cold, you're there like a sweater
'Cause that's the way we like to do it
That's the way we like
And never ever let the world get the best of you
Every night we're apart, I'm still next to you
'Cause that's the way I like to do it
That's the way I like

We touch down on the East Coast
Dinner in the sky

Next, let's turn the lyrics into a spaCy document:

In [10]:
doc2 = nlp(lyrics)

Below is a Named Entities chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different named entities that spaCy can identify as well as their corresponding type labels.

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|

To quickly see spaCy's NER in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) with the `style=` parameter set to "ent"  (short for entities):

In [11]:
displacy.render(doc2, style="ent")

As you can see, spaCy can identify "every day" and "Winter" as `DATE`s, West Coast and East Coast as `LOC`s, and USA as a `GPE`. It recognizes Chevy but thinks its an `ORG`, the company Chevrolet, rather than a `PRODUCT`. This is a good reminder than spaCy is far from perfect and you always need to check how it does!

This is all we're going to do with NER today. As you can see this is just the very tip of the iceberg. You could now easily get, say, a full list of persons or orgs or products from any text. Feel free to explore further on your own.

# Parts of speech

In this part of the lesson, we're going to learn about the textual analysis method _part-of-speech tagging_.  

spaCy's `pos_` attribute gives a general part of speech; the `tag_` attribute gives a more specific designation.

## Why is Part-of-Speech Tagging Useful?

I don't mean to go all [Language Nerd](https://xkcd.com/1443/) on you, but parts of speech are important. Even if they seem kind of boring. *Parts of speech* are the grammatical units of language — such as (in English) nouns, verbs, adjectives, adverbs, pronouns, and prepositions. Each of these parts of speech plays a different role in a sentence.

<img src="https://imgs.xkcd.com/comics/language_nerd.png" >


By computationally identifying parts of speech, we can start computationally exploring *syntax*, the relationship between words — rather than only focusing on words in isolation

## spaCy Part-of-Speech Tagging

What **parts** of speech does spaCy recognize and identify?

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| ADJ   | adjective                 | big, old, green, incomprehensible, first      |
| ADP   | adposition                | in, to, during                                |
| ADV   | adverb                    | very, tomorrow, down, where, there            |
| AUX   | auxiliary                 | is, has (done), will (do), should (do)        |
| CONJ  | conjunction               | and, or, but                                  |
| CCONJ | coordinating conjunction  | and, or, but                                  |
| DET   | determiner                | a, an, the                                    |
| INTJ  | interjection              | psst, ouch, bravo, hello                      |
| NOUN  | noun                      | girl, cat, tree, air, beauty                  |
| NUM   | numeral                   | 1, 2017, one, seventy-seven, IV, MMXIV        |
| PART  | particle                  | ’s, not,                                      |
| PRON  | pronoun                   | I, you, he, she, myself, themselves, somebody |
| PROPN | proper noun               | Mary, John, London, NATO, HBO                 |
| PUNCT | punctuation               | ., (, ), ?                                    |
| SCONJ | subordinating conjunction | if, while, that                               |
| SYM   | symbol                    | $, %, §, ©, +, −, ×, ÷, =, :), 😝             |
| VERB  | verb                      | run, runs, running, eat, ate, eating          |
| X     | other                     | sfpksdpsxmsa                                  |
| SPACE | space                     |                                               |

Above is a POS chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different parts of speech that spaCy can identify as well as their corresponding labels. 

### Let's see it in action!

We'll check out the parts-of-speech in Demi Lovato's 'Made in the USA', which, remember, is defined as `doc2`

In [12]:
print("Word, POS, tag\n")

for item in doc2:
    print(item.text, item.pos_, item.tag_)

Word, POS, tag

Our DET PRP$
love NOUN NN
runs VERB VBZ
deep ADV RB
like SCONJ IN
a DET DT
Chevy PROPN NNP

 SPACE _SP
If SCONJ IN
you PRON PRP
fall VERB VBP
, PUNCT ,
I PRON PRP
'll VERB MD
fall VERB VB
with ADP IN
you PRON PRP
, PUNCT ,
baby NOUN NN

 SPACE _SP
'Cause ADP IN
that DET DT
's AUX VBZ
the DET DT
way NOUN NN
we PRON PRP
like VERB VBP
to PART TO
do AUX VB
it PRON PRP

 SPACE _SP
That DET DT
's AUX VBZ
the DET DT
way NOUN NN
we PRON PRP
like VERB VBP

 SPACE _SP
You PRON PRP
run VERB VBP
around ADV RB
, PUNCT ,
open ADJ JJ
doors NOUN NNS
like SCONJ IN
a DET DT
gentleman NOUN NN

 SPACE _SP
Tell VERB VB
me PRON PRP
, PUNCT ,
girl NOUN NN
, PUNCT ,
every DET DT
day NOUN NN
of ADP IN
my DET PRP$
everything PRON NN

 SPACE _SP
'Cause ADP IN
that DET DT
's AUX VBZ
the DET DT
way NOUN NN
you PRON PRP
like VERB VBP
to PART TO
do AUX VB
it PRON PRP

 SPACE _SP
That DET DT
's AUX VBZ
the DET DT
way NOUN NN
you PRON PRP
like VERB VBP


 SPACE _SP
Just ADV RB
a DET DT
little ADJ JJ
We

We can also quickly see spaCy's POS tagging in action by we using the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) on `doc2` with the `style=` parameter set to "dep" (short for dependency parsing):

In [13]:
#Set some display options for the visualizer
options = {"compact": True, "distance": 90, "color": "yellow", "bg": "black", "font": "Gill Sans"}

displacy.render(doc2, style="dep", options=options)

### Extracting words by part of speech

Now we can write simple code to extract and recombine words by their part of speech. The following code creates lists of all nouns and adjectives in Lovato's song:

In [14]:
nouns = []
adjectives = []
for item in doc2:
    if item.pos_ == 'NOUN' and item.text not in nouns:
        nouns.append(item.text)
for item in doc2:
    if item.pos_ == 'ADJ' and item.text not in adjectives:
        adjectives.append(item.text)
print("here are the nouns: " + str(nouns))
print("and here are the adjectives: " + str(adjectives))

here are the nouns: ['love', 'baby', 'way', 'doors', 'gentleman', 'girl', 'day', 'bit', 'sunshine', 'Hair', 'wind', 'track', 'time', 'world', 'people', 'mind', 'letter', 'sweater', 'night', 'sky', 'rise', 'Winter', 'city', 'lights', 'bullet', 'blow']
and here are the adjectives: ['open', 'little', 'whole', 'bad', 'other', 'cold', 'best', 'next']


And below, some code to print out random pairings of an adjective from the text with a noun from the text:

In [26]:
import random
print(random.choice(adjectives) + " " + random.choice(nouns))

other track


### Excercise!

Making a list of verbs works similarly. You try:

In [27]:
verbs = []
for item in doc2:
    if item.pos_ == 'VERB' and item.text not in verbs:
        verbs.append(item.text)    
verbs

['runs',
 'fall',
 "'ll",
 'like',
 'run',
 'Tell',
 'blowing',
 'losing',
 'go',
 'want',
 'know',
 'wo',
 'say',
 'break',
 'made',
 'Made',
 'reading',
 'let',
 'touch',
 'walking',
 'take']

The `.tag_` attribute allows us to be more specific about the kinds of verbs we want. 

**HOW WOULD WE GET ONLY VERBS IN PAST PARTICIPLE?**
_Hint: you'll find help [here](https://spacy.io/api/annotation#pos-tagging)

In [29]:
only_past = []

for item in doc2:
    if item.tag_ == 'VBN' and item.text not in only_past:
        only_past.append(item.text)      
      
only_past

['made', 'Made']

## Larger syntactic units

So we can get individual words by their part of speech. Great! But what if we want larger chunks, based on their syntactic role in the sentence? The easy way is `.noun_chunks`, which is an attribute of a document or a sentence that evaluates to a list of [spans](https://spacy.io/docs/api/span) of noun phrases, regardless of their position in the document:

In [30]:
for item in doc2.noun_chunks:
    print(item.text)

Our love
a Chevy
you
I
you
the way
we
it
the way
we
You
open doors
a gentleman
me
my everything
the way
you
it
the way
you
Just a little West Coast
a bit
sunshine
Hair
the wind
track
time
Just you
I
Just you
I
we
I
the whole world
I
you
I
it
what
the people
I
we
our love
the USA
the USA
You
my mind
a letter
I
you
a sweater
the way
we
it
the way
we
the world
you
we
I
you
the way
I
it
the way
I
We
the East Coast
Dinner
the sky
Winter
the best time
the city lights
You
I
You
I
we
I
the whole world
I
you
I
it
what
the people
I
we
our love
the USA
I
the bullet
the blow
love
Our love
the USA
the USA
the USA
we
I
the whole world
I
you
I
it
what
the people
I
we
our love
the USA
the USA
the US
the US
the USA


For anything more sophisticated than this, though, we'll need to learn about how spaCy parses sentences into its syntactic components.

### Understanding dependency grammars

![displacy parse](http://static.decontextualize.com/syntax_example.png)

The idea of a dependency grammar is that every word in a sentence is a "dependent" on some other word, which is that word's "head." Those "head" words are in turn dependents of other words. The finite verb in the sentence is the ultimate "head" of the sentence, and is not itself dependent on any other word. (The dependents of a particular head are sometimes called its "children.")

The question of how to know what constitutes a "head" and a "dependent" is complicated. As a starting point, here's a passage from [Dependency Grammar and Dependency Parsing](http://stp.lingfil.uu.se/~nivre/docs/05133.pdf):

> Here are some of the criteria that have been proposed for identifying a syntactic relation between a head H and a dependent D in a construction C (Zwicky, 1985; Hudson, 1990):
>
> 1. H determines the syntactic category of C and can often replace C.
> 2. H determines the semantic category of C; D gives semantic specification.
> 3. H is obligatory; D may be optional.
> 4. H selects D and determines whether D is obligatory or optional.
> 5. The form of D depends on H (agreement or government).
> 6. The linear position of D is specified with reference to H."

There are different *types* of relationships between heads and dependents, and each type of relation has its own name. 

**Visit [the displaCy visualizer](https://demos.explosion.ai/displacy/?text=Everyone%20has%20the%20right%20to%20life%2C%20liberty%20and%20security%20of%20person&model=en&cpu=1&cph=0) to see how a particular sentence is parsed, and what the relations between the heads and dependents are.**

Here's a list of a few dependency relations and what they mean. ([A more complete list can be found here.](http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf))

* `nsubj`: this word's head is a verb, and this word is itself the subject of the verb
* `nsubjpass`: same as above, but for subjects in sentences in the passive voice
* `dobj`: this word's head is a verb, and this word is itself the direct object of the verb
* `iobj`: same as above, but indirect object
* `aux`: this word's head is a verb, and this word is an "auxiliary" verb (like "have", "will", "be")
* `attr`: this word's head is a copula (like "to be"), and this is the description attributed to the subject of the sentence (e.g., in "This product is a global brand", `brand` is dependent on `is` with the `attr` dependency relation)
* `det`: this word's head is a noun, and this word is a determiner of that noun (like "the," "this," etc.)
* `amod`: this word's head is a noun, and this word is an adjective describing that noun
* `prep`: this word is a preposition that modifies its head
* `pobj`: this word is a dependent (object) of a preposition

In [31]:
# Let's take a look at how this works in practice
# We'll go back to using our first doc for the rest of this notebook

for word in list(doc.sents)[2]:
    print("Word:", word.text)
    print("Tag:", word.tag_)
    print("Head:", word.head.text)
    print("Dependency relation:", word.dep_)
    print("Children:", list(word.children))
    print("")

Word: Everyone
Tag: NN
Head: has
Dependency relation: nsubj
Children: []

Word: has
Tag: VBZ
Head: has
Dependency relation: ROOT
Children: [Everyone, right, .]

Word: the
Tag: DT
Head: right
Dependency relation: det
Children: []

Word: right
Tag: NN
Head: has
Dependency relation: dobj
Children: [the, to]

Word: to
Tag: IN
Head: right
Dependency relation: prep
Children: [life]

Word: life
Tag: NN
Head: to
Dependency relation: pobj
Children: [,, liberty]

Word: ,
Tag: ,
Head: life
Dependency relation: punct
Children: []

Word: liberty
Tag: NN
Head: life
Dependency relation: conj
Children: [and, security, of]

Word: and
Tag: CC
Head: liberty
Dependency relation: cc
Children: []

Word: security
Tag: NN
Head: liberty
Dependency relation: conj
Children: []

Word: of
Tag: IN
Head: liberty
Dependency relation: prep
Children: [person]

Word: person
Tag: NN
Head: of
Dependency relation: pobj
Children: []

Word: .
Tag: .
Head: has
Dependency relation: punct
Children: []



### Using .subtree for extracting syntactic units

The `.subtree` attribute evaluates to a generator that can be flatted by passing it to `list()`. This is a list of the word's syntactic dependents--essentially, the clause that the word belongs to.

This function merges a subtree and returns a string with the text of the words contained in it:

In [32]:
def flatten_subtree(st):
       return ''.join([w.text_with_ws for w in list(st)]).strip() # just take my word for it!

With this function in our toolbox, we can write a loop that prints out the subtree for each word in a sentence:

In [33]:
for word in list(doc.sents)[2]:
    print("Word:", word.text)
    print("Flattened subtree: ", flatten_subtree(word.subtree))
    print("")

Word: Everyone
Flattened subtree:  Everyone

Word: has
Flattened subtree:  Everyone has the right to life, liberty and security of person.

Word: the
Flattened subtree:  the

Word: right
Flattened subtree:  the right to life, liberty and security of person

Word: to
Flattened subtree:  to life, liberty and security of person

Word: life
Flattened subtree:  life, liberty and security of person

Word: ,
Flattened subtree:  ,

Word: liberty
Flattened subtree:  liberty and security of person

Word: and
Flattened subtree:  and

Word: security
Flattened subtree:  security

Word: of
Flattened subtree:  of person

Word: person
Flattened subtree:  person

Word: .
Flattened subtree:  .



Using the subtree and our knowledge of dependency relation types, we can write code that extracts larger syntactic units based on their relationship with the rest of the sentence. For example, to get all of the noun phrases that are subjects of a verb:

In [34]:
subjects = []
for word in doc:
    if word.dep_ in ('nsubj', 'nsubjpass'):
        subjects.append(flatten_subtree(word.subtree))

In [35]:
subjects

['All human beings', 'They', 'Everyone']

Or every prepositional phrase:

In [36]:
prep_phrases = []
for word in doc:
    if word.dep_ == 'prep':
        prep_phrases.append(flatten_subtree(word.subtree))

In [37]:
prep_phrases

['in dignity and rights',
 'with reason and conscience',
 'towards one another',
 'in a spirit of brotherhood',
 'of brotherhood',
 'to life, liberty and security of person',
 'of person']

## Further reading and resources

[A few example programs can be found here.](https://github.com/aparrish/rwet-examples/tree/master/spacy)

We've barely scratched the surface of what it's possible to do with spaCy. [There's a great page of tutorials on the official site](https://spacy.io/docs/usage/tutorials) that you should check out!