# POS and NER with Spacy


[![Open In Colab](colab-badge.svg)](https://colab.research.google.com/github/alexisperrier/intro2nlp/blob/master/notebooks/intro2nlp_05_spacy_pos_ner.ipynb)


POS tagging and NER are essential tasks in NLP. 

- POS is used for information extraction (finding all the adjectives associated with a person or a product, for example) and facilitates language understanding for complex NLP tasks (text generation, for instance). 

- NER is used across many domains to identify specific entities from the text (medical terms, legal concepts, people, …). 

When parsing a text with a Spacy model: ```doc = nlp(text)```, Spacy also performs POS tagging and NER.


In [None]:
# install spacy if you haven't done so already and download the small English model
!pip install -U spacy
!python -m spacy download en_core_web_sm

# install NLTK 
!pip install nltk 

In [None]:
# load spacy and the small English model
import spacy
nlp = spacy.load("en_core_web_sm")


## Part of Speech Tagging 

Let's start by exploring POS


In [None]:
text = "If you don't know where you are going any road can take you there."
doc = nlp(text)

# print the nature of each token
for token in doc:
   print(f"{token.text}\t {token.pos_} ")

In [None]:
# and now for some Shakespeare

doc = nlp("Grace me no grace, nor uncle me no uncle")
for t in doc: 
    print(t, t.pos_)

Spacy correctly identifies the nature of the _grace_ and _uncle_ both used as nouns (as expected) and as verbs.

On the other hand, NLTK, is confused. Grace and Uncle are identified as nouns in all occurences. 

In [None]:
import nltk

nltk.download('universal_tagset')

text = nltk.word_tokenize("Grace me no grace, nor uncle me no uncle")

nltk.pos_tag(text,tagset='universal')

## Named Entity Recognition (NER)

Now let's see how we can extract names of peoples, places etc from a text with Spacy.

And let's see which persons can be found in Alice in Wonderland




In [None]:
import requests
import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

# text from Alice in Wonderland
r = requests.get('http://www.gutenberg.org/files/11/11-0.txt')

# remove the footer and some weird characters 
# remove the header, the footer and some weird characters 
text = ' '.join(r.text.split('***')[1:])
text = text.split("END OF THE PROJECT GUTENBERG")[0]
text = text.encode('ascii',errors='ignore').decode('utf-8')
print(text)

In [None]:
# and parse the text
doc = nlp(text)

# Find all the 'persons' in the text
persons = []
# For each entity in the doc 
for ent in doc.ents:
    # if the entity is a person
    if ent.label_ == 'PERSON':
        # add to the list of persons
        persons.append(ent.text)

# note we could have written the last bit in one line with
persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']

# list the 12 most common ones
Counter(persons).most_common(20)


The Rabbit, although a very frequent character in the book, doesn't come out in the top 20 of identified persons. 

Let's see how the Rabbit entity is classified.


In [None]:
rabbit_ner = [(ent.text, ent.label_) for ent in doc.ents if "Rabbit" in ent.text]
Counter(rabbit_ner).most_common(10)

Interestingly, the Rabbit is identified as a location, an event and even a work of art! But not as a person.

Let's see if we get better results by using a larger Spacy model.


In [None]:
# Download and load the large English model.
# Note: Better to comment out the line after you've downladed the model the first time 
# to avoid downloading it each time you run the notebook!
!python -m spacy download en_core_web_lg


In [None]:
nlp_lg = spacy.load("en_core_web_lg")

In [None]:
# and parse the text this time with the large language model

# and parse the text
doc = nlp_lg(text)

# Find all the 'persons' in the text
persons = []
# For each entity in the doc 
for ent in doc.ents:
    # if the entity is a person
    if ent.label_ == 'PERSON':
        # add to the list of persons
        persons.append(ent.text)

# note we could have written the last bit in one line with
persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']

# list the 12 most common ones
Counter(persons).most_common(20)


In [None]:
rabbit_ner = [(ent.text, ent.label_) for ent in doc.ents if "Rabbit" in ent.text]
Counter(rabbit_ner).most_common(10)

Well that did not really work out either. The poor rabbit is now an organisation and still not a person or character.

Note that with the larger model, Alice is identified as a Person 293 but with the smaller model, Alice is a person only 191 times. So although, the model still can't identify the entity class of the Rabbit, it does a better job on other characters.

Let's see which other ORGs we can find in the book

In [None]:
orgs = [ent.text for ent in doc.ents if ent.label_ == 'ORG']
Counter(orgs).most_common(10)

In [None]:
# and work of art

woas = [ent.text for ent in doc.ents if ent.label_ == 'WORK_OF_ART']
Counter(woas).most_common(10)