# Text preprocessing, POS tagging and NER
  
In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the `spaCy` library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.

**Helpful Links**
  
[SpaCy Documentation](https://spacy.io)  
["To-be" Verbs: What is that?](https://www.flsinternationalonline.net/blog/to-be-verbs-completely-explained#:~:text=What%20are%20“to%20be”%20verbs,nationality%2C%20job%20or%20other%20traits.)

In [26]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations
import re                           # Regular Expressions:      Text manipulation
import spacy                        # Spatium Cython:           Natural Language Processing
from pprint import pprint           # Pretty Print:             Advanced printing operations

## Tokenization and Lemmatization
  
In NLP, we usually have to deal with texts from a variety of sources.
  
**Text sources**
  
For instance, it can be a news article where the text is grammatically correct and proofread. It could be tweets containing shorthands and hashtags. It could also be comments on YouTube where people have a tendency to abuse capital letters and punctuations.
  
**Making text machine friendly**
  
It is important that we standardize these texts into a machine friendly format. We want our models to treat similar words as the same. Consider the words "Dogs" and "dog". Strictly speaking, they are different strings. However, they connotate the same thing. Similarly, "reduction", "reducing" and "reduce" should also be standardized to the same string regardless of their form and case usage. Other examples include "don't" and "do not", and "won't" and "will not". In the next couple of lessons, we will learn techniques to achieve this.
  
**Text preprocessing techniques**
  
The text processing techniques you use are dependent on the application you're working on. We'll be covering the common ones, including converting words into lowercase, removing unnecessary whitespace, removing punctuation, removing commonly occurring words or stopwords, expanding contracted words like don't and removing special characters such as numbers and emojis.
  
**Tokenization**
  
To do this, we must first understand tokenization. *Tokenization* is the process of splitting a string into its constituent tokens. These tokens may be sentences, words or punctuations and is specific to a particular language. In this course, we will primarily be focused with word and punctuation tokens. For instance, consider this sentence. 
  
<img src='../_images/nlp-tokenization-and-lemmatization.png' alt='img' width='740'>
  
Tokenizing it into its constituent words and punctuations will yield the following list of tokens. Tokenization also involves expanding contracted words. Therefore, a word like "don't" gets decomposed into two tokens: "do" and "n't" as can be seen in this example.
  
<img src='../_images/nlp-tokenization-and-lemmatization1.png' alt='img' width='740'>
  
**Tokenization using `spaCy`**
  
To perform tokenization in python, we will use the `spacy` library. We first import the `spacy` library. Next, we load a pre-trained English model `'en_core_web_sm'` using `spacy.load()`. This will return a Language object that has the know-how to perform tokenization. This is stored in the variable `nlp`. Let's now define a string we want to tokenize. We pass this string into `nlp` to generate a `spaCy` Doc object. We store this in a variable named `doc`. This `doc` object contains the required tokens (and many other things, as we will soon find out). We generate the list of tokens by using list comprehension as shown. This is essentially looping over `doc` and extracting the text of each token in each iteration. The result is as follows.
  
<img src='../_images/nlp-tokenization-and-lemmatization2.png' alt='img' width='740'>
  
**Lemmatization**
  
*Lemmatization* is the process of converting a word into its lowercased base form or lemma. This is an extremely powerful process of standardization. For instance, the words "reducing", "reduces", "reduced" and "reduction", when lemmatized, are all converted into the base form "reduce". Similarly "To-be" verbs such as "am", "are" and "is" are converted into "be". Lemmatization also allows us to convert words with apostrophes into their full forms. Therefore, "n't" is converted to "not" and "'ve" is converted to "have".
  
<img src='../_images/nlp-tokenization-and-lemmatization3.png' alt='img' width='740'>
  
**Lemmatization using `spaCy`**
  
When you pass the string into `nlp`, `spaCy` automatically performs lemmatization by default. Therefore, generating lemmas is identical to generating tokens except that we extract `token.lemma_` in each iteration inside the list comprehension instead of `token.text`. Also, observe how `spaCy` converted the "Is" into `-PRON-`. This is standard behavior where every pronoun is converted into the string '`-PRON-`'.
  
<img src='../_images/nlp-tokenization-and-lemmatization4.png' alt='img' width='740'>
  
**Text preprocessing techniques**  
  
- Converting words into lowercase  
- Removing leading and trailing whitespaces  
- Removing punctuation  
- Removing stopwords  
- Expanding contractions  
- Removing special characters such as numbers and emoji's
  
**Tokenization**  
- The process of splitting a string into its constituent tokens
  
**Lemmatization**  
- The process of converting a word into its lowercased base form or lemma
  
**Let's practice!**
  
Once we understand how to perform tokenization and lemmatization, performing the text preprocessing techniques described earlier becomes easier. Before we move to that, let's first practice our understanding of the concepts introduced so far.

### Identifying lemmas
  
Identify the list of words from the choices which do not have the same lemma.
  
Possible Answers
  
- [ ] He, She, I, They
- [ ] Am, Are, Is, Was
- [ ] Increase, Increases, Increasing, Increased
- [x] Car, Bike, Truck, Bus
  
Good job! Although all the words refer to vehicles, they are words with distinct base forms.

### Tokenizing the Gettysburg Address
  
In this exercise, you will be tokenizing one of the most famous speeches of all time: the Gettysburg Address delivered by American President Abraham Lincoln during the American Civil War.
  
The entire speech is available as a string named `gettysburg`.
  
1. Load the `en_core_web_sm` model using `spacy.load()`.
2. Create a Doc object `doc` for the `gettysburg` string.
3. Using list comprehension, loop over `doc` to generate the token texts.

In [27]:
with open('../_datasets/gettysburg.txt', 'r') as f:
    gettysburg = f.read()

In [28]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# create a Doc object
doc = nlp(gettysburg)

# Generate the tokens
tokens = [token.text for token in doc]
print(tokens)

['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceived', 'in', 'Liberty', ',', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', 'all', 'men', 'are', 'created', 'equal', '.', 'Now', 'we', "'re", 'engaged', 'in', 'a', 'great', 'civil', 'war', ',', 'testing', 'whether', 'that', 'nation', ',', 'or', 'any', 'nation', 'so', 'conceived', 'and', 'so', 'dedicated', ',', 'can', 'long', 'endure', '.', 'We', "'re", 'met', 'on', 'a', 'great', 'battlefield', 'of', 'that', 'war', '.', 'We', "'ve", 'come', 'to', 'dedicate', 'a', 'portion', 'of', 'that', 'field', ',', 'as', 'a', 'final', 'resting', 'place', 'for', 'those', 'who', 'here', 'gave', 'their', 'lives', 'that', 'that', 'nation', 'might', 'live', '.', 'It', "'s", 'altogether', 'fitting', 'and', 'proper', 'that', 'we', 'should', 'do', 'this', '.', 'But', ',', 'in', 'a', 'larger', 'sense', ',', 'we', 'ca', "n't", 'dedicate', '-', 'we', '

You now know how to tokenize a piece of text. In the next exercise, we will perform similar steps and conduct lemmatization.

### Lemmatizing the Gettysburg address
  
In this exercise, we will perform lemmatization on the same gettysburg address from before.
  
However, this time, we will also take a look at the speech, before and after lemmatization, and try to adjudge the kind of changes that take place to make the piece more machine friendly.
  
1. Print the gettysburg address to the console.
2. Loop over `doc` and extract the lemma for each token of `gettysburg`.
3. Convert `lemmas` into a string using `.join`.

In [29]:
# Print the gettysburg address
print(gettysburg)

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so no

In [30]:
# Generate lemmas
lemmas = [token.lemma_ for token in doc]

# Convert lemmas into a string
print(' '.join(lemmas))

four score and seven year ago our father bring forth on this continent , a new nation , conceive in Liberty , and dedicate to the proposition that all man be create equal . now we be engage in a great civil war , test whether that nation , or any nation so conceive and so dedicated , can long endure . we be meet on a great battlefield of that war . we 've come to dedicate a portion of that field , as a final resting place for those who here give their life that that nation might live . it be altogether fitting and proper that we should do this . but , in a large sense , we can not dedicate - we can not consecrate - we can not hallow - this ground . the brave man , living and dead , who struggle here , have consecrate it , far above our poor power to add or detract . the world will little note , nor long remember what we say here , but it can never forget what they do here . it be for we the living , rather , to be dedicate here to the unfinished work which they who fight here have thus

You're now proficient at performing lemmatization using `spaCy`. Observe the lemmatized version of the speech. It isn't very readable to humans but it is in a much more convenient format for a machine to process.

## Text cleaning
  
Now that we know how to convert a string into a list of lemmas, we are now in a good position to perform basic text cleaning.
  
**Text cleaning techniques**
  
Some of the most common text cleaning steps include removing extra whitespaces, escape sequences, punctuations, special characters such as numbers and stopwords. In other words, it is very common to remove non-alphabetic tokens and words that occur so commonly that they are not very useful for analysis.
  
**`.isalpha()`**
  
Every python string has an `.isalpha()` method that returns `True` if all the characters of the string are alphabets. Therefore, the "Dog".`.isalpha()` will return `True` but "3dogs".`.isalpha()` will return false as it has a non-alphabetic character 3. Similarly, numbers, punctuations and emojis will all return false too. This is an extremely convenient method to remove all (lemmatized) tokens that are or contain numbers, punctuation and emojis.
  
<img src='../_images/text-cleaning-for-nlp.png' alt='img' width='740'>
  
**A word of caution**
  
If `.isalpha()` as a silver bullet that cleans text meticulously seems too good to be true, it's because it is. Remember that `.isalpha()` has a tendency of returning false on words we would not want to remove. Examples include abbreviations such as USA and UK which have periods in them, and proper nouns with numbers in them such as word2vec and xto10x. For such nuanced cases, `.isalpha()` may not be sufficient. 
  
It may be advisable to write your own custom functions, typically using regular expressions, to ensure you're not inadvertently removing useful words.
  
**Removing non-alphabetic characters**
  
Consider the string here. This has a lot of punctuations, unnecessary extra whitespace, escape sequences, numbers and emojis. We will generate the lemmatized tokens like before.
  
<img src='../_images/text-cleaning-for-nlp1.png' alt='img' width='740'>
  
**Removing non-alphabetic characters**
  
Next, we loop through the tokens again and choose only those words that are either `-PRON-` or contain only alphabetic characters. Let's now print out the sanitized string. We see that all the non-alphabetic characters have been removed and each word is separated by a single space.
  
<img src='../_images/text-cleaning-for-nlp2.png' alt='img' width='740'>
  
**Stopwords**
  
There are some words in the English language that occur so commonly that it is often a good idea to just ignore them. Examples include articles such as "a" and "the", 'be verbs' such as "is" and "am", and pronouns such as "he" and "she".
  
**Removing stopwords using spaCy**
  
`spaCy` has a built-in list of stopwords which we can access using `spacy.lang.en.stop_words.STOP_WORDS`.
  
<img src='../_images/text-cleaning-for-nlp3.png' alt='img' width='740'>
  
**Removing stopwords using spaCy**
  
We make a small tweak to `a_lemmas` generation step. Notice that we have removed the `-PRON-` condition as pronouns are stopwords anyway and should be removed. Additionally, we have introduced a new condition to check if the word belongs to spacy's list of stopwords. The output is as follows. Notice how the string consists only of base form words. Always exercise caution while using third party stopword lists. It is common that an application find certain words useful that may be considered a stopword by third party lists. It is often advisable to create your custom stopword lists.
  
<img src='../_images/text-cleaning-for-nlp4.png' alt='img' width='740'>
  
**Other text preprocessing techniques**
  
There are other preprocessing techniques that are used but have been omitted for the sake of brevity. Some of them include removing HTML or XML tags, replacing accented characters and correcting spelling errors and shorthands
  
**A word of caution**
  
We have covered a lot of text preprocessing techniques in the last couple of lessons. However, a word of caution is in place. The text preprocessing techniques you use is always dependent on the application. There are many applications which may find punctuations, numbers and emojis useful, so it may be wise to not remove them. In other cases, using all caps may be a good indicator of something. Remember to always use only those techniques that are relevant to your particular use case.


NOTE: If the specific version of `spaCy` you are using does not include the `-PRON-` token, it's possible that it has been removed or the default behavior has been changed in that version.
  
In newer versions of `spaCy`, instead of using `-PRON-` as a placeholder for pronouns, `spaCy` replaces pronouns with their respective lemma (base form). This change was made to provide more accurate lemmatization.

### Cleaning a blog post
  
In this exercise, you have been given an excerpt from a blog post. Your task is to clean this text into a more machine friendly format. This will involve converting to lowercase, lemmatization and removing stopwords, punctuations and non-alphabetic characters.
  
The excerpt is available as a string `blog` and has been printed to the console. The list of stopwords are available as `stopwords`.
  
1. Using list comprehension, loop through `doc` to extract the `.lemma_` of each token.
2. Remove stopwords and non-alphabetic tokens using stopwords and `.isalpha()`.

In [31]:
with open('../_datasets/blog.txt', 'r') as file:
    blog = file.read()
    
print(blog)  # Before standardization
stopwords = spacy.lang.en.stop_words.STOP_WORDS
blog = blog.lower()
print(blog)  # After standardization





In [32]:
# Generate doc Object: doc
doc = nlp(blog)

# Generate lemmatized tokens
lemmas = [token.lemma_ for token in doc]

# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]

# Print string after text cleaning
print(' '.join(a_lemmas))



Take a look at the cleaned text; it is lowercased and devoid of numbers, punctuations and commonly used stopwords. Also, note that the word U.S. was present in the original text. Since it had periods in between, our text cleaning process completely removed it. This may not be ideal behavior. It is always advisable to use your custom functions in place of `.isalpha()` for more nuanced cases.

### Cleaning TED talks in a dataframe
  
In this exercise, we will revisit the TED Talks from the first chapter. You have been a given a dataframe ted consisting of 5 TED Talks. Your task is to clean these talks using techniques discussed earlier by writing a function preprocess and applying it to the transcript feature of the dataframe.
  
The stopwords list is available as `stopwords`.
  
1. Generate the `Doc` object for `text`. Ignore the disable argument for now.
2. Generate lemmas using list comprehension using the `.lemma_` attribute.
3. Remove non-alphabetic characters using `.isalpha()` in the if condition.

In [33]:
ted = pd.read_csv('../_datasets/ted.csv') # Importing Ted Talk df

ted['transcript'] = ted['transcript'].str.lower()  # Standardization
print(ted.shape)  # Shape of df
ted.head()  # Viewing df

(500, 2)


Unnamed: 0,transcript,url
0,"we're going to talk — my — a new lecture, just...",https://www.ted.com/talks/al_seckel_says_our_b...
1,"this is a representation of your brain, and yo...",https://www.ted.com/talks/aaron_o_connell_maki...
2,it's a great honor today to share with you the...,https://www.ted.com/talks/carter_emmart_demos_...
3,"my passions are music, technology and making t...",https://www.ted.com/talks/jared_ficklin_new_wa...
4,it used to be that if you wanted to get a comp...,https://www.ted.com/talks/jeremy_howard_the_wo...


In [34]:
# Function to preprocess text
def preprocess(text):
    # Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]  # Standardization process
    
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]  # Applying standardization
    
    return ' '.join(a_lemmas)


# Applying preprocess() to ted['transcript']
ted['transcript'] = ted['transcript'].apply(preprocess)
print(ted['transcript'])

0      talk new lecture ted I illusion create ted I t...
1      representation brain brain break left half log...
2      great honor today share digital universe creat...
3      passion music technology thing combination thi...
4      use want computer new program programming requ...
                             ...                        
495    today I unpack example iconic design perfect s...
496    brother belong demographic pat percent accord ...
497    john hockenberry great tom I want start questi...
498    right moment kill car internet little mobile d...
499    real problem math education right basically ha...
Name: transcript, Length: 500, dtype: object


You have preprocessed all the TED talk transcripts contained in `ted` and it is now in a good shape to perform operations such as vectorization (as we will soon see how). You now have a good understanding of how text preprocessing works and why it is important. In the next lessons, we will move on to generating word level features for our texts.

## Part-of-speech tagging
  
In this lesson, we will cover part-of-speech tagging, which is one of the most popularly used feature engineering techniques in NLP.
  
**Applications**
  
Part-of speech tagging or POS tagging has an immense number of applications in NLP. It is used in word-sense disambiguation to identify the sense of a word in a sentence. For instance, consider the sentences "the bear is a majestic animal" and "please bear with me". Both sentences use the word 'bear' but they mean different things. 
  
**Part-of-Speech (POS)**  
  
- Helps in identifying distinction by identifying one bear as a noun and the other as a verb  
- Word-sense disambiguation  
- Example: "The bear is a majestic animal"  
- Example: "Please bear with me"  
  
POS tagging helps in identifying this distinction by identifying one bear as a noun and the other as a verb. Consequentially, POS tagging is also used in sentiment analysis, question answering systems and linguistic approaches to detect fake news and opinion spam. For example, one paper discovered that fake news headlines, on average, tend to use lesser common nouns and more proper nouns than mainstream headlines. Generating the POS tags for these words proved extremely useful in detecting false or hyperpartisan news.
  
**Applications for Part-of-Speech Tagging**  
  
- Sentiment analysis  
- Question answering  
- Fake news and opinion spam detection  
  
**POS tagging**
  
So what is POS tagging? It is the process of assigning every word (or token) in a piece of text, its corresponding part-of-speech. For instance, consider the sentence "Jane is an amazing guitarist". A typical POS tagger will label Jane as a proper noun, is as a verb, an as a determiner (or an article), amazing as an adjective and finally, guitarist as a noun.
  
"Jane is an amazing guitarist"  
  
Jane -> Proper noun
is -> Verb
an -> Determiner (or an article)
amazing -> Adjective
guitarist -> Noun
  
**POS tagging**  
  
- Assigning every word, its corresponding part of speech  
  
**POS tagging using `spaCy`**
  
POS Tagging is extremely easy to do using `spaCy`'s models and performing it is almost identical to generating tokens or lemmas. As usual, we import the `spacy` library and load the `'en_core_web_sm'` model as `nlp`. We will use the same sentence "Jane is an amazing guitarist" from before. We will then create a `Doc` object that will perform POS tagging, by default.
  
```python
import spacy


# Load the en_core_web_sm pre-trained model
nlp = spacy.load('en_core_web_sm')

# Initialize string
string = "Jane is an amazing guitarist"

# Doc object instantiation
doc = nlp(string)
```
  
**POS tagging using `spaCy`**
  
Using list comprehension, we generate a list of tuples `pos` where the first element of the tuple is the `token` and is generated using `token.text` and the second element is its POS tag, which is generated using `token.pos_`. Printing `pos` will give us the following output. Note how the tagger correctly identified all the parts-of-speech as we had discussed earlier. That said, remember that POS tagging is not an exact science. 
  
*`spaCy` infers the POS tags of these words based on the predictions given by its pre-trained models. In other words, the accuracy of the POS tagging is dependent on the data that the model has been trained on and the data that it is being used on.*
  
```python
...
...
# Generating list of tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

out[2]: 
[('Jane', 'PROPN'), 
('is', 'AUX'), 
('an', 'DET'), 
('amazing', 'ADJ'), 
('guitarist', 'NOUN')]
```
  
**POS annotations in `spaCy`**
  
`spaCy` is capable of identifying close to 20 parts-of-speech and as we saw in the previous slide, it uses specific annotations to denote a particular part of speech. For instance, `PROPN` referred to a proper noun and `DET` referred to a determinant. You can find the complete list of POS annotations used by `spaCy` in `spaCy's documentation. Here is a snapshot of the web page.
  
**POS annotation in `spaCy`**  
  
`PROPN` -> Proper noun  
`DET` -> Determinant  
  
**Let's practice!**
  
Great! Let's now practice our understanding of POS tagging in the next few exercises.

### POS tagging in Lord of the Flies
  
In this exercise, you will perform part-of-speech tagging on a famous passage from one of the most well-known novels of all time, Lord of the Flies, authored by William Golding.
  
The passage is available as `lotf` and has already been printed to the console.
  
`"He found himself understanding the wearisomeness of this life, where every path was an improvisation and a considerable part of one’s waking life was spent watching one’s feet."`
  
1. Load the `en_core_web_sm` model.
2. Create a `doc` object for `lotf` using `nlp()`.
3. Using the `text` and `.pos_` attributes, generate tokens and their corresponding POS tags.

In [35]:
with open('../_datasets/lotf.txt', 'r') as file:
    lotf = file.read()

print(lotf)

He found himself understanding the wearisomeness of this life, where every path was an improvisation and a considerable part of one’s waking life was spent watching one’s feet.


In [36]:
# Creating a Doc object
doc = nlp(lotf)

# Generating tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
pprint(pos)

[('He', 'PRON'),
 ('found', 'VERB'),
 ('himself', 'PRON'),
 ('understanding', 'VERB'),
 ('the', 'DET'),
 ('wearisomeness', 'NOUN'),
 ('of', 'ADP'),
 ('this', 'DET'),
 ('life', 'NOUN'),
 (',', 'PUNCT'),
 ('where', 'SCONJ'),
 ('every', 'DET'),
 ('path', 'NOUN'),
 ('was', 'AUX'),
 ('an', 'DET'),
 ('improvisation', 'NOUN'),
 ('and', 'CCONJ'),
 ('a', 'DET'),
 ('considerable', 'ADJ'),
 ('part', 'NOUN'),
 ('of', 'ADP'),
 ('one', 'NUM'),
 ('’s', 'PART'),
 ('waking', 'VERB'),
 ('life', 'NOUN'),
 ('was', 'AUX'),
 ('spent', 'VERB'),
 ('watching', 'VERB'),
 ('one', 'NUM'),
 ('’s', 'PART'),
 ('feet', 'NOUN'),
 ('.', 'PUNCT')]


Examine the various POS tags attached to each token and evaluate if they make intuitive sense to you. You will notice that they are indeed labelled correctly according to the standard rules of English grammar.

### Counting nouns in a piece of text
  
In this exercise, we will write two functions, `nouns()` and `proper_nouns()` that will count the number of other nouns and proper nouns in a piece of text respectively.
  
These functions will take in a piece of text and generate a list containing the POS tags for each word. It will then return the number of proper nouns/other nouns that the text contains. We will use these functions in the next exercise to generate interesting insights about fake news.
  
The `en_core_web_sm` model has already been loaded as `nlp` in this exercise.
  
1. Using the list `.count()` method, count the number of proper nouns (annotated as `PROPN`) in the `pos` list.
2. Using the list `.count()` method, count the number of other nouns (annotated as `NOUN`) in the `pos` list.

In [37]:
# Returns number of proper nouns
def proper_nouns(text, model=nlp):
    # Create doc object
    doc = model(text)
    
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of proper nouns
    return pos.count('PROPN')


print(proper_nouns('Abdul, Bill and Cathy went to the market to buy apples.', nlp))

3


In [38]:
# Returns number of other nouns
def nouns(text, model=nlp):
    # create doc object
    doc = model(text)
    
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of other nouns
    return pos.count('NOUN')


print(nouns('Abdul, Bill and Cathy went to the market to buy apples.', nlp))

2


You now know how to write functions that compute the number of instances of a particular POS tag in a given piece of text. In the next exercise, we will use these functions to generate features from text in a dataframe.

### Noun usage in fake news
  
In this exercise, you have been given a dataframe `headlines` that contains news headlines that are either fake or real. Your task is to generate two new features `num_propn` and `num_noun` that represent the number of proper nouns and other nouns contained in the `title` feature of `headlines`.
  
Next, we will compute the mean number of proper nouns and other nouns used in fake and real news headlines and compare the values. If there is a remarkable difference, then there is a good chance that using the `num_propn` and `num_noun` features in fake news detectors will improve its performance.
  
To accomplish this task, the functions `proper_nouns` and `nouns` that you had built in the previous exercise have already been made available to you.
  
1. Create a new feature `num_propn` by applying `proper_nouns` to `headlines['title']`.
Filter `headlines` to compute the mean number of proper nouns in fake news using the `.mean()` method.
2. Repeat the process for other nouns: create a feature `'num_noun'` using nouns and compute the mean of other nouns.

In [39]:
headlines = pd.read_csv('../_datasets/fakenews.csv')
headlines.head()

Unnamed: 0.1,Unnamed: 0,title,label
0,0,You Can Smell Hillary’s Fear,FAKE
1,1,Watch The Exact Moment Paul Ryan Committed Pol...,FAKE
2,2,Kerry to go to Paris in gesture of sympathy,REAL
3,3,Bernie supporters on Twitter erupt in anger ag...,FAKE
4,4,The Battle of New York: Why This Primary Matters,REAL


In [40]:
# Applying the prior functions to the feature 'title', returns counts and appends a new feature.
headlines['num_propn'] = headlines['title'].apply(proper_nouns)
headlines['num_noun'] = headlines['title'].apply(nouns)

# Compute mean of proper nouns
real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()
fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()

# Compute mean of other nouns
real_noun = headlines[headlines['label'] == 'REAL']['num_noun'].mean()
fake_noun = headlines[headlines['label'] == 'FAKE']['num_noun'].mean()

# Print results
print("Mean no. of proper nouns in real and fake headlines are {:.2f} and {:.2f} respectively.".format(real_propn, fake_propn))
print("Mean no. of other nouns in real and fake headlines are {:.2f} and {:.2f} respectively.".format(real_noun, fake_noun))

Mean no. of proper nouns in real and fake headlines are 2.47 and 4.86 respectively.
Mean no. of other nouns in real and fake headlines are 2.25 and 1.53 respectively.


You now know to construct features using POS tags information. Notice how the mean number of proper nouns is considerably higher for fake news than it is for real news. The opposite seems to be true in the case of other nouns. This fact can be put to great use in designing fake news detectors.

## Named entity recognition
  
The final technique we will learn as part of this chapter is named entity recognition.
  
**Applications of Named Entity Recognition (NER)**
  
Named entity recognition or NER has a host of extremely useful applications. It is used to build efficient search algorithms and question answering systems. For instance, let us say you have a piece of text and you ask your system about the people that are being talked about in the text. NER would help the system in answering this question by identifying all the entities that refer to a person in the text. NER also found application with News Providers who use it to categorize their articles and Customer Service centers who use it to classify and record their complaints efficiently.
  
**Applications of (NER)**
- Efficient search algorithms
- Question answering (~*ie. chat-bots*)
- Document segmentation (~*classifying documents or records, ie. Customer Service complaints*)
- News article classification
  
**Named entity recognition**
  
Let us now get down to the definitions. A named entity is anything that can be denoted with a proper name or a proper noun. Named entity recognition or NER, therefore, is the process of identifying such named entities in a piece of text and classifying them into predefined categories such as person, organization, country, etc. For example, consider the text "John Doe is a software engineer working at Google. He lives in France." Performing NER on this text will tell us that there are three named entities: John Doe, who is a person, Google, which is an organization and France, which is a country (or geopolitical entity)
  
"John Doe is a software engineer working at Google. He lives in France."  
  
John Doe -> Person  
Google -> Orginization  
France -> Country (geopolitical entity)  
  
**NER using spaCy**
  
Like POS tagging, performing NER is extremely easy using spaCy's pre-trained models. Let's try to find the named entities in the same sentence we used earlier. As usual, we import the `spacy` library, load the required model and create a `Doc` object for the `string`. When we do this, `spaCy` automatically computes all the named entities and makes it available as the ents attribute of `doc`. Therefore, to access the named entity and its category, we use list comprehension to loop over `doc.ents` and create a tuple containing the entity `name`, which is accessed using `ent.text`, and entity category, which is accessed using `ent.label_`. Printing this list out will give the following output. We see that `spaCy` has correctly identified and classified all the named entities in this string.
  
```python
import spacy


# Load the en_core_web_sm pre-trained model
nlp = spacy.load('en_core_web_sm')

# Initialize string
string = "John Doe is a software engineer working at Google. He lives in France."  

# Doc object instantiation
doc = nlp(string)

# Generating named entities
ne = [(ent.text, ent.label_) for ent in doc.ents]
print(ne)

out[1] : [('John Doe', 'PERSON'), ('Google', 'ORG'), ('France', 'GPE')]
```
  
**NER annotations in `spaCy`**
  
Currently, spaCy's models are capable of identifying more than 15 different types of named entities. The complete list of categories and their annotations can be found in spaCy's documentatiion.
  
**A word of caution**
  
In this chapter, we have used spacy's models to accomplish several tasks. However, remember that spacy's models are not perfect and its performance depends on the data it was trained with and the data it is being used on. For instance, if we are trying extract named entities for texts from a heavily technical field, such as medicine, spacy's pretrained models may not perform such a great job. In such nuanced cases, it is better to train your models with your specialized data. Also, remember that spacy's models are language specific. This is understandable considering that each language has its own grammar and nuances. The `en_core_web_sm` model that we've been using is, as the name suggests, only suitable for English texts.
  
**Let's practice!**
  
This concludes our lesson on named entity recognition. Let us practice our understanding of this technique in the exercises.

### Named entities in a sentence
  
In this exercise, we will identify and classify the labels of various named entities in a body of text using one of spaCy's statistical models. We will also verify the veracity of these labels.
  
1. Use `spacy.load()` to load the `en_core_web_sm` model.
2. Create a `Doc` instance `doc` using `text` and `nlp`.
3. Loop over `doc.ents` to print all the named entities and their corresponding labels.

In [41]:
# Load the required model
nlp = spacy.load('en_core_web_sm')

# Create a Doc instance 
text = 'Sundar Pichai is the CEO of Google. Its headquarters is in Mountain View.'
doc = nlp(text)

# Print all named entities and their labels
for ent in doc.ents:
    print(ent.text,'->', ent.label_)

Sundar Pichai -> PERSON
Google -> ORG
Mountain View -> GPE


It is possible to train `spaCy` models on your custom data. You will learn to do this in more advanced NLP courses.

### Identifying people mentioned in a news article
  
In this exercise, you have been given an excerpt from a news article published in TechCrunch. Your task is to write a function `find_people` that identifies the names of people that have been mentioned in a particular piece of text. You will then use `find_people` to identify the people of interest in the article.
  
The article is available as the string `tc` and has been printed to the console. The required `spacy` model has also been already loaded as `nlp`.
  
1. Create a `Doc` object for `text`.
2. Using list comprehension, loop through `doc.ents` and create a list of named entities whose label is `PERSON`.
3. Using `find_persons()`, print the people mentioned in `tc`.

In [42]:
with open('../_datasets/tc.txt', 'r') as file:
    tc = file.read()

print(tc)


It’s' been a busy day for Facebook  exec op-eds. Earlier this morning, Sheryl Sandberg broke the site’s silence around the Christchurch massacre, and now Mark Zuckerberg is calling on governments and other bodies to increase regulation around the sorts of data Facebook traffics in. He’s hoping to get out in front of heavy-handed regulation and get a seat at the table shaping it.


In [43]:
def find_persons(text):
    # Create Doc object
    doc = nlp(text)
    
    # Indentify the persons
    persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
    
    # Return persons
    return persons


print(find_persons(tc))

['Sheryl Sandberg', 'Mark Zuckerberg']


The article was related to Facebook and our function correctly identified both the people mentioned. You can now see how NER could be used in a variety of applications. Publishers may use a technique like this to classify news articles by the people mentioned in them. 
  
A question answering system could also use something like this to answer questions such as 'Who are the people mentioned in this passage?'. With this, we come to an end of this chapter. In the next, we will learn how to conduct vectorization on documents.