Note: This notebook was completed as part of DataCamp's course of the same name.

# Feature Engineering for NLP in Python
In this course, you will learn techniques that will allow you to extract useful information from text and process them into a format suitable for applying ML models. More specifically, you will learn about POS tagging, named entity recognition, readability scores, the n-gram and tf-idf models, and how to implement them using scikit-learn and spaCy. You will also learn to compute how similar two documents are to each other. In the process, you will predict the sentiment of movie reviews and build movie and Ted Talk recommenders. Following the course, you will be able to engineer critical features out of any text and solve some of the most challenging problems in data science!

**Instructor:** Rounak Banik, Data Scientist at Fractal Analytics

In [15]:
from textatistic import Textatistic
import spacy

# $\star$ Chapter 1: Basic features and readability scores
Learn to compute basic features such as number of words, number of characters, average word length and number of special characters (such as Twitter hashtags and mentions). You will also learn to compute readability scores and determine the amount of education required to comprehend a piece of text.

### Introduction to NLP feature engineering
* Learn to extract useful features out of text and convert them into formats that are suitable for machine learning algorithms
* Recall that for any ML algorithm, data fed into it must be in tabular form and all the training features must be numerical
* ML algorithms can also work with categorical data provided the categories are converted into numerical form through one-hot-encoding.

#### One-hot encoding with pandas

```
# Import the pandas library
import pandas as pd

# Perform one-hot encoding on the 'sex' feature of df
df = pd.get_dummies(df, columns=['sex'])
```
* **Note** that *not* mentioning columns will lead pandas to automatically encde all non-numerical features
* Consider the following movie reviews dataset:

<img src='data/mov_rev_data.png' width="400" height="200" align="center"/>

* The above data cannot be utilized by any ML algorithm
* The training feature `review` is not numerical
    * Neither is it categorical to perform one-hot encoding on
    
#### Text pre-processing
* We need to perform two steps to make this dataset suitable for ML:
    * 1) **Standardize the text:**
        * converting words to lowercase
        * lemmatization/ converting to root form
        * example: `Reduction` gets converted to `reduce`
    * 2) **Vectorization:**
        * After standardization, the reviews are converted into a set of numerical training features through a process known as **vectorization**.
        * After vectoriztion, our original review dataset gets converted into something like this:
        
<img src='data/vect_ex.png' width="300" height="150" align="center"/>

* We will learn techniques to achieve this in later lessons

#### Basic features
* We can alo extract certain basic features from text like:
    * **word count**
    * **character count**
    * **average word length**
* When working with niche data, such as tweets, it also may be useful to know how many hashtags have been used in a given tweet

#### POS tagging
* Some NLP applications may require you to extract features for individual words
* For instance, you may want to do **parts-of-speech** or **POS** tagging to know the different parts-of-speech present in your text as shown:

<img src='data/POS_tagging.png' width="150" height="75" align="center"/>

* Consider the example above; POS tagging will label each word with its corresponding part-of-speech

#### Named Entity Recognition (NER)
* You may also want to perform named entity recognition to find out if a particular noun is referring to a person, organization or country
* Does noun refer to person, orginazion, or country (or other)?

#### Concepts covered
* Text preprocessing
* Basic features 
* Word features
* Vectorization

#### Exercises: One-hot-encoding

```
# Print the features of df1
print(df1.columns)

# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])

# Print the new features of df1
print(df1.columns)

# Print first five rows of df1
print(df1.head())
```

## Basic feature extraction
* While not very powerful, basic features can give us a good idea of the text we are dealing with
* The most basic feature we can extract from text is **number of characters** (including whitespaces)

### Number of characters
* The most basic feature we can extract from text
* **Includes whitespaces**
* For exapmle, the string `I don't know.` has **13 characters**.
* The number of characters is the length of the string, or: `len(string)`
* If our dataframe `df` has a textual feature (say `review`), we can compute the number of characters for each review and store it as a new feature `num_chars` by using the pandas dataframe `apply()` method:
    * **`df['num_chars'] = df['review'].apply(len)`**

### Number of words
* Assuming that every word is separated by a space, we can use a string's `split()` method to convert it into a list where every element is a word.

In [1]:
# Split the string into words
text = "Mary had a little lamb."
words = text.split()

# Print the list containing words
print(words)

['Mary', 'had', 'a', 'little', 'lamb.']


In [2]:
# Print number of words
print(len(words))

5


* To do this for a textual feature in a dataframe, we first define a function that takes in a string as an argument and returns the number of words in it:

In [3]:
# Function that returns number of words in string
def word_count(string):
    # Split the string into words
    words = string.split()
    
    # Return length of words list
    return len(words)

* We can now pass this function, `word_count()` to `apply()` and create `df['num_words']`:

```
# Create num_words feature in df
df['num_words'] = df['review'].apply(word_count)
```

### Average word length
* Let's define a function `avg_word_length()` which takes in a string and returns the average word length

In [5]:
# Function that returns average word length
def avg_word_length(x):
    # Split the string into words
    words = x.split()
    # Compute length of each word and store in a separate list
    word_lengths = [len(word) for word in words]
    # Compute average word length
    avg_word_length = sum(word_lengths)/len(words)
    # Return average word length
    return(avg_word_length)

* We can now pass this function (`avg_word_length()`) into `apply()` to generate an average word length feature in the df

```
# Create a new feature avg_word_length
df['avg_word_length'] = df['review'].apply(doc_density)
```

### Special features
* When working with data such as tweets, it may be useful to compute the number of hashtags or mentions used.

### Hashtags and mentions

In [6]:
# Function that returns number of hashtags
def hashtag_count(string):
    # Split the string into words
    words = string.split()
    # Create a list of hashtags
    hashtags = [word for word in words if word.startswith('#')]
    # Return number of hashtags
    return len(hashtags)

* The procedure to compute number or mentions is identical except that we check if a word starts with `@` instead of `#`:

In [7]:
# Function that returns number of mentions
def mention_count(string):
    # Split the string into words
    words = string.split()
    # Create a list of mentions
    mentions = [word for word in words if word.startswith('@')]
    # Return number of mentions
    return len(mentions)

In [8]:
hashtag_count("@janedoe This is my first tweet! #FirstTweet #Happy")

2

In [9]:
mention_count("@janedoe This is my first tweet! #FirstTweet #Happy")

1

#### Other features
* There are other basic features we can compute such as:
    * Number of sentences 
    * Number of paragraphs
    * Number of words starting with an uppercase
    * All-capital words
    * Numeric quantities
    * etc. ...
* The procedure to extract the above features is extremely similar to the ones we've already covered

#### Exercises: Character count of Russian tweets

```
# Create a feature char_count
tweets['char_count'] = tweets['content'].apply(len)

# Print the average character count
print(tweets['char_count'].mean())
```

#### Exercises: Word count of TED talks

```
# Function that returns number of words in a string
def count_words(string):
	# Split the string into words
    words = string.split()
    
    # Return the number of words
    return len(words)

# Create a new feature word_count
ted['word_count'] = ted['transcript'].apply(count_words)

# Print the average word count of the talks
print(ted['word_count'].mean())
```

#### Hashtags and mentions in Russian tweets

```
# Function that returns numner of hashtags in a string
def count_hashtags(string):
	# Split the string into words
    words = string.split()
    
    # Create a list of words that are hashtags
    hashtags = [word for word in words if word.startswith('#')]
    
    # Return number of hashtags
    return(len(hashtags))

# Create a feature hashtag_count and display distribution
tweets['hashtag_count'] = tweets['content'].apply(count_hashtags)
tweets['hashtag_count'].hist()
plt.title('Hashtag count distribution')
plt.show()
```
***

```
# Function that returns number of mentions in a string
def count_mentions(string):
	# Split the string into words
    words = string.split()
    
    # Create a list of words that are mentions
    mentions = [word for word in words if word.startswith('@')]
    
    # Return number of mentions
    return(len(mentions))

# Create a feature mention_count and display distribution
tweets['mention_count'] = tweets['content'].apply(count_mentions)
tweets['mention_count'].hist()
plt.title('Mention count distribution')
plt.show()
```

### Readability tests
* Here we will look at a set of interesting features known as **readability tests**, which are used to determine the readability of a particular passage (in English)
* In other words, it indicates at what educational level a person needs to be, in order to comprehend a particular piece of text
* The scale usually ranges from **primary school** up to **college graduate level** and is in context of the American education system
* **Readability tests** are done using a mathematical formula that utilizes the word, syllable, and sentence count of the passage.
* Readability tests are routinely used by organizations to determine how difficult their publications are to understand (or not).
* Readability tests have also found applications in domains such as **fake news**, and **opinion spam detection**.
* There are a variety of readability tests in use

#### Readability text examples
* Some common examples:
    * **Flesch reading ease**
    * **Gunning fog index**
    * **Simple Measure of Gobbledygook (SMOG)**
    * **Dale-Chall score**
* $\star$ **Note** that all of these tests are used for texts in **English**
* Tests for other languages also exist that take into consideration the nuances of that particular language
* In this lesson, we will cover the first two scores (Flesch reading ease and Gunning fog index) in detail
    * However, once you understand these two, you will be in a good position to understand and use the other scores as well.
    
### Flesch reading ease
* The Flesch Reading Ease is one of the **oldest** and **most widely used** readability tests
* Dependent on two factors:
    * **1) The greater the average sentence length, the harder a text is to read.**
    * **2) The greater the average number of syllables in a word, the harder a text is to read.**
* The higher the Flesch Reading Ease score, the greater is the readability 
    * A higher score indicates that the text is easier to understand
* **Higher the score, greater the readability**
    
<img src='data/flesch_scores.png' width="500" height="250" align="center"/>

### Gunning fog index
* Developed in 1954
* Dependent on:
    * **1) Average sentence length**
    * **2) The greater the percentage of complex words, the harder the text is to read.**
            * Here, "complex words" refer to all words that have three or more syllables
* Unlike Flesch, the formula for Gunning fog index is such that the higher the score, the more difficult the passage is to understand
* **Higher the index, lesser the readability.**

<img src='data/gunning_scores.png' width="500" height="250" align="center"/>

### The textatistic library 
* We can conduct these readability tests in Python using the Textatistic library 

In [11]:
#pip install textatistic

Collecting textatistic
  Downloading textatistic-0.0.1.tar.gz (29 kB)
Collecting pyhyphen>=2.0.5
  Downloading PyHyphen-4.0.3-cp37-abi3-macosx_10_14_x86_64.whl (37 kB)
Building wheels for collected packages: textatistic
  Building wheel for textatistic (setup.py) ... [?25ldone
[?25h  Created wheel for textatistic: filename=textatistic-0.0.1-py3-none-any.whl size=29056 sha256=48e2605433b68a36e2057c6e5bb2e18dd29014aa948312c62909c461ee3a9758
  Stored in directory: /Users/abigailmorgan/Library/Caches/pip/wheels/82/24/c4/de7882083c3530984f6eda43ae9e94875c84d906063ef10bcb
Successfully built textatistic
Installing collected packages: pyhyphen, textatistic
Successfully installed pyhyphen-4.0.3 textatistic-0.0.1
Note: you may need to restart the kernel to use updated packages.


In [14]:
print(text)

Mary had a little lamb.


In [13]:
# Create a Textatistic Object
readability_scores = Textatistic(text).scores

# Generate scores
print(readability_scores['flesch_score'])
print(readability_scores['gunningfog_score'])

100.24000000000002
2.0


#### Exercises: Readability of 'The Myth of Sisyphus'

```
# Import Textatistic
from textatistic import Textatistic

# Compute the readability scores 
readability_scores = Textatistic(sisyphus_essay).scores

# Print the flesch reading ease score
flesch = readability_scores['flesch_score']
print("The Flesch Reading Ease is %.2f" % (flesch))
```

#### Exercises: Readability of various publications

```
# Import Textatistic
from textatistic import Textatistic

# List of excerpts
excerpts = [forbes, harvard_law, r_digest, time_kids]

# Loop through excerpts and compute gunning fog index
gunning_fog_scores = []
for excerpt in excerpts:
  readability_scores = Textatistic(excerpt).scores
  gunning_fog = readability_scores['gunningfog_score']
  gunning_fog_scores.append(gunning_fog)

# Print the gunning fog indices
print(gunning_fog_scores)
```

# $\star$ Chapter 2: Text preprocessing, POS tagging and NER
In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.

### Tokenization and Lemmatization
#### Text sources (all very different styles, grammars, and vocab)
* News articles 
* Tweets
* Social media comments

#### Making text machine friendly
* It is important to standardize all of these (above) texts into a machine-friendly format
* We want our models to treat similar words as the same

### Text preprocessing techniques
* The text processing techniques you use are dependent on the application you're working on
* Some of the common ones we'll be covering include:
    * Covnverting words into lowercase
    * Removing leading and trailing whitespaces
    * Removing punctuation
    * Removing commonly occuring words (**stopwords**)
    * Expanding contractions
    * Removing special characters (numbers, emojis, etc)
    
### Tokenization
* **Tokenization** is the process of splitting a string into its constituent tokens
* These tokens may be sentences, words, or punctuations and *are specific to a particular language.*
* In this course, we will be primarily focused with word and punctuation tokens
* Tokenization also involves **expanding contracted words.**

#### Tokenization using spaCy
* We load a pre-trained English model, `en_core_web_sm` using `spacy.load()`
    * This will return a language object that has the know-how to perform tokenization

In [16]:
# import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Initialize string
string = "Hello! I don't know what I'm doing here."

# Create a Doc object
doc = nlp(string)

* The `doc` object defined above contains the required tokens (and many other things, as we will soon find out).
* We generate the list of tokens by using list comprehension as shown:

In [18]:
# Generate list of tokens
tokens = [token.text for token in doc]
print(tokens)

['Hello', '!', 'I', 'do', "n't", 'know', 'what', 'I', "'m", 'doing', 'here', '.']


### Lemmatization
* **Lemmatization** is the process of converting a word into its lowercased base form, or **lemma**.
* This is an extremely powerful process of standardization
* Examples:
    * `reducing`, `reduces`, `reduced`, `reduction` $\Rightarrow$ $\Rightarrow$ **`reduce`**
    * `am`, `are`, `is` $\Rightarrow$ $\Rightarrow$ **`be`**
    * `n't` $\Rightarrow$ $\Rightarrow$ **`not`**
    * `'ve` $\Rightarrow$ $\Rightarrow$ **`have`**
* When you pass the string into `nlp`, spaCy automatically performs lemmatization by default. Therefore, generating lemmas is identical to generating tokens, except that we extract `token.lemma_` in each iteration inside the list comprehension instead of `token`.

In [20]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Initialize string
string = "Hello! I don't know what I'm doing here."

# Create a Doc object
doc = nlp(string)

# Generate list of lemmas
lemmas = [token.lemma_ for token in doc]
print(lemmas)

['hello', '!', '-PRON-', 'do', 'not', 'know', 'what', '-PRON-', 'be', 'do', 'here', '.']


* **Also note that spaCy converted `I`s into `-PRON-`**; this is standard behavior, where every pronoun is converted into the string `-PRON-`

#### Exercises: Tokenizing the Gettysburg Address

```
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate the tokens
tokens = [token.text for token in doc]
print(tokens)
```

#### Exercises: Lemmatizing the Gettysburg address

```
print(gettysburg)

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate lemmas
lemmas = [token.lemma_ for token in doc]

# Convert lemmas into a string
print(' '.join(lemmas))
```

### Text cleaning

#### Text cleaning techniques
* Unnecessary whitespaces and escape sequences
* Punctuations
* Special characters (numbers, emojis, etc.)
* Stopwords


* In other words, it is very common to remove non-alphabetic tokens and words that occur so commonly that they are not very useful for analyis

#### isalpha()
* Every Python string has an **`isalpha()`** method that returns `True` if all the characters of the string are alphabetic
* This is an extremely convenient method to remove all (lemmatized) tokens that are or that contain numbers, punctuation and emojis
* **A word of caution:** `isalpha()` has a tendency of returning false on words we would not want to remove. Examples:
    * Abbreviations: `U.S.A.`, `U.K.`, etc
    * Proper Nounds with numbers in them: `word2vec` and `xto10x`
    * For such nuanced cases, `isalpha()` may not be sufficient and it may be advisable to write your own custom functions.
    * Write your own custome functions (typically using regex) for the more nuanced cases 

#### Removing non-alphabetic characters
* First, we generate the lemmatized tokens like before:

In [21]:
string = """
OMG!!!! This is like    the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""

# Generate list of tokens
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)
lemmas = [token.lemma_ for token in doc]

* Next, we loop through the tokens again and choose only those words that are either `-PRON-` or contain only alphabetic characters.

In [22]:
# Remove tokens that are not alphabetic
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() or lemma =='-PRON-']

# Print string after text cleaning
print(' '.join(a_lemmas))

OMG this be like the good thing ever wow such an amazing song -PRON- be hooked Top definitely


* Make lower case (in video this was done automatically with above code, not sure why it didn't here, so I'm lower-casing it in a separate call).

In [28]:
al_lemmas = []
for lemma in a_lemmas:
    al_lemmas.append(lemma.lower())

In [29]:
print(al_lemmas)

['omg', 'this', 'be', 'like', 'the', 'good', 'thing', 'ever', 'wow', 'such', 'an', 'amazing', 'song', '-pron-', 'be', 'hooked', 'top', 'definitely']


### Stopwords
* There are some words in the English language that occur so commonly that it is often a good idea to just ignore them
* Examples: 
    * articles: 
        * a
        * the
    * be verbs:
        * is
        * am
    * pronouns:
        * he
        * she
        * they
* **`spaCy` has a built-in list of stopwords**

In [30]:
# Get list of stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS
string = """
OMG!!!! This is like    the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""

In [31]:
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]
# Print string after text cleaning
print(' '.join(a_lemmas))

OMG like good thing wow amazing song hooked Top definitely


In [33]:
al_lemmas = []
for lemma in a_lemmas:
    al_lemmas.append(lemma.lower())

In [34]:
print(' '.join(al_lemmas))

omg like good thing wow amazing song hooked top definitely


* **Notice** that we have removed the `-PRON-` condition as pronouns are stopwords anyway and should be removed
* Additionally, we have introduced a new condition to check if the word belongs to spacy's list of stopwords
* **Notice also** how the string consists only of base form words
* **Always** exercise caution whil using third party stopword lists
    * It is common that an application find certain words useful that may be consideed a stopword by third party lists
    * **It is often advisable to create your own custom stopword lists**
    
#### Other text preprocessing techniques
* There are other preprocessing techniques that are used but have been omitted for the sake of brevity
* Some of them include:
    * **Removing HTML or XML tags**
    * **Replacing accented characters**
    * **Correcting spelling errors and shorthands**
    
    
* **A word of caution:** the text preprocessing techniques you use are always dependent on the application
* There are many applications which may find punctuations, numbers, and emojis useful, so in these cases it may not be wise to remove them
* **Always use only those text preprocessing techniques that are relevant to your application.**

#### Exercises: Cleaning a blog post

```
# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(blog)

# Generate lemmatized tokens
lemmas = [token.lemma_ for token in doc]

# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]

# Print string after text cleaning
print(' '.join(a_lemmas))
```

#### Exercises: Cleaning TED talks in a dataframe

```
# Function to preprocess text
def preprocess(text):
  	# Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(preprocess)
print(ted['transcript'])
```

In [35]:
# Function to preprocess text
def preprocess(text):
  	# Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)

### Part-of-speech tagging
* Part-of-speech tagging (or **POS tagging**) is one of the most popularly used feature engineering techniques in NLP

#### Applications
* **Word-sense disambiguation:**
    * `"The bear is a majestic animal"`
    * `"Please bear with me"`
* **Sentiment analysis**
* **Question answering systems**
* **Fake news and opinion spam detection** (linguistic approaches)
    * For example, one paper discovered that fake news headlines, on average, tend to use less common nouns and more proper nouns than mainstream headlines
    * Generating the POS tags for these words proved extremely useful in detecting false or hyperpartisan news
    
#### POS tagging using spaCy
* **POS tagging** is the process of assigning every word (or token) in a piece of text, its corresponding part of speech.
* Performing POS tagging with spaCy is almost identical to generating tokens or lemmas.

In [36]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

In [37]:
# Initialize string
string = "Jane is an amazing guitarist"

In [38]:
# Create a Doc object
doc = nlp(string)

Using list comprehension, the first element of the tuple is the token and is generated using `token.text` and `token.pos_`

In [39]:
# Generate list of tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

[('Jane', 'PROPN'), ('is', 'AUX'), ('an', 'DET'), ('amazing', 'ADJ'), ('guitarist', 'NOUN')]


* SpaCy infers the POS tags of these words based on the predictions given by its pre-trained models.
* In other words, **the accuracy of the POS tagging is dependent on the data that the model has been trained on and the data that it is being used on.**

#### POS annotations in spaCy
* spaCy is capable of identifying close to 20 parts-of-speech and it uses specific annotations to denote a particular part of speech
* complete spaCy annotation list [HERE](https://spacy.io/api/annotation)

<img src='data/POS_annot.png' width="600" height="300" align="center"/>

#### POS tagging in Lord of the Flies

```
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(lotf)

# Generate tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)
```

#### Exercises: Counting nouns in a piece of text
In this exercise, we will write two functions, `nouns()` and `proper_nouns()` that will count the number of other nouns and proper nouns in a piece of text respectively.

These functions will take in a piece of text and generate a list containing the POS tags for each word. It will then return the number of proper nouns/other nouns that the text contains. We will use these functions in the next exercise to generate interesting insights about fake news.

```
nlp = spacy.load('en_core_web_sm')

# Returns number of proper nouns
def proper_nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of proper nouns
    return pos.count('PROPN')

print(proper_nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))
```
***

```
nlp = spacy.load('en_core_web_sm')

# Returns number of other nouns
def nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of other nouns
    return pos.count('NOUN')

print(nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))
```

In [40]:
# Returns number of proper nouns
def proper_nouns(text, model=nlp):
      # Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]

    # Return number of proper nouns
    return pos.count('PROPN')

In [41]:
# Returns number of other nouns
def nouns(text, model=nlp):
      # Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]

    # Return number of other nouns
    return pos.count('NOUN')

#### Exercises: Noun usage in fake news

```
headlines['num_propn'] = headlines['title'].apply(proper_nouns)

# Compute mean of proper nouns
real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()
fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()

# Print results
print("Mean no. of proper nouns in real and fake headlines are %.2f and %.2f respectively"%(real_propn, fake_propn))
```
*** 

```
headlines['num_noun'] = headlines['title'].apply(nouns)

# Compute mean of other nouns
real_noun = headlines[headlines['label'] == 'REAL']['num_noun'].mean()
fake_noun = headlines[headlines['label'] == 'FAKE']['num_noun'].mean()

# Print results
print("Mean no. of other nouns in real and fake headlines are %.2f and %.2f respectively"%(real_noun, fake_noun))
```

### Named entity recognition
* **Named entity recognition** or **NER** has a host of extremely useful applications

#### Applications
* Efficient search algorithms 
* Question answering systems
* News article classification
* Customer service centers (to classify and record complaints efficiently)

#### Named entity recognition
* A **named entity** is anything that can be denoted with a proper name or a proper noun. 
* **NER** is the process of identifying such named entities in a piece of text and classifying them into predefined categories
* Categories include person, organization, country, etc.

#### NER using spaCy
* Performing NER is extremely easy using spaCy's pre-trained models

In [42]:
string = "John Doe is a software engineer working at Google. He lives in France."

In [43]:
# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)

In [44]:
# Generate named entities
ne = [(ent.text, ent.label_) for ent in doc.ents]
print(ne)

[('John Doe', 'PERSON'), ('Google', 'ORG'), ('France', 'GPE')]


* Note that `GPE` is "Geopolitical Entity"
* Currently spaCy's models are capable of identifyin more than 15 different types 
* Find [complete list here](https://spacy.io/api/annotation#named-entities)
* Below is a small snapshot:

<img src='data/NER_annote.png' width="500" height="250" align="center"/>

* **Word of caution** if we are trying to extract named entities for texts from a heavily technical field (such as medicine), spaCy's pretrained models may not perform very well.
* In such nuances cases, it is better to train your own models with your specialized data.
* Also remember that spacy's models are **language specific**

#### Exercises: Named entities in a sentence

```
# Load the required model
nlp = spacy.load('en_core_web_sm')

# Create a Doc instance 
text = 'Sundar Pichai is the CEO of Google. Its headquarters is in Mountain View.'
doc = nlp(text)

# Print all named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)
```

#### Exercises: Identifying people mentioned in a news article

```
def find_persons(text):
  # Create Doc object
  doc = nlp(text)
  
  # Identify the persons
  persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
  
  # Return persons
  return persons

print(find_persons(tc))
```

In [45]:
def find_persons(text):
  # Create Doc object
  doc = nlp(text)
  
  # Identify the persons
  persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
  
  # Return persons
  return persons

# $\star$ Chapter 3: N-Gram models
Learn about n-gram modeling and use it to perform sentiment analysis on movie reviews.

### Building a bag-of-words model

<img src='data/NER_example.png' width="600" height="300" align="center"/>