Note: This notebook was completed as part of DataCamp's course of the same name.

# Feature Engineering for NLP in Python
In this course, you will learn techniques that will allow you to extract useful information from text and process them into a format suitable for applying ML models. More specifically, you will learn about POS tagging, named entity recognition, readability scores, the n-gram and tf-idf models, and how to implement them using scikit-learn and spaCy. You will also learn to compute how similar two documents are to each other. In the process, you will predict the sentiment of movie reviews and build movie and Ted Talk recommenders. Following the course, you will be able to engineer critical features out of any text and solve some of the most challenging problems in data science!

**Instructor:** Rounak Banik, Data Scientist at Fractal Analytics

In [54]:
import pandas as pd
from textatistic import Textatistic
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

# $\star$ Chapter 1: Basic features and readability scores
Learn to compute basic features such as number of words, number of characters, average word length and number of special characters (such as Twitter hashtags and mentions). You will also learn to compute readability scores and determine the amount of education required to comprehend a piece of text.

### Introduction to NLP feature engineering
* Learn to extract useful features out of text and convert them into formats that are suitable for machine learning algorithms
* Recall that for any ML algorithm, data fed into it must be in tabular form and all the training features must be numerical
* ML algorithms can also work with categorical data provided the categories are converted into numerical form through one-hot-encoding.

#### One-hot encoding with pandas

```
# Import the pandas library
import pandas as pd

# Perform one-hot encoding on the 'sex' feature of df
df = pd.get_dummies(df, columns=['sex'])
```
* **Note** that *not* mentioning columns will lead pandas to automatically encde all non-numerical features
* Consider the following movie reviews dataset:

<img src='data/mov_rev_data.png' width="400" height="200" align="center"/>

* The above data cannot be utilized by any ML algorithm
* The training feature `review` is not numerical
    * Neither is it categorical to perform one-hot encoding on
    
#### Text pre-processing
* We need to perform two steps to make this dataset suitable for ML:
    * 1) **Standardize the text:**
        * converting words to lowercase
        * lemmatization/ converting to root form
        * example: `Reduction` gets converted to `reduce`
    * 2) **Vectorization:**
        * After standardization, the reviews are converted into a set of numerical training features through a process known as **vectorization**.
        * After vectoriztion, our original review dataset gets converted into something like this:
        
<img src='data/vect_ex.png' width="300" height="150" align="center"/>

* We will learn techniques to achieve this in later lessons

#### Basic features
* We can alo extract certain basic features from text like:
    * **word count**
    * **character count**
    * **average word length**
* When working with niche data, such as tweets, it also may be useful to know how many hashtags have been used in a given tweet

#### POS tagging
* Some NLP applications may require you to extract features for individual words
* For instance, you may want to do **parts-of-speech** or **POS** tagging to know the different parts-of-speech present in your text as shown:

<img src='data/POS_tagging.png' width="150" height="75" align="center"/>

* Consider the example above; POS tagging will label each word with its corresponding part-of-speech

#### Named Entity Recognition (NER)
* You may also want to perform named entity recognition to find out if a particular noun is referring to a person, organization or country
* Does noun refer to person, orginazion, or country (or other)?

#### Concepts covered
* Text preprocessing
* Basic features 
* Word features
* Vectorization

#### Exercises: One-hot-encoding

```
# Print the features of df1
print(df1.columns)

# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])

# Print the new features of df1
print(df1.columns)

# Print first five rows of df1
print(df1.head())
```

## Basic feature extraction
* While not very powerful, basic features can give us a good idea of the text we are dealing with
* The most basic feature we can extract from text is **number of characters** (including whitespaces)

### Number of characters
* The most basic feature we can extract from text
* **Includes whitespaces**
* For exapmle, the string `I don't know.` has **13 characters**.
* The number of characters is the length of the string, or: `len(string)`
* If our dataframe `df` has a textual feature (say `review`), we can compute the number of characters for each review and store it as a new feature `num_chars` by using the pandas dataframe `apply()` method:
    * **`df['num_chars'] = df['review'].apply(len)`**

### Number of words
* Assuming that every word is separated by a space, we can use a string's `split()` method to convert it into a list where every element is a word.

In [1]:
# Split the string into words
text = "Mary had a little lamb."
words = text.split()

# Print the list containing words
print(words)

['Mary', 'had', 'a', 'little', 'lamb.']


In [2]:
# Print number of words
print(len(words))

5


* To do this for a textual feature in a dataframe, we first define a function that takes in a string as an argument and returns the number of words in it:

In [3]:
# Function that returns number of words in string
def word_count(string):
    # Split the string into words
    words = string.split()
    
    # Return length of words list
    return len(words)

* We can now pass this function, `word_count()` to `apply()` and create `df['num_words']`:

```
# Create num_words feature in df
df['num_words'] = df['review'].apply(word_count)
```

### Average word length
* Let's define a function `avg_word_length()` which takes in a string and returns the average word length

In [5]:
# Function that returns average word length
def avg_word_length(x):
    # Split the string into words
    words = x.split()
    # Compute length of each word and store in a separate list
    word_lengths = [len(word) for word in words]
    # Compute average word length
    avg_word_length = sum(word_lengths)/len(words)
    # Return average word length
    return(avg_word_length)

* We can now pass this function (`avg_word_length()`) into `apply()` to generate an average word length feature in the df

```
# Create a new feature avg_word_length
df['avg_word_length'] = df['review'].apply(doc_density)
```

### Special features
* When working with data such as tweets, it may be useful to compute the number of hashtags or mentions used.

### Hashtags and mentions

In [6]:
# Function that returns number of hashtags
def hashtag_count(string):
    # Split the string into words
    words = string.split()
    # Create a list of hashtags
    hashtags = [word for word in words if word.startswith('#')]
    # Return number of hashtags
    return len(hashtags)

* The procedure to compute number or mentions is identical except that we check if a word starts with `@` instead of `#`:

In [7]:
# Function that returns number of mentions
def mention_count(string):
    # Split the string into words
    words = string.split()
    # Create a list of mentions
    mentions = [word for word in words if word.startswith('@')]
    # Return number of mentions
    return len(mentions)

In [8]:
hashtag_count("@janedoe This is my first tweet! #FirstTweet #Happy")

2

In [9]:
mention_count("@janedoe This is my first tweet! #FirstTweet #Happy")

1

#### Other features
* There are other basic features we can compute such as:
    * Number of sentences 
    * Number of paragraphs
    * Number of words starting with an uppercase
    * All-capital words
    * Numeric quantities
    * etc. ...
* The procedure to extract the above features is extremely similar to the ones we've already covered

#### Exercises: Character count of Russian tweets

```
# Create a feature char_count
tweets['char_count'] = tweets['content'].apply(len)

# Print the average character count
print(tweets['char_count'].mean())
```

#### Exercises: Word count of TED talks

```
# Function that returns number of words in a string
def count_words(string):
	# Split the string into words
    words = string.split()
    
    # Return the number of words
    return len(words)

# Create a new feature word_count
ted['word_count'] = ted['transcript'].apply(count_words)

# Print the average word count of the talks
print(ted['word_count'].mean())
```

#### Hashtags and mentions in Russian tweets

```
# Function that returns numner of hashtags in a string
def count_hashtags(string):
	# Split the string into words
    words = string.split()
    
    # Create a list of words that are hashtags
    hashtags = [word for word in words if word.startswith('#')]
    
    # Return number of hashtags
    return(len(hashtags))

# Create a feature hashtag_count and display distribution
tweets['hashtag_count'] = tweets['content'].apply(count_hashtags)
tweets['hashtag_count'].hist()
plt.title('Hashtag count distribution')
plt.show()
```
***

```
# Function that returns number of mentions in a string
def count_mentions(string):
	# Split the string into words
    words = string.split()
    
    # Create a list of words that are mentions
    mentions = [word for word in words if word.startswith('@')]
    
    # Return number of mentions
    return(len(mentions))

# Create a feature mention_count and display distribution
tweets['mention_count'] = tweets['content'].apply(count_mentions)
tweets['mention_count'].hist()
plt.title('Mention count distribution')
plt.show()
```

### Readability tests
* Here we will look at a set of interesting features known as **readability tests**, which are used to determine the readability of a particular passage (in English)
* In other words, it indicates at what educational level a person needs to be, in order to comprehend a particular piece of text
* The scale usually ranges from **primary school** up to **college graduate level** and is in context of the American education system
* **Readability tests** are done using a mathematical formula that utilizes the word, syllable, and sentence count of the passage.
* Readability tests are routinely used by organizations to determine how difficult their publications are to understand (or not).
* Readability tests have also found applications in domains such as **fake news**, and **opinion spam detection**.
* There are a variety of readability tests in use

#### Readability text examples
* Some common examples:
    * **Flesch reading ease**
    * **Gunning fog index**
    * **Simple Measure of Gobbledygook (SMOG)**
    * **Dale-Chall score**
* $\star$ **Note** that all of these tests are used for texts in **English**
* Tests for other languages also exist that take into consideration the nuances of that particular language
* In this lesson, we will cover the first two scores (Flesch reading ease and Gunning fog index) in detail
    * However, once you understand these two, you will be in a good position to understand and use the other scores as well.
    
### Flesch reading ease
* The Flesch Reading Ease is one of the **oldest** and **most widely used** readability tests
* Dependent on two factors:
    * **1) The greater the average sentence length, the harder a text is to read.**
    * **2) The greater the average number of syllables in a word, the harder a text is to read.**
* The higher the Flesch Reading Ease score, the greater is the readability 
    * A higher score indicates that the text is easier to understand
* **Higher the score, greater the readability**
    
<img src='data/flesch_scores.png' width="500" height="250" align="center"/>

### Gunning fog index
* Developed in 1954
* Dependent on:
    * **1) Average sentence length**
    * **2) The greater the percentage of complex words, the harder the text is to read.**
            * Here, "complex words" refer to all words that have three or more syllables
* Unlike Flesch, the formula for Gunning fog index is such that the higher the score, the more difficult the passage is to understand
* **Higher the index, lesser the readability.**

<img src='data/gunning_scores.png' width="500" height="250" align="center"/>

### The textatistic library 
* We can conduct these readability tests in Python using the Textatistic library 

In [11]:
#pip install textatistic

Collecting textatistic
  Downloading textatistic-0.0.1.tar.gz (29 kB)
Collecting pyhyphen>=2.0.5
  Downloading PyHyphen-4.0.3-cp37-abi3-macosx_10_14_x86_64.whl (37 kB)
Building wheels for collected packages: textatistic
  Building wheel for textatistic (setup.py) ... [?25ldone
[?25h  Created wheel for textatistic: filename=textatistic-0.0.1-py3-none-any.whl size=29056 sha256=48e2605433b68a36e2057c6e5bb2e18dd29014aa948312c62909c461ee3a9758
  Stored in directory: /Users/abigailmorgan/Library/Caches/pip/wheels/82/24/c4/de7882083c3530984f6eda43ae9e94875c84d906063ef10bcb
Successfully built textatistic
Installing collected packages: pyhyphen, textatistic
Successfully installed pyhyphen-4.0.3 textatistic-0.0.1
Note: you may need to restart the kernel to use updated packages.


In [14]:
print(text)

Mary had a little lamb.


In [13]:
# Create a Textatistic Object
readability_scores = Textatistic(text).scores

# Generate scores
print(readability_scores['flesch_score'])
print(readability_scores['gunningfog_score'])

100.24000000000002
2.0


#### Exercises: Readability of 'The Myth of Sisyphus'

```
# Import Textatistic
from textatistic import Textatistic

# Compute the readability scores 
readability_scores = Textatistic(sisyphus_essay).scores

# Print the flesch reading ease score
flesch = readability_scores['flesch_score']
print("The Flesch Reading Ease is %.2f" % (flesch))
```

#### Exercises: Readability of various publications

```
# Import Textatistic
from textatistic import Textatistic

# List of excerpts
excerpts = [forbes, harvard_law, r_digest, time_kids]

# Loop through excerpts and compute gunning fog index
gunning_fog_scores = []
for excerpt in excerpts:
  readability_scores = Textatistic(excerpt).scores
  gunning_fog = readability_scores['gunningfog_score']
  gunning_fog_scores.append(gunning_fog)

# Print the gunning fog indices
print(gunning_fog_scores)
```

# $\star$ Chapter 2: Text preprocessing, POS tagging and NER
In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.

### Tokenization and Lemmatization
#### Text sources (all very different styles, grammars, and vocab)
* News articles 
* Tweets
* Social media comments

#### Making text machine friendly
* It is important to standardize all of these (above) texts into a machine-friendly format
* We want our models to treat similar words as the same

### Text preprocessing techniques
* The text processing techniques you use are dependent on the application you're working on
* Some of the common ones we'll be covering include:
    * Covnverting words into lowercase
    * Removing leading and trailing whitespaces
    * Removing punctuation
    * Removing commonly occuring words (**stopwords**)
    * Expanding contractions
    * Removing special characters (numbers, emojis, etc)
    
### Tokenization
* **Tokenization** is the process of splitting a string into its constituent tokens
* These tokens may be sentences, words, or punctuations and *are specific to a particular language.*
* In this course, we will be primarily focused with word and punctuation tokens
* Tokenization also involves **expanding contracted words.**

#### Tokenization using spaCy
* We load a pre-trained English model, `en_core_web_sm` using `spacy.load()`
    * This will return a language object that has the know-how to perform tokenization

In [16]:
# import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Initialize string
string = "Hello! I don't know what I'm doing here."

# Create a Doc object
doc = nlp(string)

* The `doc` object defined above contains the required tokens (and many other things, as we will soon find out).
* We generate the list of tokens by using list comprehension as shown:

In [18]:
# Generate list of tokens
tokens = [token.text for token in doc]
print(tokens)

['Hello', '!', 'I', 'do', "n't", 'know', 'what', 'I', "'m", 'doing', 'here', '.']


### Lemmatization
* **Lemmatization** is the process of converting a word into its lowercased base form, or **lemma**.
* This is an extremely powerful process of standardization
* Examples:
    * `reducing`, `reduces`, `reduced`, `reduction` $\Rightarrow$ $\Rightarrow$ **`reduce`**
    * `am`, `are`, `is` $\Rightarrow$ $\Rightarrow$ **`be`**
    * `n't` $\Rightarrow$ $\Rightarrow$ **`not`**
    * `'ve` $\Rightarrow$ $\Rightarrow$ **`have`**
* When you pass the string into `nlp`, spaCy automatically performs lemmatization by default. Therefore, generating lemmas is identical to generating tokens, except that we extract `token.lemma_` in each iteration inside the list comprehension instead of `token`.

In [20]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Initialize string
string = "Hello! I don't know what I'm doing here."

# Create a Doc object
doc = nlp(string)

# Generate list of lemmas
lemmas = [token.lemma_ for token in doc]
print(lemmas)

['hello', '!', '-PRON-', 'do', 'not', 'know', 'what', '-PRON-', 'be', 'do', 'here', '.']


* **Also note that spaCy converted `I`s into `-PRON-`**; this is standard behavior, where every pronoun is converted into the string `-PRON-`

#### Exercises: Tokenizing the Gettysburg Address

```
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate the tokens
tokens = [token.text for token in doc]
print(tokens)
```

#### Exercises: Lemmatizing the Gettysburg address

```
print(gettysburg)

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate lemmas
lemmas = [token.lemma_ for token in doc]

# Convert lemmas into a string
print(' '.join(lemmas))
```

### Text cleaning

#### Text cleaning techniques
* Unnecessary whitespaces and escape sequences
* Punctuations
* Special characters (numbers, emojis, etc.)
* Stopwords


* In other words, it is very common to remove non-alphabetic tokens and words that occur so commonly that they are not very useful for analyis

#### isalpha()
* Every Python string has an **`isalpha()`** method that returns `True` if all the characters of the string are alphabetic
* This is an extremely convenient method to remove all (lemmatized) tokens that are or that contain numbers, punctuation and emojis
* **A word of caution:** `isalpha()` has a tendency of returning false on words we would not want to remove. Examples:
    * Abbreviations: `U.S.A.`, `U.K.`, etc
    * Proper Nounds with numbers in them: `word2vec` and `xto10x`
    * For such nuanced cases, `isalpha()` may not be sufficient and it may be advisable to write your own custom functions.
    * Write your own custome functions (typically using regex) for the more nuanced cases 

#### Removing non-alphabetic characters
* First, we generate the lemmatized tokens like before:

In [21]:
string = """
OMG!!!! This is like    the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""

# Generate list of tokens
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)
lemmas = [token.lemma_ for token in doc]

* Next, we loop through the tokens again and choose only those words that are either `-PRON-` or contain only alphabetic characters.

In [22]:
# Remove tokens that are not alphabetic
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() or lemma =='-PRON-']

# Print string after text cleaning
print(' '.join(a_lemmas))

OMG this be like the good thing ever wow such an amazing song -PRON- be hooked Top definitely


* Make lower case (in video this was done automatically with above code, not sure why it didn't here, so I'm lower-casing it in a separate call).

In [28]:
al_lemmas = []
for lemma in a_lemmas:
    al_lemmas.append(lemma.lower())

In [29]:
print(al_lemmas)

['omg', 'this', 'be', 'like', 'the', 'good', 'thing', 'ever', 'wow', 'such', 'an', 'amazing', 'song', '-pron-', 'be', 'hooked', 'top', 'definitely']


### Stopwords
* There are some words in the English language that occur so commonly that it is often a good idea to just ignore them
* Examples: 
    * articles: 
        * a
        * the
    * be verbs:
        * is
        * am
    * pronouns:
        * he
        * she
        * they
* **`spaCy` has a built-in list of stopwords**

In [30]:
# Get list of stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS
string = """
OMG!!!! This is like    the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""

In [31]:
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]
# Print string after text cleaning
print(' '.join(a_lemmas))

OMG like good thing wow amazing song hooked Top definitely


In [33]:
al_lemmas = []
for lemma in a_lemmas:
    al_lemmas.append(lemma.lower())

In [34]:
print(' '.join(al_lemmas))

omg like good thing wow amazing song hooked top definitely


* **Notice** that we have removed the `-PRON-` condition as pronouns are stopwords anyway and should be removed
* Additionally, we have introduced a new condition to check if the word belongs to spacy's list of stopwords
* **Notice also** how the string consists only of base form words
* **Always** exercise caution whil using third party stopword lists
    * It is common that an application find certain words useful that may be consideed a stopword by third party lists
    * **It is often advisable to create your own custom stopword lists**
    
#### Other text preprocessing techniques
* There are other preprocessing techniques that are used but have been omitted for the sake of brevity
* Some of them include:
    * **Removing HTML or XML tags**
    * **Replacing accented characters**
    * **Correcting spelling errors and shorthands**
    
    
* **A word of caution:** the text preprocessing techniques you use are always dependent on the application
* There are many applications which may find punctuations, numbers, and emojis useful, so in these cases it may not be wise to remove them
* **Always use only those text preprocessing techniques that are relevant to your application.**

#### Exercises: Cleaning a blog post

```
# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(blog)

# Generate lemmatized tokens
lemmas = [token.lemma_ for token in doc]

# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]

# Print string after text cleaning
print(' '.join(a_lemmas))
```

#### Exercises: Cleaning TED talks in a dataframe

```
# Function to preprocess text
def preprocess(text):
  	# Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(preprocess)
print(ted['transcript'])
```

In [35]:
# Function to preprocess text
def preprocess(text):
  	# Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)

### Part-of-speech tagging
* Part-of-speech tagging (or **POS tagging**) is one of the most popularly used feature engineering techniques in NLP

#### Applications
* **Word-sense disambiguation:**
    * `"The bear is a majestic animal"`
    * `"Please bear with me"`
* **Sentiment analysis**
* **Question answering systems**
* **Fake news and opinion spam detection** (linguistic approaches)
    * For example, one paper discovered that fake news headlines, on average, tend to use less common nouns and more proper nouns than mainstream headlines
    * Generating the POS tags for these words proved extremely useful in detecting false or hyperpartisan news
    
#### POS tagging using spaCy
* **POS tagging** is the process of assigning every word (or token) in a piece of text, its corresponding part of speech.
* Performing POS tagging with spaCy is almost identical to generating tokens or lemmas.

In [36]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

In [37]:
# Initialize string
string = "Jane is an amazing guitarist"

In [38]:
# Create a Doc object
doc = nlp(string)

Using list comprehension, the first element of the tuple is the token and is generated using `token.text` and `token.pos_`

In [39]:
# Generate list of tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

[('Jane', 'PROPN'), ('is', 'AUX'), ('an', 'DET'), ('amazing', 'ADJ'), ('guitarist', 'NOUN')]


* SpaCy infers the POS tags of these words based on the predictions given by its pre-trained models.
* In other words, **the accuracy of the POS tagging is dependent on the data that the model has been trained on and the data that it is being used on.**

#### POS annotations in spaCy
* spaCy is capable of identifying close to 20 parts-of-speech and it uses specific annotations to denote a particular part of speech
* complete spaCy annotation list [HERE](https://spacy.io/api/annotation)

<img src='data/POS_annot.png' width="600" height="300" align="center"/>

#### POS tagging in Lord of the Flies

```
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(lotf)

# Generate tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)
```

#### Exercises: Counting nouns in a piece of text
In this exercise, we will write two functions, `nouns()` and `proper_nouns()` that will count the number of other nouns and proper nouns in a piece of text respectively.

These functions will take in a piece of text and generate a list containing the POS tags for each word. It will then return the number of proper nouns/other nouns that the text contains. We will use these functions in the next exercise to generate interesting insights about fake news.

```
nlp = spacy.load('en_core_web_sm')

# Returns number of proper nouns
def proper_nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of proper nouns
    return pos.count('PROPN')

print(proper_nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))
```
***

```
nlp = spacy.load('en_core_web_sm')

# Returns number of other nouns
def nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of other nouns
    return pos.count('NOUN')

print(nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))
```

In [40]:
# Returns number of proper nouns
def proper_nouns(text, model=nlp):
      # Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]

    # Return number of proper nouns
    return pos.count('PROPN')

In [41]:
# Returns number of other nouns
def nouns(text, model=nlp):
      # Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]

    # Return number of other nouns
    return pos.count('NOUN')

#### Exercises: Noun usage in fake news

```
headlines['num_propn'] = headlines['title'].apply(proper_nouns)

# Compute mean of proper nouns
real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()
fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()

# Print results
print("Mean no. of proper nouns in real and fake headlines are %.2f and %.2f respectively"%(real_propn, fake_propn))
```
*** 

```
headlines['num_noun'] = headlines['title'].apply(nouns)

# Compute mean of other nouns
real_noun = headlines[headlines['label'] == 'REAL']['num_noun'].mean()
fake_noun = headlines[headlines['label'] == 'FAKE']['num_noun'].mean()

# Print results
print("Mean no. of other nouns in real and fake headlines are %.2f and %.2f respectively"%(real_noun, fake_noun))
```

### Named entity recognition
* **Named entity recognition** or **NER** has a host of extremely useful applications

#### Applications
* Efficient search algorithms 
* Question answering systems
* News article classification
* Customer service centers (to classify and record complaints efficiently)

#### Named entity recognition
* A **named entity** is anything that can be denoted with a proper name or a proper noun. 
* **NER** is the process of identifying such named entities in a piece of text and classifying them into predefined categories
* Categories include person, organization, country, etc.

#### NER using spaCy
* Performing NER is extremely easy using spaCy's pre-trained models

In [42]:
string = "John Doe is a software engineer working at Google. He lives in France."

In [43]:
# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)

In [44]:
# Generate named entities
ne = [(ent.text, ent.label_) for ent in doc.ents]
print(ne)

[('John Doe', 'PERSON'), ('Google', 'ORG'), ('France', 'GPE')]


* Note that `GPE` is "Geopolitical Entity"
* Currently spaCy's models are capable of identifyin more than 15 different types 
* Find [complete list here](https://spacy.io/api/annotation#named-entities)
* Below is a small snapshot:

<img src='data/NER_annote.png' width="500" height="250" align="center"/>

* **Word of caution** if we are trying to extract named entities for texts from a heavily technical field (such as medicine), spaCy's pretrained models may not perform very well.
* In such nuances cases, it is better to train your own models with your specialized data.
* Also remember that spacy's models are **language specific**

#### Exercises: Named entities in a sentence

```
# Load the required model
nlp = spacy.load('en_core_web_sm')

# Create a Doc instance 
text = 'Sundar Pichai is the CEO of Google. Its headquarters is in Mountain View.'
doc = nlp(text)

# Print all named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)
```

#### Exercises: Identifying people mentioned in a news article

```
def find_persons(text):
  # Create Doc object
  doc = nlp(text)
  
  # Identify the persons
  persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
  
  # Return persons
  return persons

print(find_persons(tc))
```

In [45]:
def find_persons(text):
  # Create Doc object
  doc = nlp(text)
  
  # Identify the persons
  persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
  
  # Return persons
  return persons

# $\star$ Chapter 3: N-Gram models
Learn about n-gram modeling and use it to perform sentiment analysis on movie reviews.

### Building a bag-of-words model

* **Vectorization** is the process of converting text (or other data) into vectors
* Recall that for any ML algorithm, data must be in tabular form and training features must all be numerical
* **Bag-of-words** is a technique that converts text documents into vectors we can use in ML algorithms
* The **bag-of-words model** is a procedure of extracting word tokens from a text document, computing the frequency of these word tokens and constructing a word vector based on these frequencies and the vocabulry of the entire corpus of documents

#### Bag of words model
* Extract word tokens
* Compute frequency of word tokens
* Construct a word vector out of these frequencies and vocabulary of corpus
* With, for example, 15 words in our vocabulary, our word vectors will have 15 dimensions and each dimension's value will correspond to the frequency of the word token corresponding to that dimension
    * For instance, the second dimension will correspond to the number of times the second word in the vocabulary occurs in the document 
    
<img src='data/bow_ex1.png' width="600" height="300" align="center"/>

* **Note** that performing text preprocessing usually leads to smaller vocabularies (which is often a good thing).
* While working with vectorization, it is routine to form word vectors running into thousands of dimensions and keeping this (vocabulary, and subsequent dimensions) to a minimum helps improve performance
* **Reducing number of dimensions helps improve performance.**

In [48]:
corpus = pd.Series([
        "The lion is the king of the jungle",
        "Lions have lifespans of a decade",
        "The lion is an endangered species"
])

(For now we will ignore text preprocessing)

In [50]:
# Import CountVectorizer
# from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

print(bow_matrix.toarray())

[[0 0 0 0 1 1 1 0 1 0 1 0 3]
 [0 1 0 1 0 0 0 1 0 1 1 0 0]
 [1 0 1 0 1 0 0 0 1 0 0 1 1]]


* **Note** that the `bow_matrix` is a sparse matrix that can be printed out in its 2D form using `bow_matrix.toarray()`

In [51]:
print(bow_matrix)

  (0, 12)	3
  (0, 8)	1
  (0, 4)	1
  (0, 6)	1
  (0, 10)	1
  (0, 5)	1
  (1, 10)	1
  (1, 9)	1
  (1, 3)	1
  (1, 7)	1
  (1, 1)	1
  (2, 12)	1
  (2, 8)	1
  (2, 4)	1
  (2, 0)	1
  (2, 2)	1
  (2, 11)	1


* **Notice** that the output of `bow_matrix_toarray()` is different from the word vectors generated in the image above.
    * This is because `CountVectorizer` automatically lowercases words and ignores single-character-tokens such as `'a'`.
    * Also, it doesn't necessarily index the vocabulary in alphabetical order
    * We can use this `bow_matrix` as our training features in ML models
    
#### Exercises: BoW model for movie taglines 
In this exercise, you have been provided with a `corpus` of more than 7000 movie tag lines. Your job is to generate the bag of words representation `bow_matrix` for these taglines. For this exercise, we will ignore the text preprocessing step and generate `bow_matrix` directly.

We will also investigate the shape of the resultant `bow_matrix`.

```
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

# Print the shape of bow_matrix
print(bow_matrix.shape)
```

#### Exercises: Analyzing dimensionality and preprocessing
Your job is to generate the bag of words representation `bow_lem_matrix` for these lemmatized taglines and compare its shape with that of `bow_matrix` obtained in the previous exercise. 

```
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_lem_matrix = vectorizer.fit_transform(lem_corpus)

# Print the shape of bow_lem_matrix
print(bow_lem_matrix.shape)
```

#### Exercises: Mapping feature indices with feature names
We had seen that `CountVectorizer` doesn't necessarily index the vocabulary in alphabetical order. In this exercise, we will learn to map each feature index to its corresponding feature name from the vocabulary.

```
# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

# Convert bow_matrix into a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray())

# Map the column names to vocabulary 
bow_df.columns = vectorizer.get_feature_names()

# Print bow_df
print(bow_df)
```

### Building a BoW Naive Bayes classifier
#### The spam filtering problem
* **Steps:**
    * 1) Text preprocessing
    * 2) Building a bag-of-words model (or representation)
    * 3) Machine learning (predictive modeling)
    
#### Text preprocessing using CountVectorizer
* `CountVectorizer` arguments:
    * **`lowercase`:** `False`, `True`
    * **`strip_accents`:** `'unicode'`, `'ascii'`, `None`
    * **`stop_words`:** `'english'`, `list`, `None`
    * **`token_pattern`:** `regex`
        * specify tokenization using a regular expression as the value of the `token_pattern` argument
    * **`tokenizer`:** `function`
        * tokenization can also be specified using a `tokenizer` argument
        * here, you can pass a function that takes a string as an argument and returns a list of tokens

* In these (2) ways, `CountVectorizer` allows usage of `spaCy`'s tokenization techniques
* `CountVectorizer` cannot perform certain steps such as lemmatization automatically.
    * This is where `spaCy` is useful
* Although it performs tokenization and preprocessing, `CountVectorizer`'s main job is to convert a corpus into a matrix of numerical vectors
* When building the spam-detecting BoW model below, we set `lowercase` to `False` because spam messages tend to abuse all-capital words and we might want to preserve this information for the ML step.

```
# Import CountVectorizer
# from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer(strip_accents='ascii', stop_words='english', lowercase=False)

# Import train_test_split
# from sklearn.model_selection import train_test_split

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.25)

# Generate training BoW vectors
X_train_bow = vectorizer.fit_transform(X_train)

# Generate test BoW vectors
X_test_bow = vectorizer.transform(X_test)
```
* **Note** that it is possible that there may be some words in the test data that are not in the vocabulary of the vectorizer (which was trained only with the vocabulary contained in the training set). **In such cases, `CountVectorizer` simply ignores these words.**

```
# Import MultinomialNB
from sklearn.naive_bayes import MultinomialNB

# Create MultinomialNB object
clf = MultinomialNB()

# Train clf
clf.fit(X_train_bow, y_train)

# Compute accuracy on test set
accuracy = clf.score(X_test_bow, y_test)
print(accuracy)
```

#### Exercises: BoW vectors for movie reviews

```
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer object
vectorizer = CountVectorizer(lowercase=True, stop_words='english')

# Fit and transform X_train
X_train_bow = vectorizer.fit_transform(X_train)

# Transform X_test
X_test_bow = vectorizer.transform(X_test)

# Print shape of X_train_bow and X_test_bow
print(X_train_bow.shape)
print(X_test_bow.shape)
```

#### Exercises: Predicting the sentiment of a movie review

```
# Create a MultinomialNB object
clf = MultinomialNB()

# Fit the classifier
clf.fit(X_train_bow, y_train)

# Measure the accuracy
accuracy = clf.score(X_test_bow, y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = "The movie was terrible. The music was underwhelming and the acting mediocre."
prediction = clf.predict(vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))
```

### Building n-gram models

<img src='data/bow_shortcomings.png' width="300" height="150" align="center"/>

* If we were to construct BoW vectors for these reviews, we would get identical vectors.
* **The biggest shortcoming of the bag of words model: the *context* of words is lost.**

### n-grams
* An **n-gram** is a contiguous sequence of n elements (or words) in a given document
* n = 1 $\Rightarrow$ bag-of-words

#### Applications
* sentence completion
* spelling correction
* machine translation correction

#### Building n-gram models using scikit-learn
* `CountVectorizer` takes in an argument **`ngram_range`**, which is a tuple containing the lower and upper bound for the range of n-values
* For example, the following only generates bigrams:
    * `bigrams = CountVectorizer(ngram_range(2,2))`
* The following generates unigrams, bigrams, and trigrms:
    * `ngrams = CountVectorizer(ngam_range=(1,3))`
    
#### Shortcomings
* Curse of dimensionality
* Higher order n-grams are rare
* Keep $n$ small

#### Exercises: n-gram models for movie tag lines

```
# Generate n-grams upto n=1
vectorizer_ng1 = CountVectorizer(ngram_range=(1,1))
ng1 = vectorizer_ng1.fit_transform(corpus)

# Generate n-grams upto n=2
vectorizer_ng2 = CountVectorizer(ngram_range=(1,2))
ng2 = vectorizer_ng2.fit_transform(corpus)

# Generate n-grams upto n=3
vectorizer_ng3 = CountVectorizer(ngram_range=(1, 3))
ng3 = vectorizer_ng3.fit_transform(corpus)

# Print the number of features for each model
print("ng1, ng2 and ng3 have %i, %i and %i features respectively" % (ng1.shape[1], ng2.shape[1], ng3.shape[1]))
```

#### Exercises: Higher order n-grams for sentiment analysis

```
# Define an instance of MultinomialNB 
clf_ng = MultinomialNB()

# Fit the classifier 
clf_ng.fit(X_train_ng, y_train)

# Measure the accuracy 
accuracy = clf_ng.score(X_test_ng, y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = "The movie was not good. The plot had several holes and the acting lacked panache."
prediction = clf_ng.predict(ng_vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))
```
#### Exercises: Comparing performance of n-gram models

```
start_time = time.time()
# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(df['review'], df['sentiment'], test_size=0.5, random_state=42, stratify=df['sentiment'])

# Generating ngrams
vectorizer = CountVectorizer(ngram_range=(1,1))
train_X = vectorizer.fit_transform(train_X)
test_X = vectorizer.transform(test_X)

# Fit classifier
clf = MultinomialNB()
clf.fit(train_X, train_y)

# Print accuracy, time and number of dimensions
print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. The ngram representation had %i features." % (time.time() - start_time, clf.score(test_X, test_y), train_X.shape[1]))
```
***

```
start_time = time.time()
# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(df['review'], df['sentiment'], test_size=0.5, random_state=42, stratify=df['sentiment'])

# Generating ngrams
vectorizer = CountVectorizer(ngram_range=(1,3))
train_X = vectorizer.fit_transform(train_X)
test_X = vectorizer.transform(test_X)

# Fit classifier
clf = MultinomialNB()
clf.fit(train_X, train_y)

# Print accuracy, time and number of dimensions
print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. The ngram representation had %i features." % (time.time() - start_time, clf.score(test_X, test_y), train_X.shape[1]))
```

# $\star$ Chapter 4: TF-IDF and similarity scores
Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.

### Building tf-idf document vectors

#### n-gram modeling
* Weight of dimension dependent on the frequency of the word corresponding to the dimension

#### Application
* Autoatically detect stopwords
* Search algorithms
* Recommender systems
* Better performance in predictive modeling for some cases

#### Term frequency-inverse document frequency
* Tfidf is the weighting mechanism for the importance of commonly occuring words
* **It is based on the idea that the weight of a term in a document should be proportional to its frequency and an inverse function of the number of documents in which it occurs.**

<img src='data/tfidf_formula2.png' width="600" height="600" align="center"/>

* **In general, the higher the tf-idf weight, the more important the word is in characterizing the document.**
* **A high tf-idf weight for a word in a document may imply that the word is relatively exclusive to that particular document, or that the words occurs extremely commonly in the document, or both.**
* The parameters and methods available within `TfidfVectorizer` is almost identical to `CountVectorizer`
* The only difference is that `TfidfVectorizer` assigns weights using the tf-idf formula in the image above and has extra parameters related to inverse document frequency that CountVectorizer does not have.
* Weights are non-integer (floats)

```
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()
# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())
```

#### Exercises: tf-idf vectors for TED talks

```
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(ted)

# Print the shape of tfidf_matrix
print(tfidf_matrix.shape)
```

### Cosine similarity
* We'll now explore techniques that allow us to determine how similar two vectors and, consequentially two documents, are to each other
* **Cosine similarity score** is one of the most popular metrics in NLP

<img src='data/cos_sim.png' width="400" height="200" align="center"/>

* Mathematically, cosine similarity is the ratio of the dot product of the vectors and the product of the madnitude of the two vectors

<img src='data/vector_dot_products.png' width="600" height="300" align="center"/>

#### Magnitude of a vector
* The **magnitude** of a vector is essentially the length of the vector
    * Mathematically it is defined as the square root of the sum of the squares of values across all the dimensions of a vector
    
<img src='data/vector_magnitude.png' width="600" height="300" align="center"/>

#### The cosine score

<img src='data/cos_score.png' width="600" height="300" align="center"/>

#### Cosine Score: points to remember:
* Value between -1 and 1
* In NLP, value between 0 and 1 ; (0 = no similarity, 1 = identical)
* Robust to document length

```
# Import the cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

# Define two 3-dimensional vectors A and B
A = (4, 7, 1)
B = (5, 2, 3)

# Compute the cosine score of A and B
score = cosine_similarity([A], [B])

# Print the cosine score 
print(score)
```
* **Remember** that `cosine_similarity` only accepts 2D arrays as inputs. Passing 1D arrays will throw an error.

In [55]:
# Import the cosine_similarity
# from sklearn.metrics.pairwise import cosine_similarity

# Define two 3-dimensional vectors A and B
A = (4, 7, 1)
B = (5, 2, 3)

# Compute the cosine score of A and B
score = cosine_similarity([A], [B])

# Print the cosine score 
print(score)

[[0.73881883]]


* Note that we got the same answer in the calculations performed in the illustration above. 

#### Exercises: Computing dot product

```
# Initialize numpy vectors
A = np.array([1,3])
B = np.array([-2,2])

# Compute dot product
dot_prod = np.dot(A, B)

# Print dot product
print(dot_prod)
```

#### Exercises: Cosine similarity matrix of a corpus

```
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)
```

### Building a plot line based recommender

#### Steps
* 1) Text preprocessing
* 2) Generate tf-idf vectors
* 3) Generate cosine similarity matrix (containing the pairwise similarity scores of every movie with every other movie)

#### The recommender function
* 1) Take a movie title, cosine similarity matrix and indices series as arguments
    * The **indices series** is a reverse mapping of movie titles with their indices in the original dataframe)
* 2) Extract pairwise cosine similarity scores for the movie
* 3) Sort the scores in descending order
* 4) Output titles corresponding to the highest scores
* 5) Ignore the highest similarity score (of 1)
    * **This is because the movie most similar to a given movie is the movie itself!**
    
#### Generating tf-idf vectors

```
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of tf-idf vectors
tfidf_matrix = vectorizer.fit_transform(movie_plots)
```

#### Generating cosine similarity matrix

```
# Import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

# Generate cosine similarity matrix
cosine_sim = cosine_simiarity(tfidf_matrix, tfidf_matrix)
```
* This generates a matrix that contains the pairwise similarity score of every movie with every other movie.
* The value corresponding to the $i$th row and the $j$th column is the cosine similarity score of movie $i$ with movie $j$
* The diagonal elements of the matrix will be 1 (movie with itself)

#### The linear_kernel function
* Magnitude of a tf-idf vector is 1
* Cosine score between two tf-idf vectors is their dot product
* Can significantly improve computation time
* Use `linear_kernel` instead of `cosine_similarity`

#### Generating cosine similarity matrix

```
# Import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

# Generate cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
```
* Using the `linear_kernel` outputs the same result, but takes significantly less time to compute.


#### The get_recommendations function
* `get_recommendations('The Lion King', cosine_sim, indices)`

#### Exercises: Comparing linear_kernel and cosine_similarity

```
# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))
```

#### Exercises: Plot recommendation engine

```
# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(movie_plots)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 
# Generate recommendations 
print(get_recommendations('The Dark Knight Rises', cosine_sim, indices))
```

#### Exercises: The recommender function

```
# Generate mapping between titles and index
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that matches title
    idx = indices[title]
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]
```    

In [56]:
def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that matches title
    idx = indices[title]
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

#### Exercises: TED talk recommender

```
# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(transcripts)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 
# Generate recommendations 
print(get_recommendations('5 ways to kill your dreams', cosine_sim, indices))
```

### Beyond n-grams: word embeddings

#### The problem with BoW and tf-idf
* Consider the following sentences:
    * `'I am happy'`
    * `'I am joyous'`
    * `'I am sad'`
* If we were to compute the similarities, `'I am happy'` and `'I am joyous'` would have the same score as `'I am happy'` and `'I am sad'`... regardless of how we vectorize it.
    * This is because `happy`, `joyous`, and `sad` are considered to be completely different words. 
* The *meaning* of the words is something that the vectorization techniques that we've covered so far simply cannot capture

### Word embeddings
* Mapping words into an n-dimensional vector space
* Produced using deep learning and huge amounts of data
* Once generated, these vectors can be used to discern how similar two words are to each other
* Used to detect synonyms and antonyms
* Captures complex relationships
    * `King` : `Queen` $\Rightarrow$ `Man` : `Woman`
    * `France` : `Paris` $\Rightarrow$ `Russia` : `Moscow`
* **Note** that word embeddings are not trained on user data; they are trained on the pre-trained spacy model you're using and are independent on the size (and contents) of your dataset

#### Word embeddings using spaCy
* **Note** that it is advisable to load larger spacy models while working with word vectors
* This is because the `en_core_web_sm` model does not technically ship with word vector but context specific tensors, which tend to give relatively poorer results

```
import spacy

# Load model and create Doc object
nlp = spacy.load('en_core_web_lg')
doc = nlp('I am happy')

# Generate word vectors for each token
for token in doc:
    print(token.vector)
```

#### Word similarities

```
doc = nlp("happy joyous sad")
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))
```

#### Document similarities

```
# Generate doc objects
sent1 = nlp("I am happy")
sent2 = nlp("I am sad")
sent3 = nlp("I am joyous")

# Compute similarity between sent1 and sent2
sent1.similarity(sent2)

# Compute similarity between sent1 and sent3
sent1.similarity(sent3)
```

In [57]:
doc = nlp("happy joyous sad")
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))

happy happy 1.0
happy joyous 0.5097088
happy sad 0.20651372
joyous happy 0.5097088
joyous joyous 1.0
joyous sad 0.39810386
sad happy 0.20651372
sad joyous 0.39810386
sad sad 1.0


  print(token1.text, token2.text, token1.similarity(token2))


In [59]:
# Generate doc objects
sent1 = nlp("I am happy")
sent2 = nlp("I am sad")
sent3 = nlp("I am joyous")

# Compute similarity between sent1 and sent2
print(sent1.similarity(sent2))

# Compute similarity between sent1 and sent3
print(sent1.similarity(sent3))

0.8965221275043582
0.9127636276495428


  print(sent1.similarity(sent2))
  print(sent1.similarity(sent3))


#### Exercises: Generating word vectors

```
# Create the doc object
doc = nlp(sent)

# Compute pairwise similarity scores
for token1 in doc:
  for token2 in doc:
    print(token1.text, token2.text, token1.similarity(token2))
```
***

#### Exercises: Computing similarity of Pink Floyd songs

```
# Create Doc objects
mother_doc = nlp(mother)
hopes_doc = nlp(hopes)
hey_doc = nlp(hey)

# Print similarity between mother and hopes
print(mother_doc.similarity(hopes_doc))

# Print similarity between mother and hey
print(mother_doc.similarity(hey_doc))
```