# TF-IDF and similarity scores
  
Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.

**Resources**
  
[SpaCy Documentation](https://spacy.io)  
[Scikit-learn Documentation](https://scikit-learn.org/stable/user_guide.html)  
[List of LaTeX mathematical symbols](https://oeis.org/wiki/List_of_LaTeX_mathematical_symbols)  
[Classical ML Equations in LaTeX](https://blmoistawinde.github.io/ml_equations_latex/)  
[SpaCy en_core_web_lg Documentation](https://spacy.io/models/en)  

In [1]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations
import re                           # Regular Expressions:      Text manipulation
import spacy                        # Spatium Cython:           Natural Language Processing
from pprint import pprint           # Pretty Print:             Advanced printing operations

2023-07-07 11:38:12.724119: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Building tf-idf document vectors
  
In the last chapter, we learned about n-gram modeling.
  
**n-gram modeling**
  
In n-gram modeling, the weight of a dimension for the vector representation of a document is dependent on the number of times the word corresponding to the dimension occurs in the document. Let's say we have a document that has the word 'human' occurring 5 times. Then, the dimension of its vector representation corresponding to 'human' would have the value 5. Think about a matrix of vocabulary terms, where 'human' is an instance in the matrix.
  
**Motivation**
  
However, some words occur very commonly across all the documents in the corpus. As a result, the vector representations get more characterized by these dimensions. Consider a corpus of documents on the Universe. Let's say there is a particular document on Jupiter where the word 'jupiter' and 'universe' both occur about 20 times. However, 'jupiter' rarely figures in the other documents whereas 'universe' is just as common. We could argue that although both *jupiter* and *universe* occur 20 times, *jupiter* should be given a larger weight on account of its exclusivity. In other words, the word 'jupiter' characterizes the document more than 'universe'.
  
**Applications**
  
Weighting words this way has a huge number of applications. They can be used to automatically detect stopwords for the corpus instead of relying on a generic list. They're used in search algorithms to determine the ranking of pages containing the search query and in recommender systems as we will soon find out. In a lot of cases, this kind of weighting also generates better performance during predictive modeling.
  
- Automatically detect stopwords  
- Search algorithms  
- Recommendation systems  
- Better performance for predictive modeling in some cases  
  
**Term frequency-inverse document frequency**
  
The weighting mechanism we've described is known as term frequency-inverse document frequency or tf-idf for short. It is based on the idea that the weight of a term in a document should be proportional to its frequency and an inverse function of the number of documents in which it occurs.
  
$formula.$
  
$\Large w_{i, j} = \text{tf}_{i, j} \cdot \log (\frac{N}{\text{df}_{i}})$
  
$where.$
  
$w_{i,j}$ = Weight of term $i$ in document $j$  
$tf_{i,j}$ = Term frequency of term $i$ in document $j$  
$N$ = Number of documents in the corpus  
$df_{i}$ = Number of documents containing term $i$  


**Mathematical formula**
  
Therefore, let's say the word 'library' occurs in a document 5 times. There are 20 documents in the corpus and 'library' occurs in 8 of them. Then, the tf-idf weight of 'library' in the vector representation of this document will be 5 times log of 20 by 8 which is approximately 2. In general, higher the tf-idf weight, more important is the word in characterizing the document. A high tf-idf weight for a word in a document may imply that the word is relatively exclusive to that particular document or that the word occurs extremely commonly in the document, or both.
  
$example.$
  
$\Large w_{library, document} = \text{5}_{library, document} \cdot \log (\frac{20}{\text{8}_{library}}) \approx 2$
  
**tf-idf using `scikit-learn`**
  
Generating vectors that use tf-idf weighting is almost identical to what we've already done so far. Instead of using `CountVectorizer`, we use the `TfidfVectorizer` class of `scikit-learn`. The parameters and methods it has is almost identical to `CountVectorizer`. The only difference is that `TfidfVectorizer` assigns weights using the tf-idf formula from before and has extra parameters related to inverse document frequency which we will not cover in this course. Here, we can see how using `TfidfVectorizer` is almost identical to using `CountVectorizer` for a corpus. However, notice that the weights are non-integer and reflect values calculated by the tf-idf formula.
  
<img src='../_images/tf-idf-sklearn-tfidfvect.png' alt='img' width='740'>
  

### tf-idf weight of commonly occurring words
  
The word `bottle` occurs 5 times in a particular document `D` and also occurs in every document of the corpus. What is the tf-idf weight of `bottle` in `D`?
  
Possible Answers

- [x] 0
- [ ] 1
- [ ] Not defined
- [ ] 5
  
Correct! In fact, the tf-idf weight for `bottle` in every document will be 0. This is because the inverse document frequency is constant across documents in a corpus and since `bottle` occurs in every document, its value is log(1), which is 0.

### tf-idf vectors for TED talks
  
In this exercise, you have been given a corpus `ted` which contains the transcripts of 500 TED Talks. Your task is to generate the tf-idf vectors for these talks.
  
In a later lesson, we will use these vectors to generate recommendations of similar talks based on the transcript.
  
1. Import `TfidfVectorizer` `from sklearn`.
2. Create a `TfidfVectorizer` object. Name it `vectorizer`.
3. Generate `tfidf_matrix` for `ted` using the `.fit_transform()` method.

In [2]:
df = pd.read_csv('../_datasets/ted.csv')
df.head()

Unnamed: 0,transcript,url
0,"We're going to talk — my — a new lecture, just...",https://www.ted.com/talks/al_seckel_says_our_b...
1,"This is a representation of your brain, and yo...",https://www.ted.com/talks/aaron_o_connell_maki...
2,It's a great honor today to share with you The...,https://www.ted.com/talks/carter_emmart_demos_...
3,"My passions are music, technology and making t...",https://www.ted.com/talks/jared_ficklin_new_wa...
4,It used to be that if you wanted to get a comp...,https://www.ted.com/talks/jeremy_howard_the_wo...


In [3]:
# Grabbing the transcript feature
ted = df['transcript']

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer


# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(ted)

# Print the shape of tfidf_matrix
print(tfidf_matrix.shape)

(500, 29158)


You now know how to generate tf-idf vectors for a given corpus of text. You can use these vectors to perform predictive modeling just like we did with `CountVectorizer`. In the next few lessons, we will see another extremely useful application of the vectorized form of documents: generating recommendations.

## Cosine similarity
  
We now know how to compute vectors out of text documents. With this representation in mind, let us now explore techniques that will allow us to determine how similar two vectors and consequentially two documents, are to each other. More specifically, we will learn about the cosine similarity score which is one of the most popularly used similarity metrics in NLP.
  
**Mathematical formula**
  
Very simply put, the cosine similarity score of two vectors is the cosine of the angle between the vectors. Mathematically, it is the ratio of the dot product of the vectors and the product of the magnitude of the two vectors. Let's walk through what this formula really means.
  
$formula.$
  
$\Large Cosine(x,y) = \frac{x \cdot y}{|x||y|}$
  
<img src='../_images/cosine-distance-simularity-score.png' alt='img' width='740'>
  
**The dot product**
  
The dot product is computed by summing the product of values across corresponding dimensions of the vectors. Let's say we have two $n$-dimensional vectors $V$ and $W$ as shown. Then, the dot product here would be $V_1$ times $W_1$ plus $V_2$ times $W_2$ and so on until $V_n$ times $W_n$. As an example, consider two vectors $A$ and $B$. By applying the formula, we see that the dot product comes to 37.
  
Consider two vectors,  
$\Large V = (v_1, v_2, \dots, v_n), W = (w_1, w_2, \dots, w_n)$
  
Then the dot product of $V$ and $W$ is,  
$\Large V \cdot W = (v_1 \times w_1) + (v_2 \times w_2) + \dots + (v_n \times w_n)$
  
$example.$
  
$A = (4, 7, 1)$  
$B = (5, 2, 3)$  
  
$A • B = (4*5) + (7*2) + (1*3)$  
$20 + 14 + 3 = 37$  
  
<img src='../_images/cosine-distance-simularity-score1.png' alt='img' width='400'>
  
**Magnitude of a vector**
  
The magnitude of a vector is essentially the length of the vector. Mathematically, it is defined as the square root of the sum of the squares of values across all the dimensions of a vector. Therefore, for an $n$-dimensional vector $V$ the magnitude, mod $V$ is computed as the square root of $V_1$ square plus $V_2$ square and so on until $V_n$ square. Consider the vector $A$ from before. Using the above formula, we compute its magnitude to be root 66.
  
For any vector,  
$\Large V = (v_1, v_2, \dots, v_n)$
  
The magnitude is defined as,  
$\Large \Vert V \Vert = \sqrt{(v_1)^2 + (v_2)^2 + \dots + (v_n)^2}$
  
$example.$
  
$A = (4, 7, 1)$  
$B = (5, 2, 3)$  
  
$\Large \Vert A \Vert = \sqrt{(4)^2 + (7)^2 + (1)^2}$  
$\Large \Vert A \Vert = \sqrt{16 + 49 + 1} \approx \sqrt{66}$  
  
$\Large \Vert B \Vert = \sqrt{(5)^2 + (2)^2 + (3)^2}$  
$\Large \Vert B \Vert = \sqrt{25 + 4 + 9} \approx \sqrt{38}$  
  
<img src='../_images/cosine-distance-simularity-score2.png' alt='img' width='400'>
  
**The cosine score**
  
We are now in a position to compute the cosine similarity score of $A$ and $B$. It is the dot product, which is 37, divided by the product of the magnitudes of $A$ and $B$, which are root 66 and root 38 respectively. The value comes out to be approximately 0.738, which is the value of the cosine($\cos$) of the angle($\measuredangle$) theta($\theta$) between the two vectors $\text{represented as} \cos(\theta) \measuredangle \vec{A},\vec{B}$.
  
$example.$
  
$A = (4, 7, 1)$  
$B = (5, 2, 3)$  
  
$\Large Cosine(A,B) = \frac{A \cdot B}{|A||B|}$  
  
$\Large Cosine(A,B) = \frac{37}{(\sqrt{66}) \times (\sqrt{38})} \approx 0.738$  
  
<img src='../_images/cosine-distance-simularity-score3.png' alt='img' width='390'>  
  
**Cosine Score: points to remember**
  
Since the cosine score is simply the cosine of the angle between two vectors, its value is bounded between -1 and 1. However, in NLP, document vectors almost always use non-negative weights. Therefore, cosine scores vary between 0 and 1 where 0 indicates no similarity and 1 indicates that the documents are identical. Finally, since the cosine score ignores the magnitude of the vectors, it is fairly robust to document length. This may be an advantage or a disadvantage depending on the use case.
  
- Value between -1 and 1  
- In NLP, value between 0 (no similarity) and 1 (same)  
- Robust to document length  
  
**Implementation using scikit-learn**
  
Scikit-learn offers a `cosine_similarity()` function that outputs a similarity matrix containing the pairwise cosine scores for a set of vectors. You can import `cosine_similarity` `from sklearn.metrics.pairwise`. However, remember that `cosine_similarity()` takes in 2-D arrays as arguments. Passing in 1-D arrays will throw an error. Let us compute the cosine similarity scores of vectors $A$ and $B$ from before. We see that we get the same answer of 0.738 from before.
  
<img src='../_images/cosine-distance-simularity-score4.png' alt='img' width='740'>  

### Range of cosine scores
  
Which of the following is a possible cosine score for a pair of document vectors?
  
Possible Answers
  
- [x] 0.86
- [ ] -0.52
- [ ] 2.36
- [ ] -1.32
  
Great job! Since document vectors use only non-negative weights, the cosine score lies between 0 and 1.

### Computing dot product
  
In this exercise, we will learn to compute the dot product between two vectors, `A` = (1, 3) and `B` = (-2, 2), using the `numpy` library. More specifically, we will use the `np.dot()` function to compute the dot product of two `numpy` arrays.
  
1. Initialize `A` (1,3) and `B` (-2,2) as `numpy` arrays using `np.array()`.
2. Compute the dot product using `np.dot()` and passing `A` and `B` as arguments.

In [5]:
# Initialize numpy vectors
A = np.array([1, 3])
B = np.array([-2, 2])

# Compute dot product
dot_prod = np.dot(A, B)

# Print dot product
print(dot_prod)

4


Good job! The dot product of the two vectors is 1 * -2 + 3 * 2 = 4, which is indeed the output produced. We will not be using `np.dot()` too much in this course but it can prove to be a helpful function while computing dot products between two standalone vectors.

### Cosine similarity matrix of a corpus
  
In this exercise, you have been given a `corpus`, which is a list containing five sentences. The `corpus` is printed in the console. You have to compute the cosine similarity matrix which contains the pairwise cosine similarity score for every pair of sentences (vectorized using tf-idf).
  
```python
corpus:
 ['The sun is the largest celestial body in the solar system', 
 'The solar system consists of the sun and eight revolving planets', 
 'Ra was the Egyptian Sun God', 
 'The Pyramids were the pinnacle of Egyptian architecture', 
 'The quick brown fox jumps over the lazy dog']
```
  
Remember, the value corresponding to the $i$-th row and $j$-th column of a similarity matrix denotes the similarity score for the $i$-th and $j$-th vector.
  
1. Initialize an instance of `TfidfVectorizer`. Name it `tfidf_vectorizer`.
2. Using `.fit_transform()`, generate the tf-idf vectors for `corpus`. Name it `tfidf_matrix`.
3. Use `.cosine_similarity()` and pass `tfidf_matrix` to compute the cosine similarity matrix `cosine_sim`.

In [6]:
corpus = ['The sun is the largest celestial body in the solar system', 
          'The solar system consists of the sun and eight revolving planets', 
          'Ra was the Egyptian Sun God', 
          'The Pyramids were the pinnacle of Egyptian architecture', 
          'The quick brown fox jumps over the lazy dog']

In [7]:
from sklearn.metrics.pairwise import cosine_similarity


# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus, datatype is a csr_matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]


As you will see in a subsequent lesson, computing the cosine similarity matrix lies at the heart of many practical systems such as recommenders. From our similarity matrix, we see that the first and the second sentence are the most similar. Also the fifth sentence has, on average, the lowest pairwise cosine scores. This is intuitive as it contains entities that are not present in the other sentences.

## Building a plot line based recommender
  
In this lesson, we will use tf-idf vectors and cosine scores to build a recommender system that suggests movies based on overviews.
  
**Movie recommender**
  
We've a dataset containing movie overviews. Here, we can see two movies, Shanghai Triad and Cry, the Beloved Country and their overviews.
  
<img src='../_images/building-a-plot-line-based-recommender.png' alt='img' width='740'>
  
**Movie recommender**
  
Our task is to build a system that takes in a movie title and outputs a list of movies that has similar plot lines. For instance, if we passed in 'The Godfather', we could expect output like this. Notice how a lot of the movies listed here have to do with crime and gangsters, just like The Godfather.
  
<img src='../_images/building-a-plot-line-based-recommender1.png' alt='img' width='740'>
  
**Steps**
  
Following are the steps involved. The first step, as always, is to preprocess movie overviews. The next step is to generate the tf-idf vectors for our overviews. Finally, we generate a cosine similarity matrix which contains the pairwise similarity scores of every movie with every other movie. Once the cosine similarity matrix is computed, we can proceed to build the recommender function.
  
1. Text preprocessing  
2. Generate tf-idf vectors  
3. Generate cosine similarity matrix  
4. Recommender function  
  
**The recommender function**
  
We will build a recommender function as part of this course. Let's take a look at how it works. The recommender function takes a movie title, the cosine similarity matrix and an indices series as arguments. The indices series is a reverse mapping of movie titles with their indices in the original dataframe. The function extracts the pairwise cosine similarity scores of the movie passed in with every other movie. Next, it sorts these scores in descending order. Finally, it outputs the titles of movies corresponding to the highest similarity scores. Note that the function ignores the highest similarity score of 1. This is because the movie most similar to a given movie is the movie itself!
  
1. Function arguments: takes a movie title, cosine similarity matrix, and indices series  
2. Extract pair-wise cosine similarity scores for the movie  
3. Sort the scores in descending order  
4. Output titles corresponding to highest similarity scores  
5. Ignore the highest similarity score of 1 (score < 1)  
  
**Generating tf-idf vectors**
  
Let's say we already have the preprocessed movie overviews as '`movie_plots`'. We already know how to generate the tf-idf vectors.
  
```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Create Vectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of tf-idf vectors
tfidf_matrix = vectorizer.fit_transform(<dataset>)
```
  
**Generating cosine similarity matrix**
  
Generating the cosine similarity matrix is also extremely simple. We simply pass in `tfidf_matrix` as both the first and second argument of `cosine_similarity`. This generates a matrix that contains the pairwise similarity score of every movie with every other movie. The value corresponding to the ith row and the $j$-th column is the cosine similarity score of movie $i$ with movie $j$. Notice that the diagonal elements of this matrix is 1. This is because, as stated earlier, the cosine similarity score of movie $k$ with itself is 1.
  
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create Vectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of tf-idf vectors
tfidf_matrix = vectorizer.fit_transform(<dataset>)

# Generate cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
```
  
**The `linear_kernel` function**
  
The magnitude of a tf-idf vector is always 1. Recall from the previous lesson that the cosine score is computed as the ratio of the dot product and the product of the magnitude of the vectors. Since the magnitude is 1, the cosine score of two tf-idf vectors is equal to their dot product! This fact can help us greatly improve the speed of computation of our cosine similarity matrix as we do not need to compute the magnitudes while working with tf-idf vectors. Therefore, while working with tf-idf vectors, we can use the `linear_kernel` function which computes the pairwise dot product of every vector with every other vector.
  
- Magnitude of a tf-idf vector is always 1  
- Cosine score between two tf-idf vectors is their dot product  
- Considering both facts can greatly improve the speed of computation time  
- Use `linear_kernel` function in place of `cosine_similarity` to consider both facts  
  
**Generating cosine similarity matrix**
  
Let us replace the `cosine_similarity` function with `linear_kernel`. As you can see, the output remains the same but it takes significantly lesser time to compute.
  
<img src='../_images/building-a-plot-line-based-recommender2.png' alt='img' width='740'>
  
**The get_recommendations function**
  
The recommender function and the indices series described earlier will be built in the exercises. You can use this function to generate recommendations using the cosine similarity matrix.
  
<img src='../_images/building-a-plot-line-based-recommender3.png' alt='img' width='740'>
  
**Let's practice!**
  
In the exercises, you will build recommendation systems of your own and see them in action. Let's practice!

### Comparing `linear_kernel` and `cosine_similarity`
  
In this exercise, you have been given `tfidf_matrix` which contains the tf-idf vectors of a thousand documents. Your task is to generate the cosine similarity matrix for these vectors first using `cosine_similarity` and then, using `linear_kernel`.
  
We will then compare the computation times for both functions.
  
1. Compute the cosine similarity matrix for `tfidf_matrix` using `cosine_similarity`.
2. Compute the cosine similarity matrix for `tfidf_matrix` using `linear_kernel`.

In [8]:
import time


# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: {} seconds".format(time.time() - start))

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]
Time taken: 0.006113767623901367 seconds


In [9]:
from sklearn.metrics.pairwise import linear_kernel


# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: {} seconds".format(time.time() - start))

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]
Time taken: 0.003468036651611328 seconds


Notice how both `linear_kernel` and `cosine_similarity` produced the same result. However, `linear_kernel` took a smaller amount of time to execute. When you're working with a very large amount of data and your vectors are in the tf-idf representation, it is good practice to default to `linear_kernel` to improve performance. 
  
*NOTE: In case, you see `linear_kernel` taking more time, it's because the dataset we're dealing with is extremely small and Python's time module is incapable of capture such minute time differences accurately*

### The recommender function
  
In this exercise, we will build a recommender function `get_recommendations()`, as discussed in the lesson and the previous exercise. As we know, it takes in a title, a cosine similarity matrix, and a movie title and index mapping as arguments and outputs a list of 10 titles most similar to the original title (excluding the title itself).
  
You have been given a dataset metadata that consists of the movie titles and overviews. The head of this dataset has been printed to console.
  
1. Get index of the movie that matches the title by using the title key of indices.
2. Extract the ten most similar movies from `sim_scores` and store it back in `sim_scores`.

In [19]:
metadata = pd.read_csv('../_datasets/movie_metadata.csv').dropna()
print(metadata.shape)
metadata.head(20)

(820, 5)


Unnamed: 0.1,Unnamed: 0,id,title,overview,tagline
0,0,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,The Legend Ends
1,1,414,Batman Forever,The Dark Knight of Gotham City confronts a das...,"Courage now, truth always..."
2,2,268,Batman,The Dark Knight of Gotham City begins his war ...,Have you ever danced with the devil in the pal...
3,3,364,Batman Returns,"Having defeated the Joker, Batman now faces th...","The Bat, the Cat, the Penguin."
4,4,415,Batman & Robin,Along with crime-fighting partner Robin and ne...,Strength. Courage. Honor. And loyalty.
5,5,14919,Batman: Mask of the Phantasm,An old flame of Bruce Wayne's strolls into tow...,The Dark Knight fights to save Gotham city fro...
6,6,2661,Batman,The Dynamic Duo faces four super-villains who ...,He's Here Big As Life In A Real Bat-Epic
7,7,272,Batman Begins,"Driven by tragedy, billionaire Bruce Wayne ded...",Evil fears the knight.
8,8,40662,Batman: Under the Red Hood,Batman faces his ultimate challenge as the mys...,Dare to Look Beneath the Hood.
9,9,69735,Batman: Year One,Two men come to Gotham City: Bruce Wayne after...,A merciless crime turns a man into an outlaw.


In [16]:
# Generate mapping between titles and index
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
print('The Series has the title as the index, and the feature as the titles original index')
print(indices)

# Function to calculate the recomendation, given a movie title, the text based cosine matrix, and index of the movie to select
def get_recommendations(title, cosine_sim, indices):
    # Get the index of the movie that matches the title
    idx = indices[title]
    # Get the pairwise similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]



The Series has the title as the index, and the feature as the titles original index
title
The Dark Knight Rises       0
Batman Forever              1
Batman                      2
Batman Returns              3
Batman & Robin              4
                         ... 
Braindead                1002
Glory                    1003
Manhattan                1005
Miller's Crossing        1006
Dead Poets Society       1007
Length: 820, dtype: int64


With this recommender function in our toolkit, we are now in a very good place to build the rest of the components of our recommendation engine.

### Plot recommendation engine
  
In this exercise, we will build a recommendation engine that suggests movies based on similarity of plot lines. You have been given a `get_recommendations()` function that takes in the title of a movie, a similarity matrix and an indices series as its arguments and outputs a list of most similar movies. indices has already been provided to you.
  
You have also been given a `movie_plots` Series that contains the plot lines of several movies. Your task is to generate a cosine similarity matrix for the tf-idf vectors of these plots.
  
Consequently, we will check the potency of our engine by generating recommendations for one of my favorite movies, The Dark Knight Rises.
  
1. Initialize a `TfidfVectorizer` with English `stop_words`. Name it `tfidf`.
2. Construct `tfidf_matrix` by fitting and transforming the movie plot data using `.fit_transform()`.
3. Generate the cosine similarity matrix `cosine_sim` using `tfidf_matrix`. Don't use `cosine_similarity()`!
4. Use `get_recommendations()` to generate recommendations for `'The Dark Knight Rises'`.

In [17]:
# Extracting the target
movie_plots = metadata['overview']

In [18]:
# Initialize the TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(movie_plots)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Generate recommendations
print(get_recommendations("The Dark Knight Rises", cosine_sim, indices))

1                              Batman Forever
2                                      Batman
8                  Batman: Under the Red Hood
3                              Batman Returns
9                            Batman: Year One
10    Batman: The Dark Knight Returns, Part 1
11    Batman: The Dark Knight Returns, Part 2
5                Batman: Mask of the Phantasm
7                               Batman Begins
4                              Batman & Robin
Name: title, dtype: object


You've just built your very first recommendation system. Notice how the recommender correctly identifies `'The Dark Knight Rises'` as a Batman movie and recommends other Batman movies as a result. This sytem is, of course, very primitive and there are a host of ways in which it could be improved. One method would be to look at the cast, crew and genre in addition to the plot to generate recommendations. We will not be covering this in this course but you have all the tools necessary to accomplish this. Do give it a try!

### TED talk recommender
  
In this exercise, we will build a recommendation system that suggests TED Talks based on their transcripts. You have been given a `get_recommendations()` function that takes in the title of a talk, a similarity matrix and an indices series as its arguments, and outputs a list of most similar talks. indices has already been provided to you.
  
You have also been given a transcripts series that contains the transcripts of around 500 TED talks. Your task is to generate a cosine similarity matrix for the tf-idf vectors of the talk transcripts.
  
Consequently, we will generate recommendations for a talk titled '5 ways to kill your dreams' by Brazilian entrepreneur Bel Pesce.  
  
1. Initialize a `TfidfVectorizer` with English `stop_words=`. Name it tfidf.
2. Construct `tfidf_matrix` by fitting and transforming transcripts.
3. Generate the cosine similarity matrix `cosine_sim` using `tfidf_matrix`.
4. Use `get_recommendations()` to generate recommendations for `'5 ways to kill your dreams'`.

In [38]:
ted = pd.read_csv('../_datasets/ted_clean.csv', index_col=0)
print(ted.columns)

# Dropping an un-used import-export error column
ted = ted.drop('Unnamed: 0.1', axis=1)

# Drop the name of the index, it was 'Unnamed: 0' before (what a "clean" dataset haha)
ted = ted.rename_axis(None)

print(ted.columns)
print(ted.shape)
ted.head()

Index(['Unnamed: 0.1', 'title', 'url', 'transcript'], dtype='object')
Index(['title', 'url', 'transcript'], dtype='object')
(499, 3)


Unnamed: 0,title,url,transcript
0,10 top time-saving tech tips,https://www.ted.com/talks/david_pogue_10_top_t...,I've noticed something interesting about socie...
1,Who am I? Think again,https://www.ted.com/talks/hetain_patel_who_am_...,"Hetain Patel: (In Chinese)Yuyu Rau: Hi, I'm He..."
2,"""Awoo""",https://www.ted.com/talks/sofi_tukker_awoo\n,"(Music)Sophie Hawley-Weld: OK, you don't have ..."
3,"What I learned from 2,000 obituaries",https://www.ted.com/talks/lux_narayan_what_i_l...,Joseph Keller used to jog around the Stanford ...
4,Why giving away our wealth has been the most s...,https://www.ted.com/talks/bill_and_melinda_gat...,"Chris Anderson: So, this is an interview with ..."


In [39]:
# Recomendation function for ted talks
def get_recommendations(title, cosine_sim, indices):
    # Get the index of the movie that matches the title
    idx = indices[title]
    # Get the pairwsie similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    talk_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return ted['title'].iloc[talk_indices]

In [40]:
# Generate mapping between titles and index
indices = pd.Series(ted.index, index=ted['title']).drop_duplicates()

# Extracting the text required to make the matrix
transcripts = ted['transcript']

In [41]:
# Initialize the TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(transcripts)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Generate recommendations
print(get_recommendations('5 ways to kill your dreams', cosine_sim, indices))

453             Success is a continuous journey
157                        Why we do what we do
494                   How to find work you love
149          My journey into movies that matter
447                        One Laptop per Child
230             How to get your ideas to spread
497         Plug into your hard-wired happiness
495    Why you will fail to have a great career
179             Be suspicious of simple stories
53                          To upgrade is human
Name: title, dtype: object


You have successfully built a TED talk recommender. This recommender works surprisingly well despite being trained only on a small subset of TED talks. In fact, three of the talks recommended by our system is also recommended by the official TED website as talks to watch next after `'5 ways to kill your dreams'`!

## Beyond n-grams: word embeddings
  
We have covered a lot of ground in the last 4 chapters. However, before we bid adieu, we will cover one advanced topic that has a large number of applications in NLP.
  
**The problem with BoW and tf-idf**
  
Consider the three sentences, 
  
"I am happy"   
"I am joyous"  
"I am sad"  
  
Now if we were to compute the similarities, "I am happy" and "I am joyous" would have the same score as "I am happy" and "I am sad", regardless of how we vectorize it. This is because 'happy', 'joyous' and 'sad' are considered to be completely different words. However, we know that happy and joyous are more similar to each other than sad. This is something that the vectorization techniques we've covered so far simply cannot capture.
  
**Word embeddings**
  
Word embedding is the process of mapping words into an $n$-dimensional vector space. These vectors are usually produced using deep learning models and huge amounts of data. The techniques used are beyond the scope of this course. However, once generated, these vectors can be used to discern how similar two words are to each other. Consequently, they can also be used to detect synonyms and antonyms. Word embeddings are also capable of capturing complex relationships. For instance, it can be used to detect that the words "king" and "queen" relate to each other the same way as "man" and "woman". Or that "France" and "Paris" are related in the same way as "Russia" and "Moscow". One last thing to note is that word embeddings are not trained on user data; they are dependent on the pre-trained `spacy` model you're using and are independent of the size of your dataset.
  
- Mapping words into an $n$-dimensional vector space  
- Produced using deep learning and huge amounts of data  
- Discern how similar two words are to each other  
- Used to detect synonyms and antonyms  
- Captures complex relationships (ie. King, Queen -> Man, Woman)  
- Dependent on spacy model; independent of dataset you use  
  
**Word embeddings using `spaCy`**
  
Generating word embeddings is easy using spaCy's pre-trained models. As usual, we load the `spacy` model and create the `doc` object for our string. Note that it is advisable to load larger `spacy` models while working with word vectors. This is because the `en_core_web_sm` model does not technically ship with word vectors but context specific tensors, which tend to give relatively poorer results. We generate word vectors for each word by looping through the tokens and accessing the `.vector` attribute. The truncated output is as shown.
  
```python
import spacy

# Load model and create Doc object
nlp = spacy.load('en_core_we_lg')
doc = nlp('<string>')

# Generate word vectors for each token
for token in doc:
    print(token.vector)

out[1] : <ndarray_displayed>
```
  
**Word similarities**
  
We can compute how similar two words are to each other by using the `.similarity()` method of a `spacy` token. Let's say we want to compute how similar "happy", "joyous" and "sad" are to each other. We define a `doc` containing the three words. We then use a nested loop to calculate the similarity scores between each pair of words. As expected, "happy" and "joyous" are more similar to each other than they are to sad.
  
<img src='../_images/beyond-n-grams-word-embeddings.png' alt='img' width='740'>
  
**Document similarities**
  
`Spacy` also allows us to directly compute the similarity between two documents by using the average of the word vectors of all the words in a particular document. Let's consider the three sentences from before. We create `doc` objects for the sentences. Like `spacy` tokens, docs also have a `.similarity` method. Therefore, we can compute the similarity between two docs as follows. As expected, "I am happy" is more similar to "I am joyous" than it is to "I am sad". Note that the similarity scores are high in both cases because all sentences share 2 out of their three words, "I" and "am".
  
<img src='../_images/beyond-n-grams-word-embeddings1.png' alt='img' width='740'>
  
**Let's practice!**
  
With this, we come to an end of this lesson. Let's now practice our new found skills in the last set of exercises.

> Note: Before using word embedding through spaCy, you need to download `en_core_web_lg` model  
> Terminal: `python3 -m spacy download en_core_web_lg`

### Generating word vectors
  
In this exercise, we will generate the pairwise similarity scores of all the words in a sentence. The sentence is available as `sent` and has been printed to the console for your convenience.
  
1. Create a `Doc` object `doc` for `sent`.
2. In the nested loop, compute the similarity between `token1` and `token2`.

In [42]:
!python3 -m spacy download en_core_web_lg
import spacy
nlp = spacy.load('en_core_web_lg')

Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m704.0 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [43]:
sent = 'I like apples and orange'

# Create the doc object
doc = nlp(sent)

# Compute pairwise similarity scores
for token1 in doc:
    for token2 in doc: 
        print(token1.text, token2.text, token1.similarity(token2))

I I 1.0
I like 0.3184410631656647
I apples 0.1975560337305069
I and -0.0979200005531311
I orange 0.06804359704256058
like I 0.3184410631656647
like like 1.0
like apples 0.29574331641197205
like and 0.24359610676765442
like orange 0.3216366171836853
apples I 0.1975560337305069
apples like 0.29574331641197205
apples apples 1.0
apples and 0.24472734332084656
apples orange 0.5736395120620728
and I -0.0979200005531311
and like 0.24359610676765442
and apples 0.24472734332084656
and and 1.0
and orange 0.2520448565483093
orange I 0.06804359704256058
orange like 0.3216366171836853
orange apples 0.5736395120620728
orange and 0.2520448565483093
orange orange 1.0


`apples orange 0.5736395120620728`  
`orange apples 0.5736395120620728`  
  
Notice how the words `'apples'` and `'oranges'` have the highest pairwaise similarity score. This is expected as they are both fruits and are more related to each other than any other pair of words.

### Computing similarity of Pink Floyd songs
  
In this final exercise, you have been given lyrics of three songs by the British band Pink Floyd, namely 'High Hopes', 'Hey You' and 'Mother'. The lyrics to these songs are available as `hopes`, `hey` and `mother` respectively.
  
Your task is to compute the pairwise similarity between `mother` and `hopes`, and `mother` and `hey`.
  
1. Create `Doc` objects for `mother`, `hopes` and `hey`.
2. Compute the similarity between `mother` and `hopes`.
3. Compute the similarity between `mother` and `hey`.


In [56]:
with open('../_datasets/mother.txt', 'r') as f:
    mother = f.read()
    
with open('../_datasets/hopes.txt', 'r') as f:
    hopes = f.read()
    
with open('../_datasets/hey.txt', 'r') as f:
    hey = f.read()

print(mother[:250])     # Displaying the first 250 chars in the string
print(hopes[:250])      # Displaying the first 250 chars in the string
print(hey[:250])        # Displaying the first 250 chars in the string


Mother do you think they'll drop the bomb?
Mother do you think they'll like this song?
Mother do you think they'll try to break my balls?
Ooh, ah
Mother should I build the wall?
Mother should I run for President?
Mother should I trust the government

Beyond the horizon of the place we lived when we were young
In a world of magnets and miracles
Our thoughts strayed constantly and without boundary
The ringing of the division bell had begun
Along the Long Road and on down the Causeway
Do they still

Hey you, out there in the cold
Getting lonely, getting old
Can you feel me?
Hey you, standing in the aisles
With itchy feet and fading smiles
Can you feel me?
Hey you, don't help them to bury the light
Don't give in without a fight
Hey you out there


In [57]:
nlp = spacy.load('en_core_web_lg')

# Create Doc objects
mother_doc = nlp(mother)
hopes_doc = nlp(hopes)
hey_doc = nlp(hey)

# Print similarity between mother and hopes
print(mother_doc.similarity(hopes_doc))

# Print similarity between mother and hey
print(mother_doc.similarity(hey_doc))

0.5779929666352768
0.9465446706762218


Notice that 'Mother' and 'Hey You' have a similarity score of 0.947 whereas 'Mother' and 'High Hopes' has a score of only 0.578. This is probably because 'Mother' and 'Hey You' were both songs from the same album 'The Wall' and were penned by Roger Waters. On the other hand, 'High Hopes' was a part of the album 'Division Bell' with lyrics by David Gilmour and his wife, Penny Samson. Treat yourself by listening to these songs. They're some of the best!