1\. Building tf-idf document vectors
------------------------------------

00:00 - 00:04

In the last chapter, we learned about n-gram modeling.

2\. n-gram modeling
-------------------

00:04 - 00:29

In n-gram modeling, the weight of a dimension for the vector representation of a document is dependent on the number of times the word corresponding to the dimension occurs in the document. Let's say we have a document that has the word 'human' occurring 5 times. Then, the dimension of its vector representation corresponding to 'human' would have the value 5.

- Weight of dimension dependent on the frequency of the word corresponding to the dimension.
- Document contains the word *human* in five places.
- Dimension corresponding to *human* has weight 5.


3\. Motivation
--------------

00:29 - 01:17

However, some words occur very commonly across all the documents in the corpus. As a result, the vector representations get more characterized by these dimensions. Consider a corpus of documents on the Universe. Let's say there is a particular document on Jupiter where the word 'jupiter' and 'universe' both occur about 20 times. However, 'jupiter' rarely figures in the other documents whereas 'universe' is just as common. We could argue that although both *jupiter* and *universe* occur 20 times, *jupiter* should be given a larger weight on account of its exclusivity. In other words, the word 'jupiter' characterizes the document more than 'universe'.

- Some words occur very commonly across all documents
- Corpus of documents on the universe:
  - One document has *jupiter* and *universe* occurring 20 times each.
  - *jupiter* rarely occurs in the other documents. *universe* is common.
  - Give more weight to *jupiter* on account of exclusivity.


4\. Applications
----------------

01:17 - 01:48

Weighting words this way has a huge number of applications. They can be used to automatically detect stopwords for the corpus instead of relying on a generic list. They're used in search algorithms to determine the ranking of pages containing the search query and in recommender systems as we will soon find out. In a lot of cases, this kind of weighting also generates better performance during predictive modeling.

- Automatically detect stopwords
- Search
- Recommender systems
- Better performance in predictive modeling for some cases


5\. Term frequency-inverse document frequency
---------------------------------------------

01:48 - 02:09

The weighting mechanism we've described is known as term frequency-inverse document frequency or tf-idf for short. It is based on the idea that the weight of a term in a document should be proportional to its frequency and an inverse function of the number of documents in which it occurs.

- Proportional to term frequency
- Inverse function of the number of documents in which it occurs


6\. Mathematical formula
------------------------

02:09 - 02:16

Mathematically, the weight of a term i in document j is computed as


\( w_{i,j} = tf_{i,j} \cdot \log \left( \frac{N}{df_i} \right) \)

\( w_{i,j} \rightarrow \) weight of term \( i \) in document \( j \)


7\. Mathematical formula
------------------------

02:16 - 02:20

term frequency of the term i in document j


$$
w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_i}\right)
$$

- \(w_{i,j}\) → weight of term \(i\) in document \(j\)
- \(tf_{i,j}\) → term frequency of term \(i\) in document \(j\)


8\. Mathematical formula
------------------------

02:20 - 02:32

multiplied by the log of the ratio of the number of documents in the corpus and the number of documents in which the term i occurs or dfi.


$$
w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_i}\right)
$$

- \(w_{i,j}\) → weight of term \(i\) in document \(j\)
- \(tf_{i,j}\) → term frequency of term \(i\) in document \(j\)
- \(N\) → number of documents in the corpus
- \(df_i\) → number of documents containing term \(i\)



9\. Mathematical formula
------------------------

02:32 - 03:18

Therefore, let's say the word 'library' occurs in a document 5 times. There are 20 documents in the corpus and 'library' occurs in 8 of them. Then, the tf-idf weight of 'library' in the vector representation of this document will be 5 times log of 20 by 8 which is approximately 2. In general, higher the tf-idf weight, more important is the word in characterizing the document. A high tf-idf weight for a word in a document may imply that the word is relatively exclusive to that particular document or that the word occurs extremely commonly in the document, or both.


$$
w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_i}\right)
$$

- \(w_{i,j}\) → weight of term \(i\) in document \(j\)
- \(tf_{i,j}\) → term frequency of term \(i\) in document \(j\)
- \(N\) → number of documents in the corpus
- \(df_i\) → number of documents containing term \(i\)

**Example:**

$$
w_{library, document} = 5 \cdot log\left(\frac{20}{8}\right) \approx 2
$$


10\. tf-idf using scikit-learn
------------------------------

03:18 - 04:10

Generating vectors that use tf-idf weighting is almost identical to what we've already done so far. Instead of using CountVectorizer, we use the TfidfVectorizer class of scikit-learn. The parameters and methods it has is almost identical to CountVectorizer. The only difference is that TfidfVectorizer assigns weights using the tf-idf formula from before and has extra parameters related to inverse document frequency which we will not cover in this course. Here, we can see how using TfidfVectorizer is almost identical to using CountVectorizer for a corpus. However, notice that the weights are non-integer and reflect values calculated by the tf-idf formula.

```python
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()
# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())
```

```python
[[0.         0.         0.         0.25434658 0.33443519 0.33443519
  0.25434658 0.         0.25434658 0.         0.76303975]
 [0.         0.46735098 0.         0.         0.46735098 0.
  0.         0.46735098 0.35543247 0.         0.        ]
...
```

11\. Let's practice!
--------------------

04:10 - 04:14

That's enough theory for now. Let's practice!

tf-idf weight of commonly occurring words
=========================================

The word `bottle` occurs 5 times in a particular document `D` and also occurs in every document of the corpus. What is the tf-idf weight of `bottle` in `D`?

##### Answer the question

#### Possible Answers

Select one answer

[x] -   0

    PRESS1

-   1

    PRESS2

-   Not defined

    PRESS3

-   5

    PRESS4

tf-idf vectors for TED talks
============================

In this exercise, you have been given a corpus `ted` which contains the transcripts of 500 TED Talks. Your task is to generate the tf-idf vectors for these talks.

In a later lesson, we will use these vectors to generate recommendations of similar talks based on the transcript.

Instructions
------------

-   Import `TfidfVectorizer` from `sklearn`.
-   Create a `TfidfVectorizer` object. Name it `vectorizer`.
-   Generate `tfidf_matrix` for `ted` using the `fit_transform()` method.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer= TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(ted)

# Print the shape of tfidf_matrix
print(tfidf_matrix.shape)

1\. Cosine similarity
---------------------

00:00 - 00:25

We now know how to compute vectors out of text documents. With this representation in mind, let us now explore techniques that will allow us to determine how similar two vectors and consequentially two documents, are to each other. More specifically, we will learn about the cosine similarity score which is one of the most popularly used similarity metrics in NLP.

2\. Mathematical formula
------------------------

00:25 - 00:45

Very simply put, the cosine similarity score of two vectors is the cosine of the angle between the vectors. Mathematically, it is the ratio of the dot product of the vectors and the product of the magnitude of the two vectors. Let's walk through what this formula really means.



## Cosine Similarity

\[
sim(A, B) = \cos(\theta) = \frac{A \cdot B}{\|A\|\|B\|}
\]


```plaintext
          y
          ^
          |
      10 -|        🐼 (A)
          |      /
          |     /
       5 -|    /θ
          |   /
          |  /_______ 🐯 (B)
          | /          
          |/____________________> x
            5    10    15
```

3\. The dot product
-------------------

00:45 - 01:21

The dot product is computed by summing the product of values across corresponding dimensions of the vectors. Let's say we have two n-dimensional vectors V and W as shown. Then, the dot product here would be v1 times w1 plus v2 times w2 and so on until vn times wn. As an example, consider two vectors A and B. By applying the formula above, we see that the dot product comes to 37.


Consider two vectors,

\[
V = (v_1, v_2, \dots, v_n), W = (w_1, w_2, \dots, w_n)
\]

Then the dot product of \( V \) and \( W \) is,

\[
V \cdot W = (v_1 \times w_1) + (v_2 \times w_2) + \dots + (v_n \times w_n)
\]

**Example:**

\[
A = (4, 7, 1), B = (5, 2, 3)
\]

\[
A \cdot B = (4 \times 5) + (7 \times 2) + (1 \times 3)
\]

\[
= 20 + 14 + 3 = 37
\]

```plaintext
          y
          ^
          |
      10 -|        A (4, 7, 1)
          |       /
          |      /
       7 -|     /
          |    /
       5 -|   /
          |  /
       3 -| /  
          |/____________________> x
             5   7   1   3
          B (5, 2, 3)
```

4\. Magnitude of a vector
-------------------------

01:21 - 01:57

The magnitude of a vector is essentially the length of the vector. Mathematically, it is defined as the square root of the sum of the squares of values across all the dimensions of a vector. Therefore, for an n-dimensional vector V, the magnitude,mod V, is computed as the square root of v1 square plus v2 square and so on until vn square. Consider the vector A from before. Using the above formula, we compute its magnitude to be root 66.


For any vector,

\[
V = (v_1, v_2, \dots, v_n)
\]

The magnitude is defined as,

\[
\|V\| = \sqrt{(v_1)^2 + (v_2)^2 + \dots + (v_n)^2}
\]

**Example:**

\[
A = (4, 7, 1), B = (5, 2, 3)
\]

\[
\|A\| = \sqrt{(4)^2 + (7)^2 + (1)^2}
\]

\[
= \sqrt{16 + 49 + 1} = \sqrt{66}
\]

```plaintext
                    A (4, 7, 1)
                       |
                       |\
                       | \
                       |  \   Magnitude of A = √66
                       |   \
                       |    \
                       |     \
                       |______\__________________________> B (5, 2, 3)
```

5\. The cosine score
--------------------

01:57 - 02:23

We are now in a position to compute the cosine similarity score of A and B. It is the dot product, which is 37, divided by the product of the magnitudes of A and B, which are root 66 and root 38 respectively. The value comes out to be approximately 0.738, which is the value of the cosine of the angle theta between the two vectors.


For vectors,

\[
A : (4, 7, 1), B : (5, 2, 3)
\]

The cosine score,

\[
\cos(A, B) = \frac{A \cdot B}{|A| \cdot |B|}
\]

\[
= \frac{37}{\sqrt{66} \times \sqrt{38}} = 0.7388
\]

```plaintext
          y
          ^
          |
      10 -|         B (5, 2, 3)
          |        /
          |       /
       7 -|      /θ
          |     /
          |    /
       5 -|   /
          |  /
       3 -| /  
          |/____________________> x
             5   7   1   3
          A (4, 7, 1)
```

6\. Cosine Score: points to remember
------------------------------------

02:23 - 03:03

Since the cosine score is simply the cosine of the angle between two vectors, its value is bounded between -1 and 1. However, in NLP, document vectors almost always use non-negative weights. Therefore, cosine scores vary between 0 and 1 where 0 indicates no similarity and 1 indicates that the documents are identical. Finally, since the cosine score ignores the magnitude of the vectors, it is fairly robust to document length. This may be an advantage or a disadvantage depending on the use case.

- Value between -1 and 1.
- In NLP, value between 0 and 1.
- Robust to document length.


7\. Implementation using scikit-learn
-------------------------------------

03:03 - 03:42

Scikit-learn offers a cosine_similarity function that outputs a similarity matrix containing the pairwise cosine scores for a set of vectors. You can import cosine_similarity from sklearn dot metrics dot pairwise. However, remember that cosine_similarity takes in 2-D arrays as arguments. Passing in 1-D arrays will throw an error. Let us compute the cosine similarity scores of vectors A and B from before. We see that we get the same answer of 0.738 from before.

```python
# Import the cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

# Define two 3-dimensional vectors A and B
A = (4, 7, 1)
B = (5, 2, 3)

# Compute the cosine score of A and B
score = cosine_similarity([A], [B])

# Print the cosine score
print(score)
```
```plaintext
array([[0.73881883]])
```

8\. Let's practice!
-------------------

03:42 - 03:46

That's enough theory for now. Let's practice!

Range of cosine scores
======================

Which of the following is a possible cosine score for a pair of document vectors?

##### Answer the question

#### Possible Answers

Select one answer

[x] -   0.86

    PRESS1

-   -0.52

    PRESS2

-   2.36

    PRESS3

-   -1.32

    PRESS4

Computing dot product
=====================

In this exercise, we will learn to compute the dot product between two vectors, A = (1, 3) and B = (-2, 2), using the `numpy` library. More specifically, we will use the `np.dot()` function to compute the dot product of two numpy arrays.

Instructions
------------

-   Initialize `A` (1,3) and `B` (-2,2) as `numpy`arrays using `np.array()`.
-   Compute the dot product using `np.dot()`and passing `A` and `B` as arguments.

In [None]:
# Initialize numpy vectors
A = np.array([1,3])
B = np.array([-2,2])

# Compute dot product
dot_prod = np.dot(A, B)

# Print dot product
print(dot_prod)

Cosine similarity matrix of a corpus
====================================

In this exercise, you have been given a `corpus`, which is a list containing five sentences. The `corpus` is printed in the console. You have to compute the cosine similarity matrix which contains the pairwise cosine similarity score for every pair of sentences (vectorized using tf-idf). 

Remember, the value corresponding to the ith row and jth column of a similarity matrix denotes the similarity score for the ith and jth vector.

Instructions
------------

-   Initialize an instance of `TfidfVectorizer`. Name it `tfidf_vectorizer`.
-   Using `fit_transform()`, generate the tf-idf vectors for `corpus`. Name it `tfidf_matrix`.
-   Use `cosine_similarity()` and pass `tfidf_matrix` to compute the cosine similarity matrix `cosine_sim`.

In [None]:
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Compute and print the cosine similarity matrix
cosine_sim = cosine_similarity( tfidf_matrix,tfidf_matrix)
print(cosine_sim)

1\. Building a plot line based recommender
------------------------------------------

00:00 - 00:09

In this lesson, we will use tf-idf vectors and cosine scores to build a recommender system that suggests movies based on overviews.

2\. Movie recommender
---------------------

00:09 - 00:19

We've a dataset containing movie overviews. Here, we can see two movies, Shanghai Triad and Cry, the Beloved Country and their overviews.

```markdown
| Title               | Overview                                                                 |
|---------------------|--------------------------------------------------------------------------|
| Shanghai Triad       | A provincial boy related to a Shanghai crime family is recruited by his uncle into cosmopolitan Shanghai in the 1930s to be a servant to a ganglord's mistress. |
| Cry, the Beloved Country | A South-African preacher goes to search for his wayward son who has committed a crime in the big city. |
```

3\. Movie recommender
---------------------

00:19 - 00:40

Our task is to build a system that takes in a movie title and outputs a list of movies that has similar plot lines. For instance, if we passed in 'The Godfather', we could expect output like this. Notice how a lot of the movies listed here have to do with crime and gangsters, just like The Godfather.

```python
get_recommendations("The Godfather")
```
```plaintext
1178     The Godfather: Part II
44030    The Godfather Trilogy: 1972–1990
1914     The Godfather: Part III
23126    Blood Ties
11297    Household Saints
34717    Start Liquidation
10821    Election
38030    Goodfellas
17729    Short Sharp Shock
26293    Beck 28 - Familjen
Name: title, dtype: object
```

4\. Steps
---------

00:40 - 01:09

Following are the steps involved. The first step, as always, is to preprocess movie overviews. The next step is to generate the tf-idf vectors for our overviews. Finally, we generate a cosine similarity matrix which contains the pairwise similarity scores of every movie with every other movie. Once the cosine similarity matrix is computed, we can proceed to build the recommender function.

1. Text preprocessing
2. Generate tf-idf vectors
3. Generate cosine similarity matrix

5\. The recommender function
----------------------------

01:09 - 01:59

We will build a recommender function as part of this course. Let's take a look at how it works. The recommender function takes a movie title, the cosine similarity matrix and an indices series as arguments. The indices series is a reverse mapping of movie titles with their indices in the original dataframe. The function extracts the pairwise cosine similarity scores of the movie passed in with every other movie. Next, it sorts these scores in descending order. Finally, it outputs the titles of movies corresponding to the highest similarity scores. Note that the function ignores the highest similarity score of 1. This is because the movie most similar to a given movie is the movie itself!

1. Take a movie title, cosine similarity matrix and indices series as arguments.
2. Extract pairwise cosine similarity scores for the movie.
3. Sort the scores in descending order.
4. Output titles corresponding to the highest scores.
5. Ignore the highest similarity score (of 1).


6\. Generating tf-idf vectors
-----------------------------

01:59 - 02:10

Let's say we already have the preprocessed movie overviews as 'movie_plots'. We already know how to generate the tf-idf vectors.

```python
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of tf-idf vectors
tfidf_matrix = vectorizer.fit_transform(movie_plots)
```

7\. Generating cosine similarity matrix
---------------------------------------

02:10 - 02:51

Generating the cosine similarity matrix is also extremely simple. We simply pass in tfidf_matrix as both the first and second argument of cosine_similarity. This generates a matrix that contains the pairwise similarity score of every movie with every other movie. The value corresponding to the ith row and the jth column is the cosine similarity score of movie i with movie j. Notice that the diagonal elements of this matrix is 1. This is because, as stated earlier, the cosine similarity score of movie k with itself is 1.

```python
# Import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

# Generate cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
```
```plaintext
array([[1.        , 0.27435345, 0.23092036, ..., 0.00758112],
       [0.27435345, 1.        , 0.1246955 , ..., 0.00740494],
       ...,
       [0.00758112, 0.00740494, 0.        , ..., 1.        ]])
```

8\. The linear_kernel function
------------------------------

02:51 - 03:37

The magnitude of a tf-idf vector is always 1. Recall from the previous lesson that the cosine score is computed as the ratio of the dot product and the product of the magnitude of the vectors. Since the magnitude is 1, the cosine score of two tf-idf vectors is equal to their dot product! This fact can help us greatly improve the speed of computation of our cosine similarity matrix as we do not need to compute the magnitudes while working with tf-idf vectors. Therefore, while working with tf-idf vectors, we can use the linear_kernel function which computes the pairwise dot product of every vector with every other vector.

```markdown
- Magnitude of a tf-idf vector is 1
- Cosine score between two tf-idf vectors is their dot product.
- Can significantly improve computation time.
- Use `linear_kernel` instead of `cosine_similarity`.
```

9\. Generating cosine similarity matrix
---------------------------------------

03:37 - 03:48

Let us replace the cosine_similarity function with linear_kernel. As you can see, the output remains the same but it takes significantly lesser time to compute.

```python
# Import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

# Generate cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
```
```plaintext
array([[1.        , 0.27435345, 0.23092036, ..., 0.00758112],
       [0.27435345, 1.        , 0.1246955 , ..., 0.00740494],
       ...,
       [0.00758112, 0.00740494, 0.        , ..., 1.        ]])
```

10\. The get_recommendations function
-------------------------------------

03:48 - 04:01

The recommender function and the indices series described earlier will be built in the exercises. You can use this function to generate recommendations using the cosine similarity matrix.

```python
get_recommendations('The Lion King', cosine_sim, indices)
```
```plaintext
7782     African Cats
5877     The Lion King 2: Simba's Pride
4524     Born Free
2719     The Bear
4770     Once Upon a Time in China III
7070     Crows Zero
739      The Wizard of Oz
8926     The Jungle Book
1749     Shadow of a Doubt
7993     October Baby
Name: title, dtype: object
```

11\. Let's practice!
--------------------

04:01 - 04:10

In the exercises, you will build recommendation systems of your own and see them in action. Let's practice!

Comparing linear_kernel and cosine_similarity
=============================================

In this exercise, you have been given `tfidf_matrix` which contains the tf-idf vectors of a thousand documents. Your task is to generate the cosine similarity matrix for these vectors first using `cosine_similarity`and then, using `linear_kernel`. 

We will then compare the computation times for both functions.

Instructions 1/2
----------------

-   Compute the cosine similarity matrix for `tfidf_matrix` using `cosine_similarity`.

In [None]:
# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))


Instructions 1/2
----------------

-   Compute the cosine similarity matrix for `tfidf_matrix` using `cosine_similarity`.

-   Compute the cosine similarity matrix for `tfidf_matrix` using `linear_kernel`.

In [None]:
# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

Plot recommendation engine
==========================

In this exercise, we will build a recommendation engine that suggests movies based on similarity of plot lines. You have been given a `get_recommendations()` function that takes in the title of a movie, a similarity matrix and an `indices` series as its arguments and outputs a list of most similar movies. `indices`has already been provided to you.

You have also been given a `movie_plots`Series that contains the plot lines of several movies. Your task is to generate a cosine similarity matrix for the tf-idf vectors of these plots.

Consequently, we will check the potency of our engine by generating recommendations for one of my favorite movies, The Dark Knight Rises.

Instructions
------------

-   Initialize a `TfidfVectorizer` with English `stop_words`. Name it `tfidf`.
-   Construct `tfidf_matrix` by fitting and transforming the movie plot data using `fit_transform()`.
-   Generate the cosine similarity matrix `cosine_sim` using `tfidf_matrix`. Don't use `cosine_similarity()`!
-   Use `get_recommendations()` to generate recommendations for `'The Dark Knight Rises'`.

In [None]:
# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(movie_plots)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 
# Generate recommendations 
print(get_recommendations('The Dark Knight Rises', cosine_sim, indices))


The recommender function
========================

In this exercise, we will build a recommender function `get_recommendations()`, as discussed in the lesson and the previous exercise. As we know, it takes in a title, a cosine similarity matrix, and a movie title and index mapping as arguments and outputs a list of 10 titles most similar to the original title (excluding the title itself).

You have been given a dataset `metadata` that consists of the movie titles and overviews. The head of this dataset has been printed to console.

Instructions
------------

-   Get index of the movie that matches the title by using the `title` key of `indices`.
-   Extract the ten most similar movies from `sim_scores` and store it back in `sim_scores`.

In [None]:
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that matches title
    idx = indices[title]
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

TED talk recommender
====================

In this exercise, we will build a recommendation system that suggests TED Talks based on their transcripts. You have been given a `get_recommendations()` function that takes in the title of a talk, a similarity matrix and an `indices` series as its arguments, and outputs a list of most similar talks. `indices`has already been provided to you.

You have also been given a `transcripts`series that contains the transcripts of around 500 TED talks. Your task is to generate a cosine similarity matrix for the tf-idf vectors of the talk transcripts.

Consequently, we will generate recommendations for a talk titled '5 ways to kill your dreams' by Brazilian entrepreneur Bel Pesce.

Instructions
------------

-   Initialize a `TfidfVectorizer` with English stopwords. Name it `tfidf`.
-   Construct `tfidf_matrix` by fitting and transforming `transcripts`.
-   Generate the cosine similarity matrix `cosine_sim` using `tfidf_matrix`.
-   Use `get_recommendations()` to generate recommendations for '5 ways to kill your dreams'.

In [None]:
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(transcripts)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 
# Generate recommendations 
print(get_recommendations('5 ways to kill your dreams', cosine_sim, indices))

1\. Beyond n-grams: word embeddings
-----------------------------------

00:00 - 00:12

We have covered a lot of ground in the last 4 chapters. However, before we bid adieu, we will cover one advanced topic that has a large number of applications in NLP.

2\. The problem with BoW and tf-idf
-----------------------------------

00:12 - 00:50

Consider the three sentences, I am happy, I am joyous and I am sad. Now if we were to compute the similarities, I am happy and I am joyous would have the same score as I am happy and I am sad, regardless of how we vectorize it. This is because 'happy', 'joyous' and 'sad' are considered to be completely different words. However, we know that happy and joyous are more similar to each other than sad. This is something that the vectorization techniques we've covered so far simply cannot capture.

```markdown
'I am happy'
'I am joyous'
'I am sad'
```

3\. Word embeddings
-------------------

00:50 - 01:53

Word embedding is the process of mapping words into an n-dimensional vector space. These vectors are usually produced using deep learning models and huge amounts of data. The techniques used are beyond the scope of this course. However, once generated, these vectors can be used to discern how similar two words are to each other. Consequently, they can also be used to detect synonyms and antonyms. Word embeddings are also capable of capturing complex relationships. For instance, it can be used to detect that the words king and queen relate to each other the same way as man and woman. Or that France and Paris are related in the same way as Russia and Moscow. One last thing to note is that word embeddings are not trained on user data; they are dependent on the pre-trained spacy model you're using and are independent of the size of your dataset.

```markdown
- Mapping words into an n-dimensional vector space
- Produced using deep learning and huge amounts of data
- Discern how similar two words are to each other
- Used to detect synonyms and antonyms
- Captures complex relationships
  - King - Queen → Man - Woman
  - France - Paris → Russia - Moscow
- Dependent on spacy model; independent of dataset you use
```

4\. Word embeddings using spaCy
-------------------------------

01:53 - 02:34

Generating word embeddings is easy using spaCy's pre-trained models. As usual, we load the spacy model and create the doc object for our string. Note that it is advisable to load larger spacy models while working with word vectors. This is because the en_core_web_sm model does not technically ship with word vectors but context specific tensors, which tend to give relatively poorer results. We generate word vectors for each word by looping through the tokens and accessing the vector attribute. The truncated output is as shown.

```python
import spacy

# Load model and create Doc object
nlp = spacy.load('en_core_web_lg')
doc = nlp('I am happy')

# Generate word vectors for each token
for token in doc:
    print(token.vector)
```
```plaintext
[-1.0747459e+00  4.8677087e-02  5.6630421e+00  1.6680446e+00 -1.3194644e+00 -1.5142369e+00  1.1940931e+00 -3.0168812e+00 ...]
```

5\. Word similarities
---------------------

02:34 - 03:02

We can compute how similar two words are to each other by using the similarity method of a spacy token. Let's say we want to compute how similar happy, joyous and sad are to each other. We define a doc containing the three words. We then use a nested loop to calculate the similarity scores between each pair of words. As expected, happy and joyous are more similar to each other than they are to sad.

```python
doc = nlp("happy joyous sad")
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))
```
```plaintext
happy happy 1.0
happy joyous 0.63244456
happy sad 0.37338886
joyous happy 0.63244456
joyous joyous 1.0
joyous sad 0.5340932
...
```

6\. Document similarities
-------------------------

03:02 - 03:45

Spacy also allows us to directly compute the similarity between two documents by using the average of the word vectors of all the words in a particular document. Let's consider the three sentences from before. We create doc objects for the sentences. Like spacy tokens, docs also have a similarity method. Therefore, we can compute the similarity between two docs as follows. As expected, I am happy is more similar to I am joyous than it is to I am sad. Note that the similarity scores are high in both cases because all sentences share 2 out of their three words, I and am.

```python
# Generate doc objects
sent1 = nlp("I am happy")
sent2 = nlp("I am sad")
sent3 = nlp("I am joyous")

# Compute similarity between sent1 and sent2
sent1.similarity(sent2)
```
```plaintext
0.9273363837282105
```

```python
# Compute similarity between sent1 and sent3
sent1.similarity(sent3)
```
```plaintext
0.9403554938594568
```

7\. Let's practice!
-------------------

03:45 - 03:55

With this, we come to an end of this lesson. Let's now practice our new found skills in the last set of exercises.

Generating word vectors
=======================

In this exercise, we will generate the pairwise similarity scores of all the words in a sentence. The sentence is available as `sent` and has been printed to the console for your convenience.

Instructions
------------

-   Create a `Doc` object `doc` for `sent`.
-   In the nested loop, compute the similarity between `token1` and `token2`.

In [None]:
doc = nlp(sent)

# Compute pairwise similarity scores
for token1 in doc:
  for token2 in doc:
    print(token1.text, token2.text, token1.similarity(token2))

Computing similarity of Pink Floyd songs
========================================

In this final exercise, you have been given lyrics of three songs by the British band Pink Floyd, namely 'High Hopes', 'Hey You' and 'Mother'. The lyrics to these songs are available as `hopes`, `hey` and `mother` respectively.

Your task is to compute the pairwise similarity between `mother` and `hopes`, and `mother`and `hey`.

Instructions
------------

-   Create `Doc` objects for `mother`, `hopes`and `hey`.
-   Compute the similarity between `mother`and `hopes`.
-   Compute the similarity between `mother`and `hey`.

In [None]:
# Create Doc objects
mother_doc = nlp(mother)
hopes_doc = nlp(hopes)
hey_doc = nlp(hey)

# Print similarity between mother and hopes
print(mother_doc.similarity(hopes_doc))

# Print similarity between mother and hey
print(mother_doc.similarity(hey_doc))

1\. Congratulations!
--------------------

00:00 - 00:04

Congratulations on making it to the end of the course!

2\. Review
----------

00:04 - 01:16

In this course, we learned about various feature engineering techniques for natural language processing in python. We started off by computing basic features such as character length and word length of documents. We then moved on to readability scores and learned various metrics that could help us deduce the amount of education required to comprehend a piece of text fully. Next, we were introduced to the spacy library and learned to perform tokenization and lemmatization. Building on these techniques, we proceeded to explore text cleaning. We also learned how to perform part of speech tagging and named entity recognition using spacy models and had a sneak peek at their applications. The third chapter was dedicated to n-gram modeling. We also explored an application of it in sentiment analysis of movie reviews. The final chapter saw us covering tf-idf vectors and cosine similarity. Using these concepts, we built a movie and a TED Talk recommender. The final lesson gave you a sneak peek into word embeddings and their use cases.

3\. Further resources
---------------------

01:16 - 01:28

This, by no means, is the end of the road. Once you're done with this course, it is highly recommended that you take the following courses, also offered by DataCamp to muscle up your skills further.

4\. Thank you!
--------------

01:28 - 01:37

We hope you have enjoyed taking this course as much as we did developing it. Thank you and all the best with your data science journey!