1\. Building tf-idf document vectors
------------------------------------

00:00 - 00:04

In the last chapter, we learned about n-gram modeling.

2\. n-gram modeling
-------------------

00:04 - 00:29

In n-gram modeling, the weight of a dimension for the vector representation of a document is dependent on the number of times the word corresponding to the dimension occurs in the document. Let's say we have a document that has the word 'human' occurring 5 times. Then, the dimension of its vector representation corresponding to 'human' would have the value 5.

- Weight of dimension dependent on the frequency of the word corresponding to the dimension.
- Document contains the word *human* in five places.
- Dimension corresponding to *human* has weight 5.


3\. Motivation
--------------

00:29 - 01:17

However, some words occur very commonly across all the documents in the corpus. As a result, the vector representations get more characterized by these dimensions. Consider a corpus of documents on the Universe. Let's say there is a particular document on Jupiter where the word 'jupiter' and 'universe' both occur about 20 times. However, 'jupiter' rarely figures in the other documents whereas 'universe' is just as common. We could argue that although both *jupiter* and *universe* occur 20 times, *jupiter* should be given a larger weight on account of its exclusivity. In other words, the word 'jupiter' characterizes the document more than 'universe'.

- Some words occur very commonly across all documents
- Corpus of documents on the universe:
  - One document has *jupiter* and *universe* occurring 20 times each.
  - *jupiter* rarely occurs in the other documents. *universe* is common.
  - Give more weight to *jupiter* on account of exclusivity.


4\. Applications
----------------

01:17 - 01:48

Weighting words this way has a huge number of applications. They can be used to automatically detect stopwords for the corpus instead of relying on a generic list. They're used in search algorithms to determine the ranking of pages containing the search query and in recommender systems as we will soon find out. In a lot of cases, this kind of weighting also generates better performance during predictive modeling.

- Automatically detect stopwords
- Search
- Recommender systems
- Better performance in predictive modeling for some cases


5\. Term frequency-inverse document frequency
---------------------------------------------

01:48 - 02:09

The weighting mechanism we've described is known as term frequency-inverse document frequency or tf-idf for short. It is based on the idea that the weight of a term in a document should be proportional to its frequency and an inverse function of the number of documents in which it occurs.

- Proportional to term frequency
- Inverse function of the number of documents in which it occurs


6\. Mathematical formula
------------------------

02:09 - 02:16

Mathematically, the weight of a term i in document j is computed as

```markdown
\( w_{i,j} = tf_{i,j} \cdot \log \left( \frac{N}{df_i} \right) \)

\( w_{i,j} \rightarrow \) weight of term \( i \) in document \( j \)
```

7\. Mathematical formula
------------------------

02:16 - 02:20

term frequency of the term i in document j

```markdown
$$
w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_i}\right)
$$

- \(w_{i,j}\) → weight of term \(i\) in document \(j\)
- \(tf_{i,j}\) → term frequency of term \(i\) in document \(j\)
```

8\. Mathematical formula
------------------------

02:20 - 02:32

multiplied by the log of the ratio of the number of documents in the corpus and the number of documents in which the term i occurs or dfi.

```markdown
$$
w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_i}\right)
$$

- \(w_{i,j}\) → weight of term \(i\) in document \(j\)
- \(tf_{i,j}\) → term frequency of term \(i\) in document \(j\)
- \(N\) → number of documents in the corpus
- \(df_i\) → number of documents containing term \(i\)
```


9\. Mathematical formula
------------------------

02:32 - 03:18

Therefore, let's say the word 'library' occurs in a document 5 times. There are 20 documents in the corpus and 'library' occurs in 8 of them. Then, the tf-idf weight of 'library' in the vector representation of this document will be 5 times log of 20 by 8 which is approximately 2. In general, higher the tf-idf weight, more important is the word in characterizing the document. A high tf-idf weight for a word in a document may imply that the word is relatively exclusive to that particular document or that the word occurs extremely commonly in the document, or both.

```markdown
$$
w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_i}\right)
$$

- \(w_{i,j}\) → weight of term \(i\) in document \(j\)
- \(tf_{i,j}\) → term frequency of term \(i\) in document \(j\)
- \(N\) → number of documents in the corpus
- \(df_i\) → number of documents containing term \(i\)

**Example:**

$$
w_{library, document} = 5 \cdot log\left(\frac{20}{8}\right) \approx 2
$$
```

10\. tf-idf using scikit-learn
------------------------------

03:18 - 04:10

Generating vectors that use tf-idf weighting is almost identical to what we've already done so far. Instead of using CountVectorizer, we use the TfidfVectorizer class of scikit-learn. The parameters and methods it has is almost identical to CountVectorizer. The only difference is that TfidfVectorizer assigns weights using the tf-idf formula from before and has extra parameters related to inverse document frequency which we will not cover in this course. Here, we can see how using TfidfVectorizer is almost identical to using CountVectorizer for a corpus. However, notice that the weights are non-integer and reflect values calculated by the tf-idf formula.

```python
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()
# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())
```

```python
[[0.         0.         0.         0.25434658 0.33443519 0.33443519
  0.25434658 0.         0.25434658 0.         0.76303975]
 [0.         0.46735098 0.         0.         0.46735098 0.
  0.         0.46735098 0.35543247 0.         0.        ]
...
```

11\. Let's practice!
--------------------

04:10 - 04:14

That's enough theory for now. Let's practice!

tf-idf weight of commonly occurring words
=========================================

The word `bottle` occurs 5 times in a particular document `D` and also occurs in every document of the corpus. What is the tf-idf weight of `bottle` in `D`?

##### Answer the question

#### Possible Answers

Select one answer

[x] -   0

    PRESS1

-   1

    PRESS2

-   Not defined

    PRESS3

-   5

    PRESS4

tf-idf vectors for TED talks
============================

In this exercise, you have been given a corpus `ted` which contains the transcripts of 500 TED Talks. Your task is to generate the tf-idf vectors for these talks.

In a later lesson, we will use these vectors to generate recommendations of similar talks based on the transcript.

Instructions
------------

-   Import `TfidfVectorizer` from `sklearn`.
-   Create a `TfidfVectorizer` object. Name it `vectorizer`.
-   Generate `tfidf_matrix` for `ted` using the `fit_transform()` method.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer= TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(ted)

# Print the shape of tfidf_matrix
print(tfidf_matrix.shape)

1\. Cosine similarity
---------------------

00:00 - 00:25

We now know how to compute vectors out of text documents. With this representation in mind, let us now explore techniques that will allow us to determine how similar two vectors and consequentially two documents, are to each other. More specifically, we will learn about the cosine similarity score which is one of the most popularly used similarity metrics in NLP.

2\. Mathematical formula
------------------------

00:25 - 00:45

Very simply put, the cosine similarity score of two vectors is the cosine of the angle between the vectors. Mathematically, it is the ratio of the dot product of the vectors and the product of the magnitude of the two vectors. Let's walk through what this formula really means.


```markdown
## Cosine Similarity

\[
sim(A, B) = \cos(\theta) = \frac{A \cdot B}{\|A\|\|B\|}
\]

```
```plaintext
          y
          ^
          |
      10 -|        🐼 (A)
          |      /
          |     /
       5 -|    /θ
          |   /
          |  /_______ 🐯 (B)
          | /          
          |/____________________> x
            5    10    15
```

3\. The dot product
-------------------

00:45 - 01:21

The dot product is computed by summing the product of values across corresponding dimensions of the vectors. Let's say we have two n-dimensional vectors V and W as shown. Then, the dot product here would be v1 times w1 plus v2 times w2 and so on until vn times wn. As an example, consider two vectors A and B. By applying the formula above, we see that the dot product comes to 37.

```markdown
Consider two vectors,

\[
V = (v_1, v_2, \dots, v_n), W = (w_1, w_2, \dots, w_n)
\]

Then the dot product of \( V \) and \( W \) is,

\[
V \cdot W = (v_1 \times w_1) + (v_2 \times w_2) + \dots + (v_n \times w_n)
\]

**Example:**

\[
A = (4, 7, 1), B = (5, 2, 3)
\]

\[
A \cdot B = (4 \times 5) + (7 \times 2) + (1 \times 3)
\]

\[
= 20 + 14 + 3 = 37
\]
```
```plaintext
          y
          ^
          |
      10 -|        A (4, 7, 1)
          |       /
          |      /
       7 -|     /
          |    /
       5 -|   /
          |  /
       3 -| /  
          |/____________________> x
             5   7   1   3
          B (5, 2, 3)
```

4\. Magnitude of a vector
-------------------------

01:21 - 01:57

The magnitude of a vector is essentially the length of the vector. Mathematically, it is defined as the square root of the sum of the squares of values across all the dimensions of a vector. Therefore, for an n-dimensional vector V, the magnitude,mod V, is computed as the square root of v1 square plus v2 square and so on until vn square. Consider the vector A from before. Using the above formula, we compute its magnitude to be root 66.

```markdown
For any vector,

\[
V = (v_1, v_2, \dots, v_n)
\]

The magnitude is defined as,

\[
\|V\| = \sqrt{(v_1)^2 + (v_2)^2 + \dots + (v_n)^2}
\]

**Example:**

\[
A = (4, 7, 1), B = (5, 2, 3)
\]

\[
\|A\| = \sqrt{(4)^2 + (7)^2 + (1)^2}
\]

\[
= \sqrt{16 + 49 + 1} = \sqrt{66}
\]
```
```plaintext
                    A (4, 7, 1)
                       |
                       |\
                       | \
                       |  \   Magnitude of A = √66
                       |   \
                       |    \
                       |     \
                       |______\__________________________> B (5, 2, 3)
```

5\. The cosine score
--------------------

01:57 - 02:23

We are now in a position to compute the cosine similarity score of A and B. It is the dot product, which is 37, divided by the product of the magnitudes of A and B, which are root 66 and root 38 respectively. The value comes out to be approximately 0.738, which is the value of the cosine of the angle theta between the two vectors.

```markdown
For vectors,

\[
A : (4, 7, 1), B : (5, 2, 3)
\]

The cosine score,

\[
\cos(A, B) = \frac{A \cdot B}{|A| \cdot |B|}
\]

\[
= \frac{37}{\sqrt{66} \times \sqrt{38}} = 0.7388
\]
```
```plaintext
          y
          ^
          |
      10 -|         B (5, 2, 3)
          |        /
          |       /
       7 -|      /θ
          |     /
          |    /
       5 -|   /
          |  /
       3 -| /  
          |/____________________> x
             5   7   1   3
          A (4, 7, 1)
```

6\. Cosine Score: points to remember
------------------------------------

02:23 - 03:03

Since the cosine score is simply the cosine of the angle between two vectors, its value is bounded between -1 and 1. However, in NLP, document vectors almost always use non-negative weights. Therefore, cosine scores vary between 0 and 1 where 0 indicates no similarity and 1 indicates that the documents are identical. Finally, since the cosine score ignores the magnitude of the vectors, it is fairly robust to document length. This may be an advantage or a disadvantage depending on the use case.

- Value between -1 and 1.
- In NLP, value between 0 and 1.
- Robust to document length.


7\. Implementation using scikit-learn
-------------------------------------

03:03 - 03:42

Scikit-learn offers a cosine_similarity function that outputs a similarity matrix containing the pairwise cosine scores for a set of vectors. You can import cosine_similarity from sklearn dot metrics dot pairwise. However, remember that cosine_similarity takes in 2-D arrays as arguments. Passing in 1-D arrays will throw an error. Let us compute the cosine similarity scores of vectors A and B from before. We see that we get the same answer of 0.738 from before.

```python
# Import the cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

# Define two 3-dimensional vectors A and B
A = (4, 7, 1)
B = (5, 2, 3)

# Compute the cosine score of A and B
score = cosine_similarity([A], [B])

# Print the cosine score
print(score)
```
```plaintext
array([[0.73881883]])
```

8\. Let's practice!
-------------------

03:42 - 03:46

That's enough theory for now. Let's practice!