1\. Building tf-idf document vectors
------------------------------------

00:00 - 00:04

In the last chapter, we learned about n-gram modeling.

2\. n-gram modeling
-------------------

00:04 - 00:29

In n-gram modeling, the weight of a dimension for the vector representation of a document is dependent on the number of times the word corresponding to the dimension occurs in the document. Let's say we have a document that has the word 'human' occurring 5 times. Then, the dimension of its vector representation corresponding to 'human' would have the value 5.

- Weight of dimension dependent on the frequency of the word corresponding to the dimension.
- Document contains the word *human* in five places.
- Dimension corresponding to *human* has weight 5.


3\. Motivation
--------------

00:29 - 01:17

However, some words occur very commonly across all the documents in the corpus. As a result, the vector representations get more characterized by these dimensions. Consider a corpus of documents on the Universe. Let's say there is a particular document on Jupiter where the word 'jupiter' and 'universe' both occur about 20 times. However, 'jupiter' rarely figures in the other documents whereas 'universe' is just as common. We could argue that although both *jupiter* and *universe* occur 20 times, *jupiter* should be given a larger weight on account of its exclusivity. In other words, the word 'jupiter' characterizes the document more than 'universe'.

- Some words occur very commonly across all documents
- Corpus of documents on the universe:
  - One document has *jupiter* and *universe* occurring 20 times each.
  - *jupiter* rarely occurs in the other documents. *universe* is common.
  - Give more weight to *jupiter* on account of exclusivity.


4\. Applications
----------------

01:17 - 01:48

Weighting words this way has a huge number of applications. They can be used to automatically detect stopwords for the corpus instead of relying on a generic list. They're used in search algorithms to determine the ranking of pages containing the search query and in recommender systems as we will soon find out. In a lot of cases, this kind of weighting also generates better performance during predictive modeling.

- Automatically detect stopwords
- Search
- Recommender systems
- Better performance in predictive modeling for some cases


5\. Term frequency-inverse document frequency
---------------------------------------------

01:48 - 02:09

The weighting mechanism we've described is known as term frequency-inverse document frequency or tf-idf for short. It is based on the idea that the weight of a term in a document should be proportional to its frequency and an inverse function of the number of documents in which it occurs.

- Proportional to term frequency
- Inverse function of the number of documents in which it occurs


6\. Mathematical formula
------------------------

02:09 - 02:16

Mathematically, the weight of a term i in document j is computed as

```markdown
\( w_{i,j} = tf_{i,j} \cdot \log \left( \frac{N}{df_i} \right) \)

\( w_{i,j} \rightarrow \) weight of term \( i \) in document \( j \)
```

7\. Mathematical formula
------------------------

02:16 - 02:20

term frequency of the term i in document j

```markdown
$$
w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_i}\right)
$$

- \(w_{i,j}\) → weight of term \(i\) in document \(j\)
- \(tf_{i,j}\) → term frequency of term \(i\) in document \(j\)
```

8\. Mathematical formula
------------------------

02:20 - 02:32

multiplied by the log of the ratio of the number of documents in the corpus and the number of documents in which the term i occurs or dfi.

```markdown
$$
w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_i}\right)
$$

- \(w_{i,j}\) → weight of term \(i\) in document \(j\)
- \(tf_{i,j}\) → term frequency of term \(i\) in document \(j\)
- \(N\) → number of documents in the corpus
- \(df_i\) → number of documents containing term \(i\)
```


9\. Mathematical formula
------------------------

02:32 - 03:18

Therefore, let's say the word 'library' occurs in a document 5 times. There are 20 documents in the corpus and 'library' occurs in 8 of them. Then, the tf-idf weight of 'library' in the vector representation of this document will be 5 times log of 20 by 8 which is approximately 2. In general, higher the tf-idf weight, more important is the word in characterizing the document. A high tf-idf weight for a word in a document may imply that the word is relatively exclusive to that particular document or that the word occurs extremely commonly in the document, or both.

```markdown
$$
w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_i}\right)
$$

- \(w_{i,j}\) → weight of term \(i\) in document \(j\)
- \(tf_{i,j}\) → term frequency of term \(i\) in document \(j\)
- \(N\) → number of documents in the corpus
- \(df_i\) → number of documents containing term \(i\)

**Example:**

$$
w_{library, document} = 5 \cdot log\left(\frac{20}{8}\right) \approx 2
$$
```

10\. tf-idf using scikit-learn
------------------------------

03:18 - 04:10

Generating vectors that use tf-idf weighting is almost identical to what we've already done so far. Instead of using CountVectorizer, we use the TfidfVectorizer class of scikit-learn. The parameters and methods it has is almost identical to CountVectorizer. The only difference is that TfidfVectorizer assigns weights using the tf-idf formula from before and has extra parameters related to inverse document frequency which we will not cover in this course. Here, we can see how using TfidfVectorizer is almost identical to using CountVectorizer for a corpus. However, notice that the weights are non-integer and reflect values calculated by the tf-idf formula.

```python
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()
# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())
```

```python
[[0.         0.         0.         0.25434658 0.33443519 0.33443519
  0.25434658 0.         0.25434658 0.         0.76303975]
 [0.         0.46735098 0.         0.         0.46735098 0.
  0.         0.46735098 0.35543247 0.         0.        ]
...
```

11\. Let's practice!
--------------------

04:10 - 04:14

That's enough theory for now. Let's practice!