# Term Frequency - Inverse Document Frequency (TF-IDF)

It is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It is widely used in information retrieval and text mining.

## Intuition

- **Term Frequency (TF):** Measures how frequently a term appears in a document. However, common words like "the" or "is" may appear frequently but are not informative.
- **Inverse Document Frequency (IDF):** Down-weights terms that appear in many documents, highlighting words that are more unique to specific documents.



**Term Frequency (TF):**

$$
\text{TF}(t, d) = \frac{\text{Occurrence of term } t \text{ in document } d}{\text{Total words in } d}
$$

**Inverse Document Frequency (IDF):**

$$
\text{IDF}(t, D) = \log \left( \frac{\text{Number of documents in corpus}}{\text{Number of documents where term } t \text{ appears}} \right)
$$

In [9]:
import math

sentences = [
    "dog bark",
    "dog run",
    "run fast dog"
]

## Tokenize sentences and adding to vocabulary

In [10]:
tokenized = [sentence.split() for sentence in sentences]

vocab = set()
for sent in tokenized:
    for word in sent:
        vocab.add(word)

## Calculation of TF

In [11]:
# calculate term frequency
tf_matrix = []
for sent in tokenized:
    tf_row = []
    for word in vocab:
        count = sent.count(word)
        tf = count / len(sent)
        tf_row.append(tf)
    tf_matrix.append(tf_row)

## Problem:

Common words like "the", "is", etc., appear in every document, so they are not very informative. To address this, we use **IDF** to down-weight such common terms.

## Calculation of IDF

For each word in the vocabulary, we calculate how many sentences contain it and then compute its IDF.

In [12]:
# calculate idf for each word
N = len(sentences)
idf_vector = []
for word in vocab:
    containing = sum(1 for sent in tokenized if word in sent)
    idf = math.log(N / containing)
    idf_vector.append(idf)

## What Does This Solve?

If a word like "the" appears in **all** documents, then:

$$
\text{IDF}("the", D) = \log\left( \frac{3}{3} \right) = \log(1) = 0
$$

So its weight becomes 0 — it won’t affect the representation.

## Example

**Sentences:**

- $s_1 = \text{"dog bark"}$
- $s_2 = \text{"dog run"}$
- $s_3 = \text{"run fast dog"}$

**Vocabulary:**

| Word  | Count |
|-------|-------|
| dog   | 3     |
| run   | 2     |
| bark  | 1     |
| fast  | 1     |

## Term Frequency (TF)

| Word  | $s_1$           | $s_2$           | $s_3$           |
|-------|-----------------|-----------------|-----------------|
| dog   | $\frac{1}{2}$  | $\frac{1}{2}$   | $\frac{1}{3}$   |
| bark  | $\frac{1}{2}$  | $0$             | $0$             |
| run   | $0$             | $\frac{1}{2}$   | $\frac{1}{3}$   |
| fast  | $0$             | $0$             | $\frac{1}{3}$   |


## Inverse Document Frequency (IDF)

| Word  | IDF                                 |
|-------|-------------------------------------|
| dog   | $\log\left(\frac{3}{3}\right) = 0$ |
| run   | $\log\left(\frac{3}{2}\right)$     |
| bark  | $\log\left(\frac{3}{1}\right)$     |
| fast  | $\log\left(\frac{3}{1}\right)$     |

## Matrix multiplication gives us the following vectors

- $s_1 = [0, 0.549, 0, 0]$
- $s_2 = [0, 0, 0.204, 0]$
- $s_3 = [0, 0, 0, 0.204]$

## Calculation of TF-IDF vectors

In [13]:
# tf x idf
tfidf_matrix = []
for tf_row in tf_matrix:
    tfidf_row = []
    for i in range(len(vocab)):
        tfidf = tf_row[i] * idf_vector[i]
        tfidf_row.append(tfidf)
    tfidf_matrix.append(tfidf_row)

In [14]:
# printing tf_idf_vectors
print("\nTF-IDF Vectors:")
for i, vec in enumerate(tfidf_matrix):
    print(f"Sentence {i+1}:", vec)


TF-IDF Vectors:
Sentence 1: [0.0, 0.0, 0.5493061443340549, 0.0]
Sentence 2: [0.0, 0.2027325540540822, 0.0, 0.0]
Sentence 3: [0.0, 0.13515503603605478, 0.0, 0.3662040962227032]
