## TF-IDF

<a href="https://colab.research.google.com/github/febse/ta2025/blob/main/02-03-TF-IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

Until now we have looked at the term frequency matrix (counts of words in documents). However, the term frequency matrix
does not take into account the importance of words in the document. For example, the word "the" is likely to appear in
most documents, but it is not very informative. The term frequency-inverse document frequency (TF-IDF) is a measure that
takes into account the importance of words in the document.

**Term frequency** ($TF(i)$) is the number of occurrences of word $i$ in document $D$. It depends strongly
on how general a word is (e.g. "has" vs. "cosine" in general literature) and also on the length of the document.

**Document frequency** ($DF(i)$) is the number of documents that contain word $i$.

**Inverse document frequency** ($IDF(i)$) is simply the inverse relative frequency of the word in the set of documents.
With $N$ documents the IDF is given by:

$$
    IDF(i) = \frac{N}{DF(i)}
$$

It is large for words that occur in many documents, and it will be small for words that appear in only a few documents.

A problem with this definition is that the IDF becomes very large for large corpora (large N) so it is commonly replaced
by its logarithm.

$$
    IDF(i) = 1 + \log\left(\frac{N}{DF(i)}\right)
$$

The addition of 1 in the above equation serves to ensure that the words that occur in all documents are not entirely discarded. The default IDF used in `TfidfVectorizer` is:

$$
    IDF(i) = 1 + \log\left(\frac{N + 1}{DF(i) + 1}\right)
$$

$$
    \text{TF-IDF}(i, d) = TF(i, d) \times IDF(i)
$$

Let's calculate it for the toy corpus with just three documents:

```
    "the quick brown fox",
    "the fast brown dog",
    "the quick red fox"
```


In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
import pandas as pd

corpus = [
    'the quick brown fox',
    'the fast brown dog',
    'the quick red fox'
]

c_vect = CountVectorizer()

term_matrix = c_vect.fit_transform(corpus)
term_matrix_dense = term_matrix.toarray()

pd.DataFrame(term_matrix_dense, columns=c_vect.get_feature_names_out())

Unnamed: 0,brown,dog,fast,fox,quick,red,the
0,1,0,0,1,1,0,1
1,1,1,1,0,0,0,1
2,0,0,0,1,1,1,1


In [3]:
tfidf_vect = TfidfVectorizer(smooth_idf=True, use_idf=True, norm=None)
tfidf_term_matrix = tfidf_vect.fit_transform(corpus)
pd.DataFrame(
    tfidf_term_matrix.toarray(),
    columns=c_vect.get_feature_names_out(),
    index=[f"doc{i}" for i in range(1, len(corpus) + 1)]
    )

Unnamed: 0,brown,dog,fast,fox,quick,red,the
doc1,1.287682,0.0,0.0,1.287682,1.287682,0.0,1.0
doc2,1.287682,1.693147,1.693147,0.0,0.0,0.0,1.0
doc3,0.0,0.0,0.0,1.287682,1.287682,1.693147,1.0


In [4]:
# Get the inverse document frequency
tfidf_vect.idf_

array([1.28768207, 1.69314718, 1.69314718, 1.28768207, 1.28768207,
       1.69314718, 1.        ])

In [5]:
# TF-IDF for "the"
import math

tfidf_the = 1 + math.log((3 + 1)/ (3 + 1))
print(tfidf_the)

# IDF for "brown" in the first document

tfidf_brown = 1 + math.log((3 + 1)/ (2 + 1))
print(tfidf_brown)


1.0
1.2876820724517808
