# Term Frequency-Inverse Document Frequency

TF-IDF is a statistical measure to evaluate how important a word is to a document in a collection. The importance of a word increases proportionally to the number of times it appears in the document. However, the frequency of a word in the collection also plays an important role in determining the importance of the word. If a word appears frequently in a document, it should be important. But if a word appears in many documents, it is not a unique identifier. In this case, the word should be less important.

TF-IDF is the product of two statistics, term frequency and inverse document frequency.

$$
\begin{aligned}
\text{tf-idf}(t, d, D) &= \text{tf}(t, d) \times \text{idf}(t, D) \\
&= \text{tf}(t, d) \times \log \frac{|D|}{|\{d \in D: t \in d\}|}
\end{aligned}
$$

where $t$ is a term, $d$ is a document, $D$ is a collection of documents, $\text{tf}(t, d)$ is the number of times term $t$ appears in document $d$, and $\text{idf}(t, D)$ is the number of documents in $D$ divided by the number of documents where $t$ appears.

This notebook demonstrates how to compute TF-IDF using `sklearn.feature_extraction.text.TfidfVectorizer`.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer


# Create a simple corpus of documents
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer()

# Learn the vocabulary dictionary and return term-document matrix.
X = vectorizer.fit_transform(corpus)

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

print(feature_names)
print(X.toarray())

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


As seen above, the TF-IDF vectorizer first trains a vocabulary from the collection of documents. Then it computes the TF-IDF score for each document.

Transforming a document produces a term-document matrix, where each row represents a document, and each column represents a term. The value of each cell is the TF-IDF score.