# 1972 - term frequency-inverse document frequency (TF-IDF)

[1972 TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
 TF-IDF expanded on BOW by giving more weight to rare words and less to common terms, improving the model’s ability to detect document relevancy. Nonetheless, it made no mention of word context.

This Jupyter notebook for 1972 TF-IDF, we will follow a similar approach as the Bag of Words (BoW) model, but this time we will include the term frequency-inverse document frequency (TF-IDF) to give more weight to rarer words. This helps in improving the model's ability to identify document relevancy by reducing the impact of common words.

Here’s the step-by-step process for our notebook:

### Step-by-Step Explanation:

1. **Collect Words and Calculate Frequencies**: As with the BoW model, we start by creating a list of unique words from all the sentences.
2. **Calculate Term Frequency (TF)**: For each word in a sentence, we calculate the term frequency, which is the number of times a word appears in a sentence divided by the total number of words in that sentence.
3. **Calculate Inverse Document Frequency (IDF)**: We then calculate the inverse document frequency for each word, which is the logarithm of the total number of documents divided by the number of documents that contain the word. This helps in giving less importance to common words across all documents.
4. **Calculate TF-IDF**: Finally, we multiply the TF and IDF values to get the TF-IDF score for each word in each document.
5. **Create a TF-IDF Table**: Similar to the BoW model, we put this information into a table where each row represents a sentence and each column represents a word from our vocabulary.

### How It Works:
- The `TfidfVectorizer` in scikit-learn is used to automatically compute the TF-IDF scores for each word in each document.
- The `fit_transform` method learns the vocabulary from the documents and calculates the TF-IDF scores.
- The resulting matrix gives a numeric representation of the importance of each word in each sentence, helping in text analysis tasks.

### Explanation of the Output:
- **Vocabulary**: Lists all the unique words found in the documents.
- **TF-IDF Matrix**: Provides the TF-IDF score for each word in each document, showing the importance of each word within the context of the document.

This code snippet provides a straightforward way to demonstrate the use of TF-IDF in text analysis, building on the foundation of the Bag of Words model.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog quickly.",
    "The quick brown fox is quick and fast."
]

# Initialize the TfidfVectorizer
# TfidfVectorizer will tokenize the documents, calculate the TF-IDF score for each word
vectorizer = TfidfVectorizer()

# Fit and transform the documents
# This step learns the vocabulary from the documents and transforms them into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents)

# Get the vocabulary (unique words in the corpus)
# get_feature_names_out() provides the list of all unique words in the documents
vocab = vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to an array
# tfidf_matrix.toarray() converts the sparse matrix to a dense numpy array for easier viewing
tfidf_array = tfidf_matrix.toarray()

# Display the results
# Print the vocabulary which lists all unique words in the documents
print("Vocabulary:\n", vocab)
# Print the TF-IDF matrix which shows the TF-IDF score of each word in each document
print("\nTF-IDF Matrix:\n", tfidf_array)

Vocabulary:
 ['and' 'brown' 'dog' 'fast' 'fox' 'is' 'jump' 'jumps' 'lazy' 'never'
 'over' 'quick' 'quickly' 'the']

TF-IDF Matrix:
 [[0.         0.31401745 0.31401745 0.         0.31401745 0.
  0.         0.41289521 0.31401745 0.         0.31401745 0.31401745
  0.         0.48772512]
 [0.         0.         0.33729513 0.         0.         0.
  0.44350256 0.         0.33729513 0.44350256 0.33729513 0.
  0.44350256 0.26193976]
 [0.38294157 0.29123694 0.         0.38294157 0.29123694 0.38294157
  0.         0.         0.         0.         0.         0.58247387
  0.         0.22617146]]


### TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is the product of two values: Term Frequency (TF) and Inverse Document Frequency (IDF).

#### 1. Term Frequency (TF)
The term frequency $ \text{TF}(w_i, D_j) $ of a word $ w_i $ in a document $ D_j $ is defined as:

$$
\text{TF}(w_i, D_j) = \frac{f(w_i, D_j)}{\sum_{k} f(w_k, D_j)}
$$

Where:
- $ f(w_i, D_j) $ is the frequency of word $ w_i $ in document $ D_j $.
- The denominator is the total number of words in document $ D_j $.

#### 2. Inverse Document Frequency (IDF)
The inverse document frequency $ \text{IDF}(w_i, \mathcal{D}) $ for a word $ w_i $ in a corpus $ \mathcal{D} $ is given by:

$$
\text{IDF}(w_i, \mathcal{D}) = \log \left( \frac{N}{|\{ D_j \in \mathcal{D} : w_i \in D_j \}|} \right)
$$

Where:
- $ N $ is the total number of documents in the corpus $ \mathcal{D} $.
- $ |\{ D_j \in \mathcal{D} : w_i \in D_j \}| $ is the number of documents that contain the word $ w_i $.

#### 3. TF-IDF Score
The TF-IDF score for a word $ w_i $ in a document $ D_j $ is then computed as:

$$
\text{TF-IDF}(w_i, D_j, \mathcal{D}) = \text{TF}(w_i, D_j) \times \text{IDF}(w_i, \mathcal{D})
$$

#### 4. TF-IDF Matrix for the Corpus
Given a corpus $ \mathcal{D} $ with documents $ D_1, D_2, \ldots, D_m $ and a vocabulary of words $ w_1, w_2, \ldots, w_n $, the TF-IDF matrix $ T $ for the corpus is an $ m \times n $ matrix where each element is the TF-IDF score for the corresponding word in the document:

$$
T = 
\begin{bmatrix}
\text{TF-IDF}(w_1, D_1, \mathcal{D}) & \text{TF-IDF}(w_2, D_1, \mathcal{D}) & \cdots & \text{TF-IDF}(w_n, D_1, \mathcal{D}) \\
\text{TF-IDF}(w_1, D_2, \mathcal{D}) & \text{TF-IDF}(w_2, D_2, \mathcal{D}) & \cdots & \text{TF-IDF}(w_n, D_2, \mathcal{D}) \\
\vdots & \vdots & \ddots & \vdots \\
\text{TF-IDF}(w_1, D_m, \mathcal{D}) & \text{TF-IDF}(w_2, D_m, \mathcal{D}) & \cdots & \text{TF-IDF}(w_n, D_m, \mathcal{D}) \\
\end{bmatrix}
$$

### Explanation in the Context of the Code
- **Vocabulary**: The set of unique words $ w_1, w_2, \ldots, w_n $ is extracted from the documents.
- **TF-IDF Matrix**: Each document $ D_j $ is transformed into a vector $ \text{TF-IDF}(D_j) $ using the TF-IDF scores for each word in the vocabulary, forming an $ m \times n $ matrix $ T $.
- The TF-IDF vectorizer in the code computes this matrix, where $ m $ is the number of documents, and $ n $ is the number of unique words in the vocabulary. Each cell in the matrix represents the TF-IDF score for a specific word in a specific document.
