# TF-IDF

While bag of words and one-hot encoding are straightforward and easy to implement, they have limitations. The main drawbacks are:

1) **High Dimensionality**: For a large vocabulary, the resulting vectors are very high-dimensional and sparse.

2) **Lack of Semantic Information**: The techniques do not capture any semantic relationships between words. For instance, "cat" and "dog" are represented as entirely different vectors with no indication that they are semantically related.

To address the limitations mentioned before, TF-IDF (Term Frequency-Inverse Document Frequency) emerged as a more efficient way to represent words. It adjusts word counts by their importance across documents, giving more weight to rare but informative words.

TF-IDF combines two measures:

1) Term Frequency (TF): The frequency of a term in a document.
2) Inverse Document Frequency (IDF): A measure of how much information the word provides, based on its rarity across the entire corpus.

If a term appears in all documents, the IDF will be zero (it is a popular term, and therefore it provides little value).

![tf-idf-formula.jpg](../imgs/tf-idf-formula.jpg)

Import necessary libraries

In [1]:
import numpy as np
import pandas as pd

Define the corpus with a few sample sentences

In [2]:
corpus = [
    "I love cats.",
    "I hate cats and dogs.",
    "I have a dog.",
    "The quick brown fox jumps over the lazy dog.",
    "Never odd or even.",
    "Was it a car or a cat I saw?"
]

## Function to calculate TF-IDF manually

In [3]:
def calculate_tf_idf(corpus):
    # Calculate Term Frequency (TF)
    tf = []
    for doc in corpus:
        doc_tf = {}
        words = doc.split()
        for word in words:
            doc_tf[word] = doc_tf.get(word, 0) + 1
        doc_len = len(words)
        for word in doc_tf:
            doc_tf[word] /= doc_len
        tf.append(doc_tf)
    
    # Calculate Document Frequency (DF)
    df = {}
    for doc in corpus:
        words = set(doc.split())
        for word in words:
            df[word] = df.get(word, 0) + 1
    
    # Calculate Inverse Document Frequency (IDF)
    N = len(corpus)
    idf = {word: np.log(N / df[word]) for word in df}
    
    # Calculate TF-IDF
    tf_idf = []
    for doc_tf in tf:
        doc_tf_idf = {word: doc_tf[word] * idf[word] for word in doc_tf}
        tf_idf.append(doc_tf_idf)
    
    return tf_idf

In [4]:
# Calculate TF-IDF manually
manual_tfidf = calculate_tf_idf(corpus)

In [5]:
# Display the manually calculated TF-IDF
print("\nManually Calculated TF-IDF:")
for idx, doc_tf_idf in enumerate(manual_tfidf):
    print(f"Document {idx + 1}: {doc_tf_idf}")


Manually Calculated TF-IDF:
Document 1: {'I': 0.13515503603605478, 'love': 0.5972531564093516, 'cats.': 0.5972531564093516}
Document 2: {'I': 0.08109302162163289, 'hate': 0.358351893845611, 'cats': 0.358351893845611, 'and': 0.358351893845611, 'dogs.': 0.358351893845611}
Document 3: {'I': 0.1013662770270411, 'have': 0.44793986730701374, 'a': 0.27465307216702745, 'dog.': 0.27465307216702745}
Document 4: {'The': 0.19908438546978388, 'quick': 0.19908438546978388, 'brown': 0.19908438546978388, 'fox': 0.19908438546978388, 'jumps': 0.19908438546978388, 'over': 0.19908438546978388, 'the': 0.19908438546978388, 'lazy': 0.19908438546978388, 'dog.': 0.12206803207423442}
Document 5: {'Never': 0.44793986730701374, 'odd': 0.44793986730701374, 'or': 0.27465307216702745, 'even.': 0.44793986730701374}
Document 6: {'Was': 0.19908438546978388, 'it': 0.19908438546978388, 'a': 0.24413606414846883, 'car': 0.19908438546978388, 'or': 0.12206803207423442, 'cat': 0.19908438546978388, 'I': 0.04505167867868493, '

## TF-IDF with Sklearn

By default, `TfidfVectorizer` in sklearn uses sublinear TF scaling, where the term frequency is calculated as $$ 1 + \log(\text{TF}) $$

Additionally, `TfidfVectorizer` computes the inverse document frequency (IDF) using the formula $$\text{IDF}(t) = \log\left(\frac{1 + N}{1 + \text{df}(t)}\right) + 1 $$

where \(N\) is the total number of documents and $$\text{df}(t)$$ is the number of documents containing the term t.

In [6]:
# Create a TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

In [7]:
# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

In [8]:
# Convert the TF-IDF matrix to a DataFrame for better readability
tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

In [9]:
print("TF-IDF Matrix:")
print(tfidf_df)

TF-IDF Matrix:
        and     brown       car       cat      cats       dog      dogs  \
0  0.000000  0.000000  0.000000  0.000000  0.634086  0.000000  0.000000   
1  0.521823  0.000000  0.000000  0.000000  0.427903  0.000000  0.521823   
2  0.000000  0.000000  0.000000  0.000000  0.000000  0.634086  0.000000   
3  0.000000  0.306104  0.000000  0.000000  0.000000  0.251009  0.000000   
4  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
5  0.000000  0.000000  0.419871  0.419871  0.000000  0.000000  0.000000   

       even       fox      hate  ...      lazy      love     never       odd  \
0  0.000000  0.000000  0.000000  ...  0.000000  0.773262  0.000000  0.000000   
1  0.000000  0.000000  0.521823  ...  0.000000  0.000000  0.000000  0.000000   
2  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000   
3  0.000000  0.306104  0.000000  ...  0.306104  0.000000  0.000000  0.000000   
4  0.521823  0.000000  0.000000  ...  0.000000  0.000000  0