# TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is often used in text mining and information retrieval to identify the most significant words in documents.

### Layman Terms
Imagine you have a set of documents and you want to find out which words are the most important in each document. Simply counting the words (term frequency) isn't enough because common words like "the" and "is" will appear frequently in many documents but may not be important. TF-IDF helps by reducing the weight of common words and increasing the weight of words that are unique or less common in the corpus.

### Math Behind It
1. **Term Frequency (TF)**: Measures how frequently a term appears in a document. 
   - \( \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \)

2. **Inverse Document Frequency (IDF)**: Measures how important a term is. While computing TF, all terms are considered equally important. IDF reduces the weight of terms that occur very frequently in the corpus and increases the weight of terms that occur rarely.
   - \( \text{IDF}(t, D) = \log \left( \frac{\text{Total number of documents } N}{\text{Number of documents with term } t} \right) \)

3. **TF-IDF**: Combines the two measures.
   - \( \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) \)


<img src="images/tfidf.png">

### Example Code
Here's a simple implementation of TF-IDF using the `sklearn` library in Python.

### Installation
First, you need to install the `scikit-learn` and `nltk` libraries if you haven't already:
```bash
pip install scikit-learn nltk
```

### Example Code

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
import pandas as pd
import nltk

In [2]:
# Download NLTK data files (only need to run once)
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gauravkandel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# Sample corpus
documents = [
    "The cat sits on the mat",
    "The dog plays with the cat",
    "Dogs and cats are great pets",
    "The mat is under the table"
]

In [4]:
# Preprocess the documents (tokenization)
tokenized_documents = [" ".join(word_tokenize(doc.lower())) for doc in documents]

In [5]:
# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(tokenized_documents)

In [6]:
# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

In [7]:
# Convert the TF-IDF matrix to a dense format and display it
dense_tfidf = tfidf_matrix.todense()
tfidf_df = pd.DataFrame(dense_tfidf, columns=feature_names)

In [8]:
print("TF-IDF Matrix:")
print(tfidf_df)

TF-IDF Matrix:
        and       are       cat      cats       dog      dogs     great  \
0  0.000000  0.000000  0.357160  0.000000  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.344051  0.000000  0.436384  0.000000  0.000000   
2  0.408248  0.408248  0.000000  0.408248  0.000000  0.408248  0.408248   
3  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   

         is       mat        on      pets     plays      sits     table  \
0  0.000000  0.357160  0.453012  0.000000  0.000000  0.453012  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.436384  0.000000  0.000000   
2  0.000000  0.000000  0.000000  0.408248  0.000000  0.000000  0.000000   
3  0.436384  0.344051  0.000000  0.000000  0.000000  0.000000  0.436384   

        the     under      with  
0  0.578303  0.000000  0.000000  
1  0.557077  0.000000  0.436384  
2  0.000000  0.000000  0.000000  
3  0.557077  0.436384  0.000000  


In [9]:
# Example: Get the TF-IDF score for a specific word in a specific document
doc_index = 0  # First document
word = 'cat'
word_index = feature_names.tolist().index(word)
tfidf_score = tfidf_matrix[doc_index, word_index]

In [10]:
print(f"TF-IDF score for '{word}' in document {doc_index}: {tfidf_score}")

TF-IDF score for 'cat' in document 0: 0.35715970626521537


### Explanation
1. **Tokenize Documents**: We tokenize the documents using NLTK's `word_tokenize` and convert them to lowercase.
2. **TF-IDF Vectorizer**: We initialize a `TfidfVectorizer` from `sklearn`.
3. **Fit and Transform**: We fit the vectorizer to the tokenized documents and transform them into a TF-IDF matrix.
4. **Feature Names**: We retrieve the feature names (words) from the vectorizer.
5. **Dense Format**: We convert the sparse TF-IDF matrix to a dense format for easier display and create a DataFrame for better readability.
6. **TF-IDF Score**: We get the TF-IDF score for a specific word in a specific document.

### Note:
In a real-world scenario, you'd typically preprocess the text further (e.g., removing stop words, stemming/lemmatization) and use a larger corpus of documents. The example above is simplified for clarity.