# TF-IDF (Term Frequency – Inverse Document Frequency)

<h3>TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.</h3>

### Detailed Explanation:

- **Term Frequency (TF):** Measures how frequently a term occurs in a document. It is the ratio of the number of times a term appears in a document to the total number of terms in the document.
  
$$
\text{TF}(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$
  

- **Inverse Document Frequency (IDF):** Measures how important a term is. While computing TF, all terms are considered equally important. However, certain terms like "is", "of", and "that" may appear frequently but have little importance. Thus, we need to weigh down the frequent terms while scaling up the rare ones.

$$
\text{IDF}(t) = \log \left(\frac{\text{Total number of documents}}{\text{Number of documents with term } t}\right)
$$

- **TF-IDF:** The TF-IDF score is the product of TF and IDF scores. This score helps to identify words that are important to a document but not to other documents.

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
$$

### Advantages of TF-IDF:

- **Simple and Effective:** Easy to implement and understand.
- **Captures Importance:** Balances term frequency with inverse document frequency to highlight important words.
- **Sparse Representation:** Efficient for large datasets due to sparse matrix representation.

### Disadvantages of TF-IDF:

- **Ignores Semantics:** Does not capture the semantic meaning of words.
- **Context Insensitivity:** Treats words independently, ignoring context and word order.
- **Sparse Representation:** Can be memory-intensive for very large vocabularies.

In [1]:
# import requirments

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Define the sentences
sentences = [
    "The cat sat on the mat.",
    "The dog barked at the cat.",
    "The cat and the dog are friends.",
    "Birds can fly high in the sky."
]

sentences

['The cat sat on the mat.',
 'The dog barked at the cat.',
 'The cat and the dog are friends.',
 'Birds can fly high in the sky.']

In [3]:
# clean text if necessary

In [4]:
# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the documents to get the TF-IDF representation
tfidf = vectorizer.fit_transform(sentences)

In [5]:
# create a vocabulary of unique words and sort them

vocabulary = sorted(vectorizer.vocabulary_)
print(vocabulary)

['and', 'are', 'at', 'barked', 'birds', 'can', 'cat', 'dog', 'fly', 'friends', 'high', 'in', 'mat', 'on', 'sat', 'sky', 'the']


In [6]:
# create a matrix of the words wrt sentence

tfidf_mattrix = tfidf.toarray()
tfidf_mattrix = np.round(tfidf_mattrix, 2)  # toget upto 2 decimal number
tfidf_mattrix

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.3 , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.47, 0.47, 0.47, 0.  , 0.49],
       [0.  , 0.  , 0.49, 0.49, 0.  , 0.  , 0.31, 0.39, 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.51],
       [0.44, 0.44, 0.  , 0.  , 0.  , 0.  , 0.28, 0.35, 0.  , 0.44, 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.46],
       [0.  , 0.  , 0.  , 0.  , 0.4 , 0.4 , 0.  , 0.  , 0.4 , 0.  , 0.4 ,
        0.4 , 0.  , 0.  , 0.  , 0.4 , 0.21]])

In [7]:
# create a dataframe for better understanding the importance of a feature in a sentence

pd.DataFrame(tfidf_mattrix , index=sentences , columns=vocabulary)

Unnamed: 0,and,are,at,barked,birds,can,cat,dog,fly,friends,high,in,mat,on,sat,sky,the
The cat sat on the mat.,0.0,0.0,0.0,0.0,0.0,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.47,0.47,0.47,0.0,0.49
The dog barked at the cat.,0.0,0.0,0.49,0.49,0.0,0.0,0.31,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.51
The cat and the dog are friends.,0.44,0.44,0.0,0.0,0.0,0.0,0.28,0.35,0.0,0.44,0.0,0.0,0.0,0.0,0.0,0.0,0.46
Birds can fly high in the sky.,0.0,0.0,0.0,0.0,0.4,0.4,0.0,0.0,0.4,0.0,0.4,0.4,0.0,0.0,0.0,0.4,0.21


                     ____________________________________ End ____________________________________