In [1]:
! pip install nltk



# Term frequency - inverse document frequency

Term frequency-inverse document frequency (TF-IDF) is another text representation technique we use to represent text data before further analysis. 
In detail, we use this technique to convert the text data we’re working with into numerical vectors, making it suitable for training machine-learning
models. Here’s a breakdown of what TF-IDF means:

![image.png](attachment:78526b23-dcd0-433c-b61e-9c72c4736794.png)

![image.png](attachment:1f9bb24e-8277-497d-9926-c682525cf59c.png)

![image.png](attachment:52df296b-db8f-4b0c-8563-ecbd4f188def.png)

Calculating the TF-IDF score
To demonstrate the calculation of TF and IDF, imagine a corpus of documents A, B, and C. For simplicity, we’ll consider a corpus consisting of just three documents. These are:

A: “The cat jumped on the table” 

B: “The dog chased the cat”

C: “The cat and the dog played together”

To calculate the TF-IDF score for the term “dog” in do

![image.png](attachment:404f6881-3259-426e-b380-72dc536b1f8b.png)

![image.png](attachment:c6f428d1-c6bc-4dfa-91b9-785ba9ab1a3f.png)

![image.png](attachment:84350e6e-2643-4b1e-81be-0596cbb09c9a.png)

Here’s a table that shows the TF, IDF, and TF-IDF scores for each term with respect to each document. Each row represents each unique term in the corpus (including stopwords).

![image.png](attachment:9958826a-09ea-4b24-8fd9-3807a2de8635.png)

As a general observation, we can see that terms like “jumped,” “on,” “table,” and “chased” receive higher TF-IDF scores, indicating their importance within specific documents. These terms are relatively rare in the corpus, and when they appear, they carry more significance. In contrast, common terms like “the” and “cat” receive lower TF-IDF scores because they appear frequently across all documents. This reflects their high frequency across all documents and their lack of distinctiveness. Therefore, they’re not useful for distinguishing one document from another.

cument B:

# Implementation steps : 

![image.png](attachment:cb85560e-400f-49fd-8d41-a15a51a6e809.png)

Compared with BoW, both techniques are useful for representing text data as numerical features. However, TF-IDF offers a way to capture more information about the importance of words by assigning higher weights to rare terms in the document than to common ones. This weighting aspect helps address the limitation of BoW, where common words can dominate the representation. As a result, this advantage makes TF-IDF useful when considering the significance of terms within text data in tasks like information retrieval and classification.



In [1]:
# import necessary libraries

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Read the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [4]:
reviews = df['text']
reviews

0     The software had a steep learning curve at fir...
1     I'm really impressed with the user interface o...
2     The latest update to the software fixed severa...
3     I encountered a few glitches while using the s...
4     I was skeptical about trying the software init...
5     The analytics features have provided us with v...
6     I appreciate the regular updates that the soft...
7     I attended a training session for the software...
8     The software documentation could be more compr...
9     I've recommended the software to colleagues du...
10    The software integration with third-party plug...
11    I'm looking forward to the upcoming release of...
12    The user community is active and supportive, m...
13    I've been using the software for a while now, ...
14    The user interface could use some modernizatio...
15    I went for a run and the software did a good j...
Name: text, dtype: object

In [16]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(reviews)
feature_names = vectorizer.get_feature_names_out()
X_array = X.toarray()
tfidf_df = pd.DataFrame(X_array, columns=feature_names)
print(tfidf_df)

      about    active   address  advanced     after  analytics       and  \
0   0.00000  0.000000  0.000000  0.000000  0.284909   0.000000  0.000000   
1   0.00000  0.000000  0.000000  0.000000  0.000000   0.000000  0.177111   
2   0.00000  0.000000  0.000000  0.000000  0.000000   0.000000  0.171966   
3   0.00000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
4   0.27494  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
5   0.00000  0.000000  0.000000  0.000000  0.000000   0.264963  0.000000   
6   0.00000  0.000000  0.000000  0.000000  0.000000   0.000000  0.153989   
7   0.00000  0.000000  0.000000  0.301491  0.000000   0.000000  0.157078   
8   0.00000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
9   0.00000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
10  0.00000  0.000000  0.000000  0.000000  0.000000   0.000000  0.152935   
11  0.00000  0.000000  0.270952  0.000000  0.000000   0.000000  0.000000   
12  0.00000 

# Benefits and limitations

When compared to other text representation methods, TF-IDF has several benefits:

TF-IDF considers the importance of terms by considering their frequency within a document and rarity across the entire corpus, allowing it to highlight
key terms that are both frequent and distinctive.

It also effectively handles common words by downweighting them, focusing more on informative and unique terms that better represent the content and 
meaning of a document.

However, it has some limitations:

TF-IDF does not capture semantic relationships between words, which can limit its ability to understand context and meaning.

Additionally, it solely relies on term frequency as an indicator of term importance, which might not always accurately reflect the significance of 
a term in certain contexts.
