### TF-IDF 
#### Term Frequency - Inverse Document Frequency 

## TF-IDF (Term Frequency–Inverse Document Frequency)

TF-IDF is a numerical statistic used in **Natural Language Processing (NLP)** to reflect how important a word is to a document in a collection or corpus.

It is commonly used in **text mining**, **information retrieval**, and as a feature for **machine learning models**.

---

### Why Use TF-IDF?

* **TF (Term Frequency)** shows how frequently a term occurs in a document.
* **IDF (Inverse Document Frequency)** reduces the weight of commonly used words and increases the importance of rare ones.

---

### TF-IDF Formula

For a term `t` in a document `d` from a corpus `D`:

#### Term Frequency (TF)

$$
TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

#### Inverse Document Frequency (IDF)

$$
IDF(t, D) = \log \left( \frac{N}{1 + DF(t)} \right)
$$

Where:

* `N` = Total number of documents in the corpus
* `DF(t)` = Number of documents containing the term `t`
* `1` is added to avoid division by zero

####  TF-IDF Score

$$
TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D)
$$


In [1]:
import pandas as pd
import numpy as np 

In [2]:
df = pd.read_csv('product_reviews.csv')

In [3]:
import nltk 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Anshum
[nltk_data]     Banga\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
from nltk.corpus import stopwords

In [5]:
swords = stopwords.words('english')

In [6]:
df.review = df.review.apply(lambda x: x.lower())

In [7]:
import re 

In [8]:
df.review = df.review.apply(lambda x: re.sub(r'[^\w\s]','',x))

In [9]:
df.review = df.review.apply(lambda x: nltk.word_tokenize(x))

In [12]:
# df.review.apply()

In [13]:
df.review = df.review.apply(lambda x: [word for word in x if word not in swords])

In [14]:
df.review

0      [absolutely, love, phone, best, purchase, ever]
1              [battery, life, great, camera, average]
2    [terrible, experience, phone, stopped, working...
3      [amazing, camera, quality, smooth, performance]
4                     [worth, price, im, disappointed]
5                  [works, well, basic, tasks, gaming]
6               [highly, recommend, product, everyone]
7               [worst, phone, ever, used, full, bugs]
8                  [customer, service, helpful, quick]
9            [poor, build, quality, terrible, support]
Name: review, dtype: object

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [26]:
x = df.iloc[:,0:1]
x

Unnamed: 0,review
0,"[absolutely, love, phone, best, purchase, ever]"
1,"[battery, life, great, camera, average]"
2,"[terrible, experience, phone, stopped, working..."
3,"[amazing, camera, quality, smooth, performance]"
4,"[worth, price, im, disappointed]"
5,"[works, well, basic, tasks, gaming]"
6,"[highly, recommend, product, everyone]"
7,"[worst, phone, ever, used, full, bugs]"
8,"[customer, service, helpful, quick]"
9,"[poor, build, quality, terrible, support]"


In [18]:
tfidf = TfidfVectorizer()
tfidf

In [31]:
x = x.review.apply(lambda x: ' '.join(x))

In [34]:
data = tfidf.fit_transform(x).toarray()

In [35]:
pd.DataFrame(data,columns=tfidf.get_feature_names_out())

Unnamed: 0,absolutely,amazing,average,basic,battery,best,bugs,build,camera,customer,...,support,tasks,terrible,used,weeks,well,working,works,worst,worth
0,0.435368,0.0,0.0,0.0,0.0,0.435368,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.460158,0.0,0.460158,0.0,0.0,0.0,0.391176,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.370102,0.0,0.435368,0.0,0.435368,0.0,0.0,0.0
3,0.0,0.474295,0.0,0.0,0.0,0.0,0.0,0.0,0.403194,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5
5,0.0,0.0,0.0,0.447214,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.447214,0.0,0.0,0.0,0.447214,0.0,0.447214,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.435368,0.0,0.0,0.0,...,0.0,0.0,0.0,0.435368,0.0,0.0,0.0,0.0,0.435368,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.474295,0.0,0.0,...,0.474295,0.0,0.403194,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [40]:
print(tfidf.idf_)
tfidf.get_feature_names_out().size

[2.70474809 2.70474809 2.70474809 2.70474809 2.70474809 2.70474809
 2.70474809 2.70474809 2.29928298 2.70474809 2.70474809 2.29928298
 2.70474809 2.70474809 2.70474809 2.70474809 2.70474809 2.70474809
 2.70474809 2.70474809 2.70474809 2.70474809 2.70474809 2.01160091
 2.70474809 2.70474809 2.70474809 2.70474809 2.29928298 2.70474809
 2.70474809 2.70474809 2.70474809 2.70474809 2.70474809 2.70474809
 2.29928298 2.70474809 2.70474809 2.70474809 2.70474809 2.70474809
 2.70474809 2.70474809]


44



### Advantages of TF-IDF

* **Fixed-size representation**
  Converts text into vectors of consistent length, suitable for machine learning models.

* **Intuitive and simple to implement**
  Based on basic statistics (frequency and log), easy to understand and compute.

* **Captures word importance**
  Gives higher weight to rare but meaningful words, and lower weight to common ones.


### Disadvantages of TF-IDF

* **Sparsity still exists**
  Produces high-dimensional sparse matrices, especially with large vocabularies.

* **Out-of-vocabulary (OOV) issues**
  New words not seen during training are ignored or treated as zero.
