<a href="https://colab.research.google.com/github/ahmedjajan93/nlp-preprocessing/blob/main/TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**TF-IDF (Term Frequency-Inverse Document Frequency) :**

**TF-IDF** is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

**It's widely used in :**

* Information retrieval

* Text mining

* Search engines

* Document classification

**How TF-IDF Works :**

TF-IDF combines two metrics:

* Term Frequency (TF): How often a word appears in a document

* Inverse Document Frequency (IDF): How important the word is across all documents

**Mathematical Formulation :**

**TF(t, d)** = (Number of times term t appears in document d) / (Total number of terms in document d)

**IDF(t, D)** = log(Total number of documents / Number of documents containing term t)

**TF-IDF(t, d, D)** = TF(t, d) * IDF(t, D)

In [13]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from datasets import load_dataset
nltk.download('stopwords')
nltk.download('wordnet')

# Example: IMDB Reviews (Sentiment Analysis)
dataset = load_dataset("imdb")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [21]:
import pandas as pd
df = pd.DataFrame(dataset['train'])

In [33]:
df['text'][1]

'"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn\'t matter what one\'s political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn\'t true. I\'ve seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don\'t exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we\'re treated to the site of Vincent Gallo\'s throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, and the s

In [34]:
wordlemmatizer = WordNetLemmatizer()
corpus=[]
for i in range(0,len(df)):
    review=re.sub('[^a-zA-z]',' ',df['text'][i])
    review=review.lower()
    review=review.split()
    review=[wordlemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=5000,ngram_range=(1,2))
X = tfidf.fit_transform(corpus).toarray()

In [26]:
import numpy as np
np.set_printoptions(edgeitems=30, linewidth=100000,
    formatter=dict(float=lambda x: "%.3g" % x))

In [40]:
tfidf.vocabulary_

{'rented': np.int64(3653),
 'curious': np.int64(1058),
 'yellow': np.int64(4986),
 'video': np.int64(4768),
 'store': np.int64(4253),
 'surrounded': np.int64(4367),
 'first': np.int64(1707),
 'released': np.int64(3623),
 'also': np.int64(143),
 'heard': np.int64(2051),
 'ever': np.int64(1514),
 'tried': np.int64(4611),
 'enter': np.int64(1468),
 'country': np.int64(987),
 'therefore': np.int64(4481),
 'fan': np.int64(1622),
 'film': np.int64(1689),
 'considered': np.int64(920),
 'controversial': np.int64(950),
 'really': np.int64(3567),
 'see': np.int64(3893),
 'br': np.int64(507),
 'plot': np.int64(3300),
 'centered': np.int64(681),
 'around': np.int64(239),
 'young': np.int64(4992),
 'swedish': np.int64(4384),
 'drama': np.int64(1316),
 'student': np.int64(4285),
 'named': np.int64(2944),
 'lena': np.int64(2554),
 'want': np.int64(4824),
 'learn': np.int64(2537),
 'everything': np.int64(1519),
 'life': np.int64(2576),
 'particular': np.int64(3181),
 'focus': np.int64(1737),
 'attenti