## Feature Engineering for Machine Learning - Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that measures how important a word is to a document within a collection of documents. 

It weights words by both how often they appear in a specific document AND how rare they are across all documents.

Think of it as a smart scoring system that identifies words that are both:

- Frequent in a particular document (common locally)

- Rare in other documents (unique globally)

Simple Analogy:
Imagine you're analyzing book reviews. The word "book" appears in every review (not unique), but "terrible" appears frequently in negative reviews and rarely in positive ones. TF-IDF gives "terrible" a high score in negative reviews because it's both frequent there and rare elsewhere.

The Two Components:

1. Term Frequency (TF): How often a word appears in a specific document

- "amazing" appears 5 times in Review A

- Higher TF = more relevant to that document

2. Inverse Document Frequency (IDF): How rare the word is across all documents

- "amazing" appears in only 2% of all reviews

- Higher IDF = more unique/important word

TF-IDF = TF Ã— IDF



Key Point:

TF-IDF automatically discounts common words (like "the", "and") that appear everywhere, 

while highlighting words that are characteristic of specific documents.

Example:

In a collection of movie reviews:

- Word: "cinematography"

    - High TF in reviews about visually stunning films

    - High IDF because it rarely appears in other reviews

    - Result: High TF-IDF score - this word is very important for identifying reviews about visual quality

- Word: "movie"

    - High TF in many reviews

    - Very low IDF because it appears in almost every review

    - Result: Low TF-IDF score - this word is not distinctive

Why it's used:

Its purpose is to identify the most meaningful words in documents. It's crucial for:

- Search Engines: Ranking search results by relevance

- Document Classification: Identifying key features that distinguish categories

- Information Retrieval: Finding the most important terms in large text collections

- Text Summarization: Extracting key phrases from documents

TF-IDF is essentially a refined version of Bag of Words that understands some words are more important than others.

## TF-IDF code Implementation 

In [1]:
import pandas as pd


In [3]:
train_data = pd.read_csv('(bow)labeledTrainData.csv', header=0, delimiter="\t", quoting=3)

In [4]:
train_data.shape

(25000, 3)

In [5]:
train_data.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


### Data cleaning and preprocessing

In [6]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/aljebra/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
# Define the fraction of the dataset to use (e.g., 10%)
fraction = 0.10 

# Calculate the number of rows to use
num_rows = int(len(train_data) * fraction)

# Select a random sample of rows
train_data = train_data.sample(n=num_rows, random_state=42) 

In [8]:
lematization = WordNetLemmatizer()

In [9]:
processed_word = []

for sentence in train_data.index:
    review = re.sub('[^a-zA-Z]', " ", train_data.loc[sentence, 'review'])
    review = review.split()
    lemmatized_word = [lematization.lemmatize(word.lower()) for word in review if word not in stopwords.words('english')]
    processed_word.append(" ".join(lemmatized_word))



In [10]:
processed_word[:10]

['i read there girl my soup came peter seller low period watching movie i surprised almost nothing happens movie seemingly presence seller goldie hawn help movie the whole movie seems like randomly filmed whatever happened without scripting anything maybe i seen every movie middle aged elderly people trying hippy one give movie pretty bad name br br all seller hawn starred much better movie waste time pretty worthless',
 'this film pull get go grab attention acknowledging yeah story opening clich funeral br br in hand judi i given material done the great reunion famous pick one please team army platoon theatre group singer band br br but movie never stoop cheap sentimentalization think going swoop another direction a case point flower sent admirer judi br br the band member interesting group ride clich one jail one found religion one alkie one sunk dementia but joie de vivre rediscovered judi ignited granddaughter interest carry u along make u overlook sometimes simplistic nature plot 

In [11]:
vectorizer = TfidfVectorizer(max_features=100)

In [12]:
X = vectorizer.fit_transform(processed_word).toarray()