# **Sentiment Analysis on [IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data)**

## 🚀 **Let's Connect!**
<p align="left"> <a href="https://github.com/chiragpc2004" target="_blank"> <img src="https://img.shields.io/badge/GitHub-%23181717.svg?&style=for-the-badge&logo=github&logoColor=white" alt="GitHub"/> </a> <a href="https://www.linkedin.com/in/chiragpc2004/" target="_blank"> <img src="https://img.shields.io/badge/LinkedIn-%230077B5.svg?&style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn"/> </a> <a href="https://mail.google.com/mail/?view=cm&fs=1&to=chiragpc2004@gmail.com" target="_blank"> <img src="https://img.shields.io/badge/Gmail-%23D14836.svg?&style=for-the-badge&logo=gmail&logoColor=white" alt="Gmail"/> </a> </p>

## **Text Preprocessing of IMDB Movie Reviews**

### **Preprocessing**

Before feeding the text data into machine learning models, it needs to be cleaned and transformed into a suitable format. Here are the key preprocessing steps we perform:

- **Remove duplicates**: Duplicate reviews can bias the model by over-representing some data points. Removing duplicates ensures each review is unique.

- **Remove HTML tags**: Sometimes the reviews contain HTML tags (like `<br>`, `<p>`, etc.) which are not useful for analysis. We use tools like BeautifulSoup to extract only the text content.

- **Convert to lowercase**: Text data can have the same word in different cases (e.g., “Good” and “good”). Converting everything to lowercase makes the data uniform and helps in consistent processing.

- **Remove special characters and punctuation**: Characters like `!`, `?`, `#`, or numbers may not carry useful information for sentiment analysis and can add noise. Removing them cleans the text.

- **Remove stopwords**: Stopwords are common words like “the”, “is”, “and”, which usually do not contribute much to the sentiment or meaning. Removing them reduces noise and focuses the model on important words.

- **Lemmatization**: Words often appear in different forms (e.g., “running”, “ran”, “runs”). Lemmatization converts these variants to their base or dictionary form (e.g., “run”), which helps the model understand the core meaning better.

- **Tokenization**: This is the process of breaking down the text into smaller units called tokens (usually words). Models work on these tokens instead of raw text.

- **Vectorization**: After tokenization, the tokens need to be converted into a numerical format so they can be processed by machine learning models. This can be done using methods like:
  - **TF-IDF** (for traditional machine learning models)
  - **Integer encoding** (e.g., `Tokenizer.texts_to_sequences()` for deep learning models)


- **Padding sequences**: Since reviews vary in length, sequences are padded (with zeros) to make all inputs the same length. This is important for batch processing in models like LSTMs.

These preprocessing steps transform raw text reviews into a clean, consistent, and machine-readable format suitable for various machine learning algorithms.

#### Import statements

In [1]:
# Core and Data Handling
import pandas as pd
import numpy as np
import re

# Text Preprocessing
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Encoding and Feature Extraction
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer # type: ignore
from tensorflow.keras.preprocessing.sequence import pad_sequences # type: ignore

In [2]:
# Read the dataset
df = pd.read_csv("D:/imdb-sentiment-classifier/data/raw/IMDB_Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
# Making a copy of the dataframe so that no errors are raised later
df = df.copy()

In [4]:
# Drop duplicates
df = df.drop_duplicates()
print("Total number of reviews after dropping duplicates: ", df.shape[0])

Total number of reviews after dropping duplicates:  49582


In [5]:
# Encoding sentiment column
le = LabelEncoder()
df['sentiment'] = le.fit_transform(df['sentiment'])

In [15]:
# Remove html tags
df['review'] = df['review'].apply(lambda x : BeautifulSoup(x,"html.parser").get_text())

In [7]:
# Convert all reviews into lower-case
df['review'] = df['review'].apply(lambda x: x.lower())

In [8]:
# Remove special characters
df['review'] = df['review'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))

#### Remove Stopwords

In [9]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
df['review'] = df['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chira\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Lemmatization

In [10]:
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\chira\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\chira\AppData\Roaming\nltk_data...


In [11]:
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
df['review'] = df['review'].apply(lemmatize_text)

### Preprocessed Data

In [14]:
df.to_csv("../data/processed/cleaned_data.csv", index=False)