## Project Overview

# 🎬 Sentiment Analysis on IMDB Movie Reviews

## 📌 Project Overview
This project applies **Natural Language Processing (NLP)** and **Machine Learning** techniques to classify IMDB movie reviews as **Positive (1)** or **Negative (0)**.  
The pipeline includes **data preprocessing, feature extraction, multiple model training, performance evaluation, and visualization** to identify the best-performing sentiment classification model.

## 📊 Dataset
- **Source**: [IMDB Movie Review Dataset]  
- **Size**: 50,000 labeled reviews  
- **Features**:  
  - `review`: text content of movie reviews  
  - `sentiment`: target label (0 = Negative, 1 = Positive)  

## ⚙️ Project Features

### 🔍 Data Preprocessing
- Removed **HTML tags, punctuation, numbers, and special characters**  
- Converted text to lowercase  
- Applied **tokenization, stopword removal, and lemmatization**  
- Balanced and shuffled dataset for fair training  

### 🧩 Feature Extraction
- **Bag-of-Words (CountVectorizer)**  
- **TF-IDF Vectorization** for SVM model  

### 🤖 Machine Learning Models
1. **Naïve Bayes**  
   - Simple baseline model  
   - Moderate accuracy, quick training  

2. **Logistic Regression**  
   - Strong linear model  
   - Performed well with balanced precision/recall  

3. **Support Vector Machine (SVM)**  
   - Implemented with TF-IDF features  
   - Tuned for performance on sampled dataset (10,000 reviews)  

### 📊 Model Evaluation
- Metrics: **Accuracy, Precision, Recall, F1-Score, Confusion Matrix, ROC-AUC**  
- Side-by-side comparison of all models  
- Ranking system based on key metrics (Accuracy, F1, ROC-AUC)  

## 📈 Visualizations
- 📌 **Confusion Matrix Heatmap** (for best-performing model)  
- 📌 **Word Clouds** for Positive and Negative reviews  
- 📌 **Histograms** of review length (Positive vs Negative)  
- 📌 **Bar Chart & Histogram of Review Length Distribution** by categories (Short, Medium, Long)  

## 🏆 Results & Best Model
- **Naïve Bayes** → Good baseline, fast but less accurate  
- **Logistic Regression** → Reliable, balanced performance  
- **SVM** → Delivered the **highest overall performance**  

✅ **Best Model Selected: Support Vector Machine (SVM)**  
- Accuracy: ~**89%**  
- Precision/Recall: High & consistent  
- ROC-AUC: ~**90%** Strong classification power 

## 📌 Conclusion
This project demonstrates the power of **text preprocessing + ML models** in sentiment classification.  
The **SVM model with TF-IDF features** proved to be the best choice for analyzing IMDB reviews, achieving strong accuracy and robust evaluation scores.


##  Load Training and Testing Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix,roc_auc_score

#load csv
df=pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
print("\nOriginal Data: ",df.head(5))

#check for null values
print(df.isnull().sum())

#convert sentiment to numeric
le = LabelEncoder()
df['sentiment'] = le.fit_transform(df['sentiment']) #0 for negative and 1 for positive
print("\nAfter Encoding")
print(df.head(5))

#shuffle the dataframe
df=df.sample(frac=1).reset_index(drop=True)


## Data Preprocessing (Tokenization + Stopwords removal + Lemmatization)

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# Convert text to string
df['review'] = df['review'].astype(str)

# Before preprocessing print data
print("\nBefore Preprocessing")
print(df.head(2))

# Text preprocessing function
def preprocess_text(text):
    # Remove HTML tags
    text = re.sub('<.*?>', ' ', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation, numbers, and special characters (keep only letters)
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Tokenization
    tokens = text.split()
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Join tokens back into string
    return " ".join(tokens)

# Apply preprocessing
df['review'] = df['review'].apply(preprocess_text)

# After preprocessing print data
print("\nAfter Preprocessing")
print(df.head(5))

## Feature Extraction & Model NaveBayes Prediction

In [None]:
#features
X=df['review']

#target
y=df['sentiment']


#convert text into numerical features
vectorizer=CountVectorizer()
X_counts=vectorizer.fit_transform(df['review'])

#Train,test and split data
X_train,X_test,y_train,y_test=train_test_split(X_counts,y,test_size=0.3,random_state=42)

#nave bayes model
model=MultinomialNB()
model.fit(X_train,y_train)

#prediction
y_pred=model.predict(X_test)
print("===========Model Nave Bayes=============")
print("News is predicted as : ",y_pred[:10])

#metrics evaluation
nb_acc=accuracy_score(y_test, y_pred)
nb_pre=precision_score(y_test, y_pred)
nb_rec=recall_score(y_test, y_pred)
nb_f1=f1_score(y_test, y_pred)
nb_cm=confusion_matrix(y_test, y_pred)
nb_roc=roc_auc_score(y_test, y_pred)

print("==========================================")
print("============Model Performance=============")
print("Accuracy:",nb_acc)
print("Precision:",nb_pre)
print("Recall:",nb_rec)
print("F1-Score:",nb_f1)
print("Confusion Matrix:\n",nb_cm)
print("ROC-AUC:",nb_roc)


## Logistic Regression Prediction

In [None]:
#logistic regression
reg=LogisticRegression(max_iter=1000)
reg.fit(X_train,y_train)
reg_pred=reg.predict(X_test)

#predictions
print("===========Model Logistic Regression=============")
print("News is predicted as : ",reg_pred[:10])

#metrics evaluation
lr_acc=accuracy_score(y_test, reg_pred)
lr_pre=precision_score(y_test, reg_pred)
lr_rec=recall_score(y_test, reg_pred)
lr_f1=f1_score(y_test, reg_pred)
lr_cm=confusion_matrix(y_test, reg_pred)
lr_roc=roc_auc_score(y_test, reg_pred)

print("==========================================")
print("============Model Performance=============")
print("Accuracy:",lr_acc)
print("Precision:",lr_pre)
print("Recall:",lr_rec)
print("F1-Score:",lr_f1)
print("Confusion Matrix:\n",lr_cm)
print("ROC-AUC:",lr_roc)

## SVM Predictions

In [None]:
# STRATEGIC SAMPLING: Take a smaller representative sample 
sample_size = 10000  
df_sampled = df.sample(n=sample_size, random_state=42, replace=False)

# Use this sampled dataframe for ALL models from now on
X_sampled = df_sampled['review']
y_sampled = df_sampled['sentiment']

print(f"Working with a manageable sample of {sample_size} reviews for model comparison.")

# Convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000)
X_counts = vectorizer.fit_transform(X_sampled)

# Train, test and split data
X_train, X_test, y_train, y_test = train_test_split(X_counts, y_sampled, test_size=0.3, random_state=42)

# SVM model

svm = SVC(random_state=42)  
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)


# Predictions
print("===========Model Support Vector Machine=============")
print("First 10 predictions: ", svm_pred[:10])
print("Actual first 10 labels:", y_test.values[:10])

# Metrics evaluation
svm_acc = accuracy_score(y_test, svm_pred)
svm_pre = precision_score(y_test, svm_pred)
svm_rec = recall_score(y_test, svm_pred)
svm_f1 = f1_score(y_test, svm_pred)
svm_cm = confusion_matrix(y_test, svm_pred)
svm_roc = roc_auc_score(y_test, svm_pred)

print("==========================================")
print("============Model Performance=============")
print(f"Accuracy: {svm_acc:.4f}")
print(f"Precision: {svm_pre:.4f}")
print(f"Recall: {svm_rec:.4f}")
print(f"F1-Score: {svm_f1:.4f}")
print("Confusion Matrix:\n", svm_cm)
print(f"ROC-AUC: {svm_roc:.4f}")

# Additional useful info
print(f"\nAdditional Info:")
print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")
print(f"Number of features: {X_train.shape[1]}")

## Comparison of all Models and choose the best one

In [None]:
results=pd.DataFrame({
    

    'Model':['Nave Bayes','Logistic Regression','Support Vector Machine'],
    'Accuracy':[nb_acc,lr_acc,svm_acc],
    'Precision':[nb_pre,lr_pre,svm_pre],
    'Recall':[nb_rec,lr_rec,svm_rec],
    'F1-Score':[nb_f1,lr_f1,svm_f1],
    'Confusion Matrix':[nb_cm,lr_cm,svm_cm],
    'ROC-AUC':[nb_roc,lr_roc,svm_roc]

})

print("============== 📊 Model Comparison Table ==============")
print(results.to_string())

# Rank models based on metrics (lower rank = better)
metrics_to_rank = ["Accuracy", "F1-Score", "ROC-AUC"]
for metric in metrics_to_rank:
    results[f"{metric}_Rank"] = results[metric].rank(ascending=False)

# Calculate overall rank
results["Overall_Rank"] = results[[f"{m}_Rank" for m in metrics_to_rank]].sum(axis=1)

# Sort by overall rank
results_sorted = results.sort_values("Overall_Rank")
print("\n============== 🏆 Ranked Models ==============")
print(results_sorted.to_string())

# Select best model
best_model = results_sorted.iloc[0]["Model"]
print("\n🏆 Best Model Selected:", best_model)

## Visualizations

## Confusion Matrix (According to SYM) for Best Model Accuracy

In [None]:
# Normalize to percentage
svm_cm_percent = svm_cm.astype("float") / svm_cm.sum() * 100

plt.figure(figsize=(7, 8))
plt.imshow(svm_cm_percent, interpolation="nearest", cmap="ocean")
plt.title("SVM - Confusion Matrix (%)", fontsize=14, fontweight="bold")
plt.colorbar(label="Percentage")

# Axis labels
plt.xticks([0, 1], ["Predicted Legit (0)", "Predicted Phishing (1)"], fontsize=10)
plt.yticks([0, 1], ["Actual Legit (0)", "Actual Phishing (1)"], fontsize=10)
plt.ylabel("Actual Class", fontsize=12)
plt.xlabel("Predicted Class", fontsize=12)

# Annotate cells with percentage
for i in range(svm_cm_percent.shape[0]):
    for j in range(svm_cm_percent.shape[1]):
        plt.text(j, i, f"{svm_cm_percent[i, j]:.2f}%", 
                 ha="center", va="center", color="black", fontsize=11, fontweight="bold")

plt.tight_layout()

plt.show()

## Wordcloud for Positive/Negative Reviews

In [None]:
from wordcloud import WordCloud

# Get positive and negative reviews
positive_reviews = df[df['sentiment']==1]['review']
negative_reviews = df[df['sentiment']==0]['review']

# Create word clouds for positive and negative reviews
positive_wordcloud = WordCloud(colormap='Greens').generate(' '.join(positive_reviews))  
negative_wordcloud = WordCloud(colormap='Reds').generate(' '.join(negative_reviews))


# Plot word clouds
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.imshow(positive_wordcloud, interpolation='bilinear')
plt.title('Positive Reviews')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(negative_wordcloud, interpolation='bilinear')
plt.title('Negative Reviews')
plt.axis('off')

plt.tight_layout()
plt.show()


## Histogram – Review Length Distribution (Positive vs Negative Reviews)

In [None]:


# Calculate review lengths
df['review_length'] = df['review'].apply(lambda x: len(x.split())) #apply() lets you run a function on each element of a column
#for each row of x in text column split the string into words and count them.

# Separate lengths by sentiment
pos_lengths = df[df['sentiment']==1]['review_length']
neg_lengths = df[df['sentiment']==0]['review_length']

# Plot side-by-side histograms
plt.figure(figsize=(12,5))

plt.subplot(1, 2, 1)
plt.hist(pos_lengths, bins=20, color='red', alpha=0.6, edgecolor='black')
plt.title('Positive Reviews Length')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(neg_lengths, bins=20, color='purple', alpha=0.6, edgecolor='black')
plt.title('Negative Reviews Length')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


## Distribution of Review Length vs Word Count Categories

## Histogram

In [None]:
# Calculate review lengths
df['review_length'] = df['review'].apply(lambda x: len(x.split()))
df['short_review'] = df['review_length'].apply(lambda x: True if x < 50 else False) 
df['medium_review']= df['review_length'].apply(lambda x: True if x >= 50 and x < 150 else False)
df['long_review'] = df['review_length'].apply(lambda x: True if x > 150 else False)

#subplots for each review
plt.figure(figsize=(14,6))

#subplot for short reviews
plt.subplot(1,3,1)
plt.hist(df[df['short_review'] == True]['review_length'], bins=20, color='red', alpha=0.6, edgecolor='black')
plt.title('Short Reviews Length')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')

#subplot for medium reviews
plt.subplot(1,3,2)
plt.hist(df[df['medium_review']==True]['review_length'],bins=20,color='yellow',alpha=0.6,edgecolor='black')
plt.title('Medium Reviews Length')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')

#subplot for long reviews
plt.subplot(1,3,3)
plt.hist(df[df['long_review'] == True]['review_length'], bins=20, color='green', alpha=0.6, edgecolor='black')
plt.title('Long Reviews Length')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

## Bar Chart

In [None]:

# Review length categories
df['review_length'] = df['review'].apply(lambda x: len(x.split()))
df['review_type'] = pd.cut(df['review_length'],
                           bins=[0, 50, 150, float('inf')],
                           labels=['Short', 'Medium', 'Long'])

# Count values for each category
counts = df['review_type'].value_counts().reindex(['Short', 'Medium', 'Long'])

# Bar chart with border
plt.figure(figsize=(8,6))
bars = plt.bar(counts.index, counts.values, 
               color=['red', 'yellow', 'green'], 
               edgecolor='black', linewidth=1.5)

plt.title('Distribution of Short, Medium, and Long Reviews')
plt.xlabel('Review Type')
plt.ylabel('Count')
plt.show()


## Submission.csv

In [None]:
# Save predictions of best model (logistic regression) for Kaggle submission

submission = pd.DataFrame({
    "id": range(len(svm_pred)),
    "sentiment": svm_pred
})

submission.to_csv("submission.csv", index=False)
print("✅ submission.csv file generated!")