# Sentiment Analysis of IMDB Movie Reviews

This notebook performs sentiment analysis on the IMDB movie review dataset.

Check and edit notebook here: https://drive.google.com/file/d/1oCf8DUa75G3hPsMYjZmVwIwShaoLLR98/view?usp=sharing

**Includes:**
- Data Loading and Exploration
- Text Preprocessing
- Training and Evaluation of:
  - Multinomial Naive Bayes
  - Logistic Regression
  - LSTM (TensorFlow/Keras)
- Model Evaluation and Comparison

## 1. Data Loading and Exploration

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("Loading dataset...")
df = pd.read_csv('IMDB Dataset.csv')

df.info()
df.head()


In [None]:

# Check for missing values and basic stats
print("Missing values:")
print(df.isnull().sum())
print("\nSentiment distribution:")
print(df['sentiment'].value_counts())


In [None]:

# Plot sentiment distribution
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='sentiment')
plt.title("Sentiment Distribution")
plt.show()


In [None]:

# Review length analysis
df['review_length'] = df['review'].apply(len)
print(df['review_length'].describe())

plt.figure(figsize=(10,5))
sns.histplot(df['review_length'], bins=50)
plt.title("Review Length Distribution")
plt.xlabel("Review Length")
plt.show()


## 2. Text Preprocessing

We clean the text by removing HTML tags, punctuation, converting to lowercase, removing stopwords, and lemmatizing.

In [None]:

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = re.sub('<.*?>', '', text)
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and len(word) > 1]
    return ' '.join(tokens)

df['cleaned_review'] = df['review'].apply(preprocess_text)
df['sentiment_label'] = df['sentiment'].map({'negative': 0, 'positive': 1})
df[['review', 'cleaned_review']].head()


## 3. Splitting Data and TF-IDF Feature Extraction

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X = df['cleaned_review']
y = df['sentiment_label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

tfidf_vectorizer = TfidfVectorizer(max_features=20000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


## 4. Training Traditional Machine Learning Models

In [None]:

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train_tfidf, y_train)


## 5. Training LSTM Model (Deep Learning)

In [None]:

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping

MAX_WORDS = 20000
MAX_LEN = 250
EMBEDDING_DIM = 128

tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

X_train_pad = pad_sequences(X_train_seq, maxlen=MAX_LEN)
X_test_pad = pad_sequences(X_test_seq, maxlen=MAX_LEN)

lstm_model = Sequential([
    Embedding(MAX_WORDS, EMBEDDING_DIM, input_length=MAX_LEN),
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(32)),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history = lstm_model.fit(X_train_pad, y_train,
                         validation_split=0.1,
                         epochs=10,
                         batch_size=128,
                         callbacks=[early_stop])


## 6. Model Evaluation

In [None]:

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import seaborn as sns

def evaluate_model(model, X, y, model_name, is_dl=False):
    if is_dl:
        y_pred_proba = model.predict(X)
        y_pred = (y_pred_proba > 0.5).astype(int).flatten()
    else:
        y_pred = model.predict(X)
        y_pred_proba = model.predict_proba(X)[:, 1] if hasattr(model, 'predict_proba') else y_pred
    
    print(f"\nEvaluation for {model_name}")
    print(f"Accuracy: {accuracy_score(y, y_pred):.4f}")
    print(f"Precision: {precision_score(y, y_pred):.4f}")
    print(f"Recall: {recall_score(y, y_pred):.4f}")
    print(f"F1 Score: {f1_score(y, y_pred):.4f}")
    print(f"AUC: {roc_auc_score(y, y_pred_proba):.4f}")

    print("\nConfusion Matrix:")
    sns.heatmap(confusion_matrix(y, y_pred), annot=True, fmt='d', cmap='Blues')
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.title(f"Confusion Matrix: {model_name}")
    plt.show()

evaluate_model(nb_model, X_test_tfidf, y_test, "Naive Bayes")
evaluate_model(lr_model, X_test_tfidf, y_test, "Logistic Regression")
evaluate_model(lstm_model, X_test_pad, y_test, "LSTM", is_dl=True)


## 7. Conclusion and Comparison

- **Naive Bayes**: Fast, interpretable, good baseline.
- **Logistic Regression**: Strong performance with TF-IDF, excellent balance of simplicity and accuracy.
- **LSTM**: Handles sequence better, potentially more accurate with further tuning.

Final thoughts: For quick and reliable sentiment analysis, Logistic Regression with TF-IDF is highly effective. For deeper NLP tasks, LSTM and transformers can offer performance gains.