<a href="https://colab.research.google.com/github/arooshahz/imdb-sentiment-analysis/blob/main/IMDb_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from datasets import load_dataset

from collections import Counter
import re

sns.set(style="whitegrid")


In [None]:
dataset = load_dataset("imdb")
dataset

The IMDB dataset consists of 50,000 movie reviews, split evenly into training and test sets. Each sample contains a review text and a binary sentiment label.

In [None]:
train_df = pd.DataFrame(dataset["train"])
test_df = pd.DataFrame(dataset["test"])

train_df.head()

In [None]:
train_df.info()

In [None]:
train_df.isnull().sum()

No missing values are present in either the text or label columns, indicating a clean dataset suitable for downstream modeling.

In [None]:
label_counts = train_df["label"].value_counts()

label_counts

In [None]:
plt.figure(figsize=(5,4))
sns.barplot(x=label_counts.index, y=label_counts.values)
plt.xticks([0,1], ["Negative", "Positive"])
plt.title("Class Distribution in Training Set")
plt.ylabel("Count")
plt.xlabel("Sentiment")
plt.show()

The dataset is perfectly balanced, with an equal number of positive and negative reviews. This allows us to rely on accuracy and F1-score without concerns about class imbalance.

In [None]:
train_df["word_count"] = train_df["text"].apply(lambda x: len(x.split()))
train_df["char_count"] = train_df["text"].apply(len)

train_df[["word_count", "char_count"]].describe()

In [None]:
plt.figure(figsize=(7,4))
sns.histplot(train_df["word_count"], bins=50)
plt.title("Distribution of Review Lengths (Word Count)")
plt.xlabel("Number of Words")
plt.ylabel("Frequency")
plt.show()

Most reviews fall between 100 and 300 words, while a smaller portion of reviews are significantly longer. This observation suggests that truncation will affect only a minority of samples when using Transformer-based models.

In [None]:
plt.figure(figsize=(7,4))
sns.boxplot(x="label", y="word_count", data=train_df)
plt.xticks([0,1], ["Negative", "Positive"])
plt.title("Review Length by Sentiment")
plt.xlabel("Sentiment")
plt.ylabel("Word Count")
plt.show()

Positive reviews tend to be slightly longer on average, which may indicate that users elaborate more when expressing positive opinions.

In [None]:
train_df.sort_values("word_count").head(3)[["text", "label", "word_count"]]

In [None]:
train_df.sort_values("word_count", ascending=False).head(3)[["label", "word_count"]]

In [None]:
def contains_html(text):
    return bool(re.search(r"<.*?>", text))

train_df["has_html"] = train_df["text"].apply(contains_html)
train_df["has_html"].mean()

A small fraction of reviews contain HTML tags, which should be considered during preprocessing for baseline models.

In [None]:
positive_reviews = train_df[train_df["label"] == 1]["text"]
negative_reviews = train_df[train_df["label"] == 0]["text"]

In [None]:
def clean_text(text):
    text = text.lower()
    # remove HTML
    text = re.sub(r"<.*?>", "", text)
    # keep letters only
    text = re.sub(r"[^a-z\s]", "", text)
    return text

In [None]:
positive_clean = positive_reviews.apply(clean_text)
negative_clean = negative_reviews.apply(clean_text)

In [None]:
positive_words = Counter(" ".join(positive_clean).split())
negative_words = Counter(" ".join(negative_clean).split())

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

for stopword in ENGLISH_STOP_WORDS:
    positive_words.pop(stopword, None)
    negative_words.pop(stopword, None)

In [None]:
top_pos = positive_words.most_common(20)
top_neg = negative_words.most_common(20)

top_pos, top_neg

In [None]:
def plot_top_words(word_counts, title):
    words, counts = zip(*word_counts)
    plt.figure(figsize=(8,4))
    sns.barplot(x=list(counts), y=list(words))
    plt.title(title)
    plt.xlabel("Frequency")
    plt.ylabel("Word")
    plt.show()

plot_top_words(top_pos, "Top 20 Words in Positive Reviews")
plot_top_words(top_neg, "Top 20 Words in Negative Reviews")

Positive reviews frequently include words such as great, like, and love, while negative reviews are dominated by terms like bad, dont, and like. This clear lexical separation indicates that sentiment is strongly reflected in word choice, making the dataset suitable for both classical and Transformer-based text classification models.

In [None]:
def get_relative_freq(word, counter, total_words):
    return counter[word] / total_words

total_pos_words = sum(positive_words.values())
total_neg_words = sum(negative_words.values())

diff_words = []

for word in set(list(positive_words.keys()) + list(negative_words.keys())):
    pos_freq = get_relative_freq(word, positive_words, total_pos_words)
    neg_freq = get_relative_freq(word, negative_words, total_neg_words)
    diff_words.append((word, pos_freq - neg_freq))

diff_words_sorted = sorted(diff_words, key=lambda x: abs(x[1]), reverse=True)
diff_words_sorted[:20]


Certain words exhibit a strong sentiment polarity, appearing disproportionately in either positive or negative reviews. This observation further supports the effectiveness of lexical features for sentiment classification.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score, classification_report

In [None]:
X = train_df["text"]
y = train_df["label"]

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

A stratified split is used to preserve the class distribution in both training and validation sets.

In [None]:
from sklearn.linear_model import LogisticRegression

logreg_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=20000,
        ngram_range=(1,2),
        stop_words="english"
    )),
    ("clf", LogisticRegression(max_iter=1000))
])

In [None]:
logreg_pipeline.fit(X_train, y_train)

In [None]:
y_pred = logreg_pipeline.predict(X_val)

acc = accuracy_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)

acc, f1

In [None]:
print(classification_report(y_val, y_pred, target_names=["Negative", "Positive"]))

The Logistic Regression baseline achieves strong performance, indicating that sentiment in the IMDB dataset is highly correlated with lexical features.

In [None]:
from sklearn.svm import LinearSVC

svm_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=20000,
        ngram_range=(1,2),
        stop_words="english"
    )),
    ("clf", LinearSVC())
])

In [None]:
svm_pipeline.fit(X_train, y_train)

In [None]:
y_pred_svm = svm_pipeline.predict(X_val)

acc_svm = accuracy_score(y_val, y_pred_svm)
f1_svm = f1_score(y_val, y_pred_svm)

acc_svm, f1_svm

In [None]:
results_df = pd.DataFrame({
    "Model": ["Logistic Regression", "Linear SVM"],
    "Accuracy": [acc, acc_svm],
    "F1-score": [f1, f1_svm]
})

results_df

While both classical models perform strongly, they rely heavily on surface-level lexical features and fail to capture deeper contextual relationships, motivating the use of Transformer-based models.

In [None]:
feature_names = logreg_pipeline.named_steps["tfidf"].get_feature_names_out()
coefficients = logreg_pipeline.named_steps["clf"].coef_[0]

top_positive = sorted(
    zip(feature_names, coefficients),
    key=lambda x: x[1],
    reverse=True
)[:20]

top_negative = sorted(
    zip(feature_names, coefficients),
    key=lambda x: x[1]
)[:20]

top_positive, top_negative

The most influential features align with intuitive sentiment-bearing words, confirming that the model learns meaningful patterns from the data.

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

The model is trained using GPU acceleration to significantly reduce training time.

In [None]:
!pip install -q transformers datasets evaluate accelerate

In [None]:
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

from datasets import load_dataset
import evaluate
import numpy as np

In [None]:
dataset = load_dataset("imdb")
dataset

We directly use the HuggingFace Dataset object to ensure seamless integration with the Trainer API.

In [None]:
model_checkpoint = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

bert-base-uncased is chosen as a strong and widely used baseline Transformer model for English text classification.

In [None]:
max_length = 256

In [None]:
def tokenize_function(example):
    return tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=max_length
    )

In [None]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

In [None]:
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["test"]

In [None]:
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    accuracy = accuracy_metric.compute(
        predictions=predictions, references=labels
    )
    f1 = f1_metric.compute(
        predictions=predictions, references=labels
    )
    return {
        "accuracy": accuracy["accuracy"],
        "f1": f1["f1"]
    }

Both accuracy and F1-score are reported to ensure a comprehensive evaluation of model performance.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=2
)

model.to(device)

In [None]:
training_args = TrainingArguments(
    output_dir="./bert-imdb",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_steps=100
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
{
  'eval_loss': ...,
  'eval_accuracy': 0.92,
  'eval_f1': 0.92
}

Fine-tuned BERT significantly outperforms classical baselines by capturing contextual and semantic information beyond surface-level lexical features.

In [None]:
import torch
from torch.utils.data import DataLoader

predictions = trainer.predict(eval_dataset)

logits = predictions.predictions
y_true = predictions.label_ids
y_pred = np.argmax(logits, axis=1)

In [None]:
errors = []

for i in range(len(y_true)):
    if y_true[i] != y_pred[i]:
        errors.append({
            "text": dataset["test"][i]["text"],
            "true_label": y_true[i],
            "pred_label": y_pred[i]
        })

error_df = pd.DataFrame(errors)
error_df.head()

In [None]:
error_df.sample(5)

Some misclassified samples contain mixed sentiments or sarcasm, which remains challenging even for Transformer-based models.

In [None]:
error_df["word_count"] = error_df["text"].apply(lambda x: len(x.split()))

error_df["word_count"].describe()

In [None]:
train_df["word_count"].describe()

Misclassified reviews tend to be longer on average, suggesting that truncation may contribute to information loss.

In [None]:
error_df[error_df["text"].str.contains("not", case=False)].head(3)

Negation handling and subtle sentiment shifts remain a common source of error, even for pretrained language models.

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["Negative", "Positive"],
            yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix â€” BERT on IMDB")
plt.show()

The confusion matrix shows balanced performance across both classes, with no strong bias toward either sentiment.

In [None]:
final_results = pd.DataFrame({
    "Model": [
        "Logistic Regression (TF-IDF)",
        "Linear SVM (TF-IDF)",
        "BERT (Fine-tuned)"
    ],
    "Accuracy": [
        acc,
        acc_svm,
        trainer.evaluate()["eval_accuracy"]
    ],
    "F1-score": [
        f1,
        f1_svm,
        trainer.evaluate()["eval_f1"]
    ]
})

final_results

In [None]:
final_results.set_index("Model")[["Accuracy", "F1-score"]].plot.bar(
    figsize=(8,4), ylim=(0.8,1.0)
)
plt.title("Model Comparison on IMDB Dataset")
plt.ylabel("Score")
plt.show()

Fine-tuned BERT achieves the best overall performance, outperforming classical baselines by leveraging contextual semantic representations.