# Fine-Tuning BERT for Sentiment Analysis on IMDB

## Project Overview
In this project, we fine-tune **BERT (Bidirectional Encoder Representations from Transformers)** to perform **binary sentiment classification** on the **IMDB movie reviews dataset**.  
The task is to classify a given movie review as **positive** or **negative**.

---

## Why BERT?
- **Contextual embeddings**: Unlike Word2Vec or GloVe, BERT captures context dynamically from both directions.
- **Transfer learning power**: Pretrained on massive corpora, fine-tuned with relatively small labeled datasets.
- **State-of-the-art**: BERT and its variants achieve leading results in NLP tasks such as sentiment analysis, Q&A, and text classification.

---

## Workflow
We’ll follow a **complete NLP project pipeline**:

1. **Setup & Installations** → Install Hugging Face Transformers, Datasets, and Optuna (optional).
2. **Dataset Preparation** → Load IMDB dataset, split into train/test, and preprocess with `BertTokenizer`.
3. **Fine-Tuning BERT** → Train `bert-base-uncased` for sentiment classification.
4. **Evaluation** → Measure accuracy on the test set.
5. **Model Saving** → Save locally and push to Hugging Face Hub.




## 1. Setup in Colab

In [1]:
!pip install transformers datasets torch --quiet

In [2]:
# import Libraries
import torch
from transformers import BertTokenizer, BertModel

In [3]:
# Load Pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load BERT model
model = BertModel.from_pretrained('bert-base-uncased')

# Put model in evaluation mode (not training yet)
model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

**Now let's test the tokenizer and Model output**

In [4]:
# example sentence
sentence = "BERT is amazing for NLP tasks!"

# tokenize and convert to tensor
inputs = tokenizer(sentence, return_tensors = 'pt')

# run through BERT
outputs = model(**inputs)

# last hidden states
print("Input IDs :", inputs['input_ids'])
print("Attention Mask:", inputs['attention_mask'])
print("last hidden shape state", outputs.last_hidden_state.shape)

Input IDs : tensor([[  101, 14324,  2003,  6429,  2005, 17953,  2361,  8518,   999,   102]])
Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
last hidden shape state torch.Size([1, 10, 768])


## 2.Dataset preparation


In [5]:
from datasets import load_dataset

# Load IMDB dataset
dataset = load_dataset("imdb")
print(dataset)


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [6]:
# Check a Sample
print(dataset["train"][0])



{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

**We’ll apply the BERT tokenizer to convert raw text into input_ids, attention_mask.**

In [7]:
# tokenization function

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",  # pad shorter sentences
        truncation=True,       # cut longer sentences
        max_length=128         # max tokens (BERT supports up to 512, but 128 is enough for IMDB)
    )

# Apply to dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [8]:
# Format for PyTorch
tokenized_datasets = tokenized_datasets.remove_columns(["text"])  # drop raw text
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")  # Hugging Face expects "labels"
tokenized_datasets.set_format("torch")  # convert to PyTorch tensors


In [9]:
# Take only 2000 training samples and 1000 test samples
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(2000))
small_test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))


In [10]:
# Create Train & Test Dataloaders
from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_train_dataset, batch_size=16, shuffle=True)
test_dataloader = DataLoader(small_test_dataset, batch_size=16)

## 3. Fine-Tuning BERT for Sentiment Classification

In [11]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2  # Binary classification: 0 = negative, 1 = positive
)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# Move Model to GPU (if available)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [13]:
# Optimizer & Loss

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)



In [14]:
#  Learning Rate Scheduler

from transformers import get_scheduler

num_training_steps = len(train_dataloader) * 2  # 2 epochs
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)


In [15]:
# Training Loop

from torch.nn import CrossEntropyLoss
from tqdm import tqdm

epochs = 2
progress_bar = tqdm(range(num_training_steps))

model.train()

for epoch in range(epochs):
    for batch in train_dataloader:
        # Move batch to GPU
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()

        progress_bar.update(1)

100%|██████████| 250/250 [1:37:31<00:00, 22.56s/it]

In [16]:
from torch.nn.functional import softmax

model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in test_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        predictions = torch.argmax(outputs.logits, dim=-1)

        correct += (predictions == batch["labels"]).sum().item()
        total += batch["labels"].size(0)

accuracy = correct / total
print("Test Accuracy:", accuracy)


Test Accuracy: 0.851


## 4. Save the model and predictions

In [24]:
# Save the Model & Tokenizer
model.save_pretrained("bert-sentiment")
tokenizer.save_pretrained("bert-sentiment")



('bert-sentiment/tokenizer_config.json',
 'bert-sentiment/special_tokens_map.json',
 'bert-sentiment/vocab.txt',
 'bert-sentiment/added_tokens.json')

In [25]:

from transformers import BertForSequenceClassification, BertTokenizer

model = BertForSequenceClassification.from_pretrained("bert-sentiment")
tokenizer = BertTokenizer.from_pretrained("bert-sentiment")


In [26]:
# prediction on new sentences

def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
        prediction = torch.argmax(probs, dim=-1).item()

    labels = ["Negative", "Positive"]
    return labels[prediction], probs.squeeze().tolist()

# Test
print(predict_sentiment("I absolutely loved this movie!"))
print(predict_sentiment("This was the worst film I’ve ever seen."))


('Positive', [0.028649233281612396, 0.9713507890701294])
('Negative', [0.9661880731582642, 0.03381199389696121])


## . Push Model to hugging face hub

In [30]:
# Install and login
!pip install huggingface_hub --quiet
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [33]:
#  Push Model to Hub
from transformers import BertForSequenceClassification, BertTokenizer

# Reload your saved model & tokenizer
model = BertForSequenceClassification.from_pretrained("bert-sentiment")
tokenizer = BertTokenizer.from_pretrained("bert-sentiment")

# Push to Hugging Face Hub
model.push_to_hub("Avinash-panda/bert-sentiment-imdb", private = False)
tokenizer.push_to_hub("Avinash-panda/bert-sentiment-imdb", private = False)


No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/Avinash-panda/bert-sentiment-imdb/commit/17ebad17507e476c7ea64e6227f320ee49ae0cd4', commit_message='Upload tokenizer', commit_description='', oid='17ebad17507e476c7ea64e6227f320ee49ae0cd4', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Avinash-panda/bert-sentiment-imdb', endpoint='https://huggingface.co', repo_type='model', repo_id='Avinash-panda/bert-sentiment-imdb'), pr_revision=None, pr_num=None)

In [35]:
# Load the model

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "Avinash-panda/bert-sentiment-imdb"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [36]:
# Run inference (sentiment prediction)

from transformers import pipeline

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
print(classifier("The movie was fantastic!"))
print(classifier("This was the worst film I’ve ever seen."))


Device set to use cpu


[{'label': 'LABEL_1', 'score': 0.9687082767486572}]
[{'label': 'LABEL_0', 'score': 0.9661880731582642}]
