#  Fine-Tuning BERT for Sentiment Analysis on IMDb Dataset

##  Project Overview

This project fine-tunes the `bert-base-uncased` model—a transformer-based architecture—on the IMDb movie reviews dataset for **binary sentiment classification** (positive or negative). It utilizes the **Hugging Face Transformers** library and **PyTorch**, and is designed to run efficiently on **Google Colab** with GPU support.

---

##  Theoretical Background

### 1. BERT (Bidirectional Encoder Representations from Transformers)

####  What is BERT?
BERT, developed by Google, is a **transformer-based model** for natural language processing (NLP). Unlike traditional models, BERT processes text **bidirectionally**, meaning it learns context from both the **left and right** of a token during pre-training.

####  Architecture
- Model: `bert-base-uncased`
- Layers: 12 Transformer blocks
- Hidden Size: 768
- Attention Heads: 12
- Parameters: ~110 million

It uses **self-attention** to weigh the importance of different words in a sentence, capturing rich contextual information.

####  Pre-training Tasks
- **Masked Language Modeling (MLM):** Randomly masks 15% of the input tokens and predicts them based on surrounding context.
- **Next Sentence Prediction (NSP):** Predicts whether two input sentences follow each other in the original text.

These tasks are trained on large corpora like **Wikipedia** and **BookCorpus**.

####  Why Use BERT?
- Leverages bidirectional context
- Transfers powerful language understanding to downstream tasks like sentiment classification
- Requires less labeled data due to pre-training

---

### 2. Fine-Tuning

####  Definition
Fine-tuning means continuing the training of a **pre-trained model** on a smaller, task-specific dataset. In this case, we fine-tune BERT for **binary sentiment classification** on the IMDb reviews dataset.

####  Process
- A **classification head** (linear layer) is added on top of BERT.
- The [CLS] token’s representation is used as input to the classification head.
- The model is trained **end-to-end** (both BERT and the new head).

####  Hyperparameters
| Parameter       | Description                                     |
|----------------|-------------------------------------------------|
| Learning Rate   | `2e-5` - Stable updates to pre-trained weights |
| Batch Size      | `4` - Optimized for memory on Colab GPUs       |
| Epochs          | `1` - Sufficient for small datasets            |
| Weight Decay    | Regularization to prevent overfitting          |

---

### 3. Hugging Face Transformers

####  Overview
The [Transformers](https://huggingface.co/transformers) library by Hugging Face provides:
- Pre-trained models
- Tokenizers
- APIs for training and inference

It simplifies working with complex NLP models like BERT.

####  Components Used
- `BertTokenizer`: Tokenizes and encodes input text into BERT-compatible format using WordPiece tokenization.
- `BertForSequenceClassification`: BERT model with a classification layer.
- `Trainer` & `TrainingArguments`: High-level APIs for training, evaluation, and logging.

####  Advantages
- Abstracts away low-level PyTorch code
- Easy to use with GPU/TPU
- Supports model saving, loading, and deployment

---





In [1]:
!pip install transformers==4.44.2 datasets==3.0.1 torch==2.4.1

Collecting transformers==4.44.2
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==3.0.1
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting torch==2.4.1
  Downloading torch-2.4.1-cp311-cp311-manylinux1_x86_64.whl.metadata (26 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers==4.44.2)
  Downloading tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting fsspec<=2024.6.1,>=2023.1.0 (from fsspec[http]<=2024.6.1,>=2023.1.0->datasets==3.0.1)
  Downloading fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.4.1)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.

In [1]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch

# Load the IMDb dataset
dataset = load_dataset("imdb")

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Tokenize the dataset with minimal sequence length
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)  # Reduced from 512

# Tokenize in batches
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Use smaller dataset subsets
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(500))  # Reduced from 1000
eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(200))    # Reduced from 200

# Minimal training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="no",  # Disable evaluation to save time/memory
    learning_rate=2e-5,       # Standard learning rate
    per_device_train_batch_size=4,  # Reduced from 8 to lower memory usage
    num_train_epochs=1,       # Already minimal
    weight_decay=0.01,        # Reduced from 0.1 for simplicity
    logging_steps=50,         # Increased from 10 to reduce logging overhead
    no_cuda=False,            # Use GPU if available (Colab default)
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Start fine-tuning
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./finetuned_bert")
tokenizer.save_pretrained("./finetuned_bert")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Currently logged in as: [33mbluedatagpt[0m ([33mbluedatagpt-ghghh[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
50,0.6923


('./finetuned_bert/tokenizer_config.json',
 './finetuned_bert/special_tokens_map.json',
 './finetuned_bert/vocab.txt',
 './finetuned_bert/added_tokens.json')

In [6]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load the fine-tuned model and tokenizer
model = BertForSequenceClassification.from_pretrained("./finetuned_bert")
tokenizer = BertTokenizer.from_pretrained("./finetuned_bert")

# Example text
text = "This movie is good"

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)

# Map prediction to label
labels = ["Negative", "Positive"]
print(f"Sentiment: {labels[predictions.item()]}")

Sentiment: Positive
