# Finetune DistilBERT on the IMDb dataset to determine whether a movie review is positive or negative.

## Introduction

**DistilBERT** is a smaller, faster, and lighter version of the BERT (Bidirectional Encoder Representations from Transformers) model.

It is designed to retain 97% of BERT's language understanding capabilities while being 40% smaller and 60% faster.

DistilBERT achieves this through a process called knowledge distillation, where a smaller "student" model learns to mimic a larger "teacher" model. This makes DistilBERT an efficient alternative for various natural language processing tasks like text classification, sentiment analysis, and question answering, especially in environments with limited computational resources.


![](https://www.scaler.com/topics/images/tokenization-text.webp)

## Setup

In [None]:
!pip install transformers datasets evaluate accelerate

In [None]:
# Logged in to HuggingFace Hub
from huggingface_hub import notebook_login

notebook_login()

## Load the IMDB Dataset

In [None]:
from datasets import load_dataset

imdb = load_dataset("imdb")

In [None]:
imdb

In [None]:
# Let's look at few examples
imdb["train"][0]

There are two fields in this dataset:

- `text`: the movie review text.
- `label`: a value that is either `0` for a negative review or `1` for a positive review.

## Preprocess

In [None]:
# Load a DistilBERT tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
# Preprocess the entire dataset
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_imdb = imdb.map(preprocess_function, batched=True)

In [None]:
# Create a batch of examples using DataCollatorWithPadding
# dynamically pad the sentences to the longest length in a batch during collation

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Evaluation Metrics

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

In [None]:
# Function to Calculate accuracy from prediction and labels
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(
        predictions=predictions,
        references=labels
    )

## Train the Model

In [None]:
# Create a map of expected ids to their labels
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [None]:
# Load the DistilBERT model
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label=id2label,
    label2id=label2id
)

In [None]:
# Define training hyperparameters in TrainingArguments
training_args = TrainingArguments(
    output_dir="imdb-distilbert-funetuned",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True
)

In [None]:
# Pass the training arguments to Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
# Call train() to finetune your model
trainer.train()

In [None]:
# Share model to the Hub
trainer.push_to_hub()

## Inference

### Perform inference using pipeline with a fine-tuned DistilBERT model

In [None]:
# Example Texts to run inference
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

In [None]:
# Instantiate a pipeline for sentiment analysis with our model
from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="Ashaduzzaman/imdb-distilbert-funetuned",
)

# Run inference
classifier(text)

### Perform inference using Gradio with a fine-tuned DistilBERT model

In [None]:
!pip install gradio

In [None]:
import gradio as gr
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch

# Load the fine-tuned DistilBERT model and tokenizer
model_name = "Ashaduzzaman/imdb-distilbert-funetuned"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name)

# Define the prediction function
def classify_text(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.softmax(logits, dim=1)
    labels = ['Negative', 'Positive']
    predicted_label = labels[predictions.argmax().item()]
    confidence = predictions.max().item()
    return predicted_label, confidence

# Create the Gradio interface
iface = gr.Interface(
    fn=classify_text,
    inputs=gr.Textbox(lines=2, placeholder="Enter a movie review..."),
    outputs=[
        gr.Label(num_top_classes=2),  # For the predicted label
        gr.Number(label="Confidence Score")  # For the confidence score
    ],
    title="IMDb Movie Review Sentiment Classifier",
    description="Enter a movie review to classify it as positive or negative sentiment."
)

# Launch the interface
iface.launch()

In [None]:
"""
**Positive Reviews:**
1. "This movie was an absolute delight! The story was captivating, and the acting was top-notch. I would highly recommend it to anyone looking for a feel-good film."
2. "A masterpiece! The cinematography was breathtaking, and the plot twists kept me on the edge of my seat. Definitely one of the best movies I've seen this year."
3. "I loved every minute of this film. The characters were well-developed, and the emotional depth was just incredible. A must-watch for sure!"

**Negative Reviews:**
1. "What a disappointment. The plot was all over the place, and the acting was subpar. I honestly regret wasting my time on this movie."
2. "The movie had potential, but it was ruined by poor scriptwriting and lackluster performances. It just didn't live up to the hype."
3. "I found the film to be quite boring and predictable. There were no interesting characters or memorable moments. I wouldn't recommend it."

"""