# Finetune DistilBERT on the IMDb dataset to determine whether a movie review is positive or negative.

## Introduction

**DistilBERT** is a smaller, faster, and lighter version of the BERT (Bidirectional Encoder Representations from Transformers) model.

It is designed to retain 97% of BERT's language understanding capabilities while being 40% smaller and 60% faster.

DistilBERT achieves this through a process called knowledge distillation, where a smaller "student" model learns to mimic a larger "teacher" model. This makes DistilBERT an efficient alternative for various natural language processing tasks like text classification, sentiment analysis, and question answering, especially in environments with limited computational resources.


![](https://www.scaler.com/topics/images/tokenization-text.webp)

## Setup

In [None]:
!pip install transformers datasets evaluate accelerate

In [2]:
# Logged in to HuggingFace Hub
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load the IMDB Dataset

In [3]:
from datasets import load_dataset

imdb = load_dataset("imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [5]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [6]:
# Let's look at few examples
imdb["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

There are two fields in this dataset:

- `text`: the movie review text.
- `label`: a value that is either `0` for a negative review or `1` for a positive review.

## Preprocess

In [8]:
# Load a DistilBERT tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [9]:
# Preprocess the entire dataset
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_imdb = imdb.map(preprocess_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [10]:
# Create a batch of examples using DataCollatorWithPadding
# dynamically pad the sentences to the longest length in a batch during collation

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Evaluation Metrics

In [11]:
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [12]:
# Function to Calculate accuracy from prediction and labels
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(
        predictions=predictions,
        references=labels
    )

## Train the Model

In [13]:
# Create a map of expected ids to their labels
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [14]:
# Load the DistilBERT model
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label=id2label,
    label2id=label2id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
# Define training hyperparameters in TrainingArguments
training_args = TrainingArguments(
    output_dir="imdb-distilbert-funetuned",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True
)



In [16]:
# Pass the training arguments to Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [17]:
# Call train() to finetune your model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2239,0.202609,0.92268
2,0.1468,0.231859,0.93196


TrainOutput(global_step=3126, training_loss=0.20507300944947618, metrics={'train_runtime': 3298.7566, 'train_samples_per_second': 15.157, 'train_steps_per_second': 0.948, 'total_flos': 6556904415524352.0, 'train_loss': 0.20507300944947618, 'epoch': 2.0})

In [19]:
# Share model to the Hub
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/Ashaduzzaman/imdb-distilbert-funetuned/commit/24b57f3dd9cb38150ee4fa5571e5bf425e1db31a', commit_message='End of training', commit_description='', oid='24b57f3dd9cb38150ee4fa5571e5bf425e1db31a', pr_url=None, pr_revision=None, pr_num=None)

## Inference

### Perform inference using pipeline with a fine-tuned DistilBERT model

In [20]:
# Example Texts to run inference
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

In [23]:
# Instantiate a pipeline for sentiment analysis with our model
from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="Ashaduzzaman/imdb-distilbert-funetuned",
)

# Run inference
classifier(text)

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.9949856996536255}]

### Perform inference using Gradio with a fine-tuned DistilBERT model

In [3]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.42.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.112.1-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.3.0 (from gradio)
  Downloading gradio_client-1.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.9 (from gradi

In [4]:
import gradio as gr
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch

# Load the fine-tuned DistilBERT model and tokenizer
model_name = "Ashaduzzaman/imdb-distilbert-funetuned"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name)

# Define the prediction function
def classify_text(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.softmax(logits, dim=1)
    labels = ['Negative', 'Positive']
    predicted_label = labels[predictions.argmax().item()]
    confidence = predictions.max().item()
    return predicted_label, confidence

# Create the Gradio interface
iface = gr.Interface(
    fn=classify_text,
    inputs=gr.Textbox(lines=2, placeholder="Enter a movie review..."),
    outputs=[
        gr.Label(num_top_classes=2),  # For the predicted label
        gr.Number(label="Confidence Score")  # For the confidence score
    ],
    title="IMDb Movie Review Sentiment Classifier",
    description="Enter a movie review to classify it as positive or negative sentiment."
)

# Launch the interface
iface.launch()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://f8adbce31a47e6e286.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


'\n**Positive Reviews:**\n1. "This movie was an absolute delight! The story was captivating, and the acting was top-notch. I would highly recommend it to anyone looking for a feel-good film."\n2. "A masterpiece! The cinematography was breathtaking, and the plot twists kept me on the edge of my seat. Definitely one of the best movies I\'ve seen this year."\n3. "I loved every minute of this film. The characters were well-developed, and the emotional depth was just incredible. A must-watch for sure!"\n\n**Negative Reviews:**\n1. "What a disappointment. The plot was all over the place, and the acting was subpar. I honestly regret wasting my time on this movie."\n2. "The movie had potential, but it was ruined by poor scriptwriting and lackluster performances. It just didn\'t live up to the hype."\n3. "I found the film to be quite boring and predictable. There were no interesting characters or memorable moments. I wouldn\'t recommend it."\n\n'

In [None]:
"""
**Positive Reviews:**
1. "This movie was an absolute delight! The story was captivating, and the acting was top-notch. I would highly recommend it to anyone looking for a feel-good film."
2. "A masterpiece! The cinematography was breathtaking, and the plot twists kept me on the edge of my seat. Definitely one of the best movies I've seen this year."
3. "I loved every minute of this film. The characters were well-developed, and the emotional depth was just incredible. A must-watch for sure!"

**Negative Reviews:**
1. "What a disappointment. The plot was all over the place, and the acting was subpar. I honestly regret wasting my time on this movie."
2. "The movie had potential, but it was ruined by poor scriptwriting and lackluster performances. It just didn't live up to the hype."
3. "I found the film to be quite boring and predictable. There were no interesting characters or memorable moments. I wouldn't recommend it."

"""