# Fine-Tuning BERT for Sentiment Analysis: A Detailed Step-by-Step Guide

Welcome! This notebook will guide you through the complete process of fine-tuning a pre-trained BERT model for a sentiment analysis task. We will use the popular IMDb movie review dataset.

This version is highly detailed, with each step broken down into individual cells and extensive debugging printouts to make the process as clear as possible.

### Step 1: Installing Required Libraries

First, we need to install the necessary Python packages. We'll use the `!pip` command to install them directly within our Colab environment.
- `transformers`: Provides the BERT model and tokenizer.
- `datasets`: Helps us easily load the IMDb dataset.
- `torch`: The deep learning framework that powers the training.
- `evaluate`: A library from Hugging Face for model evaluation metrics.

In [1]:
!pip install transformers datasets torch evaluate

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting

### Step 2: Importing Libraries

Now that the libraries are installed, we import the specific modules and classes we'll need for our script. We'll import them one by one for clarity.

In [2]:
import torch

In [3]:
from datasets import load_dataset

In [4]:
from transformers import AutoModelForSequenceClassification

In [5]:
from transformers import AutoTokenizer

In [6]:
from transformers import TrainingArguments

In [7]:
from transformers import Trainer

In [8]:
import numpy as np

In [9]:
import evaluate

Let's confirm the imports were successful.

In [10]:
print("All libraries imported successfully!")

All libraries imported successfully!


### Step 3: Loading the Dataset

We'll use the `datasets` library to download and load the IMDb movie review dataset.

In [11]:
print("Loading IMDb dataset...")

Loading IMDb dataset...


In [12]:
dataset = load_dataset("imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [13]:
print("Dataset loaded successfully.")

Dataset loaded successfully.


**(Debugging)** Let's inspect the loaded dataset to understand its structure.

In [14]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


**(Debugging)** Let's also look at a single example from the training set. Notice the `text` and `label` fields.

In [15]:
print("\nExample of a training sample:")
print(dataset["train"][0])


Example of a training sample:
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and n

### Step 4: Data Preprocessing

Machine learning models can't work with raw text. We need to convert the text reviews into a numerical format that BERT can understand. This process is called **tokenization**.

#### 4.1 - Load the Tokenizer

We must use the exact same tokenizer that was used to pre-train the BERT model. The `AutoTokenizer` class from Hugging Face handles this for us, automatically downloading the correct tokenizer for `bert-base-uncased`.

In [16]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [17]:
print("Tokenizer loaded successfully.")

Tokenizer loaded successfully.


#### 4.2 - Create a Tokenization Function

We'll define a function that takes examples from our dataset and applies the tokenizer to them.
- `padding='max_length'`: This ensures that all sequences are padded to the same length. Shorter sequences will have special `[PAD]` tokens added.
- `truncation=True`: This will cut off any sequences that are longer than the maximum length BERT can handle.

In [18]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

#### 4.3 - Apply Tokenization to the Entire Dataset

Now, we use the `.map()` method to apply our `tokenize_function` to every review in the dataset. Using `batched=True` processes multiple examples at once, which is much faster.

In [19]:
print("Tokenizing the dataset...")

Tokenizing the dataset...


In [20]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [21]:
print("Dataset tokenized successfully.")

Dataset tokenized successfully.


**(Debugging)** Let's look at the dataset again. Notice the new columns: `input_ids`, `token_type_ids`, and `attention_mask`. These were added by the tokenizer.

In [22]:
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})


#### 4.4 - Format the Dataset for Training

The tokenization process adds new columns (`input_ids`, `attention_mask`). We no longer need the original `text` column.

In [23]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])

**(Debugging)** Let's check the columns again to confirm 'text' is gone.

In [24]:
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})


Next, we rename the `label` column to `labels`, as this is the name the model expects.

In [25]:
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

**(Debugging)** Let's check the columns one more time to confirm 'label' has been renamed to 'labels'.

In [26]:
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})


Finally, we set the format of our dataset to `torch` tensors, which is the format PyTorch uses.

In [27]:
tokenized_datasets.set_format("torch")

#### 4.5 - Create Smaller Subsets for a Quicker Run

Training on the full IMDb dataset can take a while. For this demonstration, we'll create smaller, random subsets of the training and test sets. This allows us to run through the entire process quickly. For best results on a real project, you would use the full dataset.

In [28]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))

In [29]:
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

**(Debugging)** Let's check the size of our new, smaller datasets.

In [30]:
print(f"Size of the small training set: {len(small_train_dataset)}")

Size of the small training set: 1000


In [31]:
print(f"Size of the small evaluation set: {len(small_eval_dataset)}")

Size of the small evaluation set: 1000


### Step 5: Model Training

This is the core of our task: fine-tuning the pre-trained BERT model.

#### 5.1 - Load the Pre-trained Model

We use `AutoModelForSequenceClassification` to load the `bert-base-uncased` model. This class automatically adds a classification "head" on top of the base BERT model. This head is a small, untrained neural network layer that we will train to perform our specific sentiment analysis task. We specify `num_labels=2` because we have two output classes: positive and negative.

In [32]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [33]:
print("Pre-trained model with a new classification head loaded successfully.")

Pre-trained model with a new classification head loaded successfully.


#### 5.2 - Define Training Arguments

The `TrainingArguments` class lets us configure all the hyperparameters and settings for the training process. We also add `report_to="none"` to disable logging to Weights & Biases (wandb) for this simple tutorial.

In [34]:
training_args = TrainingArguments(
    output_dir="test_trainer",
    eval_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
)

**(Debugging)** Let's print the training arguments to see our configuration.

In [35]:
print("Training Arguments:")
print(training_args)

Training Arguments:
TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_use

### Step 6: Define Evaluation Metrics

To know how well our model is doing, we need to define a metric. We'll use **accuracy**, which is a common metric for classification tasks.

#### 6.1 - Load the Accuracy Metric

We use the `evaluate` library to load the accuracy metric.

In [36]:
metric = evaluate.load("accuracy")

Downloading builder script: 0.00B [00:00, ?B/s]

#### 6.2 - Create a `compute_metrics` Function

The `Trainer` needs a function that it can call during evaluation to compute the metrics. This function will take the model's predictions (`logits`) and the true labels, and return the calculated accuracy.

In [37]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### Step 7: Initialize and Run the Trainer

Now we bring everything together. The `Trainer` class from Hugging Face takes our model, training arguments, datasets, and metrics function, and handles the entire training loop for us.

In [38]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [39]:
print("Trainer initialized. Ready to start training.")

Trainer initialized. Ready to start training.


Now, we can start the training by calling `.train()`.

In [40]:
print("Starting model training...")

Starting model training...


In [41]:
trainer.train()

  return forward_call(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4286,0.355415,0.867
2,0.4603,0.339773,0.866
3,0.3913,0.34241,0.895


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)


TrainOutput(global_step=375, training_loss=0.41940143950780234, metrics={'train_runtime': 371.0916, 'train_samples_per_second': 8.084, 'train_steps_per_second': 1.011, 'total_flos': 789333166080000.0, 'train_loss': 0.41940143950780234, 'epoch': 3.0})

In [42]:
print("Training finished.")

Training finished.


### Step 8: Evaluate the Final Model

After training is complete, we should evaluate our model on the test set to see its final performance on data it has never seen before.

In [43]:
print("Evaluating the model...")

Evaluating the model...


In [68]:
evaluation_results = trainer.evaluate()

In [69]:
print("\nEvaluation results:")


Evaluation results:


In [70]:
print(evaluation_results)

{'eval_loss': 0.34241020679473877, 'eval_accuracy': 0.895, 'eval_runtime': 27.3818, 'eval_samples_per_second': 36.521, 'eval_steps_per_second': 4.565, 'epoch': 3.0}


### Step 9: Inference

The final step is to use our fine-tuned model to make predictions on new sentences. This is known as **inference**.

#### 9.1 - Inference on a Positive Review

In [47]:
text_1 = "This movie was absolutely fantastic! The acting was superb and the plot was thrilling."

In [48]:
print(f"Analyzing sentiment for text: '{text_1}'")

Analyzing sentiment for text: 'This movie was absolutely fantastic! The acting was superb and the plot was thrilling.'


Tokenize the input text.

In [49]:
inputs = tokenizer(text_1, return_tensors="pt")

**(Debugging)** Let's see the tokenized inputs.

In [50]:
print(inputs)

{'input_ids': tensor([[  101,  2023,  3185,  2001,  7078, 10392,   999,  1996,  3772,  2001,
         21688,  1998,  1996,  5436,  2001, 26162,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


Move the inputs to the same device as the model (GPU if available).

In [51]:
inputs = {k: v.to(model.device) for k, v in inputs.items()}

Get predictions from the model.

In [52]:
with torch.no_grad():
    outputs = model(**inputs)

  return forward_call(*args, **kwargs)


The model outputs 'logits', which are the raw scores for each class.

In [53]:
logits = outputs.logits

**(Debugging)** Let's look at the raw logit scores. The first number is the score for 'negative' and the second is for 'positive'.

In [54]:
print(f"Logits: {logits}")

Logits: tensor([[-2.0206,  2.0464]], device='cuda:0')


We use `argmax` to get the index of the highest score, which corresponds to the predicted class.

In [55]:
predicted_class_id = torch.argmax(logits, dim=1).item()

Finally, we use the model's configuration to map the predicted ID back to its label name (e.g., 0 -> 'LABEL_0', 1 -> 'LABEL_1').

In [56]:
sentiment = model.config.id2label[predicted_class_id]

In [57]:
print(f"Predicted sentiment: {sentiment}")

Predicted sentiment: LABEL_1


#### 9.2 - Inference on a Negative Review

In [58]:
text_2 = "I was really disappointed with this film. It was boring and the story was weak."

In [59]:
print(f"\nAnalyzing sentiment for text: '{text_2}'")


Analyzing sentiment for text: 'I was really disappointed with this film. It was boring and the story was weak.'


In [60]:
inputs_2 = tokenizer(text_2, return_tensors="pt")

In [61]:
inputs_2 = {k: v.to(model.device) for k, v in inputs_2.items()}

In [62]:
with torch.no_grad():
    outputs_2 = model(**inputs_2)

In [63]:
logits_2 = outputs_2.logits

**(Debugging)** Let's look at the raw logit scores for the negative review.

In [64]:
print(f"Logits: {logits_2}")

Logits: tensor([[ 2.2189, -2.6327]], device='cuda:0')


In [65]:
predicted_class_id_2 = torch.argmax(logits_2, dim=1).item()

In [66]:
sentiment_2 = model.config.id2label[predicted_class_id_2]

In [67]:
print(f"Predicted sentiment: {sentiment_2}")

Predicted sentiment: LABEL_0
