# Run this notebook on Google Collab

# Fine-Tuning BERT for Sentiment Analysis: A Detailed Step-by-Step Guide

### Step 1: Installing Required Libraries

First, we need to install the necessary Python packages. We'll use the `!pip` command to install them directly within our Colab environment.
- `transformers`: Provides the BERT model and tokenizer.
- `datasets`: Helps us easily load the IMDb dataset.
- `torch`: The deep learning framework that powers the training.
- `evaluate`: A library from Hugging Face for model evaluation metrics.

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [7]:
!pip install transformers datasets torch evaluate tf_keras 'accelerate>=0.26.0'



### Step 2: Importing Libraries

Now that the libraries are installed, we import the specific modules and classes we'll need for our script

In [8]:
import torch

In [9]:
from datasets import load_dataset

In [10]:
# This class is the starting point for any sequence classification task, which involves assigning a label to a whole piece of text.

from transformers import AutoModelForSequenceClassification

In [11]:
#Automatically loads the correct tokenizer for a model. It converts raw text into numerical inputs (tokens) that the model can understand.
# There are three main types of tokenizers are word-based, character-based, and subword-based

from transformers import AutoTokenizer

In [12]:
# A configuration class that holds all the settings and hyperparameters for training (e.g., learning rate, batch size, number of epochs).

from transformers import TrainingArguments

In [13]:
# Handles the entire training and evaluation loop, abstracting away the complex boilerplate code required to train a model.

from transformers import Trainer

In [14]:
import numpy as np

In [15]:
# Provides a simple and unified way to assess the performance of models
# Hugging Face library

import evaluate

Let's confirm the imports were successful.

### Step 3: Loading the Dataset

We'll use the `datasets` library to download and load the IMDb movie review dataset.

In [19]:
dataset = load_dataset("imdb")

In [20]:
print("Dataset loaded successfully.")

Dataset loaded successfully.


**(Debugging)** Let's inspect the loaded dataset to understand its structure.

In [21]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


**(Debugging)** Let's also look at a single example from the training set. Notice the `text` and `label` fields.

In [22]:
print("\nExample of a training sample:")
print(dataset["train"][0])


Example of a training sample:
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and n

### Step 4: Data Preprocessing

Machine learning models can't work with raw text. We need to convert the text reviews into a numerical format that BERT can understand. This process is called **tokenization**.

#### 4.1 - Load the Tokenizer

We must use the exact same tokenizer that was used to pre-train the BERT model. The `AutoTokenizer` class from Hugging Face handles this for us, automatically downloading the correct tokenizer for `bert-base-uncased`.

 - bert: This stands for Bidirectional Encoder Representations from Transformers. It identifies the fundamental architecture of the model, which is known for reading an entire sequence of text at once to understand the context of each word.
  - base: This indicates the size of the model. The original BERT was released in two sizes:
    - base: A smaller model with 12 layers and about 110 million parameters. It's faster and requires less computational power.
    - large: A much larger, more powerful model with 24 layers and about 340 million parameters.
- uncased: This describes the text preprocessing method used during the model's training. - lower case or not?

In [25]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [26]:
print("Tokenizer loaded successfully.")

Tokenizer loaded successfully.


#### 4.2 - Create a Tokenization Function

We'll define a function that takes examples from our dataset and applies the tokenizer to them.
- `padding='max_length'`: This ensures that all sequences are padded to the same length. Shorter sequences will have special `[PAD]` tokens added.
- `truncation=True`: This will cut off any sequences that are longer than the maximum length BERT can handle.

In [28]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

#### 4.3 - Apply Tokenization to the Entire Dataset

Now, we use the `.map()` method to apply our `tokenize_function` to every review in the dataset. Using `batched=True` processes multiple examples at once, which is much faster.

In [29]:
print("Tokenizing the dataset...")

Tokenizing the dataset...


In [30]:
# Setting batched=True makes the .map() operation process multiple examples from the dataset at once, which provides a significant performance boost

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [31]:
print("Dataset tokenized successfully.")

Dataset tokenized successfully.


**(Debugging)** Let's look at the dataset again. Notice the new columns: `input_ids`, `token_type_ids`, and `attention_mask`. These were added by the tokenizer.

- input_ids: These are the numerical representations of your text
- token_type_ids: Distinguishes between sentence pairs (e.g., sentence A is 0s, sentence B is 1s).
- attention_mask: This is a list of 1s and 0s that tells the model which tokens to pay attention to

In [32]:
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})


#### 4.4 - Format the Dataset for Training

The tokenization process adds new columns (`input_ids`, `attention_mask`). We no longer need the original `text` column.

In [33]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])

**(Debugging)** Let's check the columns again to confirm 'text' is gone.

In [34]:
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})


Next, we rename the `label` column to `labels`, as this is the name the model expects.

In [35]:
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

**(Debugging)** Let's check the columns one more time to confirm 'label' has been renamed to 'labels'.

In [36]:
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})


Finally, we set the format of our dataset to `torch` tensors, which is the format PyTorch uses.

In [40]:
tokenized_datasets.set_format("torch")

#### 4.5 - Create Smaller Subsets for a Quicker Run

Training on the full IMDb dataset can take a while. For this demonstration, we'll create smaller, random subsets of the training and test sets. This allows us to run through the entire process quickly. For best results on a real project, you would use the full dataset.

In [41]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))

In [42]:
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

**(Debugging)** Let's check the size of our new, smaller datasets.

In [43]:
print(f"Size of the small training set: {len(small_train_dataset)}")

Size of the small training set: 1000


In [44]:
print(f"Size of the small evaluation set: {len(small_eval_dataset)}")

Size of the small evaluation set: 1000


### Step 5: Model Training

This is the core of our task: fine-tuning the pre-trained BERT model.

#### 5.1 - Load the Pre-trained Model

We use `AutoModelForSequenceClassification` to load the `bert-base-uncased` model. This class automatically adds a classification "head" on top of the base BERT model. This head is a small, untrained neural network layer that we will train to perform our specific sentiment analysis task. We specify `num_labels=2` because we have two output classes: positive and negative.

- A head is a small, untrained neural network layer that is added on top of a pre-trained base model (like BERT) to adapt it for a specific task.

In [46]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [47]:
print("Pre-trained model with a new classification head loaded successfully.")

Pre-trained model with a new classification head loaded successfully.


#### 5.2 - Define Training Arguments

The `TrainingArguments` class lets us configure all the hyperparameters and settings for the training process.

In [49]:
training_args = TrainingArguments(
    output_dir="test_trainer",
    eval_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
)

### Step 6: Define Evaluation Metrics

To know how well our model is doing, we need to define a metric. We'll use **accuracy**, which is a common metric for classification tasks.

#### 6.1 - Load the Accuracy Metric

We use the `evaluate` library to load the accuracy metric.

In [51]:
metric = evaluate.load("accuracy")

Downloading builder script: 0.00B [00:00, ?B/s]

#### 6.2 - Create a `compute_metrics` Function

The `Trainer` needs a function that it can call during evaluation to compute the metrics. This function will take the model's predictions (`logits`) and the true labels, and return the calculated accuracy.

In [52]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### Step 7: Initialize and Run the Trainer

Now we bring everything together. The `Trainer` class from Hugging Face takes our model, training arguments, datasets, and metrics function, and handles the entire training loop for us.

In [53]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [54]:
print("Trainer initialized. Ready to start training.")

Trainer initialized. Ready to start training.


Now, we can start the training by calling `.train()`.

In [55]:
print("Starting model training...")

Starting model training...


In [56]:
trainer.train()

  return forward_call(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4851,0.421611,0.84
2,0.4208,0.326882,0.868
3,0.324,0.473593,0.857


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)


TrainOutput(global_step=375, training_loss=0.42337818161646523, metrics={'train_runtime': 383.0337, 'train_samples_per_second': 7.832, 'train_steps_per_second': 0.979, 'total_flos': 789333166080000.0, 'train_loss': 0.42337818161646523, 'epoch': 3.0})

In [58]:
print("Training finished.")

Training finished.


### Step 8: Evaluate the Final Model

After training is complete, we should evaluate our model on the test set to see its final performance on data it has never seen before.

In [59]:
print("Evaluating the model...")

Evaluating the model...


In [60]:
evaluation_results = trainer.evaluate()

  return forward_call(*args, **kwargs)


In [61]:
print("\nEvaluation results:")


Evaluation results:


In [62]:
print(evaluation_results)

{'eval_loss': 0.47359317541122437, 'eval_accuracy': 0.857, 'eval_runtime': 30.7192, 'eval_samples_per_second': 32.553, 'eval_steps_per_second': 4.069, 'epoch': 3.0}


### Step 9: Inference

The final step is to use our fine-tuned model to make predictions on new sentences. This is known as **inference**.

#### 9.1 - Inference on a Positive Review

In [63]:
text_1 = "This movie was absolutely fantastic! The acting was superb and the plot was thrilling."

In [64]:
print(f"Analyzing sentiment for text: '{text_1}'")

Analyzing sentiment for text: 'This movie was absolutely fantastic! The acting was superb and the plot was thrilling.'


Tokenize the input text.

In [65]:
# "pt": Returns PyTorch tensors (torch.Tensor). This is what you need if you are working with a PyTorch model.

inputs = tokenizer(text_1, return_tensors="pt")

**(Debugging)** Let's see the tokenized inputs.

In [66]:
print(inputs)

{'input_ids': tensor([[  101,  2023,  3185,  2001,  7078, 10392,   999,  1996,  3772,  2001,
         21688,  1998,  1996,  5436,  2001, 26162,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


Move the inputs to the same device as the model (GPU if available).

In [67]:
# In PyTorch, a model and its input data must be on the same device to perform computations
# If your model is on a GPU for faster processing, your data tensors must also be moved from the CPU (their default location) to that same GPU.

inputs = {k: v.to(model.device) for k, v in inputs.items()}

Get predictions from the model.

In [68]:
# Tells PyTorch to disable gradient calculation, which makes model inference significantly faster and more memory-efficient.

with torch.no_grad():
    outputs = model(**inputs)

  return forward_call(*args, **kwargs)


The model outputs 'logits', which are the raw scores for each class.

In [69]:
logits = outputs.logits

**(Debugging)** Let's look at the raw logit scores. The first number is the score for 'negative' and the second is for 'positive'.

In [70]:
print(f"Logits: {logits}")

Logits: tensor([[-2.9120,  2.3884]], device='cuda:0')


We use `argmax` to get the index of the highest score, which corresponds to the predicted class.

In [71]:
predicted_class_id = torch.argmax(logits, dim=1).item()

Finally, we use the model's configuration to map the predicted ID back to its label name (e.g., 0 -> 'LABEL_0', 1 -> 'LABEL_1').

In [72]:
sentiment = model.config.id2label[predicted_class_id]

In [73]:
print(f"Predicted sentiment: {sentiment}")

Predicted sentiment: LABEL_1


#### 9.2 - Inference on a Negative Review

In [74]:
text_2 = "I was really disappointed with this film. It was boring and the story was weak."

In [75]:
print(f"\nAnalyzing sentiment for text: '{text_2}'")


Analyzing sentiment for text: 'I was really disappointed with this film. It was boring and the story was weak.'


In [76]:
inputs_2 = tokenizer(text_2, return_tensors="pt")

In [77]:
inputs_2 = {k: v.to(model.device) for k, v in inputs_2.items()}

In [78]:
with torch.no_grad():
    outputs_2 = model(**inputs_2)

In [79]:
logits_2 = outputs_2.logits

**(Debugging)** Let's look at the raw logit scores for the negative review.

In [80]:
print(f"Logits: {logits_2}")

Logits: tensor([[ 2.3058, -1.7182]], device='cuda:0')


In [81]:
predicted_class_id_2 = torch.argmax(logits_2, dim=1).item()

In [82]:
sentiment_2 = model.config.id2label[predicted_class_id_2]

In [83]:
print(f"Predicted sentiment: {sentiment_2}")

Predicted sentiment: LABEL_0


In [84]:
# Define a directory to save the model
output_dir = "./my_awesome_model"

# Save the model
model.save_pretrained(output_dir)

# Save the tokenizer
tokenizer.save_pretrained(output_dir)

('./my_awesome_model/tokenizer_config.json',
 './my_awesome_model/special_tokens_map.json',
 './my_awesome_model/vocab.txt',
 './my_awesome_model/added_tokens.json',
 './my_awesome_model/tokenizer.json')

In [85]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

loaded_model = AutoModelForSequenceClassification.from_pretrained(output_dir)
loaded_tokenizer = AutoTokenizer.from_pretrained(output_dir)

In [90]:
text_3 = "It don't know but I am not unsatisified. I think its worth watching if you are geeting bored and you need to spare some time. It worth watching one"

inputs_3 = loaded_tokenizer(text_3, return_tensors="pt")

inputs_3 = {k: v.to(loaded_model.device) for k, v in inputs_3.items()}

with torch.no_grad():
    outputs_3 = loaded_model(**inputs_3)

logits_3 = outputs_3.logits

print(f"Logits: {logits_3}")

predicted_class_id_3 = torch.argmax(logits_3, dim=1).item()

sentiment_3 = loaded_model.config.id2label[predicted_class_id_3]

print(f"Predicted sentiment: {sentiment_3}")



Logits: tensor([[0.1209, 0.9124]])
Predicted sentiment: LABEL_1
