## ***Step 1: Install Required Libraries***

We begin by installing the Hugging Face `transformers` and `datasets` libraries, which provide easy access to pretrained models and datasets.


In [1]:
!pip install transformers datasets --quiet


In [3]:
# Install and upgrade necessary packages
!pip install --upgrade pip --quiet
!pip install --upgrade transformers datasets huggingface_hub fsspec --quiet


[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.8/1.8 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m10.8/10.8 MB[0m [31m130.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m4/4[0m [datasets]
[1A[2K[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.6.0+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.5.3.2 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-cupti-cu12==12.4.127; platform_sy

##  **Step 2: Import Python Libraries**

We import PyTorch, Hugging Face classes, and sklearn metrics.


In [1]:
import torch
import numpy as np
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import matplotlib.pyplot as plt


## **Step 3: Load the IMDB Movie Reviews Dataset**

This is a binary sentiment classification dataset with 'positive' and 'negative' reviews.


In [2]:
dataset = load_dataset("imdb")
dataset["train"][0]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

## **Step 4: Preprocess the Dataset with BERT Tokenizer**

We use the `bert-base-uncased` tokenizer to prepare the text for the model.


In [3]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "labels"])


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

##  **Step 5: Define Evaluation Metrics**

We use Accuracy, Precision, Recall, and F1-score for evaluation.


In [8]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="binary")
    acc = accuracy_score(labels, predictions)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}


## **Step 6: Load and Fine-Tune BERT Model (Config A)**

This uses:
- Learning Rate = 2e-5
- Batch Size = 16
- Epochs = 2


In [5]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [6]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

training_args = TrainingArguments(
    output_dir="./results_A",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    logging_dir="./logs_A"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(2000)),
    eval_dataset=tokenized_datasets["test"].select(range(1000)),
    compute_metrics=compute_metrics,
)

trainer.train()

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.482035,0.791,0.0,0.0,0.0
2,0.264700,0.320248,0.895,0.0,0.0,0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


TrainOutput(global_step=626, training_loss=0.23825911134957506, metrics={'train_runtime': 1031.2361, 'train_samples_per_second': 9.697, 'train_steps_per_second': 0.607, 'total_flos': 2631110553600000.0, 'train_loss': 0.23825911134957506, 'epoch': 2.0})

##  **Step 7: Evaluate Baseline Model (Config A)**

In [7]:
metrics = trainer.evaluate()
metrics


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'eval_loss': 0.320247620344162,
 'eval_accuracy': 0.895,
 'eval_precision': 0.0,
 'eval_recall': 0.0,
 'eval_f1': 0.0,
 'eval_runtime': 30.7369,
 'eval_samples_per_second': 32.534,
 'eval_steps_per_second': 2.05,
 'epoch': 2.0}

##  **Step 8: Define Function to Train with Different Hyperparameters**

This function trains a model with given learning rate, batch size, and number of epochs.


In [9]:
def train_model(learning_rate, batch_size, epochs, label):
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

    args = TrainingArguments(
        output_dir=f"./results_{label}",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=epochs,
        weight_decay=0.01,
        load_best_model_at_end=True,
        logging_dir=f"./logs_{label}",
        metric_for_best_model="accuracy"
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(5000)),
        eval_dataset=tokenized_datasets["test"].select(range(1000)),
        compute_metrics=compute_metrics,
    )

    print(f"\nüß™ Training {label} with LR={learning_rate}, BS={batch_size}, Epochs={epochs}")
    trainer.train()
    metrics = trainer.evaluate()
    return label, metrics


##  **Step 9: Train Models with Different Hyperparameters**

We compare 3 settings:
- A: LR=2e-5, BS=16, Epochs=2
- B: LR=5e-5, BS=32, Epochs=3
- C: LR=3e-5, BS=16, Epochs=4


In [10]:
results = []
results.append(train_model(2e-5, 16, 2, "A"))
results.append(train_model(5e-5, 32, 3, "B"))
results.append(train_model(3e-5, 16, 4, "C"))

# Show all metrics
for label, metric in results:
    print(f"\nüìä {label} -> Accuracy: {metric['eval_accuracy']:.4f}, F1: {metric['eval_f1']:.4f}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

## Step 10: Save Best Model and Tokenizer


In [None]:
model.save_pretrained("./bert-final-model")
tokenizer.save_pretrained("./bert-final-model")


## Step 11: Predict Sentiment from Raw Text


In [None]:
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=1)
    label = "Positive" if torch.argmax(probs) == 1 else "Negative"
    return label, probs.detach().numpy()

predict_sentiment("This movie was an absolute masterpiece.")


##  Plot Accuracy, F1, Precision, and Recall for All Configs

We now visualize and compare how different hyperparameter configurations affect model performance.


In [None]:
import matplotlib.pyplot as plt

# Convert results to named lists
labels = [r[0] for r in results]
accuracies = [r[1]['eval_accuracy'] for r in results]
precisions = [r[1]['eval_precision'] for r in results]
recalls = [r[1]['eval_recall'] for r in results]
f1_scores = [r[1]['eval_f1'] for r in results]

# Plot
plt.figure(figsize=(12, 6))
plt.plot(labels, accuracies, marker='o', label="Accuracy")
plt.plot(labels, precisions, marker='s', label="Precision")
plt.plot(labels, recalls, marker='^', label="Recall")
plt.plot(labels, f1_scores, marker='D', label="F1 Score")
plt.title("Model Performance Across Configurations")
plt.xlabel("Model Config")
plt.ylabel("Score")
plt.ylim(0.5, 1.0)
plt.grid(True)
plt.legend()
plt.show()
