# Fine-Tuning a Transformer for Multi-Class Classification Using AG News Dataset

# Introduction

This notebook demonstrates the fine-tuning of a pre-trained transformer model for a multi-class text classification task using the AG News dataset. The AG News dataset is widely used in NLP tasks and consists of four distinct categories: World, Sports, Business, and Science/Technology. Each sample in the dataset contains a news article headline and body text, labeled according to its category.

#Objective

The primary goal of this project is to fine-tune the transformer model bert-base-cased for accurate classification of news articles across the four categories. We will utilize the Hugging Face Transformers library to streamline the fine-tuning process and implement a comprehensive evaluation using metrics such as:

Accuracy (Primary Metric)
Precision, Recall, and F1-Score (Macro-averaged)
By following this notebook, you will gain a clear understanding of how to:

Load and explore the AG News dataset.
Preprocess and tokenize the text data using a transformer tokenizer.
Fine-tune a pre-trained transformer model on the training set.
Evaluate the model’s performance on the test set using appropriate metrics for multi-class classification.

In [1]:
# Transformers installation
! pip install transformers datasets --q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/484.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/484.9 kB[0m [31m6.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.9/484.9 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/194.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━

In [2]:
!git clone https://github.com/huggingface/transformers
!pip install /content/transformers

Cloning into 'transformers'...
remote: Enumerating objects: 257639, done.[K
remote: Counting objects: 100% (30053/30053), done.[K
remote: Compressing objects: 100% (2424/2424), done.[K
remote: Total 257639 (delta 29198), reused 27635 (delta 27627), pack-reused 227586 (from 4)[K
Receiving objects: 100% (257639/257639), 258.87 MiB | 12.46 MiB/s, done.
Resolving deltas: 100% (190766/190766), done.
Processing ./transformers
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.50.0.dev0-py3-none-any.whl size=10770185 sha256=1f5fecacc0aa53060b33efa2a2e797d92b01ef86e3f8c5d096d0dde79816613d
  Stored in directory: /tmp/pip-ephem-wheel-cache-hmgj_toa/wheels/9f/62/72/77fdff469e8308ad837268261590df

In [3]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-1.4.0-py3-none-any.whl.metadata (19 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=2.0.0

### ‼️ Restart runtime after installs!

# Fine-tuning the pretrained model on Multi-Class Classification

## Preparing of the dataset

Before we can fine-tune a pretrained model, we download a dataset and prepare it for training.

We begin by loading the [AG News](https://huggingface.co/datasets/sh0416/ag_news) dataset:

In [14]:
# Importing the dataset and review a row of the dataset.
from datasets import load_dataset

dataset = load_dataset("ag_news")
dataset["train"][10]

{'text': "Oil and Economy Cloud Stocks' Outlook  NEW YORK (Reuters) - Soaring crude prices plus worries  about the economy and the outlook for earnings are expected to  hang over the stock market next week during the depth of the  summer doldrums.",
 'label': 2}

In [16]:
# Mapping of label IDs to their corresponding names
label_names = {
    0: "World",
    1: "Sports",
    2: "Business",
    3: "Sci/Tech"
}

# Finding the occurrence of each label
from collections import Counter

label_counts = Counter(dataset['train']['label'])
for label, count in label_counts.items():
    print(f"Label: {label_names.get(label, 'Unknown')}, Count: {count}")


Label: Business, Count: 30000
Label: Sci/Tech, Count: 30000
Label: Sports, Count: 30000
Label: World, Count: 30000


In [17]:
# Creating a smaller subset of the full dataset to fine-tune on to reduce the training time.
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(100))
small_eval_dataset = dataset["test"].shuffle(seed=42).select(range(50))

In [18]:
# Tokenizing the text in the training and evaluation datasets using a BERT tokenizer from HuggingFace.
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
  return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_small_train_dataset = small_train_dataset.map(tokenize_function, batched=True)
tokenized_small_eval_dataset = small_eval_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [19]:
# Calling for the dimensions of the tokenized training dataset
tokenized_small_train_dataset.shape

(100, 5)

## Training using HuggingFace Transformers Trainer (PyTorch)

In [20]:
# Loading pre-trained BERT model for the 4-class sequence classification.
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=4)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Training hyperparameters

In [21]:
# Importing TrainingArguments which are default hyperparameters and save the results to "test_trainer" folder.
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

### Evaluating

In [22]:
# Installing evaluate library
!pip install evaluate --q

In [23]:
# Importing numpy and evaluate library and defining the metrics used to evaluate the model
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")

In [25]:
# Defining a function for the metrics used to evaluate the model (Accuracy, Precision & Recall)
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

   # Compute metrics
    accuracy_score = accuracy.compute(predictions=predictions, references=labels)
    precision_score = precision.compute(predictions=predictions, references=labels, average="macro")
    recall_score = recall.compute(predictions=predictions, references=labels, average="macro")

     # Combine all metrics in a dictionary
    return {
        "accuracy": accuracy_score["accuracy"],
        "precision": precision_score["precision"],
        "recall": recall_score["recall"],
    }


In [26]:
# Importing TrainingArguments and specifying key parameters like batch size, the number of epochs, -
# where to save outputs and logs, and when to evaluate the model.
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=8,
    eval_strategy="epoch",
    logging_dir='./logs',
)

### Trainer

In [27]:
# Creating a trainer object including the model, training arguments, training and test datasets and the evaluation function.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_small_train_dataset,
    eval_dataset=tokenized_small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [28]:
# Fine-tuning the model by using .train()
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall
1,No log,1.231423,0.46,0.269048,0.459936
2,No log,1.129888,0.52,0.668494,0.573851


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


TrainOutput(global_step=26, training_loss=1.231619614821214, metrics={'train_runtime': 1409.025, 'train_samples_per_second': 0.142, 'train_steps_per_second': 0.018, 'total_flos': 52623156019200.0, 'train_loss': 1.231619614821214, 'epoch': 2.0})

# Upload to HuggingFace

In [29]:
!pip install huggingface_hub




In [30]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineGrained).
The token `test1234` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-aut

In [31]:
trainer.save_model("fine_tuned_model")
tokenizer.save_pretrained("fine_tuned_model")

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the model and tokenizer from the directory where you saved them
model = AutoModelForSequenceClassification.from_pretrained("fine_tuned_model")
tokenizer = AutoTokenizer.from_pretrained("fine_tuned_model")

In [32]:
# Push the model and tokenizer to the Hugging Face Hub
model.push_to_hub("UlrikN123/fine_tuned_model")
tokenizer.push_to_hub("UlrikN123/fine_tuned_model")

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/UlrikN123/fine_tuned_model/commit/5d65af57227c6d86ffe0cbbdab2cca0addc49924', commit_message='Upload tokenizer', commit_description='', oid='5d65af57227c6d86ffe0cbbdab2cca0addc49924', pr_url=None, repo_url=RepoUrl('https://huggingface.co/UlrikN123/fine_tuned_model', endpoint='https://huggingface.co', repo_type='model', repo_id='UlrikN123/fine_tuned_model'), pr_revision=None, pr_num=None)