# Fine-Tuning a Transformer for Multi-Class Classification Using AG News Dataset

# Introduction

This notebook demonstrates the fine-tuning of a pre-trained transformer model for a multi-class text classification task using the AG News dataset. The AG News dataset is widely used in NLP tasks and consists of four distinct categories: World, Sports, Business, and Science/Technology. Each sample in the dataset contains a news article headline and body text, labeled according to its category.

#Objective

The primary goal of this project is to fine-tune the transformer model bert-base-cased for accurate classification of news articles across the four categories. We will utilize the Hugging Face Transformers library to streamline the fine-tuning process and implement a comprehensive evaluation using metrics such as:

Accuracy (Primary Metric)
Precision, Recall, and F1-Score (Macro-averaged)
By following this notebook, you will gain a clear understanding of how to:

Load and explore the AG News dataset.
Preprocess and tokenize the text data using a transformer tokenizer.
Fine-tune a pre-trained transformer model on the training set.
Evaluate the model’s performance on the test set using appropriate metrics for multi-class classification.

In [None]:
# Transformers installation
! pip install transformers datasets --q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/484.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/484.9 kB[0m [31m76.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.9/484.9 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!git clone https://github.com/huggingface/transformers
!pip install /content/transformers

Cloning into 'transformers'...
remote: Enumerating objects: 257190, done.[K
remote: Counting objects: 100% (30710/30710), done.[K
remote: Compressing objects: 100% (2371/2371), done.[K
remote: Total 257190 (delta 29892), reused 28339 (delta 28339), pack-reused 226480 (from 1)[K
Receiving objects: 100% (257190/257190), 257.97 MiB | 12.18 MiB/s, done.
Resolving deltas: 100% (190684/190684), done.
Processing ./transformers
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.49.0.dev0-py3-none-any.whl size=10770282 sha256=6612ebdd13c2a7be4d3a04767d7310a1814fd84ec283414400219942a0b781d9
  Stored in directory: /tmp/pip-ephem-wheel-cache-sjv5t8q7/wheels/9f/62/72/77fdff469e8308ad837268261590df

In [None]:
!pip install accelerate -U

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.wh

### ‼️ Restart runtime after installs!

# Fine-tuning the pretrained model on Multi-Class Classification

## Preparing of the dataset

Before we can fine-tune a pretrained model, we download a dataset and prepare it for training.

We begin by loading the [AG News](https://huggingface.co/datasets/sh0416/ag_news) dataset:

In [None]:
# Importing the dataset and review a row of the dataset.
from datasets import load_dataset

dataset = load_dataset("ag_news")
dataset["train"][10]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

{'text': "Oil and Economy Cloud Stocks' Outlook  NEW YORK (Reuters) - Soaring crude prices plus worries  about the economy and the outlook for earnings are expected to  hang over the stock market next week during the depth of the  summer doldrums.",
 'label': 2}

In [None]:
# Finding the occurrence of each label
from collections import Counter

label_counts = Counter(dataset['train']['label'])
for label, count in label_counts.items():
    print(f"Label: {label}, Count: {count}")


Label: 2, Count: 30000
Label: 3, Count: 30000
Label: 1, Count: 30000
Label: 0, Count: 30000


In [None]:
# Creating a smaller subset of the full dataset to fine-tune on to reduce the training time.
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(100))
small_eval_dataset = dataset["test"].shuffle(seed=42).select(range(25))


Now that you know you need a tokenizer to process the text and a padding and truncation strategy to handle different sequence lengths, you can use the [`map`](https://huggingface.co/docs/datasets/process.html#map) method from Hugging Face Datasets to apply a preprocessing function to your entire dataset in one step.



In [None]:
# Tokenizing the text in the training and evaluation datasets using a BERT tokenizer from HuggingFace.
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
  return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_small_train_dataset = small_train_dataset.map(tokenize_function, batched=True)
tokenized_small_eval_dataset = small_eval_dataset.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

In [None]:
# Calling for the dimensions of the tokenized training dataset
tokenized_small_train_dataset.shape

(100, 5)

## Training using HuggingFace Transformers Trainer (PyTorch)

In [None]:
# Loading pre-trained BERT model for the 4-class sequence classification.
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=4)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Training hyperparameters

In [None]:
# Importing TrainingArguments which are default hyperparameters and save the results to "test_trainer" folder.
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

### Evaluating

In [None]:
# Installing evaluate library
!pip install evaluate --q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Importing numpy and evaluate library and defining the metrics used to evaluate the model
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.56k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.38k [00:00<?, ?B/s]

In [None]:
# Defining a function for the metrics used to evaluate the model (Accuracy, Precision & Recall)
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

   # Compute metrics
    accuracy_score = accuracy.compute(predictions=predictions, references=labels)
    precision_score = precision.compute(predictions=predictions, references=labels, average="macro")
    recall_score = recall.compute(predictions=predictions, references=labels, average="macro")

     # Combine all metrics in a dictionary
    return {
        "accuracy": accuracy_score["accuracy"],
        "precision": precision_score["precision"],
        "recall": recall_score["recall"],
    }


In [None]:
# Importing TrainingArguments and specifying key parameters like batch size, the number of epochs, -
# where to save outputs and logs, and when to evaluate the model.
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=8,
    eval_strategy="epoch",
    logging_dir='./logs',
)

### Trainer

In [None]:
# Creating a trainer object including the model, training arguments, training and test datasets and the evaluation function.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_small_train_dataset,
    eval_dataset=tokenized_small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
# Fine-tuning the model by using .train()
trainer.train()

Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall
1,No log,1.14487,0.68,0.759615,0.647619


TrainOutput(global_step=13, training_loss=1.23077025780311, metrics={'train_runtime': 718.1057, 'train_samples_per_second': 0.139, 'train_steps_per_second': 0.018, 'total_flos': 26311578009600.0, 'train_loss': 1.23077025780311, 'epoch': 1.0})

# Upload to HuggingFace

In [17]:
!pip install huggingface_hub




In [29]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineG

In [27]:
trainer.save_model("fine_tuned_model")
tokenizer.save_pretrained("fine_tuned_model")

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the model and tokenizer from the directory where you saved them
model = AutoModelForSequenceClassification.from_pretrained("fine_tuned_model")
tokenizer = AutoTokenizer.from_pretrained("fine_tuned_model")

In [31]:
# Push the model and tokenizer to the Hugging Face Hub
model.push_to_hub("UlrikN123/fine_tuned_model")
tokenizer.push_to_hub("UlrikN123/fine_tuned_model")

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/UlrikN123/fine_tuned_model/commit/52c078fc0d46954ee17147612e4e75f0f5c4cb6c', commit_message='Upload tokenizer', commit_description='', oid='52c078fc0d46954ee17147612e4e75f0f5c4cb6c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/UlrikN123/fine_tuned_model', endpoint='https://huggingface.co', repo_type='model', repo_id='UlrikN123/fine_tuned_model'), pr_revision=None, pr_num=None)