[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZGObhOKJCQhJJZFakc-v2ykj-hXm7K2o?usp=sharing)


# Fine-tuning RoBERTa for Topic Classification with Hugging Face Transformers and Datasets Library


This is the code for the medium post [Fine-tuning RoBERTa for Topic Classification with Hugging Face Transformers and Datasets Library](https://medium.com/@achillesmoraites/fine-tuning-roberta-for-topic-classification-with-hugging-face-transformers-and-datasets-library-c6f8432d0820).

**The code and the post assume that**:
- You have a Hugging Face 🤗 account and are familiar with the platform (at least with creating a model repo and access tokens).
- You are experienced with Machine Learning (ML), Deep Learning, and NLP.
- You have some experience with Deep learning frameworks like Pytorch or Tensorflow.
- You have coding experience with Python.
- You have access to a Jupyter Environment with a GPU that can support the training process, and you are proficient in using it.

## ⚠️Warning
The post and the accompanying code do not intend to teach ML, Deep Learning, or NLP!

The aim of the post and the code is to illustrate the process of finetuning a RoBERTa model and publishing it to the Hugging Face 🤗 platform.

Building a production-level ML model involves steps and processes not covered by the post and the code.


In [None]:
!pip install transformers datasets huggingface_hub tensorboard==2.11
!sudo apt-get install git-lfs --yes

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.3-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m62.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface_hub
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorboard==2.11
  Downloading tensorboard-2.11.0-py3-none-any.whl (6.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m80.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokeni

In [None]:
import torch
from datasets import load_dataset
from transformers import (
    RobertaTokenizerFast,
    RobertaForSequenceClassification,
    TrainingArguments,
    Trainer,
    AutoConfig,
)
from huggingface_hub import HfFolder, notebook_login

In [None]:
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
model_id = "roberta-base"
dataset_id = "ag_news"
# make sure to put your own model here
# Before you start make sure to have created an empty repository model in hugging face 🤗 using https://huggingface.co/new
# <username>/<model-name>
repository_id = "achimoraites/roberta-base_ag_news"

In [None]:
# Load dataset
dataset = load_dataset(dataset_id)
train_dataset = dataset['train']
test_dataset = dataset["test"].shard(num_shards=2, index=0)

# Split train_dataset into train and validation sets
val_dataset = dataset['test'].shard(num_shards=2, index=1)

# Preprocessing
tokenizer = RobertaTokenizerFast.from_pretrained(model_id)

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=256)

train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
val_dataset = val_dataset.map(tokenize, batched=True, batch_size=len(val_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
val_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

# Extract the number of classess and their names
num_labels = dataset['train'].features['label'].num_classes
class_names = dataset["train"].features["label"].names
print(f"number of labels: {num_labels}")
print(f"the labels: {class_names}")

# Create an id2label mapping
# We will need this to directly output the class names when using the pipeline without needing to map the labels later.
id2label = {i: label for i, label in enumerate(class_names)}

# 3. Update the model's configuration with the id2label mapping
config = AutoConfig.from_pretrained(model_id)
config.update({"id2label": id2label})



  0%|          | 0/2 [00:00<?, ?it/s]



Map:   0%|          | 0/60000 [00:00<?, ? examples/s]



number of labels: 4
the labels: ['World', 'Sports', 'Business', 'Sci/Tech']


In [None]:
# Model
model = RobertaForSequenceClassification.from_pretrained(model_id, config=config)

# TrainingArguments
training_args = TrainingArguments(
    output_dir=repository_id,
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="epoch",
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=10,
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_steps=500,
    save_strategy="epoch",
    load_best_model_at_end=True,
    save_total_limit=2,
    report_to="tensorboard",
    push_to_hub=True,
    hub_strategy="every_save",
    hub_model_id=repository_id,
    hub_token=HfFolder.get_token(),
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

Download file pytorch_model.bin:   0%|          | 1.40k/476M [00:00<?, ?B/s]

Download file logs/1679607689.3610873/events.out.tfevents.1679607689.e9b6a0b6017c.1642.1: 100%|##########| 5.6…

Download file training_args.bin: 100%|##########| 3.56k/3.56k [00:00<?, ?B/s]

Clean file logs/1679607689.3610873/events.out.tfevents.1679607689.e9b6a0b6017c.1642.1:  18%|#7        | 1.00k/…

Clean file training_args.bin:  28%|##8       | 1.00k/3.56k [00:00<?, ?B/s]

Download file logs/events.out.tfevents.1679607689.e9b6a0b6017c.1642.0:   2%|2         | 8.19k/352k [00:00<?, ?…

Download file logs/events.out.tfevents.1679611186.e9b6a0b6017c.1642.2: 100%|##########| 316/316 [00:00<?, ?B/s…

Clean file logs/events.out.tfevents.1679611186.e9b6a0b6017c.1642.2: 100%|##########| 316/316 [00:00<?, ?B/s]

Clean file logs/events.out.tfevents.1679607689.e9b6a0b6017c.1642.0:   0%|          | 1.00k/352k [00:00<?, ?B/s…

Clean file pytorch_model.bin:   0%|          | 1.00k/476M [00:00<?, ?B/s]

In [None]:
# Fine-tune the model
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.3692,0.430518
2,1.6035,1.807135
3,0.6766,0.449418
4,0.3733,0.394254
5,0.2483,0.358265


TrainOutput(global_step=37500, training_loss=0.5209779283046723, metrics={'train_runtime': 5369.3904, 'train_samples_per_second': 55.872, 'train_steps_per_second': 6.984, 'total_flos': 3.94673670144e+16, 'train_loss': 0.5209779283046723, 'epoch': 5.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.3582654297351837,
 'eval_runtime': 216.3315,
 'eval_samples_per_second': 277.352,
 'eval_steps_per_second': 34.669,
 'epoch': 5.0}

In [None]:
# Save our tokenizer and create model card
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()
# Push the results to the hub
trainer.push_to_hub()

Upload file logs/events.out.tfevents.1679644177.3e007e4cae88.1032.0:   5%|5         | 32.0k/587k [00:00<?, ?B/…

Upload file logs/events.out.tfevents.1679649763.3e007e4cae88.1032.2: 100%|##########| 316/316 [00:00<?, ?B/s]

To https://huggingface.co/achimoraites/roberta-base_ag_news
   589ec3c..5f76a9d  main -> main

   589ec3c..5f76a9d  main -> main



'https://huggingface.co/achimoraites/roberta-base_ag_news/commit/5f76a9d92ad90029d29343de48c0f709067c87fd'

In [None]:
# TEST MODEL

from transformers import pipeline
# from datasets import load_dataset

# dataset = load_dataset(dataset_id)
# class_names = dataset["train"].features["label"].names

pip = pipeline('text-classification',repository_id)


text = "Kederis proclaims innocence Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his innocence and vowing: quot;After the crucifixion comes the resurrection. quot; .."
result = pip(text)

predicted_label = result[0]["label"]
print(f"Predicted label: {predicted_label}")

Predicted label: Sports
