[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZGObhOKJCQhJJZFakc-v2ykj-hXm7K2o?usp=sharing)

# Fine-tuning RoBERTa for Topic Classification with Hugging Face Transformers and Datasets Library

This is the code for the medium post [Fine-tuning RoBERTa for Topic Classification with Hugging Face Transformers and Datasets Library](https://medium.com/@achillesmoraites/fine-tuning-roberta-for-topic-classification-with-hugging-face-transformers-and-datasets-library-c6f8432d0820).

**The code and the post assume that**:
- You have a Hugging Face 🤗 account and are familiar with the platform (at least with creating a model repo and access tokens).
- You are experienced with Machine Learning (ML), Deep Learning, and NLP.
- You have some experience with Deep learning frameworks like Pytorch or Tensorflow.
- You have coding experience with Python.
- You have access to a Jupyter Environment with a GPU that can support the training process, and you are proficient in using it.

## ⚠️Warning
The post and the accompanying code do not intend to teach ML, Deep Learning, or NLP!

The aim of the post and the code is to illustrate the process of finetuning a RoBERTa model and publishing it to the Hugging Face 🤗 platform.

Building a production-level ML model involves steps and processes not covered by the post and the code.

In [1]:
!pip install -U transformers accelerate datasets huggingface_hub tensorboard==2.11
!sudo apt-get install git-lfs --yes

Collecting transformers
  Downloading transformers-4.40.1-py3-none-any.whl.metadata (137 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.0/138.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting accelerate
  Downloading accelerate-0.29.3-py3-none-any.whl.metadata (18 kB)
Collecting datasets
  Downloading datasets-2.19.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-0.22.2-py3-none-any.whl.metadata (12 kB)
Collecting tensorboard==2.11
  Downloading tensorboard-2.11.0-py3-none-any.whl.metadata (1.9 kB)
Collecting absl-py>=0.4 (from tensorboard==2.11)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting grpcio>=1.24.3 (from tensorboard==2.11)
  Downloading grpcio-1.62.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting google-auth<3,>=1.6.3 (from tensorboard==2.11)
  Downloading google_auth-2.29.0-py2.py3-none-any.whl.metadata (4.7 k

In [5]:
import torch
from datasets import load_dataset
from transformers import (
    RobertaTokenizerFast,
    RobertaForSequenceClassification,
    TrainingArguments,
    Trainer,
    AutoConfig,
)
from huggingface_hub import HfFolder, notebook_login

In [4]:
pip install ipywidgets

Collecting ipywidgets
  Downloading ipywidgets-8.1.2-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.10 (from ipywidgets)
  Downloading widgetsnbextension-4.0.10-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.10 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.10-py3-none-any.whl.metadata (4.1 kB)
Downloading ipywidgets-8.1.2-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.4/139.4 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jupyterlab_widgets-3.0.10-py3-none-any.whl (215 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.0/215.0 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading widgetsnbextension-4.0.10-py3-none-any.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m47.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: widgetsnbextension, jupyterlab-widgets, ip

In [6]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [65]:
model_id = "roberta-base"

repository_id = "harshal-11/roberta-political-bias"

In [13]:
import pandas as pd

# Load your dataset
df = pd.read_csv('Data_for_model_training.csv')

# Suppose your dataset has columns 'text' for the input and 'label' for the target
print(df.head())

                                               title       label Unnamed: 2  \
0            Free Speech and the University, Part IV  Right Wing        NaN   
1             BREAKING: Fauci hints at new lockdowns  Right Wing        NaN   
2                                 Economics and Time  Right Wing        NaN   
3      Forced COVID Vaccination For Kids Is Unlawful  Right Wing        NaN   
4  Why "Voluntarism" Instead of Voluntaryism? Per...  Right Wing        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  


In [15]:
df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace=True)

# Now, print the DataFrame to see the result
print(df.head())

                                               title       label
0            Free Speech and the University, Part IV  Right Wing
1             BREAKING: Fauci hints at new lockdowns  Right Wing
2                                 Economics and Time  Right Wing
3      Forced COVID Vaccination For Kids Is Unlawful  Right Wing
4  Why "Voluntarism" Instead of Voluntaryism? Per...  Right Wing


In [20]:
pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.0-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Downloading threadpoolctl-3.4.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hDownloading joblib-1.4.0-py3-none-any.whl (301 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.2/301.2 

In [27]:
from sklearn.model_selection import train_test_split

# Split data into training and remaining data
train_df, remaining_df = train_test_split(df, test_size=0.2, random_state=42)

# Split remaining data into validation and test sets
val_df, test_df = train_test_split(remaining_df, test_size=0.5, random_state=42)

In [23]:
df.columns

Index(['title', 'label'], dtype='object')

In [24]:
# df = pd.read_csv('path/to/your/file.csv')
# If necessary, rename columns to ensure consistency
df.rename(columns={'title': 'text'}, inplace=True)


In [26]:
df.columns

Index(['text', 'label'], dtype='object')

In [32]:
df['text'].fillna("Missing text", inplace=True)  # Replace nulls with a placeholder string


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['text'].fillna("Missing text", inplace=True)  # Replace nulls with a placeholder string


In [34]:
# Check data types in the text column
print(df['text'].apply(type).value_counts())


text
<class 'str'>    24447
Name: count, dtype: int64


In [37]:
print(train_df['label'].unique())  # Display unique label values
print(train_df['label'].apply(type).value_counts())  # Check data types of labels


['Left Wing' 'Right Wing' 'Neutral' ' whenever I leave the West'
 ' and that I may be better off then they are because I still have elders that I can go to who will make me feel at home for a while as they cleanse me. Sometimes I find myself wondering'
 nan '1/25/22 18:45']
label
<class 'str'>      19554
<class 'float'>        3
Name: count, dtype: int64


In [38]:
import pandas as pd
import numpy as np

# Display unique values before cleaning
print("Unique labels before cleaning:", train_df['label'].unique())

# Clean labels: Only keep valid categories, set others to NaN
valid_labels = ['Left Wing', 'Right Wing', 'Neutral']
train_df['label'] = train_df['label'].apply(lambda x: x if x in valid_labels else np.nan)
val_df['label'] = val_df['label'].apply(lambda x: x if x in valid_labels else np.nan)
test_df['label'] = test_df['label'].apply(lambda x: x if x in valid_labels else np.nan)

# Option to drop NaNs if your dataset allows
# train_df.dropna(subset=['label'], inplace=True)
# val_df.dropna(subset=['label'], inplace=True)
# test_df.dropna(subset=['label'], inplace=True)

# Display unique values after cleaning
print("Unique labels after cleaning:", train_df['label'].unique())


Unique labels before cleaning: ['Left Wing' 'Right Wing' 'Neutral' ' whenever I leave the West'
 ' and that I may be better off then they are because I still have elders that I can go to who will make me feel at home for a while as they cleanse me. Sometimes I find myself wondering'
 nan '1/25/22 18:45']
Unique labels after cleaning: ['Left Wing' 'Right Wing' 'Neutral' nan]


In [39]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

# Fit the encoder on the training data and transform all datasets
train_df['label'] = encoder.fit_transform(train_df['label'].astype(str))
val_df['label'] = encoder.transform(val_df['label'].astype(str))
test_df['label'] = encoder.transform(test_df['label'].astype(str))

# Check transformed labels
print("Encoded labels:", train_df['label'].unique())


Encoded labels: [0 2 1 3]


In [40]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', add_prefix_space=True)

def tokenize_data(df):
    # Tokenize the text
    texts = df['text'].astype(str).tolist()
    labels = df['label'].tolist()
    tokenized = tokenizer(texts, padding="max_length", truncation=True, max_length=256, return_tensors="pt")
    # Add labels
    tokenized['labels'] = torch.tensor(labels, dtype = torch.long)
    return tokenized

try:
    train_dataset = tokenize_data(train_df)
    val_dataset = tokenize_data(val_df)
    test_dataset = tokenize_data(test_df)
    print("Tokenization successful")
except Exception as e:
    print(f"Tokenization failed: {e}")

Tokenization successful


In [41]:
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

# Create datasets
train_dataset = TextDataset(train_dataset)
val_dataset = TextDataset(val_dataset)
test_dataset = TextDataset(test_dataset)


In [42]:
from transformers import AutoConfig

# Manually define class names if they are known
class_names = ['LeftWing', 'Neutral', 'RightWing', 'nan']  # replace with your actual class names
num_labels=len(class_names)
# Create id2label mapping
id2label = {i: name for i, name in enumerate(class_names)}
config = AutoConfig.from_pretrained(model_id, num_labels=len(class_names), id2label=id2label)
print(f"number of labels: {num_labels}")
print(f"the labels: {class_names}")


number of labels: 4
the labels: ['LeftWing', 'Neutral', 'RightWing', 'nan']


In [None]:
# # Load dataset
# dataset = load_dataset(dataset_id)
# train_dataset = dataset['train'].shard(num_shards=40, index=0)
# test_dataset = dataset["test"].shard(num_shards=2, index=0)

# # Split train_dataset into train and validation sets
# val_dataset = dataset['test'].shard(num_shards=2, index=1)

# # Preprocessing
# tokenizer = RobertaTokenizerFast.from_pretrained(model_id)

# def tokenize(batch):
#     return tokenizer(batch["text"], padding=True, truncation=True, max_length=256)

# train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
# val_dataset = val_dataset.map(tokenize, batched=True, batch_size=len(val_dataset))
# test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

# train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
# val_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
# test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

# # Extract the number of classess and their names
# num_labels = dataset['train'].features['label'].num_classes
# class_names = dataset["train"].features["label"].names
# print(f"number of labels: {num_labels}")
# print(f"the labels: {class_names}")

# # Create an id2label mapping
# # We will need this to directly output the class names when using the pipeline without needing to map the labels later.
# id2label = {i: label for i, label in enumerate(class_names)}

# # 3. Update the model's configuration with the id2label mapping
# config = AutoConfig.from_pretrained(model_id)
# config.update({"id2label": id2label})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3800 [00:00<?, ? examples/s]

Map:   0%|          | 0/3800 [00:00<?, ? examples/s]

number of labels: 4
the labels: ['World', 'Sports', 'Business', 'Sci/Tech']


In [44]:
# Model
model = RobertaForSequenceClassification.from_pretrained(model_id, config=config)

# TrainingArguments
training_args = TrainingArguments(
    output_dir=repository_id,
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=10,
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_steps=500,
    save_strategy="epoch",
    load_best_model_at_end=True,
    save_total_limit=2,
    # report_to="tensorboard",
    # push_to_hub=True,
    # hub_strategy="every_save",
    # hub_model_id=repository_id,
    # hub_token=HfFolder.get_token(),
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [45]:
# Fine-tune the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.578,0.647581
2,0.6646,0.69091
3,0.475,0.751751
4,0.2359,0.961122
5,0.4465,1.047401


TrainOutput(global_step=12225, training_loss=0.502140113137251, metrics={'train_runtime': 742.9544, 'train_samples_per_second': 131.616, 'train_steps_per_second': 16.455, 'total_flos': 1.286438827834368e+16, 'train_loss': 0.502140113137251, 'epoch': 5.0})

In [46]:
trainer.evaluate()

{'eval_loss': 0.6475806832313538,
 'eval_runtime': 4.5609,
 'eval_samples_per_second': 536.078,
 'eval_steps_per_second': 33.546,
 'epoch': 5.0}

In [50]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='macro')
    acc = accuracy_score(labels, predictions)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }


In [51]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)


In [52]:
results = trainer.evaluate()
print(results)


{'eval_loss': 0.6475806832313538, 'eval_accuracy': 0.7456032719836401, 'eval_f1': 0.6622051265363189, 'eval_precision': 0.6755439946989243, 'eval_recall': 0.6534952320230859, 'eval_runtime': 4.6102, 'eval_samples_per_second': 530.342, 'eval_steps_per_second': 33.187}


In [53]:
results

{'eval_loss': 0.6475806832313538,
 'eval_accuracy': 0.7456032719836401,
 'eval_f1': 0.6622051265363189,
 'eval_precision': 0.6755439946989243,
 'eval_recall': 0.6534952320230859,
 'eval_runtime': 4.6102,
 'eval_samples_per_second': 530.342,
 'eval_steps_per_second': 33.187}

In [66]:
from transformers import TrainingArguments, Trainer

repository_id = 'harshal-11/roberta-political-bias'  # This is your model's Hugging Face Repository ID

training_args = TrainingArguments(
    output_dir='./results',  # Local directory for saving training outputs
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="no",  # Since you are using push_to_hub, you may not need to save locally
    logging_dir='./logs',  # Local directory for saving logs
    logging_steps=10,
    push_to_hub=True,  # Enable push to hub
    hub_model_id=repository_id,  # Repository ID where the model will be pushed
    hub_strategy="every_save",  # Push to hub every time save is called
    hub_token=os.getenv('HF_TOKEN'),  # Hugging Face authentication token
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

# Before we save and push to hub, make sure you're authenticated with Hugging Face.
# You should run this in a separate cell beforehand:
# !transformers-cli login
# And follow the instructions to log in with your token.

# Save tokenizer and create a model card in your Hugging Face repository
tokenizer.save_pretrained(training_args.output_dir)
trainer.create_model_card()

# Push the tokenizer, model, and model card to the hub
trainer.push_to_hub(commit_message="Training completed")


Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]
training_args.bin:   0%|          | 0.00/5.05k [00:00<?, ?B/s][A

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s][A[A

training_args.bin: 100%|██████████| 5.05k/5.05k [00:00<00:00, 26.9kB/s][A[A


model.safetensors:   1%|          | 4.34M/499M [00:00<00:22, 22.1MB/s][A[A

model.safetensors:   2%|▏         | 8.34M/499M [00:00<00:16, 30.3MB/s][A[A

model.safetensors:   3%|▎         | 14.0M/499M [00:00<00:11, 40.7MB/s][A[A

model.safetensors:   4%|▎         | 18.1M/499M [00:00<00:26, 17.9MB/s][A[A

model.safetensors:   5%|▍         | 22.8M/499M [00:00<00:20, 23.3MB/s][A[A

model.safetensors:   6%|▌         | 28.2M/499M [00:01<00:15, 29.9MB/s][A[A

model.safetensors:   7%|▋         | 32.4M/499M [00:01<00:19, 23.7MB/s][A[A

model.safetensors:   7%|▋         | 37.3M/499M [00:01<00:16, 28.6MB/s][A[A

model.safetensors:   8%|▊         | 42.3M/499M [00:01<00:13, 33.1MB/s][A[A

model.safetensors:   9%|▉

CommitInfo(commit_url='https://huggingface.co/harshal-11/roberta-political-bias/commit/890987d852c773cc97f11c3fc7962d000d5e36e4', commit_message='Training completed', commit_description='', oid='890987d852c773cc97f11c3fc7962d000d5e36e4', pr_url=None, pr_revision=None, pr_num=None)

In [59]:
import os

# Replace 'your_token' with the actual token you copied from Hugging Face.
os.environ['HF_TOKEN'] = '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'

# Use this environment variable when you create the `Trainer` or call `push_to_hub`.


In [63]:
# Save our tokenizer and create model card
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()
# Push the results to the hub
trainer.push_to_hub()

training_args.bin: 100%|██████████| 4.98k/4.98k [00:00<00:00, 28.7kB/s]


CommitInfo(commit_url='https://huggingface.co/harshal-11/results/commit/4ee1513a028374cc0dae813f220c3383f8398d4b', commit_message='End of training', commit_description='', oid='4ee1513a028374cc0dae813f220c3383f8398d4b', pr_url=None, pr_revision=None, pr_num=None)

In [84]:
# TEST MODEL

from transformers import pipeline
# from datasets import load_dataset

# dataset = load_dataset(dataset_id)
# class_names = dataset["train"].features["label"].names

pip = pipeline('text-classification',repository_id)


text = "republican supports army"
result = pip(text)

predicted_label = result[0]["label"]
print(f"Predicted label: {predicted_label}")

Predicted label: RightWing
