# ModernBERT-cats-and-dogs

This notebook trains a ModernBERT base model (150M params) in a pre-existing dataset with sentences about:
- Cats
- Dogs
- Undefined (any other topic)

BEWARE that training processes regardless the model size usually consume a lot of resources, so only run it if you have a good GPU.

Otherwise, I suggest running it in a Colab, which will train the model in just some minutes (for 5 epochs it achieves an F1-score of 94%).

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import numpy as np
from sklearn.metrics import f1_score
from huggingface_hub import HfFolder

In [2]:
# Load the dataset from Hugging Face

dataset_id = "fkuhne/cats-and-dogs"
local_dataset_cache = ".datasets/" + dataset_id

train_dataset = load_dataset(
    path = dataset_id,
    cache_dir = local_dataset_cache,
    split='train'
)

split_dataset = train_dataset.train_test_split(test_size=0.1)

if "label" in split_dataset["train"].features.keys():
    split_dataset = split_dataset.rename_column("label", "labels") # to match Trainer

split_dataset['train'][:10]

{'text': ['Domestic cats are often referred to as the primary carrier of a parasite that causes toxoplasmosis in humans.',
  'Dogs are highly social animals that thrive on interaction with their human family members. They have been known to form strong bonds with their owners and can become depressed if left alone for extended periods of time.',
  "In ancient Egyptian mythology, the cat was associated with the goddess Bastet, often depicted as a woman with the head of a cat. This association is thought to have originated from the cat's ability to protect grain stores from rodents, which were considered pests. As a result, the cat became a symbol of fertility, motherhood, and protection.",
  'Domestic cats are often referred to as the most popular pet in many countries due to their affectionate nature, relatively low maintenance, and ability to provide companionship.',
  'Did you know that Greyhounds are bred specifically for their speed, with some dogs reaching up to 45 miles per hour?

In [3]:
# Load the tokenizer
model_id = "answerdotai/ModernBERT-base"
local_dir = ".models/" + model_id

tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path = model_id,
    cache_dir = local_dir,
    model_max_length = 8192
)

def tokenize(batch):
    return tokenizer(batch['text'], padding = True, truncation = True, return_tensors = "pt")

In [4]:
# Tokenize the dataset

tokenized_dataset = split_dataset.map(tokenize, batched = True, remove_columns = ["text"])

tokenized_dataset["train"].features.keys()

Map:   0%|          | 0/449 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

dict_keys(['labels', 'input_ids', 'attention_mask'])

In [5]:
# Prepare model labels - useful for inference

labels = tokenized_dataset["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

In [6]:
# Download the model from huggingface.co/models

model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path = model_id,
    cache_dir = local_dir,
    num_labels = num_labels,
    label2id = label2id,
    id2label = id2label
)

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
# Define training args

output_model_dir = ".models/ModernBERT-cats-and-dogs"

training_args = TrainingArguments(
    output_dir = output_model_dir,
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 16,
    learning_rate = 5e-5,
    num_train_epochs = 5,
    bf16 = True, # bfloat16 training 
    optim = "adamw_torch_fused", # improved optimizer 
    # logging & evaluation strategies
    logging_strategy = "steps",
    logging_steps = 100,
    eval_strategy = "epoch",
    save_strategy = "epoch",
    save_total_limit = 2,
    load_best_model_at_end = True,
    metric_for_best_model = "f1",
    # push to hub parameters
    push_to_hub = False,
    hub_strategy = "every_save",
    hub_token = HfFolder.get_token(),
)

In [8]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis = 1)
    score = f1_score(
            labels, predictions, labels = labels, pos_label = 1, average = "weighted"
        )
    return {"f1": float(score) if score == 1 else score}

In [9]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_dataset["train"],
    eval_dataset = tokenized_dataset["test"],
    compute_metrics = compute_metrics,
)

In [None]:
# This command will take too long in a normal machine. I suggest running this
# notebook in a colab session and download the model files from there.

trainer.train()