# Fine-tune classifier with ModernBERT

In the implementation example we will understand how to fine-tune ModernBERT, for classifying user prompts to implement an intelligent LLM router.
You will learn how to:
1. Setup environment and install libraries
2. Load and prepare the classification dataset
3. Fine-tune & evaluate ModernBERT with the Hugging Face Trainer
4. Run inference & test model

### 1. Setup environment and install libraries
Our first step is to install Hugging Face Libraries and Pyroch, including transformers and datasets.

In [1]:
# Install Pytorch & other libraries
!pip install "torch==2.4.1" tensorboard 
!pip install flash-attn "setuptools<71.0.0" scikit-learn 
 
# Install Hugging Face libraries
!pip install  --upgrade \
  "datasets==3.1.0" \
  "accelerate==1.2.1" \
  "hf-transfer==0.1.8"
  #"transformers==4.47.1" \
 
# ModernBERT is not yet available in an official release, so we need to install it from github
!pip install "git+https://github.com/huggingface/transformers.git@6e0515e99c39444caae39472ee1b2fd76ece32f1" --upgrade

Collecting torch==2.4.1
  Using cached torch-2.4.1-cp310-none-macosx_11_0_arm64.whl (62.1 MB)
Collecting tensorboard
  Downloading tensorboard-2.18.0-py3-none-any.whl (5.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting fsspec
  Downloading fsspec-2024.12.0-py3-none-any.whl (183 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
Collecting jinja2
  Downloading jinja2-3.1.5-py3-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.6/134.6 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting filelock
  Using cached filelock-3.16.1-py3-none-any.whl (16 kB)
Collecting sympy
  Using cached sympy-1.13.3-py3-none-any.whl (6.2 MB)
Collecting networkx
  Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1

In [13]:
!pip install -U scikit-learn scipy matplotlib

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting scikit-learn
  Downloading scikit_learn-1.6.0-cp310-cp310-macosx_12_0_arm64.whl (11.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.1/11.1 MB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting scipy
  Downloading scipy-1.15.0-cp310-cp310-macosx_14_0_arm64.whl (24.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.8/24.8 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting matplotlib
  Downloading matplotlib-3.10.0-cp310-cp310-macosx_11_0_arm64.whl (8.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.0/8.0 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting threadpoolctl>=3.1.0
  Using cached threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Collecting joblib>=1.2.0
  Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Collecting cycler>=0.10
  Using cached cycler-0.12.1-py3-none-any.whl (8.3 kB)
Collecting kiwisolver>=1.3.1
  

We'll utilize the Hugging Face Hub as a remote service for model versioning. This allows us to automatically upload our model, logs, and related information to the Hub during training. To get started, you'll need to create an account on Hugging Face. Once registered, we'll use the login utility from the huggingface_hub package to log in and securely save your access token on your local system.

In [5]:
from huggingface_hub import login
 
login(token="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") # ADD YOUR TOKEN HERE

### 2. Load and prepare the classification dataset
In this example, our goal is to fine-tune ModernBERT to function as a router for user prompts. To achieve this, we require a classification dataset that includes user prompts along with their corresponding "difficulty" scores. For this purpose, we'll use the DevQuasar/llm_router_dataset-synth dataset-a synthetic collection of approximately 15,000 user prompts labeled with a difficulty score of either "large_llm" (1) or "small_llm" (0).
To load this dataset, we'll leverage the load_dataset() function from the 🤗 Datasets library, ensuring seamless integration into our fine-tuning workflow.

In [6]:
from datasets import load_dataset
 
# Dataset id from huggingface.co/dataset
# dataset_id = "DevQuasar/llm_router_dataset-synth"
dataset_id = "legacy-datasets/banking77"
 
# Load raw dataset
raw_dataset = load_dataset(dataset_id)
 
print(f"Train dataset size: {len(raw_dataset['train'])}")
print(f"Test dataset size: {len(raw_dataset['test'])}")

Generating train split: 100%|██████████| 10003/10003 [00:00<00:00, 673239.67 examples/s]
Generating test split: 100%|██████████| 3080/3080 [00:00<00:00, 878806.55 examples/s]

Train dataset size: 10003
Test dataset size: 3080





Train dataset size: 10003 Test dataset size: 3080
Let's check out an example of the dataset.

In [7]:
from random import randrange
 
random_id = randrange(len(raw_dataset['train']))
raw_dataset['train'][random_id]
# {'id': '6225a9cd-5cba-4840-8e21-1f9cf2ded7e6',
# 'prompt': 'How many legs does a spider have?',
# 'label': 0}

{'text': 'I forgot my passcode. Now what?', 'label': 44}

To train our model, we need to convert our text prompts to token IDs. This is done by a Tokenizer, which tokenizes the inputs.

In [8]:
from transformers import AutoTokenizer
 
# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"
# model_id = "google-bert/bert-base-uncased"
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 512 # set model_max_length to 512 as prompts are not longer than 1024 tokens
 
# Tokenize helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True, return_tensors="pt")
 
# Tokenize dataset
raw_dataset =  raw_dataset.rename_column("label", "labels") # to match Trainer
tokenized_dataset = raw_dataset.map(tokenize, batched=True,remove_columns=["text"])
 
print(tokenized_dataset["train"].features.keys())
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask','lable'])

Map: 100%|██████████| 10003/10003 [00:01<00:00, 6934.10 examples/s]
Map: 100%|██████████| 3080/3080 [00:00<00:00, 7304.08 examples/s]

dict_keys(['labels', 'input_ids', 'attention_mask'])





### 3. Fine-tune & evaluate ModernBERT with the Hugging Face Trainer
Once our dataset is preprocessed, we're ready to train the model. For this, we'll use the answerdotai/ModernBERT-base model. The first step involves loading the model using the AutoModelForSequenceClassification class from the Hugging Face Hub. This initializes ModernBERT's pre-trained weights and adds a classification head on top. We'll specify the number of classes (2) from our dataset and include the label names to ensure the outputs are interpretable during inference.

In [10]:
from transformers import AutoModelForSequenceClassification
 
# Model id to load the tokenizer
# model_id = "answerdotai/ModernBERT-base"
model_id = "google-bert/bert-base-uncased"
 
# Prepare model labels - useful for inference
labels = tokenized_dataset["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label
 
# Download the model from huggingface.co/models
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=num_labels, label2id=label2id, id2label=id2label,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We evaluate our model during training. The Trainer supports evaluation during training by providing a compute_metrics method. We use the evaluate library to calculate the f1 metric during training on our test split.

In [14]:
import numpy as np
from sklearn.metrics import f1_score
 
# Metric helper method
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    score = f1_score(
            labels, predictions, labels=labels, pos_label=1, average="weighted"
        )
    return {"f1": float(score) if score == 1 else score}

The last step is to define the hyperparameters (TrainingArguments) we use for our training. Here we are adding optimizations introduced features for fast training times using torch_compile option in the TrainingArguments.
We also leverage the Hugging Face Hub integration of the Trainer to push our checkpoints, logs, and metrics during training into a repository.

In [15]:
from huggingface_hub import HfFolder
from transformers import Trainer, TrainingArguments
 
# Define training args
training_args = TrainingArguments(
    output_dir= "modernbert-llm-router",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
  num_train_epochs=5,
    bf16=True, # bfloat16 training 
    optim="adamw_torch_fused", # improved optimizer 
    # logging & evaluation strategies
    logging_strategy="steps",
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    # push to hub parameters
    report_to="tensorboard",
    push_to_hub=True,
    hub_strategy="every_save",
    hub_token=HfFolder.get_token(),
 
)
 
# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks… To disable this warning, you can either:  
Avoid using tokenizers before the fork if possible  
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks… To disable this warning, you can either:  
Avoid using tokenizers before the fork if possible  
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)  

We can start our training by using the train method of the Trainer.  

In [16]:
# Start training
trainer.train()

RuntimeError: `fused=True` requires all the params to be floating point Tensors of supported devices: ['cuda', 'xpu', 'cpu', 'privateuseone'].

Fine-tuning answerdotai/ModernBERT-base on ~15,000 synthetic prompts for 5 epochs took 321 seconds and our best model achieved a f1 score of 0.993. 🚀 I also ran the training with bert-base-uncased to compare the training time and performance. The original BERT achieved a f1 score of 0.99 and took 1048 seconds to train.  
Note: ModernBERT and BERT both almost achieve the same performance. This indicates that the dataset is not challenging and probably could be solved using a logistic regression classifier. I ran the same code on the banking77 dataset. A dataset of ~13,000 customer service queries with 77 classes. There the ModernBERT outperformed the original BERT by 3% (f1 score of 0.93 vs 0.90)  
Lets save our final best model and tokenizer to the Hugging Face Hub and create a model card.

In [None]:
# Save processor and create model card
tokenizer.save_pretrained("modernbert-llm-router")
trainer.create_model_card()
trainer.push_to_hub()

### 4. Run Inference & test model
To wrap up this tutorial, we will run inference on a few examples and test our model. We will use the pipeline method from the transformers library to run inference on our model.

In [None]:
from transformers import pipeline
 
# load model from huggingface.co/models using our repository id
classifier = pipeline("sentiment-analysis", model="modernbert-llm-router", device=0)
 
sample = "How does the structure and function of plasmodesmata affect cell-to-cell communication and signaling in plant tissues, particularly in response to environmental stresses?"
 
 
pred = classifier(sample)
print(pred)
# [{'label': 'large_llm', 'score': 1.0}]