<a href="https://colab.research.google.com/github/bsmider/colab/blob/main/ModernBERT_Large_LLMRouter_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning ModernBERT Large for LLM Router Classification

An annotated colab notebook for finetuning [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) based on Phillip Schmid's [Fine-tune classifier with ModernBERT in 2025
](https://www.philschmid.de/fine-tune-modern-bert-in-2025) blog post. Methodology and code credits to Phillip, check out his other great writeups!


---
## Dependencies

In [None]:
%%capture
# Install Pytorch & other libraries
%pip install "torch==2.4.1" tensorboard
%pip install flash-attn "setuptools<71.0.0" scikit-learn
%pip install --upgrade torchvision

# Install Hugging Face libraries
%pip install  --upgrade \
  "datasets==3.1.0" \
  "accelerate==1.2.1" \
  "hf-transfer==0.1.8"

# ModernBERT is not yet available in an official transformers release, so we need to install it from github
%pip install "git+https://github.com/huggingface/transformers.git@6e0515e99c39444caae39472ee1b2fd76ece32f1" --upgrade

---
## HuggingFace Login

Grabbing our HuggingFace token to both access hub models/datasets, as well as push our fine tuned model back to the hub. Set up your own token [using this link](https://huggingface.co/settings/tokens) and either store it in your Colab secrets like I have here, or directly pass it through the `token` argument.

In [None]:
from huggingface_hub import login
from google.colab import userdata

login(token=userdata.get('HF_TOKEN'), add_to_git_credential=True)

---
## Dataset Preparation

Our main goal is to fine tune ModernBERT-large for predicting whether a query needs to be routed to a large or small LLM. To teach this classification, we'll be using the [DevQuasar/llm_router_dataset-synth](https://huggingface.co/datasets/DevQuasar/llm_router_dataset-synth) dataset. A collection of ~15,000 prompt and label pairs.

In [None]:
from datasets import load_dataset

# Dataset id from huggingface.co/dataset
# dataset_id = "DevQuasar/llm_router_dataset-synth"

# Load raw dataset
dataset = load_dataset("open-r1/OpenR1-Math-220k", "default")
print(dataset)
# # Split into our Test & Train sets
# train_dataset = raw_dataset['train']
# test_dataset = raw_dataset['test']

# print(f"Train dataset size: {len(train_dataset)}")
# print(f"Test dataset size: {len(test_dataset)}")

Train dataset size: 15306
Test dataset size: 4921


The Transformers trainer is going to expect not just a train and test set, but two columns, one marked `labels` and one marked `text`. We can clean up our train and test data using the datasets `.remove_columns` and `.renamed_column` methods to prepare our data.

In [None]:
cleaned_dataset = dataset.remove_columns([col for col in dataset['train'].column_names if col not in ["problem", "problem_type"]])
new_dataset = cleaned_dataset.rename_column("problem", "text")
new_dataset = new_dataset.rename_column("problem_type", "labels")

print(new_dataset)

# get rid of other
new_dataset['train'] = new_dataset['train'].filter(lambda example: example["labels"] != "Other")  # Replace "label" with the actual column name
print(new_dataset)

unique_values = set(new_dataset["train"]["labels"])
print(unique_values)

In [None]:
# transfer word labels to integers for input
indexing = []
for i in unique_values:
  indexing.append(i)

print(indexing)

label2id = {}
id2label = {}
for i in range(len(indexing)):
  label2id[indexing[i]] = str(i)
  id2label[str(i)] = indexing[i]

print(label2id)
print(id2label)
# label2id = {"small_llm": "0", "large_llm": "1"}
# id2label = {"0": "small_llm", "1": "large_llm"}
# print(label2id)
# print(id2label)

Dataset({
    features: ['text', 'labels'],
    num_rows: 4921
})

In [None]:
from collections import Counter

# convert all label word occurences to an integer
print(new_dataset)
new_dataset["train"] = new_dataset["train"].map(lambda example: {"labels": label2id.get(example["labels"], example["labels"])})
print(new_dataset)
label_counts = Counter(new_dataset["train"]["labels"])  # Replace 'label' with the actual column name
print(label_counts)

In [None]:
split_dataset = new_dataset['train'].train_test_split(test_size=0.8)
print(split_dataset)
print(split_dataset['train'][0])

In [None]:
from collections import Counter

label_counts = Counter(split_dataset['train']["labels"])  # Replace 'label' with the actual column name
print(label_counts)

In [None]:
import random
from datasets.arrow_dataset import Dataset

# under sample to class balance
# Find minimum class size
min_count = min(label_counts.values())

# Group samples by label
label_to_samples = {label: [] for label in label_counts}
for example in split_dataset['train']:
    label_to_samples[example["labels"]].append(example)

# Undersample each class to match the minority class size
balanced_data = []
for label, samples in label_to_samples.items():
    balanced_data.extend(random.sample(samples, min_count))

# Create a new dataset
balanced_dataset = Dataset.from_list(balanced_data)
print(balanced_dataset)

label_counts = Counter(balanced_dataset["labels"])  # Replace 'label' with the actual column name
print(label_counts)

In [None]:
print(split_dataset)
split_dataset['train'] = balanced_dataset
print(split_dataset)

In [None]:
# prompt: get the first 4000 elements in split_dataset['test']. use select
print(split_dataset)
split_dataset['test'] = split_dataset['test'].select(range(4000))
print(split_dataset)

We then need to tokenize our inputs. Tokenization is the process of converting the raw text into numbers (token IDs) that the underlying neural network can process.

Most text based model's like BERT or LLMs have a seperate tokenizer that does this conversion, generally at the start when the text is being input, and then once again at the end to convert the numerical token output back to text.

We will load and use ModernBERT-large's tokenizer, which can be found the in [hub's files](https://huggingface.co/answerdotai/ModernBERT-large/tree/main).

In [None]:
from transformers import AutoTokenizer

# Model ID
model_id = "answerdotai/ModernBERT-large"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Notes that all inputs should be at max 1024 tokens.
# Text longer will be truncated, and text shorter will be padded with special tokens to maintain consistency.
tokenizer.model_max_length = 1024

# Tokenize helper function
# Take in a batch of text to tokenize, return back the tokenized text.
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True, return_tensors="pt")

In [None]:
# Tokenize train dataset
tokenized_train_dataset = split_dataset['train'].map(tokenize, batched=True,remove_columns=["text"])

# Tokenize test dataset
tokenized_test_dataset = split_dataset['test'].map(tokenize, batched=True,remove_columns=["text"])

print(tokenized_train_dataset.features.keys())
print(tokenized_test_dataset.features.keys())

Map:   0%|          | 0/4921 [00:00<?, ? examples/s]

dict_keys(['labels', 'input_ids', 'attention_mask'])
dict_keys(['labels', 'input_ids', 'attention_mask'])


We remove the original `text` columns as we no longer need the text anymore. We'll be working directly with the tokenized version.

looking at our feature keys (structure of our dataset), we see now that we have labels, which in our case will be the numerical label for either a small or large LLM classification, input_ids which are our tokenized text, and attention_mask which will show the model what tokens are the content vs padding, if padding needed to be added.

A pseudocode example of what this might look like:

```python
Text: "hello world"
input_ids:      [101, 7592, 2087, 102, 0,   0,   0...]
                [CLS, hello,world,SEP, PAD, PAD, PAD...]
attention_mask: [1,   1,    1,    1,   0,   0,   0...]
labels:         [1]  # Example for binary classification
```

---
## Model Preparation

Now that we have our dataset ready, we need to prep ModernBERT-large for classification training. Using the transformers library, we can load [the model](https://huggingface.co/answerdotai/ModernBERT-large) using `AutoModelForSequenceClassification`. This will configure our model with a new classification head, which we pass in our label identifiers of

```python
label2id = {"small_llm": "0", "large_llm": "1"}
id2label = {"0": "small_llm", "1": "large_llm"}
```

This prepares the model to be trained for our specific classification task, with the right final layer attached to predict our new labels.

In [None]:
print(label2id)
print(id2label)

# num_labels = len(labels)
num_labels = len(label2id)
label2id1, id2label1 = dict(), dict()
for k, v in label2id.items():
    label2id1[k] = v
    id2label1[v] = k

print(label2id1)
print(id2label1)
label2id = label2id1
id2label = id2label1

In [None]:
from transformers import AutoModelForSequenceClassification

# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-large"

print(label2id)
print(id2label)
num_labels = (len(label2id))

# # Prepare model labels - useful for inference
# labels = tokenized_train_dataset.features["labels"].names
# num_labels = len(labels)
# label2id1, id2label1 = dict(), dict()
# for i, label in enumerate(labels):
#     label2id1[label] = str(i)
#     id2label1[str(i)] = label

# print(label2id1)
# print(id2label1)
# Download the model from huggingface.co/models
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=num_labels, label2id=label2id, id2label=id2label,
).to('cuda')

model.safetensors:   0%|          | 0.00/1.58G [00:00<?, ?B/s]

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in ModernBertForSequenceClassification is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-large and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this

---
## Training Evaluation Metric

To evaluate our model's performance during training, we use the F1 score metric.

The F1 score combines
1. **Precision**: Out of all the times we predicted `large_llm`, how many were actually for a large LLMs?
2. **Recall**: Out of all the actual `large_llm` labels, how many did we catch?

Into `F1 = 2 * (precision * recall) / (precision + recall)`

The `compute_metrics` function processes our model's predictions in two steps:
1. Converts the model's raw output probabilities into actual predictions using `argmax` (selecting the class with highest probability)
2. Calculates the weighted F1 score comparing these predictions against the true labels

We use a weighted F1 score to account for both classes (`small_llm` and `large_llm`), with `pos_label=1` indicating that `large_llm` is our positive class. The weighting ensures that both classes are properly considered in our evaluation, even if our dataset isn't perfectly balanced between the two classes.

This metric will be calculated during training to help us understand how well our model is learning.

In [None]:
import numpy as np
from sklearn.metrics import f1_score

# Metric helper method
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    score = f1_score(
            labels, predictions, labels=labels, pos_label=1, average="weighted"
        )
    return {"f1": float(score) if score == 1 else score}

---
## Training the Model

We then use transformers to initialize a trainer with the following hyperparameters:

In [None]:
from huggingface_hub import HfFolder
from transformers import Trainer, TrainingArguments

# Define our training configuration
training_args = TrainingArguments(
    output_dir= "ModernBERT-large-llm-router",  # Directory where model checkpoints will be saved
    per_device_train_batch_size=32,         # Number of samples processed at once during training
    per_device_eval_batch_size=16,          # Number of samples processed at once during evaluation
    learning_rate=5e-5,                     # How quickly the model updates its weights
    num_train_epochs=5,                     # Number of complete passes through the training data
    bf16=True,                              # Use bfloat16 for faster, memory-efficient training
    optim="adamw_torch_fused",              # Optimized version of AdamW optimizer for better performance

    # Configure how and when to log training progress
    logging_strategy="steps",
    logging_steps=100,                      # Log metrics every 100 training steps
    eval_strategy="epoch",                  # Evaluate model after each epoch
    save_strategy="epoch",                  # Save model after each epoch
    save_total_limit=2,                     # Only keep the 2 best model checkpoints
    load_best_model_at_end=True,            # Load the best model when training finishes
    metric_for_best_model="f1",             # Use F1 score to determine which model is best

    # HuggingFace Hub integration settings
    report_to="tensorboard",                # Log metrics to Tensorboard
    push_to_hub=True,                       # Upload model to HuggingFace Hub
    hub_strategy="every_save",              # Push to Hub whenever we save a checkpoint
    hub_token=HfFolder.get_token(),         # Authentication for HuggingFace Hub
)

# Create trainer with our model, data, and training configuration
trainer = Trainer(
    model=model,                            # Our BERT model with classification head
    args=training_args,                     # Training configuration we defined above
    train_dataset=tokenized_train_dataset,  # Our processed training data
    eval_dataset=tokenized_test_dataset,    # Our processed test data
    compute_metrics=compute_metrics,        # Our F1 score calculation function
)

And then run our training!

In [None]:
# Start training
trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,0.0303,0.031701,0.988087
2,0.014,0.037432,0.992731
3,0.0044,0.050241,0.992126
4,0.0004,0.055424,0.992731
5,0.0003,0.053579,0.993337


TrainOutput(global_step=2395, training_loss=0.015399816034202287, metrics={'train_runtime': 698.514, 'train_samples_per_second': 109.561, 'train_steps_per_second': 3.429, 'total_flos': 1.6186952304488448e+17, 'train_loss': 0.015399816034202287, 'epoch': 5.0})

Finally saving our trained model on HuggingFace

This particular model is saved at [AdamLucek/ModernBERT-large-llm-router](https://huggingface.co/AdamLucek/ModernBERT-large-llm-router)

In [None]:
# Save processor and create model card
tokenizer.save_pretrained("ModernBERT-large-llm-router")
trainer.create_model_card()
trainer.push_to_hub()

events.out.tfevents.1736072493.b29abe2b1e5d.7227.0:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/AdamLucek/ModernBERT-large-llm-router/commit/58ac0e040fd3ddb11f344c8b8775b0d695c11a62', commit_message='End of training', commit_description='', oid='58ac0e040fd3ddb11f344c8b8775b0d695c11a62', pr_url=None, repo_url=RepoUrl('https://huggingface.co/AdamLucek/ModernBERT-large-llm-router', endpoint='https://huggingface.co', repo_type='model', repo_id='AdamLucek/ModernBERT-large-llm-router'), pr_revision=None, pr_num=None)

---
## Using the Model

Now that it's been successfully trained for our task, we can load and use our model for classification. We'll use transformer's pipeline helper with `text-classification` to easily load this.

In [None]:
# Note - Run above dependencies cell to install necessary packages

from transformers import pipeline

# load model from huggingface.co/models using our repository id
classifier = pipeline("text-classification", model="AdamLucek/ModernBERT-large-llm-router", device=0)

config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.58G [00:00<?, ?B/s]

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in ModernBertForSequenceClassification is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`


tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.58M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
sample_1 = "What role does chromatin remodeling play in epigenetic regulation during embryonic development, particularly in cell fate determination and tissue specification?"
prediction_1 = classifier(sample_1)
print(prediction_1)



[{'label': 'large_llm', 'score': 1.0}]


In [None]:
sample_2 = "Why is the sky blue?"
prediction_2 = classifier(sample_2)
print(prediction_2)

[{'label': 'small_llm', 'score': 1.0}]
