# ModernBert

ModernBERT was published in late 2024 and has shown substantial improvements over other BERT family models (BERT, RoBERTa, ALBERT, etc.). This notebook showcases the improvements of ModernBERT compared to BERT. Specifically, we will look at the differences in tokenization, long-context capability, model architecture, model outputs, and inference speed.

Here are some helpful resources:

https://huggingface.co/docs/transformers/main/en/model_doc/modernbert

https://huggingface.co/blog/modernbert

https://huggingface.co/answerdotai/ModernBERT-base

https://huggingface.co/docs/transformers/model_doc/bert

# Import libaries

In [1]:
!pip install -q -U transformers
!pip install -q -U datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m77.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.3/506.3 kB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.[0m[31m
[0m

In [2]:
import numpy as np
import pandas as pd
import torch

from datasets import load_dataset

from transformers import BertTokenizer, BertModel, BertForSequenceClassification
from transformers import AutoTokenizer, AutoModel

# Load models & tokenizer

In [3]:
# Let's first load the BERT model
bert_checkpoint = "bert-base-cased"
bert_tokenizer = BertTokenizer.from_pretrained(bert_checkpoint)
bert_model = BertModel.from_pretrained(bert_checkpoint)

# BERT tokenizer and model can also be loaded with AutoTokenizer and AutoModel
#bert_tokenizer = AutoTokenizer.from_pretrained(bert_checkpoint)
#bert_model = AutoModel.from_pretrained(bert_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [4]:
# Then we will load the ModernBERT model
mbert_checkpoint = "answerdotai/ModernBERT-base"
mbert_tokenizer = AutoTokenizer.from_pretrained(mbert_checkpoint)
mbert_model = AutoModel.from_pretrained(mbert_checkpoint)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

# Tokenizer

Let's first take a look at the difference between the tokenizers for the two models.

In [5]:
text = "This is MIDS 266. Let's learn some NLP!"

In [6]:
# Check out how BERT tokenizing things
bert_inputs = bert_tokenizer(text, return_tensors="pt")
bert_inputs

{'input_ids': tensor([[  101,  1188,  1110, 26574, 13675,  1744,  1545,   119,  2421,   112,
           188,  3858,  1199, 21239,  2101,   106,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [9]:
# How is it "really" handled?
bert_tokenizer.convert_ids_to_tokens(bert_inputs["input_ids"][0])

['[CLS]',
 'This',
 'is',
 'MI',
 '##DS',
 '26',
 '##6',
 '.',
 'Let',
 "'",
 's',
 'learn',
 'some',
 'NL',
 '##P',
 '!',
 '[SEP]']

In [7]:
mbert_inputs = mbert_tokenizer(text, return_tensors="pt")
mbert_inputs

{'input_ids': tensor([[50281,  1552,   310,   353, 15782, 30610,    15,  1281,   434,  3037,
           690,   427, 13010,     2, 50282]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [8]:
mbert_tokenizer.convert_ids_to_tokens(mbert_inputs["input_ids"][0])


['[CLS]',
 'This',
 'Ġis',
 'ĠM',
 'IDS',
 'Ġ266',
 '.',
 'ĠLet',
 "'s",
 'Ġlearn',
 'Ġsome',
 'ĠN',
 'LP',
 '!',
 '[SEP]']

What do we notice at first glance?

Nice mini side-by-side! A few things pop right out:

* **Special tokens & ranges**

  * **BERT** uses `[CLS]=101` … `[SEP]=102`.
  * **ModernBERT** adds BOS/EOS-style tokens with big IDs `50281` / `50282` (typical of byte-level BPE vocabularies with specials at the top).

* **Segment (token type) ids**

  * **BERT** returns `token_type_ids` (all zeros here since there’s one segment); classic BERT adds **token-type embeddings**.
  * **ModernBERT** returns **no `token_type_ids`** (it doesn’t use them).

* **Subword system / vocabulary**

  * **BERT-base-cased** uses **WordPiece**, case-sensitive; you’ll often see words split into smaller pieces and punctuation as separate tokens.
  * **ModernBERT** uses a **byte-level BPE/SentencePiece-style** tokenizer, so IDs are larger and tokenization decisions differ (often fewer/smarter splits on rare/cased strings and punctuation like “Let’s”, “NLP”, “MIDS 266”, “!”).

* **Sequence length here**

  * BERT made **17 tokens** (incl. `[CLS]`/`[SEP]`), ModernBERT **15** (incl. BOS/EOS). Different subword rules → different lengths.

* **Model features (not visible but relevant)**

  * **BERT**: 512-token window, absolute positional embeddings.
  * **ModernBERT**: much **longer context (8k)** and **RoPE**; no token-type embeddings.

Let's now take a closer look at the `input_ids`.

In [10]:
bert_inputs.input_ids.shape

torch.Size([1, 17])

In [11]:
mbert_inputs.input_ids.shape

torch.Size([1, 15])

We can see that the two models have different length of input_ids for texts with the same word count!

Let's then take a look at how the tokenization differs between the two models.

In [12]:
btokens = bert_tokenizer.tokenize(text)
print(btokens)

['This', 'is', 'MI', '##DS', '26', '##6', '.', 'Let', "'", 's', 'learn', 'some', 'NL', '##P', '!']


In [13]:
mtokens = mbert_tokenizer.tokenize(text)
print(mtokens)

['This', 'Ġis', 'ĠM', 'IDS', 'Ġ266', '.', 'ĠLet', "'s", 'Ġlearn', 'Ġsome', 'ĠN', 'LP', '!']


Wow! The tokens look quite different between the two models!

We already know BERT uses CLS and SEP tokens, does ModernBERT do the same?

In [14]:
bert_tokenizer.decode(bert_tokenizer.encode(text))

"[CLS] This is MIDS 266. Let's learn some NLP! [SEP]"

In [15]:
bert_inputs.input_ids

tensor([[  101,  1188,  1110, 26574, 13675,  1744,  1545,   119,  2421,   112,
           188,  3858,  1199, 21239,  2101,   106,   102]])

In [16]:
mbert_tokenizer.decode(mbert_tokenizer.encode(text))

"[CLS]This is MIDS 266. Let's learn some NLP![SEP]"

In [17]:
mbert_inputs.input_ids

tensor([[50281,  1552,   310,   353, 15782, 30610,    15,  1281,   434,  3037,
           690,   427, 13010,     2, 50282]])

Similar to other BERT family models, ModernBERT also uses CLS and SEP tokens. Can you guess the input_id for these special tokens?
 - 50281 and 05282

Let's now try batch encode, what's different now?

Read the [ModernBert Config](https://huggingface.co/docs/transformers/main/en/model_doc/modernbert#transformers.ModernBertConfig) to identify other special tokens and the input ids for each of them.

In [18]:
bert_input = bert_tokenizer.batch_encode_plus(
    ['This is great!', 'This is terrible!'],  # a batch of 2 single-sentence strings
    max_length=10,         # cap each sequence at 10 tokens (incl. [CLS]/[SEP])
    truncation=True,       # if longer than 10, cut off the tail
    padding='max_length',  # pad shorter ones up to length 10
    return_tensors='pt'    # return PyTorch tensors (not lists)
)


bert_input

{'input_ids': tensor([[ 101, 1188, 1110, 1632,  106,  102,    0,    0,    0,    0],
        [ 101, 1188, 1110, 6434,  106,  102,    0,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

In [19]:
mbert_input = mbert_tokenizer.batch_encode_plus(
    ['This is great!', 'This is terrible!'],
    max_length=10,
    truncation=True,
    padding='max_length',
    return_tensors='pt'
)

mbert_input

{'input_ids': tensor([[50281,  1552,   310,  1270,     2, 50282, 50283, 50283, 50283, 50283],
        [50281,  1552,   310, 11527,     2, 50282, 50283, 50283, 50283, 50283]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

# Long Context Inputs

For long context illustration, we will use the [long-context retrieval (MLDR)](https://huggingface.co/datasets/sentence-transformers/mldr) dataset. This dataset has 10K triples of anchor-positive-negative datapoints, and is ideal for information retrieval. We will cover information retrieval in Week 10.

We can directly load the dataset from HuggingFace Hub. For this exercise, we will only take the first 5 datapoints as an example.

In [20]:
dataset = load_dataset("sentence-transformers/mldr", "en-triplet", split="train").take(5)

README.md: 0.00B [00:00, ?B/s]

en-triplet/train-00000-of-00001.parquet:   0%|          | 0.00/211M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [21]:
dataset

Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 5
})

This dataset is designed for long context retrieval, let's take a look at the first example

In [22]:
text = dataset[0]["positive"]
len(text)

12379

In [24]:
text[:100]

'John Vincent "Jack" Geraghty, Jr. (born February 23, 1934) is an Irish American civic politician, jo'

Wow this is surely a long text, what happens if we try to tokenize it for BERT?

In [25]:
bert_inputs = bert_tokenizer(text, return_tensors="pt")
bert_inputs

Token indices sequence length is longer than the specified maximum sequence length for this model (2450 > 512). Running this sequence through the model will result in indexing errors


{'input_ids': tensor([[  101,  1287,  5665,  ...,  4052, 12762,   102]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

Uhoh, we see a warning that the sequence length exceeds the max sequence length allowed for the model.

What happens if we directly use this long text in the BERT model without any preprocessing? All the texts after the 512th token will be lost!

Think about what you can do to combat this issue if you'd like to use BERT model on this dataset? What disadvantage would this impose compared to using a model that has long context capabilities?

Now let's tokenize it for ModernBERT

In [26]:
mbert_inputs = mbert_tokenizer(text, return_tensors="pt")
mbert_inputs

{'input_ids': tensor([[50281,  8732, 26456,  ..., 13416, 20759, 50282]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

In [27]:
mbert_inputs.input_ids.shape

torch.Size([1, 2671])

No more warnings about the sequence length exceeding the maximum sequence length. This is because ModernBERT allows for long context up to 8192 tokens! This is huge!

# Model Architecture

Let's now take a look at the model architecture!

In [28]:
for name, param in bert_model.named_parameters():
    print(name)

embeddings.word_embeddings.weight
embeddings.position_embeddings.weight
embeddings.token_type_embeddings.weight
embeddings.LayerNorm.weight
embeddings.LayerNorm.bias
encoder.layer.0.attention.self.query.weight
encoder.layer.0.attention.self.query.bias
encoder.layer.0.attention.self.key.weight
encoder.layer.0.attention.self.key.bias
encoder.layer.0.attention.self.value.weight
encoder.layer.0.attention.self.value.bias
encoder.layer.0.attention.output.dense.weight
encoder.layer.0.attention.output.dense.bias
encoder.layer.0.attention.output.LayerNorm.weight
encoder.layer.0.attention.output.LayerNorm.bias
encoder.layer.0.intermediate.dense.weight
encoder.layer.0.intermediate.dense.bias
encoder.layer.0.output.dense.weight
encoder.layer.0.output.dense.bias
encoder.layer.0.output.LayerNorm.weight
encoder.layer.0.output.LayerNorm.bias
encoder.layer.1.attention.self.query.weight
encoder.layer.1.attention.self.query.bias
encoder.layer.1.attention.self.key.weight
encoder.layer.1.attention.self.key

In [29]:
for name, param in mbert_model.named_parameters():
    print(name)

embeddings.tok_embeddings.weight
embeddings.norm.weight
layers.0.attn.Wqkv.weight
layers.0.attn.Wo.weight
layers.0.mlp_norm.weight
layers.0.mlp.Wi.weight
layers.0.mlp.Wo.weight
layers.1.attn_norm.weight
layers.1.attn.Wqkv.weight
layers.1.attn.Wo.weight
layers.1.mlp_norm.weight
layers.1.mlp.Wi.weight
layers.1.mlp.Wo.weight
layers.2.attn_norm.weight
layers.2.attn.Wqkv.weight
layers.2.attn.Wo.weight
layers.2.mlp_norm.weight
layers.2.mlp.Wi.weight
layers.2.mlp.Wo.weight
layers.3.attn_norm.weight
layers.3.attn.Wqkv.weight
layers.3.attn.Wo.weight
layers.3.mlp_norm.weight
layers.3.mlp.Wi.weight
layers.3.mlp.Wo.weight
layers.4.attn_norm.weight
layers.4.attn.Wqkv.weight
layers.4.attn.Wo.weight
layers.4.mlp_norm.weight
layers.4.mlp.Wi.weight
layers.4.mlp.Wo.weight
layers.5.attn_norm.weight
layers.5.attn.Wqkv.weight
layers.5.attn.Wo.weight
layers.5.mlp_norm.weight
layers.5.mlp.Wi.weight
layers.5.mlp.Wo.weight
layers.6.attn_norm.weight
layers.6.attn.Wqkv.weight
layers.6.attn.Wo.weight
layers.6.mlp

What differences do you see when comparing the layers?

# Model outputs

We have seen the differences in inputs and model architecture, let's now take a look at model outputs

Remember, our text is too long for BERT, so we must truncate it before we can feed the tokenized inputs to the model

In [30]:
bert_inputs = bert_tokenizer(text, return_tensors="pt", truncation=True)
bert_outputs = bert_model(**bert_inputs)
bert_outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.5935,  0.0940, -0.1687,  ..., -0.3806,  0.0434,  0.3832],
         [ 0.4253, -0.4766,  0.7440,  ..., -0.3942,  0.3230,  0.2140],
         [ 0.4182, -0.5374,  0.5692,  ..., -0.2323, -0.0629,  1.2690],
         ...,
         [ 0.5098, -0.5685,  0.2344,  ..., -0.5356,  0.0076,  0.3847],
         [ 0.4078, -0.4184,  0.0023,  ..., -0.1491, -0.1271,  0.2958],
         [ 1.3882,  0.1162,  1.2960,  ...,  0.3271,  0.2593, -0.2280]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-5.8410e-01,  3.0264e-01,  9.9930e-01, -9.6620e-01,  8.9615e-01,
          9.3145e-01,  7.0277e-01, -9.9296e-01, -8.9288e-01,  6.4332e-02,
          9.1099e-01,  9.9563e-01, -9.9678e-01, -9.9903e-01,  7.9328e-01,
         -8.7767e-01,  9.3608e-01, -5.5321e-01, -9.9983e-01, -4.3656e-01,
         -5.8476e-01, -9.9944e-01,  1.0964e-01,  9.4960e-01,  7.4010e-01,
          9.5588e-02,  9.5480e-01,  9.9985e-01,  3.6780e-01, -7.141

BERT has two outputs, do you remember what they are?

You’re seeing the two *standard* BERT outputs:

1. **`last_hidden_state`** – shape `[batch, seq_len, hidden]`
   The final-layer vector for **every token**, including `[CLS]` and `[SEP]`. Use this when you need token features (sequence labeling) or when you want to pool over tokens yourself (e.g., mean/max/attention pooling) for a sentence/document vector.

2. **`pooler_output`** – shape `[batch, hidden]`
   A **single vector per sequence** made by taking the **[CLS] token** from `last_hidden_state`, passing it through a learned dense layer + `tanh`. It existed for BERT’s original **Next Sentence Prediction** objective and is often used as a quick sentence representation for classification heads.

A few tips:

* Many modern setups **ignore the pooler** and instead use `[CLS]` directly (`last_hidden_state[:,0,:]`) or **mean-pool** the non-pad tokens; these can work better depending on the task.
* If you want **more** (all layers or attention maps), set:

  ```python
  out = bert_model(**bert_inputs, output_hidden_states=True, output_attentions=True, return_dict=True)
  ```
* Models like **ModernBERT** typically **don’t have a pooler**, so you’ll just use your own pooling on `last_hidden_state`.


In [31]:
print('Shape of first BERT output: ', bert_outputs[0].shape)
print('Shape of second BERT output: ', bert_outputs[1].shape)

Shape of first BERT output:  torch.Size([1, 512, 768])
Shape of second BERT output:  torch.Size([1, 768])


With ModernBERT, thanks to the long context capabilities, we do not need to truncate our long text before feeding it to the model.

In [32]:
# this cell takes a minute to run because our text is so long
mbert_inputs = mbert_tokenizer(text, return_tensors="pt")
mbert_outputs = mbert_model(**mbert_inputs)
mbert_outputs

BaseModelOutput(last_hidden_state=tensor([[[ 0.1680, -0.3384, -0.7694,  ..., -0.4745, -0.1105, -0.7589],
         [-0.8740, -0.3295,  0.6591,  ..., -1.9694, -2.3188,  0.4383],
         [ 0.3498, -1.4112,  0.0969,  ...,  0.4195,  0.0887, -0.6138],
         ...,
         [ 1.8556, -0.1904, -1.1067,  ..., -1.8617,  0.1445, -0.0550],
         [ 0.2851, -0.9601, -0.9530,  ..., -1.6096,  0.1400,  0.2290],
         [ 0.1789, -0.0395,  0.0412,  ...,  0.0570,  0.1783,  0.1189]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

Interesting! ModernBERT only has one output!

In [33]:
print('Shape of ModernBERT output: ', mbert_outputs[0].shape)

Shape of ModernBERT output:  torch.Size([1, 2671, 768])


Compare the first output from BERT and the only output from ModernBERT, what is the difference? What does the 2nd dimension represent?

Two key differences:

* **What you get back**

  * **BERT:** `BaseModelOutputWithPooling...` → it has **`last_hidden_state`** *and* a **`pooler_output`**.
  * **ModernBERT:** `BaseModelOutput` → only **`last_hidden_state`** (no pooler).

* **How long the sequence is**

  * Your BERT call used `truncation=True`, so `last_hidden_state` is **[1, 512, 768]** (BERT can’t exceed 512).
  * ModernBERT returned **[1, 2671, 768]** — it processed the **entire tokenized sequence** thanks to its long context.

And to the second question: the **2nd dimension is the sequence length in tokens** (subword tokens, including specials like `[CLS]`/`[SEP]`).
So shapes are **[batch_size, seq_len, hidden_size]** → here: batch=1, seq_len=2671, hidden=768.


In [None]:
bert_lhs_shape = bert_outputs.last_hidden_state.shape
bert_lhs_data = bert_outputs.last_hidden_state

print("Last hidden state shape")
print(bert_lhs_shape)

print("Last hidden state")
print(bert_lhs_data)

Last hidden state shape
torch.Size([1, 512, 768])
Last hidden state
tensor([[[ 0.5935,  0.0940, -0.1687,  ..., -0.3806,  0.0434,  0.3832],
         [ 0.4253, -0.4766,  0.7440,  ..., -0.3942,  0.3230,  0.2140],
         [ 0.4182, -0.5374,  0.5692,  ..., -0.2323, -0.0629,  1.2690],
         ...,
         [ 0.5098, -0.5685,  0.2344,  ..., -0.5356,  0.0076,  0.3847],
         [ 0.4078, -0.4184,  0.0023,  ..., -0.1491, -0.1271,  0.2958],
         [ 1.3882,  0.1162,  1.2960,  ...,  0.3271,  0.2593, -0.2280]]],
       grad_fn=<NativeLayerNormBackward0>)


In [None]:
mbert_lhs_shape = mbert_outputs.last_hidden_state.shape
mbert_lhs_data = mbert_outputs.last_hidden_state

print("Last hidden state shape")
print(mbert_lhs_shape)

print("Last hidden state")
print(mbert_lhs_data)

Last hidden state shape
torch.Size([1, 2671, 768])
Last hidden state
tensor([[[ 0.1680, -0.3384, -0.7694,  ..., -0.4745, -0.1105, -0.7589],
         [-0.8740, -0.3295,  0.6591,  ..., -1.9694, -2.3188,  0.4383],
         [ 0.3498, -1.4112,  0.0969,  ...,  0.4195,  0.0887, -0.6138],
         ...,
         [ 1.8556, -0.1904, -1.1067,  ..., -1.8617,  0.1445, -0.0550],
         [ 0.2851, -0.9601, -0.9530,  ..., -1.6096,  0.1400,  0.2290],
         [ 0.1789, -0.0395,  0.0412,  ...,  0.0570,  0.1783,  0.1189]]],
       grad_fn=<NativeLayerNormBackward0>)


As expected, the output tensors from the two models are different.

What advantage does long context capability have over short context model? Why might one prefer one model over another for their task?

Great question. Here’s the quick intuition and a pragmatic checklist.

# Why long-context helps

* **Captures far-apart evidence.** Lets the model reason over cues that are **hundreds/thousands of tokens apart** (e.g., early history vs. late assessment in a discharge note; cross-section references in legal/technical docs).
* **Less glue code.** No need for **chunking, sliding windows, or hierarchical pooling**—fewer engineering choices that can drop signal or bias results.
* **Global coherence.** Better at **coreference, timelines, discourse** and “A refers to B above” style dependencies.
* **Fewer truncation errors.** You don’t silently lose important tail content at 512 tokens.

# Why short-context can still win

* **Speed & cost.** Short models are **smaller, cheaper, faster**, with higher batch throughput and simpler deployments.
* **Task fit.** Many tasks only need **a sentence/paragraph** (NER, short QA, sentence classification, pairwise STS). Long context adds little.
* **Maturity & availability.** Tons of **well-tested checkpoints, adapters, and recipes** exist for 512-token models.
* **Noise control.** Long documents contain **distractors**; a short-context + **retrieval/selection** pipeline can be more precise and cheaper.

# Trade-offs at a glance

* **Compute/memory:** Long context ↑ memory/latency (even with FlashAttention/RoPE); short context is lean.
* **Data efficiency:** Long context may need **more data/regularization** to not overfit spurious long-range patterns.
* **Implementation:** Long context is simpler at modeling time; short context needs **chunking/RAG/hierarchical** tricks—but those can be tailored and audited.

# What to choose for common scenarios

* **Clinical readmission from discharge notes:** Long-context (e.g., ModernBERT 8k) usually better—signals spread across sections (HPI, Course, A/P).
  If constrained to 512: do **section-aware chunking + attention pooling** and consider adding structured EHR features.
* **ICD code assignment from short summaries:** Either works; short-context is fine if summaries fit ≤512.
* **Legal/finance reports, research papers:** Long-context shines for cross-section reasoning; otherwise do retrieval over sections + short model.
* **Sentence-level NER, small intent classification, pair matching:** Short-context—faster, cheaper, no benefit from long windows.

# Simple decision checklist

1. **Does important evidence span >512 tokens?** → pick **long**; else **short**.
2. **Latency/cost budget tight?** → **short** (or long only at inference for hard cases).
3. **Need auditability/consistency?** → short with **explicit retrieval/sectioning** can be easier to inspect.
4. **Engineering capacity:** if you don’t want chunking/aggregation logic → **long**.
5. **Data volume:** low data + long docs → consider **long** + strong regularization, or **short** with retrieval to reduce noise.

Bottom line: use **long-context** when meaning depends on **distant context** or you want to avoid chunking complexity; stick with **short-context** when inputs are short, budgets tight, or you prefer modular retrieval + lightweight models.


# Inference Speed

To make an apple-to-apple comparison of inference speed between the two models, let's truncate the long text to 512 tokens for both models.

In [34]:
texts = dataset["positive"]
len(texts)

5

In [35]:
dataset['positive'][4]

'Bridgewater is a town in Aroostook County, Maine, United States. The population was 532 at the 2020 census.\n\nGeography\nAccording to the United States Census Bureau, the town has a total area of , of which  is land and  is water.\n\nClimate\nThis climatic region is typified by large seasonal temperature differences, with warm to hot (and often humid) summers and cold (sometimes severely cold) winters. According to the Köppen Climate Classification system, Bridgewater has a humid continental climate, abbreviated "Dfb" on climate maps.\n\nDemographics\n\n2010 census\nAs of the census of 2010, there were 610 people, 263 households, and 175 families living in the town. The population density was . There were 326 housing units at an average density of . The racial makeup of the town was 96.7% White, 0.7% Native American, 0.2% Asian, 1.0% from other races, and 1.5% from two or more races. Hispanic or Latino of any race were 1.1% of the population.\n\nThere were 263 households, of which 25

In [36]:
bert_inputs = bert_tokenizer(dataset['positive'][0],
                             max_length=512,
                             padding=True,
                             truncation=True,
                             return_tensors='pt')

mbert_inputs = mbert_tokenizer(dataset['positive'][0],
                               max_length=512,
                               padding=True,
                               truncation=True,
                               return_tensors='pt')

In [37]:
%%time

bert_outputs = bert_model(**bert_inputs)

CPU times: user 1.32 s, sys: 0 ns, total: 1.32 s
Wall time: 222 ms


In [38]:
%%time

mbert_outputs = mbert_model(**mbert_inputs)

CPU times: user 1.67 s, sys: 0 ns, total: 1.67 s
Wall time: 280 ms


Even with the enhanced capabilities of ModernBERT, the inference time is comparable between the two models on CPU.

We can also run inference on a GPU and compare the speed on a GPU.

Select a GPU runtime and run the cells below.

In [39]:
!pip install transformers datasets -q -U

import numpy as np
import pandas as pd
import torch

from datasets import load_dataset

from transformers import BertTokenizer, BertModel, BertForSequenceClassification
from transformers import AutoTokenizer, AutoModel

In [40]:
bert_checkpoint = "bert-base-cased"
bert_tokenizer = BertTokenizer.from_pretrained(bert_checkpoint)
bert_model = BertModel.from_pretrained(bert_checkpoint)

In [41]:
mbert_checkpoint = "answerdotai/ModernBERT-base"
mbert_tokenizer = AutoTokenizer.from_pretrained(mbert_checkpoint)
mbert_model = AutoModel.from_pretrained(mbert_checkpoint)

In [42]:
dataset = load_dataset("sentence-transformers/mldr", "en-triplet", split="train").take(5)
texts = dataset["positive"][0-4]

Again, we truncate the long context texts to the same length to compare the two models.

In [43]:
bert_inputs = bert_tokenizer(texts,
                             max_length=512,
                             padding=True,
                             truncation=True,
                             return_tensors='pt')

mbert_inputs = mbert_tokenizer(texts,
                               max_length=512,
                               padding=True,
                               truncation=True,
                               return_tensors='pt')

In [44]:
%%time

bert_outputs = bert_model(**bert_inputs)

CPU times: user 1.17 s, sys: 191 ms, total: 1.36 s
Wall time: 227 ms


In [45]:
%%time

mbert_outputs = mbert_model(**mbert_inputs)

CPU times: user 1.85 s, sys: 388 ms, total: 2.23 s
Wall time: 373 ms


GPU certainly makes things much faster and the inference speed between the two models are comparable to each other!

Next, take a look at the [ModernBERT documentation](https://huggingface.co/docs/transformers/main/en/model_doc/modernbert) to see if you can finetune a ModernBERT model for a downstream task!