# BioClinical ModernBERT vs ModernBERT

## ADE Document Classification (Synthetic)

This notebook compares **BioClinical ModernBERT** to vanilla **ModernBERT** on the synthetic notes dataset. We build *silver* labels for ADE presence (affirmed, not negated), fine-tune each model briefly, and compare metrics and training time.

**Data:** `data/synth_clinical/notes.csv`

**Task:** Binary classification: ADE present (1) vs not (0).

In [17]:
import os, random
from datasets import Dataset
import numpy as np
import pandas as pd
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer)

from sklearn.metrics import roc_auc_score, average_precision_score, f1_score, accuracy_score

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

DATA_PATH = os.environ.get("DATA_DIR", "../data/synth_clinical")
NOTES_CSV = os.path.join(DATA_PATH, "notes.csv")
COHORT_CSV = os.path.join(DATA_PATH, "cohort.csv")

assert os.path.exists(NOTES_CSV), f"Missing notes.csv at {NOTES_CSV} or {COHORT_CSV}"
print("Using data:", NOTES_CSV, COHORT_CSV)

Using data: ../data/synth_clinical/notes.csv ../data/synth_clinical/cohort.csv


In [20]:
# Build silver labels: ADE term present AND not negated
terms = ["acute kidney injury", "aki", "hypotension", "cough", "angioedema", "hyperkalemia"]
negs = ["denies", "no signs of", "not consistent with", "without evidence of", "rule out"]

def has_term(t): return any(w in t for w in terms)
def has_neg(t):  return any(n in t for n in negs)

notes = pd.read_csv(NOTES_CSV)
cohort = pd.read_csv(COHORT_CSV)

df = notes.merge(cohort[["hadm_id", "T", "AKI"]], on="hadm_id", how="left")

t = df["text"].str.lower()
df['label'] = ((t.apply(has_term)) & (~t.apply(has_neg)) & (df["T"]==1) & (df["AKI"]==1)).astype(int)
print(df['label'].value_counts())
df.head(2)

label
0    575
1     25
Name: count, dtype: int64


Unnamed: 0,doc_id,hadm_id,charttime,category,text,T,AKI,label
0,D000000,H000001,2023-06-14 12:00:00,Progress note,Patient admitted to ICU with SOFA score of 7. ...,1,0,0
1,D000001,H000002,2023-09-09 04:00:00,Discharge summary,Patient admitted to ICU with SOFA score of 0. ...,1,0,0


In [21]:
ds = Dataset.from_pandas(df[['text', 'label']])
ds = ds.train_test_split(test_size=0.2, seed=SEED)
len(ds['train']), len(ds['test'])

(480, 120)

In [22]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # Convert logits to probs for class 1
    if logits.ndim == 2 and logits.shape[1] == 2:
        probs = logits - logits.max(axis=1, keepdims=True)
        probs = np.exp(probs)
        probs = probs[:, 1] / probs.sum(axis=1)
    else:
        probs = 1 / (1 + np.exp(-logits))
    preds = (probs >= 0.5).astype(int)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds),
        "auroc": roc_auc_score(labels, probs),
        "auprc": average_precision_score(labels, probs)
    }

In [23]:
def run_model(model_name: str, max_len=1024, epochs=5, batch=16, lr=2e-5, fp16=True):
    tok = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
    enc = ds.map(lambda x: tok(x['text'], max_length=max_len, truncation=True),
                 batched=True, remove_columns=['text'])
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
    args = TrainingArguments(
        output_dir=f"../reports/doc_cls{model_name.replace('/', '_')}",
        per_device_train_batch_size=batch,
        per_device_eval_batch_size=batch,
        num_train_epochs=epochs,
        learning_rate=lr,
        fp16=fp16,
        logging_steps=50,
        eval_strategy="epoch",
        save_strategy="no",
        report_to=[],
        seed=SEED,
    )
    trainer = Trainer(model=model, args=args, train_dataset=enc["train"],
                      eval_dataset=enc["test"], tokenizer=tok, compute_metrics=compute_metrics)
    
    import time
    t0 = time.perf_counter()
    trainer.train()

    t1 = time.perf_counter() - t0
    metrics = trainer.evaluate()
    metrics["seconds"] = t1
    return metrics

In [24]:
results = {}

bio_model = "thomas-sounack/BioClinical-ModernBERT-base"
van_model = "answerdotai/ModernBERT-base"

for name, mn in [("BioClinical ModernBERT", bio_model),
                 ("ModernBERT (vanilla)", van_model)]:
    print(f"\n==== Training {name}: {mn} ====")
    metrics = run_model(mn, max_len=1024, epochs=5, batch=16, lr=2e-5, fp16=True)
    results[name] = metrics

results


==== Training BioClinical ModernBERT: thomas-sounack/BioClinical-ModernBERT-base ====


Map: 100%|██████████| 480/480 [00:00<00:00, 18820.67 examples/s]
Map: 100%|██████████| 120/120 [00:00<00:00, 14648.33 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at thomas-sounack/BioClinical-ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(model=model, args=args, train_dataset=enc["train"],
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Auroc,Auprc
1,No log,0.032415,1.0,1.0,1.0,1.0
2,0.145100,1.3e-05,1.0,1.0,1.0,1.0
3,0.145100,4e-06,1.0,1.0,1.0,1.0
4,0.000000,3e-06,1.0,1.0,1.0,1.0
5,0.000000,3e-06,1.0,1.0,1.0,1.0



==== Training ModernBERT (vanilla): answerdotai/ModernBERT-base ====


Map: 100%|██████████| 480/480 [00:00<00:00, 17092.72 examples/s]
Map: 100%|██████████| 120/120 [00:00<00:00, 16902.29 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(model=model, args=args, train_dataset=enc["train"],
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Auroc,Auprc
1,No log,0.028425,0.975,0.0,1.0,1.0
2,0.152800,4.1e-05,1.0,1.0,1.0,1.0
3,0.152800,3e-06,1.0,1.0,1.0,1.0
4,0.000100,3e-06,1.0,1.0,1.0,1.0
5,0.000000,3e-06,1.0,1.0,1.0,1.0


{'BioClinical ModernBERT': {'eval_loss': 3.266334488216671e-06,
  'eval_accuracy': 1.0,
  'eval_f1': 1.0,
  'eval_auroc': 1.0,
  'eval_auprc': 1.0,
  'eval_runtime': 0.2328,
  'eval_samples_per_second': 515.529,
  'eval_steps_per_second': 34.369,
  'epoch': 5.0,
  'seconds': 14.32555623799999},
 'ModernBERT (vanilla)': {'eval_loss': 2.5063752673304407e-06,
  'eval_accuracy': 1.0,
  'eval_f1': 1.0,
  'eval_auroc': 1.0,
  'eval_auprc': 1.0,
  'eval_runtime': 0.2287,
  'eval_samples_per_second': 524.7,
  'eval_steps_per_second': 34.98,
  'epoch': 5.0,
  'seconds': 13.975709999998799}}

In [25]:
pd.DataFrame(results).T[['eval_accuracy','eval_f1','eval_auroc','eval_auprc','eval_loss','seconds']].sort_values('eval_f1', ascending=False)


Unnamed: 0,eval_accuracy,eval_f1,eval_auroc,eval_auprc,eval_loss,seconds
BioClinical ModernBERT,1.0,1.0,1.0,1.0,3e-06,14.325556
ModernBERT (vanilla),1.0,1.0,1.0,1.0,3e-06,13.97571
