# BioClinical ModernBERT vs ModernBERT

## ADE Document Classification (Synthetic)

This notebook compares **BioClinical ModernBERT** to vanilla **ModernBERT** on the synthetic notes dataset. We build *silver* labels for ADE presence (affirmed, not negated), fine-tune each model briefly, and compare metrics and training time.

**Data:** `data/synth_clinical/notes.csv`

**Task:** Binary classification: ADE present (1) vs not (0).

In [4]:
import os, random
from datasets import Dataset
import numpy as np
import pandas as pd
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer)

from sklearn.metrics import roc_auc_score, average_precision_score, f1_score, accuracy_score

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

DATA_PATH = os.environ.get("DATA_DIR", "../data/synth_clinical")
NOTES_CSV = os.path.join(DATA_PATH, "notes.csv")
assert os.path.exists(NOTES_CSV), f"Missing notes.csv at {NOTES_CSV}"
print("Using data:", NOTES_CSV)

Using data: ../data/synth_clinical/notes.csv


In [2]:
# Build silver labels: ADE term present AND not negated
terms = ["acute kidney injury", "aki", "hypotension", "cough", "angioedema", "hyperkalemia"]
negs = ["denies", "no signs of", "not consistent with", "without evidence of", "rule out"]

def silver_label(text:str)->int:
    t = str(text).lower()
    has_term = any(w in t for w in terms)
    has_neg = any(n in t for n in negs)
    return int(has_term and (not has_neg))

df = pd.read_csv(NOTES_CSV)
df['label'] = df['text'].apply(silver_label).astype(int)
print(df['label'].value_counts())
df.head(2)

label
0    544
1     56
Name: count, dtype: int64


Unnamed: 0,doc_id,hadm_id,charttime,category,text,label
0,D000000,H000001,2023-06-14 12:00:00,Progress note,Patient admitted to ICU with SOFA score of 7. ...,0
1,D000001,H000002,2023-09-09 04:00:00,Discharge summary,Patient admitted to ICU with SOFA score of 0. ...,0


In [5]:
ds = Dataset.from_pandas(df[['text', 'label']])
ds = ds.train_test_split(test_size=0.2, seed=SEED)
len(ds['train']), len(ds['test'])

(480, 120)

In [6]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # Convert logits to probs for class 1
    if logits.ndim == 2 and logits.shape[1] == 2:
        probs = logits - logits.max(axis=1, keepdims=True)
        probs = np.exp(probs)
        probs = probs[:, 1] / probs.sum(axis=1)
    else:
        probs = 1 / (1 + np.exp(-logits))
    preds = (probs >= 0.5).astype(int)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds),
        "auroc": roc_auc_score(labels, probs),
        "aurpc": average_precision_score(labels, probs)
    }

In [11]:
def run_model(model_name: str, max_len=512, epochs=2, batch=16, lr=2e-5, fp16=True):
    tok = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
    enc = ds.map(lambda x: tok(x['text'], max_length=max_len, truncation=True),
                 batched=True, remove_columns=['text'])
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
    args = TrainingArguments(
        output_dir=f"../reports/doc_cls{model_name.replace('/', '_')}",
        per_device_train_batch_size=batch,
        per_device_eval_batch_size=batch,
        num_train_epochs=epochs,
        learning_rate=lr,
        fp16=fp16,
        logging_steps=50,
        eval_strategy="epoch",
        save_strategy="no",
        report_to=[],
        seed=SEED,
    )
    trainer = Trainer(model=model, args=args, train_dataset=enc["train"],
                      eval_dataset=enc["test"], tokenizer=tok, compute_metrics=compute_metrics)
    
    import time
    t0 = time.perf_counter()
    trainer.train()

    t1 = time.perf_counter() - t0
    metrics = trainer.evaluate()
    metrics["seconds"] = t1
    return metrics

In [12]:
results = {}

bio_model = "thomas-sounack/BioClinical-ModernBERT-base"
van_model = "answerdotai/ModernBERT-base"

for name, mn in [("BioClinical ModernBERT", bio_model),
                 ("ModernBERT (vanilla)", van_model)]:
    print(f"\n==== Training {name}: {mn} ====")
    metrics = run_model(mn, max_len=512, epochs=2, batch=16, lr=2e-5, fp16=True)
    results[name] = metrics

results


==== Training BioClinical ModernBERT: thomas-sounack/BioClinical-ModernBERT-base ====


Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 480/480 [00:00<00:00, 21425.02 examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 120/120 [00:00<00:00, 16091.19 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at thomas-sounack/BioClinical-ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.26.0`: Please run `pip install transformers[torch]` or `pip install 'accelerate>=0.26.0'`