# BioClinical ModernBERT vs ModernBERT — Causal ADE Classification (Synthetic Notes)

This notebook benchmarks **BioClinical ModernBERT** against **vanilla ModernBERT**
for detecting **causal Adverse Drug Events (ADEs)** in synthetic ICU notes.

We use a synthetic dataset (`notes_hard_v4.csv`, `doc_labels_hard_v4`) that contains both **textual** and **structural** signals 
for drug–ADE relationships. Each note is labeled positive only when both:

- The patient was *treated with an ACE inhibitor* (`T=1`), **and**
- The note explicitly links the treatment to an **adverse outcome** (`AKI=1` with causal phrasing).

## 1. Load the Synthetic Dataset

We load `notes_hard.csv` and `doc_labels_hard.csv`, then merge them on `doc_id`.  
Each note contains free text describing an ICU admission, while `hard_label` encodes the binary ADE outcome.

The dataset has:
- Text field (`text`)
- Label field (`hard_label`)
- Roughly balanced class distribution

Let's preview and confirm the merge worked correctly.


In [3]:
import os, random
from datasets import Dataset
import numpy as np
import pandas as pd
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer, DataCollatorWithPadding)

from sklearn.metrics import roc_auc_score, average_precision_score, f1_score, accuracy_score

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

DATA_DIR = os.environ.get("DATA_DIR", "../data/synth_clinical")

notes_v4  = os.path.join(DATA_DIR, "notes_hard_v4.csv")
labels_v4 = os.path.join(DATA_DIR, "doc_labels_hard_v4.csv")

notes = pd.read_csv(notes_v4)
labs  = pd.read_csv(labels_v4)
df = notes.merge(labs[['doc_id','label','split']], on='doc_id', how='left')

## 2. Prepare the Dataset for Modeling

We convert the merged dataframe into a Hugging Face `Dataset` and split 80/20 into 
training and test subsets.

This allows us to evaluate generalization on unseen synthetic notes.

In [4]:
train_df = df[df['split']=='train'][['text','label']].reset_index(drop=True)
test_df  = df[df['split']=='test'][['text','label']].reset_index(drop=True)

ds = {
    "train": Dataset.from_pandas(train_df),
    "test":  Dataset.from_pandas(test_df),
}
len(ds["train"]), len(ds["test"])

(6464, 1536)

## 3. Define Evaluation Metrics

For consistency with biomedical NLP literature, we evaluate:

- **Accuracy** — overall classification correctness  
- **F1 Score** — harmonic mean of precision & recall  
- **AUROC** — area under the ROC curve (ranking ability)  
- **AUPRC** — area under the precision–recall curve (robust to imbalance)

In [5]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = logits - logits.max(axis=1, keepdims=True)
    probs = np.exp(probs)
    probs = probs[:,1] / probs.sum(axis=1)
    preds=(probs>=0.5).astype(int)
    return {
        'accuracy': accuracy_score(labels, preds),
        'f1': f1_score(labels, preds),
        'auroc': roc_auc_score(labels, probs),
        'auprc': average_precision_score(labels, probs)
    }


## 4. Define Model Training Routine

We create a helper `run_model()` that:

1. Loads the chosen pretrained ModernBERT tokenizer & model  
2. Tokenizes text up to a maximum sequence length  
3. Trains for a configurable number of epochs  
4. Logs key metrics on the validation set after each epoch

The function returns a dictionary of performance statistics for easy comparison.


In [21]:
def run_model(model_name:str, max_len=768, epochs=3, batch=16, lr=2e-5, fp16=True):
    tok = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)

    def tok_fn(b):
        return tok(b["text"], max_length=max_len, truncation=True)

    enc = {k: v.map(tok_fn, batched=True, remove_columns=["text"]) for k, v in ds.items()}
    data_collator = DataCollatorWithPadding(tok)  # dynamic padding per batch
    
    model=AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
    args=TrainingArguments(
        output_dir=f"../reports/doc_cls_hard_{model_name.replace('/','_')}",
        per_device_train_batch_size=batch,
        per_device_eval_batch_size=batch,
        num_train_epochs=epochs,
        learning_rate=lr,
        weight_decay=0.1,
        lr_scheduler_type='cosine',
        warmup_ratio=0.1,
        fp16=fp16,
        logging_steps=50,
        eval_strategy='epoch',
        save_strategy='epoch',
        label_smoothing_factor=0.2,
        load_best_model_at_end=True,
        metric_for_best_model="eval_auprc",
        report_to=[],
        max_grad_norm=1.0,
        greater_is_better=True,
        save_total_limit=2,
        seed=SEED
    )
    trainer=Trainer(model=model, args=args, train_dataset=enc['train'], eval_dataset=enc['test'], processing_class=tok, 
                    data_collator=data_collator, compute_metrics=compute_metrics)
    import time; t0=time.time(); trainer.train(); dur=time.time()-t0
    metrics=trainer.evaluate(); metrics['seconds']=dur; return metrics


## 5. Train and Compare Models

We benchmark two architectures:

| Model | Description |
|--------|--------------|
| **BioClinical ModernBERT** | Domain-adapted version trained on biomedical corpora (MIMIC, PubMed) |
| **ModernBERT (vanilla)** | General English version trained on large web text |

Both are fine-tuned for binary sequence classification on our synthetic ADE dataset.


In [22]:
results={}
for name, mn in [("BioClinical ModernBERT", "thomas-sounack/BioClinical-ModernBERT-base"),
                 ("ModernBERT (vanilla)", "answerdotai/ModernBERT-base")]:
    print(f"\n==== Training {name}: {mn} ====")
    results[name]=run_model(mn, max_len=128, epochs=6, batch=32, lr=1e-5)
results



==== Training BioClinical ModernBERT: thomas-sounack/BioClinical-ModernBERT-base ====


Map: 100%|██████████| 6464/6464 [00:00<00:00, 28459.27 examples/s]
Map: 100%|██████████| 1536/1536 [00:00<00:00, 25080.10 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at thomas-sounack/BioClinical-ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Auroc,Auprc
1,0.4584,0.425841,0.947266,0.948831,0.950244,0.942866
2,0.4197,0.418756,0.954427,0.955471,0.952118,0.947214
3,0.4242,0.408924,0.954427,0.955471,0.952524,0.951783
4,0.3869,0.410186,0.954427,0.955471,0.949895,0.951002
5,0.39,0.408628,0.954427,0.955471,0.951059,0.947857
6,0.3803,0.408561,0.954427,0.955471,0.95116,0.949261



==== Training ModernBERT (vanilla): answerdotai/ModernBERT-base ====


Map: 100%|██████████| 6464/6464 [00:00<00:00, 28376.67 examples/s]
Map: 100%|██████████| 1536/1536 [00:00<00:00, 29342.55 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Auroc,Auprc
1,0.4642,0.438118,0.933594,0.936803,0.9548,0.951175
2,0.409,0.408706,0.954427,0.955471,0.954127,0.955611
3,0.4145,0.405766,0.954427,0.955471,0.95464,0.947457
4,0.3874,0.405943,0.954427,0.955471,0.951401,0.950117
5,0.3891,0.40524,0.954427,0.955471,0.953182,0.952365
6,0.38,0.405845,0.954427,0.955471,0.953383,0.952728


{'BioClinical ModernBERT': {'eval_loss': 0.4089237153530121,
  'eval_accuracy': 0.9544270833333334,
  'eval_f1': 0.955470737913486,
  'eval_auroc': 0.9525244465571099,
  'eval_auprc': 0.9517825891984698,
  'eval_runtime': 1.8817,
  'eval_samples_per_second': 816.263,
  'eval_steps_per_second': 25.508,
  'epoch': 6.0,
  'seconds': 173.26331639289856},
 'ModernBERT (vanilla)': {'eval_loss': 0.4087055027484894,
  'eval_accuracy': 0.9544270833333334,
  'eval_f1': 0.955470737913486,
  'eval_auroc': 0.9541270541898683,
  'eval_auprc': 0.9556107462964034,
  'eval_runtime': 2.0301,
  'eval_samples_per_second': 756.62,
  'eval_steps_per_second': 23.644,
  'epoch': 6.0,
  'seconds': 171.17205333709717}}

## 6. Results Summary

Below we summarize final evaluation metrics after fine-tuning.

Values close to **1.0** indicate that the task is relatively easy for these models — likely because 
the dataset contains strong lexical cues ("after starting", "denies", "no evidence of", etc.) 
that the models can exploit directly.

Subsequent versions of the dataset (`hard_v2`, `hard_v3`, `hard_v4`) progressively remove such shortcuts 
to test deeper reasoning and contextual understanding.


In [23]:
pd.DataFrame(results).T[['eval_accuracy','eval_f1','eval_auroc','eval_auprc','eval_loss','seconds']]


Unnamed: 0,eval_accuracy,eval_f1,eval_auroc,eval_auprc,eval_loss,seconds
BioClinical ModernBERT,0.954427,0.955471,0.952524,0.951783,0.408924,173.263316
ModernBERT (vanilla),0.954427,0.955471,0.954127,0.955611,0.408706,171.172053


In [14]:
print(df.groupby('split')['label'].value_counts())#(normalize=True))


split  label
test   1         796
       0         740
train  0        3235
       1        3229
Name: count, dtype: int64


## 7. Simple Keyword Baseline Metrics



In [None]:
CUES  = ["after starting", "following initiation", "soon after",
         "temporal association", "shortly post-initiation", "in close proximity"]
NEGS  = ["no evidence of", "denies", "without signs of", "not ", "unlikely to"]

def keyword_baseline(text):
    t = text.lower()
    cue = any(c in t for c in CUES)
    neg = any(n in t for n in NEGS)
    return int(cue and not neg)   # 1 = positive guess, else 0

for split in ["train","test"]:
    p = df[df.split==split]
    preds = p.text.map(keyword_baseline).values
    acc = (preds == p.label.values).mean()
    print(f"{split} keyword baseline accuracy: {acc:.3f}")


train keyword baseline accuracy: 0.595
test keyword baseline accuracy: 0.540


In [24]:
assert set(df[df.split=='train'].subject_id) & set(df[df.split=='test'].subject_id) == set(), "Subject leakage!"

# Inspect truncation — if most notes are short, try smaller max_len (less capacity = less overfit)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("thomas-sounack/BioClinical-ModernBERT-base", add_prefix_space=True)
lens = df['text'].sample(200).apply(lambda s: len(tok(s, truncation=True)['input_ids']))
print("p90 length:", np.percentile(lens, 90))

p90 length: 83.0
