# BioClinical ModernBERT vs ModernBERT — Causal ADE Classification (Synthetic Notes)

This notebook benchmarks **BioClinical ModernBERT** against **vanilla ModernBERT**
for detecting **causal Adverse Drug Events (ADEs)** in synthetic ICU notes.

We use a synthetic dataset (`notes_hard_v4.csv`, `doc_labels_hard_v4`) that contains both **textual** and **structural** signals 
for drug–ADE relationships. Each note is labeled positive only when both:

- The patient was *treated with an ACE inhibitor* (`T=1`), **and**
- The note explicitly links the treatment to an **adverse outcome** (`AKI=1` with causal phrasing).

## 1. Load the Synthetic Dataset

We load `notes_hard.csv` and `doc_labels_hard.csv`, then merge them on `doc_id`.  
Each note contains free text describing an ICU admission, while `hard_label` encodes the binary ADE outcome.

The dataset has:
- Text field (`text`)
- Label field (`hard_label`)
- Roughly balanced class distribution

Let's preview and confirm the merge worked correctly.


In [None]:
import os, random
from datasets import Dataset
import numpy as np
import pandas as pd
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer)

from sklearn.metrics import roc_auc_score, average_precision_score, f1_score, accuracy_score

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

DATA_DIR = os.environ.get("DATA_DIR", "../data/synth_clinical")

notes_v4  = os.path.join(DATA_DIR, "notes_hard_v4.csv")
labels_v4 = os.path.join(DATA_DIR, "doc_labels_hard_v4.csv")

notes = pd.read_csv(notes_v4)
labs  = pd.read_csv(labels_v4)
df = notes.merge(labs[['doc_id','label','split']], on='doc_id', how='left')

(6464, 1536)

## 2. Prepare the Dataset for Modeling

We convert the merged dataframe into a Hugging Face `Dataset` and split 80/20 into 
training and test subsets.

This allows us to evaluate generalization on unseen synthetic notes.

In [None]:
train_df

Unnamed: 0,text,label
0,Patient admitted to ICU with SOFA score of 3. ...,0
1,Patient admitted to ICU with SOFA score of 9. ...,1
2,[ADE] noted duing treatment with [DRUG]. Patie...,1
3,Care plan discussed with team. Patient admitte...,0
4,Patient admitted to ICU with SOFA score of 4. ...,0
...,...,...
6459,Care plan discussed with team. Patient admitte...,1
6460,"Soon after {drug} was begun, creatinine remain...",0
6461,Patient admitted to ICU with SOFA score of 4. ...,0
6462,Patinet admitted to ICU with SOFA score of 7. ...,0


In [None]:
train_df = df[df['split']=='train'][['text','label']].reset_index(drop=True)
test_df  = df[df['split']=='test'][['text','label']].reset_index(drop=True)

ds = {
    "train": Dataset.from_pandas(train_df),
    "test":  Dataset.from_pandas(test_df),
}
len(ds["train"]), len(ds["test"])

## 3. Define Evaluation Metrics

For consistency with biomedical NLP literature, we evaluate:

- **Accuracy** — overall classification correctness  
- **F1 Score** — harmonic mean of precision & recall  
- **AUROC** — area under the ROC curve (ranking ability)  
- **AUPRC** — area under the precision–recall curve (robust to imbalance)

In [28]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = logits - logits.max(axis=1, keepdims=True)
    probs = np.exp(probs)
    probs = probs[:,1] / probs.sum(axis=1)
    preds=(probs>=0.5).astype(int)
    return {
        'accuracy': accuracy_score(labels, preds),
        'f1': f1_score(labels, preds),
        'auroc': roc_auc_score(labels, probs),
        'auprc': average_precision_score(labels, probs)
    }


## 4. Define Model Training Routine

We create a helper `run_model()` that:

1. Loads the chosen pretrained ModernBERT tokenizer & model  
2. Tokenizes text up to a maximum sequence length  
3. Trains for a configurable number of epochs  
4. Logs key metrics on the validation set after each epoch

The function returns a dictionary of performance statistics for easy comparison.


In [33]:
def run_model(model_name:str, max_len=768, epochs=3, batch=16, lr=2e-5, fp16=True):
    tok=AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)

    def tokenize(batch):
        return tok(batch["text"], max_length=max_len, truncation=True)
    
    
    # If ds is a dict of splits, tokenize per split
    if isinstance(ds, dict):
        enc = {split: d.map(tokenize, batched=True, remove_columns=["text"])
               for split, d in ds.items()}
        train_ds = enc["train"]
        eval_ds  = enc.get("test") or enc.get("validation")
        if eval_ds is None:
            raise ValueError("No eval split found in ds; expected 'test' or 'validation'.")
    else:
        # ds is a DatasetDict (or similar) with map and split keys
        enc = ds.map(tokenize, batched=True, remove_columns=["text"])
        train_ds = enc["train"]
        eval_ds  = enc.get("test") or enc.get("validation")    
    
    model=AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
    args=TrainingArguments(
        output_dir=f"../reports/doc_cls_hard_{model_name.replace('/','_')}",
        per_device_train_batch_size=batch,
        per_device_eval_batch_size=batch,
        num_train_epochs=epochs,
        learning_rate=lr,
        weight_decay=0.05,
        lr_scheduler_type='cosine',
        warmup_ratio=0.1,
        fp16=fp16,
        logging_steps=50,
        eval_strategy='epoch',
        save_strategy='no',
        label_smoothing_factor=0.05,
        report_to=[],
        seed=SEED
    )
    trainer=Trainer(model=model, args=args, train_dataset=enc['train'], eval_dataset=enc['test'], tokenizer=tok, compute_metrics=compute_metrics)
    import time; t0=time.time(); trainer.train(); dur=time.time()-t0
    metrics=trainer.evaluate(); metrics['seconds']=dur; return metrics


## 5. Train and Compare Models

We benchmark two architectures:

| Model | Description |
|--------|--------------|
| **BioClinical ModernBERT** | Domain-adapted version trained on biomedical corpora (MIMIC, PubMed) |
| **ModernBERT (vanilla)** | General English version trained on large web text |

Both are fine-tuned for binary sequence classification on our synthetic ADE dataset.


In [34]:
results={}
for name, mn in [("BioClinical ModernBERT", "thomas-sounack/BioClinical-ModernBERT-base"),
                 ("ModernBERT (vanilla)", "answerdotai/ModernBERT-base")]:
    print(f"\n==== Training {name}: {mn} ====")
    results[name]=run_model(mn, max_len=512, epochs=10, batch=32, lr=1e-5)
results



==== Training BioClinical ModernBERT: thomas-sounack/BioClinical-ModernBERT-base ====


Map: 100%|██████████| 6464/6464 [00:00<00:00, 28757.36 examples/s]
Map: 100%|██████████| 1536/1536 [00:00<00:00, 22951.05 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at thomas-sounack/BioClinical-ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer=Trainer(model=model, args=args, train_dataset=enc['train'], eval_dataset=enc['test'], tokenizer=tok, compute_metrics=compute_metrics)
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Auroc,Auprc
1,0.3155,0.3296,0.945312,0.946085,0.952462,0.940787
2,0.2678,0.274074,0.953776,0.954863,0.95278,0.949004
3,0.2866,0.26031,0.953776,0.954806,0.954908,0.948426
4,0.2362,0.262458,0.954427,0.955471,0.950197,0.943677


KeyboardInterrupt: 

## 6. Results Summary

Below we summarize final evaluation metrics after fine-tuning.

Values close to **1.0** indicate that the task is relatively easy for these models — likely because 
the dataset contains strong lexical cues ("after starting", "denies", "no evidence of", etc.) 
that the models can exploit directly.

Subsequent versions of the dataset (`hard_v2`, `hard_v3`, `hard_v4`) progressively remove such shortcuts 
to test deeper reasoning and contextual understanding.


In [20]:
pd.DataFrame(results).T[['eval_accuracy','eval_f1','eval_auroc','eval_auprc','eval_loss','seconds']]


Unnamed: 0,eval_accuracy,eval_f1,eval_auroc,eval_auprc,eval_loss,seconds
BioClinical ModernBERT,1.0,1.0,1.0,1.0,0.000944,30.437208
ModernBERT (vanilla),1.0,1.0,1.0,1.0,0.000246,29.71271


In [19]:
print(df.groupby('split')['label'].value_counts(normalize=True))


split  label
test   0        0.8250
       1        0.1750
train  0        0.8375
       1        0.1625
Name: proportion, dtype: float64


In [None]:
# notes = pd.read_csv(f"{DATA_DIR}/notes_hard_v4.csv")
# labs  = pd.read_csv(f"{DATA_DIR}/doc_labels_hard_v4.csv")
# df = notes.merge(labs[['doc_id','split','label']], on='doc_id')

CUES  = ["after starting", "following initiation", "soon after",
         "temporal association", "shortly post-initiation", "in close proximity"]
NEGS  = ["no evidence of", "denies", "without signs of", "not ", "unlikely to"]

def keyword_baseline(text):
    t = text.lower()
    cue = any(c in t for c in CUES)
    neg = any(n in t for n in NEGS)
    return int(cue and not neg)   # 1 = positive guess, else 0

for split in ["train","test"]:
    p = df[df.split==split]
    preds = p.text.map(keyword_baseline).values
    acc = (preds == p.label.values).mean()
    print(f"{split} keyword baseline accuracy: {acc:.3f}")
