# BioClinical ModernBERT vs ModernBERT

## ADE Document Classification (Synthetic)

This notebook compares **BioClinical ModernBERT** to vanilla **ModernBERT** on the synthetic notes dataset. We build *silver* labels for ADE presence (affirmed, not negated), fine-tune each model briefly, and compare metrics and training time.

**Data:** `data/synth_clinical/notes.csv`

**Task:** Binary classification: ADE present (1) vs not (0).

In [4]:
import os, random
from datasets import Dataset
import numpy as np
import pandas as pd
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer)

from sklearn.metrics import roc_auc_score, average_precision_score, f1_score, accuracy_score

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

DATA_PATH = os.environ.get("DATA_DIR", "../data/synth_clinical")
NOTES_CSV = os.path.join(DATA_PATH, "notes.csv")
assert os.path.exists(NOTES_CSV), f"Missing notes.csv at {NOTES_CSV}"
print("Using data:", NOTES_CSV)

Using data: ../data/synth_clinical/notes.csv


In [2]:
# Build silver labels: ADE term present AND not negated
terms = ["acute kidney injury", "aki", "hypotension", "cough", "angioedema", "hyperkalemia"]
negs = ["denies", "no signs of", "not consistent with", "without evidence of", "rule out"]

def silver_label(text:str)->int:
    t = str(text).lower()
    has_term = any(w in t for w in terms)
    has_neg = any(n in t for n in negs)
    return int(has_term and (not has_neg))

df = pd.read_csv(NOTES_CSV)
df['label'] = df['text'].apply(silver_label).astype(int)
print(df['label'].value_counts())
df.head(2)

label
0    544
1     56
Name: count, dtype: int64


Unnamed: 0,doc_id,hadm_id,charttime,category,text,label
0,D000000,H000001,2023-06-14 12:00:00,Progress note,Patient admitted to ICU with SOFA score of 7. ...,0
1,D000001,H000002,2023-09-09 04:00:00,Discharge summary,Patient admitted to ICU with SOFA score of 0. ...,0


In [5]:
ds = Dataset.from_pandas(df[['text', 'label']])
ds = ds.train_test_split(test_size=0.2, seed=SEED)
len(ds['train']), len(ds['test'])

(480, 120)