## Transformer-Based Experiment: Using `roberta-base`

This notebook evaluates the performance of the `roberta-base` model for multiclass classification of primary progressive aphasia (PPA) subtypes using clinical interview transcripts.

### Objective

To benchmark a transformer-based model against traditional machine learning pipelines by using direct fine-tuning for text classification.

### Preventing Data Leakage

To ensure valid evaluation, a **Group K-Fold cross-validation** strategy is applied:

- Each participant (`SubjectID`) appears in only one fold.
- This ensures that no data from the same individual is present in both training and testing sets, preventing data leakage and overestimation of performance.

### Experiment Details

- **Model**: `roberta-base` from Hugging Face Transformers
- **Tokenization**: Applied using `AutoTokenizer` with truncation, padding, and a maximum length of 128 tokens
- **Training**:
  - Optimizer: AdamW
  - Epochs: 10
  - Batch size: 16
- **Evaluation Metrics**:
  - F1-score (weighted)
  - Balanced Accuracy
  - Precision
  - Recall
  - Hamming Loss
  - AUC (One-vs-Rest multiclass setting)

### Dataset Description

The dataset contains transcribed utterances labeled by subtype. It includes four target classes:

- Logopenic Variant (lvPPA)
- Semantic Variant (svPPA)
- Nonfluent Variant (nfvPPA)
- Healthy Controls

Each entry is associated with:
- `SubjectID` (participant ID)
- `Text` (utterance)
- `Subtype` (target label)

### Output

The notebook prints:

- Fold-wise performance metrics
- Averaged scores across all five folds

### Notes

This approach complements other experiments in the study by allowing the transformer model to operate in an end-to-end fine-tuning fashion, rather than as a feature extractor.

In [14]:
import pandas as pd
import io
import os

In [1]:
# import data here

In [16]:
from sklearn.metrics import f1_score, balanced_accuracy_score
from sklearn.preprocessing import LabelEncoder
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader
from torch.optim import AdamW
from tqdm import tqdm
import copy
import random
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader
from sklearn.model_selection import GroupKFold
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
import torch
from tqdm import tqdm
from sklearn.metrics import (
    f1_score,
    balanced_accuracy_score,
    precision_score,
    recall_score,
    hamming_loss,
    roc_auc_score
)


In [17]:
seed_value = 42
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)
np.random.seed(seed_value)
random.seed(seed_value)

In [18]:
label_encoder = LabelEncoder()
df['Subtype'] = label_encoder.fit_transform(df['Subtype'])

In [19]:
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")

In [20]:
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index):
        text = str(self.texts[index])
        label = self.labels[index]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

MAX_LEN = 128
dataset = TextDataset(df['Text'].to_numpy(), df['Subtype'].to_numpy(), tokenizer, MAX_LEN)


In [21]:
# define the device for computations
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [22]:
# initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")

# define parameters
MAX_LEN = 128
BATCH_SIZE = 16
EPOCHS = 10

# define 5-fold cross-validation
groups = df['SubjectID']  
kf = GroupKFold(n_splits=5)

# initialize metrics storage
f1_scores = []
balanced_accuracies = []
precisions = []
recalls = []
hamming_losses = []
auc_scores = []

# perform cross-validation
for fold, (train_index, val_index) in enumerate(kf.split(df, groups=groups)):
    print(f"\nFold {fold + 1}")

    # split the data for the current fold
    train_texts, val_texts = df.iloc[train_index]['Text'], df.iloc[val_index]['Text']
    train_labels, val_labels = df.iloc[train_index]['Subtype'], df.iloc[val_index]['Subtype']

    # create datasets and dataloaders
    train_dataset = TextDataset(train_texts.to_numpy(), train_labels.to_numpy(), tokenizer, MAX_LEN)
    val_dataset = TextDataset(val_texts.to_numpy(), val_labels.to_numpy(), tokenizer, MAX_LEN)
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

    # initialize model and optimizer for each fold
    model = AutoModelForSequenceClassification.from_pretrained(
        "FacebookAI/roberta-base", num_labels=len(df['Subtype'].unique())
    )
    model = model.to(device)
    optimizer = AdamW(model.parameters(), lr=2e-5)

    # training loop
    for epoch in range(EPOCHS):
        model.train()
        total_loss = 0
        for batch in tqdm(train_loader):
            optimizer.zero_grad()
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()

            loss.backward()
            optimizer.step()

        print(f"Epoch {epoch + 1}, Loss: {total_loss / len(train_loader)}")

    model.eval()
    true_labels = []
    pred_labels = []
    probabilities = []

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1)
            probs = torch.softmax(logits, dim=-1)  # get probabilities for AUC

            true_labels.extend(labels.cpu().numpy())
            pred_labels.extend(predictions.cpu().numpy())
            probabilities.extend(probs.cpu().numpy())

    # calculate metrics for this fold
    f1 = f1_score(true_labels, pred_labels, average='weighted')
    balanced_acc = balanced_accuracy_score(true_labels, pred_labels)
    precision = precision_score(true_labels, pred_labels, average='weighted')
    recall = recall_score(true_labels, pred_labels, average='weighted')
    hamming = hamming_loss(true_labels, pred_labels)

    # calculate AUC (one-vs-rest for multiclass)
    try:
        auc = roc_auc_score(
            true_labels, probabilities, multi_class='ovr', average='weighted'
        )
    except ValueError:
        auc = np.nan  # handle edge cases where AUC is undefined

    # append metrics for this fold
    f1_scores.append(f1)
    balanced_accuracies.append(balanced_acc)
    precisions.append(precision)
    recalls.append(recall)
    hamming_losses.append(hamming)
    auc_scores.append(auc)

    print(
        f"Fold {fold + 1} - F1-Score: {f1:.4f}, Balanced Accuracy: {balanced_acc:.4f}, "
        f"Precision: {precision:.4f}, Recall: {recall:.4f}, Hamming Loss: {hamming:.4f}, AUC: {auc:.4f}"
    )

# calculate and print the average metrics across all folds
avg_f1 = np.mean(f1_scores)
avg_balanced_acc = np.mean(balanced_accuracies)
avg_precision = np.mean(precisions)
avg_recall = np.mean(recalls)
avg_hamming = np.mean(hamming_losses)
avg_auc = np.nanmean(auc_scores)

print("\n5-Fold Cross-Validation Results:")
print(f"Average F1-Score: {avg_f1:.4f}")
print(f"Average Balanced Accuracy: {avg_balanced_acc:.4f}")
print(f"Average Precision: {avg_precision:.4f}")
print(f"Average Recall: {avg_recall:.4f}")
print(f"Average Hamming Loss: {avg_hamming:.4f}")
print(f"Average AUC: {avg_auc:.4f}")



Fold 1


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 114/114 [02:06<00:00,  1.11s/it]


Epoch 1, Loss: 1.234344541503672


100%|██████████| 114/114 [02:05<00:00,  1.10s/it]


Epoch 2, Loss: 1.031129674691903


100%|██████████| 114/114 [02:08<00:00,  1.13s/it]


Epoch 3, Loss: 0.8680872206102338


100%|██████████| 114/114 [02:09<00:00,  1.14s/it]


Epoch 4, Loss: 0.7437477281741929


100%|██████████| 114/114 [02:09<00:00,  1.13s/it]


Epoch 5, Loss: 0.6202081637947183


100%|██████████| 114/114 [02:11<00:00,  1.15s/it]


Epoch 6, Loss: 0.4962686546016158


100%|██████████| 114/114 [02:08<00:00,  1.13s/it]


Epoch 7, Loss: 0.4185000001207778


100%|██████████| 114/114 [02:08<00:00,  1.12s/it]


Epoch 8, Loss: 0.3804126736430222


100%|██████████| 114/114 [02:08<00:00,  1.13s/it]


Epoch 9, Loss: 0.33000585989078934


100%|██████████| 114/114 [02:05<00:00,  1.10s/it]


Epoch 10, Loss: 0.2744802177409854
Fold 1 - F1-Score: 0.6093, Balanced Accuracy: 0.5180, Precision: 0.7020, Recall: 0.5933, Hamming Loss: 0.4067, AUC: 0.8451

Fold 2


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 113/113 [02:06<00:00,  1.12s/it]


Epoch 1, Loss: 1.17694117432147


100%|██████████| 113/113 [02:06<00:00,  1.12s/it]


Epoch 2, Loss: 0.9451387798891658


100%|██████████| 113/113 [02:08<00:00,  1.14s/it]


Epoch 3, Loss: 0.7852349502850423


100%|██████████| 113/113 [02:06<00:00,  1.12s/it]


Epoch 4, Loss: 0.6613738075294325


100%|██████████| 113/113 [02:07<00:00,  1.12s/it]


Epoch 5, Loss: 0.5372118545053279


100%|██████████| 113/113 [02:06<00:00,  1.12s/it]


Epoch 6, Loss: 0.4106873018024242


100%|██████████| 113/113 [02:09<00:00,  1.14s/it]


Epoch 7, Loss: 0.36205054644857887


100%|██████████| 113/113 [02:10<00:00,  1.16s/it]


Epoch 8, Loss: 0.3075812291518777


100%|██████████| 113/113 [02:10<00:00,  1.15s/it]


Epoch 9, Loss: 0.29598533765998036


100%|██████████| 113/113 [02:09<00:00,  1.14s/it]


Epoch 10, Loss: 0.26378823905787635
Fold 2 - F1-Score: 0.5764, Balanced Accuracy: 0.5139, Precision: 0.5938, Recall: 0.5724, Hamming Loss: 0.4276, AUC: 0.8035

Fold 3


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 113/113 [02:09<00:00,  1.14s/it]


Epoch 1, Loss: 1.1802423485612448


100%|██████████| 113/113 [02:07<00:00,  1.13s/it]


Epoch 2, Loss: 0.9450128099559683


100%|██████████| 113/113 [02:10<00:00,  1.15s/it]


Epoch 3, Loss: 0.7811068995336515


100%|██████████| 113/113 [02:09<00:00,  1.15s/it]


Epoch 4, Loss: 0.6644550242782694


100%|██████████| 113/113 [02:06<00:00,  1.12s/it]


Epoch 5, Loss: 0.5524008801553102


100%|██████████| 113/113 [02:06<00:00,  1.12s/it]


Epoch 6, Loss: 0.4442527854600839


100%|██████████| 113/113 [02:06<00:00,  1.12s/it]


Epoch 7, Loss: 0.367040793320774


100%|██████████| 113/113 [02:06<00:00,  1.12s/it]


Epoch 8, Loss: 0.3198084186729604


100%|██████████| 113/113 [02:06<00:00,  1.12s/it]


Epoch 9, Loss: 0.27609268618764077


100%|██████████| 113/113 [02:06<00:00,  1.12s/it]


Epoch 10, Loss: 0.26456744953231737
Fold 3 - F1-Score: 0.5363, Balanced Accuracy: 0.4951, Precision: 0.5384, Recall: 0.5473, Hamming Loss: 0.4527, AUC: 0.7845

Fold 4


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 114/114 [02:07<00:00,  1.12s/it]


Epoch 1, Loss: 1.2064670235441441


100%|██████████| 114/114 [02:04<00:00,  1.10s/it]


Epoch 2, Loss: 0.9905226207093188


100%|██████████| 114/114 [02:04<00:00,  1.09s/it]


Epoch 3, Loss: 0.7975142172031235


100%|██████████| 114/114 [02:05<00:00,  1.10s/it]


Epoch 4, Loss: 0.6492735327858674


100%|██████████| 114/114 [02:05<00:00,  1.10s/it]


Epoch 5, Loss: 0.5459830175366318


100%|██████████| 114/114 [02:05<00:00,  1.10s/it]


Epoch 6, Loss: 0.4253568646024194


100%|██████████| 114/114 [02:06<00:00,  1.11s/it]


Epoch 7, Loss: 0.3731549193331024


100%|██████████| 114/114 [02:05<00:00,  1.10s/it]


Epoch 8, Loss: 0.29726220816863996


100%|██████████| 114/114 [02:05<00:00,  1.10s/it]


Epoch 9, Loss: 0.3230994434275648


100%|██████████| 114/114 [02:06<00:00,  1.11s/it]


Epoch 10, Loss: 0.24566557550835505
Fold 4 - F1-Score: 0.5929, Balanced Accuracy: 0.4971, Precision: 0.5983, Recall: 0.6084, Hamming Loss: 0.3916, AUC: 0.8143

Fold 5


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 114/114 [02:07<00:00,  1.12s/it]


Epoch 1, Loss: 1.1692485160994948


100%|██████████| 114/114 [02:12<00:00,  1.17s/it]


Epoch 2, Loss: 0.9286467064368097


100%|██████████| 114/114 [02:14<00:00,  1.18s/it]


Epoch 3, Loss: 0.7706947441686663


100%|██████████| 114/114 [02:12<00:00,  1.17s/it]


Epoch 4, Loss: 0.6529102613005722


100%|██████████| 114/114 [02:12<00:00,  1.16s/it]


Epoch 5, Loss: 0.5227076473988985


100%|██████████| 114/114 [02:12<00:00,  1.16s/it]


Epoch 6, Loss: 0.4379646831698585


100%|██████████| 114/114 [02:12<00:00,  1.16s/it]


Epoch 7, Loss: 0.32158186440274383


100%|██████████| 114/114 [02:12<00:00,  1.16s/it]


Epoch 8, Loss: 0.28701276193258535


100%|██████████| 114/114 [02:13<00:00,  1.17s/it]


Epoch 9, Loss: 0.2564951217439222


100%|██████████| 114/114 [02:12<00:00,  1.17s/it]


Epoch 10, Loss: 0.20498949174948952
Fold 5 - F1-Score: 0.5310, Balanced Accuracy: 0.5255, Precision: 0.5464, Recall: 0.5457, Hamming Loss: 0.4543, AUC: 0.7731

5-Fold Cross-Validation Results:
Average F1-Score: 0.5692
Average Balanced Accuracy: 0.5099
Average Precision: 0.5958
Average Recall: 0.5734
Average Hamming Loss: 0.4266
Average AUC: 0.8041
