# **Approach 2: Deep Learning Baseline (DeBERTa)**

In this notebook, we implement **Approach 2**, which utilizes a pre-trained Transformer model (**DeBERTa-v3**) fine-tuned on the RSD-15K dataset. Unlike the feature-based XGBoost baseline (Approach 1), this model learns contextual representations directly from raw text.

---
### **Import & Setup**
We define the following hyperparameters based on standard practices for fine-tuning Transformers on consumer hardware (e.g., Apple Silicon M3 Pro):
* **Model Architecture:** `microsoft/deberta-v3-base` (12 layers, 768 hidden size). We select the 'base' version to balance performance and computational efficiency.
* **Max Sequence Length:** `512` tokens. This covers the majority of social media post lengths.
* **Batch Size:** `32`. Optimized for 36GB+ Unified Memory.
* **Learning Rate:** `2e-5`. A conservative learning rate to prevent catastrophic forgetting during fine-tuning.

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score, classification_report
import sys
import os
from tqdm.auto import tqdm
sys.path.append(os.path.abspath(os.path.join('..')))
from src.utils import compute_graded_metrics

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Configuration
MODEL_NAME = "microsoft/deberta-v3-base" 
MAX_LEN = 512 # limit is 512 tokens
BATCH_SIZE = 16
EPOCHS = 4
LEARNING_RATE = 2e-5
NUM_CLASSES = 4

# Detect MPS for Mac
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using MPS (Apple Silicon GPU)")

DATA_DIR = '../data/processed'

Using MPS (Apple Silicon GPU)


---
### **Data Preparation: From Text to Tokens**
1.  **Define a Custom Dataset Class (`SuicideDataset`):** This handles the tokenization process, converting raw posts into numerical input IDs and attention masks using the **DeBERTa-v3 tokenizer**.
2.  **Load Processed Data:** We load the 80/10/10 split data prepared in `00_preprocessing.ipynb`.
3.  **Create DataLoaders:** These efficiently batch the data (Batch Size: 32) to feed into the GPU during training.

In [3]:
# ===== BEGIN: Gemini-generated block =====

class SuicideDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    # like a handbook for pytorch
    def __getitem__(self, item):
        text = str(self.texts[item])
        label = self.labels[item]

        # Tokenize the text
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# ===== END: Gemini-generated block =====

# Load Data
train_df = pd.read_pickle(os.path.join(DATA_DIR, 'train.pkl'))
val_df = pd.read_pickle(os.path.join(DATA_DIR, 'val.pkl'))
test_df = pd.read_pickle(os.path.join(DATA_DIR, 'test.pkl'))

# Initialize Tokenizer (DeBERTa-v3 uses SentencePiece)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# ===== BEGIN: Gemini-generated block =====

# Create DataLoaders -> spilt data with batch size we set
def create_data_loader(df, tokenizer, max_len, batch_size):
    ds = SuicideDataset(
        texts=df.text.to_numpy(),
        labels=df.label_ordinal.to_numpy(),
        tokenizer=tokenizer,
        max_len=max_len
    )
    # num_workers=0 is safer for MPS on Mac to avoid multiprocessing errors
    return DataLoader(ds, batch_size=batch_size, shuffle=True, num_workers=0)

# ===== END: Gemini-generated block =====

train_loader = create_data_loader(train_df, tokenizer, MAX_LEN, BATCH_SIZE)
val_loader = create_data_loader(val_df, tokenizer, MAX_LEN, BATCH_SIZE)
test_loader = create_data_loader(test_df, tokenizer, MAX_LEN, BATCH_SIZE)

print(f"Train batches: {len(train_loader)}")

Train batches: 749




## **Model Initialization & Training Setup**

1.  **Model Architecture:** We load the pre-trained `microsoft/deberta-v3-base` and add a classification head on top for our 4-class problem.
2.  **Loss Function:** For Approach 2 (Deep Learning Baseline), we strictly adhere to the proposal by using **Standard Cross-Entropy Loss**. This treats the risk levels as independent categories, ignoring their ordinal nature (unlike Approach 3).
3.  **Optimizer:** We use **AdamW** (Adaptive Moment Estimation with Weight Decay), the standard optimizer for Transformer models.
4.  **Scheduler:** A linear learning rate scheduler with warmup is used to stabilize the early stages of fine-tuning.

In [4]:
# ===== BEGIN: Gemini-generated block =====

# --- 1. Initialize Model ---
# Load DeBERTa-v3 with a classification head for 4 output classes
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=NUM_CLASSES)
model = model.to(device) # Move model to MPS

# --- 2. Optimization Setup ---
# AdamW is generally preferred for Transformers over standard SGD
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)

# Learning Rate Scheduler
total_steps = len(train_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,         # No warmup steps needed for this dataset size
    num_training_steps=total_steps
)

# --- 3. Loss Function ---
# standard Cross Entropy Loss
loss_fn = nn.CrossEntropyLoss().to(device)

# --- 4. Training Helper Functions ---

def train_epoch(model, data_loader, loss_fn, optimizer, device, scheduler, n_examples):
    """
    Runs one full pass over the training data.
    """
    model = model.train() # Set model to training mode
    losses = []
    correct_predictions = 0

    ## follow the batch size
    for d in tqdm(data_loader, desc="Training Batch"):
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        targets = d["labels"].to(device)

        # Forward pass
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        
        # DeBERTa outputs 'logits' (raw scores before Softmax)
        logits = outputs.logits
        
        # Calculate Loss
        loss = loss_fn(logits, targets)
        
        # Calculate Accuracy for monitoring
        _, preds = torch.max(logits, dim=1)
        correct_predictions += torch.sum(preds == targets)
        losses.append(loss.item())

        # Backward pass (Gradient Descent)
        loss.backward()
        
        # Gradient Clipping (Prevents "exploding gradients" in deep networks)
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        optimizer.step()   # Update weights
        scheduler.step()   # Update learning rate
        optimizer.zero_grad() # Reset gradients

    return correct_predictions.float() / n_examples, np.mean(losses)

def eval_model(model, data_loader, loss_fn, device, n_examples):
    """
    Evaluates the model on validation/test data (No gradient updates).
    """
    model = model.eval() # Set model to evaluation mode
    losses = []
    correct_predictions = 0

    with torch.no_grad(): # Disable gradient calculation for speed
        for d in data_loader:
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            targets = d["labels"].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )
            logits = outputs.logits
            loss = loss_fn(logits, targets)

            _, preds = torch.max(logits, dim=1)
            correct_predictions += torch.sum(preds == targets)
            losses.append(loss.item())

    return correct_predictions.float() / n_examples, np.mean(losses)

# ===== END: Gemini-generated block =====

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## **Model Training**

iterating through the dataset for a fixed number of **4 epochs**

1.  **Training & Validation:** For each epoch, the model updates its weights on the training set and then evaluates its performance on the validation set.
2.  **Monitoring:** We track loss and accuracy history to visualize learning progress.
3.  **Model Checkpointing (Early Saving):** Instead of simply using the final model (which might be overfitted), we automatically save the model state (`best_deberta_model.bin`) whenever it achieves a new high score in **Validation Accuracy**. This ensures we use the most generalizable version of the model for testing.

In [5]:
# ===== BEGIN: Gemini-generated block =====

# Store training history for plotting
history = {'train_acc': [], 'train_loss': [], 'val_acc': [], 'val_loss': []}

# Initialize best accuracy to save the best model
best_accuracy = 0

for epoch in range(EPOCHS):
    print(f'Epoch {epoch + 1}/{EPOCHS}')
    print('-' * 10)

    # --- Training Step ---
    train_acc, train_loss = train_epoch(
        model, 
        train_loader, 
        loss_fn, 
        optimizer, 
        device, 
        scheduler, 
        len(train_df)
    )
    print(f'Train loss {train_loss:.4f} accuracy {train_acc:.4f}')

    # --- Validation Step ---
    val_acc, val_loss = eval_model(
        model, 
        val_loader, 
        loss_fn, 
        device, 
        len(val_df)
    )
    print(f'Val   loss {val_loss:.4f} accuracy {val_acc:.4f}')

    # --- History Tracking ---
    history['train_acc'].append(train_acc)
    history['train_loss'].append(train_loss)
    history['val_acc'].append(val_acc)
    history['val_loss'].append(val_loss)

    # --- Model Checkpointing ---
    # Save the model only if validation accuracy improves
    if val_acc > best_accuracy:
        torch.save(model.state_dict(), '../models/best_deberta_model.bin')
        best_accuracy = val_acc
        print("--> Best Model Saved")
        
    print() # Empty line for readability

# ===== END: Gemini-generated block =====

Epoch 1/4
----------


Training Batch: 100%|█████████████████████████| 749/749 [45:45<00:00,  3.67s/it]


Train loss 0.7902 accuracy 0.6896
Val   loss 0.6232 accuracy 0.7514
--> Best Model Saved

Epoch 2/4
----------


Training Batch: 100%|███████████████████████| 749/749 [1:01:05<00:00,  4.89s/it]


Train loss 0.5654 accuracy 0.7836
Val   loss 0.6293 accuracy 0.7389

Epoch 3/4
----------


Training Batch: 100%|███████████████████████| 749/749 [1:00:02<00:00,  4.81s/it]


Train loss 0.4403 accuracy 0.8380
Val   loss 0.6744 accuracy 0.7508

Epoch 4/4
----------


Training Batch: 100%|█████████████████████████| 749/749 [59:18<00:00,  4.75s/it]


Train loss 0.3484 accuracy 0.8791
Val   loss 0.7512 accuracy 0.7477



## **Evaluation**

In [6]:
# ===== BEGIN: Gemini-generated block =====

# Load the best saved model
model.load_state_dict(torch.load('../models/best_deberta_model.bin'))
model = model.to(device)
model.eval()

print("--- Predicting on Test Set ---")
y_pred_list = []
y_true_list = []

with torch.no_grad():
    for d in test_loader:
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        targets = d["labels"].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        _, preds = torch.max(outputs.logits, dim=1)
        
        y_pred_list.extend(preds.cpu().numpy())
        y_true_list.extend(targets.cpu().numpy())

# ===== END: Gemini-generated block =====

# 1. Standard Metrics
acc = accuracy_score(y_true_list, y_pred_list)
print(f"\nSimple Accuracy: {acc:.4f}")

# 2. Detailed Report
target_names = ['Indicator', 'Ideation', 'Behavior', 'Attempt']
print("\nClassification Report:")
print(classification_report(y_true_list, y_pred_list, target_names=target_names))

# 3. Graded Metrics
graded_metrics = compute_graded_metrics(y_true_list, y_pred_list)
print("\n=== Graded Metrics (Approach 2) ===")
print(f"Graded Precision: {graded_metrics['graded_precision']:.4f}")
print(f"Graded Recall:    {graded_metrics['graded_recall']:.4f}")
print(f"Graded F1-Score:  {graded_metrics['graded_f1']:.4f}")

--- Predicting on Test Set ---

Simple Accuracy: 0.6969

Classification Report:
              precision    recall  f1-score   support

   Indicator       0.73      0.71      0.72       305
    Ideation       0.73      0.77      0.75       530
    Behavior       0.54      0.47      0.50       135
     Attempt       0.51      0.53      0.52        66

    accuracy                           0.70      1036
   macro avg       0.63      0.62      0.62      1036
weighted avg       0.69      0.70      0.69      1036


=== Graded Metrics (Approach 2) ===
Graded Precision: 0.8562
Graded Recall:    0.8407
Graded F1-Score:  0.8484
