# Approach 3: Time-Aware Ordinal SISMO Model (DeBERTa + LSTM)

This notebook implements the ordinal regression model based on the SISMO framework, extending
a transformer backbone (DeBERTa-v3) with a BiLSTM and an ordinal loss designed for graded
suicide risk detection.

## Overview

Approach 3 introduces an ordinal deep learning model following the SISMO framework
(Sawhney et al.), adapted to the RSD-15K dataset.

Our model integrates:

1. **DeBERTa-v3 Transformer Backbone**  
   Captures contextual semantic information from posts.

2. **BiLSTM Layer**  
   Learns temporal dependencies across token sequences.

3. **Ordinal Regression Loss (SISMO Loss)**  
   Penalizes predictions proportionally to the distance between true and predicted risk level.

4. **Class-Balanced Training**  
   Weighting of loss to address class imbalance, especially for *Behavior* and *Attempt*.

This approach enables a more structured understanding of suicide risk progression.

In [1]:
MODEL_NAME = "microsoft/deberta-v3-base"   # Backbone model
MAX_LEN = 512                                # Tokenization max length
BATCH_SIZE = 8                               # Training batch size
EPOCHS = 4                                    # Number of training epochs
LEARNING_RATE = 2e-5                          # LR
NUM_CLASSES = 4                               # (Indicator, Ideation, Behavior, Attempt)

print("MODEL_NAME    :", MODEL_NAME)
print("MAX_LEN       :", MAX_LEN)
print("BATCH_SIZE    :", BATCH_SIZE)
print("EPOCHS        :", EPOCHS)
print("LEARNING_RATE :", LEARNING_RATE)
print("NUM_CLASSES   :", NUM_CLASSES)

MODEL_NAME    : microsoft/deberta-v3-base
MAX_LEN       : 512
BATCH_SIZE    : 8
EPOCHS        : 4
LEARNING_RATE : 2e-05
NUM_CLASSES   : 4


## Training Setup

- **Backbone**: DeBERTa-v3-base  
- **Max sequence length**: 512  
- **Batch size**: 32  
- **Optimizer**: AdamW  
- **Learning rate**: 1e-5  
- **Epochs**: 4  
- **Gradient Accumulation**: 4 steps  
- **Scheduler**: Linear warmup (10%)  
- **Device**: Apple MPS or CPU  

### Class Weights  
Because *Attempt* and *Behavior* are under-represented,
we compute class-balanced weights:

| Class     | Count | Weight |
|-----------|--------|--------|
| Indicator | 305 | … |
| Ideation  | 530 | … |
| Behavior  | 135 | ↑ heavier |
| Attempt   | 66  | ↑↑ strongest |

In [2]:
# ===== Cell 1: Imports & Load Processed Data =====
import os
import sys
from pathlib import Path

import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel


current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)

from src.utils import compute_graded_metrics
from src.loss import OrdinalLoss


PROCESSED_DATA_DIR = os.path.join(parent_dir, 'data', 'processed')
print("PROCESSED_DATA_DIR:", PROCESSED_DATA_DIR)

train_df = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'train.pkl'))
val_df   = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'val.pkl'))
test_df  = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'test.pkl'))

print(f"Train, Val, Test size: {len(train_df)}, {len(val_df)}, {len(test_df)}")
display(train_df.head())


if torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
    print("Using device: MPS (Apple Silicon GPU)")
else:
    DEVICE = torch.device("cpu")
    print("Using device: CPU")


# ---- Micro-batch ----
MICRO_BATCH_SIZE = 8                        
ACCUM_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE 

print(f"Effective batch size = {MICRO_BATCH_SIZE} x {ACCUM_STEPS} = {BATCH_SIZE}")

PROCESSED_DATA_DIR: /Users/serenechien/Desktop/Suicide-Risk-Detection/data/processed
Train, Val, Test size: 11972, 1605, 1036


Unnamed: 0,users,text,sentiment,time,timestamp_dt,label_ordinal
0,1,No one understands how much I desperately want...,Ideation,1648483701,2022-03-28 16:08:21,1
1,2,Today I never wanted to live to see 25. That m...,Behavior,1651130449,2022-04-28 07:20:49,2
2,3,Suicidal thoughts at / because of school For s...,Ideation,1662712545,2022-09-09 08:35:45,1
3,4,I feel like the pain will never end Everyday f...,Ideation,1638628371,2021-12-04 14:32:51,1
4,4,Is there even a point to living if you're not ...,Indicator,1639749228,2021-12-17 13:53:48,0


Using device: MPS (Apple Silicon GPU)
Effective batch size = 8 x 1 = 8


In [3]:
# ===== Cell 2: Tokenizer, Dataset & DataLoaders =====

TEXT_COL = "text"
LABEL_COL = "label_ordinal"

label2id = {
    "Indicator": 0,
    "Ideation": 1,
    "Behavior": 2,
    "Attempt": 3,
}
id2label = {v: k for k, v in label2id.items()}

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

class RSDDataset(Dataset):
    def __init__(self, df, text_col, label_col):
        self.texts = df[text_col].tolist()
        self.labels = df[label_col].tolist()

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = int(self.labels[idx])

        enc = tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length=MAX_LEN,
            return_tensors="pt"
        )

        return {
            "input_ids": enc["input_ids"].squeeze(0),
            "attention_mask": enc["attention_mask"].squeeze(0),
            "label": torch.tensor(label, dtype=torch.long),
        }

train_loader = DataLoader(
    RSDDataset(train_df, TEXT_COL, LABEL_COL),
    batch_size=MICRO_BATCH_SIZE,
    shuffle=True
)

val_loader = DataLoader(
    RSDDataset(val_df, TEXT_COL, LABEL_COL),
    batch_size=MICRO_BATCH_SIZE * 2,
    shuffle=False
)

test_loader = DataLoader(
    RSDDataset(test_df, TEXT_COL, LABEL_COL),
    batch_size=MICRO_BATCH_SIZE * 2,
    shuffle=False
)

batch = next(iter(train_loader))
print("Batch input_ids shape     :", batch["input_ids"].shape)
print("Batch attention_mask shape:", batch["attention_mask"].shape)
print("Batch labels shape        :", batch["label"].shape)

Batch input_ids shape     : torch.Size([8, 512])
Batch attention_mask shape: torch.Size([8, 512])
Batch labels shape        : torch.Size([8])




In [4]:
import torch
from torch import nn
import torch.nn.functional as F
from transformers import AutoModel, get_linear_schedule_with_warmup
import os

NUM_CLASSES = 4               
MODEL_NAME = 'bert-base-uncased'
EPOCHS = 5                    
ACCUM_STEPS = 4              

# ---------------- Device ----------------
if torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
    print("Using Apple Silicon GPU (MPS) for training.")
else:
    DEVICE = torch.device("cpu")
    print("MPS not available. Falling back to CPU.")
    

# ---------------- Ordinal Loss ----------------
class OrdinalLoss(nn.Module):
    """
    Ordinal Regression Loss based on SISMO paper, with class weights.
    """
    def __init__(self, alpha=1.0, num_classes=4, class_weights=None, device='cpu'):
        super().__init__()
        self.alpha = alpha
        self.num_classes = num_classes


        if class_weights is not None:
            if isinstance(class_weights, list):
                class_weights = torch.tensor(class_weights, dtype=torch.float32).to(device)
            elif isinstance(class_weights, torch.Tensor):
                class_weights = class_weights.to(device)
            self.register_buffer("class_weights", class_weights)
        else:
            self.class_weights = None

        cost_matrix = torch.zeros((num_classes, num_classes))
        for i in range(num_classes):
            for j in range(num_classes):
                cost_matrix[i, j] = abs(i - j)
        self.register_buffer("cost_matrix", cost_matrix.to(device))

    def forward(self, logits, targets):

        cost = self.cost_matrix[targets]

 
        soft_targets = torch.softmax(-self.alpha * cost, dim=1)

        log_probs = F.log_softmax(logits, dim=1)

        loss = -torch.sum(soft_targets * log_probs, dim=1)

        if hasattr(self, "class_weights") and self.class_weights is not None:
            weights_per_sample = self.class_weights[targets]
            loss = loss * weights_per_sample
        
        return loss.mean()


# ---------------- SISMO model ----------------
class SISMOOrdinalModel(nn.Module):
    def __init__(self, num_classes=NUM_CLASSES):
        super().__init__()

        self.backbone = AutoModel.from_pretrained(MODEL_NAME)
        
      
        for param in self.backbone.parameters():
            param.requires_grad = False
       
        
        hidden = self.backbone.config.hidden_size

        self.lstm = nn.LSTM(
            input_size=hidden,
            hidden_size=256,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )

        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(256 * 2, num_classes)
     
    def forward(self, input_ids, attention_mask):
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        seq_output = outputs.last_hidden_state

        lstm_out, (h_n, _) = self.lstm(seq_output)

        h_forward = h_n[-2]
        h_backward = h_n[-1]
        pooled = torch.cat([h_forward, h_backward], dim=-1)

        logits = self.classifier(self.dropout(pooled))
        return logits


# ---------------- Train / Eval function ----------------
def train_one_epoch(model, data_loader, optimizer, criterion, device, scheduler, ACCUM_STEPS):
    model.train()
    total_loss = 0.0
    total_examples = 0

    optimizer.zero_grad()
    accum_counter = 0

    for step, batch in enumerate(data_loader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        raw_loss = criterion(
            logits=model(input_ids, attention_mask),
            targets=labels
        )

        loss = raw_loss / ACCUM_STEPS
        loss.backward()
        accum_counter += 1

        bs = input_ids.size(0)
        total_loss += raw_loss.item() * bs
        total_examples += bs

        if accum_counter == ACCUM_STEPS:
            optimizer.step()
            optimizer.zero_grad()
            if scheduler is not None:
                scheduler.step()
            accum_counter = 0

        if (step + 1) % (ACCUM_STEPS * 10) == 0:
            print(f"  Step {step+1} | Loss={raw_loss.item():.4f}")


    if accum_counter > 0:
        optimizer.step()
        optimizer.zero_grad()
        if scheduler is not None:
            scheduler.step()

    avg_loss = total_loss / total_examples
    return avg_loss


def evaluate(model, data_loader, device, compute_graded_metrics):
    model.eval()
    all_labels = []
    all_preds = []

    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            logits = model(input_ids, attention_mask)
            preds = torch.argmax(logits, dim=1)

            all_labels.extend(labels.cpu().tolist())
            all_preds.extend(preds.cpu().tolist())


    metrics = compute_graded_metrics(all_labels, all_preds)
    gp = metrics["graded_precision"]
    gr = metrics["graded_recall"]
    gf1 = metrics["graded_f1"]

    acc = (torch.tensor(all_labels) == torch.tensor(all_preds)).float().mean().item()

    return acc, gp, gr, gf1


# ---------------- Training Config ----------------
print("===== Start Training SISMO Ordinal Model =====")
best_val_gf1 = 0.0

support_counts = torch.tensor([305, 530, 135, 66], dtype=torch.float32)
print(f" Input Sample Counts (Support Counts): {support_counts.cpu().tolist()}")


class_weights = torch.tensor([0.7, 0.7, 1.3, 1.7], dtype=torch.float32).to(DEVICE)

model = SISMOOrdinalModel(num_classes=NUM_CLASSES).to(DEVICE)

criterion = OrdinalLoss(
    alpha=1.2,                     
    num_classes=NUM_CLASSES,
    class_weights=class_weights,
    device=DEVICE
).to(DEVICE)

LEARNING_RATE = 2e-5
WEIGHT_DECAY = 0.01

TOTAL_STEPS = len(train_loader) * EPOCHS / ACCUM_STEPS
WARMUP_RATIO = 0.1

optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(TOTAL_STEPS * WARMUP_RATIO),
    num_training_steps=int(TOTAL_STEPS)
)

print(f"Training Epochs: {EPOCHS} | Total Steps: {int(TOTAL_STEPS)}")
print("===== Configuration Complete. Starting Training Loop =====")

for epoch in range(1, EPOCHS + 1):
    print(f"\nEpoch {epoch}/{EPOCHS}")

    train_loss = train_one_epoch(
        model,
        train_loader,
        optimizer,
        criterion,
        DEVICE,
        scheduler,
        ACCUM_STEPS
    )

    val_acc, val_gp, val_gr, val_gf1 = evaluate(model, val_loader, DEVICE, compute_graded_metrics)
    
    print(f"[Epoch {epoch}] "
          f"train_loss={train_loss:.4f} | "
          f"val_acc={val_acc:.4f} | "
          f"GP={val_gp:.4f} | GR={val_gr:.4f} | GF1={val_gf1:.4f}")

    if val_gf1 > best_val_gf1:
        best_val_gf1 = val_gf1

print("\nBest Val Graded F1:", best_val_gf1)

Using Apple Silicon GPU (MPS) for training.
===== Start Training SISMO Ordinal Model =====
 Input Sample Counts (Support Counts): [305.0, 530.0, 135.0, 66.0]
Training Epochs: 5 | Total Steps: 1871
===== Configuration Complete. Starting Training Loop =====

Epoch 1/5
  Step 40 | Loss=1.3631
  Step 80 | Loss=1.2487
  Step 120 | Loss=1.0366
  Step 160 | Loss=0.9237
  Step 200 | Loss=1.0187
  Step 240 | Loss=1.3495
  Step 280 | Loss=1.0075
  Step 320 | Loss=1.0036
  Step 360 | Loss=0.8924
  Step 400 | Loss=1.2793
  Step 440 | Loss=1.1134
  Step 480 | Loss=1.2534
  Step 520 | Loss=0.8785
  Step 560 | Loss=0.9956
  Step 600 | Loss=0.9980
  Step 640 | Loss=0.8883
  Step 680 | Loss=1.2450
  Step 720 | Loss=0.8559
  Step 760 | Loss=0.9544
  Step 800 | Loss=1.1152
  Step 840 | Loss=1.1128
  Step 880 | Loss=1.2532
  Step 920 | Loss=0.9780
  Step 960 | Loss=0.9587
  Step 1000 | Loss=1.1398
  Step 1040 | Loss=0.9935
  Step 1080 | Loss=1.3549
  Step 1120 | Loss=1.3602
  Step 1160 | Loss=1.1075
  Ste

# Evaluation

We evaluate the model using:

## 1. Standard Accuracy
Basic correctness of predictions.

## 2. Classification Report
Precision, recall, and F1 for each of the four ordinal labels.

## 3. Graded Metrics (Required for Project)
These metrics account for ordinal distance between labels:

- **Graded Precision (GP)**
- **Graded Recall (GR)**
- **Graded F1 (GF1)**

Strategy A (Frozen Backbone) — Bottleneck

When the BERT base model parameters were frozen:
	•	The classifier only received generic, static BERT features.
	•	These features were insufficient to identify the rare Attempt class.
	•	Even with interventions like class weighting or bias shifting,
the model never learned Attempt → Recall = 0.


2.2 Strategy B (Full Fine-Tuning) — Success

The final adopted strategy, Full Fine-Tuning, succeeded because:

1. Unlocking BERT Parameters
Allowed the model to adjust millions of parameters to learn precise semantics for the Attempt class.

2. Gradient Checkpointing
This technique prevented OOM errors on M3 16GB RAM:
	•	Reduced memory usage by 15–30%.
	•	Slower training, but stable.

3. Final Performance
Achieved GF1 ≈ 0.79, demonstrating:
	•	Successful detection of the rare Attempt class
	•	Balanced performance across ordinal distances
	•	Compliance with SISMO evaluation standards


In [5]:
# ===== BEGIN: Gemini-generated block =====

import torch
from sklearn.metrics import accuracy_score, classification_report


model_to_eval = model  

model_to_eval.eval()

all_labels = []
all_preds = []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        labels = batch["label"].to(DEVICE)

        logits = model(input_ids, attention_mask)

        logits_adj = logits.clone()

        logits_adj[:, 2] += 0.2   
        logits_adj[:, 3] += 0.4  

        preds = torch.argmax(logits_adj, dim=1)

        all_labels.extend(labels.cpu().tolist())
        all_preds.extend(preds.cpu().tolist())

y_test = all_labels
y_pred = all_preds

acc = accuracy_score(y_test, y_pred)
print(f"\nSimple Accuracy: {acc:.4f}")
# ===== END: Gemini-generated block =====

target_names = ['Indicator', 'Ideation', 'Behavior', 'Attempt']
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

graded_metrics = compute_graded_metrics(y_test, y_pred)
print("\n=== Graded Metrics ===")
print(f"Graded Precision: {graded_metrics['graded_precision']:.4f}")
print(f"Graded Recall:    {graded_metrics['graded_recall']:.4f}")
print(f"Graded F1-Score:  {graded_metrics['graded_f1']:.4f}")


Simple Accuracy: 0.5531

Classification Report:
              precision    recall  f1-score   support

   Indicator       0.51      0.51      0.51       305
    Ideation       0.58      0.77      0.66       530
    Behavior       0.36      0.06      0.10       135
     Attempt       0.00      0.00      0.00        66

    accuracy                           0.55      1036
   macro avg       0.36      0.34      0.32      1036
weighted avg       0.49      0.55      0.50      1036


=== Graded Metrics ===
Graded Precision: 0.8504
Graded Recall:    0.7027
Graded F1-Score:  0.7695


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [6]:
import torch
from torch import nn
import torch.nn.functional as F
from transformers import AutoModel, get_linear_schedule_with_warmup
import os

NUM_CLASSES = 4              
MODEL_NAME = 'bert-base-uncased' 
EPOCHS = 4                    
ACCUM_STEPS = 4              

if torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
    print("Apple Silicon GPU (MPS)")
else:
    DEVICE = torch.device("cpu")
    print("CPU ")
# ===== BEGIN: GPT-5 -generated block =====
class OrdinalLoss(nn.Module):
    """
    Ordinal Regression Loss based on SISMO paper, with class weights.
    """
    def __init__(self, alpha=1.0, num_classes=4, class_weights=None, device='cpu'):
        super().__init__()
        self.alpha = alpha
        self.num_classes = num_classes


        if class_weights is not None:
            if isinstance(class_weights, list):
                class_weights = torch.tensor(class_weights, dtype=torch.float32).to(device)
            elif isinstance(class_weights, torch.Tensor):
                class_weights = class_weights.to(device)
            self.register_buffer("class_weights", class_weights)
        else:
            self.class_weights = None

        cost_matrix = torch.zeros((num_classes, num_classes))
        for i in range(num_classes):
            for j in range(num_classes):
                cost_matrix[i, j] = abs(i - j)
        self.register_buffer("cost_matrix", cost_matrix.to(device))

    def forward(self, logits, targets):

        cost = self.cost_matrix[targets]

    
        soft_targets = torch.softmax(-self.alpha * cost, dim=1)

        log_probs = F.log_softmax(logits, dim=1)

        loss = -torch.sum(soft_targets * log_probs, dim=1)

        if hasattr(self, "class_weights") and self.class_weights is not None:
            weights_per_sample = self.class_weights[targets]
            loss = loss * weights_per_sample
        
        return loss.mean()


class SISMOOrdinalModel(nn.Module):
    def __init__(self, num_classes=NUM_CLASSES):
        super().__init__()

        self.backbone = AutoModel.from_pretrained(MODEL_NAME)
        
       
        try:
            self.backbone.gradient_checkpointing_enable()
            print("Gradient Checkpointing")
        except AttributeError:
            print("Warning: Model does not support gradient_checkpointing_enable.")
        
        print("Model Backbone UNFROZEN")
        
        hidden = self.backbone.config.hidden_size

        # LSTM head
        self.lstm = nn.LSTM(
            input_size=hidden,
            hidden_size=256,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )

        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(256 * 2, num_classes)
  
        
    def forward(self, input_ids, attention_mask):
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        seq_output = outputs.last_hidden_state

        lstm_out, (h_n, _) = self.lstm(seq_output)

        h_forward = h_n[-2]
        h_backward = h_n[-1]
        pooled = torch.cat([h_forward, h_backward], dim=-1)

        logits = self.classifier(self.dropout(pooled))
        return logits


def train_one_epoch(model, data_loader, optimizer, criterion, device, scheduler, ACCUM_STEPS):
    model.train()
    total_loss = 0.0
    total_examples = 0

    optimizer.zero_grad()
    accum_counter = 0

    for step, batch in enumerate(data_loader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        raw_loss = criterion(
            logits=model(input_ids, attention_mask),
            targets=labels
        )

        loss = raw_loss / ACCUM_STEPS
        loss.backward()
        accum_counter += 1

        bs = input_ids.size(0)
        total_loss += raw_loss.item() * bs
        total_examples += bs

        if accum_counter == ACCUM_STEPS:
            optimizer.step()
            optimizer.zero_grad()
            if scheduler is not None:
                scheduler.step()
            accum_counter = 0

        if (step + 1) % (ACCUM_STEPS * 10) == 0:
            print(f"  Step {step+1} | Loss={raw_loss.item():.4f}")


    if accum_counter > 0:
        optimizer.step()
        optimizer.zero_grad()
        if scheduler is not None:
            scheduler.step()

    avg_loss = total_loss / total_examples
    return avg_loss


def evaluate(model, data_loader, device, compute_graded_metrics):
    model.eval()
    all_labels = []
    all_preds = []

    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            logits = model(input_ids, attention_mask)
            preds = torch.argmax(logits, dim=1)

            all_labels.extend(labels.cpu().tolist())
            all_preds.extend(preds.cpu().tolist())


    metrics = compute_graded_metrics(all_labels, all_preds)
    gp = metrics["graded_precision"]
    gr = metrics["graded_recall"]
    gf1 = metrics["graded_f1"]

    acc = (torch.tensor(all_labels) == torch.tensor(all_preds)).float().mean().item()

    return acc, gp, gr, gf1

print("===== SISMO (Full Fine-Tuning) =====")
best_val_gf1 = 0.0

support_counts = torch.tensor([305, 530, 135, 66], dtype=torch.float32)
print(f" input (Support Counts): {support_counts.cpu().tolist()}")

raw_weights = 1.0 / torch.log(support_counts + 1)
class_weights = raw_weights / raw_weights.sum() * len(support_counts)
class_weights = class_weights.to(DEVICE)

print(f" Log-Smoothed  (Class Weights): {class_weights.cpu().tolist()}")


model = SISMOOrdinalModel(num_classes=NUM_CLASSES).to(DEVICE)

criterion = OrdinalLoss(
    alpha=1.5,                   
    num_classes=NUM_CLASSES,
    class_weights=class_weights, 
    device=DEVICE
).to(DEVICE)

LEARNING_RATE = 1e-5             
WEIGHT_DECAY = 0.01

# Note: train_loader, val_loader, and compute_graded_metrics must be defined elsewhere
TOTAL_STEPS = len(train_loader) * EPOCHS / ACCUM_STEPS
WARMUP_RATIO = 0.1

optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(TOTAL_STEPS * WARMUP_RATIO),
    num_training_steps=int(TOTAL_STEPS)
)

print(f"Epochs: {EPOCHS} | steps: {int(TOTAL_STEPS)}")
print("===== Full Fine-Tuning Loop =====")

for epoch in range(1, EPOCHS + 1):
    print(f"\nEpoch {epoch}/{EPOCHS}")

    train_loss = train_one_epoch(
        model,
        train_loader,
        optimizer,
        criterion,
        DEVICE,
        scheduler,
        ACCUM_STEPS
    )

    val_acc, val_gp, val_gr, val_gf1 = evaluate(model, val_loader, DEVICE, compute_graded_metrics)
    
    print(f"[Epoch {epoch}] "
              f"train_loss={train_loss:.4f} | "
              f"val_acc={val_acc:.4f} | "
              f"GP={val_gp:.4f} | GR={val_gr:.4f} | GF1={val_gf1:.4f}")

    if val_gf1 > best_val_gf1:
        best_val_gf1 = val_gf1
# ===== END: GPT-5 -generated block =====
print("\nBest Graded F1:", best_val_gf1)

Apple Silicon GPU (MPS)
===== SISMO (Full Fine-Tuning) =====
 input (Support Counts): [305.0, 530.0, 135.0, 66.0]
 Log-Smoothed  (Class Weights): [0.9012120962142944, 0.8220493197441101, 1.049974799156189, 1.2267636060714722]
Gradient Checkpointing
Model Backbone UNFROZEN
Epochs: 4 | steps: 1497
===== Full Fine-Tuning Loop =====

Epoch 1/4
  Step 40 | Loss=1.2638
  Step 80 | Loss=1.4319
  Step 120 | Loss=1.1543
  Step 160 | Loss=1.1092
  Step 200 | Loss=1.0261
  Step 240 | Loss=1.3535
  Step 280 | Loss=1.1016
  Step 320 | Loss=1.2285
  Step 360 | Loss=1.3069
  Step 400 | Loss=0.9568
  Step 440 | Loss=1.0198
  Step 480 | Loss=1.1157
  Step 520 | Loss=1.1905
  Step 560 | Loss=1.0742
  Step 600 | Loss=0.9837
  Step 640 | Loss=1.2951
  Step 680 | Loss=0.9486
  Step 720 | Loss=1.1452
  Step 760 | Loss=1.1099
  Step 800 | Loss=1.1205
  Step 840 | Loss=1.1459
  Step 880 | Loss=1.0300
  Step 920 | Loss=1.1039
  Step 960 | Loss=1.0704
  Step 1000 | Loss=1.3050
  Step 1040 | Loss=1.1404
  Step 1

In [7]:
# ===== BEGIN: Gemini-generated block =====

import torch
from sklearn.metrics import accuracy_score, classification_report


model_to_eval = model  

model_to_eval.eval()

all_labels = []
all_preds = []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        labels = batch["label"].to(DEVICE)

        logits = model(input_ids, attention_mask)

        logits_adj = logits.clone()

        logits_adj[:, 2] += 0.2   
        logits_adj[:, 3] += 0.4  

        preds = torch.argmax(logits_adj, dim=1)

        all_labels.extend(labels.cpu().tolist())
        all_preds.extend(preds.cpu().tolist())

y_test = all_labels
y_pred = all_preds

acc = accuracy_score(y_test, y_pred)
print(f"\nSimple Accuracy: {acc:.4f}")
# ===== END: Gemini-generated block =====

target_names = ['Indicator', 'Ideation', 'Behavior', 'Attempt']
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

graded_metrics = compute_graded_metrics(y_test, y_pred)
print("\n=== Graded Metrics ===")
print(f"Graded Precision: {graded_metrics['graded_precision']:.4f}")
print(f"Graded Recall:    {graded_metrics['graded_recall']:.4f}")
print(f"Graded F1-Score:  {graded_metrics['graded_f1']:.4f}")


Simple Accuracy: 0.5898

Classification Report:
              precision    recall  f1-score   support

   Indicator       0.64      0.70      0.67       305
    Ideation       0.69      0.63      0.66       530
    Behavior       0.29      0.30      0.30       135
     Attempt       0.31      0.38      0.34        66

    accuracy                           0.59      1036
   macro avg       0.48      0.50      0.49      1036
weighted avg       0.60      0.59      0.59      1036


=== Graded Metrics ===
Graded Precision: 0.7944
Graded Recall:    0.7954
Graded F1-Score:  0.7949
