## 0) Setup (libraries and reproducibility)

**Function Description:**
This cell initializes the environment by importing all necessary libraries and setting up reproducibility controls. It loads tools for data manipulation, machine learning, and deep learning, then configures random seeds to ensure consistent results across multiple runs.

**Syntax Explanation:**
The imports follow a logical grouping pattern. Standard Python libraries like `os`, `math`, and `random` come first, followed by numerical computing tools (`numpy`, `pandas`), and finally the deep learning stack (`torch`, `transformers`, `sklearn`). The `AutoTokenizer` and `AutoModelForSequenceClassification` are convenience classes from Hugging Face that automatically detect and load the correct model architecture based on the checkpoint name you provide. The `TrainingArguments` class acts as a container for all training hyperparameters, while `Trainer` wraps the training loop and handles evaluation, logging, and checkpointing automatically. I set the random seed using `random.seed()`, `np.random.seed()`, `torch.manual_seed()`, and `torch.cuda.manual_seed_all()` to control randomness across all libraries. The device detection uses `torch.cuda.is_available()` to check for GPU availability.

**Inputs:**
This cell takes no external inputs. It operates on the Python environment itself, importing modules that are either built-in or installable via pip. The SEED value (42) is hardcoded as a constant.

**Outputs:**
You'll see a single print statement showing which device you're using - "cuda" if a GPU is available, "cpu" otherwise. This confirmation helps you understand whether training will be fast (GPU) or slow (CPU). GPU training can be 10-50x faster than CPU training for transformer models.

**Code Flow:**
The cell progresses from general to specific imports, starting with basic Python utilities and ending with specialized deep learning components. After imports, it sets reproducibility seeds across all random number generators. Finally, it detects and prints the compute device. This setup happens once at the beginning and affects all subsequent cells.

**Comments and Observations:**
Reproducibility is important for scientific experiments and debugging. Without setting seeds, you'd get different train/test splits and different model initializations each time you run the notebook, making it impossible to compare results. The seed value 42 is arbitrary but conventional in machine learning tutorials. GPU availability dramatically impacts training time - a full fine-tuning run that takes 3 hours on CPU might finish in 15 minutes on a GPU. If you're running on Google Colab, make sure you've enabled GPU in Runtime > Change runtime type > Hardware accelerator > GPU. The imports might take 10-30 seconds the first time you run them because Colab needs to load the libraries into memory.

In [None]:

# Every import has an explanatory comment.
import os                         # file paths and environment checks
import math                       # math helpers (may be useful for schedules)
import random                     # Python's RNG for reproducibility
import numpy as np                # numerical arrays and metrics support
import pandas as pd               # data loading and manipulation
from pathlib import Path          # convenient and robust path handling

# Hugging Face / PyTorch stack (for transformer fine‚Äëtuning)
import torch                      # tensor and GPU utilities
from datasets import Dataset      # lightweight dataset wrapper around pandas
from transformers import (       # core HF components for tokenization and training
    AutoTokenizer,               # auto‚Äëloads the right tokenizer for a given model checkpoint
    AutoModelForSequenceClassification,  # classification head on top of a transformer
    TrainingArguments,           # training hyperparameters container
    Trainer                      # training loop helper (handles eval and logging)
)

# Metrics
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Make runs reproducible (seed Python, NumPy, and PyTorch)
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

# Detect device once and print for visibility
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")  # shows 'cuda' when a GPU is available in Colab


Using device: cuda


## 1) Load Dataset

**Function Description:**
This cell handles the complete data loading pipeline, from file upload through data cleaning and label encoding. It prompts you to upload a CSV file, validates the required columns, removes any problematic rows, and converts text labels into numerical format that machine learning models can process.

**Syntax Explanation:**
The `files.upload()` function from `google.colab` opens a browser file picker that lets you select a CSV from your computer. I capture the uploaded file using `list(uploaded.keys())[0]` which grabs the filename from the dictionary returned by the upload function. The `Path` object creates a cross-platform file path that works on Windows, Mac, and Linux. After loading with `pd.read_csv()`, I use `assert` to verify that both 'statement' and 'status' columns exist - if they don't, the code stops with an error message showing which columns are missing. The `dropna()` method removes any rows where either the text or label is missing, and `copy()` creates a new DataFrame to avoid pandas warnings about modifying views. Converting the statement column with `astype(str)` ensures all entries are strings, even if some got parsed as numbers. The `LabelEncoder` from sklearn automatically creates a mapping from unique text labels to integers (0, 1, 2, etc.) using `fit_transform()`. I temporarily store the encoded values in a new column, then replace the original status column and drop the temporary one.

**Inputs:**
You provide a CSV file through the browser upload dialog. The CSV must contain at least two columns: 'statement' with the text you want to classify (like "I feel overwhelmed and can't cope"), and 'status' with the mental health label (like "Stress", "Anxiety", "Normal"). The labels can be text strings or already-encoded numbers. If you have extra columns like 'Unnamed: 0' (a common artifact from saving DataFrames), they won't break anything.

**Outputs:**
You'll see several outputs: a confirmation message showing the file path, the label encoding map (which number represents which condition), the count of samples per class, and the first three rows of your cleaned dataset. The label encoding map is particularly important because you'll need it later to interpret predictions - if the model predicts "5", you need to know that means "Stress". The value counts reveal class imbalance, which affects how you should train your model.

**Code Flow:**
The flow moves through four distinct phases. First, file upload and path resolution. Second, loading and validation (checking for required columns). Third, data cleaning (removing nulls, ensuring correct data types). Fourth, label encoding (converting text to numbers) with the final reassignment of the status column. Each step depends on the previous one succeeding, which is why I use assertions for critical validations.

**Comments and Observations:**
Class imbalance is probably your biggest challenge here. If you see something like 16,343 Normal samples but only 1,077 Personality Disorder samples, your model will naturally bias toward predicting Normal because it sees that class 15 times more often. This is why I use class weights in later sections. The `LabelEncoder` assigns numbers alphabetically by default, so "Anxiety" becomes 0, "Bipolar" becomes 1, and so on. This alphabetical ordering doesn't affect model performance but does affect how you read the results. Some datasets have text encoding issues (weird characters, emojis) that can cause problems during tokenization. If you see strange symbols in the data preview, you might need to add encoding='utf-8' or encoding='latin-1' to the `read_csv()` call. The label encoder will fail if your status column has typos (like "Stres" vs "Stress") because it treats them as different classes. Always check your label counts to catch these issues.

In [None]:
# --- Load Dataset (Upload version, auto-encodes text labels) ---
import pandas as pd
from pathlib import Path
from google.colab import files

print("üìÇ Please upload your dataset CSV (e.g., Combined Data.csv)")
uploaded = files.upload()

# Automatically pick the first uploaded file
filename = list(uploaded.keys())[0]
csv_path = Path(f"/content/{filename}")

print(f"‚úÖ File uploaded successfully: {csv_path}")

# Load the CSV
df = pd.read_csv(csv_path)

# --- Validate columns ---
expected_cols = {'statement', 'status'}
assert expected_cols.issubset(df.columns), f"‚ùå Missing required columns: {expected_cols - set(df.columns)}"

# --- Clean ---
df = df.dropna(subset=['statement', 'status']).copy()
df['statement'] = df['statement'].astype(str)

# --- Encode text labels into integers ---
# This maps each unique label (like 'Anxiety', 'Stress', etc.) to a numeric ID
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['status_encoded'] = le.fit_transform(df['status'])

# Optional: print mapping for your reference
print("üî§ Label encoding map:")
for label, code in zip(le.classes_, range(len(le.classes_))):
    print(f"  {code} ‚Üí {label}")

# Replace 'status' with the encoded version
df['status'] = df['status_encoded']
df.drop(columns=['status_encoded'], inplace=True)

print("\n‚úÖ Dataset loaded and label-encoded successfully!")
print(df['status'].value_counts(dropna=False))
df.head(3)


üìÇ Please upload your dataset CSV (e.g., Combined Data.csv)


Saving Combined Data.csv to Combined Data.csv
‚úÖ File uploaded successfully: /content/Combined Data.csv
üî§ Label encoding map:
  0 ‚Üí Anxiety
  1 ‚Üí Bipolar
  2 ‚Üí Depression
  3 ‚Üí Normal
  4 ‚Üí Personality disorder
  5 ‚Üí Stress
  6 ‚Üí Suicidal

‚úÖ Dataset loaded and label-encoded successfully!
status
3    16343
2    15404
6    10652
0     3841
1     2777
5     2587
4     1077
Name: count, dtype: int64


Unnamed: 0.1,Unnamed: 0,statement,status
0,0,oh my gosh,0
1,1,"trouble sleeping, confused mind, restless hear...",0
2,2,"All wrong, back off dear, forward doubt. Stay ...",0


## 2) Baseline Models (TF-IDF + Linear)

**Function Description:**
This cell establishes performance baselines using traditional machine learning before moving to deep learning. It splits your data, converts text to numerical features using TF-IDF, trains two simple linear models (Logistic Regression and Linear SVM), and reports their accuracy, precision, recall, and F1 scores.

**Syntax Explanation:**
The `train_test_split()` function from sklearn divides your data into 80% training and 20% validation using the random state 42 for reproducibility. The `stratify` parameter ensures both sets maintain the same class distribution as your original data. `TfidfVectorizer` converts text into numbers by analyzing word frequencies - the `ngram_range=(1,2)` parameter means it considers both individual words and two-word phrases, `min_df=2` ignores words appearing in fewer than 2 documents (filtering out typos and rare terms), and `max_features=40000` keeps only the 40,000 most informative features. The vectorizer's `fit_transform()` learns the vocabulary from training data and converts it to features in one step, while `transform()` applies that learned vocabulary to validation data without learning anything new. Both `LogisticRegression` and `LinearSVC` use `class_weight="balanced"` which automatically adjusts for class imbalance by computing weights inversely proportional to class frequencies. The `precision_recall_fscore_support()` function calculates all metrics at once, and the `average` parameter determines how to aggregate across multiple classes (weighted average accounts for class imbalance).

**Inputs:**
This cell takes the cleaned DataFrame from the previous section and specifically uses the 'statement' column (text) as features and 'status' column (labels) as targets. The `train_test_split()` randomly selects which samples go into training vs validation based on the test_size ratio and random seed.

**Outputs:**
You get performance metrics for both baseline models printed in a compact format showing accuracy, precision, recall, and F1 score. The cell also prints which averaging method it's using (binary for 2 classes, weighted for more than 2) based on automatic detection of the number of unique classes. Typical baseline scores range from 75-85% accuracy depending on your data quality and class separability.

**Code Flow:**
The code follows a standard machine learning pipeline. First, split the data to create independent train and test sets. Second, fit the TF-IDF vectorizer on training text and transform both sets. Third, train the first model (Logistic Regression), make predictions, and calculate metrics. Fourth, repeat the training and evaluation process for the second model (Linear SVM). The vectorizer must be fit before the classifiers because the classifiers need fixed-size numerical inputs.

**Comments and Observations:**
These baseline models serve two purposes: they give you a performance floor that deep learning should beat, and they train in seconds rather than hours, letting you quickly spot data quality issues. If your baseline F1 is below 60%, something's wrong with your data (mislabeled samples, too much noise, or the classes aren't actually distinguishable from text alone). TF-IDF works surprisingly well for text classification because it captures which words are distinctive for each class. For example, the word "overwhelmed" might appear frequently in stress-related texts but rarely in normal texts, giving it high TF-IDF weight. The ngram_range=(1,2) parameter helps capture phrases like "panic attack" or "feel good" that carry more meaning than individual words. Linear models like Logistic Regression are also interpretable - you could examine the feature weights to see which words most strongly predict each class. The max_features limit prevents the feature space from exploding (some datasets have 100k+ unique words) and also acts as regularization by forcing the model to focus on the most informative terms. SVM typically performs slightly better than Logistic Regression on text because it finds the maximum-margin decision boundary, but both usually give similar results. If SVM and Logistic Regression give very different scores (more than 5% gap), that suggests your data has complex class boundaries that might benefit from deep learning.

In [None]:
# --- Baseline Models (TF-IDF + Linear, supports multi-class) ---
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np

# ============================================================================
# Three-Way Data Split: Train / Validation / Ground-Truth Test Set
# ============================================================================
# Split: 70% train, 15% validation, 15% held-out test set (for final evaluation)
# ============================================================================

from datetime import datetime

# Configuration
TRAIN_SIZE = 0.70
VAL_SIZE = 0.15
TEST_SIZE = 0.15
RANDOM_STATE = 42

# Ground truth test set metadata
GROUND_TRUTH_TEST_SET = {
    'created_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'description': 'Held-out test set for final model evaluation. Do not use for training or validation.',
    'split_ratio': f'Train: {TRAIN_SIZE*100:.0f}%, Validation: {VAL_SIZE*100:.0f}%, Test: {TEST_SIZE*100:.0f}%',
    'random_state': RANDOM_STATE
}

# First split: Separate out the test set (15%)
X_temp, X_test, y_temp, y_test = train_test_split(
    df['statement'].values,
    df['status'].values,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=df['status'].values
)

# Second split: Split remaining data into train (70%) and validation (15%)
val_ratio = VAL_SIZE / (TRAIN_SIZE + VAL_SIZE)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp,
    y_temp,
    test_size=val_ratio,
    random_state=RANDOM_STATE,
    stratify=y_temp
)

# Print split summary
print("=" * 80)
print("DATA SPLIT SUMMARY")
print("=" * 80)
print(f"Training set:   {len(X_train):,} samples ({len(X_train)/len(df)*100:.1f}%)")
print(f"Validation set: {len(X_val):,} samples ({len(X_val)/len(df)*100:.1f}%)")
print(f"Test set:       {len(X_test):,} samples ({len(X_test)/len(df)*100:.1f}%)")
print(f"Total:          {len(df):,} samples")
print("=" * 80)

# Automatic Export: Save Ground-Truth Test Set to CSV
test_df = pd.DataFrame({
    'statement': X_test,
    'status': y_test
})
test_df['test_set_id'] = range(1, len(test_df) + 1)
test_df['created_at'] = GROUND_TRUTH_TEST_SET['created_at']
test_df['description'] = GROUND_TRUTH_TEST_SET['description']
test_df = test_df[['test_set_id', 'statement', 'status', 'created_at', 'description']]

# Save test set with encoded labels
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
csv_filename = f"Ground_Truth_Test_Set_Final_Version_{timestamp}.csv"
test_df.to_csv(csv_filename, index=False)
print(f"\n‚úÖ Ground-Truth Test Set exported to: {csv_filename}")

# Also save with original labels if LabelEncoder is available
if 'le' in globals():
    test_df_with_labels = test_df.copy()
    test_df_with_labels['status_label'] = le.inverse_transform(test_df['status'])
    csv_filename_labels = f"Ground_Truth_Test_Set_Final_Version_With_Labels_{timestamp}.csv"
    test_df_with_labels.to_csv(csv_filename_labels, index=False)
    print(f"‚úÖ Ground-Truth Test Set (with labels) exported to: {csv_filename_labels}")

print("\n‚ö†Ô∏è  IMPORTANT: The test set is held-out for final evaluation only!")
print("   Do not use it for training, validation, or hyperparameter tuning.")
print("=" * 80)

# Convert raw text into TF-IDF features
tfidf = TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_features=40000)
Xtr = tfidf.fit_transform(X_train)
Xva = tfidf.transform(X_val)

# Detect if this is binary or multiclass
num_classes = len(np.unique(y_train))
avg_type = "binary" if num_classes == 2 else "weighted"
print(f"Detected {num_classes} classes ‚Üí using average='{avg_type}' for metrics.\n")

# --- Baseline 1: Logistic Regression ---
logreg = LogisticRegression(max_iter=2000, class_weight="balanced")
logreg.fit(Xtr, y_train)
pred_lr = logreg.predict(Xva)
p, r, f, _ = precision_recall_fscore_support(y_val, pred_lr, average=avg_type)
acc = accuracy_score(y_val, pred_lr)
print(f"[Baseline-LR] Acc={acc:.3f}  P={p:.3f}  R={r:.3f}  F1={f:.3f}")

# --- Baseline 2: Linear SVM ---
svm = LinearSVC(class_weight="balanced")
svm.fit(Xtr, y_train)
pred_svm = svm.predict(Xva)
p, r, f, _ = precision_recall_fscore_support(y_val, pred_svm, average=avg_type)
acc = accuracy_score(y_val, pred_svm)
print(f"[Baseline-SVM] Acc={acc:.3f}  P={p:.3f}  R={r:.3f}  F1={f:.3f}")


DATA SPLIT SUMMARY
Training set:   36,875 samples (70.0%)
Validation set: 7,903 samples (15.0%)
Test set:       7,903 samples (15.0%)
Total:          52,681 samples

‚úÖ Ground-Truth Test Set exported to: Ground_Truth_Test_Set_Final_Version_20251119_115238.csv
‚úÖ Ground-Truth Test Set (with labels) exported to: Ground_Truth_Test_Set_Final_Version_With_Labels_20251119_115238.csv

‚ö†Ô∏è  IMPORTANT: The test set is held-out for final evaluation only!
   Do not use it for training, validation, or hyperparameter tuning.
Detected 7 classes ‚Üí using average='weighted' for metrics.

[Baseline-LR] Acc=0.779  P=0.785  R=0.779  F1=0.777
[Baseline-SVM] Acc=0.789  P=0.786  R=0.789  F1=0.786


## 3) Pre-Trained Models (Tokenization and Dataset Prep)

**Function Description:**
This cell prepares your text data for transformer models by loading a specialized tokenizer and converting all text into the numerical format that BERT-based models expect. It tokenizes both training and validation texts, then packages them into HuggingFace Dataset objects that work seamlessly with the Trainer API.

**Syntax Explanation:**
I define two model checkpoint names as constants - `CLINICAL_BERT` points to a model trained on clinical text, while `DISTIL_BERT` points to a smaller, faster baseline. The `AutoTokenizer.from_pretrained()` method downloads and initializes the tokenizer that matches your chosen model architecture. The `tokenize_texts()` helper function takes a list of strings and converts them to token IDs - the `padding=True` parameter adds zeros to shorter sequences so all sequences in a batch have the same length, `truncation=True` cuts off text exceeding the max_length, `max_length=128` sets the sequence limit, and `return_tensors="pt"` formats the output as PyTorch tensors rather than lists. After tokenizing, I use `Dataset.from_dict()` to create HuggingFace datasets, passing dictionaries that contain input_ids (the tokenized text), attention_mask (which positions are real tokens vs padding), and labels (your encoded status values wrapped in `torch.tensor()`).

**Inputs:**
This cell uses the `X_train`, `X_val`, `y_train`, and `y_val` arrays created by the train_test_split in the previous section. X_train and X_val contain the text statements, while y_train and y_val contain the corresponding numerical labels.

**Outputs:**
You'll see progress bars as the tokenizer downloads (first run only), then the final line shows the sizes of your train and validation datasets as a tuple like (42144, 10537). This confirms you have roughly 80% of samples in training and 20% in validation. The tokenized datasets are stored in memory as `train_ds` and `val_ds` objects ready for training.

**Code Flow:**
The flow is straightforward and sequential. First, I define model checkpoints and select one as the default backbone. Second, I load the tokenizer for that backbone. Third, I define a helper function that wraps the tokenizer with specific parameters. Fourth, I apply that function to both train and validation texts. Fifth, I package the tokenized outputs and labels into Dataset objects. This preparation step only happens once before training multiple experiments.

**Comments and Observations:**
The choice between ClinicalBERT and DistilBERT matters more than you might think. ClinicalBERT was trained on clinical notes, discharge summaries, and medical text, so it understands medical terminology and the way healthcare professionals write. This makes it better suited for mental health classification where text might include clinical terms or formal descriptions. DistilBERT is a compressed version of BERT with 40% fewer parameters - it trains faster and uses less memory but might miss subtle patterns that the full model catches. The max_length=128 setting is a practical choice that balances speed and information retention. Most mental health statements are 20-80 words, which translates to roughly 30-120 tokens after subword tokenization. Setting max_length too high wastes computation on padding, while setting it too low truncates important information. The tokenizer uses subword tokenization, meaning it breaks rare or complex words into pieces - for example, "unhappiness" might become ["un", "happiness"]. This helps the model handle words it's never seen before by understanding their components. The attention_mask is important because it tells the model which tokens are real (value of 1) and which are padding (value of 0), preventing padding tokens from influencing the model's predictions. When you see the tokenizer downloading files, it's fetching the vocabulary file (which maps words to IDs) and the config file (which stores tokenization parameters). These downloads only happen once and get cached locally.

In [None]:

# Choose your checkpoints.
# We include ClinicalBERT (for clinical text) and DistilBERT (fast baseline).
CLINICAL_BERT = "emilyalsentzer/Bio_ClinicalBERT"
DISTIL_BERT   = "distilbert-base-uncased"

# Pick one as the default backbone for experiments below.
BACKBONE = CLINICAL_BERT

# Initialize tokenizer for the chosen backbone
tokenizer = AutoTokenizer.from_pretrained(BACKBONE)

# Helper to tokenize a pandas series with per-line comments
def tokenize_texts(texts, max_length=128):
    # Apply the tokenizer: returns dict with input_ids and attention_mask
    return tokenizer(
        list(texts),                 # a Python list of strings
        padding=True,                # pad to the longest in the batch
        truncation=True,             # cut off text exceeding max_length
        max_length=max_length,       # cap sequence length
        return_tensors="pt"          # return PyTorch tensors
    )

# Tokenize train/validation splits
train_enc = tokenize_texts(X_train)
val_enc   = tokenize_texts(X_val)

# Wrap into HF Datasets with labels
train_ds = Dataset.from_dict({
    "input_ids": train_enc["input_ids"],
    "attention_mask": train_enc["attention_mask"],
    "labels": torch.tensor(y_train)
})
val_ds = Dataset.from_dict({
    "input_ids": val_enc["input_ids"],
    "attention_mask": val_enc["attention_mask"],
    "labels": torch.tensor(y_val)
})

len(train_ds), len(val_ds)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

(36875, 7903)

## 4) Training of Data (Trainer utilities and metrics)

**Function Description:**
This cell sets up the infrastructure you need for training - specifically the metric computation function and the custom weighted loss. It defines how to evaluate model performance and how to handle class imbalance during training by penalizing mistakes on rare classes more heavily.

**Syntax Explanation:**
The `compute_metrics()` function takes an `eval_pred` tuple containing logits (raw model outputs before softmax) and true labels. Inside the function, `np.argmax(logits, axis=-1)` converts logits to class predictions by selecting the highest value along the last dimension. The `precision_recall_fscore_support()` function calculates all four metrics in one call using the average parameter to specify how to aggregate across classes (binary for 2-class, weighted for multi-class). For class weights, I count how many samples exist in each class using `(y_train == 1).sum()` for the positive class and similar for negative, then compute the weight for the positive class as `neg / max(pos, 1)` which gives higher weight to the minority class. The `max(pos, 1)` prevents division by zero if you somehow have zero positive samples. I create a PyTorch tensor from these weights and move it to the correct device using `.to(device)`. The `WeightedTrainer` class inherits from HuggingFace's `Trainer` and overrides only the `compute_loss()` method. Inside that method, I extract labels from inputs, run the model on the remaining inputs (everything except labels), get the logits from outputs, create a CrossEntropyLoss function with the class weights, and calculate loss by comparing predictions to true labels.

**Inputs:**
This cell uses `y_train` from the earlier train-test split to compute class frequencies and create weights. The compute_metrics function receives predictions from the Trainer during evaluation, while the WeightedTrainer receives model inputs, labels, and the model itself during training.

**Outputs:**
You'll see the class weights printed as a list showing the weight for each class. For a binary case with 20,000 negative and 5,000 positive samples, you'd see weights like [1.0, 4.0], meaning the model pays 4x more attention to positive class errors. For multi-class problems with severe imbalance, some weights might be 10x or higher.

**Code Flow:**
The code sets up two separate but related pieces of infrastructure. First, it defines and prints class weights that quantify the imbalance in your data. Second, it creates a custom Trainer class that uses those weights during loss calculation. These components work together during training - the Trainer calls compute_loss every batch to calculate weighted loss, and calls compute_metrics every epoch to evaluate on the validation set.

**Comments and Observations:**
Class imbalance is one of the biggest challenges in mental health classification. Without weighting, a model trained on data that's 80% Normal and 20% Other could achieve 80% accuracy by always predicting Normal and completely ignoring the minority classes. Weighted loss forces the model to care about all classes by making errors on rare classes expensive. The math behind class weighting is intuitive - if you have 4x more samples of class A than class B, you give class B a weight of 4.0 so that one mistake on class B costs as much as four mistakes on class A. This balances the gradient updates and prevents the model from ignoring minority classes. CrossEntropyLoss is the standard loss function for classification because it measures the difference between predicted probability distributions and true labels. Adding weights modifies the formula to multiply each sample's loss by its class weight before averaging. The custom Trainer override is necessary because the default Trainer doesn't support class weights out of the box. By inheriting and overriding just the compute_loss method, you keep all the other Trainer functionality (logging, checkpointing, evaluation) while adding custom loss calculation. The `return_outputs` parameter in compute_loss determines whether to return just the loss (for training) or both loss and full model outputs (for when you need predictions), and I handle both cases with the conditional return statement. This weighted approach works well for moderate imbalance (ratios up to 10:1 or 20:1) but for extreme imbalance you might need additional techniques like oversampling the minority class or using focal loss.

In [None]:
from sklearn.model_selection import train_test_split

# Assuming your cleaned & encoded dataframe is called df
# with columns: 'statement' (text) and 'status' (numeric label)
X = df['statement']
y = df['status']

# Split into 80% train, 20% validation (you can adjust ratio)
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("‚úÖ Data split complete:")
print(f"Train size: {len(X_train)} | Validation size: {len(X_val)}")

‚úÖ Data split complete:
Train size: 42144 | Validation size: 10537


In [None]:

# Metric function for the Trainer: computes Accuracy, Precision, Recall, F1
def compute_metrics(eval_pred):
    # eval_pred is a tuple of (logits, labels)
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

# Optional: class weights for imbalanced datasets
# Compute weights inversely proportional to class frequencies
pos = (y_train == 1).sum()
neg = (y_train == 0).sum()
w_pos = neg / max(pos, 1)   # weight for positive class
w_neg = 1.0                 # keep negative as baseline
class_weights = torch.tensor([w_neg, w_pos], dtype=torch.float).to(device)
print(f"Class weights (neg, pos): {class_weights.tolist()}" )

# Custom Trainer that injects weighted loss
from torch.nn import CrossEntropyLoss
class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(**{k: v for k, v in inputs.items() if k != "labels"})
        logits = outputs.get("logits")
        loss_fct = CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits.view(-1, model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss


Class weights (neg, pos): [1.0, 1.3836109638214111]


## 5) Fine-tuning (Three Experiments)

**Function Description:**
This cell runs three complete fine-tuning experiments with different hyperparameter configurations. It trains transformer models on your data, evaluates them on the validation set, and creates a leaderboard ranking them by F1 score. Each experiment uses different settings for learning rate, batch size, epochs, and model architecture to find the best configuration for your specific dataset.

**Syntax Explanation:**
The code starts by detecting the number of unique classes with `len(np.unique(y_train))` and setting the averaging strategy for metrics accordingly. The `compute_metrics()` function here is similar to Section 4 but adapts to multi-class by using weighted averaging. For class weights, I use `np.bincount(y_train, minlength=num_labels)` which counts occurrences of each class, then compute weights as `counts.max() / np.maximum(counts, 1)` which gives higher weights to rarer classes while avoiding division by zero. The weights become a PyTorch tensor on the correct device. The `WeightedTrainer` class override works identically to Section 4 but now handles the multi-class case properly. The `tokenize_texts()` function accepts a max_length parameter to allow different experiments to use different sequence lengths. The `make_training_args()` function is a factory that creates `TrainingArguments` objects with version compatibility - it first tries the modern signature with `evaluation_strategy` and `save_strategy`, and if that fails (older transformers versions), it falls back to legacy parameters like `do_eval`. The `run_experiment()` function orchestrates everything: it re-tokenizes data with the specified max_length, creates fresh Dataset objects, loads the pre-trained model with `AutoModelForSequenceClassification.from_pretrained()` while specifying the correct number of output classes, creates training arguments, instantiates the WeightedTrainer, calls `trainer.train()` to run training, calls `trainer.evaluate()` to get final metrics, and returns both metrics and the trainer object. Each of the three experiments (A, B, C) calls `run_experiment()` with different parameters and stores results in an OrderedDict. Finally, I extract the F1 scores from each result, sort experiments by F1, and print a leaderboard.

**Inputs:**
This cell uses `X_train`, `X_val`, `y_train`, and `y_val` from the train-test split. It also uses the model checkpoint names (CLINICAL_BERT, DISTIL_BERT) and the tokenizer defined in earlier sections. Each experiment re-tokenizes the data with its specific max_length setting.

**Outputs:**
During training, you'll see progress bars showing epoch progress, batch progress within each epoch, loss values that should decrease over time, and periodic evaluation metrics. After each experiment finishes, you'll see a summary of its final performance metrics including accuracy, precision, recall, and F1 score. At the very end, a leaderboard ranks all three experiments by F1 score, showing which configuration worked best. The entire cell might take 20-60 minutes to run depending on whether you have GPU and how large your dataset is.

**Code Flow:**
The flow is hierarchical and modular. At the top level, I set up shared infrastructure (metrics function, class weights). Then I define helper functions (tokenization, training args factory, experiment runner). Finally, I call the experiment runner three times with different parameters and collect results. Each experiment is independent - they don't share trained weights, though they do share the data and evaluation metrics. The leaderboard aggregation happens after all experiments complete, sorting by F1 score and displaying in descending order.

**Comments and Observations:**
Hyperparameter selection is part science, part art. Learning rates for fine-tuning transformers typically range from 1e-5 to 5e-5 because these models are already well-trained and you don't want to disturb the pre-trained weights too much. Going higher risks catastrophic forgetting where the model loses its pre-trained knowledge. Batch size is constrained by GPU memory - if you get out-of-memory errors, reduce batch size. Smaller batches (8-16) give noisier gradient updates but sometimes generalize better, while larger batches (32-64) give more stable training but might overfit. The number of epochs depends on dataset size - smaller datasets need more epochs to converge, but too many epochs causes overfitting. Weight decay adds L2 regularization by penalizing large weights, which prevents overfitting but too much weight decay can underfit. Warmup ratio gradually increases the learning rate from near-zero to the target value over the first X% of training steps, which stabilizes training when starting from random initialization of the classification head. The version compatibility fallback exists because HuggingFace frequently changes their API - older transformers versions used different parameter names for the same functionality. By catching the TypeError and falling back to legacy parameters, the code works across a wider range of library versions. Experiment A (conservative) uses safe defaults that should work reliably but might not achieve peak performance. Experiment B (aggressive) pushes the learning rate higher and trains longer, which might find better optima but risks overfitting. Experiment C (fast baseline) uses DistilBERT for speed comparison - if DistilBERT matches ClinicalBERT performance, you might prefer it for production due to faster inference. The leaderboard at the end tells you objectively which approach worked best on your specific data. Sometimes the aggressive approach wins, sometimes the conservative one does - it depends on your data characteristics and how much overfitting risk you face. The F1 score is the key metric here because it balances precision and recall, giving you a single number that captures overall classification quality while accounting for class imbalance through weighted averaging.

In [None]:
# --- 5) Fine-tuning (Three Experiments) [version-compatible] ---
import os
os.environ["WANDB_DISABLED"] = "true"

import numpy as np
import torch
from collections import OrderedDict
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from torch.nn import CrossEntropyLoss

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1) Metrics: binary vs multiclass handled automatically
num_labels = len(np.unique(y_train))
avg_type = "binary" if num_labels == 2 else "weighted"
print(f"[Fine-tune] Detected {num_labels} classes ‚Üí metrics average='{avg_type}'")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support
    p, r, f, _ = precision_recall_fscore_support(labels, preds, average=avg_type)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "precision": p, "recall": r, "f1": f}

# 2) Class weights for imbalanced data (size == num_labels)
counts = np.bincount(y_train, minlength=num_labels)
# Heuristic: inverse-frequency scaled to max=1.0 (safe for CE)
weights = counts.max() / np.maximum(counts, 1)
class_weights = torch.tensor(weights, dtype=torch.float32, device=device)
print(f"[Fine-tune] Class weights: {class_weights.tolist()}")

class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.get("labels")
        outputs = model(**{k: v for k, v in inputs.items() if k != "labels"})
        logits = outputs.get("logits")
        loss_fct = CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits.view(-1, model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

# 3) Helper: tokenizer already defined above. Re-tokenize per max_length
def tokenize_texts(texts, max_length=160):
    return tokenizer(
        list(texts),
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )

# 4) Version-compatible TrainingArguments factory
import inspect

def make_training_args(name, batch_size, lr, epochs, weight_decay, warmup_ratio):
    kwargs_modern = dict(
        output_dir=f"./runs/{name}",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        learning_rate=lr,
        num_train_epochs=epochs,
        weight_decay=weight_decay,
        warmup_ratio=warmup_ratio,
        logging_steps=50,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        fp16=torch.cuda.is_available(),
        report_to=[]
    )
    try:
        # Try modern signature first
        return TrainingArguments(**kwargs_modern)
    except TypeError:
        # Fallback for older transformers (no evaluation_strategy/save_strategy)
        print("[Fine-tune] Using legacy TrainingArguments fallback.")
        kwargs_legacy = dict(
            output_dir=f"./runs/{name}",
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            learning_rate=lr,
            num_train_epochs=epochs,
            weight_decay=weight_decay,
            logging_steps=50,
            do_eval=True,          # legacy way to enable evaluation
            save_steps=500,        # periodic saving
            overwrite_output_dir=True,
            fp16=torch.cuda.is_available()
        )
        return TrainingArguments(**kwargs_legacy)

def run_experiment(name, backbone, batch_size=16, lr=2e-5, epochs=3,
                   weight_decay=0.01, warmup_ratio=0.1, max_length=160):
    # Re-tokenize for this max_length
    tr = tokenize_texts(X_train, max_length=max_length)
    va = tokenize_texts(X_val,   max_length=max_length)

    train_ds_local = Dataset.from_dict({
        "input_ids": tr["input_ids"],
        "attention_mask": tr["attention_mask"],
        "labels": torch.tensor(y_train.to_numpy(), dtype=torch.long)   # <-- use .to_numpy()
    })
    val_ds_local = Dataset.from_dict({
        "input_ids": va["input_ids"],
        "attention_mask": va["attention_mask"],
        "labels": torch.tensor(y_val.to_numpy(), dtype=torch.long)     # <-- use .to_numpy()
    })


    # Load backbone with correct num_labels
    model = AutoModelForSequenceClassification.from_pretrained(
        backbone, num_labels=num_labels
    ).to(device)

    args = make_training_args(
        name=name, batch_size=batch_size, lr=lr, epochs=epochs,
        weight_decay=weight_decay, warmup_ratio=warmup_ratio
    )

    trainer = WeightedTrainer(
        model=model,
        args=args,
        train_dataset=train_ds_local,
        eval_dataset=val_ds_local,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer
    )

    trainer.train()
    metrics = trainer.evaluate()
    print(f"\n>>> {name} results: {metrics}\n")
    return metrics, trainer

# --- Define backbones (already set earlier) ---
CLINICAL_BERT = "emilyalsentzer/Bio_ClinicalBERT"
DISTIL_BERT   = "distilbert-base-uncased"

results = OrderedDict()

# Exp-A: ClinicalBERT, conservative LR, small batch
results['expA_clinicalbert_bs16_lr2e-5_ep3'] = run_experiment(
    name="expA_clinicalbert_bs16_lr2e-5_ep3",
    backbone=CLINICAL_BERT,
    batch_size=16, lr=2e-5, epochs=3,
    weight_decay=0.01, warmup_ratio=0.1, max_length=160
)

# Exp-B: ClinicalBERT, slightly higher LR, more epochs
results['expB_clinicalbert_bs16_lr5e-5_ep4'] = run_experiment(
    name="expB_clinicalbert_bs16_lr5e-5_ep4",
    backbone=CLINICAL_BERT,
    batch_size=16, lr=5e-5, epochs=4,
    weight_decay=0.01, warmup_ratio=0.06, max_length=160
)

# Exp-C: DistilBERT fast baseline
results['expC_distilbert_bs32_lr3e-5_ep3'] = run_experiment(
    name="expC_distilbert_bs32_lr3e-5_ep3",
    backbone=DISTIL_BERT,
    batch_size=32, lr=3e-5, epochs=3,
    weight_decay=0.01, warmup_ratio=0.1, max_length=128
)

# Leaderboard
board = []
for k,(m,_t) in results.items():
    board.append((k, m.get('eval_f1', float('nan')), m.get('eval_accuracy', float('nan'))))
board = sorted(board, key=lambda x: x[1], reverse=True)
print("\nLeaderboard (by F1):")
for name, f1, acc in board:
    print(f"{name:35s}  F1={f1:.4f}  Acc={acc:.4f}")


[Fine-tune] Detected 7 classes ‚Üí metrics average='weighted'
[Fine-tune] Class weights: [4.254474639892578, 5.886537551879883, 1.0609430074691772, 1.0, 15.16705322265625, 6.31594181060791, 1.5343269109725952]


pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


[Fine-tune] Using legacy TrainingArguments fallback.


  trainer = WeightedTrainer(


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Step,Training Loss
50,1.9126
100,1.667
150,1.3683
200,1.2414
250,1.112
300,1.0793
350,1.0089
400,0.9846
450,0.9603
500,0.8787


## 5b) Fast Fine-Tuning (Optimized for Speed)

This section provides optimized training configurations to significantly speed up fine-tuning while maintaining good performance.


In [None]:
# ============================================================================
# FAST FINE-TUNING - Optimized for Speed
# ============================================================================
# This version uses multiple optimizations to train much faster
# ============================================================================

import os
os.environ["WANDB_DISABLED"] = "true"

import numpy as np
import torch
from collections import OrderedDict
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from torch.nn import CrossEntropyLoss

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Speed optimizations configuration
SPEED_OPTIMIZATIONS = {
    'use_gradient_checkpointing': True,  # Saves memory, allows larger batches
    'gradient_accumulation_steps': 4,    # Simulate larger batch size
    'dataloader_num_workers': 4,         # Parallel data loading
    'dataloader_pin_memory': True,        # Faster GPU transfer
    'max_length': 128,                    # Reduced from 160 (faster)
    'logging_steps': 100,                 # Less frequent logging
    'eval_steps': 500,                    # Less frequent evaluation
    'save_steps': 1000,                   # Less frequent saving
    'fp16': True,                         # Mixed precision (already using)
    'optim': 'adamw_torch',               # Optimized optimizer
}

print("=" * 80)
print("FAST FINE-TUNING CONFIGURATION")
print("=" * 80)
print("Speed Optimizations Enabled:")
for key, value in SPEED_OPTIMIZATIONS.items():
    print(f"  {key}: {value}")
print("=" * 80)

# Reuse metrics and class weights from previous cell
if 'num_labels' not in globals():
    num_labels = len(np.unique(y_train))
    avg_type = "binary" if num_labels == 2 else "weighted"
else:
    avg_type = "binary" if num_labels == 2 else "weighted"

if 'compute_metrics' not in globals():
    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        preds = np.argmax(logits, axis=-1)
        from sklearn.metrics import accuracy_score, precision_recall_fscore_support
        p, r, f, _ = precision_recall_fscore_support(labels, preds, average=avg_type)
        acc = accuracy_score(labels, preds)
        return {"accuracy": acc, "precision": p, "recall": r, "f1": f}

if 'class_weights' not in globals():
    counts = np.bincount(y_train, minlength=num_labels)
    weights = counts.max() / np.maximum(counts, 1)
    class_weights = torch.tensor(weights, dtype=torch.float32, device=device)

if 'WeightedTrainer' not in globals():
    class WeightedTrainer(Trainer):
        def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
            labels = inputs.get("labels")
            outputs = model(**{k: v for k, v in inputs.items() if k != "labels"})
            logits = outputs.get("logits")
            loss_fct = CrossEntropyLoss(weight=class_weights)
            loss = loss_fct(logits.view(-1, model.config.num_labels), labels.view(-1))
            return (loss, outputs) if return_outputs else loss

# Fast tokenization with reduced length
def tokenize_texts_fast(texts, max_length=128):
    return tokenizer(
        list(texts),
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )

def make_fast_training_args(name, batch_size, lr, epochs, weight_decay=0.01, warmup_ratio=0.1):
    """Create optimized training arguments for speed"""
    effective_batch_size = batch_size * SPEED_OPTIMIZATIONS['gradient_accumulation_steps']

    kwargs = dict(
        output_dir=f"./runs_fast/{name}",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size * 2,  # Larger eval batch (faster)
        gradient_accumulation_steps=SPEED_OPTIMIZATIONS['gradient_accumulation_steps'],
        learning_rate=lr,
        num_train_epochs=epochs,
        weight_decay=weight_decay,
        warmup_ratio=warmup_ratio,
        logging_steps=SPEED_OPTIMIZATIONS['logging_steps'],
        eval_steps=SPEED_OPTIMIZATIONS['eval_steps'],
        save_steps=SPEED_OPTIMIZATIONS['save_steps'],
        evaluation_strategy="steps",  # Evaluate by steps, not epoch (faster)
        save_strategy="steps",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        fp16=SPEED_OPTIMIZATIONS['fp16'] and torch.cuda.is_available(),
        dataloader_num_workers=SPEED_OPTIMIZATIONS['dataloader_num_workers'],
        dataloader_pin_memory=SPEED_OPTIMIZATIONS['dataloader_pin_memory'],
        optim=SPEED_OPTIMIZATIONS['optim'],
        report_to=[],
        # Disable unnecessary features for speed
        save_total_limit=2,  # Keep only 2 checkpoints
        prediction_loss_only=False,
    )

    try:
        return TrainingArguments(**kwargs)
    except TypeError:
        # Fallback for older versions
        kwargs_legacy = {k: v for k, v in kwargs.items() if k not in ['evaluation_strategy', 'save_strategy']}
        kwargs_legacy.update({
            'do_eval': True,
            'eval_steps': SPEED_OPTIMIZATIONS['eval_steps'],
            'save_steps': SPEED_OPTIMIZATIONS['save_steps'],
        })
        return TrainingArguments(**kwargs_legacy)

def run_fast_experiment(name, backbone, batch_size=32, lr=2e-5, epochs=3,
                       weight_decay=0.01, warmup_ratio=0.1):
    """Run optimized fast experiment"""
    max_length = SPEED_OPTIMIZATIONS['max_length']

    print(f"\n{'='*80}")
    print(f"FAST EXPERIMENT: {name}")
    print(f"{'='*80}")
    print(f"Backbone: {backbone}")
    print(f"Batch size: {batch_size} (effective: {batch_size * SPEED_OPTIMIZATIONS['gradient_accumulation_steps']} with accumulation)")
    print(f"Max length: {max_length} (reduced for speed)")
    print(f"Epochs: {epochs}")
    print(f"{'='*80}\n")

    # Tokenize with reduced length
    tr = tokenize_texts_fast(X_train, max_length=max_length)
    va = tokenize_texts_fast(X_val, max_length=max_length)

    train_ds_local = Dataset.from_dict({
        "input_ids": tr["input_ids"],
        "attention_mask": tr["attention_mask"],
        "labels": torch.tensor(y_train.to_numpy(), dtype=torch.long)
    })
    val_ds_local = Dataset.from_dict({
        "input_ids": va["input_ids"],
        "attention_mask": va["attention_mask"],
        "labels": torch.tensor(y_val.to_numpy(), dtype=torch.long)
    })

    # Load model
    model = AutoModelForSequenceClassification.from_pretrained(
        backbone, num_labels=num_labels
    ).to(device)

    # Enable gradient checkpointing to save memory (allows larger batches)
    if SPEED_OPTIMIZATIONS['use_gradient_checkpointing']:
        if hasattr(model, 'gradient_checkpointing_enable'):
            model.gradient_checkpointing_enable()
            print("‚úÖ Gradient checkpointing enabled (saves memory)")

    args = make_fast_training_args(
        name=name, batch_size=batch_size, lr=lr, epochs=epochs,
        weight_decay=weight_decay, warmup_ratio=warmup_ratio
    )

    trainer = WeightedTrainer(
        model=model,
        args=args,
        train_dataset=train_ds_local,
        eval_dataset=val_ds_local,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer
    )

    import time
    start_time = time.time()
    trainer.train()
    train_time = time.time() - start_time

    metrics = trainer.evaluate()
    total_time = time.time() - start_time

    print(f"\n>>> {name} results: {metrics}")
    print(f">>> Training time: {train_time/60:.1f} minutes ({train_time:.0f} seconds)")
    print(f">>> Total time: {total_time/60:.1f} minutes\n")

    return metrics, trainer, total_time

# Run fast experiments
CLINICAL_BERT = "emilyalsentzer/Bio_ClinicalBERT"
DISTIL_BERT = "distilbert-base-uncased"

fast_results = OrderedDict()

print("\n" + "=" * 80)
print("STARTING FAST EXPERIMENTS")
print("=" * 80)
print("These experiments use speed optimizations:")
print("  - Larger batch sizes with gradient accumulation")
print("  - Reduced sequence length (128 vs 160)")
print("  - Optimized data loading")
print("  - Less frequent evaluation/saving")
print("  - Gradient checkpointing")
print("=" * 80)

# Fast Exp 1: DistilBERT (fastest model)
fast_results['fast_exp1_distilbert_bs32_ep2'] = run_fast_experiment(
    name="fast_exp1_distilbert_bs32_ep2",
    backbone=DISTIL_BERT,
    batch_size=32,
    lr=3e-5,
    epochs=2,  # Fewer epochs for speed
    weight_decay=0.01,
    warmup_ratio=0.1
)

# Fast Exp 2: ClinicalBERT with optimizations
fast_results['fast_exp2_clinicalbert_bs32_ep3'] = run_fast_experiment(
    name="fast_exp2_clinicalbert_bs32_ep3",
    backbone=CLINICAL_BERT,
    batch_size=32,  # Larger batch
    lr=2e-5,
    epochs=3,
    weight_decay=0.01,
    warmup_ratio=0.1
)

# Fast Exp 3: ClinicalBERT with even larger batch (if memory allows)
# Uncomment if you have enough GPU memory
# fast_results['fast_exp3_clinicalbert_bs64_ep3'] = run_fast_experiment(
#     name="fast_exp3_clinicalbert_bs64_ep3",
#     backbone=CLINICAL_BERT,
#     batch_size=64,  # Very large batch
#     lr=2e-5,
#     epochs=3,
#     weight_decay=0.01,
#     warmup_ratio=0.1
# )

# Leaderboard
print("\n" + "=" * 80)
print("FAST EXPERIMENTS LEADERBOARD")
print("=" * 80)
board = []
for k, (m, t, time_taken) in fast_results.items():
    board.append((k, m.get('eval_f1', float('nan')), m.get('eval_accuracy', float('nan')), time_taken))
board = sorted(board, key=lambda x: x[1], reverse=True)
print(f"{'Experiment':40s}  F1 Score    Accuracy    Time (min)")
print("-" * 80)
for name, f1, acc, time_taken in board:
    print(f"{name:40s}  F1={f1:.4f}  Acc={acc:.4f}  {time_taken/60:.1f} min")
print("=" * 80)


### Speed Optimization Summary

The fast fine-tuning uses these optimizations to train 2-4x faster:

1. **Larger Batch Sizes**: Batch size 32-64 instead of 16 (fewer steps = faster)
2. **Gradient Accumulation**: Simulates even larger batches without memory issues
3. **Reduced Sequence Length**: 128 tokens instead of 160 (20% faster processing)
4. **Optimized Data Loading**: Parallel workers and pinned memory
5. **Less Frequent Evaluation**: Every 500 steps instead of every epoch
6. **Gradient Checkpointing**: Saves memory, allows larger batches
7. **Mixed Precision (FP16)**: Already enabled, but optimized
8. **Fewer Checkpoints**: Saves only 2 best models instead of all

**Expected Speed Improvement**: 2-4x faster than standard training


In [None]:
# Record all experiment results to Excel log file
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
from datetime import datetime
import re

# Create a new workbook
wb = Workbook()
ws = wb.active
ws.title = "Experiment_Logs"

# Define header style
header_fill = PatternFill(start_color="366092", end_color="366092", fill_type="solid")
header_font = Font(bold=True, color="FFFFFF")

# Define headers
headers = [
    "Experiment_ID",
    "Model_Backbone",
    "Batch_Size",
    "Learning_Rate",
    "Epochs",
    "Weight_Decay",
    "Warmup_Ratio",
    "Max_Length",
    "Accuracy",
    "F1_Score",
    "Precision",
    "Recall"
]

# Write headers
for col_idx, header in enumerate(headers, 1):
    cell = ws.cell(row=1, column=col_idx, value=header)
    cell.fill = header_fill
    cell.font = header_font
    cell.alignment = Alignment(horizontal="center", vertical="center")

# Function to parse experiment name and extract hyperparameters
def parse_experiment_name(exp_name):
    """Extract hyperparameters from experiment name"""
    params = {
        "backbone": "Unknown",
        "batch_size": None,
        "learning_rate": None,
        "epochs": None
    }

    # Extract backbone
    if "clinicalbert" in exp_name.lower():
        params["backbone"] = "ClinicalBERT"
    elif "distilbert" in exp_name.lower():
        params["backbone"] = "DistilBERT"

    # Extract batch size (bs16, bs32, etc.)
    bs_match = re.search(r'bs(\d+)', exp_name.lower())
    if bs_match:
        params["batch_size"] = int(bs_match.group(1))

    # Extract learning rate (lr2e-5, lr5e-5, etc.)
    lr_match = re.search(r'lr([\d.e-]+)', exp_name.lower())
    if lr_match:
        lr_str = lr_match.group(1)
        # Convert scientific notation string to float
        if 'e' in lr_str:
            base, exp = lr_str.split('e')
            params["learning_rate"] = float(base) * (10 ** int(exp))
        else:
            params["learning_rate"] = float(lr_str)

    # Extract epochs (ep3, ep4, etc.)
    ep_match = re.search(r'ep(\d+)', exp_name.lower())
    if ep_match:
        params["epochs"] = int(ep_match.group(1))

    return params

# Store experiment configurations (you may need to adjust these based on your actual runs)
experiment_configs = {
    "expA_clinicalbert_bs16_lr2e-5_ep3": {
        "weight_decay": 0.01,
        "warmup_ratio": 0.1,
        "max_length": 160
    },
    "expB_clinicalbert_bs16_lr5e-5_ep4": {
        "weight_decay": 0.01,
        "warmup_ratio": 0.06,
        "max_length": 160
    },
    "expC_distilbert_bs32_lr3e-5_ep3": {
        "weight_decay": 0.01,
        "warmup_ratio": 0.1,
        "max_length": 128
    }
}

# Write experiment data
row = 2
for exp_name, (metrics, trainer) in results.items():
    # Parse experiment name
    parsed = parse_experiment_name(exp_name)
    config = experiment_configs.get(exp_name, {})

    # Write data
    ws.cell(row=row, column=1, value=exp_name)  # Experiment_ID
    ws.cell(row=row, column=2, value=parsed["backbone"])  # Model_Backbone
    ws.cell(row=row, column=3, value=parsed["batch_size"])  # Batch_Size
    ws.cell(row=row, column=4, value=parsed["learning_rate"])  # Learning_Rate
    ws.cell(row=row, column=5, value=parsed["epochs"])  # Epochs
    ws.cell(row=row, column=6, value=config.get("weight_decay", "N/A"))  # Weight_Decay
    ws.cell(row=row, column=7, value=config.get("warmup_ratio", "N/A"))  # Warmup_Ratio
    ws.cell(row=row, column=8, value=config.get("max_length", "N/A"))  # Max_Length
    ws.cell(row=row, column=9, value=metrics.get("eval_accuracy", "N/A"))  # Accuracy
    ws.cell(row=row, column=10, value=metrics.get("eval_f1", "N/A"))  # F1_Score
    ws.cell(row=row, column=11, value=metrics.get("eval_precision", "N/A"))  # Precision
    ws.cell(row=row, column=12, value=metrics.get("eval_recall", "N/A"))  # Recall

    row += 1

# Auto-adjust column widths
for col in ws.columns:
    max_length = 0
    col_letter = col[0].column_letter
    for cell in col:
        try:
            if len(str(cell.value)) > max_length:
                max_length = len(str(cell.value))
        except:
            pass
    adjusted_width = min(max_length + 2, 30)
    ws.column_dimensions[col_letter].width = adjusted_width

# Save the file
excel_filename = f"Exercise_F2_Experiment_Logs_{datetime.now().strftime('%Y%m%d_%H%M%S')}.xlsx"
wb.save(excel_filename)

print(f"‚úÖ Experiment logs saved to: {excel_filename}")
print(f"   Total experiments logged: {len(results)}")
print(f"   Columns: {', '.join(headers)}")

# Automatically download the file
try:
    from google.colab import files
    files.download(excel_filename)
    print(f"‚úÖ File automatically downloaded: {excel_filename}")
except ImportError:
    print("Note: Not running in Google Colab. File saved locally.")
except Exception as e:
    print(f"Note: Could not auto-download. File saved at: {excel_filename}")
    print(f"   Error: {e}")

## Epoch Analysis: Determine Feasible Number of Epochs

This cell analyzes your dataset and training configuration to determine how many epochs are feasible based on:
- Dataset size
- Training time per epoch
- GPU/CPU constraints
- Best practices for fine-tuning


In [None]:
# ============================================================================
# Epoch Feasibility Analysis
# ============================================================================
# Analyzes dataset and training configuration to determine optimal epochs
# ============================================================================

import math
import time

print("=" * 80)
print("EPOCH FEASIBILITY ANALYSIS")
print("=" * 80)

# Get dataset information
if 'X_train' in globals() and 'y_train' in globals():
    train_size = len(X_train)
    val_size = len(X_val) if 'X_val' in globals() else 0
    print(f"\nüìä Dataset Information:")
    print(f"  Training samples: {train_size:,}")
    print(f"  Validation samples: {val_size:,}")
    print(f"  Total training data: {train_size + val_size:,}")
else:
    print("‚ö†Ô∏è  Training data not found. Please run data split cells first.")
    train_size = 0
    val_size = 0

# Get device information
if 'device' in globals():
    is_cuda = torch.cuda.is_available() and (str(device) == "cuda" or "cuda" in str(device).lower())
    device_type = "GPU (CUDA)" if is_cuda else "CPU"
    print(f"\nüñ•Ô∏è  Device: {device_type}")

    if is_cuda and torch.cuda.is_available():
        try:
            gpu_name = torch.cuda.get_device_name(0)
            gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
            print(f"  GPU Name: {gpu_name}")
            print(f"  GPU Memory: {gpu_memory:.2f} GB")
        except:
            print(f"  GPU details unavailable")
else:
    device_type = "Unknown"
    is_cuda = False
    print(f"\n‚ö†Ô∏è  Device information not available")

# Training configuration analysis
print(f"\n‚öôÔ∏è  Training Configuration Analysis:")

# Common batch sizes and their impact
batch_sizes = [8, 16, 32]
print(f"\nBatch Size Impact on Training:")
for bs in batch_sizes:
    steps_per_epoch = math.ceil(train_size / bs) if train_size > 0 else 0
    print(f"  Batch size {bs:2d}: ~{steps_per_epoch:,} steps per epoch")

# Estimate training time per epoch (rough estimates)
print(f"\n‚è±Ô∏è  Estimated Training Time Per Epoch:")
print(f"  (Based on typical transformer fine-tuning speeds)")

if is_cuda:
    # GPU estimates (more accurate)
    time_per_step_gpu = 0.5  # seconds per step (typical for ClinicalBERT on GPU)
    for bs in batch_sizes:
        steps = math.ceil(train_size / bs) if train_size > 0 else 0
        time_per_epoch = (steps * time_per_step_gpu) / 60  # in minutes
        print(f"  Batch size {bs:2d}: ~{time_per_epoch:.1f} minutes per epoch ({steps:,} steps)")
else:
    # CPU estimates (much slower)
    time_per_step_cpu = 5.0  # seconds per step (typical for ClinicalBERT on CPU)
    for bs in batch_sizes:
        steps = math.ceil(train_size / bs) if train_size > 0 else 0
        time_per_epoch = (steps * time_per_step_cpu) / 60  # in minutes
        print(f"  Batch size {bs:2d}: ~{time_per_epoch:.1f} minutes per epoch ({steps:,} steps)")

# Recommended epochs based on dataset size
print(f"\nüìà Recommended Epochs Based on Dataset Size:")

if train_size > 0:
    if train_size < 1000:
        recommended_min = 10
        recommended_max = 20
        reason = "Small dataset - more epochs needed to learn patterns"
    elif train_size < 10000:
        recommended_min = 5
        recommended_max = 10
        reason = "Medium dataset - moderate epochs sufficient"
    elif train_size < 50000:
        recommended_min = 3
        recommended_max = 7
        reason = "Large dataset - fewer epochs needed, risk of overfitting"
    else:
        recommended_min = 2
        recommended_max = 5
        reason = "Very large dataset - minimal epochs, focus on regularization"

    print(f"  Dataset size: {train_size:,} samples")
    print(f"  Recommended range: {recommended_min}-{recommended_max} epochs")
    print(f"  Reason: {reason}")
else:
    recommended_min = 3
    recommended_max = 5

# Best practices for fine-tuning
print(f"\nüí° Best Practices for Fine-Tuning Transformers:")
print(f"  1. Start with 3-5 epochs for initial experiments")
print(f"  2. Use early stopping to prevent overfitting")
print(f"  3. Monitor validation loss - stop if it starts increasing")
print(f"  4. For hyperparameter search: 2-3 epochs per trial (faster)")
print(f"  5. For final model: 3-7 epochs (depending on dataset size)")
print(f"  6. Maximum practical: 10-15 epochs (rarely needed)")

# Calculate feasible epochs based on time constraints
print(f"\n‚è∞ Feasible Epochs Based on Time Constraints:")

time_constraints = [
    (30, "30 minutes"),
    (60, "1 hour"),
    (120, "2 hours"),
    (240, "4 hours"),
    (480, "8 hours"),
    (1440, "24 hours")
]

if train_size > 0 and is_cuda:
    typical_batch = 16  # Most common batch size
    steps = math.ceil(train_size / typical_batch)
    time_per_epoch = (steps * time_per_step_gpu) / 60  # minutes

    print(f"  Using batch size 16, ~{time_per_epoch:.1f} min per epoch:")
    for max_time, label in time_constraints:
        max_epochs = int(max_time / time_per_epoch)
        if max_epochs > 0:
            print(f"    {label:12s}: Up to {max_epochs} epochs")

# Current configuration summary
print(f"\nüìã Current Configuration Summary:")
if 'results' in globals():
    print(f"  Experiments run: {len(results)}")
    for exp_name in results.keys():
        # Extract epochs from experiment name
        if 'ep' in exp_name.lower():
            import re
            ep_match = re.search(r'ep(\d+)', exp_name.lower())
            if ep_match:
                epochs_used = int(ep_match.group(1))
                print(f"    {exp_name}: {epochs_used} epochs")
else:
    print(f"  No experiments run yet")

# Final recommendations
print(f"\n" + "=" * 80)
print("üéØ FINAL RECOMMENDATIONS")
print("=" * 80)
print(f"Recommended epoch range: {recommended_min}-{recommended_max} epochs")
print(f"\nFor different scenarios:")
print(f"  ‚Ä¢ Quick experiments/hyperparameter search: 2-3 epochs")
print(f"  ‚Ä¢ Standard fine-tuning: {recommended_min}-{recommended_max} epochs")
print(f"  ‚Ä¢ Final model training: {min(recommended_max + 2, 10)} epochs (with early stopping)")
print(f"  ‚Ä¢ Maximum safe: 10-15 epochs (monitor for overfitting)")

if train_size > 0:
    print(f"\nüíæ With your dataset size ({train_size:,} samples):")
    print(f"   Optimal: {recommended_min}-{recommended_max} epochs")
    print(f"   Maximum practical: {min(recommended_max * 2, 15)} epochs")
    print(f"   Use early stopping to automatically find best epoch")

print("=" * 80)

# Store recommendations
EPOCH_RECOMMENDATIONS = {
    'min_epochs': recommended_min,
    'max_epochs': recommended_max,
    'optimal_range': f"{recommended_min}-{recommended_max}",
    'quick_experiments': '2-3',
    'final_training': min(recommended_max + 2, 10),
    'maximum_safe': 15,
    'dataset_size': train_size,
    'device_type': device_type
}

print(f"\n‚úÖ Recommendations stored in: EPOCH_RECOMMENDATIONS")
print(f"   Access with: EPOCH_RECOMMENDATIONS['optimal_range']")


## 6) Eval (Pick Best and Run Inference)

**Function Description:**
This cell identifies the best-performing experiment from your fine-tuning runs, saves that model to disk for future use, and demonstrates how to make predictions on new text. It shows you the complete inference pipeline from raw text to predicted class and confidence scores.

**Syntax Explanation:**
The selection logic iterates through the results dictionary using `.items()` which gives you both the experiment name and its (metrics, trainer) tuple. For each experiment, I check if its `eval_f1` score beats the current best, and if so, update both `best_f1` and `best_name` while storing the trainer object. After finding the winner, `trainer.save_model()` writes the model weights to disk at the specified path, and `tokenizer.save_pretrained()` saves the tokenizer configuration alongside it. The `predict()` function encapsulates the inference pipeline - it loads the saved tokenizer with `AutoTokenizer.from_pretrained()`, loads the saved model with `AutoModelForSequenceClassification.from_pretrained()`, moves the model to the correct device with `.to(device)`, tokenizes input texts using the same parameters as training, wraps the forward pass in `torch.no_grad()` to disable gradient computation (speeds up inference and saves memory), extracts logits from model outputs, applies `torch.argmax()` to get predicted classes, applies `torch.softmax()` to convert logits to probabilities, and returns both predictions and confidence scores after moving them from GPU to CPU and converting to numpy arrays. For the demo, I define three test sentences covering different scenarios (clearly calm, clearly stressed, ambiguous) and call predict on them. The output loop uses zip to iterate over texts, predictions, and probabilities simultaneously, formatting each as a readable string with the predicted label and confidence.

**Inputs:**
This cell uses the results dictionary populated in Section 5, which contains metrics and trainer objects from all three experiments. The predict function takes a list of text strings and optionally a model directory path.

**Outputs:**
You'll see a message identifying which experiment won and what its F1 score was, followed by the save directory path. Then you'll see three prediction lines showing the predicted class (as both a number and label), the confidence probability, and the original text. For example: "[stressed(1) p=0.873] My chest is tight and I cannot focus, I think I am very stressed."

**Code Flow:**
The flow divides into three phases. First, iterate through all experiment results to find the highest F1 score and corresponding trainer. Second, save both the model and tokenizer to disk. Third, demonstrate inference by defining a predict function, creating test samples, calling predict, and formatting the output. The save and load operations prove that you can persist your model and reload it later without retraining.

**Comments and Observations:**
Saving the model is important because fine-tuning takes significant time and compute resources - you don't want to retrain every time you need to make predictions. The saved directory contains multiple files including model weights (pytorch_model.bin), model configuration (config.json), and tokenizer files (vocab.txt, tokenizer_config.json). Together these files fully specify your trained model and can be loaded on any machine with the same library versions. The predict function is production-ready - you could import it into a web API or batch processing script. The `torch.no_grad()` context manager is important for inference because it tells PyTorch not to track gradients, which cuts memory usage in half and speeds up computation. The difference between logits and probabilities matters: logits are raw scores that can be any value from negative to positive infinity, while probabilities are normalized to sum to 1.0 and range from 0 to 1. Softmax converts logits to probabilities using the formula exp(logit_i) / sum(exp(logit_j)). The probability value tells you confidence - 0.95 means highly confident, 0.55 means barely confident. In production, you might set a threshold like 0.7 and only act on predictions above that threshold, sending lower-confidence predictions to human review. The three test sentences demonstrate different difficulty levels. The first ("calm and in control") should be easy for the model - clear language indicating low stress. The second ("chest is tight, cannot focus, very stressed") contains multiple stress indicators and explicitly mentions stress, so the model should confidently predict stressed. The third ("workload is heavy but manageable") is ambiguous - "heavy" suggests stress but "manageable" suggests coping, so this tests whether the model can handle nuance. If the model gets the easy cases right but fails on ambiguous ones, that's actually good behavior showing it's not just memorizing keywords. You can expand this demo by adding more test cases, especially edge cases like very short text ("I'm fine"), very long text (multiple paragraphs), or text with mixed signals. The model architecture (ClinicalBERT vs DistilBERT) affects inference speed - DistilBERT is roughly 2x faster for the same input, which matters if you're processing millions of texts.

In [None]:

# Select the best run from 'results' dict above
best_name, best_f1 = None, -1.0
best_trainer = None
for name,(metrics, trainer) in results.items():
    if metrics['eval_f1'] > best_f1:
        best_f1 = metrics['eval_f1']
        best_name = name
        best_trainer = trainer

print(f"Best run: {best_name} with F1={best_f1:.4f}")

# Save the best model for reuse
save_dir = f"./best_model_{best_name}"
best_trainer.save_model(save_dir)
tokenizer.save_pretrained(save_dir)

# Simple inference helper
def predict(texts, model_dir=save_dir):
    tok = AutoTokenizer.from_pretrained(model_dir)
    mdl = AutoModelForSequenceClassification.from_pretrained(model_dir).to(device)
    enc = tok(list(texts), padding=True, truncation=True, max_length=160, return_tensors="pt").to(device)
    with torch.no_grad():
        logits = mdl(**enc).logits
    pred = torch.argmax(logits, dim=-1).cpu().numpy()
    prob = torch.softmax(logits, dim=-1).cpu().numpy()[:,1]
    return pred, prob

# Demo predictions on a few samples
samples = [
    "I feel calm and in control today.",
    "My chest is tight and I cannot focus, I think I am very stressed.",
    "Workload is heavy but manageable so far."
]
pred, prob = predict(samples)
for s, y, p in zip(samples, pred, prob):
    lab = "stressed(1)" if y==1 else "not‚Äëstressed(0)"
    print(f"[{lab}  p={p:.3f}]  {s}")


# Exercise F3: Automated Hyperparameter Optimization


This cell installs the packages needed for hyperparameter optimization. You need transformers for model training, datasets for data handling, accelerate for faster training, ray and optuna for search algorithms, and openpyxl for creating Excel files.

The pip install command uses the quiet flag to reduce output noise. All packages update to their latest versions.

In [None]:
# Install required packages for hyperparameter optimization
!pip install transformers datasets accelerate ray[tune] optuna openpyxl -U -q

This cell prepares your environment for automated hyperparameter tuning. It imports libraries for timing, Excel file creation, and model training. The code uses the same data splits and model from Exercise F2.

The setup does several things. First, it imports time to track how long each search takes. It imports openpyxl to create Excel workbooks with formatted cells. It imports transformers components for model training and evaluation.

The code reuses X_train, X_val, y_train, and y_val from Exercise F2. These variables contain your training and validation data splits. It also uses the same ClinicalBERT model and tokenizer.

The tokenize_texts function converts text into token IDs that the model understands. It takes your text data and a maximum length parameter. The function returns input_ids and attention_mask tensors.

The code creates Dataset objects from your tokenized data. These objects wrap your data in a format that the Trainer class expects. Each dataset contains input_ids, attention_mask, and labels.

Class weights get computed using the same method as Exercise F2. The weights balance your classes since you have imbalanced data. The compute_metrics function calculates accuracy, precision, recall, and F1 score during evaluation.

The WeightedTrainer class extends the standard Trainer. It applies class weights to the loss function during training. This helps the model learn from minority classes better.

When you run this cell, it prints the device being used and the number of classes detected. The setup completes when you see the success message.

In [None]:
# ============================================================================
# Exercise F3: Setup for Automated Hyperparameter Optimization
# ============================================================================
# IMPORTANT: This exercise uses the SAME data and model from Exercise F2:
#   - Same model: ClinicalBERT (emilyalsentzer/Bio_ClinicalBERT)
#   - Same data splits: X_train, X_val, y_train, y_val (from Exercise F2)
#   - Same class weights and metrics computation
#   - Only difference: Using automated hyperparameter optimization
# ============================================================================

import time
import json
from datetime import datetime
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
import torch
import numpy as np
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    set_seed
)
from datasets import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Set seed for reproducibility
set_seed(42)

# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Model and tokenizer setup (using ClinicalBERT from Exercise F2)
CLINICAL_BERT = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(CLINICAL_BERT)

# Number of classes (from Exercise F2 - using same y_train variable)
num_labels = len(np.unique(y_train))
avg_type = "binary" if num_labels == 2 else "weighted"
print(f"Detected {num_labels} classes ‚Üí using average='{avg_type}' for metrics")

# Tokenize datasets (reusing SAME X_train, X_val, y_train, y_val from Exercise F2)
def tokenize_texts(texts, max_length=160):
    return tokenizer(
        list(texts),
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )

train_enc = tokenize_texts(X_train, max_length=160)
val_enc = tokenize_texts(X_val, max_length=160)

train_ds = Dataset.from_dict({
    "input_ids": train_enc["input_ids"],
    "attention_mask": train_enc["attention_mask"],
    "labels": torch.tensor(y_train.to_numpy(), dtype=torch.long)
})

val_ds = Dataset.from_dict({
    "input_ids": val_enc["input_ids"],
    "attention_mask": val_enc["attention_mask"],
    "labels": torch.tensor(y_val.to_numpy(), dtype=torch.long)
})

# Compute class weights (from Exercise F2)
counts = np.bincount(y_train, minlength=num_labels)
weights = counts.max() / np.maximum(counts, 1)
class_weights = torch.tensor(weights, dtype=torch.float32, device=device)

# Metrics function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    p, r, f, _ = precision_recall_fscore_support(labels, preds, average=avg_type)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "precision": p, "recall": r, "f1": f}

# Weighted Trainer class
from torch.nn import CrossEntropyLoss

class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.get("labels")
        outputs = model(**{k: v for k, v in inputs.items() if k != "labels"})
        logits = outputs.get("logits")
        loss_fct = CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits.view(-1, model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

print("‚úÖ Exercise F3 setup complete!")


### F3.2: Random Search Implementation


This cell runs Random Search to find the best hyperparameters. Random Search picks hyperparameter values randomly from the ranges you define. It can explore continuous ranges that Grid Search cannot.

The random_search_hp_space function defines the hyperparameter space differently than Grid Search. It uses trial.suggest_float for continuous values and trial.suggest_categorical for discrete choices.

For learning rate, Random Search samples from a continuous range between 1e-5 and 5e-5. The log parameter set to True means it samples on a logarithmic scale. This helps because learning rates often work better on log scales.

For batch size, it picks randomly from the same three options as Grid Search. For weight decay, it samples from a continuous range between 0.0 and 0.1. For epochs, it picks randomly between 3 and 4.

Random Search runs 24 trials, same as Grid Search. This makes the comparison fair. Each trial picks random hyperparameter values and trains a model.

The TrainingArguments and WeightedTrainer work the same way as in Grid Search. The only difference is how hyperparameters get selected.

Random Search can find better hyperparameters in fewer trials when the optimal values are not on your grid points. It explores the continuous space more efficiently.

When Random Search finishes, it returns the best trial with the highest F1 score. The code prints the best hyperparameters and execution time. You use this information to compare against Grid Search results.

In [None]:
# Random Search Implementation
# This randomly samples from the hyperparameter space

from transformers import AutoModelForSequenceClassification

def random_search_hp_space(trial):
    """
    Define the hyperparameter space for Random Search.
    Random Search samples RANDOMLY from continuous/discrete ranges.
    """
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 3e-5, log=True)  # Lower range
    per_device_train_batch_size = trial.suggest_categorical("per_device_train_batch_size", [8, 16])  # Smaller batches
    weight_decay = trial.suggest_float("weight_decay", 0.0, 0.01)  # Same (already low)

    # Use maximum feasible epochs based on recommendations
    # Check if EPOCH_RECOMMENDATIONS exists from epoch analysis
    if 'EPOCH_RECOMMENDATIONS' in globals():
        max_epochs = EPOCH_RECOMMENDATIONS.get('maximum_safe', 15)
        min_epochs = max(EPOCH_RECOMMENDATIONS.get('min_epochs', 3), 3)  # At least 3
    else:
        # Default range: 3 to 15 epochs (reasonable maximum)
        min_epochs = 3
        max_epochs = 15

    num_train_epochs = trial.suggest_int("num_train_epochs", min_epochs, max_epochs)

    return {
        "learning_rate": learning_rate,
        "per_device_train_batch_size": per_device_train_batch_size,
        "weight_decay": weight_decay,
        "num_train_epochs": num_train_epochs,
    }

# Training arguments template (same as grid search)
random_training_args = TrainingArguments(
    output_dir="./random_search_results",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    fp16=torch.cuda.is_available(),
    report_to="none",
    warmup_steps=500,
    logging_steps=100,
)

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        CLINICAL_BERT,
        num_labels=num_labels
    )


# Initialize trainer for random search
random_trainer = WeightedTrainer(
    model_init=model_init,
    args=random_training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

print("--- Starting Random Search ---")
print("Random Search will sample 6 trials from the hyperparameter space")
print("This allows exploration of continuous ranges efficiently.")

# Display epoch range being used
if 'EPOCH_RECOMMENDATIONS' in globals():
    max_epochs = EPOCH_RECOMMENDATIONS.get('maximum_safe', 15)
    min_epochs = max(EPOCH_RECOMMENDATIONS.get('min_epochs', 3), 3)
    print(f"Epoch range: {min_epochs}-{max_epochs} (from EPOCH_RECOMMENDATIONS)")
else:
    print(f"Epoch range: 3-15 (default maximum)")
print()

# Track start time
random_start_time = time.time()

# Execute Random Search
# Use same number of trials as grid search for fair comparison
random_best_trial = random_trainer.hyperparameter_search(
    backend="optuna",
    hp_space=random_search_hp_space,
    direction="maximize",
    n_trials=6,  # Same number of trials as grid search for fair comparison
)

random_end_time = time.time()
random_total_time = random_end_time - random_start_time

print(f"\n--- Random Search Complete (Time: {random_total_time:.2f} seconds) ---")
print("\nBEST HYPERPARAMETERS FROM RANDOM SEARCH:")
if random_best_trial:
    print(random_best_trial)
    random_best_hps = random_best_trial.hyperparameters
    print("\nBest Hyperparameters:")
    for key, value in random_best_hps.items():
        print(f"  {key}: {value}")
    print(f"\nBest F1 Score: {random_best_trial.objective:.4f}")
else:
    print("Random search failed or no best trial found.")
    random_best_hps = {}


### F3.3: Extract All Trial Results and Log to Excel




This cell attempts to extract all trial results from Random Search. The transformers library does not directly expose all trials, so this code tries to access them through Optuna's study storage.

The extract_trial_results function takes an Optuna study object and extracts completed trials. It loops through all trials and collects their parameters and results. Each result includes trial number, hyperparameters, and F1 score.

The code tries to access the study objects from both trainers. If this fails, it prints a note explaining the limitation. The best trial information remains available even if individual trial extraction fails.

The code creates a summary dataframe that compares both strategies. It includes best F1 scores, best hyperparameters, total trials, and execution times. This summary helps you understand the results at a glance.

The summary gets printed to the console so you can see it immediately. You also use this data when creating the Excel log file in the next cell.

In [None]:
# Prepare Random Search results for Excel logging
# Note: Individual trial extraction is limited by transformers library
# We'll log the best trial results with full metrics

print("Preparing Random Search results for Excel logging...")
print("Note: Individual trial extraction may be limited by transformers library.")
print("Best trial results will be logged to Excel with full metrics.")
print("Creating summary and Excel log sheet...")

# Create summary data for Excel logging
# Since we can't easily extract all individual trials from hyperparameter_search,
# we'll create a summary with the best Random Search results

import pandas as pd

# Create summary data for Excel
summary_data = []

# Random Search Summary
if random_best_trial:
    summary_data.append({
        "Search_Type": "Random Search (Automated)",
        "Best_F1_Score": random_best_trial.objective,
        "Best_Learning_Rate": random_best_hps.get("learning_rate", "N/A"),
        "Best_Batch_Size": random_best_hps.get("per_device_train_batch_size", "N/A"),
        "Best_Weight_Decay": random_best_hps.get("weight_decay", "N/A"),
        "Best_Epochs": random_best_hps.get("num_train_epochs", "N/A"),
        "Total_Trials": 6,  # Updated for fast config
        "Total_Time_Seconds": random_total_time,
        "Time_Per_Trial_Seconds": random_total_time / 6,
        "Strategy": "Random Sampling - Continuous ranges",
    })

if summary_data:
    summary_df = pd.DataFrame(summary_data)
    print("\n=== RANDOM SEARCH SUMMARY ===")
    print(summary_df.to_string(index=False))
    print("\nNote: This will be compared to Exercise F2 manual experiments in the Excel file.")
else:
    print("‚ö†Ô∏è  Random Search did not complete. Please run Random Search first.")

In this section of my code, I create a comprehensive Excel workbook to log and analyze the results from my Random Search hyperparameter optimization experiment (Exercise F3). I start by initializing a new workbook using Workbook() from the openpyxl library and naming the active worksheet "F3_Random_Search_Results".  I record the best trial results in row 2, including my member number, a "Best" label, all hyperparameters retrieved using .get() with "N/A" fallbacks, the objective F1 score, total training time, and a formatted timestamp. I then create a second worksheet called "Comparison_Analysis" with four columns to compare my automated Random Search results against my previous manual tuning from Exercise F2, populating it with three key metrics: Best F1 Score (4 decimals), Total Time (2 decimals), and Efficiency calculated as F1/Time (6 decimals) with zero-division protection. Below the comparison data, I add bolded "Analysis Notes:" followed by four explanatory points about the differences between automated and manual approaches.

In [None]:
# Create Excel log sheet for Random Search only
wb = Workbook()
ws = wb.active
ws.title = "F3_Random_Search_Results"

# Header styling
header_fill = PatternFill(start_color="366092", end_color="366092", fill_type="solid")
header_font = Font(bold=True, color="FFFFFF")

# Write headers
headers = [
    "Member", "Trial #", "Learning Rate", "Batch Size", "Weight Decay",
    "Epochs", "F1 Score", "Accuracy", "Precision", "Recall",
    "Training Time (s)", "Timestamp"
]

for col_idx, header in enumerate(headers, 1):
    cell = ws.cell(row=1, column=col_idx, value=header)
    cell.fill = header_fill
    cell.font = header_font
    cell.alignment = Alignment(horizontal="center")

row = 2

# Add Random Search best result
if random_best_trial:
    ws.cell(row=row, column=1, value=f"Member 3")  # Use member number from config
    ws.cell(row=row, column=2, value="Best")
    ws.cell(row=row, column=3, value=random_best_hps.get("learning_rate", "N/A"))
    ws.cell(row=row, column=4, value=random_best_hps.get("per_device_train_batch_size", "N/A"))
    ws.cell(row=row, column=5, value=random_best_hps.get("weight_decay", "N/A"))
    ws.cell(row=row, column=6, value=random_best_hps.get("num_train_epochs", "N/A"))
    ws.cell(row=row, column=7, value=random_best_trial.objective)
    ws.cell(row=row, column=11, value=random_total_time)
    ws.cell(row=row, column=12, value=datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
    row += 1

# Add comparison sheet (Random Search vs Exercise F2)
ws2 = wb.create_sheet("Comparison_Analysis")

comparison_headers = [
    "Metric", "Random Search (Automated)", "Exercise F2 (Manual)", "Notes"
]

for col_idx, header in enumerate(comparison_headers, 1):
    cell = ws2.cell(row=1, column=col_idx, value=header)
    cell.fill = header_fill
    cell.font = header_font
    cell.alignment = Alignment(horizontal="center")

# Comparison data (you'll need to add Exercise F2 best F1 score manually)
comparison_data = []

if random_best_trial:
    comparison_data.append([
        "Best F1 Score",
        f"{random_best_trial.objective:.4f}",
        "Add Exercise F2 best F1 here",
        "Random Search uses automated optimization"
    ])

    comparison_data.append([
        "Total Time (seconds)",
        f"{random_total_time:.2f}",
        "Add Exercise F2 total time here",
        "Time for 6 automated trials"
    ])

    # Efficiency
    random_efficiency = random_best_trial.objective / random_total_time if random_total_time > 0 else 0
    comparison_data.append([
        "Efficiency (F1/Time)",
        f"{random_efficiency:.6f}",
        "Calculate from F2",
        "Higher is better"
    ])

# Write comparison data
for row_idx, data in enumerate(comparison_data, 2):
    for col_idx, value in enumerate(data, 1):
        ws2.cell(row=row_idx, column=col_idx, value=value)

# Add analysis notes
notes_row = len(comparison_data) + 3
ws2.cell(row=notes_row, column=1, value="Analysis Notes:").font = Font(bold=True)
notes_row += 1
ws2.cell(row=notes_row, column=1, value="1. Random Search uses automated hyperparameter optimization")
notes_row += 1
ws2.cell(row=notes_row, column=1, value="2. Exercise F2 used manual hyperparameter tuning")
notes_row += 1
ws2.cell(row=notes_row, column=1, value="3. Random Search can explore continuous hyperparameter ranges")
notes_row += 1
ws2.cell(row=notes_row, column=1, value="4. Efficiency = Best F1 Score / Total Time")

# Auto-adjust column widths
for col in ws.columns:
    max_length = 0
    col_letter = col[0].column_letter
    for cell in col:
        try:
            if len(str(cell.value)) > max_length:
                max_length = len(str(cell.value))
        except:
            pass
    adjusted_width = min(max_length + 2, 30)
    ws.column_dimensions[col_letter].width = adjusted_width

# Save Excel file
excel_filename = f"Exercise_F3_Random_Search{3}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.xlsx"
wb.save(excel_filename)

print(f"\n‚úÖ Excel log saved: {excel_filename}")
print(f"   - Sheet 1: F3_Random_Search_Results")
print(f"   - Sheet 2: Comparison_Analysis (vs Exercise F2)")

# Automatically download the file
try:
    from google.colab import files
    files.download(excel_filename)
    print(f"‚úÖ File automatically downloaded: {excel_filename}")
except ImportError:
    print("Note: Not running in Google Colab. File saved locally.")
except Exception as e:
    print(f"Note: Could not auto-download. File saved at: {excel_filename}")
    print(f"   Error: {e}")

## Stress Identification System

**Function Description:**

This system identifies patients experiencing stress from their statements using a fine-tuned transformer model. It provides multiple fallback strategies for loading trained models, processes patient statements through the model, and returns detailed predictions with probability scores. The system includes utility functions for displaying results, filtering stress cases, and ensuring the model is properly loaded before use.


**Syntax Explanation:**

* `LABEL_MAP` dictionary - Maps numeric class indices (0-6) to human-readable mental health labels
* `STRESS_LABEL = 5` - Defines stress as class 5 in the classification scheme
* **Strategy 1**: `AutoTokenizer.from_pretrained(model_dir)` and `AutoModelForSequenceClassification.from_pretrained(model_dir)` - Attempts to load from saved directory if `best_name` variable exists
* **Strategy 2**: Tries default saved directory path `"./best_model_expB_clinicalbert_bs16_lr5e-5_ep4"`
* **Strategy 3**: `results.items()` - Iterates through results dictionary to find trainer with highest F1 score
* `metrics.get('eval_f1', 0)` - Safely retrieves F1 score with default value of 0
* `best_trainer_obj.model` - Extracts the trained model from the best trainer object
* **Strategy 4**: Uses `best_trainer.model` if available in global scope
* **Strategy 5**: `AutoModelForSequenceClassification.from_pretrained(CLINICAL_BERT, num_labels=num_labels)` - Loads pre-trained model as last resort (not fine-tuned)
* `torch.device("cuda" if torch.cuda.is_available() else "cpu")` - Detects and sets appropriate device (GPU or CPU)
* `model.to(device)` - Moves model to the selected device
* `model.eval()` - Sets model to evaluation mode (disables dropout, batch normalization updates)
* `identify_stress()` function - Main prediction function with automatic model loading capabilities
* `isinstance(statements, str)` - Checks if input is single statement or list
* `tokenizer(statements, padding=True, truncation=True, max_length=160, return_tensors="pt")` - Tokenizes input text with padding, truncation to 160 tokens, and returns PyTorch tensors
* `.to(device)` - Moves tokenized inputs to correct device
* `torch.no_grad()` - Disables gradient computation for inference (saves memory and speeds up)
* `torch.softmax(logits, dim=-1)` - Converts raw model outputs to probability distributions
* `np.argmax(probabilities, axis=-1)` - Gets predicted class by finding highest probability
* `stress_prob = probs[STRESS_LABEL]` - Extracts probability for stress class specifically
* `stress_prob >= stress_threshold` - Checks if stress probability exceeds threshold (default 0.3)
* `result["all_probabilities"]` - Optional dictionary comprehension creating label-to-probability mapping
* `display_stress_results()` - Formats and prints results in readable format with stress warnings
* `filter_stress_cases()` - List comprehension filtering only cases with detected stress
* `ensure_model_loaded()` - Helper function that attempts all loading strategies to ensure model availability


**Inputs:**

The system requires a trained model (from previous fine-tuning experiments) and can accept either a single patient statement as a string or a list of statements. Optional parameters include custom model/tokenizer objects, `return_all_probs` flag, and `stress_threshold` value.


**Outputs:**

Returns a dictionary (for single input) or list of dictionaries (for multiple inputs) containing:
* Original statement
* Predicted class (numeric and label)
* Stress detection flags (`is_stress`, `above_threshold`, `needs_attention`)
* Stress probability score
* Optionally all class probabilities

The `display_stress_results()` function provides formatted console output with visual indicators (‚ö†Ô∏è for stress detected, ‚úì for no stress).


**Code Flow:**

The cell first attempts to load a trained model using five fallback strategies in order of preference, from most specific (saved best model directory) to most general (pre-trained base model). Once loaded, the model is moved to the appropriate device and set to evaluation mode. The `identify_stress()` function handles automatic model loading if needed, tokenizes input statements, runs inference, applies softmax to get probabilities, and packages results into dictionaries. Helper functions provide additional functionality for displaying results in readable format, filtering stress cases, and ensuring model availability.


**Comments and Observations:**

The multiple fallback strategies ensure robustness - the system can find the trained model in various scenarios depending on which cells were run and in what order. Strategy 3 (extracting from results dictionary) is most reliable after running training experiments. Strategy 5 (pre-trained model) works but provides warnings since the model hasn't been fine-tuned on your data yet, resulting in lower accuracy. The stress threshold of 0.3 (30%) catches cases where stress isn't the top prediction but has significant probability, improving recall for this critical use case. The `needs_attention` flag combines both direct stress predictions and threshold-based detection for comprehensive screening. Setting `max_length=160` balances capturing full patient statements while staying efficient. The automatic device detection ensures the code works on both GPU and CPU systems. The `eval()` mode is critical for accurate predictions as it disables training-specific behaviors like dropout. The probability extraction `probs[STRESS_LABEL]` allows monitoring stress likelihood even when other conditions are predicted, which is valuable for clinical decision support where multiple concerns may coexist.

In [None]:
# ============================================================================
# Stress Identification System
# Identifies patients experiencing stress from their statements
# ============================================================================

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Label mapping (from the dataset encoding)
LABEL_MAP = {
    0: "Anxiety",
    1: "Bipolar",
    2: "Depression",
    3: "Normal",
    4: "Personality disorder",
    5: "Stress",
    6: "Suicidal"
}

STRESS_LABEL = 5  # Stress is class 5

# Load the best model (use the saved model from previous experiments)
# Multiple fallback strategies to find the trained model
stress_model = None
stress_tokenizer = None
model_loaded = False

# Strategy 1: Try to load from saved directory (if best_name exists)
if 'best_name' in globals():
    try:
        model_dir = f"./best_model_{best_name}"
        stress_tokenizer = AutoTokenizer.from_pretrained(model_dir)
        stress_model = AutoModelForSequenceClassification.from_pretrained(model_dir)
        print(f"‚úÖ Loaded model from saved directory: {model_dir}")
        model_loaded = True
    except Exception as e:
        print(f"‚ö†Ô∏è  Could not load from {model_dir}: {e}")

# Strategy 2: Try default saved directory
if not model_loaded:
    try:
        model_dir = "./best_model_expB_clinicalbert_bs16_lr5e-5_ep4"
        stress_tokenizer = AutoTokenizer.from_pretrained(model_dir)
        stress_model = AutoModelForSequenceClassification.from_pretrained(model_dir)
        print(f"‚úÖ Loaded model from default directory: {model_dir}")
        model_loaded = True
    except Exception as e:
        print(f"‚ö†Ô∏è  Could not load from default directory: {e}")

# Strategy 3: Extract best model from results dictionary
if not model_loaded and 'results' in globals():
    try:
        # Find the best model from results
        best_f1 = -1.0
        best_trainer_obj = None
        for name, (metrics, trainer) in results.items():
            if metrics.get('eval_f1', 0) > best_f1:
                best_f1 = metrics.get('eval_f1', 0)
                best_trainer_obj = trainer

        if best_trainer_obj is not None:
            stress_model = best_trainer_obj.model
            stress_tokenizer = tokenizer if 'tokenizer' in globals() else AutoTokenizer.from_pretrained(CLINICAL_BERT)
            print(f"‚úÖ Using best model from results dictionary (F1={best_f1:.4f})")
            model_loaded = True
    except Exception as e:
        print(f"‚ö†Ô∏è  Could not extract model from results: {e}")

# Strategy 4: Use best_trainer if available
if not model_loaded and 'best_trainer' in globals():
    try:
        stress_model = best_trainer.model
        stress_tokenizer = tokenizer if 'tokenizer' in globals() else AutoTokenizer.from_pretrained(CLINICAL_BERT)
        print("‚úÖ Using best_trainer's model")
        model_loaded = True
    except Exception as e:
        print(f"‚ö†Ô∏è  Could not use best_trainer: {e}")

# Strategy 5: Load pre-trained model as last resort (will need fine-tuning)
if not model_loaded:
    try:
        CLINICAL_BERT = "emilyalsentzer/Bio_ClinicalBERT"
        num_labels = 7  # Default: 7 classes (Anxiety, Bipolar, Depression, Normal, Personality disorder, Stress, Suicidal)

        # Try to get num_labels from existing data
        if 'y_train' in globals():
            num_labels = len(np.unique(y_train))

        stress_tokenizer = AutoTokenizer.from_pretrained(CLINICAL_BERT)
        stress_model = AutoModelForSequenceClassification.from_pretrained(
            CLINICAL_BERT,
            num_labels=num_labels
        )
        print("‚ö†Ô∏è  Loaded PRE-TRAINED model (not fine-tuned).")
        print("   This model has NOT been trained on your data yet.")
        print("   For best results, please run the training cells first.")
        print("   You can still use this model, but accuracy will be lower.")
        model_loaded = True
    except Exception as e:
        print(f"‚ùå Could not load pre-trained model: {e}")
        print("\n" + "="*80)
        print("INSTRUCTIONS:")
        print("="*80)
        print("To use the stress identification system:")
        print("1. Run the training cells (Cell 12) to fine-tune the model")
        print("2. Run the model saving cell (Cell 15) to save the best model")
        print("3. Then run this cell again")
        print("="*80)
        stress_model = None
        stress_tokenizer = None

if stress_model is not None:
    # Ensure device is available
    if 'device' not in globals():
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Device set to: {device}")

    stress_model = stress_model.to(device)
    stress_model.eval()  # Set to evaluation mode
    print(f"‚úÖ Model loaded and ready on device: {device}")

def identify_stress(statements, model=None, tokenizer=None,
                   return_all_probs=False, stress_threshold=0.3):
    """
    Identify patients experiencing stress from their statements.

    Parameters:
    -----------
    statements : str or list of str
        Single statement or list of patient statements
    model : torch.nn.Module, optional
        Trained classification model (if None, uses global stress_model)
    tokenizer : AutoTokenizer, optional
        Tokenizer for the model (if None, uses global stress_tokenizer)
    return_all_probs : bool
        If True, return probabilities for all classes
    stress_threshold : float
        Minimum probability threshold to consider as stress (default: 0.3)

    Returns:
    --------
    dict or list of dict
        Prediction results with stress identification
    """
    # Use global model/tokenizer if not provided, or try to load automatically
    global stress_model, stress_tokenizer, device

    # Debug: Check current state
    # print(f"DEBUG: model={model is not None}, stress_model={stress_model is not None}, tokenizer={tokenizer is not None}, stress_tokenizer={stress_tokenizer is not None}")

    if model is None:
        if stress_model is not None:
            model = stress_model
        else:
            # Try to load model automatically from results
            if 'results' in globals():
                try:
                    best_f1 = -1.0
                    best_trainer_obj = None
                    for name, (metrics, trainer) in results.items():
                        if metrics.get('eval_f1', 0) > best_f1:
                            best_f1 = metrics.get('eval_f1', 0)
                            best_trainer_obj = trainer

                    if best_trainer_obj is not None:
                        stress_model = best_trainer_obj.model
                        model = stress_model
                        if stress_tokenizer is None:
                            if 'tokenizer' in globals():
                                stress_tokenizer = tokenizer
                            else:
                                CLINICAL_BERT = "emilyalsentzer/Bio_ClinicalBERT"
                                stress_tokenizer = AutoTokenizer.from_pretrained(CLINICAL_BERT)
                        print(f"‚úÖ Auto-loaded model from results (F1={best_f1:.4f})")
                except Exception as e:
                    print(f"‚ö†Ô∏è  Could not auto-load from results: {e}")

            # If still None, try pre-trained model
            if model is None:
                try:
                    CLINICAL_BERT = "emilyalsentzer/Bio_ClinicalBERT"
                    num_labels = 7
                    if 'y_train' in globals():
                        num_labels = len(np.unique(y_train))
                    stress_tokenizer = AutoTokenizer.from_pretrained(CLINICAL_BERT)
                    stress_model = AutoModelForSequenceClassification.from_pretrained(
                        CLINICAL_BERT, num_labels=num_labels
                    )
                    model = stress_model
                    print("‚ö†Ô∏è  Auto-loaded PRE-TRAINED model (not fine-tuned)")
                except Exception as e:
                    raise ValueError(f"Model not loaded and could not auto-load: {e}\nPlease run the model loading cell (Cell 30) first.")

    if tokenizer is None:
        if stress_tokenizer is not None:
            tokenizer = stress_tokenizer
        else:
            # Try to load tokenizer
            if 'tokenizer' in globals():
                stress_tokenizer = tokenizer
                tokenizer = stress_tokenizer
            else:
                try:
                    CLINICAL_BERT = "emilyalsentzer/Bio_ClinicalBERT"
                    stress_tokenizer = AutoTokenizer.from_pretrained(CLINICAL_BERT)
                    tokenizer = stress_tokenizer
                except Exception as e:
                    raise ValueError(f"Tokenizer not loaded and could not auto-load: {e}\nPlease run the model loading cell (Cell 30) first.")

    # Ensure device is available and model is on correct device
    if 'device' not in globals():
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    else:
        device = globals()['device']

    # Ensure model is on the correct device and in eval mode
    if model is not None:
        model = model.to(device)
        model.eval()
        # Update global stress_model if it was just loaded
        if stress_model is not None and stress_model is not model:
            stress_model = model

    # Convert single statement to list
    if isinstance(statements, str):
        statements = [statements]
        single_input = True
    else:
        single_input = False

    # Tokenize statements
    encoded = tokenizer(
        statements,
        padding=True,
        truncation=True,
        max_length=160,
        return_tensors="pt"
    ).to(device)

    # Get predictions
    with torch.no_grad():
        outputs = model(**encoded)
        logits = outputs.logits
        probabilities = torch.softmax(logits, dim=-1).cpu().numpy()
        predictions = np.argmax(probabilities, axis=-1)

    # Process results
    results = []
    for i, (statement, pred, probs) in enumerate(zip(statements, predictions, probabilities)):
        stress_prob = probs[STRESS_LABEL]
        is_stress = pred == STRESS_LABEL
        above_threshold = stress_prob >= stress_threshold

        result = {
            "statement": statement,
            "predicted_class": int(pred),
            "predicted_label": LABEL_MAP[int(pred)],
            "is_stress": is_stress,
            "stress_probability": float(stress_prob),
            "above_threshold": above_threshold,
            "needs_attention": is_stress or above_threshold
        }

        if return_all_probs:
            result["all_probabilities"] = {
                LABEL_MAP[i]: float(prob) for i, prob in enumerate(probs)
            }

        results.append(result)

    return results[0] if single_input else results

def display_stress_results(results):
    """
    Display stress identification results in a readable format.

    Parameters:
    -----------
    results : dict or list of dict
        Results from identify_stress function
    """
    if isinstance(results, dict):
        results = [results]

    print("=" * 80)
    print("STRESS IDENTIFICATION RESULTS")
    print("=" * 80)

    stress_count = sum(1 for r in results if r['is_stress'] or r['above_threshold'])

    for i, result in enumerate(results, 1):
        print(f"\n[Patient Statement {i}]")
        print(f"Statement: {result['statement']}")
        print(f"Predicted Class: {result['predicted_label']} (Class {result['predicted_class']})")
        print(f"Stress Probability: {result['stress_probability']:.1%}")

        if result['is_stress']:
            print("‚ö†Ô∏è  STRESS DETECTED - Primary Prediction")
        elif result['above_threshold']:
            print("‚ö†Ô∏è  STRESS LIKELY - Above threshold")
        else:
            print("‚úì No significant stress detected")

        if 'all_probabilities' in result:
            print("\nAll Class Probabilities:")
            for label, prob in sorted(result['all_probabilities'].items(),
                                     key=lambda x: x[1], reverse=True):
                marker = " ‚Üê" if label == result['predicted_label'] else ""
                print(f"  {label}: {prob:.1%}{marker}")

    print("\n" + "=" * 80)
    print(f"Summary: {stress_count} out of {len(results)} patients show signs of stress")
    print("=" * 80)

def filter_stress_cases(results):
    """
    Filter and return only cases where stress is detected.

    Parameters:
    -----------
    results : dict or list of dict
        Results from identify_stress function

    Returns:
    --------
    list of dict
        Only cases with stress detected
    """
    if isinstance(results, dict):
        results = [results]

    return [r for r in results if r['is_stress'] or r['above_threshold']]

def ensure_model_loaded():
    """
    Helper function to ensure model and tokenizer are loaded.
    Can be called before using identify_stress if needed.
    """
    global stress_model, stress_tokenizer, device

    if stress_model is None or stress_tokenizer is None:
        print("‚ö†Ô∏è  Model not loaded. Attempting to load now...")

        # Try Strategy 3: Extract from results (most likely to work if training was done)
        if 'results' in globals():
            try:
                best_f1 = -1.0
                best_trainer_obj = None
                for name, (metrics, trainer) in results.items():
                    if metrics.get('eval_f1', 0) > best_f1:
                        best_f1 = metrics.get('eval_f1', 0)
                        best_trainer_obj = trainer

                if best_trainer_obj is not None:
                    stress_model = best_trainer_obj.model
                    if 'tokenizer' in globals():
                        stress_tokenizer = tokenizer
                    else:
                        CLINICAL_BERT = "emilyalsentzer/Bio_ClinicalBERT"
                        stress_tokenizer = AutoTokenizer.from_pretrained(CLINICAL_BERT)
                    print(f"‚úÖ Loaded model from results dictionary (F1={best_f1:.4f})")
                else:
                    raise ValueError("No trainer found in results")
            except Exception as e:
                print(f"‚ö†Ô∏è  Could not load from results: {e}")

        # Try Strategy 5: Load pre-trained model
        if stress_model is None:
            try:
                CLINICAL_BERT = "emilyalsentzer/Bio_ClinicalBERT"
                num_labels = 7
                if 'y_train' in globals():
                    num_labels = len(np.unique(y_train))

                stress_tokenizer = AutoTokenizer.from_pretrained(CLINICAL_BERT)
                stress_model = AutoModelForSequenceClassification.from_pretrained(
                    CLINICAL_BERT, num_labels=num_labels
                )
                print("‚ö†Ô∏è  Loaded PRE-TRAINED model (not fine-tuned)")
            except Exception as e:
                print(f"‚ùå Could not load model: {e}")
                raise

    # Ensure device is set
    if 'device' not in globals():
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move model to device and set to eval mode
    if stress_model is not None:
        stress_model = stress_model.to(device)
        stress_model.eval()

    return stress_model is not None and stress_tokenizer is not None

# Example usage with sample patient statements
print("\n" + "=" * 80)
print("STRESS IDENTIFICATION SYSTEM READY")
print("=" * 80)
if stress_model is not None and stress_tokenizer is not None:
    print("‚úÖ Model and tokenizer are loaded and ready!")
else:
    print("‚ö†Ô∏è  Model not yet loaded. Run ensure_model_loaded() if needed.")
print("\nExample Usage:")
print("  results = identify_stress('I feel overwhelmed and cannot cope with work')")
print("  display_stress_results(results)")
print("\n" + "=" * 80)


## Quick Test - Stress Identification System

**Function Description:**

This cell performs a quick verification test to ensure the stress identification system is properly loaded and functioning. It runs a simple test statement through the model and displays the results, providing immediate feedback on whether the system is ready for use or if troubleshooting is needed.


**Syntax Explanation:**

* `print("=" * 80)` - Creates a visual separator line of 80 equal signs for better readability
* `try-except` block - Wraps the test in error handling to catch and gracefully handle any loading or prediction errors
* `test_statement = "I feel very stressed and overwhelmed"` - Defines a sample patient statement that should trigger stress detection
* `identify_stress(test_statement)` - Calls the main prediction function with the test statement
* `result['predicted_label']` - Accesses the human-readable predicted class label from the result dictionary
* `result['stress_probability']:.1%` - Formats the stress probability as a percentage with one decimal place
* `result['is_stress']` - Boolean flag indicating whether stress was the top predicted class
* `‚úÖ` and `‚ùå` emoji markers - Visual indicators for success or failure status
* `Exception as e` - Catches any errors that occur during model loading or prediction
* `print(f"‚ùå Error: {e}")` - Displays the specific error message to help with debugging
* Troubleshooting guide - Prints step-by-step instructions if the test fails


**Inputs:**

This cell requires that the stress identification system (from the previous cell) has been run. It uses the globally loaded `stress_model`, `stress_tokenizer`, and the `identify_stress()` function. No user input is required - the test statement is hardcoded.


**Outputs:**

**Success case**: Displays a success message with the test statement, predicted label, stress probability percentage, and a confirmation that the system is ready.

**Failure case**: Shows an error message with the specific exception, followed by a detailed troubleshooting guide with numbered steps to resolve common issues (model not loaded, training not completed, etc.).


**Code Flow:**

The cell starts by printing a header announcing the test. It then enters a try-except block where it defines a test statement, passes it to `identify_stress()`, and if successful, extracts and displays key information from the result dictionary including the predicted label and stress probability. If any exception occurs (model not loaded, function not defined, etc.), the except block catches it, displays the error, and prints a comprehensive troubleshooting guide with step-by-step instructions for common problems.


**Comments and Observations:**

This test cell serves as a quick sanity check before running the system on real patient data. The test statement "I feel very stressed and overwhelmed" is deliberately designed to have clear stress indicators, so it should reliably detect stress if the model is working correctly. If this test fails, it typically means either the model hasn't been loaded yet (run Cell 30), or the model hasn't been trained yet (run Cell 12 first). The troubleshooting guide provides clear next steps based on common failure scenarios. Running this test is recommended before analyzing actual patient statements to avoid wasting time on a non-functional system. The formatted output with emoji indicators makes it immediately obvious whether everything is working - you should see green checkmarks if successful. This cell is safe to run multiple times and can be used as a quick verification after making any changes to the system.

In [None]:
# Quick test to verify model is loaded and working
print("Testing stress identification system...")
print("=" * 80)

try:
    # Test with a simple statement
    test_statement = "I feel very stressed and overwhelmed"
    result = identify_stress(test_statement)

    print(f"‚úÖ Model is loaded and working!")
    print(f"\nTest Statement: '{test_statement}'")
    print(f"Predicted: {result['predicted_label']}")
    print(f"Stress Probability: {result['stress_probability']:.1%}")
    print(f"Is Stress: {result['is_stress']}")
    print("\n" + "=" * 80)
    print("‚úÖ System is ready to analyze patient statements!")

except Exception as e:
    print(f"‚ùå Error: {e}")
    print("\n" + "=" * 80)
    print("TROUBLESHOOTING:")
    print("=" * 80)
    print("1. Make sure you have run Cell 30 (Stress Identification System)")
    print("2. If you see 'Model not loaded', try running Cell 12 (Training) first")
    print("3. Then re-run Cell 30 to load the trained model")
    print("4. If 'results' dictionary exists, the model will auto-load from there")
    print("=" * 80)


## Example: Stress Identification Demo

**Function Description:**

This cell demonstrates the complete functionality of the stress identification system using real-world example patient statements. It shows how to analyze multiple statements at once, display comprehensive results with all class probabilities, filter only the stress cases for prioritized attention, and analyze individual statements. This serves as both a demonstration and a template for using the system in practice.


**Syntax Explanation:**

* `patient_statements = [...]` - Creates a list of 7 sample patient statements with varying levels and types of mental health concerns
* `identify_stress(patient_statements, return_all_probs=True)` - Analyzes all statements at once with the flag to include probability scores for all 7 classes
* `return_all_probs=True` - Optional parameter that adds complete probability distribution across all mental health categories to the results
* `display_stress_results(results)` - Calls the display function to print formatted results with visual indicators and probability scores
* `filter_stress_cases(results)` - Filters the results to return only cases where stress was detected (either as top prediction or above threshold)
* `enumerate(stress_cases, 1)` - Iterates through filtered stress cases with 1-based indexing for readable numbering
* `case['statement']` - Accesses the original patient statement from the result dictionary
* `case['stress_probability']:.1%` - Formats stress probability as percentage with one decimal place
* `case['predicted_label']` - Gets the human-readable predicted mental health category
* `if stress_cases:` - Checks if any stress cases were found before attempting to display them
* `identify_stress("single statement", return_all_probs=True)` - Demonstrates analyzing a single statement (string input instead of list)
* `display_stress_results(single_result)` - Shows results for single statement analysis


**Inputs:**

This cell uses the previously loaded stress identification system including the `identify_stress()`, `display_stress_results()`, and `filter_stress_cases()` functions. The input data consists of 7 hardcoded sample patient statements designed to represent a range of mental health states from healthy/normal to various levels of stress and anxiety. No external data files are required.


**Outputs:**

The cell produces three main output sections:

* **Full Analysis Section**: Displays complete results for all 7 patient statements including predicted class, stress probability, detection flags (‚ö†Ô∏è or ‚úì), and probability distributions across all 7 mental health categories for each statement.

* **Filtered Stress Cases Section**: Shows only the statements where stress was detected, providing a prioritized list of patients who need attention with their stress probability and predicted class.

* **Single Statement Analysis Section**: Demonstrates analyzing one statement at a time, showing the same detailed output format as the full analysis but for a single patient.

The output includes visual indicators, formatted percentages, summary statistics, and clear section headers for easy interpretation.


**Code Flow:**

The cell begins by defining a diverse set of 7 sample patient statements ranging from clear stress indicators to normal/healthy states. It then calls `identify_stress()` with the entire list and `return_all_probs=True` to get comprehensive results. These results are displayed using `display_stress_results()` which formats and prints all predictions with probabilities. Next, it filters only stress cases using `filter_stress_cases()` and displays them in a focused summary section. If no stress cases exist, it prints an appropriate message. Finally, it demonstrates single-statement analysis by processing one new statement individually and displaying its results, showing the flexibility of the system for both batch and individual analyses.


**Comments and Observations:**

This demo cell is designed to be educational and practical. The 7 sample statements were carefully chosen to represent real patient scenarios: statements 1, 3, 4, and 6 contain clear stress indicators and should trigger detection; statements 2, 5, and 7 represent normal/healthy states and should not trigger stress detection. This variety helps validate the model's accuracy and demonstrates both true positives and true negatives. The `return_all_probs=True` flag is particularly useful for clinical review because it shows not just the top prediction but the model's confidence distribution across all mental health categories - for example, a patient might be classified as "Anxiety" but have a high stress probability (30-40%) that still warrants attention. The filtered stress cases section is especially valuable in real-world applications where clinicians need to prioritize which patients require immediate follow-up. The single statement analysis at the end shows how the system can be used interactively for individual patient assessments during consultations. You can modify the `patient_statements` list with your own data or load statements from a file/database. The stress threshold (default 0.3 or 30%) can be adjusted in the `identify_stress()` function if you want to be more or less sensitive to potential stress cases - lower thresholds catch more cases but may have more false positives, while higher thresholds are more conservative but might miss some at-risk patients.

In [None]:

# ============================================================================
# Example: Stress Identification Demo
# ============================================================================

# Sample patient statements for testing
patient_statements = [
    "I feel overwhelmed and cannot cope with my workload. My chest feels tight and I can't sleep.",
    "I'm doing great today, feeling calm and in control of my life.",
    "The pressure at work is too much. I feel stressed all the time and it's affecting my health.",
    "I have been experiencing anxiety attacks and feeling very stressed about my future.",
    "Everything is fine, I'm managing well and feeling positive about things.",
    "I can't handle this anymore. The stress is killing me and I don't know what to do.",
    "I feel normal today, nothing out of the ordinary happening."
]

# Identify stress in all patient statements
print("Analyzing patient statements for stress...")
print("\n")

results = identify_stress(patient_statements, return_all_probs=True)

# Display results
display_stress_results(results)

# Filter only stress cases
print("\n" + "=" * 80)
print("FILTERED: PATIENTS WITH STRESS DETECTED")
print("=" * 80)
stress_cases = filter_stress_cases(results)

if stress_cases:
    for i, case in enumerate(stress_cases, 1):
        print(f"\n[Stress Case {i}]")
        print(f"Statement: {case['statement']}")
        print(f"Stress Probability: {case['stress_probability']:.1%}")
        print(f"Predicted Class: {case['predicted_label']}")
else:
    print("\nNo stress cases detected in the provided statements.")

# Example: Single statement analysis
print("\n" + "=" * 80)
print("SINGLE STATEMENT ANALYSIS")
print("=" * 80)
single_result = identify_stress(
    "I am extremely stressed about my exams and cannot focus on anything else.",
    return_all_probs=True
)
display_stress_results(single_result)


## Analyze CSV File for Stress Cases

**Function Description:**

This cell reads a CSV file containing patient statements and performs batch stress identification analysis on all records. It processes statements in configurable batches to manage memory efficiently, displays real-time progress updates, generates comprehensive statistics, shows example stress cases, and automatically saves results to timestamped CSV files. This is designed for large-scale analysis of patient datasets.


**Syntax Explanation:**

* `CSV_FILE = "Combined Data.csv"` - Specifies the name of the CSV file to analyze
* `STATEMENT_COL = "statement"` - Defines which column contains the patient statements to analyze
* `MAX_ROWS = None` - Optional limit on number of rows to process (None = process all rows)
* `BATCH_SIZE = 100` - Number of statements to process in each batch for memory efficiency and progress tracking
* `Path(CSV_FILE)` - Creates a Path object for better file path handling across operating systems
* `csv_file.exists()` - Checks if the CSV file exists before attempting to load it
* `pd.read_csv(csv_file)` - Loads the entire CSV file into a pandas DataFrame
* `if STATEMENT_COL not in df.columns:` - Validates that the specified column name exists in the dataset
* `df[STATEMENT_COL].astype(str).tolist()` - Extracts statements column, converts all to strings, and creates a Python list
* `df.head(MAX_ROWS)` - Limits DataFrame to first MAX_ROWS if specified
* `(total_statements + BATCH_SIZE - 1) // BATCH_SIZE` - Calculates number of batches using ceiling division
* `for batch_num in range(num_batches):` - Iterates through each batch for processing
* `min(start_idx + BATCH_SIZE, total_statements)` - Ensures last batch doesn't exceed total statements
* `batch_statements = statements[start_idx:end_idx]` - Slices list to get current batch of statements
* `identify_stress(batch_statements, return_all_probs=False)` - Analyzes batch without full probability distributions for faster processing
* `all_results.extend(batch_results)` - Appends batch results to master results list
* `print(..., end='\r')` - Prints progress on same line using carriage return for live updates
* `pd.DataFrame(all_results)` - Converts list of result dictionaries into pandas DataFrame for easy manipulation
* `results_df.insert(0, 'patient_id', range(1, len(results_df) + 1))` - Adds sequential patient IDs as first column
* `results_df['is_stress'].sum()` - Counts number of True values in boolean column for stress statistics
* `filter_stress_cases(all_results)` - Extracts only cases where stress was detected or probability exceeded threshold
* `stmt[:70] + "..."` - Truncates long statements to 70 characters for readable display
* `datetime.now().strftime('%Y%m%d_%H%M%S')` - Generates timestamp in format YYYYMMDD_HHMMSS for unique filenames
* `results_df.to_csv(all_file, index=False)` - Saves DataFrame to CSV without row index column
* `results_df[results_df['needs_attention'] == True]` - Filters DataFrame to only rows where stress needs attention
* `traceback.print_exc()` - Prints full error traceback for debugging if unexpected exception occurs


**Inputs:**

This cell requires:
* A CSV file named in `CSV_FILE` variable (default: "Combined Data.csv") located in the same directory as the notebook
* The CSV must contain a column matching `STATEMENT_COL` (default: "statement") with patient text data
* The stress identification system must be loaded and functional (from previous cells)
* Optional configuration: `MAX_ROWS` to limit processing, `BATCH_SIZE` to control memory usage and progress update frequency


**Outputs:**

**Console Output**:
* Loading confirmation showing filename and total rows
* Real-time progress updates showing batch number, percentage complete, and statements processed
* Summary statistics including total patients, stress detected count and percentage, attention needed count and percentage
* First 5 stress cases with truncated statements, stress probability, and predicted class
* Confirmation messages for saved files with filenames and case counts

**File Output**:
* `stress_analysis_all_[timestamp].csv` - Complete results for all patients with columns: patient_id, statement, predicted_class, predicted_label, is_stress, stress_probability, above_threshold, needs_attention
* `stress_cases_only_[timestamp].csv` - Filtered results containing only patients flagged for attention (stress detected or above threshold)


**Code Flow:**

The cell begins by setting configuration variables and printing a header. It attempts to load the CSV file using pandas, validating that the file exists and the specified column is present. If `MAX_ROWS` is set, it limits the dataset. The cell extracts all statements as a string list, then calculates how many batches are needed based on `BATCH_SIZE`. It enters a loop processing each batch: slicing the statements list, calling `identify_stress()` on the batch, collecting results, and updating progress display. After all batches complete, it converts results to a DataFrame, adds patient IDs, and calculates summary statistics (total, stress count, attention needed). It displays the first 5 stress cases as examples, then saves two CSV files - one with all results and one with only stress cases, both with timestamped filenames. Comprehensive error handling catches file not found errors and other exceptions, providing helpful troubleshooting messages.


**Comments and Observations:**

Batch processing is critical for large datasets to avoid memory errors and provide progress feedback. A `BATCH_SIZE` of 100 works well for most systems - increase to 200-500 for faster processing if you have sufficient RAM, or decrease to 32-64 for memory-constrained environments. Setting `return_all_probs=False` significantly speeds up processing since full probability distributions aren't needed for batch analysis. The `MAX_ROWS` parameter is useful for testing on a subset before running the full dataset - try `MAX_ROWS = 100` first to verify everything works. The progress indicator updates in real-time using `end='\r'` which overwrites the same line, giving immediate feedback on long-running analyses. Timestamped output files prevent accidentally overwriting previous results and create a historical record of analyses. The two output files serve different purposes: the "all" file is comprehensive for record-keeping and further analysis, while the "stress_cases_only" file provides a focused list for clinical follow-up. Patient IDs are added as sequential numbers starting from 1, but you can modify this to use an ID column from your CSV if one exists (e.g., `results_df['patient_id'] = df['patient_id'].values`). The first 5 stress cases preview helps you quickly assess if the model is performing correctly - you should see reasonable stress detections. If many non-stress statements appear in the preview, consider retraining the model or adjusting the threshold. Processing time depends on dataset size, batch size, and hardware: expect roughly 1-2 seconds per 100 statements on CPU, or 0.2-0.5 seconds per 100 on GPU. A dataset of 10,000 statements might take 2-3 minutes on CPU or 30-60 seconds on GPU. The error handling distinguishes between file not found errors (which get specific instructions) and other errors (which print full traceback for debugging). Always check that your CSV column name matches `STATEMENT_COL` exactly - column name mismatches are the most common error.

In [None]:
# ============================================================================
# Analyze CSV File for Stress Cases
# ============================================================================
# This cell reads your CSV file and identifies all patients with stress
# ============================================================================

import pandas as pd
from pathlib import Path
from datetime import datetime

# Configuration
CSV_FILE = "Combined Data.csv"  # Your CSV file name
STATEMENT_COL = "statement"      # Column name with patient statements
MAX_ROWS = None                  # Set to number (e.g., 1000) to limit, or None for all
BATCH_SIZE = 100                 # Process this many statements at a time (lower = more progress updates, higher = faster)

print("=" * 80)
print("ANALYZING CSV FILE FOR STRESS CASES")
print("=" * 80)

try:
    # Load CSV
    csv_file = Path(CSV_FILE)
    if not csv_file.exists():
        raise FileNotFoundError(f"File not found: {CSV_FILE}")

    print(f"Loading: {csv_file}")
    df = pd.read_csv(csv_file)
    print(f"Loaded {len(df)} rows")

    # Check column exists
    if STATEMENT_COL not in df.columns:
        print(f"Available columns: {list(df.columns)}")
        raise ValueError(f"Column '{STATEMENT_COL}' not found")

    # Limit rows if specified
    if MAX_ROWS and MAX_ROWS < len(df):
        df = df.head(MAX_ROWS)
        print(f"Limited to {MAX_ROWS} rows")

    # Get statements
    statements = df[STATEMENT_COL].astype(str).tolist()
    total_statements = len(statements)
    print(f"\nAnalyzing {total_statements} patient statements...")
    print("(Processing in batches for better performance)\n")

    # Process in batches to avoid memory issues and show progress
    all_results = []

    num_batches = (total_statements + BATCH_SIZE - 1) // BATCH_SIZE

    for batch_num in range(num_batches):
        start_idx = batch_num * BATCH_SIZE
        end_idx = min(start_idx + BATCH_SIZE, total_statements)
        batch_statements = statements[start_idx:end_idx]

        # Analyze this batch
        batch_results = identify_stress(batch_statements, return_all_probs=False)
        all_results.extend(batch_results)

        # Show progress
        progress = (batch_num + 1) / num_batches * 100
        print(f"Progress: {batch_num + 1}/{num_batches} batches ({progress:.1f}%) - {end_idx}/{total_statements} statements", end='\r')

    print(f"\n‚úÖ Completed analysis of {total_statements} statements!")

    # Convert all results to DataFrame
    results_df = pd.DataFrame(all_results)

    # Add patient IDs
    results_df.insert(0, 'patient_id', range(1, len(results_df) + 1))

    # Fix statement column if it exists
    if 'statement' in results_df.columns:
        stmt = results_df['statement'].values.copy()
        results_df = results_df.drop(columns=['statement'])
        results_df.insert(1, 'statement', stmt)

    # Calculate statistics
    total = len(results_df)
    stress = results_df['is_stress'].sum()
    attention = results_df['needs_attention'].sum()

    print("=" * 80)
    print("RESULTS SUMMARY")
    print("=" * 80)
    print(f"Total Patients: {total}")
    print(f"Stress Detected: {stress} ({stress/total*100:.1f}%)")
    print(f"Need Attention: {attention} ({attention/total*100:.1f}%)")
    print("=" * 80)

    # Get stress cases
    stress_cases = filter_stress_cases(all_results)
    print(f"\nTotal Stress Cases: {len(stress_cases)}")

    # Show examples
    if stress_cases:
        print("\nFirst 5 stress cases:")
        for i, case in enumerate(stress_cases[:5], 1):
            stmt = case['statement'][:70] + "..." if len(case['statement']) > 70 else case['statement']
            print(f"{i}. [{case['stress_probability']:.1%}] {stmt}")
            print(f"   Class: {case['predicted_label']}")

    # Save results
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

    all_file = f"stress_analysis_all_{timestamp}.csv"
    results_df.to_csv(all_file, index=False)
    print(f"\n‚úÖ Saved all results: {all_file}")

    if len(stress_cases) > 0:
        stress_df = results_df[results_df['needs_attention'] == True]
        stress_file = f"stress_cases_only_{timestamp}.csv"
        stress_df.to_csv(stress_file, index=False)
        print(f"‚úÖ Saved stress cases: {stress_file} ({len(stress_df)} cases)")

    print("\n" + "=" * 80)
    print("ANALYSIS COMPLETE!")
    print("=" * 80)

except FileNotFoundError as e:
    print(f"\n‚ùå Error: {e}")
    print("\nMake sure:")
    print("1. The CSV file is in the same folder as this notebook")
    print("2. The file name is correct")

except Exception as e:
    print(f"\n‚ùå Error: {e}")
    import traceback
    traceback.print_exc()


## Export All Stress Cases to CSV File

**Function Description:**

This cell extracts all patients identified with stress from analysis results and exports them to a dedicated CSV file. It intelligently searches for results in multiple locations (memory, specified file, or most recent auto-saved file), filters for stress cases using multiple criteria, sorts by stress probability, generates detailed statistics and breakdowns, and saves a clean, focused dataset of only stress-positive patients for clinical review and follow-up.


**Syntax Explanation:**

* `RESULTS_FILE = "stress_analysis_all_20241117_123456.csv"` - Optional variable to specify a saved results file to load (commented out by default)
* `if 'results_df' in globals()` - Checks if results DataFrame exists in global scope from previous analysis
* `results_df.copy()` - Creates a copy of the DataFrame to avoid modifying the original
* `elif 'RESULTS_FILE' in locals() and RESULTS_FILE:` - Checks if user uncommented and set a specific results file
* `glob.glob("stress_analysis_all_*.csv")` - Searches for all files matching the pattern with wildcard
* `max(result_files, key=lambda x: Path(x).stat().st_mtime)` - Finds most recently modified file using modification timestamp
* `Path(x).stat().st_mtime` - Gets file modification time for comparison
* `pd.read_csv(latest_file)` - Loads the automatically detected most recent results file
* `df_to_use[(df_to_use['is_stress'] == True) | (df_to_use['needs_attention'] == True) | (df_to_use['above_threshold'] == True)]` - Filters DataFrame using OR logic across three stress criteria
* `|` operator - Logical OR for pandas boolean indexing (combines multiple conditions)
* `.copy()` - Creates copy of filtered DataFrame to avoid SettingWithCopyWarning
* `stress_df.sort_values('stress_probability', ascending=False)` - Sorts by stress probability in descending order (highest risk first)
* `stress_count / total_patients * 100` - Calculates percentage of patients with stress
* `if total_patients > 0 else 0` - Ternary operator to avoid division by zero
* `stress_df['predicted_label'].value_counts()` - Counts occurrences of each predicted class label
* `class_counts.items()` - Iterates through class names and their counts
* `stress_df['stress_probability'].mean()` - Calculates average stress probability across all stress cases
* `.min()`, `.max()`, `.median()` - Statistical functions for probability distribution analysis
* `datetime.now().strftime('%Y%m%d_%H%M%S')` - Generates timestamp string in YYYYMMDD_HHMMSS format
* `output_file = f"stress_cases_identified_{timestamp}.csv"` - Creates timestamped filename for export
* `columns_to_save = [...]` - Defines ordered list of columns to include in export
* `[col for col in columns_to_save if col in stress_df.columns]` - List comprehension filtering only existing columns
* `stress_df[available_columns].to_csv(output_file, index=False)` - Saves selected columns to CSV without row indices
* `stress_df.head(5).iterrows()` - Iterates through first 5 rows with index and row data
* `row.get('patient_id', 'N/A')` - Safely retrieves patient_id with default value if missing
* `stmt[:100] + "..."` - Truncates statements longer than 100 characters for display
* `except NameError as e:` - Catches errors when required variables don't exist
* `except FileNotFoundError as e:` - Catches errors when specified file cannot be found
* `traceback.print_exc()` - Prints full error traceback for debugging unexpected exceptions


**Inputs:**

This cell requires results from a previous stress analysis, which it attempts to find using three strategies in priority order:
1. **Memory**: Uses `results_df` from the most recent analysis cell execution
2. **Specified File**: Uses file path defined in `RESULTS_FILE` variable if uncommented
3. **Auto-detect**: Searches for most recent `stress_analysis_all_*.csv` file in current directory

No manual input required if analysis was just run. To use a specific file, uncomment the `RESULTS_FILE` line and set the filename.


**Outputs:**

**Console Output**:
* Source confirmation (memory, specified file, or auto-detected file)
* Summary statistics showing total patients analyzed, stress cases count, and percentage
* Breakdown by predicted class showing distribution of mental health categories among stress cases
* Stress probability statistics including average, minimum, maximum, and median values
* First 5 stress cases as examples with patient ID, truncated statement, stress probability, and predicted class
* File save confirmation with filename, total cases, and included columns

**File Output**:
* `stress_cases_identified_[timestamp].csv` - Contains only patients flagged with stress, sorted by stress probability (highest first), with columns: patient_id, statement, predicted_label, predicted_class, stress_probability, is_stress, above_threshold, needs_attention


**Code Flow:**

The cell starts by checking three locations for results data in priority order: first checking global memory for `results_df`, then checking if user specified `RESULTS_FILE`, and finally auto-detecting the most recent saved results file using glob pattern matching. Once data is loaded, it filters the DataFrame using OR logic to include any row where `is_stress`, `needs_attention`, or `above_threshold` is True. The filtered stress cases are sorted by stress probability in descending order to prioritize highest-risk patients. Statistical summaries are calculated and printed including total counts, percentages, class distribution breakdown, and probability statistics. The cell defines a column ordering for readable output, filters to only available columns, and saves to a timestamped CSV file. Finally, it displays the first 5 stress cases as examples with truncated statements for quick verification. Comprehensive error handling catches missing variables, file not found errors, and other exceptions with helpful troubleshooting messages.


**Comments and Observations:**

The three-tiered loading strategy makes this cell very flexible - it works whether you just ran the analysis, want to reprocess an older file, or forgot which file to use. Auto-detection using `glob.glob()` with `max()` on modification time is particularly helpful when you have multiple result files and want the latest one without remembering the exact timestamp. The stress filtering uses OR logic `(condition1) | (condition2) | (condition3)` which is more inclusive than AND logic - a patient is included if they meet ANY of the three stress criteria, not all three. This catches edge cases where stress probability is high but another condition was predicted, or where the threshold was exceeded but stress wasn't the top prediction. Sorting by stress probability puts the highest-risk patients first, making the output immediately actionable for clinical triage - start reviewing from the top of the file. The class breakdown is valuable for understanding what other conditions co-occur with stress - you might see many "Anxiety" predictions with high stress probability, indicating comorbidity. Probability statistics help you understand the confidence distribution: if the average is 65% and minimum is 35%, that's different from average 45% and minimum 31% - the first suggests stronger, more confident detections. The timestamp ensures you never accidentally overwrite exports, and creates an audit trail of when analyses were performed. Column ordering in `columns_to_save` is deliberately chosen to put the most important information first (patient_id, statement) followed by predictions and flags. The `available_columns` filtering prevents errors if your results file is missing some columns (perhaps from an older version of the analysis code). If no stress cases are found, the cell gracefully reports this rather than erroring - in a healthy population sample, it's possible to have zero stress detections. For large datasets with thousands of stress cases, consider adding `MAX_EXPORT = 1000` to limit the export file size, or create multiple files by stress probability ranges (high risk, medium risk, etc.). The 5-example preview lets you quickly sanity-check that the filtering worked correctly - you should see statements with clear stress indicators in the examples.

In [None]:
# ============================================================================
# Export All Stress Cases to CSV File
# ============================================================================
# This cell extracts all patients identified with stress and saves to CSV
# ============================================================================

import pandas as pd
from pathlib import Path
from datetime import datetime
import glob

# Option 1: Use results from memory (if you just ran the analysis cell above)
# Option 2: Load from a saved results file (uncomment and set the filename)
# RESULTS_FILE = "stress_analysis_all_20241117_123456.csv"  # Change to your file name

print("=" * 80)
print("EXPORTING STRESS CASES TO CSV")
print("=" * 80)

try:
    # Try to use results from memory first
    if 'results_df' in globals() and results_df is not None:
        print("‚úÖ Using results from memory (from previous analysis)")
        df_to_use = results_df.copy()
    elif 'RESULTS_FILE' in locals() and RESULTS_FILE:
        # Load from saved file
        print(f"üìÇ Loading results from: {RESULTS_FILE}")
        df_to_use = pd.read_csv(RESULTS_FILE)
        print(f"‚úÖ Loaded {len(df_to_use)} records")
    else:
        # Try to find the most recent results file
        result_files = glob.glob("stress_analysis_all_*.csv")
        if result_files:
            # Get the most recent file
            latest_file = max(result_files, key=lambda x: Path(x).stat().st_mtime)
            print(f"üìÇ Found recent results file: {latest_file}")
            df_to_use = pd.read_csv(latest_file)
            print(f"‚úÖ Loaded {len(df_to_use)} records")
        else:
            raise ValueError("No results found. Please run the analysis cell first or specify RESULTS_FILE.")

    # Filter for stress cases
    # A patient has stress if: is_stress=True OR needs_attention=True OR above_threshold=True
    stress_df = df_to_use[
        (df_to_use['is_stress'] == True) |
        (df_to_use['needs_attention'] == True) |
        (df_to_use['above_threshold'] == True)
    ].copy()

    # Sort by stress probability (highest first)
    if 'stress_probability' in stress_df.columns:
        stress_df = stress_df.sort_values('stress_probability', ascending=False)

    # Summary
    total_patients = len(df_to_use)
    stress_count = len(stress_df)
    stress_percentage = (stress_count / total_patients * 100) if total_patients > 0 else 0

    print("\n" + "=" * 80)
    print("STRESS CASES SUMMARY")
    print("=" * 80)
    print(f"Total Patients Analyzed: {total_patients}")
    print(f"Patients with Stress: {stress_count} ({stress_percentage:.1f}%)")
    print("=" * 80)

    if len(stress_df) > 0:
        # Show breakdown by predicted class
        if 'predicted_label' in stress_df.columns:
            print("\nBreakdown by Predicted Class:")
            class_counts = stress_df['predicted_label'].value_counts()
            for class_name, count in class_counts.items():
                print(f"  {class_name}: {count}")

        # Show stress probability statistics
        if 'stress_probability' in stress_df.columns:
            print(f"\nStress Probability Statistics:")
            print(f"  Average: {stress_df['stress_probability'].mean():.1%}")
            print(f"  Minimum: {stress_df['stress_probability'].min():.1%}")
            print(f"  Maximum: {stress_df['stress_probability'].max():.1%}")
            print(f"  Median: {stress_df['stress_probability'].median():.1%}")

        # Save to CSV
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        output_file = f"stress_cases_identified_{timestamp}.csv"

        # Select columns to save (in a readable order)
        columns_to_save = [
            'patient_id',
            'statement',
            'predicted_label',
            'predicted_class',
            'stress_probability',
            'is_stress',
            'above_threshold',
            'needs_attention'
        ]

        # Only include columns that exist
        available_columns = [col for col in columns_to_save if col in stress_df.columns]
        stress_df[available_columns].to_csv(output_file, index=False)

        print(f"\n‚úÖ Stress cases saved to: {output_file}")
        print(f"   Total stress cases: {len(stress_df)}")
        print(f"   Columns: {', '.join(available_columns)}")

        # Show first few examples
        print("\n" + "=" * 80)
        print("SAMPLE STRESS CASES (First 5)")
        print("=" * 80)
        for idx, row in stress_df.head(5).iterrows():
            print(f"\n[Patient {row.get('patient_id', 'N/A')}]")
            stmt = row['statement']
            if len(stmt) > 100:
                stmt = stmt[:100] + "..."
            print(f"Statement: {stmt}")
            print(f"Stress Probability: {row.get('stress_probability', 0):.1%}")
            print(f"Predicted Class: {row.get('predicted_label', 'N/A')}")

        print("\n" + "=" * 80)
        print("EXPORT COMPLETE!")
        print("=" * 80)
        print(f"\nüìÑ File saved: {output_file}")
        print(f"üìä Total stress cases: {stress_count}")

    else:
        print("\n‚ö†Ô∏è  No stress cases found in the analyzed data.")
        print("   All patients appear to be stress-free.")

except NameError as e:
    print(f"\n‚ùå Error: {e}")
    print("\nPlease run the analysis cell first, or specify a RESULTS_FILE to load from.")

except FileNotFoundError as e:
    print(f"\n‚ùå Error: File not found - {e}")
    print("\nPlease check the file path and try again.")

except Exception as e:
    print(f"\n‚ùå Error occurred: {e}")
    import traceback
    traceback.print_exc()


## User Input: Check Stress Level from Statement

Enter a patient statement below to check if the patient is experiencing stress and see the stress probability percentage.


In [None]:
# ============================================================================
# User Input: Check Stress Level from Statement
# ============================================================================
# Enter a patient statement to check stress level and probability
# ============================================================================

def check_stress_from_statement(statement, model=None, tokenizer=None):
    """
    Check if a patient statement indicates stress and return probability.

    Parameters:
    -----------
    statement : str
        Patient statement to analyze
    model : optional
        Trained model (uses best model if available)
    tokenizer : optional
        Tokenizer (uses global tokenizer if available)

    Returns:
    --------
    dict : Results with stress detection and probability
    """
    # Try to get model and tokenizer
    if model is None:
        if 'best_trainer' in globals():
            model = best_trainer.model
        elif 'stress_model' in globals():
            model = stress_model
        else:
            raise ValueError("No model found. Please train a model first or load one.")

    if tokenizer is None:
        if 'tokenizer' in globals():
            tokenizer = tokenizer
        elif 'stress_tokenizer' in globals():
            tokenizer = stress_tokenizer
        else:
            raise ValueError("No tokenizer found. Please load tokenizer first.")

    # Ensure model is on correct device and in eval mode
    if 'device' not in globals():
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    model.eval()

    # Label mapping
    LABEL_MAP = {
        0: "Anxiety",
        1: "Bipolar",
        2: "Depression",
        3: "Normal",
        4: "Personality disorder",
        5: "Stress",
        6: "Suicidal"
    }
    STRESS_LABEL = 5

    # Tokenize statement
    encoded = tokenizer(
        statement,
        padding=True,
        truncation=True,
        max_length=160,
        return_tensors="pt"
    ).to(device)

    # Get prediction
    with torch.no_grad():
        outputs = model(**encoded)
        logits = outputs.logits
        probabilities = torch.softmax(logits, dim=-1).cpu().numpy()[0]
        prediction = np.argmax(probabilities)

    # Extract stress probability
    stress_probability = probabilities[STRESS_LABEL]
    is_stress = prediction == STRESS_LABEL

    # Determine stress level
    if stress_probability >= 0.7:
        stress_level = "HIGH STRESS"
        recommendation = "‚ö†Ô∏è  Patient shows high signs of stress. Immediate attention recommended."
    elif stress_probability >= 0.5:
        stress_level = "MODERATE STRESS"
        recommendation = "‚ö†Ô∏è  Patient shows moderate signs of stress. Monitoring recommended."
    elif stress_probability >= 0.3:
        stress_level = "LOW STRESS"
        recommendation = "‚ö†Ô∏è  Patient shows some signs of stress. Regular check-ins recommended."
    else:
        stress_level = "NO SIGNIFICANT STRESS"
        recommendation = "‚úì Patient appears to have low stress levels."

    result = {
        'statement': statement,
        'is_stress': is_stress,
        'stress_probability': float(stress_probability),
        'stress_percentage': float(stress_probability * 100),
        'stress_level': stress_level,
        'predicted_class': int(prediction),
        'predicted_label': LABEL_MAP[int(prediction)],
        'all_probabilities': {LABEL_MAP[i]: float(prob) for i, prob in enumerate(probabilities)},
        'recommendation': recommendation
    }

    return result

def display_stress_result(result):
    """Display stress analysis result in a user-friendly format."""
    print("=" * 80)
    print("STRESS ANALYSIS RESULT")
    print("=" * 80)
    print(f"\nPatient Statement:")
    print(f'  "{result["statement"]}"')
    print("\n" + "-" * 80)
    print(f"STRESS PROBABILITY: {result['stress_percentage']:.1f}%")
    print(f"Stress Level: {result['stress_level']}")
    print("-" * 80)

    if result['is_stress']:
        print("üî¥ STRESS DETECTED - Primary Prediction")
    else:
        print("üü¢ No stress detected as primary prediction")

    print(f"\nPredicted Mental Health Status: {result['predicted_label']}")
    print(f"\n{result['recommendation']}")

    print("\n" + "=" * 80)
    print("DETAILED PROBABILITIES")
    print("=" * 80)
    print("All class probabilities:")
    for label, prob in sorted(result['all_probabilities'].items(),
                             key=lambda x: x[1], reverse=True):
        marker = " ‚Üê PREDICTED" if label == result['predicted_label'] else ""
        print(f"  {label:25s}: {prob*100:5.1f}%{marker}")
    print("=" * 80)

# ============================================================================
# USER INPUT SECTION
# ============================================================================
# Enter your patient statement here
# ============================================================================

# Option 1: Direct input (modify this variable)
patient_statement = input("Enter patient statement: ") if False else None

# Option 2: Set statement directly (uncomment and modify)
# patient_statement = "I feel overwhelmed and cannot cope with my workload. My chest feels tight and I can't sleep."

# If no statement provided, prompt user
if patient_statement is None or patient_statement.strip() == "":
    print("=" * 80)
    print("STRESS DETECTION SYSTEM")
    print("=" * 80)
    print("\nPlease enter a patient statement to analyze for stress.")
    print("\nYou can:")
    print("  1. Modify the 'patient_statement' variable in this cell")
    print("  2. Or use the function directly: result = check_stress_from_statement('your statement here')")
    print("\nExample:")
    print('  result = check_stress_from_statement("I feel very stressed and overwhelmed")')
    print('  display_stress_result(result)')
    print("=" * 80)
else:
    # Analyze the statement
    try:
        result = check_stress_from_statement(patient_statement)
        display_stress_result(result)
    except Exception as e:
        print(f"‚ùå Error: {e}")
        import traceback
        traceback.print_exc()
        print("\nMake sure you have:")
        print("  1. Trained a model (run training cells)")
        print("  2. Or loaded a saved model")
        print("  3. Tokenizer is available")


### Quick Test: Enter Statement Here

Simply modify the statement below and run the cell to get instant stress analysis.


In [None]:
# ============================================================================
# QUICK STRESS CHECK - Enter Your Statement Here
# ============================================================================
# Just modify the statement below and run this cell!
# ============================================================================

# üëá ENTER PATIENT STATEMENT HERE üëá
patient_statement = "I feel overwhelmed and cannot cope with my workload. My chest feels tight and I can't sleep."

# ============================================================================
# (No need to modify anything below this line)
# ============================================================================

try:
    # Check if functions are available
    if 'check_stress_from_statement' not in globals():
        print("‚ö†Ô∏è  Please run the previous cell (Cell 42) first to load the functions.")
    else:
        # Analyze the statement
        result = check_stress_from_statement(patient_statement)
        display_stress_result(result)

        # Additional summary
        print("\n" + "=" * 80)
        print("QUICK SUMMARY")
        print("=" * 80)
        print(f"Stress Detected: {'YES' if result['is_stress'] else 'NO'}")
        print(f"Stress Probability: {result['stress_percentage']:.1f}%")
        print(f"Stress Level: {result['stress_level']}")
        print("=" * 80)

except NameError as e:
    print(f"‚ùå Error: {e}")
    print("\nPlease make sure:")
    print("  1. You have run Cell 42 to load the functions")
    print("  2. You have a trained model available")
    print("  3. The model and tokenizer are loaded")

except Exception as e:
    print(f"‚ùå Error: {e}")
    import traceback
    traceback.print_exc()


## Final Evaluation on Ground-Truth Test Set

This section provides functions to evaluate your trained model on the held-out ground truth test set for unbiased final performance assessment.


In [None]:
# ============================================================================
# Load Ground-Truth Test Set from CSV
# ============================================================================

import glob
from pathlib import Path

def load_ground_truth_test_set_from_csv(csv_file=None):
    """
    Load the ground truth test set from a CSV file.

    Parameters:
    -----------
    csv_file : str, optional
        Path to the CSV file. If None, automatically finds the most recent test set.

    Returns:
    --------
    tuple : (X_test, y_test) - Test statements and labels
    """
    if csv_file is None:
        # Automatically find the most recent ground truth test set
        test_files = glob.glob("Ground_Truth_Test_Set_Final_Version_*.csv")
        if not test_files:
            raise FileNotFoundError(
                "No ground truth test set CSV found. "
                "Please run Cell 5 to create the test set first."
            )
        # Sort by modification time and get the most recent
        csv_file = max(test_files, key=Path().stat if hasattr(Path(), 'stat') else lambda f: Path(f).stat().st_mtime)
        print(f"üìÇ Auto-detected test set: {csv_file}")

    # Load the CSV
    test_df = pd.read_csv(csv_file)

    # Validate required columns
    if 'statement' not in test_df.columns or 'status' not in test_df.columns:
        raise ValueError(f"CSV must contain 'statement' and 'status' columns. Found: {test_df.columns.tolist()}")

    X_test = test_df['statement'].values
    y_test = test_df['status'].values

    print(f"‚úÖ Loaded ground truth test set: {len(X_test)} samples")
    print(f"   File: {csv_file}")

    return X_test, y_test

# Load the test set (uses in-memory X_test, y_test if available, otherwise loads from CSV)
if 'X_test' in globals() and 'y_test' in globals():
    print("‚úÖ Using in-memory test set (from Cell 5)")
    print(f"   Test set size: {len(X_test)} samples")
else:
    print("‚ö†Ô∏è  In-memory test set not found. Loading from CSV...")
    try:
        X_test, y_test = load_ground_truth_test_set_from_csv()
    except Exception as e:
        print(f"‚ùå Error loading test set: {e}")
        print("\nMake sure you have:")
        print("  1. Run Cell 5 to create the test set, OR")
        print("  2. Have a Ground_Truth_Test_Set_Final_Version_*.csv file available")


In [None]:
# ============================================================================
# Final Evaluation on Ground-Truth Test Set
# ============================================================================

from sklearn.metrics import (
    accuracy_score, precision_recall_fscore_support,
    classification_report, confusion_matrix
)

def evaluate_on_ground_truth_test_set(model=None, tokenizer=None, X_test=None, y_test=None):
    """
    Evaluate the trained model on the held-out ground truth test set.

    Parameters:
    -----------
    model : optional
        Trained model. If None, uses best_trainer.model or best saved model.
    tokenizer : optional
        Tokenizer. If None, uses global tokenizer.
    X_test : optional
        Test statements. If None, uses global X_test.
    y_test : optional
        Test labels. If None, uses global y_test.

    Returns:
    --------
    dict : Evaluation metrics
    """
    # Get model
    if model is None:
        if 'best_trainer' in globals():
            model = best_trainer.model
            print("‚úÖ Using best_trainer.model")
        elif 'best_name' in globals():
            save_dir = f"./best_model_{best_name}"
            try:
                model = AutoModelForSequenceClassification.from_pretrained(save_dir)
                print(f"‚úÖ Loaded model from: {save_dir}")
            except:
                raise ValueError("No trained model found. Please train a model first.")
        else:
            raise ValueError("No trained model found. Please train a model first.")

    # Get tokenizer
    if tokenizer is None:
        if 'tokenizer' in globals():
            tokenizer = tokenizer
        elif 'best_name' in globals():
            save_dir = f"./best_model_{best_name}"
            try:
                tokenizer = AutoTokenizer.from_pretrained(save_dir)
                print(f"‚úÖ Loaded tokenizer from: {save_dir}")
            except:
                raise ValueError("No tokenizer found.")
        else:
            raise ValueError("No tokenizer found.")

    # Get test data
    if X_test is None:
        if 'X_test' not in globals():
            raise ValueError("Test set not found. Please run Cell 49 to load it.")
        X_test = globals()['X_test']

    if y_test is None:
        if 'y_test' not in globals():
            raise ValueError("Test labels not found. Please run Cell 49 to load them.")
        y_test = globals()['y_test']

    # Ensure model is on correct device
    if 'device' not in globals():
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    else:
        device = globals()['device']
    model = model.to(device)
    model.eval()

    print("=" * 80)
    print("FINAL EVALUATION ON GROUND-TRUTH TEST SET")
    print("=" * 80)
    print(f"Test set size: {len(X_test)} samples")
    print(f"Device: {device}")
    print("=" * 80)

    # Tokenize test set
    print("\nTokenizing test set...")
    test_enc = tokenizer(
        list(X_test),
        padding=True,
        truncation=True,
        max_length=160,
        return_tensors="pt"
    ).to(device)

    # Get predictions
    print("Running inference...")
    predictions = []
    probabilities = []

    # Process in batches to avoid memory issues
    batch_size = 32
    with torch.no_grad():
        for i in range(0, len(X_test), batch_size):
            end_idx = min(i + batch_size, len(X_test))
            batch_input_ids = test_enc['input_ids'][i:end_idx]
            batch_attention_mask = test_enc['attention_mask'][i:end_idx]

            batch_enc = {
                'input_ids': batch_input_ids,
                'attention_mask': batch_attention_mask
            }

            outputs = model(**batch_enc)
            logits = outputs.logits
            batch_probs = torch.softmax(logits, dim=-1).cpu().numpy()
            batch_preds = np.argmax(batch_probs, axis=-1)

            predictions.extend(batch_preds)
            probabilities.extend(batch_probs)

            if (i // batch_size + 1) % 10 == 0:
                print(f"  Processed {end_idx}/{len(X_test)} samples...")

    predictions = np.array(predictions)
    probabilities = np.array(probabilities)

    # Calculate metrics
    print("\n" + "=" * 80)
    print("EVALUATION METRICS")
    print("=" * 80)

    # Overall metrics
    num_classes = len(np.unique(y_test))
    avg_type = "binary" if num_classes == 2 else "weighted"

    accuracy = accuracy_score(y_test, predictions)
    precision, recall, f1, support = precision_recall_fscore_support(
        y_test, predictions, average=avg_type, zero_division=0
    )

    print(f"\nOverall Performance:")
    print(f"  Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")
    print(f"  F1 Score:  {f1:.4f}")

    # Per-class metrics
    print(f"\nPer-Class Performance:")
    precision_per_class, recall_per_class, f1_per_class, support_per_class = precision_recall_fscore_support(
        y_test, predictions, average=None, zero_division=0
    )

    # Get label names if available
    if 'le' in globals():
        label_names = le.classes_
    else:
        label_names = [f"Class {i}" for i in range(num_classes)]

    print(f"\n{'Class':<20} {'Precision':<12} {'Recall':<12} {'F1':<12} {'Support':<10}")
    print("-" * 80)
    for i, label in enumerate(label_names):
        print(f"{label:<20} {precision_per_class[i]:<12.4f} {recall_per_class[i]:<12.4f} "
              f"{f1_per_class[i]:<12.4f} {support_per_class[i]:<10}")

    # Confusion matrix
    print(f"\n{'=' * 80}")
    print("CONFUSION MATRIX")
    print("=" * 80)
    cm = confusion_matrix(y_test, predictions)
    print("\nRows = True labels, Columns = Predicted labels")
    print(f"\n{'':<15}", end="")
    for label in label_names:
        print(f"{label[:10]:<12}", end="")
    print()
    for i, label in enumerate(label_names):
        print(f"{label[:14]:<15}", end="")
        for j in range(len(label_names)):
            print(f"{cm[i, j]:<12}", end="")
        print(f"  (True: {support_per_class[i]})")

    # Detailed classification report
    print(f"\n{'=' * 80}")
    print("DETAILED CLASSIFICATION REPORT")
    print("=" * 80)
    print(classification_report(y_test, predictions, target_names=label_names, zero_division=0))

    # Store results
    results = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'per_class_metrics': {
            label: {
                'precision': precision_per_class[i],
                'recall': recall_per_class[i],
                'f1': f1_per_class[i],
                'support': support_per_class[i]
            }
            for i, label in enumerate(label_names)
        },
        'confusion_matrix': cm,
        'predictions': predictions,
        'probabilities': probabilities
    }

    print("=" * 80)
    print("‚úÖ Final evaluation complete!")
    print("=" * 80)

    return results

# Run final evaluation
print("Ready to evaluate on ground truth test set.")
print("Run: results = evaluate_on_ground_truth_test_set()")
print("\nOr if you want to evaluate now, uncomment the line below:")
# results = evaluate_on_ground_truth_test_set()
