## 0) Setup (libraries and reproducibility)

**Function Description:**
This cell initializes the environment by importing all necessary libraries and setting up reproducibility controls. It loads tools for data manipulation, machine learning, and deep learning, then configures random seeds to ensure consistent results across multiple runs.

**Syntax Explanation:**
The imports follow a logical grouping pattern. Standard Python libraries like `os`, `math`, and `random` come first, followed by numerical computing tools (`numpy`, `pandas`), and finally the deep learning stack (`torch`, `transformers`, `sklearn`). The `AutoTokenizer` and `AutoModelForSequenceClassification` are convenience classes from Hugging Face that automatically detect and load the correct model architecture based on the checkpoint name you provide. The `TrainingArguments` class acts as a container for all training hyperparameters, while `Trainer` wraps the training loop and handles evaluation, logging, and checkpointing automatically. I set the random seed using `random.seed()`, `np.random.seed()`, `torch.manual_seed()`, and `torch.cuda.manual_seed_all()` to control randomness across all libraries. The device detection uses `torch.cuda.is_available()` to check for GPU availability.

**Inputs:**
This cell takes no external inputs. It operates on the Python environment itself, importing modules that are either built-in or installable via pip. The SEED value (42) is hardcoded as a constant.

**Outputs:**
You'll see a single print statement showing which device you're using - "cuda" if a GPU is available, "cpu" otherwise. This confirmation helps you understand whether training will be fast (GPU) or slow (CPU). GPU training can be 10-50x faster than CPU training for transformer models.

**Code Flow:**
The cell progresses from general to specific imports, starting with basic Python utilities and ending with specialized deep learning components. After imports, it sets reproducibility seeds across all random number generators. Finally, it detects and prints the compute device. This setup happens once at the beginning and affects all subsequent cells.

**Comments and Observations:**
Reproducibility is important for scientific experiments and debugging. Without setting seeds, you'd get different train/test splits and different model initializations each time you run the notebook, making it impossible to compare results. The seed value 42 is arbitrary but conventional in machine learning tutorials. GPU availability dramatically impacts training time - a full fine-tuning run that takes 3 hours on CPU might finish in 15 minutes on a GPU. If you're running on Google Colab, make sure you've enabled GPU in Runtime > Change runtime type > Hardware accelerator > GPU. The imports might take 10-30 seconds the first time you run them because Colab needs to load the libraries into memory.

## 0) Setup (libraries and reproducibility)

In [None]:

# Every import has an explanatory comment.
import os                         # file paths and environment checks
import math                       # math helpers (may be useful for schedules)
import random                     # Python's RNG for reproducibility
import numpy as np                # numerical arrays and metrics support
import pandas as pd               # data loading and manipulation
from pathlib import Path          # convenient and robust path handling

# Hugging Face / PyTorch stack (for transformer fine‚Äëtuning)
import torch                      # tensor and GPU utilities
from datasets import Dataset      # lightweight dataset wrapper around pandas
from transformers import (       # core HF components for tokenization and training
    AutoTokenizer,               # auto‚Äëloads the right tokenizer for a given model checkpoint
    AutoModelForSequenceClassification,  # classification head on top of a transformer
    TrainingArguments,           # training hyperparameters container
    Trainer                      # training loop helper (handles eval and logging)
)

# Metrics
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Make runs reproducible (seed Python, NumPy, and PyTorch)
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

# Detect device once and print for visibility
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")  # shows 'cuda' when a GPU is available in Colab


Using device: cpu


## 1) Load Dataset

**Function Description:**
This cell handles the complete data loading pipeline, from file upload through data cleaning and label encoding. It prompts you to upload a CSV file, validates the required columns, removes any problematic rows, and converts text labels into numerical format that machine learning models can process.

**Syntax Explanation:**
The `files.upload()` function from `google.colab` opens a browser file picker that lets you select a CSV from your computer. I capture the uploaded file using `list(uploaded.keys())[0]` which grabs the filename from the dictionary returned by the upload function. The `Path` object creates a cross-platform file path that works on Windows, Mac, and Linux. After loading with `pd.read_csv()`, I use `assert` to verify that both 'statement' and 'status' columns exist - if they don't, the code stops with an error message showing which columns are missing. The `dropna()` method removes any rows where either the text or label is missing, and `copy()` creates a new DataFrame to avoid pandas warnings about modifying views. Converting the statement column with `astype(str)` ensures all entries are strings, even if some got parsed as numbers. The `LabelEncoder` from sklearn automatically creates a mapping from unique text labels to integers (0, 1, 2, etc.) using `fit_transform()`. I temporarily store the encoded values in a new column, then replace the original status column and drop the temporary one.

**Inputs:**
You provide a CSV file through the browser upload dialog. The CSV must contain at least two columns: 'statement' with the text you want to classify (like "I feel overwhelmed and can't cope"), and 'status' with the mental health label (like "Stress", "Anxiety", "Normal"). The labels can be text strings or already-encoded numbers. If you have extra columns like 'Unnamed: 0' (a common artifact from saving DataFrames), they won't break anything.

**Outputs:**
You'll see several outputs: a confirmation message showing the file path, the label encoding map (which number represents which condition), the count of samples per class, and the first three rows of your cleaned dataset. The label encoding map is particularly important because you'll need it later to interpret predictions - if the model predicts "5", you need to know that means "Stress". The value counts reveal class imbalance, which affects how you should train your model.

**Code Flow:**
The flow moves through four distinct phases. First, file upload and path resolution. Second, loading and validation (checking for required columns). Third, data cleaning (removing nulls, ensuring correct data types). Fourth, label encoding (converting text to numbers) with the final reassignment of the status column. Each step depends on the previous one succeeding, which is why I use assertions for critical validations.

**Comments and Observations:**
Class imbalance is probably your biggest challenge here. If you see something like 16,343 Normal samples but only 1,077 Personality Disorder samples, your model will naturally bias toward predicting Normal because it sees that class 15 times more often. This is why I use class weights in later sections. The `LabelEncoder` assigns numbers alphabetically by default, so "Anxiety" becomes 0, "Bipolar" becomes 1, and so on. This alphabetical ordering doesn't affect model performance but does affect how you read the results. Some datasets have text encoding issues (weird characters, emojis) that can cause problems during tokenization. If you see strange symbols in the data preview, you might need to add encoding='utf-8' or encoding='latin-1' to the `read_csv()` call. The label encoder will fail if your status column has typos (like "Stres" vs "Stress") because it treats them as different classes. Always check your label counts to catch these issues.

## 1) Load Dataset

In [None]:
# --- Load Dataset (Upload version, auto-encodes text labels) ---
import pandas as pd
from pathlib import Path
from google.colab import files

print("üìÇ Please upload your dataset CSV (e.g., Combined Data.csv)")
uploaded = files.upload()

# Automatically pick the first uploaded file
filename = list(uploaded.keys())[0]
csv_path = Path(f"/content/{filename}")

print(f"‚úÖ File uploaded successfully: {csv_path}")

# Load the CSV
df = pd.read_csv(csv_path)

# --- Validate columns ---
expected_cols = {'statement', 'status'}
assert expected_cols.issubset(df.columns), f"‚ùå Missing required columns: {expected_cols - set(df.columns)}"

# --- Clean ---
df = df.dropna(subset=['statement', 'status']).copy()
df['statement'] = df['statement'].astype(str)

# --- Encode text labels into integers ---
# This maps each unique label (like 'Anxiety', 'Stress', etc.) to a numeric ID
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['status_encoded'] = le.fit_transform(df['status'])

# Optional: print mapping for your reference
print("üî§ Label encoding map:")
for label, code in zip(le.classes_, range(len(le.classes_))):
    print(f"  {code} ‚Üí {label}")

# Replace 'status' with the encoded version
df['status'] = df['status_encoded']
df.drop(columns=['status_encoded'], inplace=True)

print("\n‚úÖ Dataset loaded and label-encoded successfully!")
print(df['status'].value_counts(dropna=False))
df.head(3)


üìÇ Please upload your dataset CSV (e.g., Combined Data.csv)


Saving Combined Data.csv to Combined Data (2).csv
‚úÖ File uploaded successfully: /content/Combined Data (2).csv
üî§ Label encoding map:
  0 ‚Üí Anxiety
  1 ‚Üí Bipolar
  2 ‚Üí Depression
  3 ‚Üí Normal
  4 ‚Üí Personality disorder
  5 ‚Üí Stress
  6 ‚Üí Suicidal

‚úÖ Dataset loaded and label-encoded successfully!
status
3    16343
2    15404
6    10652
0     3841
1     2777
5     2587
4     1077
Name: count, dtype: int64


Unnamed: 0.1,Unnamed: 0,statement,status
0,0,oh my gosh,0
1,1,"trouble sleeping, confused mind, restless hear...",0
2,2,"All wrong, back off dear, forward doubt. Stay ...",0


## 2) Baseline Models (TF-IDF + Linear)

**Function Description:**
This cell establishes performance baselines using traditional machine learning before moving to deep learning. It splits your data, converts text to numerical features using TF-IDF, trains two simple linear models (Logistic Regression and Linear SVM), and reports their accuracy, precision, recall, and F1 scores.

**Syntax Explanation:**
The `train_test_split()` function from sklearn divides your data into 80% training and 20% validation using the random state 42 for reproducibility. The `stratify` parameter ensures both sets maintain the same class distribution as your original data. `TfidfVectorizer` converts text into numbers by analyzing word frequencies - the `ngram_range=(1,2)` parameter means it considers both individual words and two-word phrases, `min_df=2` ignores words appearing in fewer than 2 documents (filtering out typos and rare terms), and `max_features=40000` keeps only the 40,000 most informative features. The vectorizer's `fit_transform()` learns the vocabulary from training data and converts it to features in one step, while `transform()` applies that learned vocabulary to validation data without learning anything new. Both `LogisticRegression` and `LinearSVC` use `class_weight="balanced"` which automatically adjusts for class imbalance by computing weights inversely proportional to class frequencies. The `precision_recall_fscore_support()` function calculates all metrics at once, and the `average` parameter determines how to aggregate across multiple classes (weighted average accounts for class imbalance).

**Inputs:**
This cell takes the cleaned DataFrame from the previous section and specifically uses the 'statement' column (text) as features and 'status' column (labels) as targets. The `train_test_split()` randomly selects which samples go into training vs validation based on the test_size ratio and random seed.

**Outputs:**
You get performance metrics for both baseline models printed in a compact format showing accuracy, precision, recall, and F1 score. The cell also prints which averaging method it's using (binary for 2 classes, weighted for more than 2) based on automatic detection of the number of unique classes. Typical baseline scores range from 75-85% accuracy depending on your data quality and class separability.

**Code Flow:**
The code follows a standard machine learning pipeline. First, split the data to create independent train and test sets. Second, fit the TF-IDF vectorizer on training text and transform both sets. Third, train the first model (Logistic Regression), make predictions, and calculate metrics. Fourth, repeat the training and evaluation process for the second model (Linear SVM). The vectorizer must be fit before the classifiers because the classifiers need fixed-size numerical inputs.

**Comments and Observations:**
These baseline models serve two purposes: they give you a performance floor that deep learning should beat, and they train in seconds rather than hours, letting you quickly spot data quality issues. If your baseline F1 is below 60%, something's wrong with your data (mislabeled samples, too much noise, or the classes aren't actually distinguishable from text alone). TF-IDF works surprisingly well for text classification because it captures which words are distinctive for each class. For example, the word "overwhelmed" might appear frequently in stress-related texts but rarely in normal texts, giving it high TF-IDF weight. The ngram_range=(1,2) parameter helps capture phrases like "panic attack" or "feel good" that carry more meaning than individual words. Linear models like Logistic Regression are also interpretable - you could examine the feature weights to see which words most strongly predict each class. The max_features limit prevents the feature space from exploding (some datasets have 100k+ unique words) and also acts as regularization by forcing the model to focus on the most informative terms. SVM typically performs slightly better than Logistic Regression on text because it finds the maximum-margin decision boundary, but both usually give similar results. If SVM and Logistic Regression give very different scores (more than 5% gap), that suggests your data has complex class boundaries that might benefit from deep learning.

In [None]:
# --- Baseline Models (TF-IDF + Linear, supports multi-class) ---
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np

# Train/validation split
X_train, X_val, y_train, y_val = train_test_split(
    df['statement'].values,
    df['status'].values,
    test_size=0.2,
    random_state=42,
    stratify=df['status'].values
)

# Convert raw text into TF-IDF features
tfidf = TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_features=40000)
Xtr = tfidf.fit_transform(X_train)
Xva = tfidf.transform(X_val)

# Detect if this is binary or multiclass
num_classes = len(np.unique(y_train))
avg_type = "binary" if num_classes == 2 else "weighted"
print(f"Detected {num_classes} classes ‚Üí using average='{avg_type}' for metrics.\n")

# --- Baseline 1: Logistic Regression ---
logreg = LogisticRegression(max_iter=2000, class_weight="balanced")
logreg.fit(Xtr, y_train)
pred_lr = logreg.predict(Xva)
p, r, f, _ = precision_recall_fscore_support(y_val, pred_lr, average=avg_type)
acc = accuracy_score(y_val, pred_lr)
print(f"[Baseline-LR] Acc={acc:.3f}  P={p:.3f}  R={r:.3f}  F1={f:.3f}")

# --- Baseline 2: Linear SVM ---
svm = LinearSVC(class_weight="balanced")
svm.fit(Xtr, y_train)
pred_svm = svm.predict(Xva)
p, r, f, _ = precision_recall_fscore_support(y_val, pred_svm, average=avg_type)
acc = accuracy_score(y_val, pred_svm)
print(f"[Baseline-SVM] Acc={acc:.3f}  P={p:.3f}  R={r:.3f}  F1={f:.3f}")


Detected 7 classes ‚Üí using average='weighted' for metrics.

[Baseline-LR] Acc=0.778  P=0.787  R=0.778  F1=0.777
[Baseline-SVM] Acc=0.782  P=0.779  R=0.782  F1=0.780


## 3) Pre-Trained Models (Tokenization and Dataset Prep)

**Function Description:**
This cell prepares your text data for transformer models by loading a specialized tokenizer and converting all text into the numerical format that BERT-based models expect. It tokenizes both training and validation texts, then packages them into HuggingFace Dataset objects that work seamlessly with the Trainer API.

**Syntax Explanation:**
I define two model checkpoint names as constants - `CLINICAL_BERT` points to a model trained on clinical text, while `DISTIL_BERT` points to a smaller, faster baseline. The `AutoTokenizer.from_pretrained()` method downloads and initializes the tokenizer that matches your chosen model architecture. The `tokenize_texts()` helper function takes a list of strings and converts them to token IDs - the `padding=True` parameter adds zeros to shorter sequences so all sequences in a batch have the same length, `truncation=True` cuts off text exceeding the max_length, `max_length=128` sets the sequence limit, and `return_tensors="pt"` formats the output as PyTorch tensors rather than lists. After tokenizing, I use `Dataset.from_dict()` to create HuggingFace datasets, passing dictionaries that contain input_ids (the tokenized text), attention_mask (which positions are real tokens vs padding), and labels (your encoded status values wrapped in `torch.tensor()`).

**Inputs:**
This cell uses the `X_train`, `X_val`, `y_train`, and `y_val` arrays created by the train_test_split in the previous section. X_train and X_val contain the text statements, while y_train and y_val contain the corresponding numerical labels.

**Outputs:**
You'll see progress bars as the tokenizer downloads (first run only), then the final line shows the sizes of your train and validation datasets as a tuple like (42144, 10537). This confirms you have roughly 80% of samples in training and 20% in validation. The tokenized datasets are stored in memory as `train_ds` and `val_ds` objects ready for training.

**Code Flow:**
The flow is straightforward and sequential. First, I define model checkpoints and select one as the default backbone. Second, I load the tokenizer for that backbone. Third, I define a helper function that wraps the tokenizer with specific parameters. Fourth, I apply that function to both train and validation texts. Fifth, I package the tokenized outputs and labels into Dataset objects. This preparation step only happens once before training multiple experiments.

**Comments and Observations:**
The choice between ClinicalBERT and DistilBERT matters more than you might think. ClinicalBERT was trained on clinical notes, discharge summaries, and medical text, so it understands medical terminology and the way healthcare professionals write. This makes it better suited for mental health classification where text might include clinical terms or formal descriptions. DistilBERT is a compressed version of BERT with 40% fewer parameters - it trains faster and uses less memory but might miss subtle patterns that the full model catches. The max_length=128 setting is a practical choice that balances speed and information retention. Most mental health statements are 20-80 words, which translates to roughly 30-120 tokens after subword tokenization. Setting max_length too high wastes computation on padding, while setting it too low truncates important information. The tokenizer uses subword tokenization, meaning it breaks rare or complex words into pieces - for example, "unhappiness" might become ["un", "happiness"]. This helps the model handle words it's never seen before by understanding their components. The attention_mask is important because it tells the model which tokens are real (value of 1) and which are padding (value of 0), preventing padding tokens from influencing the model's predictions. When you see the tokenizer downloading files, it's fetching the vocabulary file (which maps words to IDs) and the config file (which stores tokenization parameters). These downloads only happen once and get cached locally.

## 3) Pre‚ÄëTrained Models (Tokenization and Dataset Prep)

In [None]:

# Choose your checkpoints.
# We include ClinicalBERT (for clinical text) and DistilBERT (fast baseline).
CLINICAL_BERT = "emilyalsentzer/Bio_ClinicalBERT"
DISTIL_BERT   = "distilbert-base-uncased"

# Pick one as the default backbone for experiments below.
BACKBONE = CLINICAL_BERT

# Initialize tokenizer for the chosen backbone
tokenizer = AutoTokenizer.from_pretrained(BACKBONE)

# Helper to tokenize a pandas series with per-line comments
def tokenize_texts(texts, max_length=128):
    # Apply the tokenizer: returns dict with input_ids and attention_mask
    return tokenizer(
        list(texts),                 # a Python list of strings
        padding=True,                # pad to the longest in the batch
        truncation=True,             # cut off text exceeding max_length
        max_length=max_length,       # cap sequence length
        return_tensors="pt"          # return PyTorch tensors
    )

# Tokenize train/validation splits
train_enc = tokenize_texts(X_train)
val_enc   = tokenize_texts(X_val)

# Wrap into HF Datasets with labels
train_ds = Dataset.from_dict({
    "input_ids": train_enc["input_ids"],
    "attention_mask": train_enc["attention_mask"],
    "labels": torch.tensor(y_train)
})
val_ds = Dataset.from_dict({
    "input_ids": val_enc["input_ids"],
    "attention_mask": val_enc["attention_mask"],
    "labels": torch.tensor(y_val)
})

len(train_ds), len(val_ds)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

(42144, 10537)

## 4) Training of Data (Trainer utilities and metrics)

**Function Description:**
This cell sets up the infrastructure you need for training - specifically the metric computation function and the custom weighted loss. It defines how to evaluate model performance and how to handle class imbalance during training by penalizing mistakes on rare classes more heavily.

**Syntax Explanation:**
The `compute_metrics()` function takes an `eval_pred` tuple containing logits (raw model outputs before softmax) and true labels. Inside the function, `np.argmax(logits, axis=-1)` converts logits to class predictions by selecting the highest value along the last dimension. The `precision_recall_fscore_support()` function calculates all four metrics in one call using the average parameter to specify how to aggregate across classes (binary for 2-class, weighted for multi-class). For class weights, I count how many samples exist in each class using `(y_train == 1).sum()` for the positive class and similar for negative, then compute the weight for the positive class as `neg / max(pos, 1)` which gives higher weight to the minority class. The `max(pos, 1)` prevents division by zero if you somehow have zero positive samples. I create a PyTorch tensor from these weights and move it to the correct device using `.to(device)`. The `WeightedTrainer` class inherits from HuggingFace's `Trainer` and overrides only the `compute_loss()` method. Inside that method, I extract labels from inputs, run the model on the remaining inputs (everything except labels), get the logits from outputs, create a CrossEntropyLoss function with the class weights, and calculate loss by comparing predictions to true labels.

**Inputs:**
This cell uses `y_train` from the earlier train-test split to compute class frequencies and create weights. The compute_metrics function receives predictions from the Trainer during evaluation, while the WeightedTrainer receives model inputs, labels, and the model itself during training.

**Outputs:**
You'll see the class weights printed as a list showing the weight for each class. For a binary case with 20,000 negative and 5,000 positive samples, you'd see weights like [1.0, 4.0], meaning the model pays 4x more attention to positive class errors. For multi-class problems with severe imbalance, some weights might be 10x or higher.

**Code Flow:**
The code sets up two separate but related pieces of infrastructure. First, it defines and prints class weights that quantify the imbalance in your data. Second, it creates a custom Trainer class that uses those weights during loss calculation. These components work together during training - the Trainer calls compute_loss every batch to calculate weighted loss, and calls compute_metrics every epoch to evaluate on the validation set.

**Comments and Observations:**
Class imbalance is one of the biggest challenges in mental health classification. Without weighting, a model trained on data that's 80% Normal and 20% Other could achieve 80% accuracy by always predicting Normal and completely ignoring the minority classes. Weighted loss forces the model to care about all classes by making errors on rare classes expensive. The math behind class weighting is intuitive - if you have 4x more samples of class A than class B, you give class B a weight of 4.0 so that one mistake on class B costs as much as four mistakes on class A. This balances the gradient updates and prevents the model from ignoring minority classes. CrossEntropyLoss is the standard loss function for classification because it measures the difference between predicted probability distributions and true labels. Adding weights modifies the formula to multiply each sample's loss by its class weight before averaging. The custom Trainer override is necessary because the default Trainer doesn't support class weights out of the box. By inheriting and overriding just the compute_loss method, you keep all the other Trainer functionality (logging, checkpointing, evaluation) while adding custom loss calculation. The `return_outputs` parameter in compute_loss determines whether to return just the loss (for training) or both loss and full model outputs (for when you need predictions), and I handle both cases with the conditional return statement. This weighted approach works well for moderate imbalance (ratios up to 10:1 or 20:1) but for extreme imbalance you might need additional techniques like oversampling the minority class or using focal loss.

## 4) Training of Data (Trainer utilities and metrics)

In [None]:

# Metric function for the Trainer: computes Accuracy, Precision, Recall, F1
def compute_metrics(eval_pred):
    # eval_pred is a tuple of (logits, labels)
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

# Optional: class weights for imbalanced datasets
# Compute weights inversely proportional to class frequencies
pos = (y_train == 1).sum()
neg = (y_train == 0).sum()
w_pos = neg / max(pos, 1)   # weight for positive class
w_neg = 1.0                 # keep negative as baseline
class_weights = torch.tensor([w_neg, w_pos], dtype=torch.float).to(device)
print(f"Class weights (neg, pos): {class_weights.tolist()}" )

# Custom Trainer that injects weighted loss
from torch.nn import CrossEntropyLoss
class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(**{k: v for k, v in inputs.items() if k != "labels"})
        logits = outputs.get("logits")
        loss_fct = CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits.view(-1, model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss


Class weights (neg, pos): [1.0, 1.3836109638214111]


## 5) Fine-tuning (Three Experiments)

**Function Description:**
This cell runs three complete fine-tuning experiments with different hyperparameter configurations. It trains transformer models on your data, evaluates them on the validation set, and creates a leaderboard ranking them by F1 score. Each experiment uses different settings for learning rate, batch size, epochs, and model architecture to find the best configuration for your specific dataset.

**Syntax Explanation:**
The code starts by detecting the number of unique classes with `len(np.unique(y_train))` and setting the averaging strategy for metrics accordingly. The `compute_metrics()` function here is similar to Section 4 but adapts to multi-class by using weighted averaging. For class weights, I use `np.bincount(y_train, minlength=num_labels)` which counts occurrences of each class, then compute weights as `counts.max() / np.maximum(counts, 1)` which gives higher weights to rarer classes while avoiding division by zero. The weights become a PyTorch tensor on the correct device. The `WeightedTrainer` class override works identically to Section 4 but now handles the multi-class case properly. The `tokenize_texts()` function accepts a max_length parameter to allow different experiments to use different sequence lengths. The `make_training_args()` function is a factory that creates `TrainingArguments` objects with version compatibility - it first tries the modern signature with `evaluation_strategy` and `save_strategy`, and if that fails (older transformers versions), it falls back to legacy parameters like `do_eval`. The `run_experiment()` function orchestrates everything: it re-tokenizes data with the specified max_length, creates fresh Dataset objects, loads the pre-trained model with `AutoModelForSequenceClassification.from_pretrained()` while specifying the correct number of output classes, creates training arguments, instantiates the WeightedTrainer, calls `trainer.train()` to run training, calls `trainer.evaluate()` to get final metrics, and returns both metrics and the trainer object. Each of the three experiments (A, B, C) calls `run_experiment()` with different parameters and stores results in an OrderedDict. Finally, I extract the F1 scores from each result, sort experiments by F1, and print a leaderboard.

**Inputs:**
This cell uses `X_train`, `X_val`, `y_train`, and `y_val` from the train-test split. It also uses the model checkpoint names (CLINICAL_BERT, DISTIL_BERT) and the tokenizer defined in earlier sections. Each experiment re-tokenizes the data with its specific max_length setting.

**Outputs:**
During training, you'll see progress bars showing epoch progress, batch progress within each epoch, loss values that should decrease over time, and periodic evaluation metrics. After each experiment finishes, you'll see a summary of its final performance metrics including accuracy, precision, recall, and F1 score. At the very end, a leaderboard ranks all three experiments by F1 score, showing which configuration worked best. The entire cell might take 20-60 minutes to run depending on whether you have GPU and how large your dataset is.

**Code Flow:**
The flow is hierarchical and modular. At the top level, I set up shared infrastructure (metrics function, class weights). Then I define helper functions (tokenization, training args factory, experiment runner). Finally, I call the experiment runner three times with different parameters and collect results. Each experiment is independent - they don't share trained weights, though they do share the data and evaluation metrics. The leaderboard aggregation happens after all experiments complete, sorting by F1 score and displaying in descending order.

**Comments and Observations:**
Hyperparameter selection is part science, part art. Learning rates for fine-tuning transformers typically range from 1e-5 to 5e-5 because these models are already well-trained and you don't want to disturb the pre-trained weights too much. Going higher risks catastrophic forgetting where the model loses its pre-trained knowledge. Batch size is constrained by GPU memory - if you get out-of-memory errors, reduce batch size. Smaller batches (8-16) give noisier gradient updates but sometimes generalize better, while larger batches (32-64) give more stable training but might overfit. The number of epochs depends on dataset size - smaller datasets need more epochs to converge, but too many epochs causes overfitting. Weight decay adds L2 regularization by penalizing large weights, which prevents overfitting but too much weight decay can underfit. Warmup ratio gradually increases the learning rate from near-zero to the target value over the first X% of training steps, which stabilizes training when starting from random initialization of the classification head. The version compatibility fallback exists because HuggingFace frequently changes their API - older transformers versions used different parameter names for the same functionality. By catching the TypeError and falling back to legacy parameters, the code works across a wider range of library versions. Experiment A (conservative) uses safe defaults that should work reliably but might not achieve peak performance. Experiment B (aggressive) pushes the learning rate higher and trains longer, which might find better optima but risks overfitting. Experiment C (fast baseline) uses DistilBERT for speed comparison - if DistilBERT matches ClinicalBERT performance, you might prefer it for production due to faster inference. The leaderboard at the end tells you objectively which approach worked best on your specific data. Sometimes the aggressive approach wins, sometimes the conservative one does - it depends on your data characteristics and how much overfitting risk you face. The F1 score is the key metric here because it balances precision and recall, giving you a single number that captures overall classification quality while accounting for class imbalance through weighted averaging.

## 5) Fine‚Äëtuning (Three Experiments)

In [None]:
# --- 5) Fine-tuning (Three Experiments) [version-compatible] ---
import numpy as np
import torch
from collections import OrderedDict
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from torch.nn import CrossEntropyLoss

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1) Metrics: binary vs multiclass handled automatically
num_labels = len(np.unique(y_train))
avg_type = "binary" if num_labels == 2 else "weighted"
print(f"[Fine-tune] Detected {num_labels} classes ‚Üí metrics average='{avg_type}'")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support
    p, r, f, _ = precision_recall_fscore_support(labels, preds, average=avg_type)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "precision": p, "recall": r, "f1": f}

# 2) Class weights for imbalanced data (size == num_labels)
counts = np.bincount(y_train, minlength=num_labels)
# Heuristic: inverse-frequency scaled to max=1.0 (safe for CE)
weights = counts.max() / np.maximum(counts, 1)
class_weights = torch.tensor(weights, dtype=torch.float32, device=device)
print(f"[Fine-tune] Class weights: {class_weights.tolist()}")

class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(**{k: v for k, v in inputs.items() if k != "labels"})
        logits = outputs.get("logits")
        loss_fct = CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits.view(-1, model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

# 3) Helper: tokenizer already defined above. Re-tokenize per max_length
def tokenize_texts(texts, max_length=160):
    return tokenizer(
        list(texts),
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )

# 4) Version-compatible TrainingArguments factory
import inspect

def make_training_args(name, batch_size, lr, epochs, weight_decay, warmup_ratio):
    kwargs_modern = dict(
        output_dir=f"./runs/{name}",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        learning_rate=lr,
        num_train_epochs=epochs,
        weight_decay=weight_decay,
        warmup_ratio=warmup_ratio,
        logging_steps=50,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        fp16=torch.cuda.is_available(),
        report_to=[]
    )
    try:
        # Try modern signature first
        return TrainingArguments(**kwargs_modern)
    except TypeError:
        # Fallback for older transformers (no evaluation_strategy/save_strategy)
        print("[Fine-tune] Using legacy TrainingArguments fallback.")
        kwargs_legacy = dict(
            output_dir=f"./runs/{name}",
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            learning_rate=lr,
            num_train_epochs=epochs,
            weight_decay=weight_decay,
            logging_steps=50,
            do_eval=True,          # legacy way to enable evaluation
            save_steps=500,        # periodic saving
            overwrite_output_dir=True,
            fp16=torch.cuda.is_available()
        )
        return TrainingArguments(**kwargs_legacy)

def run_experiment(name, backbone, batch_size=16, lr=2e-5, epochs=3,
                   weight_decay=0.01, warmup_ratio=0.1, max_length=160):
    # Re-tokenize for this max_length
    tr = tokenize_texts(X_train, max_length=max_length)
    va = tokenize_texts(X_val,   max_length=max_length)

    train_ds_local = Dataset.from_dict({
        "input_ids": tr["input_ids"],
        "attention_mask": tr["attention_mask"],
        "labels": torch.tensor(y_train, dtype=torch.long)
    })
    val_ds_local = Dataset.from_dict({
        "input_ids": va["input_ids"],
        "attention_mask": va["attention_mask"],
        "labels": torch.tensor(y_val, dtype=torch.long)
    })

    # Load backbone with correct num_labels
    model = AutoModelForSequenceClassification.from_pretrained(
        backbone, num_labels=num_labels
    ).to(device)

    args = make_training_args(
        name=name, batch_size=batch_size, lr=lr, epochs=epochs,
        weight_decay=weight_decay, warmup_ratio=warmup_ratio
    )

    trainer = WeightedTrainer(
        model=model,
        args=args,
        train_dataset=train_ds_local,
        eval_dataset=val_ds_local,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer
    )

    trainer.train()
    metrics = trainer.evaluate()
    print(f"\n>>> {name} results: {metrics}\n")
    return metrics, trainer

# --- Define backbones (already set earlier) ---
CLINICAL_BERT = "emilyalsentzer/Bio_ClinicalBERT"
DISTIL_BERT   = "distilbert-base-uncased"

results = OrderedDict()

# Exp-A: ClinicalBERT, conservative LR, small batch
results['expA_clinicalbert_bs16_lr2e-5_ep3'] = run_experiment(
    name="expA_clinicalbert_bs16_lr2e-5_ep3",
    backbone=CLINICAL_BERT,
    batch_size=16, lr=2e-5, epochs=3,
    weight_decay=0.01, warmup_ratio=0.1, max_length=160
)

# Exp-B: ClinicalBERT, slightly higher LR, more epochs
results['expB_clinicalbert_bs16_lr5e-5_ep4'] = run_experiment(
    name="expB_clinicalbert_bs16_lr5e-5_ep4",
    backbone=CLINICAL_BERT,
    batch_size=16, lr=5e-5, epochs=4,
    weight_decay=0.01, warmup_ratio=0.06, max_length=160
)

# Exp-C: DistilBERT fast baseline
results['expC_distilbert_bs32_lr3e-5_ep3'] = run_experiment(
    name="expC_distilbert_bs32_lr3e-5_ep3",
    backbone=DISTIL_BERT,
    batch_size=32, lr=3e-5, epochs=3,
    weight_decay=0.01, warmup_ratio=0.1, max_length=128
)

# Leaderboard
board = []
for k,(m,_t) in results.items():
    board.append((k, m.get('eval_f1', float('nan')), m.get('eval_accuracy', float('nan'))))
board = sorted(board, key=lambda x: x[1], reverse=True)
print("\nLeaderboard (by F1):")
for name, f1, acc in board:
    print(f"{name:35s}  F1={f1:.4f}  Acc={acc:.4f}")


NameError: name 'y_train' is not defined

## 6) Eval (Pick Best and Run Inference)

**Function Description:**
This cell identifies the best-performing experiment from your fine-tuning runs, saves that model to disk for future use, and demonstrates how to make predictions on new text. It shows you the complete inference pipeline from raw text to predicted class and confidence scores.

**Syntax Explanation:**
The selection logic iterates through the results dictionary using `.items()` which gives you both the experiment name and its (metrics, trainer) tuple. For each experiment, I check if its `eval_f1` score beats the current best, and if so, update both `best_f1` and `best_name` while storing the trainer object. After finding the winner, `trainer.save_model()` writes the model weights to disk at the specified path, and `tokenizer.save_pretrained()` saves the tokenizer configuration alongside it. The `predict()` function encapsulates the inference pipeline - it loads the saved tokenizer with `AutoTokenizer.from_pretrained()`, loads the saved model with `AutoModelForSequenceClassification.from_pretrained()`, moves the model to the correct device with `.to(device)`, tokenizes input texts using the same parameters as training, wraps the forward pass in `torch.no_grad()` to disable gradient computation (speeds up inference and saves memory), extracts logits from model outputs, applies `torch.argmax()` to get predicted classes, applies `torch.softmax()` to convert logits to probabilities, and returns both predictions and confidence scores after moving them from GPU to CPU and converting to numpy arrays. For the demo, I define three test sentences covering different scenarios (clearly calm, clearly stressed, ambiguous) and call predict on them. The output loop uses zip to iterate over texts, predictions, and probabilities simultaneously, formatting each as a readable string with the predicted label and confidence.

**Inputs:**
This cell uses the results dictionary populated in Section 5, which contains metrics and trainer objects from all three experiments. The predict function takes a list of text strings and optionally a model directory path.

**Outputs:**
You'll see a message identifying which experiment won and what its F1 score was, followed by the save directory path. Then you'll see three prediction lines showing the predicted class (as both a number and label), the confidence probability, and the original text. For example: "[stressed(1) p=0.873] My chest is tight and I cannot focus, I think I am very stressed."

**Code Flow:**
The flow divides into three phases. First, iterate through all experiment results to find the highest F1 score and corresponding trainer. Second, save both the model and tokenizer to disk. Third, demonstrate inference by defining a predict function, creating test samples, calling predict, and formatting the output. The save and load operations prove that you can persist your model and reload it later without retraining.

**Comments and Observations:**
Saving the model is important because fine-tuning takes significant time and compute resources - you don't want to retrain every time you need to make predictions. The saved directory contains multiple files including model weights (pytorch_model.bin), model configuration (config.json), and tokenizer files (vocab.txt, tokenizer_config.json). Together these files fully specify your trained model and can be loaded on any machine with the same library versions. The predict function is production-ready - you could import it into a web API or batch processing script. The `torch.no_grad()` context manager is important for inference because it tells PyTorch not to track gradients, which cuts memory usage in half and speeds up computation. The difference between logits and probabilities matters: logits are raw scores that can be any value from negative to positive infinity, while probabilities are normalized to sum to 1.0 and range from 0 to 1. Softmax converts logits to probabilities using the formula exp(logit_i) / sum(exp(logit_j)). The probability value tells you confidence - 0.95 means highly confident, 0.55 means barely confident. In production, you might set a threshold like 0.7 and only act on predictions above that threshold, sending lower-confidence predictions to human review. The three test sentences demonstrate different difficulty levels. The first ("calm and in control") should be easy for the model - clear language indicating low stress. The second ("chest is tight, cannot focus, very stressed") contains multiple stress indicators and explicitly mentions stress, so the model should confidently predict stressed. The third ("workload is heavy but manageable") is ambiguous - "heavy" suggests stress but "manageable" suggests coping, so this tests whether the model can handle nuance. If the model gets the easy cases right but fails on ambiguous ones, that's actually good behavior showing it's not just memorizing keywords. You can expand this demo by adding more test cases, especially edge cases like very short text ("I'm fine"), very long text (multiple paragraphs), or text with mixed signals. The model architecture (ClinicalBERT vs DistilBERT) affects inference speed - DistilBERT is roughly 2x faster for the same input, which matters if you're processing millions of texts.

## 6) Eval (Pick Best and Run Inference)

In [None]:

# Select the best run from 'results' dict above
best_name, best_f1 = None, -1.0
best_trainer = None
for name,(metrics, trainer) in results.items():
    if metrics['eval_f1'] > best_f1:
        best_f1 = metrics['eval_f1']
        best_name = name
        best_trainer = trainer

print(f"Best run: {best_name} with F1={best_f1:.4f}")

# Save the best model for reuse
save_dir = f"./best_model_{best_name}"
best_trainer.save_model(save_dir)
tokenizer.save_pretrained(save_dir)

# Simple inference helper
def predict(texts, model_dir=save_dir):
    tok = AutoTokenizer.from_pretrained(model_dir)
    mdl = AutoModelForSequenceClassification.from_pretrained(model_dir).to(device)
    enc = tok(list(texts), padding=True, truncation=True, max_length=160, return_tensors="pt").to(device)
    with torch.no_grad():
        logits = mdl(**enc).logits
    pred = torch.argmax(logits, dim=-1).cpu().numpy()
    prob = torch.softmax(logits, dim=-1).cpu().numpy()[:,1]
    return pred, prob

# Demo predictions on a few samples
samples = [
    "I feel calm and in control today.",
    "My chest is tight and I cannot focus, I think I am very stressed.",
    "Workload is heavy but manageable so far."
]
pred, prob = predict(samples)
for s, y, p in zip(samples, pred, prob):
    lab = "stressed(1)" if y==1 else "not‚Äëstressed(0)"
    print(f"[{lab}  p={p:.3f}]  {s}")
