# Task 1_2: Fine-Tuning BERT for GoEmotions Multi-Label Classification

This notebook fulfills the requirements for the Deep Learning Final Term (UAS) assignment.
It covers:
1. Data Preprocessing for multi-label classification.
2. Traditional Machine Learning Baseline (Logistic Regression).
3. Fine-Tuning BERT (Encoder model) using HuggingFace.
4. Evaluation and Comparison.

## 1. Setup Environment
Installing necessary libraries as seen in reference notebooks.

In [1]:
import torch
import numpy as np
import pandas as pd
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    pipeline
)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import f1_score, accuracy_score, classification_report
from sklearn.preprocessing import MultiLabelBinarizer

# Check for GPU (T4 in Colab)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")



Using device: cpu


## 2. Load Dataset
The GoEmotions dataset contains 58k Reddit comments labeled with 28 emotion categories.

In [2]:
dataset = load_dataset("go_emotions", "simplified") # Using simplified for 28 labels
num_labels = 28

# Previewing the data (matches your provided sample)
train_df = pd.DataFrame(dataset['train'])
print("Sample Data:")
print(train_df[['text', 'labels']].head())

valid_df = pd.DataFrame(dataset['validation'])
print("\nValidation Data:")
print(valid_df[['text', 'labels']].head())

test_df = pd.DataFrame(dataset['test'])
print("\nTest Data:")
print(test_df[['text', 'labels']].head())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Sample Data:
                                                text labels
0  My favourite food is anything I didn't have to...   [27]
1  Now if he does off himself, everyone will thin...   [27]
2                     WHY THE FUCK IS BAYLESS ISOING    [2]
3                        To make her feel threatened   [14]
4                             Dirty Southern Wankers    [3]

Validation Data:
                                                text   labels
0  Is this in New Orleans?? I really feel like th...     [27]
1  You know the answer man, you are programmed to...  [4, 27]
2               I've never been this sad in my life!     [25]
3  The economy is heavily controlled and subsidiz...  [4, 27]
4  He could have easily taken a real camera from ...     [20]

Test Data:
                                                text labels
0  I’m really sorry about your situation :( Altho...   [25]
1    It's wonderful because it's awful. At not with.    [0]
2  Kings fan here, good luck to you guys! Wil

## 3. Data Preprocessing (Multi-Label)
Since a text can have multiple emotions (e.g., [4, 27]), we use MultiLabelBinarizer for the baseline.

In [3]:
mlb = MultiLabelBinarizer(classes=list(range(num_labels)))

def preprocess_labels(example):
    # For BERT training, we need float labels for BCEWithLogitsLoss
    label_vector = np.zeros(num_labels, dtype=float)
    for l in example["labels"]:
        label_vector[l] = 1.0
    example["label_vector"] = label_vector
    return example

dataset = dataset.map(preprocess_labels)

## 4. Traditional ML Baseline (TF-IDF + OneVsRest Logistic Regression)
Per assignment requirements: comparison of traditional ML vs DL.

In [4]:
print("\n--- Training Traditional ML Baseline ---")
X_train_text = dataset['train']['text']
X_test_text = dataset['test']['text']

y_train = mlb.fit_transform(dataset['train']['labels'])
y_test = mlb.transform(dataset['test']['labels'])

vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train_text)
X_test_tfidf = vectorizer.transform(X_test_text)

# OneVsRest is used because it's a multi-label problem
lr_model = OneVsRestClassifier(LogisticRegression(max_iter=1000))
lr_model.fit(X_train_tfidf, y_train)

lr_preds = lr_model.predict(X_test_tfidf)
print(f"Baseline (LR) Micro-F1 Score: {f1_score(y_test, lr_preds, average='micro'):.4f}")


--- Training Traditional ML Baseline ---
Baseline (LR) Micro-F1 Score: 0.4242


## 5. BERT Fine-Tuning
Using bert-base-uncased as specified in the reference materials.

In [29]:
MODEL_NAME = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Remove original 'labels', 'id', and 'text' columns as they are no longer needed for training
tokenized_datasets = tokenized_datasets.remove_columns(["labels", "id", "text"])

# Rename 'label_vector' to 'labels' for Trainer compatibility
tokenized_datasets = tokenized_datasets.rename_column("label_vector", "labels")

# Set up the model for Multi-Label Classification
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=num_labels,
    problem_type="multi_label_classification" # Critical for multi-label loss
)
model.to(device)

# Metrics function for multi-label
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # Apply sigmoid to logits for multi-label predictions
    probs = 1 / (1 + np.exp(-logits))
    predictions = (probs > 0.5).astype(int)
    return {
        "f1_micro": f1_score(labels, predictions, average="micro"),
        "accuracy": accuracy_score(labels, predictions)
    }

Map:   0%|          | 0/43410 [00:00<?, ? examples/s]

Map:   0%|          | 0/5426 [00:00<?, ? examples/s]

Map:   0%|          | 0/5427 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 6. Training Configuration

### 6.1 Stratified Sampling for Training and Validation Datasets

To ensure that the smaller training and validation subsets retain a similar distribution of multi-labels as the full dataset, we will perform stratified sampling. For multi-label classification, a common approach is to stratify based on the number of positive labels in each example.

In [36]:
from sklearn.model_selection import train_test_split
import numpy as np

# Calculate the number of positive labels for stratification
def get_num_positive_labels(dataset_split):
    return [np.sum(example['labels']) for example in dataset_split]

# Get stratification keys for train and validation splits
train_num_positive_labels = get_num_positive_labels(tokenized_datasets['train'])
valid_num_positive_labels = get_num_positive_labels(tokenized_datasets['validation'])

# --- Stratified sampling for the training dataset ---
total_train_size = len(tokenized_datasets['train'])
desired_train_subset_size = 2000

# Identify unique label counts and their frequencies
unique_train_counts, train_frequencies = np.unique(train_num_positive_labels, return_counts=True)

# Create a mapping from count value to its frequency
train_count_freq_map = dict(zip(unique_train_counts, train_frequencies))

# Filter out indices where the label count appears less than 2 times
filtered_train_indices = [i for i, count in enumerate(train_num_positive_labels) if train_count_freq_map[count] >= 2]

if len(filtered_train_indices) >= desired_train_subset_size:
    # Create a temporary dataset and corresponding stratification labels from filtered indices
    temp_train_dataset = tokenized_datasets['train'].select(filtered_train_indices)
    temp_train_labels = [train_num_positive_labels[i] for i in filtered_train_indices]

    # Perform stratified sampling on the temporary dataset's indices
    train_indices_for_subset, _ = train_test_split(
        range(len(temp_train_dataset)),
        train_size=desired_train_subset_size,
        stratify=temp_train_labels,
        random_state=42
    )
    # Select the final stratified subset using the indices from the temporary dataset
    stratified_train_dataset = temp_train_dataset.select(train_indices_for_subset)
else:
    print("Warning: Not enough stratifiable samples for desired_train_subset_size after filtering rare classes. Reverting to random sampling for train.")
    stratified_train_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(desired_train_subset_size))


# --- Stratified sampling for the validation dataset ---
total_valid_size = len(tokenized_datasets['validation'])
desired_valid_subset_size = 200

# Identify unique label counts and their frequencies
unique_valid_counts, valid_frequencies = np.unique(valid_num_positive_labels, return_counts=True)

# Create a mapping from count value to its frequency
valid_count_freq_map = dict(zip(unique_valid_counts, valid_frequencies))

# Filter out indices where the label count appears less than 2 times
filtered_valid_indices = [i for i, count in enumerate(valid_num_positive_labels) if valid_count_freq_map[count] >= 2]

if len(filtered_valid_indices) >= desired_valid_subset_size:
    # Create a temporary dataset and corresponding stratification labels from filtered indices
    temp_valid_dataset = tokenized_datasets['validation'].select(filtered_valid_indices)
    temp_valid_labels = [valid_num_positive_labels[i] for i in filtered_valid_indices]

    # Perform stratified sampling on the temporary dataset's indices
    valid_indices_for_subset, _ = train_test_split(
        range(len(temp_valid_dataset)),
        train_size=desired_valid_subset_size,
        stratify=temp_valid_labels,
        random_state=42
    )
    # Select the final stratified subset using the indices from the temporary dataset
    stratified_eval_dataset = temp_valid_dataset.select(valid_indices_for_subset)
else:
    print("Warning: Not enough stratifiable samples for desired_valid_subset_size after filtering rare classes. Reverting to random sampling for validation.")
    stratified_eval_dataset = tokenized_datasets['validation'].shuffle(seed=42).select(range(desired_valid_subset_size))


print(f"Original training dataset size: {total_train_size}")
print(f"Stratified training dataset size: {len(stratified_train_dataset)}")
print(f"Original validation dataset size: {total_valid_size}")
print(f"Stratified validation dataset size: {len(stratified_eval_dataset)}")

Original training dataset size: 43410
Stratified training dataset size: 2000
Original validation dataset size: 5426
Stratified validation dataset size: 200


### 6.2 Training Configuration with Stratified Datasets

Now, we will update the `Trainer` to use the `stratified_train_dataset` and `stratified_eval_dataset`.

In [37]:
training_args = TrainingArguments(
    output_dir="./finetuning-bert-text-classification",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2, # GoEmotions is large; 2 epochs are often sufficient for UAS tasks
    weight_decay=0.01,
    load_best_model_at_end=True,
)

# Using the stratified subsets for demonstration speed
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=stratified_train_dataset,
    eval_dataset=stratified_eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,F1 Micro,Accuracy
1,No log,0.189069,0.0,0.0
2,No log,0.170387,0.0,0.0




TrainOutput(global_step=250, training_loss=0.2355742492675781, metrics={'train_runtime': 5978.973, 'train_samples_per_second': 0.669, 'train_steps_per_second': 0.042, 'total_flos': 263172476928000.0, 'train_loss': 0.2355742492675781, 'epoch': 2.0})

## 7. Final Comparison & Inference

In [39]:
results = trainer.evaluate(tokenized_datasets["test"])
print(f"\nFinal BERT Micro-F1: {results['eval_f1_micro']:.4f}")
print(f"Final BERT Accuracy: {results['eval_accuracy']:.4f}")

# Save the model
model.save_pretrained("./goemotions-bert-model")
tokenizer.save_pretrained("./goemotions-bert-model")

# Inference Example
classifier = pipeline(
    "text-classification",
    model="./goemotions-bert-model",
    top_k=3,
    device=0 if torch.cuda.is_available() else -1
)

sample_text = "I'm so happy that I finally passed the deep learning exam!"
print(f"\nText: {sample_text}")
print(f"Top 3 Predictions: {classifier(sample_text)}")

Epoch,Training Loss,Validation Loss,F1 Micro,Accuracy
0,No log,0.15038,0.0,0.0



Final BERT Micro-F1: 0.0000
Final BERT Accuracy: 0.0000


Device set to use cpu



Text: I'm so happy that I finally passed the deep learning exam!
Top 3 Predictions: [[{'label': 'LABEL_27', 'score': 0.24556677043437958}, {'label': 'LABEL_0', 'score': 0.13823865354061127}, {'label': 'LABEL_4', 'score': 0.093324676156044}]]
