# RoBERTa Fine-Tuning for Sentiment Analysis

Fine-tuning RoBERTa for sentiment classification on airline reviews. The implementation addresses class imbalance through custom loss functions and ensures reproducibility with deterministic seeding.

## Environment & Configuration

Setting up the necessary libraries and configuring the hardware environment.

> **Reproducibility**: The `CUBLAS_WORKSPACE_CONFIG` is explicitly set and deterministic algorithms are enabled in PyTorch. To ensuring experiments are repeatable and not subject to hardware non-determinism.


In [None]:
# For CUDA determinism
import os
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'  

# Force deterministic ops
import torch
torch.use_deterministic_algorithms(True, warn_only=True)  

from transformers import set_seed as hf_set_seed 
import random
import numpy as np

def set_seed(seed_value=12):
    """Fully deterministic seed setting"""
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    hf_set_seed(seed_value)  
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

# Initialize & Set SEED 
SEED = 12
set_seed(SEED)

2025-08-04 09:35:54.828582: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754300155.014847      85 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754300155.072297      85 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [None]:
# Core
import pandas as pd

# PyTorch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, random_split

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn utilities
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import classification_report

# Hugging Face ecosystem
import evaluate
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
    pipeline
    )

# Experiment tracking
import wandb
from kaggle_secrets import UserSecretsClient

# Configure device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# Load Weights & Biases API key
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("wandb")

## Data Ingestion & Analysis

Loading the raw dataset and performing a preliminary inspection.

> [!IMPORTANT]
> **Strategic Insight**: Analyzing the class distribution early informs the modeling strategy. Identifying skewness here allows for proactive design of the loss function to handle class imbalance later.


In [None]:
data = pd.read_csv('/kaggle/input/sg-data0-85/0.85_labeled__dataset.csv')
label2id = {'positive': 0,'negative': 1, 'mixed sentiment': 2}
id2label = {0: "positive", 1: "negative", 2: "mixed sentiment"}

data['label'] = data['final_sentiment'].map(label2id)

df = data[['Text', 'label']]

In [None]:
display(df.head(),df.value_counts('label'), df.shape)

Unnamed: 0,Text,label
0,Ok. We used this airline to go from Singapore ...,2
1,The service in Suites Class makes one feel lik...,2
2,"don't give them your money. Booked, paid and r...",1
3,Best Airline in the World. Best airline in the...,0
4,Premium Economy Seating on Singapore Airlines ...,1


label
0    6070
1    2412
2    1518
Name: count, dtype: int64

(10000, 2)

## Model Initialization

Initializing the `distilbert-base-uncased` architecture.

> [!TIP]
> **Why DistilBERT?** DistilBERT was chosen for its efficiency-performance trade-off. It retains 97% of BERT's performance while being 40% lighter and 60% faster, making it ideal for production environments where inference latency is a constraint.


#### Custome Matric 
To properly evaluate the model, I implemented a custom compute_metrics function that calculates a comprehensive suite of metrics during training.

This includes macro-averaged F1, weighted F1, and per-class F1 scores, allowing the training process to surface strengths and weaknesses across all labels rather than inflating performance due to dominant classes. Importantly, the function extracts the F1 score specifically for the mixed sentiment class, which served as a key indicator of whether Focal Loss and class weighting were improving performance where it mattered most.

## Prepare & Tokenize For Modeling

In [None]:
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    acc = accuracy.compute(predictions=predictions, references=labels)['accuracy']
    f1_macro_score = f1.compute(predictions=predictions, references=labels, average='macro')['f1']
    f1_weighted_score = f1.compute(predictions=predictions, references=labels, average='weighted')['f1']
    f1_per_class = f1.compute(predictions=predictions, references=labels, average=None)['f1']
    return {
        "accuracy": acc,
        "f1_macro": f1_macro_score,
        "f1_weighted": f1_weighted_score,
         "f1_mixed": f1_per_class[2]
    }

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
# Setup Tokenizer & Model
model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = model_name, use_fast = True)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=3, 
    id2label=id2label, 
    label2id=label2id
).to(device)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Data Processing Pipeline

The data is split into training, validation, and test sets to ensure robust evaluation. Text is then tokenized and class weights are calculated.

> [!IMPORTANT]
> **Handling Imbalance**: To counteract the imbalance observed in the EDA phase, class weights are computed and injected into the custom loss function to penalize the model more for misclassifying minority classes.


In [None]:
# 80/10/10 split, val split from train
train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['label'], random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.1, stratify=train_df['label'], random_state=42)
train_df.shape, val_df.shape, test_df.shape

((7200, 2), (800, 2), (2000, 2))

In [None]:
train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
val_dataset = Dataset.from_pandas(val_df.reset_index(drop=True))
test_dataset = Dataset.from_pandas(test_df.reset_index(drop=True))

## Tokenize

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["Text"], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True, num_proc=1)
val_dataset = val_dataset.map(tokenize_function, batched=True, num_proc=1)
test_dataset = test_dataset.map(tokenize_function, batched=True, num_proc=1)

Map:   0%|          | 0/7200 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
# Class Weight
train_labels = train_dataset[:]['label']

class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(train_labels),
    y=train_labels
)
#boost= 1.5
#class_weights[2] = class_weights[2]*boost
#print(class_weights)

weights_tensor = torch.tensor(class_weights, dtype=torch.float).to(device)

print("Class Weights:", weights_tensor)


Class Weights: tensor([0.5492, 1.3817, 2.1958], device='cuda:0')


In [None]:
train_dataset = train_dataset.with_format("torch", columns=["input_ids", "attention_mask", "label"])
val_dataset = val_dataset.with_format("torch", columns=["input_ids", "attention_mask", "label"])
test_dataset = test_dataset.with_format("torch", columns=["input_ids", "attention_mask", "label"])

## Training Configuration

Defining the hyperparameters and training arguments. The setup utilizes `bf16` precision for optimized performance on modern GPUs (T4/A100) and a `cosine` learning rate scheduler for better convergence.


> [!TIP]
> **üí° Custom Loss Function**: The trainer implements Focal Loss with weighted cross-entropy to handle class imbalance and focus on hard-to-classify examples. Implementation details below.


In [None]:
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits

        # Use the weights_tensor for weighted loss
        loss_fct = nn.CrossEntropyLoss(weight=weights_tensor)
        loss = loss_fct(logits, labels)

        return (loss, outputs) if return_outputs else loss


class CustomTrainerWithFocal(CustomTrainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits

        # Focal Loss (alpha=None, gamma=2 default)
        ce_loss = F.cross_entropy(logits, labels, weight=weights_tensor, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = ((1 - pt) ** 1.25 * ce_loss).mean()

        return (focal_loss, outputs) if return_outputs else focal_loss

# Initiation: Config, Train, Test

In [16]:
wandb.login(key=secret_value_0)
wandb.init(project="SG_roberta", name='v1.5_re-run' ) 

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mvinilpatel-ai[0m ([33mvinilpatel-ai-personal[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [17]:
training_args = TrainingArguments(
    output_dir=model_dir,
    num_train_epochs=4,                                # Slightly more training for refinement
    per_device_train_batch_size=16,                    # barch size
    per_device_eval_batch_size=16,
    learning_rate=4e-5,                                # Default 5e-5
    warmup_ratio=0.1,                                  # Use ratio instead of fixed warmup steps
    weight_decay=0.05,                                 # 
    bf16=True,                                         # ‚úÖ 
    logging_dir=logging_dir,
    logging_steps=10,
    eval_strategy="epoch",                             # Changed from evaluation_strategy
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_weighted",
    save_total_limit=2 ,
    max_grad_norm=1.0,
    lr_scheduler_type='cosine',                      # linear, cosine
    seed = SEED,
    dataloader_num_workers=0
)

In [18]:
trainer = CustomTrainerWithFocal(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer, # Although deprecated, keeping for now based on previous code
    compute_metrics=compute_metrics,  # accuracy, F1 etc
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

  trainer = CustomTrainerWithFocal(


## Training & Evaluation

The model is trained on the training data and validated at the end of each epoch. Once training is complete, a final evaluation is run on the held-out test set to measure real-world performance.


In [19]:
trainer.train()

  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,F1 Weighted,F1 Mixed
1,0.2488,0.233288,0.85375,0.808149,0.865864,0.623794
2,0.1584,0.230412,0.85,0.809873,0.861528,0.614379
3,0.0562,0.296025,0.87125,0.82933,0.880246,0.659933
4,0.0137,0.376121,0.88625,0.834207,0.888989,0.656371


TrainOutput(global_step=900, training_loss=0.16939397189352248, metrics={'train_runtime': 1620.0131, 'train_samples_per_second': 17.778, 'train_steps_per_second': 0.556, 'total_flos': 7577666430566400.0, 'train_loss': 0.16939397189352248, 'epoch': 4.0})

In [22]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.29901665449142456,
 'eval_accuracy': 0.903,
 'eval_f1_macro': 0.8592961914780917,
 'eval_f1_weighted': 0.9058134497012248,
 'eval_f1_mixed': 0.7073170731707317,
 'eval_runtime': 34.338,
 'eval_samples_per_second': 58.244,
 'eval_steps_per_second': 1.835,
 'epoch': 4.0}

In [23]:
predictions = trainer.predict(test_dataset)

y_true = predictions.label_ids
y_pred = predictions.predictions.argmax(axis=1)

print(classification_report(y_true, y_pred, target_names=label2id.keys()))

                 precision    recall  f1-score   support

       positive       0.96      0.94      0.95      1214
       negative       0.93      0.91      0.92       482
mixed sentiment       0.66      0.76      0.71       304

       accuracy                           0.90      2000
      macro avg       0.85      0.87      0.86      2000
   weighted avg       0.91      0.90      0.91      2000



In [29]:
wandb.finish()

0,1
eval/accuracy,‚ñÅ‚ñÅ‚ñÑ‚ñÜ‚ñà
eval/f1_macro,‚ñÅ‚ñÅ‚ñÑ‚ñÖ‚ñà
eval/f1_mixed,‚ñÇ‚ñÅ‚ñÑ‚ñÑ‚ñà
eval/f1_weighted,‚ñÇ‚ñÅ‚ñÑ‚ñÖ‚ñà
eval/loss,‚ñÅ‚ñÅ‚ñÑ‚ñà‚ñÑ
eval/runtime,‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñà
eval/samples_per_second,‚ñÜ‚ñà‚ñÉ‚ñÖ‚ñÅ
eval/steps_per_second,‚ñÑ‚ñÖ‚ñÅ‚ñÉ‚ñà
test/accuracy,‚ñÅ
test/f1_macro,‚ñÅ

0,1
eval/accuracy,0.903
eval/f1_macro,0.8593
eval/f1_mixed,0.70732
eval/f1_weighted,0.90581
eval/loss,0.29902
eval/runtime,34.338
eval/samples_per_second,58.244
eval/steps_per_second,1.835
test/accuracy,0.903
test/f1_macro,0.8593


## Model Publishing

Finally, the trained model and tokenizer are pushed to the Hugging Face Hub, making the model accessible for inference and future deployment.

In [25]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [26]:
repo_id = "Yoshaaa7/roberta-SGairline-sentiment-private"

In [27]:
model.push_to_hub(repo_id, private=True)
tokenizer.push_to_hub(repo_id, private=True)

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Yoshaaa7/roberta-SGairline-sentiment-private/commit/4d2b1c687950ef8bce83cc1cda0b2c5c932d547a', commit_message='Upload tokenizer', commit_description='', oid='4d2b1c687950ef8bce83cc1cda0b2c5c932d547a', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Yoshaaa7/roberta-SGairline-sentiment-private', endpoint='https://huggingface.co', repo_type='model', repo_id='Yoshaaa7/roberta-SGairline-sentiment-private'), pr_revision=None, pr_num=None)

## Results & Retrospective

### Performance Analysis