# DistilBERT Fine-Tuning for Sentiment Analysis

Fine-tuning DistilBERT for sentiment classification on airline reviews. The implementation addresses class imbalance through custom loss functions and ensures reproducibility with deterministic seeding.


## Environment & Configuration

Setting up the necessary libraries and configuring the hardware environment.

> [!NOTE]
> **Reproducibility**: The `CUBLAS_WORKSPACE_CONFIG` is explicitly set and deterministic algorithms are enabled in PyTorch. To ensuring experiments are repeatable and not subject to hardware non-determinism.


In [None]:
# For CUDA determinism
import os
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'  

# Force deterministic ops
import torch
torch.use_deterministic_algorithms(True, warn_only=True)  

from transformers import set_seed as hf_set_seed 
import random
import numpy as np

def set_seed(seed_value=12):
    """Fully deterministic seed setting"""
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    hf_set_seed(seed_value)  
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

# Initialize & Set SEED 
SEED = 12
set_seed(SEED)

2025-08-05 01:45:46.025214: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754358346.362409      84 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754358346.465729      84 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [None]:
# Core
import pandas as pd

# PyTorch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, random_split

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn utilities
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import classification_report

# Hugging Face ecosystem
import evaluate
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
    pipeline
    )

# Experiment tracking
import wandb
from kaggle_secrets import UserSecretsClient

# Configure device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# Load Weights & Biases API key
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("wandb")

Using device: cuda


## Data Ingestion & Analysis

Loading the raw dataset and performing a preliminary inspection.

> [!IMPORTANT]
> **Strategic Insight**: Analyzing the class distribution early informs the modeling strategy. Identifying skewness here allows for proactive design of the loss function to handle class imbalance later.


In [7]:
data = pd.read_csv('/kaggle/input/sg-data0-85/0.85_labeled__dataset.csv')
label2id = {'positive': 0,'negative': 1, 'mixed sentiment': 2}
id2label = {0: "positive", 1: "negative", 2: "mixed sentiment"}

data['label'] = data['final_sentiment'].map(label2id)

df = data[['Text', 'label']]

In [8]:
display(df.head(),df.value_counts('label'), df.shape)

Unnamed: 0,Text,label
0,Ok. We used this airline to go from Singapore ...,2
1,The service in Suites Class makes one feel lik...,2
2,"don't give them your money. Booked, paid and r...",1
3,Best Airline in the World. Best airline in the...,0
4,Premium Economy Seating on Singapore Airlines ...,1


label
0    6070
1    2412
2    1518
Name: count, dtype: int64

(10000, 2)

## Model Initialization

Initializing the `distilbert-base-uncased` architecture.

> [!TIP]
> **Why DistilBERT?** DistilBERT was chosen for its efficiency-performance trade-off. It retains 97% of BERT's performance while being 40% lighter and 60% faster, making it ideal for production environments where inference latency is a constraint.


#### Custome Matric 
To properly evaluate the model, I implemented a custom compute_metrics function that calculates a comprehensive suite of metrics during training.

This includes macro-averaged F1, weighted F1, and per-class F1 scores, allowing the training process to surface strengths and weaknesses across all labels rather than inflating performance due to dominant classes. Importantly, the function extracts the F1 score specifically for the mixed sentiment class, which served as a key indicator of whether Focal Loss and class weighting were improving performance where it mattered most.

In [11]:
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    acc = accuracy.compute(predictions=predictions, references=labels)['accuracy']
    f1_macro_score = f1.compute(predictions=predictions, references=labels, average='macro')['f1']
    f1_weighted_score = f1.compute(predictions=predictions, references=labels, average='weighted')['f1']
    f1_per_class = f1.compute(predictions=predictions, references=labels, average=None)['f1']
    return {
        "accuracy": acc,
        "f1_macro": f1_macro_score,
        "f1_weighted": f1_weighted_score,
         "f1_mixed": f1_per_class[2]
    }

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

In [12]:
# Setup Tokenizer & Model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = model_name, use_fast = True)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=3, 
    id2label=id2label, 
    label2id=label2id
).to(device)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Data Processing Pipeline

The data is split into training, validation, and test sets to ensure robust evaluation. Text is then tokenized and class weights are calculated.

> [!IMPORTANT]
> **Handling Imbalance**: To counteract the imbalance observed in the EDA phase, class weights are computed and injected into the custom loss function to penalize the model more for misclassifying minority classes.


In [13]:
# 80/10/10 split, val split from train
train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['label'], random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.1, stratify=train_df['label'], random_state=42)
train_df.shape, val_df.shape, test_df.shape

((7200, 2), (800, 2), (2000, 2))

In [14]:
train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
val_dataset = Dataset.from_pandas(val_df.reset_index(drop=True))
test_dataset = Dataset.from_pandas(test_df.reset_index(drop=True))

In [15]:
def tokenize_function(examples):
    return tokenizer(examples["Text"], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True, num_proc=1)
val_dataset = val_dataset.map(tokenize_function, batched=True, num_proc=1)
test_dataset = test_dataset.map(tokenize_function, batched=True, num_proc=1)

Map:   0%|          | 0/7200 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
# Class Weight
train_labels = train_dataset[:]['label']

class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(train_labels),
    y=train_labels
)
#boost= 1.5
#class_weights[2] = class_weights[2]*boost
#print(class_weights)
weights_tensor = torch.tensor(class_weights, dtype=torch.float).to(device)

print("Class Weights:", weights_tensor)

Class Weights: tensor([0.5492, 1.3817, 2.1958], device='cuda:0')


In [17]:
train_dataset = train_dataset.with_format("torch", columns=["input_ids", "attention_mask", "label"])
val_dataset = val_dataset.with_format("torch", columns=["input_ids", "attention_mask", "label"])
test_dataset = test_dataset.with_format("torch", columns=["input_ids", "attention_mask", "label"])

## Training Configuration

Defining the hyperparameters and training arguments. The setup utilizes `bf16` precision for optimized performance on modern GPUs (T4/A100) and a `cosine` learning rate scheduler for better convergence.


> [!TIP]
> **ðŸ’¡ Custom Loss Function**: The trainer implements Focal Loss with weighted cross-entropy to handle class imbalance and focus on hard-to-classify examples. Implementation details below.


In [None]:
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits

        # Used weights_tensor for weighted loss
        loss_fct = nn.CrossEntropyLoss(weight=weights_tensor)
        loss = loss_fct(logits, labels)

        return (loss, outputs) if return_outputs else loss


class CustomTrainerWithFocal(CustomTrainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits

        # Implimented Focal Loss 
        ce_loss = F.cross_entropy(logits, labels, weight=weights_tensor, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = ((1 - pt) ** 1.25 * ce_loss).mean()

        return (focal_loss, outputs) if return_outputs else focal_loss

In [19]:
wandb.login(key=secret_value_0)
wandb.init(project="SG_FineTune", name='6_re-run12_v17_fl1.25_deploy' ) 

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mvinilpatel-ai[0m ([33mvinilpatel-ai-personal[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
training_args = TrainingArguments(
    output_dir=model_dir,
    num_train_epochs=3,                   # Slightly more training for refinement
    per_device_train_batch_size=16,                   
    per_device_eval_batch_size=16,
    learning_rate=6e-5,                                
    warmup_ratio=0.1,                    # Used ratio instead of fixed warmup steps
    weight_decay=0.05,                      
    bf16=True,                           # T4 supports bf16)
    logging_dir=logging_dir,
    logging_steps=10,
    eval_strategy="epoch", 
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_weighted",  # For Class Wise balance
    save_total_limit=2 ,
    max_grad_norm=1.0,
    lr_scheduler_type='cosine',
    seed = SEED,
    dataloader_num_workers=0
)

In [None]:
trainer = CustomTrainerWithFocal(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,  # Custome Matric - Accuracy, F1 marco & Weighted
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

  trainer = CustomTrainerWithFocal(


## Training & Evaluation

The model is trained on the training data and validated at the end of each epoch. Once training is complete, a final evaluation is run on the held-out test set to measure real-world performance.


In [22]:
trainer.train()

  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,F1 Weighted,F1 Mixed
1,0.259,0.243451,0.80625,0.775305,0.826047,0.567335
2,0.142,0.254122,0.84875,0.804538,0.862038,0.618297
3,0.0586,0.304402,0.86875,0.81442,0.875144,0.625899


TrainOutput(global_step=675, training_loss=0.20620376997523837, metrics={'train_runtime': 575.8208, 'train_samples_per_second': 37.512, 'train_steps_per_second': 1.172, 'total_flos': 2861346838118400.0, 'train_loss': 0.20620376997523837, 'epoch': 3.0})

In [25]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.2434888333082199,
 'eval_accuracy': 0.8885,
 'eval_f1_macro': 0.8464833523306581,
 'eval_f1_weighted': 0.8944322055157344,
 'eval_f1_mixed': 0.6827880512091038,
 'eval_runtime': 17.1392,
 'eval_samples_per_second': 116.692,
 'eval_steps_per_second': 3.676,
 'epoch': 3.0}

In [None]:
# Get predictions 
predictions = trainer.predict(test_dataset)

y_true = predictions.label_ids
y_pred = predictions.predictions.argmax(axis=1)

print(classification_report(y_true, y_pred, target_names=label2id.keys()))

                 precision    recall  f1-score   support

       positive       0.97      0.91      0.94      1214
       negative       0.95      0.89      0.92       482
mixed sentiment       0.60      0.79      0.68       304

       accuracy                           0.89      2000
      macro avg       0.84      0.86      0.85      2000
   weighted avg       0.91      0.89      0.89      2000



In [27]:
#wandb.finish()

## Model Publishing

Finally, the trained model and tokenizer are pushed to the Hugging Face Hub, making the model accessible for inference and future deployment.


In [28]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

In [29]:
distilbert_repo_id = "Yoshaaa7/distilbert-SGairline-sentiment-private"
model.push_to_hub(distilbert_repo_id, private=True)
tokenizer.push_to_hub(distilbert_repo_id, private=True)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Yoshaaa7/distilbert-SGairline-sentiment-private/commit/43d59982241506bdbc79c598b82de60c818d3708', commit_message='Upload tokenizer', commit_description='', oid='43d59982241506bdbc79c598b82de60c818d3708', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Yoshaaa7/distilbert-SGairline-sentiment-private', endpoint='https://huggingface.co', repo_type='model', repo_id='Yoshaaa7/distilbert-SGairline-sentiment-private'), pr_revision=None, pr_num=None)

## Results & Retrospective

### Performance Analysis

The model achieved a weighted F1-score of **0.89**, which is strong for a 3-class problem.

- **Positive/Negative**: The model is highly effective at distinguishing clear sentiment (F1 ~0.92-0.94)
- **The Challenge of "Mixed"**: The "Mixed Sentiment" class was the hardest to predict (F1 ~0.68). This confirms the hypothesis that subtle, conflicting sentiments are difficult for the model to disentangle

### Impact of Technical Decisions

- **Focal Loss**: Implementing this was crucial. Without it, the model likely would have ignored the "Mixed" class entirely in favor of the majority classes
- **Class Weights**: This ensured that the minority classes weren't drowned out during gradient updates

### Tools & Skills

- **Weights & Biases**: W&B was used to track experiments. Seeing the loss curves diverge early helped tune the learning rate
- **Hugging Face**: Leveraging the `Trainer` API allowed focus on the custom loss logic rather than writing boilerplate training loops
