# Introduction

### What is BERT? (Bidirectional Encoder Representations from Transformers)

***BERT is a deep learning model developed by Google in 2018.***  
  
***It is based on the Transformer architecture.***  
  
***It reads entire sentences at once, from both left and right — this is called bidirectional.***

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805).

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)


# Exploratory Data Analysis and Preprocessing

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [None]:
import torch # Imports the PyTorch library which is a popular deep learning framework used for building and training neural networks.
import pandas as pd
from tqdm.notebook import tqdm
pd.set_option('display.max_rows', None)

In [None]:
df = pd.read_csv('./smileannotationsfinal.csv', names=['id', 'text', 'category'])
df.set_index('id', inplace=True) #Modifies the DataFrame df directly instead of returning a new DataFrame.


In [None]:
df.head()

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [None]:
df.category.value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
nocode,1572
happy,1137
not-relevant,214
angry,57
surprise,35
sad,32
happy|surprise,11
happy|sad,9
disgust|angry,7
disgust,6


In [None]:
df = df[~df.category.str.contains('\|')]

  df = df[~df.category.str.contains('\|')]


In [None]:
df = df[df.category != 'nocode']

In [None]:
df.category.value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
happy,1137
not-relevant,214
angry,57
surprise,35
sad,32
disgust,6


In [None]:
possible_labels = df.category.unique()

In [None]:
#Assign a unique index to each label in the DataFrame.
#This is useful for converting categorical labels into numerical indices, which is often required for machine learning
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
df['label'] = df.category.replace(label_dict)

  df['label'] = df.category.replace(label_dict)


In [None]:
df.head()

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0


# Training/Validation Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(df.index.values,
                                                  df.label.values,
                                                  test_size=0.15,
                                                  random_state=17,
                                                  stratify=df.label.values)

In [None]:
df['data_type'] = ['not_set']*df.shape[0]

In [None]:
df

Unnamed: 0_level_0,text,category,label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0,not_set
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0,not_set
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0,not_set
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0,not_set
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0,not_set
614499696015503361,Lucky @FitzMuseum_UK! Good luck @MirandaStearn...,happy,0,not_set
613601881441570816,Yr 9 art students are off to the @britishmuseu...,happy,0,not_set
613696526297210880,@RAMMuseum Please vote for us as @sainsbury #s...,not-relevant,1,not_set
610746718641102848,#AskTheGallery Have you got plans to privatise...,not-relevant,1,not_set
612648200588038144,@BarbyWT @britishmuseum so beautiful,happy,0,not_set


In [None]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [None]:
df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


# Loading Tokenizer and Encoding our Data

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [None]:
# Parameter	Purpose
# add_special_tokens	Adds task-specific tokens (e.g., [CLS] for classification).
# return_attention_mask	Generates masks to ignore padding tokens (0 for pad, 1 for real).
# pad_to_max_length	Ensures all sequences are padded/truncated to max_length.
# max_length=256	Sets maximum token length (longer sequences truncated).
# return_tensors='pt'	Returns tensors in PyTorch format (use 'tf' for TensorFlow).

encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    # pad_to_max_length=True,
    padding=True,
    max_length=256,
    return_tensors='pt'
)
print(encoded_data_train)
encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    # pad_to_max_length=True,
    padding=True,
    max_length=256,
    return_tensors='pt'
)



input_ids_train = encoded_data_train['input_ids']
# Extracts the input IDs from the encoded data for the training set.

attention_masks_train = encoded_data_train['attention_mask']
# Extracts the attention masks from the encoded data for the training set.
# Attention masks indicate which tokens should be attended to (1) and which are padding (0).
# Converts the labels for the training set into a PyTorch tensor.
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)



{'input_ids': tensor([[  101, 16092,  3897,  ...,     0,     0,     0],
        [  101,  1030, 27034,  ...,     0,     0,     0],
        [  101,  1030, 10682,  ...,     0,     0,     0],
        ...,
        [  101, 11047,  1030,  ...,     0,     0,     0],
        [  101,  1030,  3680,  ...,     0,     0,     0],
        [  101,  1030,  2120,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}


In [None]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
# Creates a TensorDataset for the training set, which combines input IDs, attention masks, and labels into a single dataset.
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [None]:
len(dataset_train)

1258

In [None]:
len(dataset_val)

223

# Setting up BERT Pretrained Model

In [None]:
from transformers import BertForSequenceClassification

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Creating Data Loaders

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [None]:
batch_size = 32

dataloader_train = DataLoader(dataset_train,
                              sampler=RandomSampler(dataset_train),
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val,
                                   sampler=SequentialSampler(dataset_val),
                                   batch_size=batch_size)

# Setting Up Optimiser and Scheduler

In [None]:
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup

In [None]:
# AdamW is an optimizer that implements the Adam algorithm with weight decay, which is commonly used for training transformer models.
# It helps in optimizing the model parameters during training.
optimizer = AdamW(model.parameters(),
                  lr=2e-5,  # Reduced learning rate
                  eps=1e-8,
                  weight_decay=0.1
                  )


In [None]:
epochs = 5

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=len(dataloader_train) // 2,  # Add warmup
    num_training_steps=len(dataloader_train) * epochs
)


# Defining our Performance Metrics

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [None]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}

    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

# Creating our Training Loop

In [None]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [None]:
def evaluate(dataloader_val):

    model.eval()

    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in dataloader_val:

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():
            outputs = model(**inputs)

        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)

    loss_val_avg = loss_val_total/len(dataloader_val)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return loss_val_avg, predictions, true_vals

In [None]:
def evaluate(dataloader_val):

    model.eval()

    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in dataloader_val:

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():
            outputs = model(**inputs)

        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)

    loss_val_avg = loss_val_total/len(dataloader_val)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return loss_val_avg, predictions, true_vals

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
_, predictions, true_vals = evaluate(dataloader_validation)

In [None]:
accuracy_per_class(predictions, true_vals)

Class: happy
Accuracy: 5/171

Class: not-relevant
Accuracy: 0/32

Class: angry
Accuracy: 9/9

Class: disgust
Accuracy: 0/1

Class: sad
Accuracy: 0/5

Class: surprise
Accuracy: 0/5



In [None]:
# STOP current training and run this complete solution:

# -*- coding: utf-8 -*-
import torch
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from transformers import BertTokenizer, BertForSequenceClassification, get_linear_schedule_with_warmup
from torch.optim import AdamW  # Import from torch.optim instead
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from tqdm.notebook import tqdm
import re
import random

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# ==================== DATA CLEANING & ANALYSIS ====================
print("=== DATA ANALYSIS ===")
print(f"Total samples: {len(df)}")
print("Class distribution:")
print(df['category'].value_counts())

# Check for data issues
print(f"\nMissing values: {df['text'].isnull().sum()}")
print(f"Empty texts: {(df['text'].str.strip() == '').sum()}")

# Enhanced text cleaning
def clean_text(text):
    if isinstance(text, str):
        # Remove URLs, mentions, hashtags
        text = re.sub(r'http\S+', '', text)
        text = re.sub(r'@\w+', '', text)
        text = re.sub(r'#\w+', '', text)
        # Remove extra spaces and trim
        text = re.sub(r'\s+', ' ', text)
        text = text.strip()
        return text if len(text) > 1 else "[EMPTY]"
    return "[EMPTY]"

df['text_clean'] = df['text'].apply(clean_text)

# Remove empty texts
df = df[df['text_clean'] != "[EMPTY]"]
print(f"After cleaning: {len(df)} samples")

# ==================== SIMPLIFIED APPROACH ====================
# Let's try without complex weighting first

# Train/validation split
X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size=0.15,
    random_state=17,
    stratify=df.label.values
)

df['data_type'] = 'not_set'
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

# Tokenization
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(df.label.values),
    y=df.label.values
)
class_weights = torch.tensor(class_weights, dtype=torch.float32).to(device)

def encode_data(texts, max_length=128):  # Reduced max_length
    return tokenizer.batch_encode_plus(
        texts,
        add_special_tokens=True,
        return_attention_mask=True,
        padding='max_length',  # Changed from True
        max_length=max_length,
        truncation=True,
        return_tensors='pt'
    )

# Encode data
encoded_train = encode_data(df[df.data_type=='train'].text_clean.values)
encoded_val = encode_data(df[df.data_type=='val'].text_clean.values)

# Create datasets
dataset_train = TensorDataset(
    encoded_train['input_ids'],
    encoded_train['attention_mask'],
    torch.tensor(df[df.data_type=='train'].label.values)
)

dataset_val = TensorDataset(
    encoded_val['input_ids'],
    encoded_val['attention_mask'],
    torch.tensor(df[df.data_type=='val'].label.values)
)

# ==================== SIMPLIFIED MODEL ====================
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=len(label_dict),
    output_attentions=False,
    output_hidden_states=False
)

model.to(device)

# ==================== CONSERVATIVE TRAINING ====================
# Small batch size, more epochs, careful learning rate
batch_size = 8  # Smaller batches
epochs = 10
warmup_steps = 0

# Data loaders (NO weighted sampling for now)
train_dataloader = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

val_dataloader = DataLoader(
    dataset_val,
    sampler=SequentialSampler(dataset_val),
    batch_size=batch_size
)

# Conservative optimizer
optimizer = AdamW(
    model.parameters(),
    lr=1e-5,  # Very small learning rate
    eps=1e-8
    )

# Scheduler
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

# ==================== DIAGNOSTIC TRAINING ====================
def evaluate_simple(dataloader):
    model.eval()
    predictions, true_vals = [], []

    for batch in dataloader:
        batch = tuple(b.to(device) for b in batch)

        with torch.no_grad():
            outputs = model(batch[0], attention_mask=batch[1])

        logits = outputs.logits
        logits = logits.detach().cpu().numpy()
        label_ids = batch[2].cpu().numpy()

        predictions.append(logits)
        true_vals.append(label_ids)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return predictions, true_vals

# Training loop with diagnostics
best_val_f1 = 0
no_improvement_count = 0
patience = 3

print("\n=== STARTING TRAINING ===")

for epoch in range(1, epochs + 1):
    # Training
    model.train()
    total_loss = 0

    for step, batch in enumerate(train_dataloader):
        model.zero_grad()

        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }

        outputs = model(**inputs)
        loss = outputs.loss  # Use default loss first

        total_loss += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

    avg_train_loss = total_loss / len(train_dataloader)

    # Validation
    predictions, true_vals = evaluate_simple(val_dataloader)
    preds_flat = np.argmax(predictions, axis=1).flatten()
    val_f1 = f1_score(true_vals, preds_flat, average='weighted')

    print(f'Epoch {epoch}')
    print(f'  Train Loss: {avg_train_loss:.4f}')
    print(f'  Val F1: {val_f1:.4f}')

    # Check for improvement
    if val_f1 > best_val_f1:
        best_val_f1 = val_f1
        no_improvement_count = 0
        torch.save(model.state_dict(), 'best_model_simple.pt')
        print(f'  → New best model saved!')
    else:
        no_improvement_count += 1
        print(f'  → No improvement ({no_improvement_count}/{patience})')

    # Early stopping
    if no_improvement_count >= patience:
        print(f'Early stopping at epoch {epoch}')
        break

# Load best model
model.load_state_dict(torch.load('best_model_simple.pt'))
print(f"\nBest F1 Score: {best_val_f1:.4f}")

# Final evaluation
predictions, true_vals = evaluate_simple(val_dataloader)
accuracy_per_class(predictions, true_vals)

Using: cuda


  df['label'] = df.category.replace(label_dict)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler()


Epoch 1:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 1 Summary:
Train Loss=1.1969, Train Acc=0.5860
Val F1=0.6785, Val Acc=0.7763
✅ New best model saved!


Epoch 2:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 2 Summary:
Train Loss=0.5732, Train Acc=0.8200
Val F1=0.7841, Val Acc=0.8265
✅ New best model saved!


Epoch 3:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 3 Summary:
Train Loss=0.4051, Train Acc=0.8789
Val F1=0.8696, Val Acc=0.8858
✅ New best model saved!


Epoch 4:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 4 Summary:
Train Loss=0.2990, Train Acc=0.9144
Val F1=0.8514, Val Acc=0.8721
No improvement (1/3)


Epoch 5:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 5 Summary:
Train Loss=0.2250, Train Acc=0.9322
Val F1=0.8888, Val Acc=0.9041
✅ New best model saved!


Epoch 6:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 6 Summary:
Train Loss=0.1619, Train Acc=0.9532
Val F1=0.8971, Val Acc=0.9087
✅ New best model saved!


Epoch 7:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 7 Summary:
Train Loss=0.1303, Train Acc=0.9645
Val F1=0.8964, Val Acc=0.9087
No improvement (1/3)


Epoch 8:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 8 Summary:
Train Loss=0.0924, Train Acc=0.9750
Val F1=0.8956, Val Acc=0.9087
No improvement (2/3)


Epoch 9:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 9 Summary:
Train Loss=0.0734, Train Acc=0.9806
Val F1=0.8859, Val Acc=0.8995
No improvement (3/3)
Early stopping!

🔥 Final Evaluation on Validation Set

Classification Report:
              precision    recall  f1-score   support

       happy       0.95      0.96      0.96       170
not-relevant       0.76      0.76      0.76        29
       angry       0.82      1.00      0.90         9
     disgust       0.00      0.00      0.00         1
         sad       0.00      0.00      0.00         5
    surprise       0.62      1.00      0.77         5

    accuracy                           0.91       219
   macro avg       0.53      0.62      0.56       219
weighted avg       0.89      0.91      0.90       219


Confusion Matrix:
[[163   5   0   0   0   2]
 [  5  22   1   0   0   1]
 [  0   0   9   0   0   0]
 [  0   1   0   0   0   0]
 [  3   1   1   0   0   0]
 [  0   0   0   0   0   5]]

Per-Class Accuracy:
Label: happy                | Accuracy: 95.88% (170 samples)
Label: not

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
# -*- coding: utf-8 -*-
import torch
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.utils.class_weight import compute_class_weight
from transformers import BertTokenizer, BertForSequenceClassification, get_linear_schedule_with_warmup
from torch.optim import AdamW
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from tqdm.notebook import tqdm
import re
import random

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# ==================== DATA CLEANING & ANALYSIS ====================
print("=== DATA ANALYSIS ===")
print(f"Total samples: {len(df)}")
print("Class distribution:")
print(df['category'].value_counts())

# Check for data issues
print(f"\nMissing values: {df['text'].isnull().sum()}")
print(f"Empty texts: {(df['text'].str.strip() == '').sum()}")

# Enhanced text cleaning
def clean_text(text):
    if isinstance(text, str):
        # Remove URLs, mentions, hashtags
        text = re.sub(r'http\S+', '', text)
        text = re.sub(r'@\w+', '', text)
        text = re.sub(r'#\w+', '', text)
        # Remove extra spaces and trim
        text = re.sub(r'\s+', ' ', text)
        text = text.strip()
        return text if len(text) > 1 else "[EMPTY]"
    return "[EMPTY]"

df['text_clean'] = df['text'].apply(clean_text)

# Remove empty texts
df = df[df['text_clean'] != "[EMPTY]"]
print(f"After cleaning: {len(df)} samples")

# ==================== OPTIMIZED HYPERPARAMETERS ====================
# Train/validation split
X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size=0.15,
    random_state=42,  # Changed for different split
    stratify=df.label.values
)

df['data_type'] = 'not_set'
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

# Tokenization with optimized length
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

# Calculate class weights for handling imbalance
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(df.label.values),
    y=df.label.values
)
class_weights = torch.tensor(class_weights, dtype=torch.float32).to(device)

print("\nClass weights for handling imbalance:")
for i, (cls_name, cls_idx) in enumerate(label_dict.items()):
    print(f"  {cls_name}: {class_weights[i]:.2f}")

def encode_data(texts, max_length=192):  # Optimized length
    return tokenizer.batch_encode_plus(
        texts,
        add_special_tokens=True,
        return_attention_mask=True,
        padding='max_length',
        max_length=max_length,
        truncation=True,
        return_tensors='pt'
    )

# Encode data
encoded_train = encode_data(df[df.data_type=='train'].text_clean.values)
encoded_val = encode_data(df[df.data_type=='val'].text_clean.values)

# Create datasets
dataset_train = TensorDataset(
    encoded_train['input_ids'],
    encoded_train['attention_mask'],
    torch.tensor(df[df.data_type=='train'].label.values)
)

dataset_val = TensorDataset(
    encoded_val['input_ids'],
    encoded_val['attention_mask'],
    torch.tensor(df[df.data_type=='val'].label.values)
)

# ==================== OPTIMIZED MODEL ====================
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=len(label_dict),
    output_attentions=False,
    output_hidden_states=False,
    hidden_dropout_prob=0.2,  # Increased dropout for regularization
    attention_probs_dropout_prob=0.2
)

model.to(device)

# ==================== OPTIMIZED TRAINING PARAMETERS ====================
# Optimized batch size
batch_size = 12  # Balanced between 8 and 16
epochs = 15  # More epochs with early stopping
warmup_ratio = 0.1  # 10% warmup

# Data loaders
train_dataloader = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size,
    drop_last=True  # Prevent small last batch
)

val_dataloader = DataLoader(
    dataset_val,
    sampler=SequentialSampler(dataset_val),
    batch_size=batch_size
)

# Optimized optimizer with weight decay
optimizer = AdamW(
    model.parameters(),
    lr=2e-5,  # Optimal for BERT fine-tuning
    eps=1e-8,
    weight_decay=0.01)

# Optimized scheduler with warmup
total_steps = len(train_dataloader) * epochs
warmup_steps = int(total_steps * warmup_ratio)

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

# ==================== ENHANCED TRAINING ====================
def evaluate_simple(dataloader):
    model.eval()
    predictions, true_vals = [], []

    for batch in dataloader:
        batch = tuple(b.to(device) for b in batch)

        with torch.no_grad():
            outputs = model(batch[0], attention_mask=batch[1])

        logits = outputs.logits
        logits = logits.detach().cpu().numpy()
        label_ids = batch[2].cpu().numpy()

        predictions.append(logits)
        true_vals.append(label_ids)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return predictions, true_vals

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    print("\n" + "="*50)
    print("CLASS-WISE ACCURACY")
    print("="*50)
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        accuracy = len(y_preds[y_preds==label]) / len(y_true)
        print(f'Class: {label_dict_inverse[label]:<15} Accuracy: {accuracy:.3f} ({len(y_preds[y_preds==label])}/{len(y_true)})')

# Training loop with enhanced diagnostics
best_val_f1 = 0
no_improvement_count = 0
patience = 4  # Increased patience
training_history = []

print("\n=== OPTIMIZED TRAINING STARTING ===")
print(f"Batch size: {batch_size}, Learning rate: {2e-5}, Epochs: {epochs}")
print(f"Warmup steps: {warmup_steps}, Weight decay: {0.01}")

for epoch in range(1, epochs + 1):
    # Training
    model.train()
    total_loss = 0
    total_correct = 0
    total_samples = 0

    progress_bar = tqdm(train_dataloader, desc=f'Epoch {epoch}')

    for step, batch in enumerate(progress_bar):
        model.zero_grad()

        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }

        outputs = model(**inputs)

        # Use class-weighted loss
        logits = outputs.logits
        loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits, inputs['labels'])

        # Calculate training accuracy
        preds = torch.argmax(logits, dim=1)
        correct = (preds == inputs['labels']).sum().item()
        total_correct += correct
        total_samples += inputs['labels'].size(0)

        total_loss += loss.item()
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

        # Update progress bar
        train_acc = total_correct / total_samples
        progress_bar.set_postfix({
            'loss': f'{loss.item():.4f}',
            'acc': f'{train_acc:.4f}'
        })

    avg_train_loss = total_loss / len(train_dataloader)
    avg_train_acc = total_correct / total_samples

    # Validation
    predictions, true_vals = evaluate_simple(val_dataloader)
    preds_flat = np.argmax(predictions, axis=1).flatten()
    val_f1 = f1_score(true_vals, preds_flat, average='weighted')
    val_acc = (preds_flat == true_vals).mean()

    print(f'\nEpoch {epoch} Summary:')
    print(f'  Train Loss: {avg_train_loss:.4f} | Train Acc: {avg_train_acc:.4f}')
    print(f'  Val F1: {val_f1:.4f} | Val Acc: {val_acc:.4f}')

    # Store history
    training_history.append({
        'epoch': epoch,
        'train_loss': avg_train_loss,
        'train_acc': avg_train_acc,
        'val_f1': val_f1,
        'val_acc': val_acc
    })

    # Check for improvement
    if val_f1 > best_val_f1:
        best_val_f1 = val_f1
        no_improvement_count = 0
        torch.save(model.state_dict(), 'best_model_optimized.pt')
        print(f'  → New best model saved! (F1: {val_f1:.4f})')
    else:
        no_improvement_count += 1
        print(f'  → No improvement ({no_improvement_count}/{patience})')

    # Early stopping
    if no_improvement_count >= patience:
        print(f'Early stopping at epoch {epoch}')
        break

# Load best model
model.load_state_dict(torch.load('best_model_optimized.pt'))
print(f"\nBest F1 Score: {best_val_f1:.4f}")

# Final evaluation
predictions, true_vals = evaluate_simple(val_dataloader)
accuracy_per_class(predictions, true_vals)

# Print training summary
print("\n" + "="*50)
print("TRAINING SUMMARY")
print("="*50)
for history in training_history:
    print(f"Epoch {history['epoch']:2d}: "
          f"Train Loss: {history['train_loss']:.4f}, "
          f"Val F1: {history['val_f1']:.4f}")

Using device: cuda
=== DATA ANALYSIS ===
Total samples: 1457
Class distribution:
category
happy           1134
not-relevant     194
angry             56
surprise          35
sad               32
disgust            6
Name: count, dtype: int64

Missing values: 0
Empty texts: 0
After cleaning: 1457 samples


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text_clean'] = df['text'].apply(clean_text)



Class weights for handling imbalance:
  happy: 0.21
  not-relevant: 1.25
  angry: 4.34
  disgust: 40.47
  sad: 7.59
  surprise: 6.94


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



=== OPTIMIZED TRAINING STARTING ===
Batch size: 12, Learning rate: 2e-05, Epochs: 15
Warmup steps: 154, Weight decay: 0.01


Epoch 1:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 1 Summary:
  Train Loss: 1.7585 | Train Acc: 0.3382
  Val F1: 0.6866 | Val Acc: 0.6758
  → New best model saved! (F1: 0.6866)


Epoch 2:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 2 Summary:
  Train Loss: 1.4751 | Train Acc: 0.6756
  Val F1: 0.8221 | Val Acc: 0.8311
  → New best model saved! (F1: 0.8221)


Epoch 3:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 3 Summary:
  Train Loss: 1.1732 | Train Acc: 0.8220
  Val F1: 0.8378 | Val Acc: 0.8356
  → New best model saved! (F1: 0.8378)


Epoch 4:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 4 Summary:
  Train Loss: 0.8218 | Train Acc: 0.8770
  Val F1: 0.8510 | Val Acc: 0.8493
  → New best model saved! (F1: 0.8510)


Epoch 5:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 5 Summary:
  Train Loss: 0.5865 | Train Acc: 0.9328
  Val F1: 0.8690 | Val Acc: 0.8721
  → New best model saved! (F1: 0.8690)


Epoch 6:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 6 Summary:
  Train Loss: 0.3941 | Train Acc: 0.9571
  Val F1: 0.8850 | Val Acc: 0.8904
  → New best model saved! (F1: 0.8850)


Epoch 7:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 7 Summary:
  Train Loss: 0.2940 | Train Acc: 0.9660
  Val F1: 0.8766 | Val Acc: 0.8858
  → No improvement (1/4)


Epoch 8:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 8 Summary:
  Train Loss: 0.2246 | Train Acc: 0.9822
  Val F1: 0.8851 | Val Acc: 0.8995
  → New best model saved! (F1: 0.8851)


Epoch 9:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 9 Summary:
  Train Loss: 0.1365 | Train Acc: 0.9879
  Val F1: 0.8858 | Val Acc: 0.8904
  → New best model saved! (F1: 0.8858)


Epoch 10:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 10 Summary:
  Train Loss: 0.1236 | Train Acc: 0.9903
  Val F1: 0.8733 | Val Acc: 0.8767
  → No improvement (1/4)


Epoch 11:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 11 Summary:
  Train Loss: 0.0965 | Train Acc: 0.9935
  Val F1: 0.8627 | Val Acc: 0.8676
  → No improvement (2/4)


Epoch 12:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 12 Summary:
  Train Loss: 0.0829 | Train Acc: 0.9943
  Val F1: 0.8686 | Val Acc: 0.8721
  → No improvement (3/4)


Epoch 13:   0%|          | 0/103 [00:00<?, ?it/s]


Epoch 13 Summary:
  Train Loss: 0.0629 | Train Acc: 0.9951
  Val F1: 0.8675 | Val Acc: 0.8676
  → No improvement (4/4)
Early stopping at epoch 13

Best F1 Score: 0.8858

CLASS-WISE ACCURACY
Class: happy           Accuracy: 0.977 (167/171)
Class: not-relevant    Accuracy: 0.655 (19/29)
Class: angry           Accuracy: 0.750 (6/8)
Class: disgust         Accuracy: 0.000 (0/1)
Class: sad             Accuracy: 0.000 (0/5)
Class: surprise        Accuracy: 0.600 (3/5)

TRAINING SUMMARY
Epoch  1: Train Loss: 1.7585, Val F1: 0.6866
Epoch  2: Train Loss: 1.4751, Val F1: 0.8221
Epoch  3: Train Loss: 1.1732, Val F1: 0.8378
Epoch  4: Train Loss: 0.8218, Val F1: 0.8510
Epoch  5: Train Loss: 0.5865, Val F1: 0.8690
Epoch  6: Train Loss: 0.3941, Val F1: 0.8850
Epoch  7: Train Loss: 0.2940, Val F1: 0.8766
Epoch  8: Train Loss: 0.2246, Val F1: 0.8851
Epoch  9: Train Loss: 0.1365, Val F1: 0.8858
Epoch 10: Train Loss: 0.1236, Val F1: 0.8733
Epoch 11: Train Loss: 0.0965, Val F1: 0.8627
Epoch 12: Train Loss

In [None]:
# =====================================
# 🚀 Optimized BERT Fine-Tuning for Sentiment Analysis (with Evaluation)
# =====================================
import torch
import numpy as np
import pandas as pd
import random, re
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    get_linear_schedule_with_warmup, DataCollatorWithPadding
)
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, Dataset
from torch.optim import AdamW
from tqdm.auto import tqdm
from torch.cuda.amp import GradScaler, autocast

# --------------------
# 1️⃣ Setup & Reproducibility
# --------------------
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using:", device)

# --------------------
# 2️⃣ Data Preparation
# --------------------
df = pd.read_csv('./smileannotationsfinal.csv', names=['id', 'text', 'category']).set_index('id')

# Clean categories
df = df[~df.category.str.contains(r'\|', regex=True)]
df = df[df.category != 'nocode']

# Label encoding
label_dict = {cat: idx for idx, cat in enumerate(df.category.unique())}
df['label'] = df.category.replace(label_dict)

# Clean text
def clean_text(text):
    if isinstance(text, str):
        text = re.sub(r"http\S+|@\w+|#\w+", "", text)
        return re.sub(r"\s+", " ", text).strip()
    return "[EMPTY]"

df['text'] = df['text'].apply(clean_text)
df = df[df['text'].str.strip() != ""]

# Train/Val Split
X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size=0.15,
    random_state=seed_val,
    stratify=df.label.values
)

df['data_type'] = 'not_set'
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

# --------------------
# 3️⃣ Tokenization (Dynamic Padding)
# --------------------
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def encode_data(texts):
    return tokenizer(
        texts.tolist(),
        truncation=True,
        padding=False,     # dynamic padding handled by DataCollator
        max_length=192
    )

train_enc = encode_data(df[df.data_type=='train'].text)
val_enc   = encode_data(df[df.data_type=='val'].text)

# Custom Dataset for dynamic padding
class SentimentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(int(self.labels[idx]))
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = SentimentDataset(train_enc, df[df.data_type=='train'].label.values)
val_dataset   = SentimentDataset(val_enc, df[df.data_type=='val'].label.values)

# --------------------
# 4️⃣ Model Setup
# --------------------
model = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=len(label_dict),
    hidden_dropout_prob=0.2,
    attention_probs_dropout_prob=0.2
).to(device)

# Class weights (optional)
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(df.label.values),
    y=df.label.values
)
class_weights = torch.tensor(class_weights, dtype=torch.float32).to(device)

# --------------------
# 5️⃣ Training Configuration
# --------------------
epochs = 10
batch_size = 16
gradient_accumulation_steps = 2
lr = 2e-5
warmup_ratio = 0.1
weight_decay = 0.01

train_dataloader = DataLoader(
    train_dataset,
    sampler=RandomSampler(train_dataset),
    batch_size=batch_size,
    collate_fn=data_collator
)
val_dataloader = DataLoader(
    val_dataset,
    sampler=SequentialSampler(val_dataset),
    batch_size=batch_size,
    collate_fn=data_collator
)

optimizer = AdamW(model.parameters(), lr=lr, eps=1e-8, weight_decay=weight_decay)
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(total_steps * warmup_ratio),
    num_training_steps=total_steps
)
scaler = GradScaler()

# --------------------
# 6️⃣ Training Loop
# --------------------
best_f1 = 0
patience, no_improve = 3, 0

for epoch in range(1, epochs+1):
    model.train()
    total_loss, total_correct, total_samples = 0, 0, 0
    progress = tqdm(train_dataloader, desc=f"Epoch {epoch}")

    for step, batch in enumerate(progress):
        batch = {k: v.to(device) for k, v in batch.items()}
        optimizer.zero_grad()

        with autocast():
            outputs = model(**batch)
            loss = outputs.loss
            logits = outputs.logits

        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()

        total_loss += loss.item()
        preds = torch.argmax(logits, dim=1)
        total_correct += (preds == batch['labels']).sum().item()
        total_samples += len(batch['labels'])

        progress.set_postfix(loss=f"{loss.item():.3f}", acc=f"{total_correct/total_samples:.3f}")

        if (step + 1) % gradient_accumulation_steps == 0:
            optimizer.zero_grad()

    avg_train_loss = total_loss / len(train_dataloader)
    avg_train_acc = total_correct / total_samples

    # --- Validation ---
    model.eval()
    preds_all, true_all = [], []
    with torch.no_grad():
        for batch in val_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            preds_all.extend(torch.argmax(outputs.logits, dim=1).cpu().numpy())
            true_all.extend(batch['labels'].cpu().numpy())

    val_f1 = f1_score(true_all, preds_all, average='weighted')
    val_acc = (np.array(preds_all) == np.array(true_all)).mean()

    print(f"\nEpoch {epoch} Summary:")
    print(f"Train Loss={avg_train_loss:.4f}, Train Acc={avg_train_acc:.4f}")
    print(f"Val F1={val_f1:.4f}, Val Acc={val_acc:.4f}")

    if val_f1 > best_f1:
        best_f1 = val_f1
        no_improve = 0
        model.save_pretrained("best_bert_sentiment")
        tokenizer.save_pretrained("best_bert_sentiment")
        print("✅ New best model saved!")
    else:
        no_improve += 1
        print(f"No improvement ({no_improve}/{patience})")
        if no_improve >= patience:
            print("Early stopping!")
            break

# --------------------
# 7️⃣ Evaluation Helpers
# --------------------
def evaluate_simple(dataloader):
    """Runs model evaluation and returns predictions + true labels"""
    model.eval()
    preds_all, true_all = [], []
    with torch.no_grad():
        for batch in dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            preds_all.extend(torch.argmax(outputs.logits, dim=1).cpu().numpy())
            true_all.extend(batch['labels'].cpu().numpy())
    return np.array(preds_all), np.array(true_all)


def accuracy_per_class(preds, true, label_dict=label_dict):
    """Prints accuracy per class with readable labels"""
    for label, idx in label_dict.items():
        idxs = np.where(true == idx)
        acc = (preds[idxs] == true[idxs]).mean() * 100
        print(f"Label: {label:<20} | Accuracy: {acc:.2f}% ({len(idxs[0])} samples)")


# --------------------
# 8️⃣ Final Evaluation
# --------------------
print("\n🔥 Final Evaluation on Validation Set")
model = AutoModelForSequenceClassification.from_pretrained("best_bert_sentiment").to(device)
predictions, true_vals = evaluate_simple(val_dataloader)

print("\nClassification Report:")
print(classification_report(true_vals, predictions, target_names=label_dict.keys()))

print("\nConfusion Matrix:")
print(confusion_matrix(true_vals, predictions))

print("\nPer-Class Accuracy:")
accuracy_per_class(predictions, true_vals)


Using: cuda


  df['label'] = df.category.replace(label_dict)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler()


Epoch 1:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 1 Summary:
Train Loss=1.1969, Train Acc=0.5860
Val F1=0.6785, Val Acc=0.7763
✅ New best model saved!


Epoch 2:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 2 Summary:
Train Loss=0.5732, Train Acc=0.8200
Val F1=0.7841, Val Acc=0.8265
✅ New best model saved!


Epoch 3:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 3 Summary:
Train Loss=0.4051, Train Acc=0.8789
Val F1=0.8696, Val Acc=0.8858
✅ New best model saved!


Epoch 4:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 4 Summary:
Train Loss=0.2990, Train Acc=0.9144
Val F1=0.8514, Val Acc=0.8721
No improvement (1/3)


Epoch 5:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 5 Summary:
Train Loss=0.2250, Train Acc=0.9322
Val F1=0.8888, Val Acc=0.9041
✅ New best model saved!


Epoch 6:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 6 Summary:
Train Loss=0.1619, Train Acc=0.9532
Val F1=0.8971, Val Acc=0.9087
✅ New best model saved!


Epoch 7:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 7 Summary:
Train Loss=0.1303, Train Acc=0.9645
Val F1=0.8964, Val Acc=0.9087
No improvement (1/3)


Epoch 8:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 8 Summary:
Train Loss=0.0924, Train Acc=0.9750
Val F1=0.8956, Val Acc=0.9087
No improvement (2/3)


Epoch 9:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 9 Summary:
Train Loss=0.0734, Train Acc=0.9806
Val F1=0.8859, Val Acc=0.8995
No improvement (3/3)
Early stopping!

🔥 Final Evaluation on Validation Set

Classification Report:
              precision    recall  f1-score   support

       happy       0.95      0.96      0.96       170
not-relevant       0.76      0.76      0.76        29
       angry       0.82      1.00      0.90         9
     disgust       0.00      0.00      0.00         1
         sad       0.00      0.00      0.00         5
    surprise       0.62      1.00      0.77         5

    accuracy                           0.91       219
   macro avg       0.53      0.62      0.56       219
weighted avg       0.89      0.91      0.90       219


Confusion Matrix:
[[163   5   0   0   0   2]
 [  5  22   1   0   0   1]
 [  0   0   9   0   0   0]
 [  0   1   0   0   0   0]
 [  3   1   1   0   0   0]
 [  0   0   0   0   0   5]]

Per-Class Accuracy:
Label: happy                | Accuracy: 95.88% (170 samples)
Label: not

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
# ============================================
# 🔍 Multi-Model Fine-Tuning for Sentiment Analysis
# Compare: BERT, RoBERTa, DeBERTa, DistilBERT, BERTweet
# ============================================
import torch
import numpy as np
import pandas as pd
import random, re, time
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, Dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    get_linear_schedule_with_warmup, DataCollatorWithPadding
)
from torch.optim import AdamW
from tqdm.auto import tqdm
from torch.cuda.amp import GradScaler, autocast

# ---------------------------
# 1️⃣ Config & Reproducibility
# ---------------------------
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"[INFO] Using device: {device}")

# ---------------------------
# 2️⃣ Dataset Preparation
# ---------------------------
df = pd.read_csv('./smileannotationsfinal.csv', names=['id', 'text', 'category']).set_index('id')

df = df[~df.category.str.contains(r'\|', regex=True)]
df = df[df.category != 'nocode']

label_dict = {cat: idx for idx, cat in enumerate(df.category.unique())}
df['label'] = df.category.replace(label_dict)

def clean_text(text):
    if isinstance(text, str):
        text = re.sub(r"http\S+|@\w+|#\w+", "", text)
        return re.sub(r"\s+", " ", text).strip()
    return "[EMPTY]"

df['text'] = df['text'].apply(clean_text)
df = df[df['text'].str.strip() != ""]

X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size=0.15,
    random_state=seed_val,
    stratify=df.label.values
)

df['data_type'] = 'not_set'
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

# ---------------------------
# 3️⃣ Dataset class
# ---------------------------
class SentimentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item['labels'] = torch.tensor(int(self.labels[idx]))
        return item

    def __len__(self):
        return len(self.labels)

# ---------------------------
# 4️⃣ Train/Eval Utilities
# ---------------------------
def train_and_eval(model_name):
    print(f"\n🚀 Training model: {model_name}")
    start_time = time.time()

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    def encode_data(texts):
        return tokenizer(texts.tolist(), truncation=True, padding=False, max_length=192)

    train_enc = encode_data(df[df.data_type == 'train'].text)
    val_enc = encode_data(df[df.data_type == 'val'].text)

    train_dataset = SentimentDataset(train_enc, df[df.data_type == 'train'].label.values)
    val_dataset   = SentimentDataset(val_enc, df[df.data_type == 'val'].label.values)

    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=len(label_dict),
        hidden_dropout_prob=0.2,
        attention_probs_dropout_prob=0.2
    ).to(device)

    class_weights = compute_class_weight(
        class_weight='balanced',
        classes=np.unique(df.label.values),
        y=df.label.values
    )
    class_weights = torch.tensor(class_weights, dtype=torch.float32).to(device)

    batch_size = 16
    epochs = 5
    lr = 2e-5
    warmup_ratio = 0.1
    weight_decay = 0.01
    gradient_accumulation_steps = 2

    train_loader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset),
                              batch_size=batch_size, collate_fn=data_collator)
    val_loader = DataLoader(val_dataset, sampler=SequentialSampler(val_dataset),
                            batch_size=batch_size, collate_fn=data_collator)

    optimizer = AdamW(model.parameters(), lr=lr, eps=1e-8, weight_decay=weight_decay)
    total_steps = len(train_loader) * epochs
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=int(total_steps * warmup_ratio),
        num_training_steps=total_steps
    )
    scaler = GradScaler()

    best_f1 = 0
    patience, no_improve = 2, 0

    for epoch in range(1, epochs + 1):
        model.train()
        total_loss, correct, total = 0, 0, 0
        loop = tqdm(train_loader, desc=f"[{model_name}] Epoch {epoch}")
        for step, batch in enumerate(loop):
            batch = {k: v.to(device) for k, v in batch.items()}
            optimizer.zero_grad()

            with autocast():
                outputs = model(**batch)
                loss = outputs.loss
                logits = outputs.logits

            scaler.scale(loss).backward()
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(optimizer)
            scaler.update()
            scheduler.step()

            total_loss += loss.item()
            preds = torch.argmax(logits, dim=1)
            correct += (preds == batch['labels']).sum().item()
            total += len(batch['labels'])
            loop.set_postfix(loss=f"{loss.item():.3f}", acc=f"{correct/total:.3f}")

        train_loss = total_loss / len(train_loader)
        train_acc = correct / total

        # ---- Validation ----
        model.eval()
        preds_all, true_all = [], []
        with torch.no_grad():
            for batch in val_loader:
                batch = {k: v.to(device) for k, v in batch.items()}
                outputs = model(**batch)
                preds_all.extend(torch.argmax(outputs.logits, dim=1).cpu().numpy())
                true_all.extend(batch['labels'].cpu().numpy())

        val_f1 = f1_score(true_all, preds_all, average='weighted')
        val_acc = (np.array(preds_all) == np.array(true_all)).mean()

        print(f"\nEpoch {epoch}: TrainLoss={train_loss:.4f}, TrainAcc={train_acc:.4f}, "
              f"ValF1={val_f1:.4f}, ValAcc={val_acc:.4f}")

        if val_f1 > best_f1:
            best_f1 = val_f1
            no_improve = 0
            model.save_pretrained(f"best_{model_name.replace('/', '_')}")
            tokenizer.save_pretrained(f"best_{model_name.replace('/', '_')}")
            print("✅ Saved new best model")
        else:
            no_improve += 1
            if no_improve >= patience:
                print("Early stopping!")
                break

    duration = time.time() - start_time
    print(f"⏱ Training time: {duration/60:.2f} min")

    # Final evaluation
    preds_all, true_all = np.array(preds_all), np.array(true_all)
    report = classification_report(true_all, preds_all, target_names=label_dict.keys(), digits=4)
    print(report)

    return {
        "model": model_name,
        "best_f1": best_f1,
        "val_acc": val_acc,
        "train_acc": train_acc,
        "duration_min": round(duration / 60, 2)
    }

# ---------------------------
# 5️⃣ Run all experiments
# ---------------------------
models_to_try = [
    "bert-base-uncased",
    "roberta-base",
    "microsoft/deberta-v3-base",
    "distilbert-base-uncased",
    "cardiffnlp/twitter-roberta-base-sentiment"
]

results = []
for model_name in models_to_try:
    try:
        results.append(train_and_eval(model_name))
    except Exception as e:
        print(f"❌ Failed for {model_name}: {e}")

# ---------------------------
# 6️⃣ Compare all results
# ---------------------------
df_results = pd.DataFrame(results)
df_results = df_results.sort_values(by="best_f1", ascending=False)
print("\n🏆 Final Model Comparison:")
print(df_results)
df_results.to_csv("bert_experiment_results.csv", index=False)


[INFO] Using device: cuda

🚀 Training model: bert-base-uncased


  df['label'] = df.category.replace(label_dict)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler()


[bert-base-uncased] Epoch 1:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 1: TrainLoss=1.0769, TrainAcc=0.6368, ValF1=0.7176, ValAcc=0.7945
✅ Saved new best model


[bert-base-uncased] Epoch 2:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 2: TrainLoss=0.5398, TrainAcc=0.8224, ValF1=0.7803, ValAcc=0.8265
✅ Saved new best model


[bert-base-uncased] Epoch 3:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 3: TrainLoss=0.3922, TrainAcc=0.8773, ValF1=0.8500, ValAcc=0.8676
✅ Saved new best model


[bert-base-uncased] Epoch 4:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 4: TrainLoss=0.2892, TrainAcc=0.9120, ValF1=0.8689, ValAcc=0.8858
✅ Saved new best model


[bert-base-uncased] Epoch 5:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 5: TrainLoss=0.2567, TrainAcc=0.9322, ValF1=0.8781, ValAcc=0.8950
✅ Saved new best model
⏱ Training time: 1.74 min
              precision    recall  f1-score   support

       happy     0.9171    0.9765    0.9459       170
not-relevant     0.7826    0.6207    0.6923        29
       angry     0.8182    1.0000    0.9000         9
     disgust     0.0000    0.0000    0.0000         1
         sad     0.0000    0.0000    0.0000         5
    surprise     0.7500    0.6000    0.6667         5

    accuracy                         0.8950       219
   macro avg     0.5447    0.5329    0.5341       219
weighted avg     0.8663    0.8950    0.8781       219


🚀 Training model: roberta-base


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler()


[roberta-base] Epoch 1:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 1: TrainLoss=1.1518, TrainAcc=0.6271, ValF1=0.6785, ValAcc=0.7763
✅ Saved new best model


[roberta-base] Epoch 2:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 2: TrainLoss=0.6005, TrainAcc=0.8160, ValF1=0.7895, ValAcc=0.8356
✅ Saved new best model


[roberta-base] Epoch 3:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 3: TrainLoss=0.4737, TrainAcc=0.8547, ValF1=0.8410, ValAcc=0.8630
✅ Saved new best model


[roberta-base] Epoch 4:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 4: TrainLoss=0.3939, TrainAcc=0.8725, ValF1=0.8552, ValAcc=0.8767
✅ Saved new best model


[roberta-base] Epoch 5:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 5: TrainLoss=0.3383, TrainAcc=0.8959, ValF1=0.8628, ValAcc=0.8813
✅ Saved new best model
⏱ Training time: 1.78 min
              precision    recall  f1-score   support

       happy     0.9126    0.9824    0.9462       170
not-relevant     0.8421    0.5517    0.6667        29
       angry     0.5714    0.8889    0.6957         9
     disgust     0.0000    0.0000    0.0000         1
         sad     0.0000    0.0000    0.0000         5
    surprise     0.6667    0.4000    0.5000         5

    accuracy                         0.8813       219
   macro avg     0.4988    0.4705    0.4681       219
weighted avg     0.8586    0.8813    0.8628       219


🚀 Training model: microsoft/deberta-v3-base


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler()


[microsoft/deberta-v3-base] Epoch 1:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 1: TrainLoss=1.2267, TrainAcc=0.5803, ValF1=0.6785, ValAcc=0.7763
✅ Saved new best model


[microsoft/deberta-v3-base] Epoch 2:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 2: TrainLoss=0.6001, TrainAcc=0.8192, ValF1=0.7880, ValAcc=0.8356
✅ Saved new best model


[microsoft/deberta-v3-base] Epoch 3:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 3: TrainLoss=0.4788, TrainAcc=0.8652, ValF1=0.8069, ValAcc=0.8447
✅ Saved new best model


[microsoft/deberta-v3-base] Epoch 4:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 4: TrainLoss=0.3987, TrainAcc=0.8757, ValF1=0.8054, ValAcc=0.8493


[microsoft/deberta-v3-base] Epoch 5:   0%|          | 0/78 [00:00<?, ?it/s]

  with autocast():



Epoch 5: TrainLoss=0.3641, TrainAcc=0.8830, ValF1=0.8104, ValAcc=0.8493
✅ Saved new best model
⏱ Training time: 2.35 min
              precision    recall  f1-score   support

       happy     0.9081    0.9882    0.9465       170
not-relevant     0.5294    0.6207    0.5714        29
       angry     0.0000    0.0000    0.0000         9
     disgust     0.0000    0.0000    0.0000         1
         sad     0.0000    0.0000    0.0000         5
    surprise     0.0000    0.0000    0.0000         5

    accuracy                         0.8493       219
   macro avg     0.2396    0.2682    0.2530       219
weighted avg     0.7750    0.8493    0.8104       219


🚀 Training model: distilbert-base-uncased


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

❌ Failed for distilbert-base-uncased: DistilBertForSequenceClassification.__init__() got an unexpected keyword argument 'hidden_dropout_prob'

🚀 Training model: cardiffnlp/twitter-roberta-base-sentiment


config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

❌ Failed for cardiffnlp/twitter-roberta-base-sentiment: Error(s) in loading state_dict for Linear:
	size mismatch for weight: copying a param with shape torch.Size([3, 768]) from checkpoint, the shape in current model is torch.Size([6, 768]).

🏆 Final Model Comparison:
                       model   best_f1   val_acc  train_acc  duration_min
0          bert-base-uncased  0.878119  0.894977   0.932203          1.74
1               roberta-base  0.862758  0.881279   0.895884          1.78
2  microsoft/deberta-v3-base  0.810378  0.849315   0.882970          2.35


In [None]:
# ============================================
# 🔍 Multi-Model Fine-Tuning for Sentiment Analysis
# Compatible with PyTorch ≥ 2.3
# ============================================
import torch, numpy as np, pandas as pd, random, re, time
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report
from sklearn.utils.class_weight import compute_class_weight
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, Dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    get_linear_schedule_with_warmup, DataCollatorWithPadding
)
from torch.optim import AdamW
from tqdm.auto import tqdm
from torch.amp import GradScaler, autocast  # ✅ correct modern import

# ---------------------------
# 1️⃣ Config & Reproducibility
# ---------------------------
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"[INFO] Using device: {device}")

# ---------------------------
# 2️⃣ Dataset Preparation
# ---------------------------
df = pd.read_csv('./smileannotationsfinal.csv', names=['id', 'text', 'category']).set_index('id')
df = df[~df.category.str.contains(r'\|', regex=True)]
df = df[df.category != 'nocode']

label_dict = {cat: idx for idx, cat in enumerate(df.category.unique())}
df['label'] = df.category.replace(label_dict)

def clean_text(t):
    if isinstance(t, str):
        t = re.sub(r"http\S+|@\w+|#\w+", "", t)
        return re.sub(r"\s+", " ", t).strip()
    return "[EMPTY]"

df['text'] = df['text'].apply(clean_text)
df = df[df['text'].str.strip() != ""]

X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size=0.15,
    random_state=seed_val,
    stratify=df.label.values
)
df['data_type'] = 'not_set'
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

# ---------------------------
# 3️⃣ Dataset class
# ---------------------------
class SentimentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item['labels'] = torch.tensor(int(self.labels[idx]))
        return item

    def __len__(self):
        return len(self.labels)

# ---------------------------
# 4️⃣ Train/Eval Function
# ---------------------------
def train_and_eval(model_name):
    print(f"\n🚀 Training model: {model_name}")
    start_time = time.time()

    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    def encode(texts):
        return tokenizer(texts.tolist(), truncation=True, padding=False, max_length=192)

    train_enc = encode(df[df.data_type == 'train'].text)
    val_enc   = encode(df[df.data_type == 'val'].text)
    train_dataset = SentimentDataset(train_enc, df[df.data_type == 'train'].label.values)
    val_dataset   = SentimentDataset(val_enc, df[df.data_type == 'val'].label.values)

    # DistilBERT / BERTweet skip dropout params
    model_kwargs = dict(num_labels=len(label_dict))
    if not any(n in model_name for n in ["distilbert", "bertweet", "cardiffnlp"]):
        model_kwargs.update({
            "hidden_dropout_prob": 0.2,
            "attention_probs_dropout_prob": 0.2
        })

    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        ignore_mismatched_sizes=True,
        **model_kwargs
    ).to(device)

    if "bertweet" in model_name:
        tokenizer.do_lower_case = False

    class_weights = compute_class_weight(
        class_weight='balanced',
        classes=np.unique(df.label.values),
        y=df.label.values
    )
    class_weights = torch.tensor(class_weights, dtype=torch.float32).to(device)

    batch_size, epochs, lr = 16, 5, 2e-5
    warmup_ratio, weight_decay = 0.1, 0.01
    grad_accum = 2

    train_loader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset),
                              batch_size=batch_size, collate_fn=data_collator)
    val_loader = DataLoader(val_dataset, sampler=SequentialSampler(val_dataset),
                            batch_size=batch_size, collate_fn=data_collator)

    optimizer = AdamW(model.parameters(), lr=lr, eps=1e-8, weight_decay=weight_decay)
    total_steps = len(train_loader) * epochs
    scheduler = get_linear_schedule_with_warmup(
        optimizer, int(total_steps * warmup_ratio), total_steps
    )
    scaler = GradScaler()  # ✅ autodetect device

    best_f1, patience, no_improve = 0, 2, 0

    for epoch in range(1, epochs + 1):
        model.train()
        total_loss, correct, total = 0, 0, 0
        loop = tqdm(train_loader, desc=f"[{model_name}] Epoch {epoch}")
        for step, batch in enumerate(loop):
            batch = {k: v.to(device) for k, v in batch.items()}
            optimizer.zero_grad()

            # ✅ fixed autocast syntax
            with autocast(device_type='cuda', enabled=True):
                outputs = model(**batch)
                loss = outputs.loss
                logits = outputs.logits

            scaler.scale(loss).backward()
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(optimizer)
            scaler.update()
            scheduler.step()

            total_loss += loss.item()
            preds = torch.argmax(logits, dim=1)
            correct += (preds == batch['labels']).sum().item()
            total += len(batch['labels'])
            loop.set_postfix(loss=f"{loss.item():.3f}", acc=f"{correct/total:.3f}")

        train_loss, train_acc = total_loss / len(train_loader), correct / total

        # ---- Validation ----
        model.eval()
        preds_all, true_all = [], []
        with torch.no_grad():
            for batch in val_loader:
                batch = {k: v.to(device) for k, v in batch.items()}
                outputs = model(**batch)
                preds_all.extend(torch.argmax(outputs.logits, dim=1).cpu().numpy())
                true_all.extend(batch['labels'].cpu().numpy())

        val_f1 = f1_score(true_all, preds_all, average='weighted')
        val_acc = (np.array(preds_all) == np.array(true_all)).mean()

        print(f"\nEpoch {epoch}: TrainLoss={train_loss:.4f}, TrainAcc={train_acc:.4f}, "
              f"ValF1={val_f1:.4f}, ValAcc={val_acc:.4f}")

        if val_f1 > best_f1:
            best_f1, no_improve = val_f1, 0
            save_dir = f"best_{model_name.replace('/', '_')}"
            model.save_pretrained(save_dir)
            tokenizer.save_pretrained(save_dir)
            print("✅ Saved new best model")
        else:
            no_improve += 1
            if no_improve >= patience:
                print("Early stopping!")
                break

    dur = (time.time() - start_time) / 60
    print(f"⏱ Training time: {dur:.2f} min")

    print(classification_report(true_all, preds_all, target_names=label_dict.keys(), digits=4))
    return {"model": model_name, "best_f1": best_f1, "val_acc": val_acc,
            "train_acc": train_acc, "duration_min": round(dur, 2)}

# ---------------------------
# 5️⃣ Run Experiments
# ---------------------------
models = [
    "bert-base-uncased",
    "roberta-base",
    "microsoft/deberta-v3-base",
    "distilbert-base-uncased",
    "vinai/bertweet-base"
]

results = []
for m in models:
    try:
        results.append(train_and_eval(m))
    except Exception as e:
        print(f"❌ Failed for {m}: {e}")

# ---------------------------
# 6️⃣ Comparison
# ---------------------------
df_results = pd.DataFrame(results)
if not df_results.empty:
    df_results = df_results.sort_values(by="best_f1", ascending=False)
    print("\n🏆 Final Model Comparison:")
    print(df_results)
    df_results.to_csv("bert_experiment_results.csv", index=False)
else:
    print("⚠️ No successful runs recorded. Check earlier logs for errors.")


[INFO] Using device: cuda

🚀 Training model: bert-base-uncased


  df['label'] = df.category.replace(label_dict)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[bert-base-uncased] Epoch 1:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 1: TrainLoss=1.0769, TrainAcc=0.6368, ValF1=0.7176, ValAcc=0.7945
✅ Saved new best model


[bert-base-uncased] Epoch 2:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 2: TrainLoss=0.5398, TrainAcc=0.8224, ValF1=0.7803, ValAcc=0.8265
✅ Saved new best model


[bert-base-uncased] Epoch 3:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 3: TrainLoss=0.3922, TrainAcc=0.8773, ValF1=0.8500, ValAcc=0.8676
✅ Saved new best model


[bert-base-uncased] Epoch 4:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 4: TrainLoss=0.2892, TrainAcc=0.9120, ValF1=0.8689, ValAcc=0.8858
✅ Saved new best model


[bert-base-uncased] Epoch 5:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 5: TrainLoss=0.2567, TrainAcc=0.9322, ValF1=0.8781, ValAcc=0.8950
✅ Saved new best model
⏱ Training time: 2.66 min
              precision    recall  f1-score   support

       happy     0.9171    0.9765    0.9459       170
not-relevant     0.7826    0.6207    0.6923        29
       angry     0.8182    1.0000    0.9000         9
     disgust     0.0000    0.0000    0.0000         1
         sad     0.0000    0.0000    0.0000         5
    surprise     0.7500    0.6000    0.6667         5

    accuracy                         0.8950       219
   macro avg     0.5447    0.5329    0.5341       219
weighted avg     0.8663    0.8950    0.8781       219


🚀 Training model: roberta-base


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[roberta-base] Epoch 1:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 1: TrainLoss=1.1518, TrainAcc=0.6271, ValF1=0.6785, ValAcc=0.7763
✅ Saved new best model


[roberta-base] Epoch 2:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 2: TrainLoss=0.6005, TrainAcc=0.8160, ValF1=0.7895, ValAcc=0.8356
✅ Saved new best model


[roberta-base] Epoch 3:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 3: TrainLoss=0.4737, TrainAcc=0.8547, ValF1=0.8410, ValAcc=0.8630
✅ Saved new best model


[roberta-base] Epoch 4:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 4: TrainLoss=0.3939, TrainAcc=0.8725, ValF1=0.8552, ValAcc=0.8767
✅ Saved new best model


[roberta-base] Epoch 5:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 5: TrainLoss=0.3383, TrainAcc=0.8959, ValF1=0.8628, ValAcc=0.8813
✅ Saved new best model
⏱ Training time: 2.19 min
              precision    recall  f1-score   support

       happy     0.9126    0.9824    0.9462       170
not-relevant     0.8421    0.5517    0.6667        29
       angry     0.5714    0.8889    0.6957         9
     disgust     0.0000    0.0000    0.0000         1
         sad     0.0000    0.0000    0.0000         5
    surprise     0.6667    0.4000    0.5000         5

    accuracy                         0.8813       219
   macro avg     0.4988    0.4705    0.4681       219
weighted avg     0.8586    0.8813    0.8628       219


🚀 Training model: microsoft/deberta-v3-base


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[microsoft/deberta-v3-base] Epoch 1:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 1: TrainLoss=1.2267, TrainAcc=0.5803, ValF1=0.6785, ValAcc=0.7763
✅ Saved new best model


[microsoft/deberta-v3-base] Epoch 2:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 2: TrainLoss=0.6001, TrainAcc=0.8192, ValF1=0.7880, ValAcc=0.8356
✅ Saved new best model


[microsoft/deberta-v3-base] Epoch 3:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 3: TrainLoss=0.4788, TrainAcc=0.8652, ValF1=0.8069, ValAcc=0.8447
✅ Saved new best model


[microsoft/deberta-v3-base] Epoch 4:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 4: TrainLoss=0.3987, TrainAcc=0.8757, ValF1=0.8054, ValAcc=0.8493


[microsoft/deberta-v3-base] Epoch 5:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 5: TrainLoss=0.3641, TrainAcc=0.8830, ValF1=0.8104, ValAcc=0.8493
✅ Saved new best model
⏱ Training time: 3.93 min
              precision    recall  f1-score   support

       happy     0.9081    0.9882    0.9465       170
not-relevant     0.5294    0.6207    0.5714        29
       angry     0.0000    0.0000    0.0000         9
     disgust     0.0000    0.0000    0.0000         1
         sad     0.0000    0.0000    0.0000         5
    surprise     0.0000    0.0000    0.0000         5

    accuracy                         0.8493       219
   macro avg     0.2396    0.2682    0.2530       219
weighted avg     0.7750    0.8493    0.8104       219


🚀 Training model: distilbert-base-uncased


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[distilbert-base-uncased] Epoch 1:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 1: TrainLoss=1.1180, TrainAcc=0.6513, ValF1=0.7592, ValAcc=0.8174
✅ Saved new best model


[distilbert-base-uncased] Epoch 2:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 2: TrainLoss=0.4752, TrainAcc=0.8588, ValF1=0.8102, ValAcc=0.8447
✅ Saved new best model


[distilbert-base-uncased] Epoch 3:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 3: TrainLoss=0.3270, TrainAcc=0.9031, ValF1=0.8292, ValAcc=0.8539
✅ Saved new best model


[distilbert-base-uncased] Epoch 4:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 4: TrainLoss=0.2419, TrainAcc=0.9314, ValF1=0.8671, ValAcc=0.8767
✅ Saved new best model


[distilbert-base-uncased] Epoch 5:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 5: TrainLoss=0.1966, TrainAcc=0.9508, ValF1=0.8613, ValAcc=0.8767
⏱ Training time: 1.30 min
              precision    recall  f1-score   support

       happy     0.9106    0.9588    0.9341       170
not-relevant     0.6786    0.6552    0.6667        29
       angry     0.8889    0.8889    0.8889         9
     disgust     0.0000    0.0000    0.0000         1
         sad     0.0000    0.0000    0.0000         5
    surprise     0.6667    0.4000    0.5000         5

    accuracy                         0.8767       219
   macro avg     0.5241    0.4838    0.4983       219
weighted avg     0.8485    0.8767    0.8613       219


🚀 Training model: vinai/bertweet-base


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at vinai/bertweet-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[vinai/bertweet-base] Epoch 1:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 1: TrainLoss=1.1506, TrainAcc=0.6659, ValF1=0.7263, ValAcc=0.7991
✅ Saved new best model


[vinai/bertweet-base] Epoch 2:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 2: TrainLoss=0.6149, TrainAcc=0.8273, ValF1=0.8118, ValAcc=0.8539
✅ Saved new best model


[vinai/bertweet-base] Epoch 3:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 3: TrainLoss=0.4580, TrainAcc=0.8733, ValF1=0.8113, ValAcc=0.8447


[vinai/bertweet-base] Epoch 4:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 4: TrainLoss=0.3505, TrainAcc=0.8935, ValF1=0.8575, ValAcc=0.8813
✅ Saved new best model


[vinai/bertweet-base] Epoch 5:   0%|          | 0/78 [00:00<?, ?it/s]


Epoch 5: TrainLoss=0.2957, TrainAcc=0.9128, ValF1=0.8483, ValAcc=0.8676
⏱ Training time: 1.88 min
              precision    recall  f1-score   support

       happy     0.9371    0.9647    0.9507       170
not-relevant     0.6333    0.6552    0.6441        29
       angry     0.5000    0.7778    0.6087         9
     disgust     0.0000    0.0000    0.0000         1
         sad     0.0000    0.0000    0.0000         5
    surprise     0.0000    0.0000    0.0000         5

    accuracy                         0.8676       219
   macro avg     0.3451    0.3996    0.3672       219
weighted avg     0.8319    0.8676    0.8483       219


🏆 Final Model Comparison:
                       model   best_f1   val_acc  train_acc  duration_min
0          bert-base-uncased  0.878119  0.894977   0.932203          2.66
3    distilbert-base-uncased  0.867086  0.876712   0.950767          1.30
1               roberta-base  0.862758  0.881279   0.895884          2.19
4        vinai/bertweet-base  0.857

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
