## Emotion Classification Project - Based on BILSTM

In [53]:
from datasets import load_dataset
import random
from sklearn.metrics import f1_score
from tokenizers import Tokenizer, models, trainers
from tokenizers.pre_tokenizers import Whitespace
import torch.nn as nn
import torch
from torch.nn.utils.rnn import pad_sequence

## 1. Load sentiment classification dataset
Use `datasets` library to load public sentiment classification dataset. This dataset contains 6 sentiment labels and has been classified into training set and test set.

In [54]:
dataset = load_dataset("dair-ai/emotion", "split")
train_data = dataset["train"]
validation_data = dataset["validation"]
test_data = dataset["test"]

In [55]:
labels = ["sadness", "joy", "love", "anger", "fear", "surprise"]

## 2. Text preprocessing (Tokenization)
In order to input text into the neural network, the original text needs to be converted into a sequence of digital IDs through a `tokenizer`. Define the preprocessing function and process the training data in batches.

Reference: [Omseeth's Blog - CNN Sentiment Analysis (2024)](https://omseeth.github.io/blog/2024/CNN_sentiment_analysis)

In [56]:
# Tokenization
vocab_n = 5000
sequence_len = 64

# Initialize a tokenizer using BPE (Byte Pair Encoding)
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = Whitespace()
tokenizer.enable_padding(length=sequence_len)
tokenizer.enable_truncation(max_length=sequence_len)
tokenizer_trainer = trainers.BpeTrainer(vocab_size=vocab_n)
tokenizer.train_from_iterator(train_data["text"], trainer=tokenizer_trainer)

In [57]:
def preprocess_text(text: str, tokenizer: Tokenizer):
    return torch.tensor(tokenizer.encode(text).ids)


def preprocess_label(label: int):
    return torch.tensor(label)


def preprocess(data: dict, tokenizer: Tokenizer):
    instances = []

    for text, label in zip(data["text"], data["label"]):
        input = preprocess_text(text, tokenizer)
        label = preprocess_label(label)
        
        instances.append((input, label))

    return instances

In [58]:
train_instances = preprocess(train_data, tokenizer)
val_instances = preprocess(validation_data, tokenizer)
test_instances = preprocess(test_data, tokenizer)

## 3. Batching
Since the BILSTM model requires batch input of data, the processed data is grouped (batching) to unify the input size of each batch for efficient training.

Reference: [Omseeth's Blog - CNN Sentiment Analysis (2024)](https://omseeth.github.io/blog/2024/CNN_sentiment_analysis)

In [59]:
# Batching for LSTM input

def batching_lstm(instances: list, batch_size: int, shuffle: bool):
    if shuffle:
        random.shuffle(instances)

    batches = []

    for i in range(0, len(instances), batch_size):
        batch = instances[i : i + batch_size]

        # Take out a batch of input and label
        batch_inputs = [item[0] for item in batch]  # list of tensors (seq_len,)
        batch_labels = torch.stack([item[1] for item in batch])  # tensor of shape [batch_size]

        # Automatic padding, becomes [batch_size, max_seq_len]
        padded_inputs = pad_sequence(batch_inputs, batch_first=True, padding_value=0)

        batches.append((padded_inputs, batch_labels))

    return batches

## 4. BILSTM emotional classification model construction

In this project, we built a simple `BILSTM (Bidirectional Long Short-Term Memory Network)` to perform text sentiment classification.

### Model structure:

- **Input layer**: Receive the text sequence (word IDs) encoded by Tokenizer.

- **Embedding layer**: Map each word ID to a fixed-dimensional word vector to learn the semantic representation of the word.

- **BILSTM layer**: Extract the temporal features in the text sequence and capture the contextual dependency information.

- **Fully connected layer**: Map the output of LSTM to the sentiment category space.

- **Output layer (Softmax Activation)**: Output the predicted probability of each category.

In [60]:
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, padding_idx):
        super(LSTMClassifier, self).__init__()
        
        # Word embedding layer, randomly initialized
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim, padding_idx=padding_idx)

        # Single-layer bidirectional LSTM
        self.lstm = nn.LSTM(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=1,
            bidirectional=True,
            batch_first=True
        )

        # Dropout layer
        self.dropout = nn.Dropout(0.3)

        # Fully connected layer, output 6 types of emotions
        self.fc = nn.Linear(hidden_dim * 2, output_dim)

    def forward(self, x):
        # x shape: [batch_size, seq_len]
        embedded = self.embedding(x)  # [batch_size, seq_len, embedding_dim]

        output, (hidden, _) = self.lstm(embedded)  # hidden shape: [2, batch, hidden_dim]

        # Take the last hidden layer of the forward and reverse directions and concatenate them
        hidden_forward = hidden[-2, :, :]  # [batch, hidden_dim]
        hidden_backward = hidden[-1, :, :]  # [batch, hidden_dim]
        combined = torch.cat((hidden_forward, hidden_backward), dim=1)  # [batch, hidden_dim * 2]

        out = self.dropout(combined)
        return self.fc(out)  # Output shape: [batch_size, output_dim]
    

In [61]:
# Get the vocabulary dictionary from the tokenizer (word → ID)
word2idx = tokenizer.get_vocab()  # e.g., {'i': 4, 'love': 5, 'this': 6, ...}

# Reversal
idx2word = {idx: word for word, idx in word2idx.items()}

vocab_size = len(word2idx)
padding_idx = word2idx.get("[PAD]", 0) 

## 5. Training and Validation Loop

- **Model training**: adjust the model to training mode `model.train()`, and use the training set to optimize the parameters in rounds

- **Model validation**: adjust the model to validation mode `model.eval()`. After each round, test the performance on the validation set but do not update the model parameters

- **Record and print results**: print the Train Loss, Train Accuracy, Train F1, Val Accuracy, Val F1 of each round

In [62]:
from sklearn.metrics import accuracy_score, f1_score
def train_and_evaluate(model, train_batches, val_batches, num_epochs=5, lr=1e-3, device="cpu"):
    model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = torch.nn.CrossEntropyLoss()

    for epoch in range(num_epochs):
        model.train()
        train_losses = []
        all_preds, all_labels = [], []

        for batch_x, batch_y in train_batches:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)

            optimizer.zero_grad()
            outputs = model(batch_x)  # shape: [batch_size, 6]
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()

            train_losses.append(loss.item())
            preds = torch.argmax(outputs, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(batch_y.cpu().numpy())

        train_acc = accuracy_score(all_labels, all_preds)
        train_f1 = f1_score(all_labels, all_preds, average='weighted')

        # ---- verify ----
        model.eval()
        val_preds, val_labels = [], []
        with torch.no_grad():
            for val_x, val_y in val_batches:
                val_x, val_y = val_x.to(device), val_y.to(device)
                val_out = model(val_x)
                val_pred = torch.argmax(val_out, dim=1)
                val_preds.extend(val_pred.cpu().numpy())
                val_labels.extend(val_y.cpu().numpy())

        val_acc = accuracy_score(val_labels, val_preds)
        val_f1 = f1_score(val_labels, val_preds, average='weighted')

        print(f"Epoch {epoch+1}/{num_epochs} | "
              f"Train Loss: {sum(train_losses)/len(train_losses):.4f} | "
              f"Train Acc: {train_acc:.4f} | F1: {train_f1:.4f} || "
              f"Val Acc: {val_acc:.4f} | Val F1: {val_f1:.4f}")

## 6. Parameter Tuning
Try different hyperparameters and find the best configuration by comparing the model performance under different parameters.

In [63]:
# Parameters tuning
model = LSTMClassifier(
    vocab_size=len(word2idx),        # vocabulary size
    embedding_dim=100,               # Dimensions of word vectors
    hidden_dim=128,                  # Hidden layer dimensions
    output_dim=6,                    # The number of output categories
    padding_idx=word2idx.get("[PAD]", 0)  # Index of pad token
)


In [64]:
train_instances = preprocess(dataset["train"], tokenizer)
val_instances = preprocess(dataset["validation"], tokenizer)

train_batches = batching_lstm(train_instances, batch_size=32, shuffle=True)
val_batches = batching_lstm(val_instances, batch_size=32, shuffle=False)

device = "cuda" if torch.cuda.is_available() else "cpu"
train_and_evaluate(model, train_batches, val_batches, num_epochs=10, device=device)


Epoch 1/10 | Train Loss: 1.4541 | Train Acc: 0.4280 | F1: 0.3565 || Val Acc: 0.6005 | Val F1: 0.5405
Epoch 2/10 | Train Loss: 0.8679 | Train Acc: 0.7076 | F1: 0.6702 || Val Acc: 0.7635 | Val F1: 0.7425
Epoch 3/10 | Train Loss: 0.5047 | Train Acc: 0.8297 | F1: 0.8215 || Val Acc: 0.8255 | Val F1: 0.8238
Epoch 4/10 | Train Loss: 0.3343 | Train Acc: 0.8871 | F1: 0.8857 || Val Acc: 0.8525 | Val F1: 0.8561
Epoch 5/10 | Train Loss: 0.2218 | Train Acc: 0.9253 | F1: 0.9248 || Val Acc: 0.8690 | Val F1: 0.8715
Epoch 6/10 | Train Loss: 0.1598 | Train Acc: 0.9464 | F1: 0.9462 || Val Acc: 0.8850 | Val F1: 0.8861
Epoch 7/10 | Train Loss: 0.1281 | Train Acc: 0.9561 | F1: 0.9560 || Val Acc: 0.8610 | Val F1: 0.8675
Epoch 8/10 | Train Loss: 0.1004 | Train Acc: 0.9649 | F1: 0.9648 || Val Acc: 0.8875 | Val F1: 0.8880
Epoch 9/10 | Train Loss: 0.0932 | Train Acc: 0.9688 | F1: 0.9687 || Val Acc: 0.8790 | Val F1: 0.8797
Epoch 10/10 | Train Loss: 0.0833 | Train Acc: 0.9711 | F1: 0.9710 || Val Acc: 0.8980 | Val 

In [65]:
from sklearn.metrics import classification_report

# Get the label name order
label_names = dataset["train"].features["label"].names  # ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

# Model prediction validation set
true_labels = []
pred_labels = []

model.eval()
with torch.no_grad():
    for x_batch, y_batch in val_batches:
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        outputs = model(x_batch)
        predictions = torch.argmax(outputs, dim=1)

        true_labels.extend(y_batch.cpu().numpy())
        pred_labels.extend(predictions.cpu().numpy())

# Print each category's precision, recall, f1-score
print(classification_report(true_labels, pred_labels, target_names=label_names))


              precision    recall  f1-score   support

     sadness       0.92      0.93      0.92       550
         joy       0.93      0.93      0.93       704
        love       0.84      0.72      0.78       178
       anger       0.88      0.88      0.88       275
        fear       0.85      0.89      0.87       212
    surprise       0.80      0.84      0.82        81

    accuracy                           0.90      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.90      0.90      0.90      2000



## References

- [1] Omseeth's Blog - CNN Sentiment Analysis (2024). https://omseeth.github.io/blog/2024/CNN_sentiment_analysis