# Simple bag-of-words baseline for emotion classification (Task 1)

Authors: Christine de Kock

## Introduction

In this starter notebook, we will take you through the process of emotion classification from text. The notebook was adapted from a notebook for SemEval 2024 Shared Task 1: SemRel.

## Imports

In [1]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import recall_score, precision_score, f1_score

import torch 
from torch import nn
from torch import optim

## Data Import

The training data consists of a short text and binary labels representing human judgments of the emotions in the text. 

The data is structured as a CSV file with the following fields:
- id: a unique identifier for the sample
- text: a sentence or short text
- 6 binary fields representing emotion annotations: joy, fear, anger, sadness, surprise

The data is multilabel, meaning that more than one of the emotion classes may apply to a given text. 

Below we will show you how to load and re-format the provided data file.

In [2]:
# Load the training and validation data

train = pd.read_csv('data/eng-data/Track 1/eng_track1_train.csv')
val = pd.read_csv('data/eng-data/Track 1/eng_track1_dev.csv')

train.head()

Unnamed: 0,id,text,Joy,Fear,Anger,Sadness,Surprise
0,eng_train_track1_001,None of us has mentioned the incident since.,0,1,0,1,1
1,eng_train_track1_002,"I was 7 and woke up early, so I went to the ba...",1,0,0,0,0
2,eng_train_track1_003,By that point I felt like someone was stabbing...,0,1,0,0,0
3,eng_train_track1_004,watching her leave with dudes drove me crazy.,0,1,1,1,0
4,eng_train_track1_005,`` My eyes widened.,0,1,0,0,1


## Bag-of-words representation

In this tutorial, we use a simple bag-of-words representation to obtain a vector for each text. This vector can then be fed into a machine learning model. More advanced models, including LSTMs and transformers, operate on text directly and to not require the vectorisation step. 

### Preprocessing 
We choose to unigrams (that is, individual words) and bigrams (two-word sequences). Texts are lowercased before being vectorised. 

Further preprocessing steps may include: 
- stopword removal,
- TFIDF normalisation,
- lemmatisation / stemming, or
- using a different tokeniser.

In [3]:
vectorizer = CountVectorizer(ngram_range=(1,2))
X_train = vectorizer.fit_transform(train['text'].str.lower()).toarray()
X_val = vectorizer.transform(val['text'].str.lower()).toarray()

emotions = ['Joy','Sadness','Surprise','Fear','Anger']
y_train = train[emotions].values
y_val = val[emotions].values

Finally, we cast the transformed vectors to PyTorch tensors.

In [4]:
X_train_t = torch.Tensor(X_train)
y_train_t = torch.Tensor(y_train)

X_val_t = torch.Tensor(X_val)
y_val_t = torch.Tensor(y_val)

## Characteristics of the data

Statistics of the data are printed below. There are 2768 samples in the training data. The input representation consists of 29001 input features and there are 5 output clsees. There is an imbalance in the dataset, with the "fear" class being assigned to 58% of samples but the "anger" class to only 12%. 

(Due to the multilabel nature of the data, the percentages do not sum to 1.)

In [5]:
print(f'Shape of X: {X_train.shape}')
print(f'Shape of y: {y_train.shape}')
print(f'Number of positives per emotion class:')
_ = [print(f' - {e}: {v} ({round(100*v/len(y_train))}%)') for e,v in zip(emotions, y_train.sum(axis=0))]

Shape of X: (2768, 29001)
Shape of y: (2768, 5)
Number of positives per emotion class:
 - Joy: 674 (24%)
 - Sadness: 878 (32%)
 - Surprise: 839 (30%)
 - Fear: 1611 (58%)
 - Anger: 333 (12%)


## Define the model 

We define a simple neural network model with 1 hidden layer for this task. This can be made arbitrarily more complex, eg. by experimenting with the types of inputs and layers, layer size, depth and regularisation. 

In [6]:
model = nn.Sequential(
          nn.Linear(X_train.shape[1], 100),
          nn.ReLU(),
          nn.Dropout(0.3),
          nn.Linear(100, y_train.shape[1])
        )

## Define the optimisation parameters

 To perform multilabel classification, we use binary cross entropy with logits. See [here](https://discuss.pytorch.org/t/is-there-an-example-for-multi-class-multilabel-classification-in-pytorch/53579/6) for a motivation. Here, one can explore different optimizers, regularisation levels, learning rates, etc.

In [7]:
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-1, weight_decay=1e-2)

## Run the training loop

In [8]:
# Train for a set number of epochs
for epoch in range(1000):
    optimizer.zero_grad()
    output = model(X_train_t)
    loss = criterion(output, y_train_t)
    loss.backward()
    optimizer.step()
    if epoch % 100 == 0:
        print(f'Epoch {epoch}: Loss: {round(loss.item(),3)}')

Epoch 0: Loss: 0.682
Epoch 100: Loss: 0.592
Epoch 200: Loss: 0.573
Epoch 300: Loss: 0.568
Epoch 400: Loss: 0.565
Epoch 500: Loss: 0.563
Epoch 600: Loss: 0.561
Epoch 700: Loss: 0.56
Epoch 800: Loss: 0.559
Epoch 900: Loss: 0.558


## Get class predictions

The model outputs logits to coordinate with the BCE. We use a sigmoid transformation to obtain a score in the range of (0,1). We need to define a classification threshold to obtain a binary prediction from the real-valued model output. This can be determined based on the validation data, and may be different for each emotion. Given the imbalance in the data, we set it slightly lower than 0.5 (the standard).

In [9]:
def get_predictions(X_val, model, threshold=0.5):
    sig = nn.Sigmoid() 
    yhat = sig(model(X_val)).detach().numpy()
    y_pred = yhat > threshold
    
    return y_pred

In [10]:
y_pred = get_predictions(X_val_t, model, 0.45)

## Evaluate

We evaluate the model based on the micro- and macro-averaged F1 score. The former aggregates the metrics at the per-sample level, whereas the latter does it at the per-class level. 

In [11]:
def evaluate(y_val, y_pred):
    for average in ['micro', 'macro']:
        recall = recall_score(y_val, y_pred, average=average, zero_division=0)
        precision = precision_score(y_val, y_pred, average=average, zero_division=0)
        f1 = f1_score(y_val, y_pred, average=average, zero_division=0)
    
        print(f'{average.upper()} recall: {round(recall, 4)}, precision: {round(precision, 4)}, f1: {round(f1, 4)}')

In [12]:
evaluate(y_val, y_pred)

MICRO recall: 0.358, precision: 0.5431, f1: 0.4315
MACRO recall: 0.2, precision: 0.1086, f1: 0.1408


The results show that the macro-averaged F1 is much lower than the micro-averaged score. This indicates that the model might be performing poorly on some of the classes. Below, we evaluate each class separately.

In [13]:
def evaluate_per_class(y_val, y_pred):
    for i, emotion in enumerate(emotions):
        print(f'*** {emotion} ***')
    
        recall = recall_score(y_val[:,i], y_pred[:,i], zero_division=0)
        precision = precision_score(y_val[:,i], y_pred[:,i], zero_division=0)
        f1 = f1_score(y_val[:,i], y_pred[:,i], zero_division=0)
        
        print(f'recall: {round(recall, 4)}, precision: {round(precision, 4)}, f1: {round(f1, 4)}\n')

In [14]:
evaluate_per_class(y_val, y_pred)

*** Joy ***
recall: 0.0, precision: 0.0, f1: 0.0

*** Sadness ***
recall: 0.0, precision: 0.0, f1: 0.0

*** Surprise ***
recall: 0.0, precision: 0.0, f1: 0.0

*** Fear ***
recall: 1.0, precision: 0.5431, f1: 0.7039

*** Anger ***
recall: 0.0, precision: 0.0, f1: 0.0



## Weighing classes

We can see that the model performs well on the "fear" class, which is the most common, but poorly on all others, classifying all samples as negative. One way to address this is by assigning weights to the different classes to increase the effect of samples from rare classes. For example, the below snippet can be used to calculate weights based on their relative frequency.

In [15]:
weights = y_train.sum(axis=0)/y_train.sum()
weights = max(weights)/weights

These weights can then be assigned to the loss function for training. 

In [16]:
# Define model 
model = nn.Sequential(
          nn.Linear(X_train.shape[1], 100),
          nn.ReLU(),
          nn.Dropout(0.3),
          nn.Linear(100, y_train.shape[1])
        )

# Define training parameters
criterion = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor(weights)) # <-- weights assigned to optimiser
optimizer = optim.SGD(model.parameters(), lr=1e-1, weight_decay=1e-2)

# Train for a number of epochs
for epoch in range(1000):
    optimizer.zero_grad()
    output = model(X_train_t)
    loss = criterion(output, y_train_t)
    loss.backward()
    optimizer.step()
    if epoch % 100 == 0:
        print(f'Epoch {epoch}: Loss: {round(loss.item(),3)}')

# Get predictions
y_pred = get_predictions(X_val_t, model, 0.45)

# Evaluate
print('\n\nEVALUATION\n')
evaluate(y_val, y_pred)

print('\nPER CLASS BREAKDOWN\n')
evaluate_per_class(y_val, y_pred)

Epoch 0: Loss: 0.878
Epoch 100: Loss: 0.863
Epoch 200: Loss: 0.856
Epoch 300: Loss: 0.851
Epoch 400: Loss: 0.846
Epoch 500: Loss: 0.84
Epoch 600: Loss: 0.835
Epoch 700: Loss: 0.829
Epoch 800: Loss: 0.823
Epoch 900: Loss: 0.817


EVALUATION

MICRO recall: 0.7443, precision: 0.4441, f1: 0.5563
MACRO recall: 0.6461, precision: 0.3981, f1: 0.4785

PER CLASS BREAKDOWN

*** Joy ***
recall: 0.2581, precision: 0.2963, f1: 0.2759

*** Sadness ***
recall: 0.8857, precision: 0.3735, f1: 0.5254

*** Surprise ***
recall: 0.7742, precision: 0.4444, f1: 0.5647

*** Fear ***
recall: 1.0, precision: 0.5431, f1: 0.7039

*** Anger ***
recall: 0.3125, precision: 0.3333, f1: 0.3226



Using this approach, we can see that the model performs much better on average (and particularly on the less common classes), even though the final training error is higher.