# Simple bag-of-words baseline for emotion classification (Task 1)

Authors: Christine de Kock

## Introduction

In this starter notebook, we will take you through the process of emotion classification from text. The notebook was adapted from a notebook for SemEval 2024 Shared Task 1: SemRel.

## Imports

In [30]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import recall_score, precision_score, f1_score

import torch 
from torch import nn
from torch import optim

## Data Import

The training data consists of a short text and binary labels representing human judgments of the emotions in the text. 

The data is structured as a CSV file with the following fields:
- id: a unique identifier for the sample
- text: a sentence or short text
- 6 binary fields representing emotion annotations: joy, fear, anger, sadness, surprise

The data is multilabel, meaning that more than one of the emotion classes may apply to a given text. 

Below we will show you how to load and re-format the provided data file.

In [31]:
def get_data(dir= 'data', split= 'train', track= 'a', language= 'ptbr'):
    
    archive = language + '.csv' if split == 'train' else language + '_' + track + '.csv'

    path = f'{dir}/{split}/track_{track}/{archive}'
    
    return pd.read_csv(path)

In [32]:
# Load the training and validation data

train = get_data(split= 'train', track= 'a', language= 'eng')
val = get_data(split= 'dev', track= 'a', language= 'ptbr')

train.head()

Unnamed: 0,id,text,Anger,Fear,Joy,Sadness,Surprise
0,eng_train_track_a_00001,But not very happy.,0,0,1,1,0
1,eng_train_track_a_00002,Well she's not gon na last the whole song like...,0,0,1,0,0
2,eng_train_track_a_00003,She sat at her Papa's recliner sofa only to mo...,0,0,0,0,0
3,eng_train_track_a_00004,"Yes, the Oklahoma city bombing.",1,1,0,1,1
4,eng_train_track_a_00005,They were dancing to Bolero.,0,0,1,0,0


In [33]:
val.head()

Unnamed: 0,id,text,Anger,Disgust,Fear,Joy,Sadness,Surprise
0,ptbr_dev_track_a_00001,ele me passou o numero dele dps vamos sair geral,,,,,,
1,ptbr_dev_track_a_00002,"Eu n sou o com quem vc conversava, só fui most...",,,,,,
2,ptbr_dev_track_a_00003,"Gol cagado de bate rebate, numa falta mequetre...",,,,,,
3,ptbr_dev_track_a_00004,"puta merda eu também, toda vez que eu vejo to ...",,,,,,
4,ptbr_dev_track_a_00005,"Sempre irão aparecer pessoas assim, mas cabe a...",,,,,,


Ter a validação zerada nos traz problemas, pois torna impossível avaliar o modelo, já que não temos os labels, assim, vamos amostrar uma validação do conjunto de treino

In [34]:
validation_fraction = 0.1

val = train.sample(frac=validation_fraction, random_state=42)
train = train.drop(val.index)

print('Tamanho dataset Validação', len(val))
print('Tamanho dataset de Treino', len(train))
val.head()

Tamanho dataset Validação 277
Tamanho dataset de Treino 2491


Unnamed: 0,id,text,Anger,Fear,Joy,Sadness,Surprise
1378,eng_train_track_a_01379,I smoke weed alone I have a tendency to become...,0,0,1,0,0
839,eng_train_track_a_00840,Nothing but fine grey and tan sand as far as m...,0,0,0,1,0
2164,eng_train_track_a_02165,After an evening there we were driving back be...,0,1,0,1,0
2619,eng_train_track_a_02620,"It never freaked me out, because it happened e...",0,1,0,0,0
927,eng_train_track_a_00928,Only damage done was scarring and a broken col...,0,1,0,1,0


In [35]:
train.head()

Unnamed: 0,id,text,Anger,Fear,Joy,Sadness,Surprise
0,eng_train_track_a_00001,But not very happy.,0,0,1,1,0
1,eng_train_track_a_00002,Well she's not gon na last the whole song like...,0,0,1,0,0
2,eng_train_track_a_00003,She sat at her Papa's recliner sofa only to mo...,0,0,0,0,0
3,eng_train_track_a_00004,"Yes, the Oklahoma city bombing.",1,1,0,1,1
4,eng_train_track_a_00005,They were dancing to Bolero.,0,0,1,0,0


## Bag-of-words representation

In this tutorial, we use a simple bag-of-words representation to obtain a vector for each text. This vector can then be fed into a machine learning model. More advanced models, including LSTMs and transformers, operate on text directly and to not require the vectorisation step. 

### Preprocessing 
We choose to unigrams (that is, individual words) and bigrams (two-word sequences). Texts are lowercased before being vectorised. 

Further preprocessing steps may include: 
- stopword removal,
- TFIDF normalisation,
- lemmatisation / stemming, or
- using a different tokeniser.

In [36]:
vectorizer = CountVectorizer(ngram_range=(1,2))
X_train = vectorizer.fit_transform(train['text'].str.lower()).toarray()
X_val = vectorizer.transform(val['text'].str.lower()).toarray()

emotions = ['Joy','Sadness','Surprise','Fear','Anger']
y_train = train[emotions].values
y_val = val[emotions].values

Finally, we cast the transformed vectors to PyTorch tensors.

In [37]:
X_train_t = torch.Tensor(X_train)
y_train_t = torch.Tensor(y_train)

X_val_t = torch.Tensor(X_val)
y_val_t = torch.Tensor(y_val)

## Characteristics of the data

Statistics of the data are printed below. There are 2768 samples in the training data. The input representation consists of 29001 input features and there are 5 output clsees. There is an imbalance in the dataset, with the "fear" class being assigned to 58% of samples but the "anger" class to only 12%. 

(Due to the multilabel nature of the data, the percentages do not sum to 1.)

In [38]:
print(f'Shape of X: {X_train.shape}')
print(f'Shape of y: {y_train.shape}')
print(f'Number of positives per emotion class:')
_ = [print(f' - {e}: {v} ({round(100*v/len(y_train))}%)') for e,v in zip(emotions, y_train.sum(axis=0))]

Shape of X: (2491, 26858)
Shape of y: (2491, 5)
Number of positives per emotion class:
 - Joy: 606 (24%)
 - Sadness: 791 (32%)
 - Surprise: 754 (30%)
 - Fear: 1461 (59%)
 - Anger: 304 (12%)


## Define the model 

We define a simple neural network model with 1 hidden layer for this task. This can be made arbitrarily more complex, eg. by experimenting with the types of inputs and layers, layer size, depth and regularisation. 

In [39]:
model = nn.Sequential(
          nn.Linear(X_train.shape[1], 100),
          nn.ReLU(),
          nn.Dropout(0.3),
          nn.Linear(100, y_train.shape[1])
        )

## Define the optimisation parameters

 To perform multilabel classification, we use binary cross entropy with logits. See [here](https://discuss.pytorch.org/t/is-there-an-example-for-multi-class-multilabel-classification-in-pytorch/53579/6) for a motivation. Here, one can explore different optimizers, regularisation levels, learning rates, etc.

In [40]:
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-1, weight_decay=1e-2)

## Run the training loop

In [41]:
# Train for a set number of epochs
for epoch in range(1000):
    optimizer.zero_grad()
    output = model(X_train_t)
    loss = criterion(output, y_train_t)
    loss.backward()
    optimizer.step()
    if epoch % 100 == 0:
        print(f'Epoch {epoch}: Loss: {round(loss.item(),3)}')

Epoch 0: Loss: 0.701
Epoch 100: Loss: 0.592
Epoch 200: Loss: 0.572
Epoch 300: Loss: 0.566
Epoch 400: Loss: 0.564
Epoch 500: Loss: 0.562
Epoch 600: Loss: 0.559
Epoch 700: Loss: 0.559
Epoch 800: Loss: 0.557
Epoch 900: Loss: 0.555


## Get class predictions

The model outputs logits to coordinate with the BCE. We use a sigmoid transformation to obtain a score in the range of (0,1). We need to define a classification threshold to obtain a binary prediction from the real-valued model output. This can be determined based on the validation data, and may be different for each emotion. Given the imbalance in the data, we set it slightly lower than 0.5 (the standard).

In [42]:
def get_predictions(X_val, model, threshold=0.5):
    sig = nn.Sigmoid() 
    yhat = sig(model(X_val)).detach().numpy()
    y_pred = yhat > threshold
    
    return y_pred

In [43]:
y_pred = get_predictions(X_val_t, model, 0.45)

## Evaluate

We evaluate the model based on the micro- and macro-averaged F1 score. The former aggregates the metrics at the per-sample level, whereas the latter does it at the per-class level. 

In [44]:
def evaluate(y_val, y_pred):
    for average in ['micro', 'macro']:
        recall = recall_score(y_val, y_pred, average=average, zero_division=0)
        precision = precision_score(y_val, y_pred, average=average, zero_division=0)
        f1 = f1_score(y_val, y_pred, average=average, zero_division=0)
    
        print(f'{average.upper()} recall: {round(recall, 4)}, precision: {round(precision, 4)}, f1: {round(f1, 4)}')

In [45]:
evaluate(y_val, y_pred)

MICRO recall: 0.358, precision: 0.5415, f1: 0.431
MACRO recall: 0.2, precision: 0.1083, f1: 0.1405


Micro: Calcula as métricas considerando todas as instâncias igualmente.


Macro: Calcula as métricas para cada classe individualmente e depois faz a média.


The results show that the macro-averaged F1 is much lower than the micro-averaged score. This indicates that the model might be performing poorly on some of the classes. Below, we evaluate each class separately.

In [46]:
def evaluate_per_class(y_val, y_pred):
    for i, emotion in enumerate(emotions):
        print(f'*** {emotion} ***')
    
        recall = recall_score(y_val[:,i], y_pred[:,i], zero_division=0)
        precision = precision_score(y_val[:,i], y_pred[:,i], zero_division=0)
        f1 = f1_score(y_val[:,i], y_pred[:,i], zero_division=0)
        
        print(f'recall: {round(recall, 4)}, precision: {round(precision, 4)}, f1: {round(f1, 4)}\n')

In [47]:
evaluate_per_class(y_val, y_pred)

*** Joy ***
recall: 0.0, precision: 0.0, f1: 0.0

*** Sadness ***
recall: 0.0, precision: 0.0, f1: 0.0

*** Surprise ***
recall: 0.0, precision: 0.0, f1: 0.0

*** Fear ***
recall: 1.0, precision: 0.5415, f1: 0.7026

*** Anger ***
recall: 0.0, precision: 0.0, f1: 0.0



## Weighing classes

We can see that the model performs well on the "fear" class, which is the most common, but poorly on all others, classifying all samples as negative. One way to address this is by assigning weights to the different classes to increase the effect of samples from rare classes. For example, the below snippet can be used to calculate weights based on their relative frequency.

In [48]:
weights = y_train.sum(axis=0)/y_train.sum()
weights = max(weights)/weights

These weights can then be assigned to the loss function for training. 

In [49]:
# Define model 
model = nn.Sequential(
          nn.Linear(X_train.shape[1], 100),
          nn.ReLU(),
          nn.Dropout(0.3),
          nn.Linear(100, y_train.shape[1])
        )

# Define training parameters
criterion = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor(weights)) # <-- weights assigned to optimiser
optimizer = optim.SGD(model.parameters(), lr=1e-1, weight_decay=1e-2)

# Train for a number of epochs
for epoch in range(1000):
    optimizer.zero_grad()
    output = model(X_train_t)
    loss = criterion(output, y_train_t)
    loss.backward()
    optimizer.step()
    if epoch % 100 == 0:
        print(f'Epoch {epoch}: Loss: {round(loss.item(),3)}')

# Get predictions
y_pred = get_predictions(X_val_t, model, 0.45)

# Evaluate
print('\n\nEVALUATION\n')
evaluate(y_val, y_pred)

print('\nPER CLASS BREAKDOWN\n')
evaluate_per_class(y_val, y_pred)

Epoch 0: Loss: 0.887
Epoch 100: Loss: 0.865
Epoch 200: Loss: 0.858
Epoch 300: Loss: 0.851
Epoch 400: Loss: 0.845
Epoch 500: Loss: 0.839
Epoch 600: Loss: 0.834
Epoch 700: Loss: 0.828
Epoch 800: Loss: 0.821
Epoch 900: Loss: 0.814


EVALUATION

MICRO recall: 0.7232, precision: 0.4329, f1: 0.5416
MACRO recall: 0.6047, precision: 0.3711, f1: 0.4571

PER CLASS BREAKDOWN

*** Joy ***
recall: 0.3971, precision: 0.3176, f1: 0.3529

*** Sadness ***
recall: 0.6437, precision: 0.3294, f1: 0.4358

*** Surprise ***
recall: 0.7412, precision: 0.4961, f1: 0.5943

*** Fear ***
recall: 1.0, precision: 0.5415, f1: 0.7026

*** Anger ***
recall: 0.2414, precision: 0.1707, f1: 0.2



Using this approach, we can see that the model performs much better on average (and particularly on the less common classes), even though the final training error is higher.