# Simple bag-of-words baseline for emotion classification (Task 1)

Authors: Christine de Kock

## Introduction

In this starter notebook, we will take you through the process of emotion classification from text. The notebook was adapted from a notebook for SemEval 2024 Shared Task 1: SemRel.

In [18]:
!pip install torch 



In [19]:
!pip install nltk



## Imports

In [20]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import recall_score, precision_score, f1_score

import torch 
from torch import nn
from torch import optim

In [21]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import ngrams
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import os
import sys

## Data Import

The training data consists of a short text and binary labels representing human judgments of the emotions in the text. 

The data is structured as a CSV file with the following fields:
- id: a unique identifier for the sample
- text: a sentence or short text
- 6 binary fields representing emotion annotations: joy, fear, anger, sadness, surprise

The data is multilabel, meaning that more than one of the emotion classes may apply to a given text. 

Below we will show you how to load and re-format the provided data file.

In [24]:
# Load the training and validation data

train = pd.read_csv('../public_data/train/track_a/eng.csv')
val = pd.read_csv('../public_data/dev/track_a/eng_a.csv')

train.head()

Unnamed: 0,id,text,Anger,Fear,Joy,Sadness,Surprise
0,eng_train_track_a_00001,But not very happy.,0,0,1,1,0
1,eng_train_track_a_00002,Well she's not gon na last the whole song like...,0,0,1,0,0
2,eng_train_track_a_00003,She sat at her Papa's recliner sofa only to mo...,0,0,0,0,0
3,eng_train_track_a_00004,"Yes, the Oklahoma city bombing.",1,1,0,1,1
4,eng_train_track_a_00005,They were dancing to Bolero.,0,0,1,0,0


In [25]:
# val.head()


## Bag-of-words representation

In this tutorial, we use a simple bag-of-words representation to obtain a vector for each text. This vector can then be fed into a machine learning model. More advanced models, including LSTMs and transformers, operate on text directly and to not require the vectorisation step. 

### Preprocessing 
We choose to unigrams (that is, individual words) and bigrams (two-word sequences). Texts are lowercased before being vectorised. 

Further preprocessing steps may include: 
- stopword removal,
- TFIDF normalisation,
- lemmatisation / stemming, or
- using a different tokeniser.

In [26]:
def pre_process(text, config):
    """ 
    Performs Different preprocessing operations.

    Parameters:
    text (string): passes a line of text (assume sentence segmentation has already been done)

    Returns:
    List[string]: Should return a list of tokens.
    """

    def separate_punctuation(text):
        text = re.sub(r"(\w)([.,;:!?'\"”\)])", r"\1 \2", text)
        text = re.sub(r"([.,;:!?'\"“\(\)])(\w)", r"\1 \2", text)
        return text

    def remove_punctuation(text):
        text = re.sub(r"(\w)([.,;:!?'\"”\)])", r"\1", text)
        text = re.sub(r"([.,;:!?'\"“\(\)])(\w)", r"\2", text)
        return text
        
    def tokenize_text(text):
        tokens = re.split(r"\s+",text)
        tokens = [t.lower() for t in tokens]
        return tokens

    def apply_stemming(tokens):
        stemmer = PorterStemmer()
        stemmed_tokens = [stemmer.stem(token) for token in tokens]
        return stemmed_tokens

    def apply_lemmatization(tokens):
        lemmatizer = WordNetLemmatizer()
        lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
        return lemmatized_tokens

    def generate_ngrams_from_tokens(tokens, n):
        return list(ngrams(tokens, n))

    # Separate Punctuation otherwise Remove it
    
    if config["sep_pn"] and not config["rm_pn"]:
        text = separate_punctuation(text)

    if config["rm_pn"] and not config["sep_pn"]:
        text = remove_punctuation(text)
    
    # tokenize text
    
    tokens = tokenize_text(text)

    # Apply Lemmatization or Stemming

    if config["apply_stemming"]:
        tokens = apply_stemming(tokens)
    if config["apply_lemmatization"]:
        tokens = apply_lemmatization(tokens)

    # Generate bigrams, trigrams and quadgrams
    if config["add_bigrams"]:
        bigrams = generate_ngrams_from_tokens(tokens, 2)
        bg = [i + " " + j for (i,j) in bigrams]
        tokens += bg

    # Remove Stop words
    
    if config["rm_sw"]:
        stop_words = set(stopwords.words('english'))
        tokens = [w for w in tokens if w not in stop_words]

    return " ".join(tokens)

In [40]:
import itertools

def grid_configurations():
    options = [
        "sep_pn", "rm_pn", "apply_lemmatization", "apply_stemming", "add_bigrams", "rm_sw"
    ]
    
    # Generate all combinations of True/False for each option
    combinations = itertools.product([True, False], repeat=len(options))
    
    configurations = []
    
    # Create a dictionary for each combination
    for combo in combinations:
        config = {options[i]: combo[i] for i in range(len(options))}
        configurations.append(config)
    
    return configurations

# Example usage
results = []
configs = grid_configurations()
for _i, config in enumerate(configs):
    vectorizer = CountVectorizer()
    X_train = vectorizer.fit_transform([pre_process(i, config) for i in train["text"]]).toarray()
    X_val = vectorizer.transform(val['text'].str.lower()).toarray()

    emotions = ['Joy','Sadness','Surprise','Fear','Anger']
    y_train = train[emotions].values
    y_val = val[emotions].values

    X_train_t = torch.Tensor(X_train)
    y_train_t = torch.Tensor(y_train)

    X_val_t = torch.Tensor(X_val)
    y_val_t = torch.Tensor(y_val)

    model = nn.Sequential(
          nn.Linear(X_train.shape[1], 100),
          nn.ReLU(),
          nn.Dropout(0.3),
          nn.Linear(100, y_train.shape[1])
        )

    weights = y_train.sum(axis=0)/y_train.sum()
    weights = max(weights)/weights
    
    criterion = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor(weights))
    optimizer = optim.SGD(model.parameters(), lr=1e-1, weight_decay=1e-2)
    
    # criterion = nn.BCEWithLogitsLoss()
    # optimizer = optim.SGD(model.parameters(), lr=1e-1, weight_decay=1e-2)
    loss = None
    
    # Train for a set number of epochs
    for epoch in range(200):
        optimizer.zero_grad()
        output = model(X_train_t)
        loss = criterion(output, y_train_t)
        loss.backward()
        optimizer.step()
    results.append((round(loss.item(),3), _i))
    print(_i, round(loss.item(),3), config)
print("MIN LOSS: ", min(results))

0 0.833 {'sep_pn': True, 'rm_pn': True, 'apply_lemmatization': True, 'apply_stemming': True, 'add_bigrams': True, 'rm_sw': True}
1 0.825 {'sep_pn': True, 'rm_pn': True, 'apply_lemmatization': True, 'apply_stemming': True, 'add_bigrams': True, 'rm_sw': False}
2 0.866 {'sep_pn': True, 'rm_pn': True, 'apply_lemmatization': True, 'apply_stemming': True, 'add_bigrams': False, 'rm_sw': True}
3 0.86 {'sep_pn': True, 'rm_pn': True, 'apply_lemmatization': True, 'apply_stemming': True, 'add_bigrams': False, 'rm_sw': False}
4 0.835 {'sep_pn': True, 'rm_pn': True, 'apply_lemmatization': True, 'apply_stemming': False, 'add_bigrams': True, 'rm_sw': True}
5 0.827 {'sep_pn': True, 'rm_pn': True, 'apply_lemmatization': True, 'apply_stemming': False, 'add_bigrams': True, 'rm_sw': False}
6 0.866 {'sep_pn': True, 'rm_pn': True, 'apply_lemmatization': True, 'apply_stemming': False, 'add_bigrams': False, 'rm_sw': True}
7 0.86 {'sep_pn': True, 'rm_pn': True, 'apply_lemmatization': True, 'apply_stemming': Fal

In [36]:
results = []
configs = [
    {'sep_pn': True, 'rm_pn': False, 'apply_lemmatization': True, 'apply_stemming': True, 'add_bigrams': True, 'rm_sw': False}, # 18
    {'sep_pn': False, 'rm_pn': True, 'apply_lemmatization': True, 'apply_stemming': False, 'add_bigrams': True, 'rm_sw': False}, # 38
    {'sep_pn': True, 'rm_pn': True, 'apply_lemmatization': True, 'apply_stemming': True, 'add_bigrams': True, 'rm_sw': False},
    {'sep_pn': True, 'rm_pn': True, 'apply_lemmatization': False, 'apply_stemming': True, 'add_bigrams': True, 'rm_sw': False},
    {'sep_pn': True, 'rm_pn': False, 'apply_lemmatization': True, 'apply_stemming': False, 'add_bigrams': True, 'rm_sw': False}, # 22
    {'sep_pn': True, 'rm_pn': False, 'apply_lemmatization': False, 'apply_stemming': False, 'add_bigrams': True, 'rm_sw': False},
    {'sep_pn': False, 'rm_pn': False, 'apply_lemmatization': True, 'apply_stemming': True, 'add_bigrams': True, 'rm_sw': False},
    {'sep_pn': False, 'rm_pn': False, 'apply_lemmatization': True, 'apply_stemming': False, 'add_bigrams': False, 'rm_sw': False},
]
for _i, config in enumerate(configs):
    vectorizer = CountVectorizer()
    X_train = vectorizer.fit_transform([pre_process(i, config) for i in train["text"]]).toarray()
    X_val = vectorizer.transform(val['text'].str.lower()).toarray()

    emotions = ['Joy','Sadness','Surprise','Fear','Anger']
    y_train = train[emotions].values
    y_val = val[emotions].values

    X_train_t = torch.Tensor(X_train)
    y_train_t = torch.Tensor(y_train)

    X_val_t = torch.Tensor(X_val)
    y_val_t = torch.Tensor(y_val)

    model = nn.Sequential(
          nn.Linear(X_train.shape[1], 100),
          nn.ReLU(),
          nn.Dropout(0.3),
          nn.Linear(100, y_train.shape[1])
        )
    
    # weights = y_train.sum(axis=0)/y_train.sum()
    # weights = max(weights)/weights
    
    criterion = nn.BCEWithLogitsLoss()
    optimizer = optim.SGD(model.parameters(), lr=1e-1, weight_decay=1e-2)
    # criterion = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor(weights))
    # optimizer = optim.SGD(model.parameters(), lr=1e-1, weight_decay=1e-2)
    loss = None
    
    # Train for a set number of epochs
    for epoch in range(1000):
        optimizer.zero_grad()
        output = model(X_train_t)
        loss = criterion(output, y_train_t)
        loss.backward()
        optimizer.step()
    results.append((round(loss.item(),3), _i))
    print(_i, round(loss.item(),3), config)
print("MIN LOSS: ", min(results))

0 0.499 {'sep_pn': True, 'rm_pn': False, 'apply_lemmatization': True, 'apply_stemming': True, 'add_bigrams': True, 'rm_sw': False}
1 0.508 {'sep_pn': False, 'rm_pn': True, 'apply_lemmatization': True, 'apply_stemming': False, 'add_bigrams': True, 'rm_sw': False}
2 0.509 {'sep_pn': True, 'rm_pn': True, 'apply_lemmatization': True, 'apply_stemming': True, 'add_bigrams': True, 'rm_sw': False}
3 0.51 {'sep_pn': True, 'rm_pn': True, 'apply_lemmatization': False, 'apply_stemming': True, 'add_bigrams': True, 'rm_sw': False}
4 0.503 {'sep_pn': True, 'rm_pn': False, 'apply_lemmatization': True, 'apply_stemming': False, 'add_bigrams': True, 'rm_sw': False}
5 0.507 {'sep_pn': True, 'rm_pn': False, 'apply_lemmatization': False, 'apply_stemming': False, 'add_bigrams': True, 'rm_sw': False}
6 0.509 {'sep_pn': False, 'rm_pn': False, 'apply_lemmatization': True, 'apply_stemming': True, 'add_bigrams': True, 'rm_sw': False}
7 0.56 {'sep_pn': False, 'rm_pn': False, 'apply_lemmatization': True, 'apply_ste

In [60]:
vectorizer = CountVectorizer()
config = {'sep_pn': True, 'rm_pn': False, 'apply_lemmatization': True, 'apply_stemming': True, 'add_bigrams': True, 'rm_sw': False}
X_train = vectorizer.fit_transform([pre_process(i, config) for i in train["text"]]).toarray()
X_val = vectorizer.transform(val['text'].str.lower()).toarray()

emotions = ['Joy','Sadness','Surprise','Fear','Anger']
y_train = train[emotions].values
y_val = val[emotions].values

# print(val)

Finally, we cast the transformed vectors to PyTorch tensors.

In [61]:
X_train_t = torch.Tensor(X_train)
y_train_t = torch.Tensor(y_train)

X_val_t = torch.Tensor(X_val)
y_val_t = torch.Tensor(y_val)

In [62]:
# Local Train/Test Split

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train_train, X_train_test, y_train_train, y_train_test = train_test_split(X_train_t, y_train_t, test_size=0.20, random_state=42)

## Characteristics of the data

Statistics of the data are printed below. There are 2768 samples in the training data. The input representation consists of 29001 input features and there are 5 output clsees. There is an imbalance in the dataset, with the "fear" class being assigned to 58% of samples but the "anger" class to only 12%. 

(Due to the multilabel nature of the data, the percentages do not sum to 1.)

In [63]:
# print(f'Shape of X: {X_train.shape}')
# print(f'Shape of y: {y_train.shape}')
# print(f'Number of positives per emotion class:')
# _ = [print(f' - {e}: {v} ({round(100*v/len(y_train))}%)') for e,v in zip(emotions, y_train.sum(axis=0))]

In [64]:
# Local train split

print(f'Shape of X: {X_train_train.shape}')
print(f'Shape of y: {y_train_train.shape}')
print(f'Number of positives per emotion class:')
_ = [print(f' - {e}: {v} ({round(100*v/len(y_train))}%)') for e,v in zip(emotions, y_train.sum(axis=0))]

Shape of X: torch.Size([2214, 4255])
Shape of y: torch.Size([2214, 5])
Number of positives per emotion class:
 - Joy: 674 (24%)
 - Sadness: 878 (32%)
 - Surprise: 839 (30%)
 - Fear: 1611 (58%)
 - Anger: 333 (12%)


## Define the model 

We define a simple neural network model with 1 hidden layer for this task. This can be made arbitrarily more complex, eg. by experimenting with the types of inputs and layers, layer size, depth and regularisation. 

In [65]:
# model = nn.Sequential(
#           nn.Linear(X_train.shape[1], 100),
#           nn.ReLU(),
#           nn.Dropout(0.3),
#           nn.Linear(100, y_train.shape[1])
#         )

In [66]:
# Local train train split

model = nn.Sequential(
          nn.Linear(X_train_train.shape[1], 100),
          nn.ReLU(),
          nn.Dropout(0.3),
          nn.Linear(100, y_train_train.shape[1])
        )

## Define the optimisation parameters

 To perform multilabel classification, we use binary cross entropy with logits. See [here](https://discuss.pytorch.org/t/is-there-an-example-for-multi-class-multilabel-classification-in-pytorch/53579/6) for a motivation. Here, one can explore different optimizers, regularisation levels, learning rates, etc.

In [67]:
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-1, weight_decay=1e-2)

## Run the training loop

In [68]:
# # Train for a set number of epochs
# for epoch in range(1000):
#     optimizer.zero_grad()
#     output = model(X_train_t)
#     loss = criterion(output, y_train_t)
#     loss.backward()
#     optimizer.step()
#     if epoch % 100 == 0:
#         print(f'Epoch {epoch}: Loss: {round(loss.item(),3)}')

In [69]:
# Train for a set number of epochs
for epoch in range(1000):
    optimizer.zero_grad()
    output = model(X_train_train)
    loss = criterion(output, y_train_train)
    loss.backward()
    optimizer.step()
    if epoch % 100 == 0:
        print(f'Epoch {epoch}: Loss: {round(loss.item(),3)}')

Epoch 0: Loss: 0.698
Epoch 100: Loss: 0.579
Epoch 200: Loss: 0.563
Epoch 300: Loss: 0.552
Epoch 400: Loss: 0.544
Epoch 500: Loss: 0.534
Epoch 600: Loss: 0.526
Epoch 700: Loss: 0.519
Epoch 800: Loss: 0.508
Epoch 900: Loss: 0.499


## Get class predictions

The model outputs logits to coordinate with the BCE. We use a sigmoid transformation to obtain a score in the range of (0,1). We need to define a classification threshold to obtain a binary prediction from the real-valued model output. This can be determined based on the validation data, and may be different for each emotion. Given the imbalance in the data, we set it slightly lower than 0.5 (the standard).

In [70]:
# def get_predictions(X_val, model, threshold=0.5):
#     sig = nn.Sigmoid() 
#     yhat = sig(model(X_val)).detach().numpy()
#     y_pred = yhat > threshold
    
#     return y_pred

In [71]:
## Local Train test split

def get_predictions(X_train_test, model, threshold=0.5):
    sig = nn.Sigmoid() 
    yhat = sig(model(X_train_test)).detach().numpy()
    y_pred = yhat > threshold
    
    return y_pred

In [72]:
y_pred_test = get_predictions(X_train_test, model, 0.45)

In [73]:
# y_pred = get_predictions(X_val_t, model, 0.45)
# # print(y_pred)

# # Create a DataFrame to save to CSV
# val_data_with_pred = pd.DataFrame(y_pred, columns=['Joy', 'Anger', 'Sadness', 'Surprise', 'Fear'])  # Adjust column names as per your features
# # val_data_with_pred['True_Label'] = y_test
# # val_data_with_pred['Predictions'] = dummy_predictions

# if not 'id' in val_data_with_pred.columns:
#     val_data_with_pred['id'] = val['id']
#     val_data_with_pred['text'] = val['text']

# val_data_with_pred = val_data_with_pred[['id', 'text', 'Joy', 'Anger', 'Sadness', 'Surprise', 'Fear']]

# # Save to CSV
# val_data_with_pred.to_csv('pred_eng_a.csv', index=False)

# print(val_data_with_pred)

In [74]:
# y_pred_test = get_predictions(X_, model, 0.45)

## Evaluate

We evaluate the model based on the micro- and macro-averaged F1 score. The former aggregates the metrics at the per-sample level, whereas the latter does it at the per-class level. 

In [75]:
def evaluate(y_val, y_pred):
    for average in ['micro', 'macro']:
        recall = recall_score(y_val, y_pred, average=average, zero_division=0)
        precision = precision_score(y_val, y_pred, average=average, zero_division=0)
        f1 = f1_score(y_val, y_pred, average=average, zero_division=0)
    
        print(f'{average.upper()} recall: {round(recall, 4)}, precision: {round(precision, 4)}, f1: {round(f1, 4)}')

In [76]:
# print(y_pred)

In [77]:
# evaluate(y_val, y_pred)

In [80]:
# evaluate(y_train_test, y_pred_test)

The results show that the macro-averaged F1 is much lower than the micro-averaged score. This indicates that the model might be performing poorly on some of the classes. Below, we evaluate each class separately.

In [81]:
def evaluate_per_class(y_val, y_pred):
    for i, emotion in enumerate(emotions):
        print(f'*** {emotion} ***')
    
        recall = recall_score(y_val[:,i], y_pred[:,i], zero_division=0)
        precision = precision_score(y_val[:,i], y_pred[:,i], zero_division=0)
        f1 = f1_score(y_val[:,i], y_pred[:,i], zero_division=0)
        
        print(f'recall: {round(recall, 4)}, precision: {round(precision, 4)}, f1: {round(f1, 4)}\n')

In [82]:
# evaluate_per_class(y_val, y_pred)

In [84]:
# evaluate_per_class(y_train_test, y_pred_test)

## Weighing classes

We can see that the model performs well on the "fear" class, which is the most common, but poorly on all others, classifying all samples as negative. One way to address this is by assigning weights to the different classes to increase the effect of samples from rare classes. For example, the below snippet can be used to calculate weights based on their relative frequency.

In [85]:
# weights = y_train.sum(axis=0)/y_train.sum()
# weights = max(weights)/weights

In [86]:
weights = y_train_train.sum(axis=0)/y_train_train.sum()
weights = max(weights)/weights

These weights can then be assigned to the loss function for training. 

In [87]:
# # Define model 
# model = nn.Sequential(
#           nn.Linear(X_train.shape[1], 100),
#           nn.ReLU(),
#           nn.Dropout(0.3),
#           nn.Linear(100, y_train.shape[1])
#         )

# # Define training parameters
# criterion = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor(weights)) # <-- weights assigned to optimiser
# optimizer = optim.SGD(model.parameters(), lr=1e-1, weight_decay=1e-2)

# # Train for a number of epochs
# for epoch in range(1000):
#     optimizer.zero_grad()
#     output = model(X_train_t)
#     loss = criterion(output, y_train_t)
#     loss.backward()
#     optimizer.step()
#     if epoch % 100 == 0:
#         print(f'Epoch {epoch}: Loss: {round(loss.item(),3)}')

# # Get predictions
# y_pred = get_predictions(X_val_t, model, 0.45)

# # Evaluate
# print('\n\nEVALUATION\n')
# evaluate(y_val, y_pred)

# print('\nPER CLASS BREAKDOWN\n')
# evaluate_per_class(y_val, y_pred)

In [88]:
# Define model 
model = nn.Sequential(
          nn.Linear(X_train_train.shape[1], 100),
          nn.ReLU(),
          nn.Dropout(0.3),
          nn.Linear(100, y_train_train.shape[1])
        )

# Define training parameters
criterion = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor(weights)) # <-- weights assigned to optimiser
optimizer = optim.SGD(model.parameters(), lr=1e-1, weight_decay=1e-2)

# Train for a number of epochs
for epoch in range(1000):
    optimizer.zero_grad()
    output = model(X_train_train)
    loss = criterion(output, y_train_train)
    loss.backward()
    optimizer.step()
    if epoch % 100 == 0:
        print(f'Epoch {epoch}: Loss: {round(loss.item(),3)}')

# Get predictions
y_pred_test = get_predictions(X_train_test, model, 0.45)

# Evaluate
print('\n\nEVALUATION\n')
evaluate(y_train_test, y_pred_test)

print('\nPER CLASS BREAKDOWN\n')
evaluate_per_class(y_train_test, y_pred_test)

Epoch 0: Loss: 0.883
Epoch 100: Loss: 0.843
Epoch 200: Loss: 0.817
Epoch 300: Loss: 0.79
Epoch 400: Loss: 0.762
Epoch 500: Loss: 0.73
Epoch 600: Loss: 0.697
Epoch 700: Loss: 0.665
Epoch 800: Loss: 0.639
Epoch 900: Loss: 0.613


EVALUATION



ValueError: Input arrays use different devices: cpu, cpu

Using this approach, we can see that the model performs much better on average (and particularly on the less common classes), even though the final training error is higher.