# Multi-Label Emotion Recognition from Text

### Objective: Develop a system to classify multiple emotions (e.g., joy, sadness, anger) present in textual data.

In [None]:
pip install transformers datasets scikit-learn torch


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
!pip install --upgrade transformers



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import os
import random
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel, BertTokenizerFast, BertForSequenceClassification, Trainer, TrainingArguments


## Understanding the Dataset

The Google GoEmotions dataset consists of comments from Reddit users with labels of their emotional coloring. GoEmotions is designed to train neural networks to perform deep analysis of the tonality of texts.


In [None]:
df = pd.read_csv('/content/go_emotions_dataset.csv')

df.head(5)

Unnamed: 0,id,text,example_very_unclear,admiration,amusement,anger,annoyance,approval,caring,confusion,...,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,eew5j0j,That game hurt.,False,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,eemcysk,>sexuality shouldn’t be a grouping category I...,True,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ed2mah1,"You do right, if you don't care then fuck 'em!",False,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,eeibobj,Man I love reddit.,False,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,eda6yn6,"[NAME] was nowhere near them, he was by the Fa...",False,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [None]:
df.shape

(211225, 31)

In [None]:
df.isnull().sum()

Unnamed: 0,0
id,0
text,0
example_very_unclear,0
admiration,0
amusement,0
anger,0
annoyance,0
approval,0
caring,0
confusion,0


In [None]:
# Dropping the unnecessary columns
df.drop('id', inplace=True, axis=1)
df.drop('example_very_unclear', inplace=True, axis=1)


### Data Visualisation

In [None]:
# Creating an emotion column of all emotions and counting the values in all of them
emotion_cols = df.columns[3:]
emotion_counts = df[emotion_cols].sum().sort_values(ascending=False)

# Creating a DataFrame for plotting
emotion_df = emotion_counts.reset_index()
emotion_df.columns = ['Emotion', 'Count']

# Plot using Plotly Express
plt.figure(figsize = (10,10) )
fig = px.bar(emotion_df, x='Emotion', y='Count', title='Go Emotions')

# Customize layout
fig.update_layout(height=600,
                  xaxis_title='',
                  yaxis_title='Number of Texts',
                  plot_bgcolor='#fff')
fig.show()

<Figure size 1000x1000 with 0 Axes>

## Data Preprocessing

In [None]:
# function to clean dataset
import re

def clean_text(text):
    text = text.lower() #lowercase text
    text = re.sub(r"http\S+", "", text) #remove URL
    text = re.sub(r"[^a-zA-Z0-9\s.,!?']", "", text) #removes special characters
    text = re.sub(r"\s+", " ", text).strip() #removes whitespace
    return text

df['clean_text'] = df['text'].apply(clean_text) #create a new cleaned text column in dataframe

# Displaying cleaned text data
display(df[['text', 'clean_text']].sample(5))

Unnamed: 0,text,clean_text
121069,IMO the car was mostly at fault for abruptly j...,imo the car was mostly at fault for abruptly j...
166846,Lovely. This is a good deed and he should be l...,lovely. this is a good deed and he should be l...
46709,You guys can’t stay married. Attraction is the...,you guys cant stay married. attraction is the ...
45098,my name’s [NAME] and i usually get a lot of [N...,my names name and i usually get a lot of name ...
50897,"Sorry, you misread. I thought you'd written ""w...","sorry, you misread. i thought you'd written wh..."


Since this is a Multi-label classification task, we want to check the class distribution of all the emotions. The 'neutral' label has the most count (55298 values) while others are within the range of 600-18k. Since this is a class imbalance problem, where one class heavily outweighs the other. This will lead to biased model prediction during training process and poor generalisation of minority classes.

In [None]:
# Creating a label columns for all emotions in the dataframe
label_cols = [col for col in df.columns if col not in ['text', 'clean_text', 'id']]

# check class distribution
label_counts = df[label_cols].sum().sort_values(ascending=False)
print(label_counts)


neutral           55298
approval          17620
admiration        17131
annoyance         13618
gratitude         11625
disapproval       11424
curiosity          9692
amusement          9245
realization        8785
optimism           8715
disappointment     8469
love               8191
anger              8084
joy                7983
confusion          7359
sadness            6758
caring             5999
excitement         5629
surprise           5514
disgust            5301
desire             3817
fear               3197
remorse            2525
embarrassment      2476
nervousness        1810
pride              1302
relief             1289
grief               673
dtype: int64


We can handle the imbalanced data by assigning class weights. It will assign a higher weight to the minority classes so that the cost function used during the training of the machine learning model will give more emphasis to errors made on the minority class.

In [None]:
# storing the computed weight of each emotion in this class_weights dict
total_samples = len(df)
class_weights = {}
for label in label_cols:
  #Iterating each emotion label and counts how many rows have this emotion
    positive_count = df[label].sum()
    # Using the class weight formula
    class_weights[label] = total_samples / (positive_count)

print(class_weights)

{'admiration': np.float64(12.32998657404705), 'amusement': np.float64(22.847485127095727), 'anger': np.float64(26.12877288471054), 'annoyance': np.float64(15.51072110442062), 'approval': np.float64(11.987797956867196), 'caring': np.float64(35.21003500583431), 'confusion': np.float64(28.702948770213343), 'curiosity': np.float64(21.793747420553032), 'desire': np.float64(55.337961750065496), 'disappointment': np.float64(24.940961152438305), 'disapproval': np.float64(18.489583333333332), 'disgust': np.float64(39.846255423505), 'embarrassment': np.float64(85.3089660743134), 'excitement': np.float64(37.524427074080656), 'fear': np.float64(66.06975289333751), 'gratitude': np.float64(18.169892473118278), 'grief': np.float64(313.8558692421991), 'joy': np.float64(26.459351121132407), 'love': np.float64(25.787449639848614), 'nervousness': np.float64(116.6988950276243), 'optimism': np.float64(24.23694779116466), 'pride': np.float64(162.2311827956989), 'realization': np.float64(24.04382470119522), 

In [None]:
# Splitting the dataset

X = df['clean_text'] # the cleaned text is our input
y = df[label_cols] # predicting the emotions

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Training transformer model BERT for multi-label classification

BERT is pre-trained using a masked language model objective which predicts missing words in a sentence based on both right and left context called bidirectional context and understanding relationships between words in a sentence.

After loading a pre-trained BERT tokenizer and encoding the text, we want to create a custom dataset that can be used by PyTorch's DataLoader to feed inputs and labels into the model during training and evaluation.

We are using BertForSequenceClassification model which is a BERT Model transformer with an addtional classification layer on top of it so that it can add the right activation and loss function.

Then, we set the training arguments (these are recommended settings to be used when using the google colab free gpu, because increasing the batch size for example might results in the session crashing).

Source I used to modify my code: https://medium.com/@abdurhmanfayad_73788/fine-tuning-bert-for-a-multi-label-classification-problem-on-colab-5ca5b8759f3f

In [None]:
# Using transformers library to use BERT
from transformers import BertTokenizer

# Loading a pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize and encode the text
# Converting the training and test set to lists as tokenizer expects a list of strings
train_encodings = tokenizer(X_train.tolist(), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(X_test.tolist(), truncation=True, padding=True, max_length=128)


In [None]:
# Custom Dataset class

class EmotionDataset(Dataset):
    def __init__(self, encodings, labels):
        # dictionary containing tokenized inputs
        self.encodings = encodings
        # array for multi-label emotion vectors containing 1s and 0s
        self.labels = torch.tensor(labels.values, dtype=torch.float32)


    def __len__(self):
        return len(self.labels) #returns number of samples in dataset

    def __getitem__(self, idx):
        # Retrieves the idx-th sample
        # For each tokenized field in encodings dictionary, we get the corresponding item
        # Convert it into a PyTorch tensor and add label tensor
        # This is the format expected by Hugging Face's Trainer
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = self.labels[idx]
        return item


# Creating instances of test and train dataset which can be used by the Hugging Face's Trainer library to train the model
train_dataset = EmotionDataset(train_encodings, y_train)
test_dataset = EmotionDataset(test_encodings, y_test)


In [None]:
# Defining the BERT model

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=len(label_cols),
    problem_type="multi_label_classification"
)

# Training arguments for Google Colab
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=4,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    eval_strategy = "epoch",

)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter: