Present the trainng data in same directory as this notebook, and rename the training data folder to "Raw_data"

# Importing libaries

In [1]:
import librosa
import librosa.display
import matplotlib.pyplot as plt
import os
import numpy as np
import whisper
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from PIL import Image
import random
import json
import shutil
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, matthews_corrcoef, accuracy_score
from sklearn.preprocessing import label_binarize

# Directories Generator

The following code makes the directory named "Audio_data", "Spectrograms", "MFCCs".

### Audio_data
All the audio files are copied to this directory, I am bascially getting all the audio files into one directory to access it easily. And before i copy i am making sure that the folder is empty, casue on runnning this notebook multiple times I don't want it to copy the audio files to the same directory.

### Spectrograms
This folder is created to store all the generated spectrograms

### MFCCs
This folder is created to store all the generated MFCCs 


In [2]:
path = "Data/Raw_data/"
save_audio = "Data/Audio_data/"
save_spec = "Data/Spectrograms/"
save_mfcc = "Data/MFCCs/"

os.makedirs(os.path.dirname(save_spec), exist_ok=True)
os.makedirs(os.path.dirname(save_mfcc), exist_ok=True)
os.makedirs(os.path.dirname(save_audio), exist_ok=True)

Directory = []

for i in range(24):
    if i<9:
        data_path = path + "Actor_0" + str(i+1)
        Directory.append(data_path)
    else:
        data_path = path + "Actor_" + str(i+1)
        Directory.append(data_path)
            
if not os.listdir(save_audio):
    for actor_path in Directory:
        for actor in os.listdir(actor_path):
            audio_path = os.path.join(actor_path, actor)
            shutil.copy(audio_path, save_audio)

# CNN with Spectrograms

## Spectrogram Generator
A function to generate the spectrogram, when the audio file location and the save location is given.

In [3]:
def Spectrograms_generator(audio_path, save_path, sr=22050):
    y, sr = librosa.load(audio_path, sr=sr)
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
    S_dB = librosa.power_to_db(S, ref=np.max)

    plt.figure(figsize=(2.24, 2.24))  # size fits CNN input nicely
    librosa.display.specshow(S_dB, sr=sr, x_axis='time', y_axis='mel')
    plt.axis('off')  # no axis for CNN input
    plt.tight_layout(pad=0)
    plt.savefig(save_path, bbox_inches='tight', pad_inches=0)
    plt.close()
    # plt.show()

The following code iterates through each of the audio files present in the folder "Audio_data" and generates the corresponding spectrogram and saves it to the folder "Spectrogram".

In [4]:
for audio in os.listdir(save_audio):
    name = audio.split(".")
    Save_path = save_spec + name[0] + ".png"
    file_path = save_audio + audio
    
    Spectrograms_generator(file_path, Save_path)

## CNN Implementation

The following CNN class defines a Convolutional Neural Network for classifying images into 8 classes. It contains two convolutional layers: the first converts a 3-channel input (like RGB) to 16 feature maps, and the second to a 32 feature maps, both using 3×3 kernels while preserve spatial dimensions via padding. Next, the max pooling follows each convolution to reduce size to half while preserving the key features. After the second pooling, the output tensor is of shape 32 x 16 x 16 and now this is flattened into a vector of size 8192. This new vector is passed through two layer Neural network, the first layer containing 128 neurons, and the 2nd containing 8 neurons. Overall the model combines convolution for feature extraction and fully connected layers for classification.

In [5]:
class CNN(nn.Module):
    def __init__(self, classes=8):
        
        super(CNN, self).__init__()

        # a Max pooling layer that downsamples the input by taking the max over 2×2 regions, and reducing spatial dimensions by half
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)  
        
        # Convolution layer 1
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)  # Output is of the form [16, 64, 64]

        # Convolutional layer 2
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)  # Output is of the form [32, 32, 32]

        # Flattering the 3D tensor after Convolutional Layer 2 of size [32, 16, 16] to a 1D tensor of size 8192 and passing it to 2 layers Neural network
        self.fc1 = nn.Linear(in_features = 32 * 16 * 16, out_features = 128)
        self.fc2 = nn.Linear(in_features = 128, out_features = classes )

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))    # Input batch is passed through first convolutional layer and output is of the form [B,16,32,32]
        x = self.pool(F.relu(self.conv2(x)))    # the output batch from 1st convolutinal layer is passed to the 2nd convolutional layer resulting in [B,32,16,16]
        x = x.view(x.size(0), -1)               # Flatterning the 4D tensor of size [B,16,32,32] to a 2D tensor of size [B,8192]
        features = F.relu(self.fc1(x))                 # passing the 1D tensor to 1st layer of Neural network, with the activation function of the neurons being ReLu
        logits = self.fc2(features)                         # passing the output from 1st layer of Neural network to second layer
        return logits, features

The following SpectrogramDataset class is used for loading spectrogram images and their labels. It takes the image directory, labels, and optional transformations as inputs (we will use this to tranform the image to a tensor). The `__len__` method returns the dataset size, while `__getitem__` loads and transforms an image at a given index, returning the transformed image and its label. This class helps in loading spectrogram images for model training.

In [6]:
class SpectrogramDataset(Dataset):
    def __init__(self, image_dir, labels, transform=None):
        self.image_dir = image_dir
        self.transform = transform
        self.labels = labels
        self.filenames = list(labels.keys())

    def __len__(self):
        return len(self.filenames)

    def __getitem__(self, index):
        image_file = self.filenames[index]
        label = self.labels[image_file]
        image_path = os.path.join(self.image_dir, image_file)
        image = Image.open(image_path).convert('RGB')
        if self.transform:
            image = self.transform(image)
        return image, label

## Training Data

The following code sets up and trains a CNN for emotion classification using spectrogram images. The images are resized to 64×64 and converted to tensors. The labels are extracted from the filenames, and the data is split into 70% for training and 30% for testing. The SpectrogramDataset and DataLoader are used to group the data into batches for training. The model is trained for 30 epochs, as it is reaching almost 100% training data accuracy in 30 epochs.

In [7]:
transform = transforms.Compose([
    transforms.Resize((64, 64)),    # Resize images to 64x64 and then
    transforms.ToTensor()           # Converting it to a PyTorch tensor
])

# -------------------------------------------------------------------------- Preparing labels for Dataset --->

labels = {}
for filename in os.listdir("Data/Spectrograms/"):
    if filename.endswith(".png"):
        # Extracting labels from the filename and customize
        label = int(filename.split("-")[2]) - 1
        labels[filename] = label

# ------------------------------------------------------------------------- Creating train and test Dataset and DataLoader --->

sample = int(70*len(labels.keys())/100)

train_labels = dict(random.sample(list(labels.items()), sample)) # randomly choosing 70% of data for training

test_labels = {x: y for x, y in labels.items() if x not in train_labels} # using the rest of the data for testing

train_dataset = SpectrogramDataset(image_dir="Data/Spectrograms/", labels=train_labels, transform=transform)
train_dataloader = DataLoader(train_dataset, batch_size = 20, shuffle=True) #loading the train data

test_dataset = SpectrogramDataset(image_dir="Data/Spectrograms/", labels=test_labels, transform=transform)
test_dataloader = DataLoader(test_dataset, batch_size = 20, shuffle=True) #loading the test data

# ------------------------------------------------------------------------- Initializing Model, Loss Function, and Optimizer --->
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = CNN(classes=8).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # updates the paramenter during the training

# ------------------------------------------------------------------------- Training the network --->
num_epochs = 30

for epoch in range(num_epochs):
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in train_dataloader:
        images, labels = images.to(device), labels.to(device)

        # Forward pass
        outputs, _ = model(images) # output after a pass
        loss = criterion(outputs, labels) # calculating the cross entropy loss

        # Back propogation and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate metrics
        running_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    epoch_loss = running_loss / len(train_dataloader)
    epoch_acc = 100 * correct / total
    print(f"Epoch {epoch+1}/{num_epochs}\t|\tLoss: {epoch_loss:.4f}\t|\tAccuracy: {epoch_acc:.2f}%")

Epoch 1/30	|	Loss: 2.0255	|	Accuracy: 18.65%
Epoch 2/30	|	Loss: 1.7618	|	Accuracy: 32.84%
Epoch 3/30	|	Loss: 1.5223	|	Accuracy: 44.74%
Epoch 4/30	|	Loss: 1.3265	|	Accuracy: 49.80%
Epoch 5/30	|	Loss: 1.1395	|	Accuracy: 58.83%
Epoch 6/30	|	Loss: 0.9495	|	Accuracy: 66.57%
Epoch 7/30	|	Loss: 0.8529	|	Accuracy: 69.84%
Epoch 8/30	|	Loss: 0.7124	|	Accuracy: 75.79%
Epoch 9/30	|	Loss: 0.5592	|	Accuracy: 82.14%
Epoch 10/30	|	Loss: 0.4872	|	Accuracy: 83.93%
Epoch 11/30	|	Loss: 0.4487	|	Accuracy: 85.62%
Epoch 12/30	|	Loss: 0.3480	|	Accuracy: 88.89%
Epoch 13/30	|	Loss: 0.2986	|	Accuracy: 90.18%
Epoch 14/30	|	Loss: 0.2346	|	Accuracy: 93.15%
Epoch 15/30	|	Loss: 0.1460	|	Accuracy: 96.23%
Epoch 16/30	|	Loss: 0.1167	|	Accuracy: 97.42%
Epoch 17/30	|	Loss: 0.0985	|	Accuracy: 97.52%
Epoch 18/30	|	Loss: 0.0834	|	Accuracy: 98.12%
Epoch 19/30	|	Loss: 0.0495	|	Accuracy: 99.31%
Epoch 20/30	|	Loss: 0.0364	|	Accuracy: 99.70%
Epoch 21/30	|	Loss: 0.0228	|	Accuracy: 100.00%
Epoch 22/30	|	Loss: 0.0150	|	Accuracy: 99.

## Testing Data

In [8]:
model.eval()  # Setting the model to evaluation mode
test_loss = 0.0
test_correct = 0
test_total = 0
confusion_matrix = torch.zeros(8, 8)
y_preds = []
y_labels = []
y_probs = []

with torch.no_grad():
    for images, labels in test_dataloader:
        images, labels = images.to(device).float(), labels.to(device)
        
        outputs, _ = model(images)
        loss = criterion(outputs, labels)
        
        test_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        
        # Updating the confusion matrix
        for t, p in zip(labels.view(-1), predicted.view(-1)):
            confusion_matrix[t.long(), p.long()] += 1

        y_preds.extend(predicted.cpu().numpy())
        y_labels.extend(labels.cpu().numpy())
        
        probs = torch.softmax(outputs, dim=1)
        y_probs.extend(probs.cpu().numpy())

y_preds = np.array(y_preds)
y_labels = np.array(y_labels)
y_probs = np.array(y_probs)


# Calculating the Accuracy
accuracy = 100 * accuracy_score(y_labels, y_preds)

# Calculating the Precision
precision = precision_score(y_labels, y_preds, average='macro')

# Calculating the Recall
recall = recall_score(y_labels, y_preds, average='macro')

# Calculating the F1 Score
f1 = f1_score(y_labels, y_preds, average='macro')

# Calculating the AUC-ROC
y_labels_bin = label_binarize(y_labels, classes=np.arange(8))
try:
    auc_roc = roc_auc_score(y_labels_bin, y_probs, multi_class='ovr')
except ValueError:
    auc_roc = float('nan')  # ROC AUC can't be computed if some classes are missing

# MCC (for multiclass)
mcc = matthews_corrcoef(y_labels, y_preds)

print(f"Accuracy:\t{accuracy:.2f}%")
print(f"Precision:\t{precision:.4f}")
print(f"Recall:\t\t{recall:.4f}")
print(f"F1 Score:\t{f1:.4f}")
print(f"AUC-ROC:\t{auc_roc:.4f}")
print(f"MCC:\t\t{mcc:.4f}")

Accuracy:	73.38%
Precision:	0.7453
Recall:		0.7185
F1 Score:	0.7232
AUC-ROC:	0.9467
MCC:		0.6947


# CNN with MFCC's

## MFCC's Generator

In [9]:
def MFCCs_generator(audio_path, save_path, sr=22050, n_mfcc=20):
    y, sr = librosa.load(audio_path, sr=sr)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)

    plt.figure(figsize=(2.24, 2.24))
    librosa.display.specshow(mfcc, x_axis='time')
    plt.axis('off')
    plt.savefig(save_path, bbox_inches='tight', pad_inches=0)
    plt.close()
    # plt.show()

In [10]:
for audio in os.listdir(save_audio):
    name = audio.split(".")
    Save_path = save_mfcc + name[0] + ".png"
    file_path = save_audio + audio
    
    MFCCs_generator(file_path, Save_path)

I have used the same CNN architecture and trained it on MFCCs. Compared to training on spectrograms, it took more epochs to converge and achieved lower training and test accuracy

In [23]:
transform = transforms.Compose([
    transforms.Resize((64, 64)),    # Resize images to 64x64 and then
    transforms.ToTensor()           # Converting it to a PyTorch tensor
])

# -------------------------------------------------------------------------- Preparing labels for Dataset --->

labels = {}
for filename in os.listdir("Data/MFCCs/"):
    if filename.endswith(".png"):
        # Extracting labels from the filename and customize
        label = int(filename.split("-")[2]) - 1
        labels[filename] = label

# ------------------------------------------------------------------------- Creating train and test Dataset and DataLoader --->

sample = int(70*len(labels.keys())/100)

train_labels = dict(random.sample(list(labels.items()), sample)) # randomly choosing 70% of data for training

test_labels = {x: y for x, y in labels.items() if x not in train_labels} # using the rest of the data for testing

train_dataset = SpectrogramDataset(image_dir="Data/MFCCs/", labels=train_labels, transform=transform)
train_dataloader = DataLoader(train_dataset, batch_size = 20, shuffle=True) #loading the train data

test_dataset = SpectrogramDataset(image_dir="Data/MFCCs/", labels=test_labels, transform=transform)
test_dataloader = DataLoader(test_dataset, batch_size = 20, shuffle=True) #loading the test data

# ------------------------------------------------------------------------- Initializing Model, Loss Function, and Optimizer --->
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = CNN(classes=8).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # updates the paramenter during the training

# ------------------------------------------------------------------------- Training the network --->
num_epochs = 100

for epoch in range(num_epochs):
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in train_dataloader:
        images, labels = images.to(device), labels.to(device)

        # Forward pass
        outputs, _ = model(images) # output after a pass
        loss = criterion(outputs, labels) # calculating the cross entropy loss

        # Back propogation and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate metrics
        running_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    epoch_loss = running_loss / len(train_dataloader)
    epoch_acc = 100 * correct / total
    print(f"Epoch {epoch+1}/{num_epochs}\t|\tLoss: {epoch_loss:.4f}\t|\tAccuracy: {epoch_acc:.2f}%")

Epoch 1/100	|	Loss: 2.0793	|	Accuracy: 13.99%
Epoch 2/100	|	Loss: 2.0684	|	Accuracy: 13.59%
Epoch 3/100	|	Loss: 2.0678	|	Accuracy: 13.99%
Epoch 4/100	|	Loss: 2.0568	|	Accuracy: 16.17%
Epoch 5/100	|	Loss: 1.9868	|	Accuracy: 23.71%
Epoch 6/100	|	Loss: 1.9344	|	Accuracy: 24.40%
Epoch 7/100	|	Loss: 1.8112	|	Accuracy: 31.15%
Epoch 8/100	|	Loss: 1.7558	|	Accuracy: 31.25%
Epoch 9/100	|	Loss: 1.6920	|	Accuracy: 33.83%
Epoch 10/100	|	Loss: 1.6590	|	Accuracy: 36.61%
Epoch 11/100	|	Loss: 1.6164	|	Accuracy: 36.61%
Epoch 12/100	|	Loss: 1.6174	|	Accuracy: 38.00%
Epoch 13/100	|	Loss: 1.5409	|	Accuracy: 41.07%
Epoch 14/100	|	Loss: 1.5184	|	Accuracy: 41.37%
Epoch 15/100	|	Loss: 1.4774	|	Accuracy: 41.96%
Epoch 16/100	|	Loss: 1.4620	|	Accuracy: 42.76%
Epoch 17/100	|	Loss: 1.4346	|	Accuracy: 43.06%
Epoch 18/100	|	Loss: 1.4104	|	Accuracy: 45.34%
Epoch 19/100	|	Loss: 1.3786	|	Accuracy: 46.63%
Epoch 20/100	|	Loss: 1.3780	|	Accuracy: 47.72%
Epoch 21/100	|	Loss: 1.3241	|	Accuracy: 48.71%
Epoch 22/100	|	Loss: 1

In [12]:
model.eval()  # Setting the model to evaluation mode
test_loss = 0.0
test_correct = 0
test_total = 0
confusion_matrix = torch.zeros(8, 8)
y_preds = []
y_labels = []
y_probs = []

with torch.no_grad():
    for images, labels in test_dataloader:
        images, labels = images.to(device).float(), labels.to(device)
        
        outputs, _ = model(images)
        loss = criterion(outputs, labels)
        
        test_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        
        # Updating the confusion matrix
        for t, p in zip(labels.view(-1), predicted.view(-1)):
            confusion_matrix[t.long(), p.long()] += 1

        y_preds.extend(predicted.cpu().numpy())
        y_labels.extend(labels.cpu().numpy())
        
        probs = torch.softmax(outputs, dim=1)
        y_probs.extend(probs.cpu().numpy())

y_preds = np.array(y_preds)
y_labels = np.array(y_labels)
y_probs = np.array(y_probs)


# Calculating the Accuracy
accuracy = 100 * accuracy_score(y_labels, y_preds)

# Calculating the Precision
precision = precision_score(y_labels, y_preds, average='macro')

# Calculating the Recall
recall = recall_score(y_labels, y_preds, average='macro')

# Calculating the F1 Score
f1 = f1_score(y_labels, y_preds, average='macro')

# Calculating the AUC-ROC
y_labels_bin = label_binarize(y_labels, classes=np.arange(8))
try:
    auc_roc = roc_auc_score(y_labels_bin, y_probs, multi_class='ovr')
except ValueError:
    auc_roc = float('nan')  # ROC AUC can't be computed if some classes are missing

# MCC (for multiclass)
mcc = matthews_corrcoef(y_labels, y_preds)

print(f"Accuracy:\t{accuracy:.2f}%")
print(f"Precision:\t{precision:.4f}")
print(f"Recall:\t\t{recall:.4f}")
print(f"F1 Score:\t{f1:.4f}")
print(f"AUC-ROC:\t{auc_roc:.4f}")
print(f"MCC:\t\t{mcc:.4f}")

Accuracy:	59.95%
Precision:	0.6034
Recall:		0.5709
F1 Score:	0.5537
AUC-ROC:	0.9038
MCC:		0.5442


# RNN Implementation

## Transcript Generator

The following code loads the Whisper model to transcribe all audio files from the Audio_data directory. It saves the transcriptions in a dictionary with filenames as keys and the transcriptions as its corresponding values and then saves it to transcripts.json. Before doing this, it checks if transcripts.json already exists so that it doesn't try to recreate the file again every time the notebook is run.

In [13]:
model = whisper.load_model("base")
audio_dir = "Data/Audio_data/"
transcripts = {}

if not os.path.exists("Data/transcripts.json"):
    for file in os.listdir(audio_dir):
        if file.endswith(".wav"):
            result = model.transcribe(os.path.join(audio_dir, file), fp16 = False)
            transcripts[file] = result['text']
    # print(transcripts)
    with open("Data/transcripts.json", "w") as f:
        json.dump(transcripts, f)

## Dataloader for RNN

The TextEmotionDataset class processes a dataset of transcriptions and their corresponding emotion labels. It loads the transcript data from the JSON file, extracts text and emotion labels, and builds a vocabulary dictionary. The class encodes each text into a sequence of indices based on the vocabulary. It implements methods for getting the length of the dataset and accessing individual samples.

The collate function is used to prepare batches of data by padding sequences to ensure uniform length and converting emotion labels into tensors.

In [14]:
class TextEmotionDataset(Dataset):
    def __init__(self, transcript_file, train = True):
        with open(transcript_file) as file:
            self.data = json.load(file)    # loads the data
        
        complete_data = list(self.data.items())
        split_index = int(70* len(complete_data)/100)
        random.seed(42)
        random.shuffle(complete_data)

        if train:
            sampled_data = dict(complete_data[:split_index])
        else:
            sampled_data = dict(complete_data[split_index:])
            
        self.data = sampled_data
        
        self.texts = list(self.data.values())    # gets all the text sentences from the transcripts
        self.files = list(self.data.keys())      # gets the file name 
        self.labels = [int(f.split("-")[2]) - 1 for f in self.files]  # gets the emotion code and makes it start from 0 instead of 1

        self.word_dictionary = {"<PAD>": 0, "<UNK>": 1}    # dictionary that assigns numbers to words
        self._build_vocab()                                # this function fills up the dictionary 

    def _build_vocab(self):
        index = 2    # starting from index 2 as 1 and 2 are already assigned to <PAD> and <UNK>

        # for each word in each sentence i am assigning  an index and adding it to the dictionary
        for text in self.texts:
            for word in text.lower().split(): 
                if word not in self.word_dictionary:
                    self.word_dictionary[word] = index
                    index += 1

    def encode(self, text):
        # for a given text sentence, encodes the sentence using the distionary from the class
        return torch.tensor([
            self.word_dictionary.get(word.lower(), 1) for word in text.split()
        ], dtype=torch.long)

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index):
        return self.encode(self.texts[index]), self.labels[index]

def rnn_collate(batch):
    texts, labels = zip(*batch)    # this unzips the batch a list of tupuls to a tupul of lists - texts and labels
    texts = pad_sequence(texts, batch_first=True, padding_value=0) # pads the shorter sequences with 0 and making all the sequences of same length
    return texts, torch.tensor(labels)

## RNN Implementation

The following code implements a basic Recurrent Neural Network (RNN) for text classification. It first converts word IDs to word vectors by an embedding layer. Then, it feeds them into a GRU. A final hidden state is used as the feature representation, which is fed into a fully connected layer for the emotion class prediction. Both the class scores (logits) and the 128-dimensional features are returned.

In [15]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=128, num_classes=8):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.rnn = nn.GRU(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.embedding(x)
        _, h_n = self.rnn(x)
        features = h_n[-1]
        logits = self.fc(features)
        return logits, features

## Training RNN model

The following code trains RNN model to classify emotions based on text transcripts for 100 epochs, and even after 100 epochs the accuracy is not improving significantly over 20%

In [16]:
dataset = TextEmotionDataset("Data/transcripts.json")
vocab_size = len(dataset.word_dictionary)
loader = DataLoader(dataset, batch_size = 20, shuffle = True, collate_fn = rnn_collate)

# Setting up the Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = RNN(vocab_size=vocab_size).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

# Training the Model
for epoch in range(200):
    model.train()
    total_loss, correct = 0, 0
    for inputs, labels in loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs, _ = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        _, preds = outputs.max(1)
        correct += (preds == labels).sum().item()

    print(f"Epoch {epoch+1}, Loss: {total_loss:.2f}, Accuracy: {100 * correct / len(dataset):.2f}%")

Epoch 1, Loss: 106.07, Accuracy: 15.38%
Epoch 2, Loss: 103.98, Accuracy: 17.36%
Epoch 3, Loss: 103.12, Accuracy: 16.87%
Epoch 4, Loss: 103.13, Accuracy: 16.96%
Epoch 5, Loss: 102.84, Accuracy: 17.76%
Epoch 6, Loss: 102.54, Accuracy: 18.06%
Epoch 7, Loss: 102.44, Accuracy: 18.15%
Epoch 8, Loss: 101.69, Accuracy: 20.14%
Epoch 9, Loss: 101.10, Accuracy: 19.15%
Epoch 10, Loss: 100.25, Accuracy: 20.24%
Epoch 11, Loss: 100.02, Accuracy: 19.05%
Epoch 12, Loss: 99.62, Accuracy: 20.54%
Epoch 13, Loss: 99.03, Accuracy: 21.13%
Epoch 14, Loss: 99.14, Accuracy: 20.34%
Epoch 15, Loss: 99.03, Accuracy: 20.04%
Epoch 16, Loss: 98.19, Accuracy: 22.22%
Epoch 17, Loss: 98.71, Accuracy: 20.73%
Epoch 18, Loss: 98.98, Accuracy: 19.94%
Epoch 19, Loss: 98.39, Accuracy: 22.32%
Epoch 20, Loss: 98.33, Accuracy: 20.73%
Epoch 21, Loss: 98.58, Accuracy: 19.74%
Epoch 22, Loss: 98.42, Accuracy: 22.12%
Epoch 23, Loss: 98.44, Accuracy: 22.02%
Epoch 24, Loss: 98.38, Accuracy: 20.63%
Epoch 25, Loss: 98.45, Accuracy: 21.23

In emotion classification with RNNs which have been trained on transcripts alone, one of the major drawback is that the text does not carry vocal context. For instance, consider the sentence **"Kids are talking by the door."** That exact sentence could have different emotions labeled in the data set depending upon how it was spoken—nervously, angrily, happily, or seriously. Since the sentence stays the same, the model can’t understand the different emotions just from the words. RNNs learn from the text, so it can’t tell the difference when the emotion changes but the words don’t. The model doesn’t know the tone, pitch, or way the sentence is spoken, so it may get learn properly when the same sentence has different emotion labels. As a result, it struggles to learn and often just guesses, leading to low training accuracy, usually around 20%, because emotions are not clear from the text alone.

## Testing the RNN model

In [17]:
test_dataset = TextEmotionDataset("Data/transcripts.json", train=False)
test_loader = DataLoader(test_dataset, batch_size=20, shuffle=False, collate_fn=rnn_collate)

# Setting the Model to evaluation mode
model.eval()
y_preds = []
y_labels = []
y_probs = []

with torch.no_grad():
    for inputs, labels in test_loader:
        inputs = inputs.to(device).long()
        labels = labels.to(device)

        outputs, _ = model(inputs)
        probs = torch.softmax(outputs, dim=1)
        _, predicted_output = probs.max(1)

        y_preds.extend(predicted_output.cpu().numpy())
        y_labels.extend(labels.cpu().numpy())
        y_probs.extend(probs.cpu().numpy())  # for AUC-ROC calculation


y_labels = np.array(y_labels)
y_pred = np.array(y_preds)
y_probs = np.array(y_probs)

# Accuracy Calculation
accuracy = 100 * accuracy_score(y_labels, y_pred)

# Precision Calculation
precision = precision_score(y_labels, y_pred, average='macro', zero_division=0)

# Recall Calculation
recall = recall_score(y_labels, y_pred, average='macro', zero_division=0)

# F1 score Calculation
f1 = f1_score(y_labels, y_pred, average='macro', zero_division=0)

# MCC Calculation
mcc = matthews_corrcoef(y_labels, y_pred)

# AUC-ROC Calculation
y_labels_bin = label_binarize(y_labels, classes=np.arange(8))
try:
    auc_roc = roc_auc_score(y_labels_bin, y_probs, multi_class='ovr')
except ValueError:
    auc_roc = float('nan')  # ROC AUC can't be computed if some classes are missing

# Printing results
print(f"Accuracy:  {accuracy:.4f}%")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-Score:  {f1:.4f}")
print(f"AUC-ROC:   {auc_roc:.4f}")
print(f"MCC:       {mcc:.4f}")

Accuracy:  12.9630%
Precision: 0.0947
Recall:    0.1260
F1-Score:  0.0576
AUC-ROC:   0.5164
MCC:       0.0006


# Fusion Model

## Fusion Dataset loader

In [18]:
class FusionDataset(Dataset):
    def __init__(self, image_dir, transcript_file, vocab_limit=10000, train = True):
        self.image_dir = image_dir

        with open(transcript_file) as file:
            self.transcripts = json.load(file)    # loads the data
        
        complete_data = list(self.transcripts.items())
        split_index = int(70* len(complete_data)/100)
        random.seed(42)
        random.shuffle(complete_data)

        if train:
            sampled_data = dict(complete_data[:split_index])
        else:
            sampled_data = dict(complete_data[split_index:])
            
        self.transcripts = sampled_data
        self.filenames = list(self.transcripts.keys())
        self.labels = [int(f.split("-")[2]) - 1 for f in self.filenames]

        self.texts = list(self.transcripts.values())
        self.word_dictionary = {"<PAD>": 0, "<UNK>": 1}
        
        self._build_vocab(vocab_limit)

        # Image transform
        self.transform = transforms.Compose([
            transforms.Resize((64, 64)),
            transforms.ToTensor()
        ])
    def _build_vocab(self, vocab_limit):
        word_freq = {}
        for text in self.texts:
            for word in text.lower().split():
                word_freq[word] = word_freq.get(word, 0) + 1

        sorted_words = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:vocab_limit - 2]
        for i, (word, _) in enumerate(sorted_words, start=2):
            self.word_dictionary[word] = i

    def encode(self, text):
        # for a given text sentence, encodes the sentence using the distionary from the class
        return torch.tensor([
            self.word_dictionary.get(word.lower(), 1) for word in text.split()
        ], dtype=torch.long)

    def __len__(self):
        return len(self.filenames)

    def __getitem__(self, index):
        file = self.filenames[index]
        label = self.labels[index]

        # Image input
        image_path = os.path.join(self.image_dir, file.replace(".wav", ".png"))
        image = Image.open(image_path).convert("RGB")
        image = self.transform(image)

        # Text input
        text = self.encode(self.transcripts[file])
        return image, text, label

def fusion_collate(batch):
    images, texts, labels = zip(*batch)
    texts = pad_sequence(texts, batch_first=True, padding_value=0)
    return torch.stack(images), texts, torch.tensor(labels)


## Implmenting Fusion model

In [19]:
class FusionModel(nn.Module):
    def __init__(self, audio_dim=128, text_dim=128, num_classes=8):
        super().__init__()
        self.fc1 = nn.Linear(audio_dim + text_dim, 256)
        self.dropout = nn.Dropout(0.3)
        self.fc2 = nn.Linear(256, num_classes)

    def forward(self, cnn_features, rnn_features):
        # Concatenate audio and text features along feature dimension
        x = torch.cat([cnn_features, rnn_features], dim=1)  # Shape: [batch_size, 256]
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

## Training Fusion Model

In [20]:
train_dataset = FusionDataset("Data/Spectrograms/","Data/transcripts.json")
vocab_size = len(dataset.word_dictionary)
fusion_loader = DataLoader(train_dataset, batch_size = 20, shuffle = True, collate_fn = fusion_collate)

cnn = CNN(classes=8).to(device)
rnn = RNN(vocab_size=vocab_size, num_classes=8).to(device)
fusion = FusionModel(audio_dim=128, text_dim=128).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(fusion.parameters(), lr=0.003)

for epoch in range(200):
    total_loss, correct = 0, 0
    fusion.train()

    for images, texts, labels in fusion_loader:
        images, texts, labels = images.to(device), texts.to(device), labels.to(device)

        with torch.no_grad():
            _, audio_feat = cnn(images)
            _, text_feat = rnn(texts)

        outputs = fusion(audio_feat, text_feat)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        correct += (outputs.argmax(1) == labels).sum().item()

    accuracy = correct / len(fusion_loader.dataset) * 100
    print(f"Epoch {epoch+1}: Loss={total_loss:.2f}, Accuracy={accuracy:.2f}%")

Epoch 1: Loss=104.45, Accuracy=15.67%
Epoch 2: Loss=102.93, Accuracy=17.76%
Epoch 3: Loss=102.71, Accuracy=16.17%
Epoch 4: Loss=102.31, Accuracy=17.16%
Epoch 5: Loss=102.24, Accuracy=16.17%
Epoch 6: Loss=101.96, Accuracy=17.56%
Epoch 7: Loss=101.52, Accuracy=20.73%
Epoch 8: Loss=101.60, Accuracy=19.25%
Epoch 9: Loss=101.31, Accuracy=19.64%
Epoch 10: Loss=101.94, Accuracy=18.35%
Epoch 11: Loss=101.27, Accuracy=19.44%
Epoch 12: Loss=101.54, Accuracy=19.35%
Epoch 13: Loss=101.01, Accuracy=19.74%
Epoch 14: Loss=100.66, Accuracy=22.32%
Epoch 15: Loss=100.72, Accuracy=19.05%
Epoch 16: Loss=100.08, Accuracy=20.44%
Epoch 17: Loss=99.85, Accuracy=21.92%
Epoch 18: Loss=99.78, Accuracy=21.03%
Epoch 19: Loss=99.39, Accuracy=20.73%
Epoch 20: Loss=99.23, Accuracy=23.02%
Epoch 21: Loss=97.90, Accuracy=24.90%
Epoch 22: Loss=97.41, Accuracy=25.20%
Epoch 23: Loss=97.38, Accuracy=24.21%
Epoch 24: Loss=97.39, Accuracy=25.79%
Epoch 25: Loss=96.25, Accuracy=27.58%
Epoch 26: Loss=96.20, Accuracy=25.50%
Epoch

## Testing Fusion model

In [21]:
# Create test dataset and loader
test_dataset = FusionDataset("Data/Spectrograms/", "Data/transcripts.json", train=False)
test_loader = DataLoader(test_dataset, batch_size=20, shuffle=False, collate_fn=fusion_collate)

fusion.eval()  # Set model to evaluation mode
cnn.eval()
rnn.eval()

y_preds = []
y_labels = []
y_probs = []

total_loss = 0
correct = 0

with torch.no_grad():
    for images, texts, labels in test_loader:
        images, texts, labels = images.to(device), texts.to(device), labels.to(device)

        _, audio_feat = cnn(images)
        _, text_feat = rnn(texts)

        outputs = fusion(audio_feat, text_feat)
        loss = criterion(outputs, labels)
        probs = torch.softmax(outputs, dim=1)
        preds = torch.argmax(probs, dim=1)

        total_loss += loss.item()
        y_preds.extend(preds.cpu().numpy())
        y_labels.extend(labels.cpu().numpy())
        y_probs.extend(probs.cpu().numpy())

y_labels = np.array(y_labels)
y_pred = np.array(y_preds)
y_probs = np.array(y_probs)

# Accuracy Calculation
accuracy = 100 * accuracy_score(y_labels, y_pred)

# Precision Calculation
precision = precision_score(y_labels, y_pred, average='macro', zero_division=0)

# Recall Calculation
recall = recall_score(y_labels, y_pred, average='macro', zero_division=0)

# F1 score Calculation
f1 = f1_score(y_labels, y_pred, average='macro', zero_division=0)

# MCC Calculation
mcc = matthews_corrcoef(y_labels, y_pred)

# AUC-ROC Calculation
y_labels_bin = label_binarize(y_labels, classes=np.arange(8))
try:
    auc_roc = roc_auc_score(y_labels_bin, y_probs, multi_class='ovr')
except ValueError:
    auc_roc = float('nan')  # ROC AUC can't be computed if some classes are missing

print(f"Test Loss     = {total_loss:.2f}")
print(f"Accuracy      = {accuracy:.4f}")
print(f"Precision     = {precision:.4f}")
print(f"Recall        = {recall:.4f}")
print(f"F1 Score      = {f1:.4f}")
print(f"AUC-ROC       = {auc_roc:.4f}")
print(f"Matthews Corr = {mcc:.4f}")

Test Loss     = 35.91
Accuracy      = 39.8148
Precision     = 0.3910
Recall        = 0.4129
F1 Score      = 0.3741
AUC-ROC       = 0.8152
Matthews Corr = 0.3220


The fusion model's low accuracy is largely due to the weak performance of the RNN, which struggles to classify emotions from transcripts alone. While the transcriptions themselves are accurate, the core problem lies in the nature of emotion recognition from text: words often don’t carry enough emotional context. As a result, the RNN often performs no better than guessing, typically achieving only around 15–20% accuracy, which negatively impacts the fusion model when these weak features are combined with the stronger audio signals.