## Emotion Recognition with Multimodal Data: Combining Audio and Text Inputs

This code practise builds an emotion detection system that combines text and audio inputs using the TESS (Toronto Emotional Speech Set) dataset. It begins by training a text-only model using a BERT-based classifier to recognize emotions from text transcripts. This step simulates a traditional call center setup that relies solely on text.

Then, the code introduces a multimodal model that incorporates audio features alongside text features. This captures nuances in speech, such as pitch and pace, which are crucial for detecting emotions like sarcasm or stress that text alone might miss. By combining these audio features with text embeddings from BERT, the multimodal model can recognize emotions more effectively, providing better insights into a customer's emotional state.


### Section 1: Setup and Load Dependencies

In [None]:
# Install required libraries for working with Hugging Face datasets, text processing, and audio features.

!pip install datasets==3.6.0 transformers librosa torch

# Import required modules
from datasets import load_dataset
import librosa
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

Collecting datasets==3.6.0
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 4.0.0
    Uninstalling datasets-4.0.0:
      Successfully uninstalled datasets-4.0.0
Successfully installed datasets-3.6.0



### Section 2: Load and Preprocess the TESS Dataset

In [None]:
# Load the dataset from Hugging Face and explore its structure.

dataset = load_dataset("myleslinder/tess", split="train", trust_remote_code=True)
df = pd.DataFrame(dataset)  # Convert to pandas DataFrame for easier manipulation

# Use 20% of the data for quick running
df = df.sample(frac=0.2, random_state=42)  # Sample 20% of the dataset
display(df.head())  # Display the first few rows to understand the dataset structure

# Preprocess the Dataset
# Extract text transcripts, audio paths, and labels, then split them into training and validation sets.
texts = df['text']  # Extract text transcripts
audio_paths = df['audio'].apply(lambda x: x['path'])  # Extract paths to audio files
labels = df['label']  # Extract emotion labels

# Split the dataset into training and validation sets (80% train, 20% validation)
train_texts, val_texts, train_paths, val_paths, train_labels, val_labels = train_test_split(
    texts, audio_paths, labels, test_size=0.2, random_state=42)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

tess.py: 0.00B [00:00, ?B/s]

tess.zip:   0%|          | 0.00/224M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2800 [00:00<?, ? examples/s]

Unnamed: 0,path,audio,speaker_id,speaker_age,text,word,label
1088,/root/.cache/huggingface/datasets/downloads/ex...,{'path': '/root/.cache/huggingface/datasets/do...,OAF,64,Say the word shout,shout,0
772,/root/.cache/huggingface/datasets/downloads/ex...,{'path': '/root/.cache/huggingface/datasets/do...,OAF,64,Say the word near,near,1
2161,/root/.cache/huggingface/datasets/downloads/ex...,{'path': '/root/.cache/huggingface/datasets/do...,YAF,26,Say the word nag,nag,6
1192,/root/.cache/huggingface/datasets/downloads/ex...,{'path': '/root/.cache/huggingface/datasets/do...,OAF,64,Say the word third,third,1
1916,/root/.cache/huggingface/datasets/downloads/ex...,{'path': '/root/.cache/huggingface/datasets/do...,YAF,26,Say the word keep,keep,6


### Section 3: Text-only Mode - Train a Text-based Classifier


In [None]:
# Tokenize the text data and train a text-based emotion classifier using a pre-trained transformer model.

# 3.1 Tokenization and Model Setup
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')  # Load tokenizer for BERT model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=7)  # Load BERT model with 7 labels for emotions

# Define a function to tokenize texts
def tokenize_function(texts):
    return tokenizer(texts, padding='max_length', truncation=True, return_tensors='pt')

# Tokenize training and validation texts
train_encodings = tokenize_function(train_texts.tolist())
val_encodings = tokenize_function(val_texts.tolist())

# 3.2 Dataset Preparation
# Create a PyTorch dataset to use with the Trainer
class TextDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        # Return each item as a dictionary containing input encodings and corresponding labels
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Prepare the datasets
train_dataset = TextDataset(train_encodings, train_labels.tolist())
val_dataset = TextDataset(val_encodings, val_labels.tolist())

# Section 3.3: Training the Text-only Model
# Define training arguments and train the model using the Hugging Face Trainer
epochs = 3
training_args = TrainingArguments(
    output_dir='./temp_output',  # Set a temporary directory (required by Hugging Face)
    eval_strategy='epoch',  # Evaluate the model at the end of each epoch
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=epochs,  # Set the number of training epochs
    logging_dir='./logs',  # Set a logging directory (required by Hugging Face)
    report_to="none", # Disable reporting to Weights & Biases
)

# Initialize the Trainer for text-only model
text_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=lambda p: {
        'accuracy': accuracy_score(p.label_ids, np.argmax(p.predictions, axis=1)),
        'f1': f1_score(p.label_ids, np.argmax(p.predictions, axis=1), average='weighted')
    }
)

# Train the text-only model
text_trainer.train()

# Evaluate the text-only model
text_eval_results = text_trainer.evaluate()
text_accuracy = text_eval_results['eval_accuracy']
text_f1 = text_eval_results['eval_f1']
print("Text-Only Model - Accuracy:", text_accuracy)
print("Text-Only Model - F1 Score:", text_f1)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,2.00986,0.080357,0.011954
2,No log,1.94924,0.133929,0.031637
3,No log,1.949364,0.125,0.028455


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Text-Only Model - Accuracy: 0.125
Text-Only Model - F1 Score: 0.028455284552845527


### Section 4: Multimodal Model - Train a Model with Combined Text and Audio Features


In [None]:
# Train a model using both text and audio features.

# 4.1 Extract Audio Features
# Use librosa to extract MFCC (Mel-Frequency Cepstral Coefficients) features from each audio file.
def extract_audio_features(audio_paths):
    features = []
    for path in audio_paths:
        y, sr = librosa.load(path, sr=None)  # Load the audio file
        mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)  # Extract 13 MFCC features
        features.append(np.mean(mfcc, axis=1))  # Average the features across time
    return np.array(features)

# Extract features for training and validation sets
train_audio_features = extract_audio_features(train_paths)
val_audio_features = extract_audio_features(val_paths)

# 4.2 Define a Combined Dataset
# Combine text and audio features in a custom dataset class.
class CombinedDataset(torch.utils.data.Dataset):
    def __init__(self, text_encodings, audio_features, labels):
        self.text_encodings = text_encodings
        self.audio_features = audio_features
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.text_encodings.items()}
        item['audio_features'] = torch.tensor(self.audio_features[idx], dtype=torch.float32)
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Prepare the datasets for multimodal input
train_combined_dataset = CombinedDataset(train_encodings, train_audio_features, train_labels.tolist())
val_combined_dataset = CombinedDataset(val_encodings, val_audio_features, val_labels.tolist())

# 4.3 Define the Multimodal Model
# Create a model that takes both text and audio features as input.
class CombinedModel(torch.nn.Module):
    def __init__(self):
        super(CombinedModel, self).__init__()
        self.text_model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=7)
        self.fc_audio = torch.nn.Linear(13, 64)  # A fully connected layer for audio features
        self.fc_combined = torch.nn.Linear(64 + self.text_model.config.hidden_size, 7)  # Final layer combining both inputs

    def forward(self, input_ids, attention_mask, audio_features):
        text_output = self.text_model.bert(input_ids, attention_mask=attention_mask).pooler_output  # Get pooled output from BERT
        audio_output = torch.relu(self.fc_audio(audio_features))  # Apply ReLU activation to audio features
        combined = torch.cat((text_output, audio_output), dim=1)  # Concatenate text and audio features
        return self.fc_combined(combined)  # Pass through the final layer for classification

combined_model = CombinedModel()

# 4.4 Train the Multimodal Model
# Training loop for the multimodal model using PyTorch (manual training loop for custom models).
from torch.utils.data import DataLoader

# Define a DataLoader for batching
train_loader = DataLoader(train_combined_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_combined_dataset, batch_size=8)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
combined_model.to(device)
optimizer = torch.optim.Adam(combined_model.parameters(), lr=1e-5)
loss_fn = torch.nn.CrossEntropyLoss()

# Training loop
for epoch in range(epochs):
    combined_model.train()
    total_loss = 0

    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        audio_features = batch['audio_features'].to(device)
        labels = batch['labels'].to(device)

        # Forward pass
        outputs = combined_model(input_ids=input_ids, attention_mask=attention_mask, audio_features=audio_features)
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch + 1}, Loss: {total_loss / len(train_loader)}")

# 4.5 Evaluate Multimodal Model
combined_model.eval()
all_preds, all_labels = [], []
with torch.no_grad():
    for batch in val_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        audio_features = batch['audio_features'].to(device)
        labels = batch['labels'].to(device)

        outputs = combined_model(input_ids=input_ids, attention_mask=attention_mask, audio_features=audio_features)
        preds = torch.argmax(outputs, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

## Calculate metrics for the multimodal model
combined_accuracy = accuracy_score(all_labels, all_preds)
combined_f1 = f1_score(all_labels, all_preds, average='weighted')
print("Multimodal Model - Accuracy:", combined_accuracy)
print("Multimodal Model - F1 Score:", combined_f1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  item = {key: torch.tensor(val[idx]) for key, val in self.text_encodings.items()}


Epoch 1, Loss: 6.104967811277935
Epoch 2, Loss: 4.765188093696322
Epoch 3, Loss: 4.095476506011827
Multimodal Model - Accuracy: 0.13392857142857142
Multimodal Model - F1 Score: 0.10079053172993335


### Section 5: Compare the two Models


In [None]:
# Display the comparison in a tabular format using pandas
data = {
    'Model': ['Text-Only', 'Multimodal (Text + Audio)'],
    'Accuracy': [text_accuracy, combined_accuracy],
    'F1 Score': [text_f1, combined_f1]
}
comparison_df = pd.DataFrame(data)
print(comparison_df)

                       Model  Accuracy  F1 Score
0                  Text-Only  0.125000  0.028455
1  Multimodal (Text + Audio)  0.133929  0.100791
