# Multimodal Emotion recognition system

## Aim

**Aim** create a multimodal emotion recognition system based on speach and text

I aim to combine: 
* emotion2vec's audio-based emotion embeddings
* text-based emotion embeddings

The text-based emotion encoder proposed here treats sentiments as structured entities within an embedding space. Unlike typical classifiers, emotions are not a seen as simple labels extracted by the classifier, the present encorder embeds *emotion sequences* rather than simply extracting a discrete labels. This framework was chosen because it provides a means of encoding latent information underlies the relationships between emotions within flow of text (Liampis, Karanikola, and Kotsiantis [2025](ttps://www.sciencedirect.com/science/article/pii/S0925231225004941)).


## Method

### Tools

* [emotion2vec: audio-based emotion embeddings](https://huggingface.co/emotion2vec/emotion2vec_plus_large)
* [emotion-english-distilroberta-base](https://huggingface.co/j-hartmann/emotion-english-roberta-large)
* [CrisperWhisper](https://github.com/nyrahealth/CrisperWhisper)

**Emotion2vec** is the first universal speech emotion representation model. Through self-supervised pre-training, emotion2vec has the ability to extract emotion representation across different tasks, languages, and scenarios. Emotion2vec aims to create a universal emotional representation space. 
**Emotion-english-distilroberta-base** model is a distilled version of RoBERTa-large trained on 6 diverse datasets and predicts Ekman's 6 basic emotions, plus a neutral class. RoBERTa-large (Liu et al. 2019) is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. 
**Crisper-whisper** is an advanced variant of OpenAI's Whisper, designed for fast, precise, and verbatim speech recognition with accurate (crisp) word-level timestamps. Unlike the original Whisper, which tends to omit disfluencies and follows more of a intended transcription style, CrisperWhisper aims to transcribe every spoken word exactly as it is, including fillers, pauses, stutters and false starts.

### Data

**Input** 
Audio File: A .wav file containing speech (e.g., audiotest.wav).
Transcripts: This analysis uses the crisperwhisper output (JSON format) which provides word-level chunks with timestamps.

However, the distributional emotion embedding approach that we want to implement works at the sentence level and therefore we need to also gather sentence level timestamps

## Multimodal Processing

### Audio embeddings extraction algorythm

1. Use emotion2vec to extract audio based emotional embeddings

### Transcript generation

Transcribe the Audio: Convert the audio file into text using Crisperwhisper.

### Text embeddings extraction algorythm

**Emotions as distributional entities.** [Liampis 2025](https://www.sciencedirect.com/science/article/pii/S0925231225004941)


#### 1. Sentence Tokenization

1. segment the transcript into sentences (initially via nltk).
2. Tokenize the transcribed text into sentences for emotion analysis
3. allocate timestamps to each sentence

#### 2. Emotion classification

1. Create Emotion Sequences:
    1. Apply a multi-label emotion classifier to each sentence in the transcribed text
        
#### 3. Create Emotion label texts

1. Sort extracted emotions by their probability values
2. Arrange emotion labels in a serially ordered layout (as strings)
   
#### 4. Embeddings extraction or generation
1. Generate Distributional Representations:
    1. Extract embeddings from these emotion sequences using a pre-trained transformer model to extract contextual representations *OR*
    2. Combine multiple embedding layers through a weighted average scheme


### Multimodal Integration

3. Integration with Audio Embeddings (either 1 or 2):
    1. Audio-Text Alignment:
        1. Map audio segments to corresponding text sentences
        2. Extract emotion embeddings from both modalities
    2. Cross-Modal Fusion:
        1. Concatenate audio and text emotion embeddings     
4. Enhanced Classification Framework ([Ni & Ni 2024](https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2024.1490796))
    1. Emotion Matching: Find the most suitable emotions from an emotion candidate list for the given multimodal input
    2. Emotion Correlation Modeling:
        1. Implement an attention-based emotion correlation modeling module that learns semantic correlations between emotions from data
        2. Obtain correlation-enhanced emotion embedding representations

The key innovation of using distributional embeddings is is treating emotion states as tokens in a specialized vocabulary, where each token connects to others through sequential occurrences within text. This allows for capturing the dynamic and sequential nature of emotions as they unfold in content ([Liampis, Karanikola, Kotsiantis 2025](https://www.sciencedirect.com/science/article/pii/S0925231225004941))



## Data input

In [None]:
import librosa
import numpy as np
import matplotlib.pyplot as plt
from funasr import AutoModel
import pandas as pd
import soundfile as sf
import os
import json


## import Audio and txt data

In [None]:
# Load the audio file
audio_file = "../data/audiotest.wav"

# Load the trancript file in JSON format
transcript_file = "../data/transcripts/audiotest_json.json"

with open(transcript_file, 'r') as file:
    transcription_data = json.load(file)


## Audio embeddings extraction

### Pre-processing
Preprocess audio file to ensure it's in the correct format for emotion2vec (sampling required = 16kHz)

In [None]:
import librosa
import soundfile as sf

def preprocess_audio(audio_file, target_sr=16000):
    """
    Preprocess audio file to ensure it's in the correct format for emotion2vec
    
    Args:
        audio_file (str): Path to the audio file
        target_sr (int): Target sampling rate (emotion2vec requires 16kHz)
        
    Returns:
        str: Path to the processed audio file
    """
    # Load audio with its original sampling rate
    waveform, sr = librosa.load(audio_file, sr=None)
    
    # Check if resampling is needed
    if sr != target_sr:
        # Resample to 16kHz
        waveform = librosa.resample(waveform, orig_sr=sr, target_sr=target_sr)
        
        # Save the resampled audio to a temporary file
        temp_file = "temp_resampled.wav"
        sf.write(temp_file, waveform, target_sr)
        return temp_file
    
    return audio_file


In [None]:

audio_file_16KhZ = preprocess_audio(audio_file)



[emotion2vec](https://huggingface.co/emotion2vec/emotion2vec_plus_large) is a speech emotion recognition foundation model that can extract emotional features directly from audio waveforms. The emotion2vec_plus_large model is the largest version, trained on 42,526 hours of data. It supports 9 emotion classes:

0. angry
1. disgusted
2. fearful
3. happy
4. neutral
5. other
6. sad
7. surprised
8. unknown


In [None]:
## loading the emotion2vec_plus_large model
from funasr import AutoModel

# model="iic/emotion2vec_base"
# model="iic/emotion2vec_base_finetuned"
# model="iic/emotion2vec_plus_seed"
# model="iic/emotion2vec_plus_base" 
model_id = "iic/emotion2vec_plus_large"

model_audio = AutoModel(
    model=model_id,
    hub="hf",  # "ms" or "modelscope" for China mainland users; "hf" or "huggingface" for other overseas users
    disable_update=True
)

 * *analyze_full_audio()* : Analyze a full audio file to extract both overall emotional tone and the 1024 embedding dimensions for the whole file.

In [None]:
# Analyze a full audio file
def analyze_full_audio(audio_file):
    # Create output directory if it doesn't exist
    os.makedirs("./outputs", exist_ok=True)
    
    # Generate emotion predictions and extract embeddings
    result = model_audio.generate(
        audio_file, 
        output_dir="./outputs", 
        granularity="utterance", 
        extract_embedding=True
    )
    
    # Extract embeddings
    # result is a list, and each item in the list contains a dictionary with keys like feats, scores, and labels.
    if 'feats' in result[0]:
        embeddings = result[0]['feats']
        print(f"Embedding dimensionality: {embeddings.shape}")
        return embeddings, result
    else:
        print("No embeddings found in result")
        return None, result


*analyze_audio_over_time()*: analyze audio over time with sliding windows and hop_length settings

In [None]:
# Function to analyze audio over time with sliding windows
#
# Segment_length (default value of 3 seconds): determines the duration of each 
#    audio segment that will be analyzed individually.
#
# Hop_length (default value of 1 second): determines how much the analysis window 
#    moves forward between consecutive segments. When the hop length is smaller than 
#    the segment length, consecutive windows overlap.

def analyze_audio_over_time(audio_file, segment_length=3, hop_length=1):
    # Load the audio file
    waveform, sample_rate = sf.read(audio_file)
    
    # Calculate samples
    # For example, with a sample rate of 16,000 Hz, a segment length of 3 seconds 
    # would correspond to 48,000 samples, and a hop length of 1 second would correspond 
    # to 16,000 samples.
    segment_samples = int(segment_length * sample_rate)
    hop_samples = int(hop_length * sample_rate)
    
    # Create temporary segment files and analyze
    emotions_over_time = []
    timestamps = []
    
    # Create output directory if it doesn't exist
    os.makedirs("./outputs", exist_ok=True)

    # The function below uses these sample counts to extract overlapping segments from the audio waveform:
    # Each segment is saved to a temporary file and analyzed by the emotion recognition model, with the 
    # results being recorded along with the corresponding timestamp.
    for start_idx in range(0, len(waveform) - segment_samples + 1, hop_samples):
        
        # Extract segment
        segment = waveform[start_idx:start_idx + segment_samples]
        
        # Save temporary segment
        temp_path = "temp_segment.wav"
        sf.write(temp_path, segment, sample_rate)
        
        # Analyze segment
        result = model_audio.generate(
            temp_path, 
            output_dir="./outputs",
            granularity="utterance", 
            extract_embedding=False
        )
        
        # Get emotion scores and labels
        if 'scores' in result[0] and 'labels' in result[0]:
            scores = result[0]['scores']
            labels = result[0]['labels']
            
            # Record timestamp and scores
            timestamp = start_idx / sample_rate
            timestamps.append(timestamp)
            
            # Extract English labels (assuming format "Chinese/English")
            english_labels = [label.split('/')[1] if '/' in label else label for label in labels]
            
            # Create a dictionary mapping emotion names to scores
            emotion_scores = {}
            for i, label in enumerate(english_labels):
                emotion_scores[label] = scores[i]
            
            emotions_over_time.append(emotion_scores)
        
    # Clean up temporary file
    if os.path.exists(temp_path):
        os.remove(temp_path)
    
    # Convert to DataFrame for easier plotting
    df = pd.DataFrame(emotions_over_time, index=timestamps)
    
    # Fill missing values with 0 (emotions not detected in some segments)
    df = df.fillna(0)
    
    return df



In [None]:
# Function to plot emotion changes over time
def plot_emotions_over_time(df):
    plt.figure(figsize=(15, 6))
    for emotion in df.columns:
        plt.plot(df.index, df[emotion], label=emotion)
    
    plt.xlabel('Time (seconds)')
    plt.ylabel('Emotion Probability')
    plt.title('Emotion Changes Over Time')
    plt.legend()
    plt.grid(True)
    plt.show()

In [None]:
# Analyze full audio to get embeddings
#embeddings, full_result = analyze_full_audio(audio_file_16KhZ)

# Analyze audio over time
emotion_df = analyze_audio_over_time(audio_file_16KhZ)
                                     
# Plot emotions over time
plot_emotions_over_time(emotion_df)

## Text embeddings extraction
### Sentence Tokenization

In [None]:
# this is a basic sentence tokenizer - it simply does sentece splitting - no fancy linguistic processing
import nltk ## need pip install nltk

nltk.download("punkt")
nltk.download('punkt_tab')



In [None]:
from nltk.tokenize import sent_tokenize

# The JSON output from crisperwhisper contains:
#  1. transcribed text without timestamps  as transcription_data["text"]
#  2. detailed word-level timestamps as transcription_data["chunks"]
def segment_into_sentences(whisper_output):
    # Extract full text
    full_text = whisper_output['text']
    
    # Tokenize into sentences
    sentences = sent_tokenize(full_text)
    
    # Map sentences to timestamps using the chunks
    sentence_timestamps = []
    chunks = whisper_output['chunks']
    
    current_sentence_idx = 0
    current_sentence_words = []
    sentence_start_time = chunks[0]['timestamp'][0]
    
    for chunk in chunks:
        word = chunk['text']
        timestamp = chunk['timestamp']
        
        current_sentence_words.append(word)
        
        # Check if this word ends a sentence
        reconstructed = ' '.join(current_sentence_words)
        if any(sentences[current_sentence_idx].strip().endswith(reconstructed.strip()) for s in sentences):
            sentence_timestamps.append({
                'text': sentences[current_sentence_idx],
                'timestamp': (sentence_start_time, timestamp[1])
            })
            
            # Move to next sentence
            current_sentence_idx += 1
            if current_sentence_idx < len(sentences):
                current_sentence_words = []
                sentence_start_time = timestamp[1]
    
    return sentence_timestamps

In [None]:
sentences = segment_into_sentences(transcription_data)

Two different representations are needed for our distributional embedding to be constructed. 

1. Base Corpus Processing:
    1. The base corpus (shown on the left) undergoes sentence tokenization ((previous step)
    2. Emotion extraction is performed on these sentences using a transformer-based multi-label classifier to "Basic Emotion Corpus" containing sentences with their corresponding emotion arrays and probability values
      
1. Target data processing:
    1. In parallel, target data (the specific data you want to model) also undergoes sentence tokenization
    2. This creates a "Target Emotion Corpus" with emotion tokens* 


In the previous paragraph we tokenized the base corpus into sentences (via `segment_into_sentences()`). Now, [the paper](https://www.sciencedirect.com/science/article/pii/S0925231225004941) suggests that: 

1. we use a pre-trained transformer to generate a dataset of labelled sentences *and*
2. we assume that the output of the classifier is our ground truth in terms of assigning emotion labels to sentences.

According to the paper the Distributional Representations requires to:

1. *"treat each sentence as an observation in a multivariate series of emotions"* 
2. *"transform the emotional flow of a text into a sequence of emotion strings."*

The textual sequence (our transcript) will have a corresponding emotion-based vector representation

We achieve this `classify_emotions()` in 2 steps:

### Emotion classification
* Step 1: we extract the corresponding emotions for each sentence using the pre-trained transformer ([Emotion English DistilRoBERTa-base](https://huggingface.co/j-hartmann/emotion-english-distilroberta-base)) to get the corresponding emotions for each sentence. This approach will lead to having an array of emotions with their probability value. 
### Create Emotion Label Texts
* Step 2: We take the output of `classify_emotions()` and sentence by sentence we interpret the probability value of each emotions in terms of emotion dominance and arrange their corresponding textual labels, i.e., as if they were strings, in serially ordered layout. Very much as if the labels were parts of a text. Therefore, for each sentence in the transcript we find the three most predominant emotions by probability (what the paper calls *"sort extracted emotions by their probability values"*). Subsequently, for each sentence `classify_emotions()` generates "emotion_sequences" txt by joining (in a serial manner) the top 3 emotions with "__" separators. This step implements what [the paper](https://www.sciencedirect.com/science/article/pii/S0925231225004941) calls *"arrange emotion labels in a serially ordered layout as if the labels were parts of a text. The vocabulary of such a text is the emotion labels."
   

In [None]:
from transformers import pipeline
import torch # for GPU use


def classify_emotions(sentences, device=None): # Default is None or CPU
  # Set device if not provided
    if device is None:
        device = "cuda:0" if torch.cuda.is_available() else "cpu"
    
    # Initialize emotion classifier with device specification
    emotion_classifier = pipeline(
        "text-classification", 
        model="j-hartmann/emotion-english-distilroberta-base",
   #     model="j-hartmann/emotion-english-distilroberta-large",
        return_all_scores=True,
        device=device  # Use specified GPU (0 for first GPU, 1 for second, etc.)
    )
    
    emotion_sequences = []
    
    for sentence in sentences:
        
        # Step 1: we extract the corresponding emotions for each sentence 
        #   Get emotion probabilities
        emotion_scores = emotion_classifier(sentence['text'])[0]

        # Step 2: get "emotion sequences" (e.g. "happy_suprised_neutral"
        #   Sort emotions by probability
        sorted_emotions = sorted(emotion_scores, key=lambda x: x['score'], reverse=True)

        #   Create emotion sequence (taking top 3 emotions)
        top_emotions = sorted_emotions[:3]
        emotion_sequence = "__".join([e['label'] for e in top_emotions])
        
        emotion_sequences.append({
            'text': sentence['text'],
            'timestamp': sentence['timestamp'],
            'emotion_sequence': emotion_sequence,
            'emotion_scores': sorted_emotions
        })
    
    return emotion_sequences


In [None]:
emotion_sequences = classify_emotions(sentences)


In [None]:
from IPython.display import Image, display

#https://www.sciencedirect.com/science/article/pii/S0925231225004941

# the authors use two emotion per sentence, we are using three
image_url = "https://ars.els-cdn.com/content/image/1-s2.0-S0925231225004941-gr2.jpg"
display(Image(url=image_url))

These labels are about the emotion state of each sentence. *"These emotion states are serial and interdependent and can be treated as novel tokens of a textual sequence consisting only of emotional categorizations - as if the elements in question are, after all, strings"*

[The paper](https://www.sciencedirect.com/science/article/pii/S0925231225004941) specifically mentions that their approach "treats each sentence as an observation in a multivariate series of emotions" and transforms "the emotional flow of a text into a sequence of emotion strings." 

The key innovation proposed is treating emotions as having a "distributional layout" in text: the emotion sequences appear and alternate with each other in structured patterns of interrelations. Therefore, after we created emotion sequences that capture the dominant emotions for each sentence in order of probability and we turned them into text we use a transformer model to generate embeddings that capture the contextual relationships between emotions.

This method allows to capture both the contextual and sequential dependencies between emotional states as they unfold in the text. *"Unlike conventional sentiment classification models, which treat emotions as discrete outputs, the methodology we propose encodes sentiments as structured entities within an embedding space."*

#### Embedding Extraction or Generation
After constructing an emotion-state-based corpus (where each sentence has been assigned emotion labels through a multi-label classifier), the process for extracting embeddings involves two alternative approaches:

1. **Extraction-based approach**: This uses pre-trained models to extract semantic representations of emotion strings. The paper explains: "To simply extract the embeddings means that one incorporates a pre-trained scheme to extract the more or less semantic representations of the given emotion strings". This leverages existing contextual understanding from transformer-based models. *OR*
2. **Generation-based approach**: This involves training an embedding scheme from scratch on the serially ordered emotion states. As the paper notes: "to generate means to train an embedding scheme over the serially ordered emotion states". This approach captures the unique distributional patterns specific to emotion sequences.


Here with `generate_emotion_embeddings()` we use the **extraction-based** approach, where *"one incorporates a pre-trained scheme* --in this case "distilber-base-ucased"--  *to extract the more or less semantic representations of the given emotion strings."* 

1. We uses DistilBERT to capture contextual relationships between emotions in the sequence. The task of the transformer is to extract contextual representations of the emotion sequences (e.g. "joy__sadness__anger","joy__fear__disgust"). The transformer self-attention mechanism creates contextual representations where each emotion's embedding is influenced by the other emotions in the sequence. Each self-attention layer of the transformer considers the relationships between the word "joy" in the emotion sequence together with the surrounding words influencing how "joy" is represented in the vector space. This means the embedding for "joy" in "joy__sadness__anger" would be different than in "joy__fear__disgust" because the surrounding emotions provide different contexts.

2. We then create the CLS (**CL**a**S**sification) token embedding as a representation for the entire emotion sequence: we leverage the  pre-trained transformer's ability to aggregate information from the entire emotion sequence into a single vector. As we discussed in the previous paragraph, the model processes the emotion sequence through its attention mechanisms, where "each attention head can focus on different aspects of meaning" and "multiple transformer layers progressively refine the representation".

   

In [None]:
### Generate Distributional Representations
from transformers import AutoTokenizer, AutoModel
import torch

def generate_emotion_embeddings(emotion_sequences):
    # Initialize transformer model for contextual representations
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    model = AutoModel.from_pretrained("distilbert-base-uncased")
    
    embeddings = []
    
    for seq in emotion_sequences:
        # Tokenize emotion sequence
        inputs = tokenizer(seq['emotion_sequence'], return_tensors="pt", padding=True, truncation=True)
        
        # Generate embeddings
        with torch.no_grad():
            outputs = model(**inputs)
        
        # Use CLS token embedding as sequence representation
        embedding = outputs.last_hidden_state[:, 0, :].numpy()
        
        seq['embedding'] = embedding
        embeddings.append(seq)
    
    return embeddings


In [None]:
emotion_embeddings = generate_emotion_embeddings(emotion_sequences)

## Temporal Emotion Analysis


In [None]:
import matplotlib.pyplot as plt
import numpy as np

def visualize_emotion_flow(emotion_embeddings):
    # Extract timestamps and emotions
    timestamps = [e['timestamp'] for e in emotion_embeddings]
    emotion_labels = [e['emotion_scores'][0]['label'] for e in emotion_embeddings]
    
    # Create time points (midpoint of each sentence)
    time_points = [(t[0] + t[1])/2 for t in timestamps]
    
    # Create a mapping of emotions to colors
    unique_emotions = list(set(emotion_labels))
    colors = plt.cm.tab10(np.linspace(0, 1, len(unique_emotions)))
    emotion_colors = {emotion: colors[i] for i, emotion in enumerate(unique_emotions)}
    
    # Plot emotions over time
    plt.figure(figsize=(12, 6))
    
    for i, (time, emotion) in enumerate(zip(time_points, emotion_labels)):
        plt.scatter(time, 1, color=emotion_colors[emotion], s=100)
        plt.text(time, 1.05, emotion, rotation=45, ha='center')
    
    # Add sentence text as annotations
    for i, embedding in enumerate(emotion_embeddings):
        plt.annotate(embedding['text'][:30] + '...', 
                     xy=(time_points[i], 0.9),
                     xytext=(time_points[i], 0.7),
                     arrowprops=dict(arrowstyle='->'),
                     ha='center')
    
    plt.yticks([])
    plt.xlabel('Time (seconds)')
    plt.title('Emotion Flow Over Time')
    plt.tight_layout()
    plt.show()


In [None]:
visualize_emotion_flow(emotion_embeddings)