# Multimodal Emotion Recognition Dataset

## Aim

**Aim** to create a multimodal emotion recognition dataset from natural speech collected in naturalistic enviroment. The dataset will integrate two modalities: voice and corresponding transcripts

## Introduction
The audio and text-based emotion encoder proposed follows and expands the work by Liampis, Karanikola, and Kotsiantis ([2025](https://www.sciencedirect.com/science/article/pii/S0925231225004941)) who model the serial, distributional structure of emotions in text by converting sequences of sentences into sequences of emotion labels, and then generating distributional emotion embeddings from these sequences. The idea behind their work is that emotions in text are not isolated states but unfold as structured, serially-dependent patterns akin to a time series and forming a sequence that reflects the emotional evolution throughout the discourse. The authors argue that the emotional states exhibit distributional patterns as they appear and alternate in text, very much like semantic relationships between words are shaped by their order and co-occurrence. In our multimodal pipeline, we extend this idea with an audio modality: we extract high-level emotion embeddings from each segment using emotion2vec which capture contextual acoustic emotion features. We then use timestamps to align and concatenate these representations at the segment level to form a multimodal feature space.  


### Conceptual Framework

#### Text component
* *Serial Emotional Footprint in Text*: Building on Liapis, Karanikola, and Kotsiantis (2025), we treat the emotional flow in natural language as a structured, serially-dependent pattern. Each segment of text (e.g., sentence or utterance) is assigned a set of emotion labels using a multi-label classifier. These labels are then arranged in order, forming an "emotion string" that reflects the temporal progression of emotions throughout the transcript. This sequence is modeled as a time-series-like structure, capturing the distributional regularities and interdependencies of emotions as they appear and alternate in the text
* *Distributional Emotion Embeddings for Text*: The emotion label sequences (emotion strings) derived from the text are embedded using distributional models (e.g., Word2Vec trained on the emotion label corpus) and/or contextual transformer-based models. This process encodes both the semantic and sequential properties of emotional expression in the text, allowing downstream models to leverage latent information about emotion transitions and dependencies that cannot be captured by isolated emotion labels alone
#### Audio Component
* *Contextual Acoustic Emotion Embeddings for Audio*: For the audio modality, we extract high-level, segment-level emotion embeddings using a state-of-the-art model (e.g. emotion2vec). These embeddings capture the contextual acoustic features relevant to emotional expression in speech but are not modeled as distributional or sequential emotion label strings. Instead, each audio segment (the audio stream corresponding to a phrasal unit) is represented by its corresponding emotion2vec embedding, which encodes paralinguistic cues and affective information from the speech signal.

#### Multimodal integration
* *Multimodal Alignment and Fusion*: Audio and text segments are aligned using precise timestamps obtained from the transcription process (via crisper-whisper). For each aligned segment, the contextual acoustic embedding (from audio) and the distributional emotion embedding (from text) are concatenated to form a multimodal feature vector. This approach should integrate the complementary strengths of both modalities: the structured, sequential modeling of emotions in text and the rich paralinguistic information in speech. By aligning and fusing these representations at the segment level, we create a comprehensive multimodal feature space for emotion recognition, suitable for robust analysis in naturalistic, diary-style speech data.

## Dataset Structure
Two data modes:
* **Audio**: segment-level emotion embeddings and labels.
* **Text**: Tokenized transcripts, segment-level emotion embeddings and labels.

### Data Collection

* **Speech Recordings**: High-quality (16KhZ), natural (and naturalistic) speech samples. 
* **Text Transcripts**: Time-aligned transcripts for each audio recording (synchronization at the utterance level.
* **Emotion Annotation**: The present pipeline is using only automatic emotion classifiers (e.g., fine-tuned transformers, emotion2vec). In addition we will need to add human annotators to label each segment with one or more emotion categories.


## Method

### Tools

* [emotion2vec: audio-based emotion embeddings](https://huggingface.co/emotion2vec/emotion2vec_plus_large)
* [emotion-english-distilroberta-base](https://huggingface.co/j-hartmann/emotion-english-roberta-large)
* [CrisperWhisper](https://github.com/nyrahealth/CrisperWhisper)

**Emotion2vec** is the first universal speech emotion representation model. Through self-supervised pre-training, emotion2vec has the ability to extract emotion representation across different tasks, languages, and scenarios. Emotion2vec aims to create a universal emotional representation space. 
**Emotion-english-distilroberta-base** model is a distilled version of RoBERTa-large trained on 6 diverse datasets and predicts Ekman's 6 basic emotions, plus a neutral class. RoBERTa-large (Liu et al. 2019) is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. 
**Crisper-whisper** is an advanced variant of OpenAI's Whisper, designed for fast, precise, and verbatim speech recognition with accurate (crisp) word-level timestamps. Unlike the original Whisper, which tends to omit disfluencies and follows more of a intended transcription style, CrisperWhisper aims to transcribe every spoken word exactly as it is, including fillers, pauses, stutters and false starts.
**Text Anonymization Pipeline**: For automated and manual redaction of personal identifiers in transcripts with Presidio (To be done). Presidio is an open-source NLP library developed by Microsoft, designed for automated text redaction and anonymization

### Data

| Item             | Description                                                             | Included in Dataset  | Requires Anonymization |
|------------------|-------------------------------------------------------------------------|----------------------|------------------------|
| Audio File       | A .wav file containing speech (e.g., audiotest.wav)                     | No                   | No                     |
| Audio embeddings | Extracted, non-reconstructable, privacy-checked embeddings              | Yes                  | No                     |
| Transcripts      | Output of crisperwhisper (JSON format) with utterance-level timestamps  | Yes                  | Yes                    |
| Timestamps       | Used for alignment, included in metadata                                | Yes                  | No                     |


### Multimodal Processing

### MODE 1: Text analysis

This is the process to generate distributional embeddings [Liampis 2025](https://www.sciencedirect.com/science/article/pii/S0925231225004941).

Two different representations are needed for the final embedding to be constructed. 

1. **base corpus**. The first representation concerns an (ideally extensive) textual corpus that is called *base corpus*, from which the base embeddings are generated (and/or) extracted.
2. **target data**. The second representation concerns the data one wants to model or *target data*.

The process follows **three algorythms**:

#### Base Corpus Processing (Algorithm 1)
This is the computationally expensive part (building the distributional embedding space). Once this is done, new data can be mapped quickly using the pre-trained models and the emotion vocabularies that we have embedded.

* Purpose: Build the emotion embedding space (Word2Vec and/or transformer-based) from a large, diverse corpus.
* How Often: Only once or whenever we want to update the base_corpus distributional embedding space.
* Pipeline:
    * Tokenizes the base corpus into sentences
    * Classifies each sentence for emotions (using a multi-label classifier)
    * Creates serial emotion strings (e.g., Joy__Trust__Surprise)
    * Trains or extracts embeddings for these emotion strings (Word2Vec and/or transformer)
**ouput**: a mapping from emotion strings to embeddings (this is our "base embedding space").

#### Target Data Processing (Algorithm 2)
* Purpose: Map new (target) data into the embedding space created from the base corpus.
* How Often: Every time we have new target data (step 1 of 2 for new data).
* Output:
    * Tokenizes the new transcript into sentences
    * Classifies each sentence for emotions
    * Creates emotion strings for each sentence
    * Maps these emotion strings to the pre-trained embeddings from the base corpus (using the same vocabulary and embedding models as in algo. 1)
    * Combines (weighted average) the different embedding layers (e.g., Word2Vec and transformer)
**output**: distributional emotion embeddings for each sentence of the target transcript.

#### Final Embedding Construction (Algorithm 3)
* Purpose: For downstream tasks, concatenate the distributional emotion embeddings with sentence-level semantic embeddings.
* How Often: Every time we process new target data (step 2 of 2 for new data).
* What it does:
    * Gets the distributional emotion embeddings (from Algorithm 2)
    * Extracts sentence embeddings for the target data (using a transformer)
    * Flattens and concatenates these embeddings for each sentence
**output**: A final feature space for each sentence, ready for use in downstream models.



# RESULTS

In [None]:
import librosa
import numpy as np
import matplotlib.pyplot as plt
from funasr import AutoModel
import pandas as pd
import soundfile as sf
import os
import json
import torch
from pathlib import Path


## MODE 1: Txt-based distributional embeddings

### Transcript generation

* Transcribe the Audio: Convert the audio file into text using Crisperwhisper to obtain transcripts with word-level timestamps.
    * Anonymisation: we use Microsoft Presidio (automated tool) - **This needs testing, not implemented yet**


In [None]:
import nltk ## need pip install nltk
import re  # for regular expressions

nltk.download("punkt")
nltk.download('punkt_tab')



In [None]:
import ast
import re

def load_and_flatten_transcripts(filename):
    ## we need to merge lists into one (because of how they were transcribed with whisper (in mnually separated chunks)
    with open(filename, "r", encoding="utf-8") as f:
        content = f.read()
    # Find all lists of dicts using regex
    lists = re.findall(r'\[.*?\]', content, re.DOTALL)
    all_entries = []
    for l in lists:
        try:
            entries = ast.literal_eval(l)
            all_entries.extend(entries)
        except Exception as e:
            print("Error in segment:", e)
    return all_entries


def retime_transcript(transcript, threshold=1.0):
    retimed = []
    time_offset = 0.0
    prev_end = 0.0

    for entry in transcript:
        start, end = entry['timestamp']
        # If timestamps reset, increment offset
        if start < prev_end - threshold:
            time_offset += prev_end
        adj_start = start + time_offset
        adj_end = end + time_offset
        retimed.append({'text': entry['text'], 'timestamp': (adj_start, adj_end)})
        prev_end = end
    return retimed


def segment_sentences_with_timestamps(whisper_output):
    """
    Segments transcript into sentences with aligned timestamps.
    Handles edge cases and maintains compatibility with emotion pipeline.
    
    Args:
        whisper_output (dict): CrisperWhisper JSON output with 'text' and 'chunks'
        
    Returns:
        list[dict]: List of {'text': str, 'timestamp': (start, end)} 
    """
    nltk.download('punkt', quiet=True)
    
    # Input validation
    if not whisper_output or 'text' not in whisper_output or 'chunks' not in whisper_output:
        return []
        
    text = whisper_output['text'].strip()
    chunks = whisper_output['chunks']
    
    if not text or not chunks:
        return []
    
    # Tokenize into sentences
    sentences = nltk.sent_tokenize(text)
    
    # Normalize chunks for better matching
    chunk_words = [re.sub(r'[^\w\s\']', '', c['text'].lower()) for c in chunks]
    chunk_times = [c['timestamp'] for c in chunks]
    
    results = []
    chunk_idx = 0
    
    for sent in sentences:
        # Normalize sentence words
        sent_words = [re.sub(r'[^\w\s\']', '', w.lower()) for w in sent.split()]
        sent_len = len(sent_words)
        
        match_found = False
        
        # Look for sentence match in remaining chunks
        while chunk_idx <= len(chunk_words) - sent_len:
            window = chunk_words[chunk_idx:chunk_idx + sent_len]
            
            # Check if window matches sentence words
            if all(w1 == w2 for w1, w2 in zip(window, sent_words)):
                results.append({
                    'text': sent,
                    'timestamp': (
                        chunk_times[chunk_idx][0],          # Start time
                        chunk_times[chunk_idx + sent_len - 1][1]  # End time
                    )
                })
                chunk_idx += sent_len
                match_found = True
                break
            chunk_idx += 1
            
        if not match_found:
            # If no exact match, try fuzzy matching
            # For now, assign last known timestamp
            if results:
                last_end = results[-1]['timestamp'][1]
                results.append({
                    'text': sent,
                    'timestamp': (last_end, last_end + 1.0)  # Estimate 1 second duration
                })
    
    return results


### Sentence Tokenisation
The base corpus and the target data should be tokenized into sentences (via `segment_into_sentences()`). 

> The output of whisper transcribed by jiawei is already tokenized

In [None]:
# Load the target-trancript file in JSON format (TARGET data)
target_file = "../data/transcripts/audiotest_json.json"
with open(target_file, 'r') as file:
    transcription_data = json.load(file)

target_data = segment_sentences_with_timestamps(transcription_data)
#target_data = transcription_data


base_corpus_unparsed = load_and_flatten_transcripts("../data/transcripts/base_corpus.txt") # this needs timestamps to be re-parsed 
base_corpus = retime_transcript(base_corpus_unparsed)



In [None]:

#base_corpus
#transcription_data
#target_data

len(base_corpus)


### Base Corpus Processing (Algorithm 1)

![algo1](https://ars.els-cdn.com/content/image/1-s2.0-S0925231225004941-fx1001.jpg)

> in the code this is done via **algorithm_1_embeddings_extraction()**

Followin this algorithm we start by 
1. segment tokenisation (we have done this): segment the transcript into sentences - via nltk (a natural language tokenizer).
    * allocate timestamps to each sentence (precise start and stop timing of each segment), this will preserve alignment with between audio and txt segments for later on. 
2. we extract, through the use of transformer-based multi-label emotion classification, the corresponding emotions for each sentence. **Ideally we will label the base corpus ourselves to form a new labelled dataset**.  In this case use a pre-trained transformer to assuming that the output of the emotion classifier is the groud-truth. For each segment:
    1.  we apply the [Emotion English DistilRoBERTa-large](https://huggingface.co/j-hartmann/emotion-english-roberta-large)  model to obtain Hartman's  emotion probabilities (or any other suitable multi-label emotion classifier).
    2. we serialise the top-$k$ (k=3) emotion labels of each segment into a new txt string Emotion1__Emotion2__Emotion3 (e.g."joy__surprise__fear") in order of their probability value. We call this token **emotion string** and the vocolabulary of this token is as "large" as $k$ - larger k will involve more subtle representation of the underlying true emotion.
3. We concate the emotion strings in the temporal order of appearance to create a time-series-like structure (**base emotion corpus**). This corpus organises emotion strings as tokens in a "text" where the vocabulary consists only of emotion labels. e.g.:

![serialised_labels](https://ars.els-cdn.com/content/image/1-s2.0-S0925231225004941-gr2.jpg)

Thus, while each *emotion string* captures the emotion dominance (top-k probability value) of each segment. In addition, the concatenation of these emotion strings (*base emotion corpus*) captures the temporal sequence of those strings and can be subject to distributional logic. 

4. Base embedding extraction and/or generation:
    - Extraction : Use a pretrained sentence transformer (e.g., all-MiniLM-L6-v2) to embed emotion strings, capturing semantic relationships between them: the embeddings will be a representation of the emotion combinations and _do not_ include any serial (temporal) interdependencies between the emotion label tokens (i.e. emotion strings).
    - Generation : Train a Word2Vec model on the emotion string sequences to encode sequential dependencies (e.g., how "Joy" often precedes "Surprise").



### Distributional Embeddings (Algorithm 2)

![algo2](https://ars.els-cdn.com/content/image/1-s2.0-S0925231225004941-fx1002.jpg)

> in the code this is doene via **algorithm_2_distributional_embeddings()**

Similarly to what we did for the base corpus:
1. we compute the base embeddings
    1. tokenize the target data
    2. we use the same RoBERTa model we used for the base corpus to predict the corresponding emotion probabilities for the target textual data.
    3.  Using these predictions, we again extract the *emotion string* for each text sample
2. We map the embeddings generated from the base corpus onto the new emotion labels extracted from the target dataset.
    1. This is done by matching the emotion labels between the base and target datasets and retrieving the corresponding embeddings from the relevant dictionaries, thus using the preexisting embeddings derived from the base corpus.
3. Once the embeddings are mapped to the target dataset, the resulting embeddings are flattened, expanding the embedding vectors into separate columns while maintaining the corresponding dimensionality. This transformation allows the embeddings to be treated as distinct features.
4. Then we extend this embedding representation by leveraging a Word2Vec model to capture the sequential interdependencies of the emotion strings. Using the emotion label corpus extracted from these vocabularies, we train a Word2Vec scheme with a vector size of 384 dimensions. The Word2Vec scheme is trained with a window size of 10, allowing it to capture contextual relationships between different emotion combinations within the emotion-based corpus.
   
### Construct Final Embedding space (Algorithm 3)
The final embeddings are a weighted average version of both MiniLM and Word2Vec. Thus, to further enhance the emotion embeddings so that they contain both semantic and distributional information, we employ a weighted averaging scheme that blends the embeddings generated by the MiniLM-L6-v2 model with those from the Word2Vec model.

![algo2_1](https://ars.els-cdn.com/content/image/1-s2.0-S0925231225004941-gr5.jpg)





In [None]:
import numpy as np
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from gensim.models import Word2Vec

class DistributionalEmotionEmbeddings:
    def __init__(self):
        self.emotion_classifier = pipeline(
            "text-classification",
            model="j-hartmann/emotion-english-distilroberta-base",
            return_all_scores=True
        )
        self.base_embeddings = {}  # Store base corpus embeddings
        self.base_emotion_vocab = {}  # Store base emotion vocabulary
        self.word2vec_model = None
        self.sentence_transformer = SentenceTransformer("all-MiniLM-L6-v2")
    
    def algorithm_1_embeddings_extraction(self, corpus, process_type="both"):
        """Algorithm 1: Embeddings Extraction from base corpus"""
        sentences = corpus
        
        # Step 2: Emotion Classification
        emotion_labels_corpus = []
        for sentence in sentences:
            scores = self.emotion_classifier(sentence['text'])[0]
            sorted_emotions = sorted(scores, key=lambda x: x['score'], reverse=True)
            
            single = sorted_emotions[0]['label']
            double = "__".join([e['label'] for e in sorted_emotions[:2]])
            triple = "__".join([e['label'] for e in sorted_emotions[:3]])
            
            emotion_labels_corpus.append({
                'single': single,
                'double': double,
                'triple': triple
            })
        
        # Step 3: Create Emotion Label Texts
        emotion_sequences = {
            'single': [item['single'] for item in emotion_labels_corpus],
            'double': [item['double'] for item in emotion_labels_corpus],
            'triple': [item['triple'] for item in emotion_labels_corpus]
        }
        
        # Store vocabulary for mapping
        self.base_emotion_vocab = emotion_sequences
        
        # Step 4: Embedding Extraction or Generation
        embeddings = {}
        
        if process_type in ["extraction", "both"]:
            # These embeddings, however, are extracted, not generated 
            # or trained from scratch in a way that captures the sequential 
            # or temporal dynamics of emotion fluctuations across the text
            # That is, the extracted embeddings in this step are merely contextual 
            # semantic embeddings that do not include any serial interdependencies 
            # between the emotion label tokens.
            for vocab_type in ['single', 'double', 'triple']:
                embeddings[f'{vocab_type}_extraction'] = self.sentence_transformer.encode(
                    emotion_sequences[vocab_type]
                )
        
        if process_type in ["generation", "both"]:
            # After constructing the emotion embeddings from the base corpus, 
            # we proceed to further enrich the embedding representation by leveraging 
            # a Word2Vec model so that we can model sequential interdependencies too.
            tokenized_sequences = []
            for vocab_type in ['single', 'double', 'triple']:
                tokenized_sequences.extend([seq.split("__") for seq in emotion_sequences[vocab_type]])
            
            self.word2vec_model = Word2Vec(
                # this configuration replicates the spec of the paper
                sentences=tokenized_sequences,
                vector_size=384,
                # The Word2Vec scheme is trained with a window size of 10, allowing it to capture 
                # contextual relationships between different emotion combinations within the 
                # emotion-based corpus
                window=10,
                min_count=1,
                workers=4
            )
            
            for vocab_type in ['single', 'double', 'triple']:
                # To model sequential interdependencies we utilize the emotion word combinations 
                # (single, double, and triple emotions) extracted from the base corpus
                w2v_embeddings = []
                for seq in emotion_sequences[vocab_type]:
                    tokens = seq.split("__")
                    token_embeddings = [self.word2vec_model.wv[token] for token in tokens if token in self.word2vec_model.wv]
                    if token_embeddings:
                        w2v_embeddings.append(np.mean(token_embeddings, axis=0))
                    else:
                        w2v_embeddings.append(np.zeros(384))
                embeddings[f'{vocab_type}_generation'] = np.array(w2v_embeddings)
        
        return embeddings
    
    def algorithm_2_distributional_embeddings(self, base_corpus, target_data, num_layers=2):
        """Algorithm 2: Distributional Emotion Embeddings"""
        # Step 1: Compute Base Embeddings
        self.base_embeddings = self.algorithm_1_embeddings_extraction(base_corpus, "both")
        
        # Step 2: Map Target Emotion Data to Base Embeddings
        # Once the Word2Vec model is trained (step1), the embeddings generated are used in conjunction with 
        # the target dataset in the context of a process similar to the one applied in the 
        # earlier steps described for the base corpus:
        target_results = []
        for sentence in target_data:
            scores = self.emotion_classifier(sentence['text'])[0]
            # the emotion probabilities are extracted for each text entry of the target data set using the multi-label classifier
            sorted_emotions = sorted(scores, key=lambda x: x['score'], reverse=True)
            
            emotion_strings = {
                # Then the single-, double-, and triple-emoword constituents of the emotion states are 
                # extracted for each instance
                'single': sorted_emotions[0]['label'],
                'double': "__".join([e['label'] for e in sorted_emotions[:2]]),
                'triple': "__".join([e['label'] for e in sorted_emotions[:3]])
            }
            
            # Map to base embeddings using Word2Vec model
            # The corresponding labels of the emowords are then mapped to their corresponding embeddings derived from the Word2Vec model.
            mapped_embeddings = {}
            for vocab_type in ['single', 'double', 'triple']:
                emotion_string = emotion_strings[vocab_type]
                
                # Generation embeddings (Word2Vec)
                if self.word2vec_model:
                    tokens = emotion_string.split("__")
                    token_embeddings = [self.word2vec_model.wv[token] for token in tokens if token in self.word2vec_model.wv]
                    if token_embeddings:
                        mapped_embeddings[f'{vocab_type}_generation'] = np.mean(token_embeddings, axis=0)
                    else:
                        mapped_embeddings[f'{vocab_type}_generation'] = np.zeros(384)
                
                # Extraction embeddings (Sentence Transformer)
                mapped_embeddings[f'{vocab_type}_extraction'] = self.sentence_transformer.encode([emotion_string])[0]
            
            target_results.append({
                'text': sentence['text'],
                'timestamp': sentence['timestamp'],
                'emotion_strings': emotion_strings,
                'embeddings': mapped_embeddings
            })
        
        # Step 3: Weighted Average of Embeddings
        alpha_extraction = 0.4
        alpha_generation = 0.6
        
        final_embeddings = []
        for result in target_results:
            # Combine extraction and generation embeddings (using single emotions as example)
            extraction_emb = result['embeddings']['single_extraction']
            generation_emb = result['embeddings']['single_generation']
            
            combined = alpha_extraction * extraction_emb + alpha_generation * generation_emb
            final_embeddings.append(combined)
        
        return np.array(final_embeddings), target_results
    
    def algorithm_3_final_embedding_space(self, base_corpus, target_data, num_layers=2):
        """Algorithm 3: Construct Final Embedding Space"""
        # Step 1: Compute Distributional Emotion Embeddings
        distributional_embeddings, target_results = self.algorithm_2_distributional_embeddings(
            base_corpus, target_data, num_layers
        )
        
        # Step 2: Extract Textual Sentence Embeddings
        target_texts = [sentence['text'] for sentence in target_data]
        sentence_embeddings = self.sentence_transformer.encode(target_texts)
        
        # Step 3: Flatten Embeddings (if needed)
        de_flat = distributional_embeddings.reshape(distributional_embeddings.shape[0], -1)
        se_flat = sentence_embeddings.reshape(sentence_embeddings.shape[0], -1)
        
        # Step 4: Concatenate Embeddings
        final_embeddings = np.concatenate([de_flat, se_flat], axis=1)
        
        return final_embeddings, target_results


In [None]:
embedder = DistributionalEmotionEmbeddings()

# Algorithm 1: Extract embeddings from base corpus (this is called internally)
# embedder.algorithm_1_embeddings_extraction(base_corpus)



# Algorithm 2: Generate distributional emotion embeddings for target data
#    adjust the num_layers so that:
#         Single: Vocabulary size = number of emotion labels (e.g., 11).
#         Double: Vocabulary size = number of possible ordered pairs (e.g., 11 × 10 = 110 - order matters).
#         Triple: Vocabulary size = number of possible ordered triplets (e.g., 11 × 10 × 9 = 990).

distributional_embeddings = embedder.algorithm_2_distributional_embeddings(
    base_corpus=base_corpus, 
    target_data=target_data, 
    num_layers=2  # My base corpus sample is relatively small (1514 sentences)
)

# Algorithm 3: Generate final multimodal embeddings (combines with sentence semantics)
final_embeddings, detailed_results = embedder.algorithm_3_final_embedding_space(
    base_corpus=base_corpus, 
    target_data=target_data
)



In [None]:
for i, result in enumerate(detailed_results):
    print(f"Sentence {i}: {result['text']}")
    print(f"Emotion strings: {result['emotion_strings']}")
    print(f"Final embedding shape: {final_embeddings[i].shape}")
    print("---")




#### Step 2. Audio embeddings extraction (mode 2)

We use emotion2vec to extract high-level emotion embeddings from the audio file.

Note: We only publish these embeddings, not the raw audio. We will also look into further processing of the embeddings to minimize speaker-identifiable information (using e.g. dimensionality reduction via umap).


#### Step 3. Multimodal Integration

1. Align audio and text segments using timestamps.
    * Segment Matching: Map audio segments (from emotion2vec) to text segments using timestamps (e.g., a 3-second audio chunk and its corresponding transcribed sentence).
    * Embedding Concatenation: For each aligned segment, concatenate:
        * Audio: Emotion2vec embedding (contextual acoustic features).
        * Text: Distributional emotion embedding (serial emotion patterns).

As we concatenate the audio emotion embedding and the corresponding text-based emotion embedding we form a multimodal feature vector.



In [None]:
from IPython.display import Image, display

#https://www.sciencedirect.com/science/article/pii/S0925231225004941

# the authors use two emotion per sentence, we are using three
image_url = "https://ars.els-cdn.com/content/image/1-s2.0-S0925231225004941-gr2.jpg"
display(Image(url=image_url))

## Step.2 Audio embeddings extraction


### Pre-processing
Preprocess audio file to ensure it's in the correct format for emotion2vec (sampling required = 16kHz)

In [None]:
# Load the audio file
target_file = "../data/audiotest.wav"




In [None]:
import librosa
import soundfile as sf

def preprocess_audio(audio_file, target_sr=16000):
    """
    Preprocess audio file to ensure it's in the correct format for emotion2vec
    
    Args:
        audio_file (str): Path to the audio file
        target_sr (int): Target sampling rate (emotion2vec requires 16kHz)
        
    Returns:
        str: Path to the processed audio file
    """
    # Load audio with its original sampling rate
    waveform, sr = librosa.load(audio_file, sr=None)
    
    # Check if resampling is needed
    if sr != target_sr:
        # Resample to 16kHz
        waveform = librosa.resample(waveform, orig_sr=sr, target_sr=target_sr)
        
        # Save the resampled audio to a temporary file
        temp_file = "temp_resampled.wav"
        sf.write(temp_file, waveform, target_sr)
        return temp_file
    
    return audio_file


In [None]:

audio_file_16KhZ = preprocess_audio(target_file, target_sr=16000)


In [None]:
## loading the emotion2vec_plus_large model
from funasr import AutoModel

# model="iic/emotion2vec_base"
# model="iic/emotion2vec_base_finetuned"
# model="iic/emotion2vec_plus_seed"
# model="iic/emotion2vec_plus_base" 
model_id = "iic/emotion2vec_plus_large"

model_audio = AutoModel(
    model=model_id,
    hub="hf",  # "ms" or "modelscope" for China mainland users; "hf" or "huggingface" for other overseas users
    disable_update=False
)

### loading EMO2VEC

[emotion2vec](https://huggingface.co/emotion2vec/emotion2vec_plus_large) is a speech emotion recognition foundation model that can extract emotional features directly from audio waveforms. The emotion2vec_plus_large model is the largest version, trained on 42,526 hours of data. It supports 9 emotion classes:

0. angry
1. disgusted
2. fearful
3. happy
4. neutral
5. other
6. sad
7. surprised
8. unknown



In [None]:
def process_audio_emotion_utterances(sentences, audio_file, model_audio, threshold=0.0):
    """
    Process audio segments into emotion sequences and embeddings, aligning with text segments.
    
    Args:
        sentences (list): List of segmented sentences with timestamps
        audio_file (str): Path to audio file
        model_audio: Loaded emotion2vec model
        threshold (float): Minimum probability threshold for emotion detection
        
    Returns:
        list: List of dicts containing:
            - text: Original sentence text
            - timestamp: (start, end) times
            - emotion_sequence: Concatenated top emotions 
            - emotion_scores: List of (emotion, score) tuples
            - embedding: emotion2vec embedding vector
    """
    
    def extract_audio_segment(audio_file, start_time, end_time, sr=16000):
        """Extract precise audio segment matching text timestamp"""
        waveform, sr = librosa.load(audio_file, sr=sr)
        start_sample = int(start_time * sr)
        end_sample = int(end_time * sr)
        return waveform[start_sample:end_sample]
    
    audio_emotion_sequences = []
    
    for sentence in sentences:
        # 1. Get exact timestamp alignment
        start_time, end_time = sentence['timestamp']
        
        # 2. Extract aligned audio segment
        segment = extract_audio_segment(audio_file, start_time, end_time)
        
        # Save segment for emotion2vec processing
        temp_path = "temp_segment.wav"
        sf.write(temp_path, segment, 16000)
        
        # 3. Generate emotion embeddings and labels
        result = model_audio.generate(
            temp_path,
            output_dir="./outputs",
            granularity="utterance",
            extract_embedding=True  # Get both embeddings and emotion scores
        )
        
        if 'scores' in result[0] and 'labels' in result[0]:
            # Process emotion labels
            scores = result[0]['scores']
            labels = result[0]['labels']
            english_labels = [label.split('/')[1] if '/' in label else label 
                            for label in labels]
            
            # Filter emotions by threshold
            emotion_scores = [
                (label, score) 
                for score, label in zip(scores, english_labels)
                if score >= threshold
            ]
            
            # Sort by score and take top 3
            emotion_scores.sort(key=lambda x: x[1], reverse=True)
            top_emotions = emotion_scores[:3]
            
            # Create emotion sequence string (this is optional I am not doing sequencing for audio atm)
            emotion_sequence = "__".join([e[0] for e in top_emotions])
            
            # Store results with embeddings
            audio_emotion_sequences.append({
                'text': sentence['text'],
                'timestamp': sentence['timestamp'],
                'emotion_sequence': emotion_sequence, #(this is optional I am not doing sequencing for audio atm)
                'emotion_scores': top_emotions,
                'embedding': result[0].get('feats', None)  # 1024-dim emotion embedding
            })
        
        # Cleanup
        if os.path.exists(temp_path):
            os.remove(temp_path)
            
    return audio_emotion_sequences

In [None]:
audio_sequences = process_audio_emotion_utterances(
    sentences=sentences,
    audio_file=audio_file_16KhZ,
    model_audio=model_audio,
    threshold=0.1 # Minimum probability threshold for emotion detection
)

In [None]:
#audio_sequences

In [None]:
# Process sentences into emotion sequences
text_emo_sequences = process_emotional_flow(sentences)
# Process audio segments into emotion sequences
audio_emo_sequences = process_audio_emotion_utterances(
    sentences=sentences,
    audio_file=audio_file_16KhZ,
    model_audio=model_audio
)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def visualize_emotion_flow(emotion_embeddings):
    # Extract timestamps and emotions
    timestamps = [e['timestamp'] for e in emotion_embeddings]
    emotion_labels = [e['emotion_scores'][0]['label'] for e in emotion_embeddings]
    
    # Create time points (midpoint of each sentence)
    time_points = [(t[0] + t[1])/2 for t in timestamps]
    
    # Create a mapping of emotions to colors
    unique_emotions = list(set(emotion_labels))
    colors = plt.cm.tab10(np.linspace(0, 1, len(unique_emotions)))
    emotion_colors = {emotion: colors[i] for i, emotion in enumerate(unique_emotions)}
    
    # Plot emotions over time
    plt.figure(figsize=(12, 6))
    
    for i, (time, emotion) in enumerate(zip(time_points, emotion_labels)):
        plt.scatter(time, 1, color=emotion_colors[emotion], s=100)
        plt.text(time, 1.05, emotion, rotation=45, ha='center')
    
    # Add sentence text as annotations
    for i, embedding in enumerate(emotion_embeddings):
        plt.annotate(embedding['text'][:30] + '...', 
                     xy=(time_points[i], 0.9),
                     xytext=(time_points[i], 0.7),
                     arrowprops=dict(arrowstyle='->'),
                     ha='center')
    
    plt.yticks([])
    plt.xlabel('Time (seconds)')
    plt.title('Emotion Flow Over Time')
    plt.tight_layout()
    plt.show()


In [None]:
emotion_embeddings = generate_emotion_embeddings(text_emo_sequences)

emotion_embeddings_MM = generate_emotion_embeddings_MM(text_emo_sequences, audio_emo_sequences)

In [None]:
def display_emotion_analysis(emotion_embeddings_MM):
    """Display a detailed summary of multimodal emotion analysis"""
    for entry in emotion_embeddings_MM:
        print("\n" + "="*80)
        print(f"Sentence: {entry['text'][:100]}...")
        print("-"*80)
        
        # Display timestamps
        start, end = entry['timestamp']
        print(f"Time window: {start:.2f}s - {end:.2f}s (duration: {end-start:.2f}s)")
        
        # Display emotion sequences
        print("\nText Emotions:", entry['text_emotion_sequence'])
        print("Audio Emotions:", entry['audio_emotion_sequence'])
        print("Combined:", entry['combined_sequence'])
        
        # Display top text emotions with scores
        print("\nText Emotion Scores:")
        for emotion, score in sorted(entry['text_vector'].items(), key=lambda x: x[1], reverse=True)[:3]:
            print(f"  {emotion:<10}: {score:.3f}")
        
        # Display top audio emotions with scores
        print("\nAudio Emotion Scores:")
        for emotion, score in entry['audio_scores']:
            print(f"  {emotion:<10}: {score:.3f}")
        
        # Show embedding dimensionality
        print(f"\nEmbedding shape: {entry['embedding'].shape}")

# Display the analysis
display_emotion_analysis(emotion_embeddings_MM)

## Temporal Emotion Analysis


In [None]:
visualize_emotion_flow(emotion_embeddings)

## Alternative using multimodal embeddings

# Audio: Sharing Embeddings Instead of Raw Recordings
Privacy-Preserving Embeddings:
Modern research shows that audio embeddings can be designed to retain emotion-relevant features while suppressing speaker identity and other biometric information. Techniques include adversarial training and feature importance-based modification of embeddings, which allow for emotion recognition tasks without exposing speaker identity or reconstructing the original audio signal.

Utility and Compliance:
Such embeddings are considered privacy-enabled and utility-preserving, supporting emotion detection while reducing risks associated with voiceprints and biometric data. This approach is also compatible with GDPR and similar regulations, as it avoids sharing personally identifiable information.

# Text: Anonymization Strategies
Automated Anonymization Tools:
Use transformer-based models (e.g., BERT, GPT-2) or specialized tools like Microsoft Presidio to detect and redact personal information (names, locations, dates, etc.) from transcripts. These tools can replace sensitive data with placeholders, ensuring the text remains useful for analysis but cannot be traced back to individuals.

Manual Review and Metadata Scrubbing:
After automated anonymization, manually review transcripts for indirect identifiers (e.g., rare job titles, unique life events) and remove or generalize them as needed. Also, strip or generalize metadata (e.g., device, location, demographics) to further reduce re-identification risk.