# Multimodal Emotion Recognition Dataset

## Aim

**Aim** to create a multimodal emotion recognition dataset from natural speech collected in naturalistic enviroment. The dataset will integrate two modalities: voice and corresponding transcripts

## Introduction
The audio and text-based emotion encoder proposed follows and expands the work by Liampis, Karanikola, and Kotsiantis ([2025](https://www.sciencedirect.com/science/article/pii/S0925231225004941)) who model the serial, distributional structure of emotions in text by converting sequences of sentences into sequences of emotion labels, and then generating distributional emotion embeddings from these sequences. The idea behind their work is that emotions in text are not isolated states but unfold as structured, serially-dependent patterns akin to a time series and forming a sequence that reflects the emotional evolution throughout the discourse. The authors argue that the emotional states exhibit distributional patterns as they appear and alternate in text, very much like semantic relationships between words are shaped by their order and co-occurrence. In our multimodal pipeline, we extend this idea with an audio modality: we extract high-level emotion embeddings from each segment using emotion2vec which capture contextual acoustic emotion features. We then use timestamps to align and concatenate these representations at the segment level to form a multimodal feature space.  


### Conceptual Framework

#### Text component
* *Serial Emotional Footprint in Text*: Building on Liapis, Karanikola, and Kotsiantis (2025), we treat the emotional flow in natural language as a structured, serially-dependent pattern. Each segment of text (e.g., sentence or utterance) is assigned a set of emotion labels using a multi-label classifier. These labels are then arranged in order, forming an "emotion string" that reflects the temporal progression of emotions throughout the transcript. This sequence is modeled as a time-series-like structure, capturing the distributional regularities and interdependencies of emotions as they appear and alternate in the text
* *Distributional Emotion Embeddings for Text*: The emotion label sequences (emotion strings) derived from the text are embedded using distributional models (e.g., Word2Vec trained on the emotion label corpus) and/or contextual transformer-based models. This process encodes both the semantic and sequential properties of emotional expression in the text, allowing downstream models to leverage latent information about emotion transitions and dependencies that cannot be captured by isolated emotion labels alone
#### Audio Component
* *Contextual Acoustic Emotion Embeddings for Audio*: For the audio modality, we extract high-level, segment-level emotion embeddings using a state-of-the-art model (e.g. emotion2vec). These embeddings capture the contextual acoustic features relevant to emotional expression in speech but are not modeled as distributional or sequential emotion label strings. Instead, each audio segment (the audio stream corresponding to a phrasal unit) is represented by its corresponding emotion2vec embedding, which encodes paralinguistic cues and affective information from the speech signal.

#### Multimodal integration
* *Multimodal Alignment and Fusion*: Audio and text segments are aligned using precise timestamps obtained from the transcription process (via crisper-whisper). For each aligned segment, the contextual acoustic embedding (from audio) and the distributional emotion embedding (from text) are concatenated to form a multimodal feature vector. This approach should integrate the complementary strengths of both modalities: the structured, sequential modeling of emotions in text and the rich paralinguistic information in speech. By aligning and fusing these representations at the segment level, we create a comprehensive multimodal feature space for emotion recognition, suitable for robust analysis in naturalistic, diary-style speech data.

## Dataset Structure
Two data modes:
* **Audio**: segment-level emotion embeddings and labels.
* **Text**: Tokenized transcripts, segment-level emotion embeddings and labels.

### Data Collection

* **Speech Recordings**: High-quality (16KhZ), natural (and naturalistic) speech samples. 
* **Text Transcripts**: Time-aligned transcripts for each audio recording (synchronization at the utterance level.
* **Emotion Annotation**: The present pipeline is using only automatic emotion classifiers (e.g., fine-tuned transformers, emotion2vec). In addition we will need to add human annotators to label each segment with one or more emotion categories.


## Method

### Tools

* [emotion2vec: audio-based emotion embeddings](https://huggingface.co/emotion2vec/emotion2vec_plus_large)
* [emotion-english-distilroberta-base](https://huggingface.co/j-hartmann/emotion-english-roberta-large)
* [CrisperWhisper](https://github.com/nyrahealth/CrisperWhisper)

**Emotion2vec** is the first universal speech emotion representation model. Through self-supervised pre-training, emotion2vec has the ability to extract emotion representation across different tasks, languages, and scenarios. Emotion2vec aims to create a universal emotional representation space. 
**Emotion-english-distilroberta-base** model is a distilled version of RoBERTa-large trained on 6 diverse datasets and predicts Ekman's 6 basic emotions, plus a neutral class. RoBERTa-large (Liu et al. 2019) is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. 
**Crisper-whisper** is an advanced variant of OpenAI's Whisper, designed for fast, precise, and verbatim speech recognition with accurate (crisp) word-level timestamps. Unlike the original Whisper, which tends to omit disfluencies and follows more of a intended transcription style, CrisperWhisper aims to transcribe every spoken word exactly as it is, including fillers, pauses, stutters and false starts.
**Text Anonymization Pipeline**: For automated and manual redaction of personal identifiers in transcripts with Presidio (To be done). Presidio is an open-source NLP library developed by Microsoft, designed for automated text redaction and anonymization

### Data

| Item             | Description                                                             | Included in Dataset  | Requires Anonymization |
|------------------|-------------------------------------------------------------------------|----------------------|------------------------|
| Audio File       | A .wav file containing speech (e.g., audiotest.wav)                     | No                   | No                     |
| Audio embeddings | Extracted, non-reconstructable, privacy-checked embeddings              | Yes                  | No                     |
| Transcripts      | Output of crisperwhisper (JSON format) with utterance-level timestamps  | Yes                  | Yes                    |
| Timestamps       | Used for alignment, included in metadata                                | Yes                  | No                     |


### Multimodal Processing

#### Step 0. Transcript generation

* Transcribe the Audio: Convert the audio file into text using Crisperwhisper to obtain transcripts with word-level timestamps.
    * Anonymisation: we use Microsoft Presidio (automated tool) - **This needs testing, not implemented yet**


#### Step 1. Text embeddings extraction (mode 1)
This is the process to generate distributional embeddings [Liampis 2025](https://www.sciencedirect.com/science/article/pii/S0925231225004941).

Two different representations are needed for the final embedding to be constructed. 

1. The first representation concerns an (ideally extensive) textual corpus that is called *base corpus*, from which the base embeddings are generated (and/or) extracted.
2. The second representation concerns the data one wants to model or *target data*.

First we extract (and/or) generate our base embeddings from the base corpus (step 1.1) and then we generate distributional embeddings (2.2).

##### Step 1.1 Embedding generation/extraction (from the base corpus)

![Emotion string sequence](https://ars.els-cdn.com/content/image/1-s2.0-S0925231225004941-fx1001.jpg)

Followin this algorithm we start by (1) tokenizing the base corpus into sentences and (2) then we extract, through the use of transformer-based multi-label emotion classification, the corresponding emotions for each sentence. In this case point (2) implies the assumption that the output of the emotion classifier is the groud-truth. The authors suggest that it would be ideal to have pre-labeled base corpus instead of using a transformer classifier to allocate these labels. (1) and (2) are conducted as follow:

1. segment tokenisation: segment the transcript into sentences - via nltk (a natural language tokenizer).
    1. allocate timestamps to each sentence (precise start and stop timing of each segment), this will preserve alignment with between audio and txt segments for later on.
2. For each segment:
    1.  we apply the [Emotion English DistilRoBERTa-large](https://huggingface.co/j-hartmann/emotion-english-roberta-large)  model to obtain Hartman's  emotion probabilities (or any other suitable multi-label emotion classifier).
    2. we serialise the top-$k$ (k=3) emotion labels of each segment into a new txt string Emotion1__Emotion2__Emotion3 (e.g."joy__surprise__fear") in order of their probability value. We call this token **emotion string** and the vocolabulary of this token is as "large" as $k$ - larger k will involve more subtle representation of the underlying true emotion.
3. We concate the emotion strings in the temporal order of appearance to create a time-series-like structure (**base emotion corpus**), e.g.:`Sentence 1: Anger__Joy__Trust; Sentence 2: "Frustration__neutral__anger"; ... `. This corpus organises emotion strings as tokens in a "text" where the vocabulary consists only of emotion labels. 

Thus, while each *emotion string* captures the emotion dominance (top-k probability value) of each segment, the concatenation of these emotion strings to forms the *base emotion corpus* captures the temporal sequence of those emotions strings. This is an important distinction when we decide whether generating or extracting (or both) base embeddings.

4. Base embedding extraction/generation:
    - Option 1: Extraction (Contextual Embeddings): Use a pretrained sentence transformer (e.g., all-MiniLM-L6-v2) to embed emotion strings, capturing semantic relationships between them: the embeddings will be a representation of the emotion combinations and _do not_ include any serial (temporal) interdependencies between the emotion label tokens (i.e. emotion strings).
    - Option 2: Generation (Distributional Embeddings): Train a Word2Vec model on the emotion string sequences to encode sequential dependencies (e.g., how "Joy" often precedes "Surprise").

Those can both carried out and combined in the step 1.2 Distributional embedding generation:

##### Step 1.2 Distributional Embeddings (from the base corpus)



![Distributional embeddings](https://ars.els-cdn.com/content/image/1-s2.0-S0925231225004941-gr5.jpg)



#### Step 2. Audio embeddings extraction (mode 2)

We use emotion2vec to extract high-level emotion embeddings from the audio file.

Note: We only publish these embeddings, not the raw audio. We will also look into further processing of the embeddings to minimize speaker-identifiable information (using e.g. dimensionality reduction via umap).


#### Step 3. Multimodal Integration

1. Align audio and text segments using timestamps.
    * Segment Matching: Map audio segments (from emotion2vec) to text segments using timestamps (e.g., a 3-second audio chunk and its corresponding transcribed sentence).
    * Embedding Concatenation: For each aligned segment, concatenate:
        * Audio: Emotion2vec embedding (contextual acoustic features).
        * Text: Distributional emotion embedding (serial emotion patterns).

As we concatenate the audio emotion embedding and the corresponding text-based emotion embedding we form a multimodal feature vector.



# RESULTS

In [None]:
import librosa
import numpy as np
import matplotlib.pyplot as plt
from funasr import AutoModel
import pandas as pd
import soundfile as sf
import os
import json
import torch
from pathlib import Path


## Multimodal processing

## Step.0 transcript generation




## Step.1  Text Embedding Extraction/generation

In the previous paragraph we tokenized the base corpus into sentences (via `segment_into_sentences()`). Now, [the paper](https://www.sciencedirect.com/science/article/pii/S0925231225004941) suggests that: 

1. we use a pre-trained transformer to generate a dataset of labelled sentences *and*
2. we assume that the output of the classifier is our ground truth in terms of assigning emotion labels to sentences.



The textual sequence (our transcript) will have a corresponding emotion-based vector representation

We define the function `classify_emotions()` in 2 steps:

### Emotion classification
* Step 1: we extract the corresponding emotions for each sentence using the pre-trained transformer ([Emotion English DistilRoBERTa-base](https://huggingface.co/j-hartmann/emotion-english-distilroberta-base)) to get the corresponding emotions for each sentence. This approach will lead to having an array of emotions with their probability value. 

### Create Emotion Label Texts
* Step 2: We take the output of `classify_emotions()` and sentence by sentence we interpret the probability value of each emotions in terms of emotion dominance and arrange their corresponding textual labels, i.e., as if they were strings, in serially ordered layout. Very much as if the labels were parts of a text. Therefore, for each sentence in the transcript we find the three most predominant emotions by probability (what the paper calls *"sort extracted emotions by their probability values"*). Subsequently, for each sentence `classify_emotions()` generates "emotion_sequences" txt by joining (in a serial manner) the top 3 emotions with "__" separators. This step implements what [the paper](https://www.sciencedirect.com/science/article/pii/S0925231225004941) calls *"arrange emotion labels in a serially ordered layout as if the labels were parts of a text. The vocabulary of such a text is the emotion labels."


#### method A: contextual Embedding Generation
For each emotion string, it uses a transformer model (here, distilbert-base-uncased) to generate a contextual embedding for the sequence (using the [CLS] token as the representation). The result is a set of contextualized vector representations for the serial emotion strings.
The final  output is a sequence of embeddings, one per sentence/utterance, each representing the dominant emotions and their context in the sequence.
#### method B: distributional Embedding Generation

According to the paper the Distributional Representations requires to:

1. *"treat each sentence as an observation in a multivariate series of emotions"* 
2. *"transform the emotional flow of a text into a sequence of emotion strings."*

### Fusion with Semantic Embeddings:
The paper’s final step is to combine these distributional emotion embeddings with semantic sentence embeddings (e.g., from MiniLM or RoBERTa applied to the sentence text itself), typically via weighted averaging or concatenation
   

In [None]:
import nltk ## need pip install nltk
import re  # for regular expressions

nltk.download("punkt")
nltk.download('punkt_tab')



In [None]:
def load_corpus_data(text_file):
    """Convert plain text file into Whisper-like JSON structure
    
    Args:
        text_file (str): Path to text file
        
    Returns:
        dict: Text data formatted like Whisper JSON output
    """
    try:
        if not os.path.exists(text_file):
            print(f"Error: File not found: {text_file}")
            return None
            
        # Read the plain text
        with open(text_file, 'r', encoding='utf-8') as file:
            text = file.read().strip()
            
        # Pre-tokenize into sentences just once
        sentences = nltk.sent_tokenize(text)
        
        # Create Whisper-like chunks with estimated timestamps
        current_time = 0.0
        chunks = []
        
        for sent in sentences:
            duration = len(sent.split()) * 0.3  # Rough estimate
            chunks.append({
                'text': sent,
                'timestamp': [current_time, current_time + duration]
            })
            current_time += duration
            
        # Return in same format as Whisper JSON
        return {
            'text': text,
            'chunks': chunks
        }
                
    except Exception as e:
        print(f"Error loading text file: {str(e)}")
        return None

def segment_sentences_with_timestamps(whisper_output):
    """
    Segments transcript into sentences with aligned timestamps.
    Handles edge cases and maintains compatibility with emotion pipeline.
    
    Args:
        whisper_output (dict): CrisperWhisper JSON output with 'text' and 'chunks'
        
    Returns:
        list[dict]: List of {'text': str, 'timestamp': (start, end)} 
    """
    nltk.download('punkt', quiet=True)
    
    # Input validation
    if not whisper_output or 'text' not in whisper_output or 'chunks' not in whisper_output:
        return []
        
    text = whisper_output['text'].strip()
    chunks = whisper_output['chunks']
    
    if not text or not chunks:
        return []
    
    # Tokenize into sentences
    sentences = nltk.sent_tokenize(text)
    
    # Normalize chunks for better matching
    chunk_words = [re.sub(r'[^\w\s\']', '', c['text'].lower()) for c in chunks]
    chunk_times = [c['timestamp'] for c in chunks]
    
    results = []
    chunk_idx = 0
    
    for sent in sentences:
        # Normalize sentence words
        sent_words = [re.sub(r'[^\w\s\']', '', w.lower()) for w in sent.split()]
        sent_len = len(sent_words)
        
        match_found = False
        
        # Look for sentence match in remaining chunks
        while chunk_idx <= len(chunk_words) - sent_len:
            window = chunk_words[chunk_idx:chunk_idx + sent_len]
            
            # Check if window matches sentence words
            if all(w1 == w2 for w1, w2 in zip(window, sent_words)):
                results.append({
                    'text': sent,
                    'timestamp': (
                        chunk_times[chunk_idx][0],          # Start time
                        chunk_times[chunk_idx + sent_len - 1][1]  # End time
                    )
                })
                chunk_idx += sent_len
                match_found = True
                break
            chunk_idx += 1
            
        if not match_found:
            # If no exact match, try fuzzy matching
            # For now, assign last known timestamp
            if results:
                last_end = results[-1]['timestamp'][1]
                results.append({
                    'text': sent,
                    'timestamp': (last_end, last_end + 1.0)  # Estimate 1 second duration
                })
    
    return results

In [None]:
base_corpus_file = "../data/transcripts/penelope.txt"

# Load the target-trancript file in JSON format (TARGET data)
transcript_file = "../data/transcripts/audiotest_json.json"
with open(transcript_file, 'r') as file:
    transcription_data = json.load(file)


corpus_data = load_corpus_data(base_corpus_file)


corpus_sentences = segment_sentences_with_timestamps(corpus_data)
target_sentences = segment_sentences_with_timestamps(transcription_data)

corpus_data
#target_sentences



## Step.1 Audio embeddings extraction


### Pre-processing
Preprocess audio file to ensure it's in the correct format for emotion2vec (sampling required = 16kHz)

In [None]:
# Load the audio file
target_file = "../data/audiotest.wav"




In [None]:
import librosa
import soundfile as sf

def preprocess_audio(audio_file, target_sr=16000):
    """
    Preprocess audio file to ensure it's in the correct format for emotion2vec
    
    Args:
        audio_file (str): Path to the audio file
        target_sr (int): Target sampling rate (emotion2vec requires 16kHz)
        
    Returns:
        str: Path to the processed audio file
    """
    # Load audio with its original sampling rate
    waveform, sr = librosa.load(audio_file, sr=None)
    
    # Check if resampling is needed
    if sr != target_sr:
        # Resample to 16kHz
        waveform = librosa.resample(waveform, orig_sr=sr, target_sr=target_sr)
        
        # Save the resampled audio to a temporary file
        temp_file = "temp_resampled.wav"
        sf.write(temp_file, waveform, target_sr)
        return temp_file
    
    return audio_file


In [None]:

audio_file_16KhZ = preprocess_audio(audio_file, target_sr=16000)


In [None]:
## loading the emotion2vec_plus_large model
from funasr import AutoModel

# model="iic/emotion2vec_base"
# model="iic/emotion2vec_base_finetuned"
# model="iic/emotion2vec_plus_seed"
# model="iic/emotion2vec_plus_base" 
model_id = "iic/emotion2vec_plus_large"

model_audio = AutoModel(
    model=model_id,
    hub="hf",  # "ms" or "modelscope" for China mainland users; "hf" or "huggingface" for other overseas users
    disable_update=True
)

### loading EMO2VEC

[emotion2vec](https://huggingface.co/emotion2vec/emotion2vec_plus_large) is a speech emotion recognition foundation model that can extract emotional features directly from audio waveforms. The emotion2vec_plus_large model is the largest version, trained on 42,526 hours of data. It supports 9 emotion classes:

0. angry
1. disgusted
2. fearful
3. happy
4. neutral
5. other
6. sad
7. surprised
8. unknown



In [None]:
def process_audio_emotion_utterances(sentences, audio_file, model_audio, threshold=0.0):
    """
    Process audio segments into emotion sequences and embeddings, aligning with text segments.
    
    Args:
        sentences (list): List of segmented sentences with timestamps
        audio_file (str): Path to audio file
        model_audio: Loaded emotion2vec model
        threshold (float): Minimum probability threshold for emotion detection
        
    Returns:
        list: List of dicts containing:
            - text: Original sentence text
            - timestamp: (start, end) times
            - emotion_sequence: Concatenated top emotions 
            - emotion_scores: List of (emotion, score) tuples
            - embedding: emotion2vec embedding vector
    """
    
    def extract_audio_segment(audio_file, start_time, end_time, sr=16000):
        """Extract precise audio segment matching text timestamp"""
        waveform, sr = librosa.load(audio_file, sr=sr)
        start_sample = int(start_time * sr)
        end_sample = int(end_time * sr)
        return waveform[start_sample:end_sample]
    
    audio_emotion_sequences = []
    
    for sentence in sentences:
        # 1. Get exact timestamp alignment
        start_time, end_time = sentence['timestamp']
        
        # 2. Extract aligned audio segment
        segment = extract_audio_segment(audio_file, start_time, end_time)
        
        # Save segment for emotion2vec processing
        temp_path = "temp_segment.wav"
        sf.write(temp_path, segment, 16000)
        
        # 3. Generate emotion embeddings and labels
        result = model_audio.generate(
            temp_path,
            output_dir="./outputs",
            granularity="utterance",
            extract_embedding=True  # Get both embeddings and emotion scores
        )
        
        if 'scores' in result[0] and 'labels' in result[0]:
            # Process emotion labels
            scores = result[0]['scores']
            labels = result[0]['labels']
            english_labels = [label.split('/')[1] if '/' in label else label 
                            for label in labels]
            
            # Filter emotions by threshold
            emotion_scores = [
                (label, score) 
                for score, label in zip(scores, english_labels)
                if score >= threshold
            ]
            
            # Sort by score and take top 3
            emotion_scores.sort(key=lambda x: x[1], reverse=True)
            top_emotions = emotion_scores[:3]
            
            # Create emotion sequence string (this is optional I am not doing sequencing for audio atm)
            emotion_sequence = "__".join([e[0] for e in top_emotions])
            
            # Store results with embeddings
            audio_emotion_sequences.append({
                'text': sentence['text'],
                'timestamp': sentence['timestamp'],
                'emotion_sequence': emotion_sequence, #(this is optional I am not doing sequencing for audio atm)
                'emotion_scores': top_emotions,
                'embedding': result[0].get('feats', None)  # 1024-dim emotion embedding
            })
        
        # Cleanup
        if os.path.exists(temp_path):
            os.remove(temp_path)
            
    return audio_emotion_sequences

In [None]:
audio_sequences = process_audio_emotion_utterances(
    sentences=sentences,
    audio_file=audio_file_16KhZ,
    model_audio=model_audio,
    threshold=0.1 # Minimum probability threshold for emotion detection
)

In [None]:
#audio_sequences

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModel
import torch
from gensim.models.word2vec import Word2Vec 
import numpy as np

def process_emotional_flow(sentences, use_distributional=True):
    """Process sentences into both distributional and contextual representations
    
    Args:
        sentences: List of sentence dictionaries with text and timestamps
        use_distributional: Whether to use Word2Vec (True) or transformer (False) embeddings
    """
    emotion_processor = EmotionDistributionalRepresentation()
    
    # Step 1: Create multivariate series
    emotion_series = emotion_processor.create_multivariate_series(sentences)
    
    # Step 2: Transform to emotion strings and get embeddings
    if use_distributional:
        emotion_sequences = emotion_processor.transform_to_emotion_strings(emotion_series, use_distributional=True)
    else:
        # Get emotion sequences without distributional embeddings
        emotion_sequences = emotion_processor.transform_to_emotion_strings(emotion_series, use_distributional=False)
        # Then apply transformer embeddings
        emotion_sequences = generate_emotion_embeddings(emotion_sequences)
        for seq in emotion_sequences:
            seq['embedding_type'] = 'contextual'
    
    return emotion_sequences

class EmotionDistributionalRepresentation:
    def __init__(self, model_name="j-hartmann/emotion-english-distilroberta-base"):
        self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
        self.classifier = pipeline(
            "text-classification",
            model=model_name,
            return_all_scores=True,
            device=self.device
        )
        self.w2v_model = None
    
    def create_multivariate_series(self, sentences):
        """Treat each sentence as an observation in a multivariate series of emotions"""
        emotion_series = []
        for sentence in sentences:
            scores = self.classifier(sentence['text'])[0]
            emotion_vector = {score['label']: score['score'] for score in scores}
            emotion_series.append({
                'text': sentence['text'],
                'timestamp': sentence['timestamp'],
                'emotion_vector': emotion_vector
            })
        return emotion_series
    
    def transform_to_emotion_strings(self, emotion_series, use_distributional=True, top_k=3):
        """Transform emotional flow into sequence of emotion strings"""
        emotion_sequences = []
        all_sequences = []
        
        # First pass: collect all emotion sequences
        for entry in emotion_series:
            sorted_emotions = sorted(
                entry['emotion_vector'].items(),
                key=lambda x: x[1],
                reverse=True
            )
            top_emotions = sorted_emotions[:top_k]
            emotions = [emotion for emotion, _ in top_emotions]
            all_sequences.append(emotions)
            
            # Create basic sequence entry
            sequence_entry = {
                'text': entry['text'],
                'timestamp': entry['timestamp'],
                'emotion_sequence': "__".join(emotions),
                'emotion_vector': entry['emotion_vector'],
                'emotions': emotions
            }
            emotion_sequences.append(sequence_entry)
        
        if use_distributional:
            # Train Word2Vec on complete emotion sequence corpus
            self.w2v_model = Word2Vec(
                sentences=all_sequences,
                vector_size=300,
                window=2,  # Model direct emotion transitions
                min_count=1,
                sg=1,  # Skip-gram
                workers=4
            )
            
            # Add distributional embeddings
            for seq in emotion_sequences:
                seq['embedding'] = np.mean(
                    [self.w2v_model.wv[e] for e in seq['emotions']], 
                    axis=0
                )
                seq['embedding_type'] = 'distributional'
        
        return emotion_sequences
    
    def get_emotion_transitions(self, emotion1, emotion2):
        """Get transition probability between emotions"""
        if self.w2v_model:
            return self.w2v_model.wv.similarity(emotion1, emotion2)
        return None

def generate_emotion_embeddings(emotion_sequences):
    """Generate contextual embeddings using transformer"""
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    model = AutoModel.from_pretrained("distilbert-base-uncased")
    
    for seq in emotion_sequences:
        inputs = tokenizer(seq['emotion_sequence'], return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = model(**inputs)
        seq['embedding'] = outputs.last_hidden_state[:, 0, :].numpy()
    
    return emotion_sequences

In [None]:
# Process sentences into emotion sequences
text_emo_sequences = process_emotional_flow(sentences)
# Process audio segments into emotion sequences
audio_emo_sequences = process_audio_emotion_utterances(
    sentences=sentences,
    audio_file=audio_file_16KhZ,
    model_audio=model_audio
)

In [None]:

for text_seq, audio_seq in zip(text_emo_sequences, audio_emo_sequences):
    print(f"Sentence: {text_seq['text'][:50]}...")
    print(f"Text emotions: {text_seq['emotion_sequence']}")
    print(f"Audio emotions: {audio_seq['emotion_sequence']}\n")

#audio_emo_sequences

# Continue with embedding generation as before
#emotion_embeddings = generate_emotion_embeddings(emotion_sequences)
#emotion_embeddings

In [None]:
from IPython.display import Image, display

#https://www.sciencedirect.com/science/article/pii/S0925231225004941

# the authors use two emotion per sentence, we are using three
image_url = "https://ars.els-cdn.com/content/image/1-s2.0-S0925231225004941-gr2.jpg"
display(Image(url=image_url))

These labels are about the emotion state of each sentence. *"These emotion states are serial and interdependent and can be treated as novel tokens of a textual sequence consisting only of emotional categorizations - as if the elements in question are, after all, strings"*

[The paper](https://www.sciencedirect.com/science/article/pii/S0925231225004941) specifically mentions that their approach "treats each sentence as an observation in a multivariate series of emotions" and transforms "the emotional flow of a text into a sequence of emotion strings." 

The key innovation proposed is treating emotions as having a "distributional layout" in text: the emotion sequences appear and alternate with each other in structured patterns of interrelations. Therefore, after we created emotion sequences that capture the dominant emotions for each sentence in order of probability and we turned them into text we use a transformer model to generate embeddings that capture the contextual relationships between emotions.

This method allows to capture both the contextual and sequential dependencies between emotional states as they unfold in the text. *"Unlike conventional sentiment classification models, which treat emotions as discrete outputs, the methodology we propose encodes sentiments as structured entities within an embedding space."*

#### Embedding Extraction or Generation
After constructing an emotion-state-based corpus (where each sentence has been assigned emotion labels through a multi-label classifier), the process for extracting embeddings involves two alternative approaches:

1. **Extraction-based approach**: This uses pre-trained models to extract semantic representations of emotion strings. The paper explains: "To simply extract the embeddings means that one incorporates a pre-trained scheme to extract the more or less semantic representations of the given emotion strings". This leverages existing contextual understanding from transformer-based models. *OR*
2. **Generation-based approach**: This involves training an embedding scheme from scratch on the serially ordered emotion states. As the paper notes: "to generate means to train an embedding scheme over the serially ordered emotion states". This approach captures the unique distributional patterns specific to emotion sequences.


Here with `generate_emotion_embeddings()` we use the **extraction-based** approach, where *"one incorporates a pre-trained scheme* --in this case "distilber-base-ucased"--  *to extract the more or less semantic representations of the given emotion strings."* 

1. We uses DistilBERT to capture contextual relationships between emotions in the sequence. The task of the transformer is to extract contextual representations of the emotion sequences (e.g. "joy__sadness__anger","joy__fear__disgust"). The transformer self-attention mechanism creates contextual representations where each emotion's embedding is influenced by the other emotions in the sequence. Each self-attention layer of the transformer considers the relationships between the word "joy" in the emotion sequence together with the surrounding words influencing how "joy" is represented in the vector space. This means the embedding for "joy" in "joy__sadness__anger" would be different than in "joy__fear__disgust" because the surrounding emotions provide different contexts.

2. We then create the CLS (**CL**a**S**sification) token embedding as a representation for the entire emotion sequence: we leverage the  pre-trained transformer's ability to aggregate information from the entire emotion sequence into a single vector. As we discussed in the previous paragraph, the model processes the emotion sequence through its attention mechanisms, where "each attention head can focus on different aspects of meaning" and "multiple transformer layers progressively refine the representation".

   

In [None]:
### Generate Distributional Representations
from transformers import AutoTokenizer, AutoModel
import torch

def generate_emotion_embeddings(emotion_sequences):
    # Initialize transformer model for contextual representations
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    model = AutoModel.from_pretrained("distilbert-base-uncased")
    
    embeddings = []
    
    for seq in emotion_sequences:
        # Tokenize emotion sequence
        inputs = tokenizer(seq['emotion_sequence'], return_tensors="pt", padding=True, truncation=True)
        
        # Generate embeddings
        with torch.no_grad():
            outputs = model(**inputs)
        
        # Use CLS token embedding as sequence representation
        embedding = outputs.last_hidden_state[:, 0, :].numpy()
        
        seq['embedding'] = embedding
        embeddings.append(seq)
    
    return embeddings

def generate_emotion_embeddings_MM(emotion_sequences, audio_sequences):
    """Generate embeddings for text and audio emotion sequences combined"""
    # Initialize transformer model for contextual representations
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    model = AutoModel.from_pretrained("distilbert-base-uncased")
    
    embeddings = []
    
    for text_seq, audio_seq in zip(emotion_sequences, audio_sequences):
        # Combine text and audio emotion sequences
        combined_sequence = text_seq['emotion_sequence'] + "__" + audio_seq['emotion_sequence']
        
        # Tokenize combined emotion sequence 
        inputs = tokenizer(combined_sequence, return_tensors="pt", padding=True, truncation=True)
        
        # Generate embeddings
        with torch.no_grad():
            outputs = model(**inputs)
        
        # Use CLS token embedding as sequence representation
        embedding = outputs.last_hidden_state[:, 0, :].numpy()
        
        # Create combined sequence entry
        combined_entry = {
            'text': text_seq['text'],
            'timestamp': text_seq['timestamp'],
            'text_emotion_sequence': text_seq['emotion_sequence'],
            'audio_emotion_sequence': audio_seq['emotion_sequence'], 
            'combined_sequence': combined_sequence,
            'embedding': embedding,
            'text_vector': text_seq.get('emotion_vector'),
            'audio_scores': audio_seq.get('emotion_scores')
        }
        
        embeddings.append(combined_entry)
    
    return embeddings


In [None]:
emotion_embeddings = generate_emotion_embeddings(text_emo_sequences)

emotion_embeddings_MM = generate_emotion_embeddings_MM(text_emo_sequences, audio_emo_sequences)

This implementation:

Maintains the paper's concept of "distributional layout" by:

Preserving sequential relationships in both modalities
Treating emotion sequences as interdependent series
Using contextual embeddings that capture inter-emotion relationships
Creates a multimodal representation that:

Keeps both text and audio emotion sequences aligned
Preserves the temporal relationship between modalities
Enables analysis of how emotions manifest differently in text vs audio
Uses the CLS token embedding as recommended to aggregate sequence information

In [None]:
def display_emotion_analysis(emotion_embeddings_MM):
    """Display a detailed summary of multimodal emotion analysis"""
    for entry in emotion_embeddings_MM:
        print("\n" + "="*80)
        print(f"Sentence: {entry['text'][:100]}...")
        print("-"*80)
        
        # Display timestamps
        start, end = entry['timestamp']
        print(f"Time window: {start:.2f}s - {end:.2f}s (duration: {end-start:.2f}s)")
        
        # Display emotion sequences
        print("\nText Emotions:", entry['text_emotion_sequence'])
        print("Audio Emotions:", entry['audio_emotion_sequence'])
        print("Combined:", entry['combined_sequence'])
        
        # Display top text emotions with scores
        print("\nText Emotion Scores:")
        for emotion, score in sorted(entry['text_vector'].items(), key=lambda x: x[1], reverse=True)[:3]:
            print(f"  {emotion:<10}: {score:.3f}")
        
        # Display top audio emotions with scores
        print("\nAudio Emotion Scores:")
        for emotion, score in entry['audio_scores']:
            print(f"  {emotion:<10}: {score:.3f}")
        
        # Show embedding dimensionality
        print(f"\nEmbedding shape: {entry['embedding'].shape}")

# Display the analysis
display_emotion_analysis(emotion_embeddings_MM)

## Temporal Emotion Analysis


In [None]:
import matplotlib.pyplot as plt
import numpy as np

def visualize_emotion_flow(emotion_embeddings):
    # Extract timestamps and emotions
    timestamps = [e['timestamp'] for e in emotion_embeddings]
    emotion_labels = [e['emotion_scores'][0]['label'] for e in emotion_embeddings]
    
    # Create time points (midpoint of each sentence)
    time_points = [(t[0] + t[1])/2 for t in timestamps]
    
    # Create a mapping of emotions to colors
    unique_emotions = list(set(emotion_labels))
    colors = plt.cm.tab10(np.linspace(0, 1, len(unique_emotions)))
    emotion_colors = {emotion: colors[i] for i, emotion in enumerate(unique_emotions)}
    
    # Plot emotions over time
    plt.figure(figsize=(12, 6))
    
    for i, (time, emotion) in enumerate(zip(time_points, emotion_labels)):
        plt.scatter(time, 1, color=emotion_colors[emotion], s=100)
        plt.text(time, 1.05, emotion, rotation=45, ha='center')
    
    # Add sentence text as annotations
    for i, embedding in enumerate(emotion_embeddings):
        plt.annotate(embedding['text'][:30] + '...', 
                     xy=(time_points[i], 0.9),
                     xytext=(time_points[i], 0.7),
                     arrowprops=dict(arrowstyle='->'),
                     ha='center')
    
    plt.yticks([])
    plt.xlabel('Time (seconds)')
    plt.title('Emotion Flow Over Time')
    plt.tight_layout()
    plt.show()


In [None]:
visualize_emotion_flow(emotion_embeddings)

## Alternative using multimodal embeddings

# Audio: Sharing Embeddings Instead of Raw Recordings
Privacy-Preserving Embeddings:
Modern research shows that audio embeddings can be designed to retain emotion-relevant features while suppressing speaker identity and other biometric information. Techniques include adversarial training and feature importance-based modification of embeddings, which allow for emotion recognition tasks without exposing speaker identity or reconstructing the original audio signal.

Utility and Compliance:
Such embeddings are considered privacy-enabled and utility-preserving, supporting emotion detection while reducing risks associated with voiceprints and biometric data. This approach is also compatible with GDPR and similar regulations, as it avoids sharing personally identifiable information.

# Text: Anonymization Strategies
Automated Anonymization Tools:
Use transformer-based models (e.g., BERT, GPT-2) or specialized tools like Microsoft Presidio to detect and redact personal information (names, locations, dates, etc.) from transcripts. These tools can replace sensitive data with placeholders, ensuring the text remains useful for analysis but cannot be traced back to individuals.

Manual Review and Metadata Scrubbing:
After automated anonymization, manually review transcripts for indirect identifiers (e.g., rare job titles, unique life events) and remove or generalize them as needed. Also, strip or generalize metadata (e.g., device, location, demographics) to further reduce re-identification risk.