<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# DSI-SG-42 Capstone Project:
### FeelFlow AI: Decoding Emotions, Advancing Patient Support

---

### **Background**

In Singapore, the urgency to address mental health issues among younger generations, particularly GenZ and millennials, is critical due to increasing pressures from work, school, and personal relationships leading to anxiety, depression, and substance abuse. Recognizing this, the Ministry of Health and AI Singapore (NUS) have initiated the "Mental Health with AI" Seminar to integrate AI technologies with clinical practices, enhancing therapeutic processes.

The aims of this study is to develop a real-time emotion predictor app. The objective is to alleviate the layer of assessing patients' emotional well-being, which is crucial in enabling a more accurate diagnosis and treatment from. The app is in its beta stages, but seeks to be presented at the seminar for. Further discussions to adoption and integration into pre-existing app/softwares can be opened during this seminar. 

### **Problem Statement**
##### *Where discerning people’s emotion can sometimes be an unnerving guessing game. How can clinicians use speech emotion recognition technology to accurately assess patients' emotional well-being, thereby improving diagnosis and treatment outcomes?*

### **Table of Contents**

### 2. [Preprocessing the Data](#preprocessing-the-data)
   #### 2.1 [Data Augmentation](#data-augmentation)
   ##### 2.1.1 [CREMA-D (Seen)](#crema-d-seen)
   ##### 2.1.2 [YouTube dataset (Unseen)](#youtube-dataset-unseen)
   ##### 2.1.3 [ESD data (Seen + Unseen)](#esd-data-seen-unseen)
   ##### 2.1.3.1 [ESD (Seen)](#esd-seen)
   ##### 2.1.3.2 [ESD (Unseen)](#esd-unseen)
   ##### 2.1.4 [TESS dataset (Seen)](#tess-dataset-seen)
   #### 2.2 [Label Mapping](#label-mapping)
   ##### 2.2.1 [CREMA-D (Seen)](#crema-d-seen-label)
   ##### 2.2.2 [ESD](#esd-label)
   ##### 2.2.2.1 [ESD (Seen)](#esd-seen-label)
   ##### 2.2.2.2 [ESD (Unseen)](#esd-unseen-label)
   ##### 2.2.3 [TESS dataset (Seen)](#tess-dataset-seen-label)
   #### 2.3 [Feature Extraction](#feature-extraction)
   ##### 2.3.1 [CREMA-D (Seen)](#crema-d-seen-feature)
   ##### 2.3.2 [YouTube dataset (Unseen)](#youtube-dataset-unseen-feature)
   ##### 2.3.3 [ESD](#esd-feature)
   ##### 2.3.3.1 [ESD (Seen)](#esd-seen-feature)
   ##### 2.3.3.2 [ESD (Unseen)](#esd-unseen-feature)
   ##### 2.3.4 [TESS dataset (Seen)](#tess-dataset-seen-feature)
   #### 2.4 [Combine Dataset (Seen)](#combine-dataset-seen)

## **2. Preprocessing the Data**<a id='preprocessing-the-data'></a>

##### *Note: It is advisable to run this notebook on Python 3.9.6.*

### For our study, we will look at 4 datasets in total:

#### 1) [Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D)](https://github.com/CheyneyComputerScience/CREMA-D)
    
CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified). Actors spoke from a selection of 12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad) and four different emotion levels (Low, Medium, High, and Unspecified).

For the simplicity of the study, we will not look into the emotion levels as a feature/variable.

Data can be downloaded [here](https://www.kaggle.com/datasets/ejlok1/cremad).

#### 2) YouTube

This includes audio data extracted from YouTube videos, documenting respondents sharing their experience with the mental health struggles. More details can be found in the [ReadMe](README.md) or [previous notebook](code/02_Preprocessing.ipynb). 

#### 3) [Toronto Emotional Speech Set (TESS)](https://tspace.library.utoronto.ca/handle/1807/24487)

TESS consist of 2,800 2-sec audio clips, with a set of 200 target words were spoken in the carrier phrase "Say the word _' by two actresses (aged 26 and 64 years) and recordings were made of the set portraying each of seven emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral).

The dataset is organised such that each of the two female actor and their emotions are contain within its own folder. And within that, all 200 target words audio file can be found. 

Data can be downloaded [here](https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess).

#### 4) [Emotion Speech Dataset (ESD)](https://github.com/HLTSingapore/Emotional-Speech-Data)

The ESD database has 35,000 2-sec audio clips, and was recorded with the aim to provide the community with a large emotional speech database with a sufficient variety of speaker and lexical coverage.

The ESD database consists of a total of 29 hours of audio recordings from 10 native English speakers and 10 native Chinese speakers, covering 5 different emotion categories (neutral, happy, angry, sad and surprise). It represents one of the largest emotional speech databases publicly available, in terms of speakers and lexical variability. All the recordings are conducted in the studio with professional devices to guarantee audio quality.

Data can be downloaded [here](https://drive.google.com/file/d/1Ypu1lQ5ThmslmCGhFZyq2vFkLmVqy9uS/view?usp=share_link).

Thereafter, we will split them into 2 groups - Seen and Unseen Data. 
* Seen Data - This includes the data that have been trained on before, in previous studies, and the labels are predefined by the data collectors. The reason for combining dataset is to ensure a more robust model training, and thereby prediction of emotions through speech.
    - CREMA-D
    - TESS
    - ESD 
* Unseen Data - On the flipside, the unseen data is one that has never been trained on before. The only exception will be that the we'll use a pre-allocated evaluation data from ESD (`ESD_eval`) as a medium of validating our model, after predicting the YouTube data.
    - YouTube 
    - `ESD_eval`

### Importing Libraries

In [None]:
import librosa
import numpy as np
import os
import pandas as pd
import soundfile as sf

### **2.1 Data Augmentation**<a id='data-augmentation'></a>

Audio data Augmentation is important in especially our use case, where speech is analysed either offline or online, this is to ensure:

1) Robustness: Audio augmentations introduce controlled variability into training data, which helps in developing models that are robust and can generalize well across different acoustic environments and recording conditions.

2) Data Augmentation: Especially in cases where the amount of training data is limited, augmentations effectively increase the dataset size, providing more training examples and helping prevent overfitting.

3) Variability and Generalization: By simulating various real-world conditions, augmentations ensure that the model can handle a wide range of audio inputs, making it more effective in practical applications.

The type of audio augmentation we seek to introduce here is the injection of Noise, Time Stretch, Time Shifting and Pitch Shifting. The following is an explanation of each augmentation type, and what we have done in with our audio data.

* Noise Addition (`noise(data)`)
    - This simulates real-world scenarios where background noise is present, training the model to be robust against noisy environments. We have added random noise to the audio signal. The amplitude of the noise is a fraction (up to 3.5%) of the maximum amplitude of the audio signal, ensuring that the noise level is substantial but not overwhelming.

* Time Stretching (`stretch(data)`)
    - Helps the model learn to recognize emotions regardless of slight variations in speech speed, which can vary from person to person or due to emotional state. We stretche or compresses the audio signal by a random factor between 95% and 105% of the original length.

* Time Shifting (`shift(data)`)
    - Introduces a variation in the temporal location of speech within the audio files, mimicking the effect of speaking earlier or later within a recording window. We have circularly shift the audio data by a random number of samples in the range of -300 to 300. 

* Pitch Shifting (`pitch(data, sr)`)
    - Accounts for variations in pitch, which can be influenced by the speaker's mood, age, gender, or emotional state, thereby training the model to be invariant to pitch changes. Here we modify the pitch of the audio signal by a random number of semitones, ranging from -2 to 2.

#### **2.1.1 CREMA-D (Seen)**<a id='crema-d-seen'></a>

In [None]:
def noise(data):
    noise_amp = 0.035 * np.random.uniform() * np.amax(data)  # Reduced noise amplitude
    return data + noise_amp * np.random.normal(size=data.shape[0])

def stretch(data):
    stretch_rate = np.random.uniform(low=0.95, high=1.05)  # Less aggressive range
    return librosa.effects.time_stretch(data, rate=stretch_rate)

def shift(data):
    shift_range = int(np.random.uniform(low=-300, high=300))  # Reduced shift range
    return np.roll(data, shift_range)

def pitch(data, sr):
    pitch_factor = np.random.uniform(low=-2, high=2)  # Less pitch variation
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=pitch_factor)

def augment_audio(file_path, output_folder, file_name):
    audio, sr = librosa.load(file_path, sr=None)
    audio = librosa.util.normalize(audio)

    # Apply augmentations
    audio = noise(audio)
    audio = stretch(audio)
    audio = shift(audio)
    audio = pitch(audio, sr)

    # Saving the augmented audio
    output_file_path = os.path.join(output_folder, f"aug_{file_name}")
    sf.write(output_file_path, audio, sr)

def process_crema_files(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    for filename in os.listdir(input_folder):
        if filename.endswith(".wav"):
            path = os.path.join(input_folder, filename)
            augment_audio(path, output_folder, filename)

input_folder = '../dataset/CREMA'
output_folder = '../dataset/CREMA_aug'
process_crema_files(input_folder, output_folder) # ignore the error (files will be provided in the dataset folder)

FileNotFoundError: [Errno 2] No such file or directory: '../dataset/CREMA'

#### **2.1.2 YouTube dataset (Unseen)**<a id='youtube-dataset-unseen'></a>

In [103]:
def noise(data):
    noise_amp = 0.035 * np.random.uniform() * np.amax(data)  # Reduced noise amplitude
    return data + noise_amp * np.random.normal(size=data.shape[0])

def stretch(data):
    stretch_rate = np.random.uniform(low=0.95, high=1.05)  # Less aggressive range
    return librosa.effects.time_stretch(data, rate=stretch_rate)

def shift(data):
    shift_range = int(np.random.uniform(low=-300, high=300))  # Reduced shift range
    return np.roll(data, shift_range)

def pitch(data, sr):
    pitch_factor = np.random.uniform(low=-2, high=2)  # Less pitch variation
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=pitch_factor)

def augment_audio(file_path, output_folder, file_name):
    audio, sr = librosa.load(file_path, sr=None)
    audio = librosa.util.normalize(audio)

    # Apply augmentations
    audio = noise(audio)
    audio = stretch(audio)
    audio = shift(audio)
    audio = pitch(audio, sr)

    # Saving the augmented audio
    output_file_path = os.path.join(output_folder, f"aug_{file_name}")
    sf.write(output_file_path, audio, sr)

def process_audio_files(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    for filename in os.listdir(input_folder):
        if filename.endswith(".wav"):
            path = os.path.join(input_folder, filename)
            augment_audio(path, output_folder, filename)

input_folder = '../dataset/YouTube'
output_folder = '../dataset/YouTube_aug'
process_audio_files(input_folder, output_folder)

#### **2.1.3 ESD data (Seen + Unseen)**<a id='esd-data-seen-unseen'></a>

##### **2.1.3.1 ESD (Seen)**<a id='esd-seen'></a>

This includes the pre-allocated training and test data, of 33,000 original 2-sec clips.

In [104]:
def noise(data):
    noise_amp = 0.035 * np.random.uniform() * np.amax(data)  # Reduced noise amplitude
    return data + noise_amp * np.random.normal(size=data.shape[0])

def stretch(data):
    stretch_rate = np.random.uniform(low=0.95, high=1.05)  # Less aggressive range
    return librosa.effects.time_stretch(data, rate=stretch_rate)

def shift(data):
    shift_range = int(np.random.uniform(low=-300, high=300))  # Reduced shift range
    return np.roll(data, shift_range)

def pitch(data, sr):
    pitch_factor = np.random.uniform(low=-2, high=2)  # Less pitch variation
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=pitch_factor)

def augment_audio(file_path, output_folder, file_name):
    audio, sr = librosa.load(file_path, sr=None)
    audio = librosa.util.normalize(audio)

    # Apply augmentations
    audio = noise(audio)
    audio = stretch(audio)
    audio = shift(audio)
    audio = pitch(audio, sr)

    # Saving the augmented audio
    output_file_path = os.path.join(output_folder, f"aug_{file_name}")
    sf.write(output_file_path, audio, sr)

def process_audio_files(input_folder, output_folder):
    for actor_folder in os.listdir(input_folder):
        actor_path = os.path.join(input_folder, actor_folder)
        if os.path.isdir(actor_path):
            for emotion_folder in os.listdir(actor_path):
                emotion_path = os.path.join(actor_path, emotion_folder)
                if os.path.isdir(emotion_path):
                    output_emotion_folder = os.path.join(output_folder, actor_folder, emotion_folder)
                    if not os.path.exists(output_emotion_folder):
                        os.makedirs(output_emotion_folder)
                    for file in os.listdir(emotion_path):
                        if file.endswith(".wav"):
                            file_path = os.path.join(emotion_path, file)
                            augment_audio(file_path, output_emotion_folder, file)

input_folder = '../dataset/ESD'
output_folder = '../dataset/ESD_aug'
process_audio_files(input_folder, output_folder)

##### **2.1.3.2 ESD (Unseen)**<a id='esd-unseen'></a>

This includes the pre-allocated evaluation data, of 2,000 original 2-sec clips.

In [41]:
def noise(data):
    noise_amp = 0.035 * np.random.uniform() * np.amax(data)  # Reduced noise amplitude
    return data + noise_amp * np.random.normal(size=data.shape[0])

def stretch(data):
    stretch_rate = np.random.uniform(low=0.95, high=1.05)  # Less aggressive range
    return librosa.effects.time_stretch(data, rate=stretch_rate)

def shift(data):
    shift_range = int(np.random.uniform(low=-300, high=300))  # Reduced shift range
    return np.roll(data, shift_range)

def pitch(data, sr):
    pitch_factor = np.random.uniform(low=-2, high=2)  # Less pitch variation
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=pitch_factor)

def augment_audio(file_path, output_folder, file_name):
    audio, sr = librosa.load(file_path, sr=None)
    audio = librosa.util.normalize(audio)

    # Apply augmentations
    audio = noise(audio)
    audio = stretch(audio)
    audio = shift(audio)
    audio = pitch(audio, sr)

    # Saving the augmented audio
    output_file_path = os.path.join(output_folder, f"aug_{file_name}")
    sf.write(output_file_path, audio, sr)

def process_audio_files(input_folder, output_folder):
    for actor_folder in os.listdir(input_folder):
        actor_path = os.path.join(input_folder, actor_folder)
        if os.path.isdir(actor_path):
            for emotion_folder in os.listdir(actor_path):
                emotion_path = os.path.join(actor_path, emotion_folder)
                if os.path.isdir(emotion_path):
                    output_emotion_folder = os.path.join(output_folder, actor_folder, emotion_folder)
                    if not os.path.exists(output_emotion_folder):
                        os.makedirs(output_emotion_folder)
                    for file in os.listdir(emotion_path):
                        if file.endswith(".wav"):
                            file_path = os.path.join(emotion_path, file)
                            augment_audio(file_path, output_emotion_folder, file)

input_folder = '../dataset/ESD_eval'
output_folder = '../dataset/ESD_eval_aug'
process_audio_files(input_folder, output_folder)

#### **2.1.4 TESS dataset (Seen)**<a id='tess-dataset-seen'></a>

In [106]:
def noise(data):
    noise_amp = 0.035 * np.random.uniform() * np.amax(data)  # Adjusted noise amplitude
    return data + noise_amp * np.random.normal(size=data.shape[0])

def stretch(data):
    stretch_rate = np.random.uniform(low=0.95, high=1.05)  # Adjusted stretch range
    return librosa.effects.time_stretch(data, rate=stretch_rate)

def shift(data):
    shift_range = int(np.random.uniform(low=-300, high=300))  # Adjusted shift range
    return np.roll(data, shift_range)

def pitch(data, sr):
    pitch_factor = np.random.uniform(low=-2, high=2)  # Adjusted pitch variation
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=pitch_factor)

def augment_audio(file_path, output_folder, file_name):
    audio, sr = librosa.load(file_path, sr=None)
    audio = librosa.util.normalize(audio)

    # Apply augmentations
    audio = noise(audio)
    audio = stretch(audio)
    audio = shift(audio)
    audio = pitch(audio, sr)

    # Saving the augmented audio
    output_file_path = os.path.join(output_folder, f"aug_{file_name}")
    sf.write(output_file_path, audio, sr)

def process_tess_files(input_folder, output_folder):
    for age_group in os.listdir(input_folder):  # 'OAF' or 'YAF'
        age_path = os.path.join(input_folder, age_group)
        if os.path.isdir(age_path):
            for emotion_folder in os.listdir(age_path):
                emotion_path = os.path.join(age_path, emotion_folder)
                if os.path.isdir(emotion_path):
                    output_emotion_folder = os.path.join(output_folder, age_group, emotion_folder)
                    if not os.path.exists(output_emotion_folder):
                        os.makedirs(output_emotion_folder)
                    for file in os.listdir(emotion_path):
                        if file.endswith(".wav"):
                            file_path = os.path.join(emotion_path, file)
                            augment_audio(file_path, output_emotion_folder, file)

# Set the input and output folders for TESS
input_folder = '../dataset/TESS'
output_folder = '../dataset/TESS_aug'
process_tess_files(input_folder, output_folder)

### **2.2 Label Mapping**<a id='label-mapping'></a>

Although most of the datasets (except YouTube) already have pre-defined labels, their labels are may vary to each other.

This section aims to standardise the mapping of the emotion labels to the CREMA-D. The rationale for choosing this is because the CREMA-D data provides a breadth of emotions commonly expressed by people. Emotions like 'Surprise' or 'Pleasant Surprise' - where evident in ESD and TESS respectively - do not fully capture an emotion type. Hence, we mapped it to the closest we thought it would be - `Neutral` and `Happy` respectively.

We will only be Label Mapping the *augmented* Seen Data and the `ESD_eval` (Unseen Data).

#### **2.2.1 CREMA-D (Seen)**<a id='crema-d-seen-label'></a>

In [16]:
# Define the directory containing the CREMA dataset
crema_directory = '../dataset/CREMA_aug'

# List all files in the directory
crema_directory_list = os.listdir(crema_directory)

# Lists to hold file emotions and file paths
file_emotion = []
file_path = []

# Populate the lists
for file in crema_directory_list:
    # Construct the full file path
    full_path = os.path.join(crema_directory, file)
    file_path.append(full_path)

    # Extract the emotion part from the filename
    parts = file.split('_')
    if len(parts) > 3:
        emotion_code = parts[3][:3]  # The emotion code is typically at this position
        emotion = {
            'NEU': 'Neutral',
            'ANG': 'Angry',
            'SAD': 'Sad',
            'FEA': 'Fear',
            'HAP': 'Happy',
            'DIS': 'Disgust'
        }.get(emotion_code, 'Unknown')  # Default to 'Unknown' if not found
        file_emotion.append(emotion)
    else:
        file_emotion.append('Unknown')

# Create DataFrames from the lists
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])
path_df = pd.DataFrame(file_path, columns=['Path'])

# Concatenate the DataFrames to a single DataFrame
crema_df = pd.concat([emotion_df, path_df], axis=1)
# Extract only the filename from the full path and update the 'Path' column
crema_df['Path'] = crema_df['Path'].apply(lambda x: os.path.basename(x))

# Save the DataFrame to a CSV file
crema_df.to_csv('../csv/CREMA_Dataset.csv', index=False)

In [108]:
crema_df.head()

Unnamed: 0,Emotions,Path
0,Happy,aug_1075_TAI_HAP_XX.wav
1,Sad,aug_1051_IEO_SAD_MD.wav
2,Sad,aug_1044_IEO_SAD_MD.wav
3,Happy,aug_1060_TAI_HAP_XX.wav
4,Sad,aug_1005_IWL_SAD_XX.wav


#### **2.2.2 ESD**<a id='esd-label'></a>

##### **2.2.2.1 ESD (Seen)**<a id='esd-seen-label'></a>

In [44]:
# Label Mapping for the emotions in the ESD dataset
label_mapping = {
    'Angry': 'Angry',
    'Happy': 'Happy',
    'Neutral': 'Neutral',
    'Sad': 'Sad',
    'Surprise': 'Neutral'  # Mapping 'Surprise' to 'Neutral'
}

def map_emotion(emotion_folder):
    return label_mapping.get(emotion_folder, 'Unknown')

def process_audio_files(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    file_emotion = []
    file_path = []

    for root, dirs, files in os.walk(input_folder):
        for file in files:
            if file.endswith(".wav"):
                full_path = os.path.join(root, file)
                file_path.append(full_path)

                # Extract the emotion part from the directory structure
                parts = root.split(os.sep)
                emotion_folder = parts[-1]  # Assuming the emotion is the last part of the path
                mapped_emotion = map_emotion(emotion_folder)
                file_emotion.append(mapped_emotion)

                if mapped_emotion == 'Unknown':
                    print(f"Unmapped emotion from path: {root}")  # Debug: print unmapped path

    # Create DataFrames from the lists
    emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])
    path_df = pd.DataFrame(file_path, columns=['Path'])

    # Concatenate the DataFrames into a single DataFrame
    esd_df = pd.concat([emotion_df, path_df], axis=1)

    # Save the DataFrame to a CSV file
    esd_df.to_csv(f'{output_folder}/ESD_Dataset.csv', index=False)

# Define paths
input_folder = '../dataset/ESD_aug'  # Make sure this is the correct path
output_folder = '../csv'
process_audio_files(input_folder, output_folder)

# Load the generated file to check its content
esd_df = pd.read_csv(f'{output_folder}/ESD_Dataset.csv')
print(esd_df.head())

In [52]:
esd_df = pd.read_csv(f'{output_folder}/ESD_Dataset.csv')
esd_df.head()

Unnamed: 0,Emotions,Path
0,Happy,../dataset/ESD_aug/0003/Happy/aug_0003_000957.wav
1,Happy,../dataset/ESD_aug/0003/Happy/aug_0003_000943.wav
2,Happy,../dataset/ESD_aug/0003/Happy/aug_0003_000994.wav
3,Happy,../dataset/ESD_aug/0003/Happy/aug_0003_000758.wav
4,Happy,../dataset/ESD_aug/0003/Happy/aug_0003_000980.wav


##### **2.2.2.2 ESD (Unseen)**<a id='esd-unseen-label'></a>

In [48]:
# Label Mapping for the emotions in the ESD_eval dataset
label_mapping = {
    'Angry': 'Angry',
    'Happy': 'Happy',
    'Neutral': 'Neutral',
    'Sad': 'Sad',
    'Surprise': 'Neutral'
}

def map_emotion(emotion_folder):
    return label_mapping.get(emotion_folder, 'Unknown')

def process_audio_files(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    file_emotion = []
    file_path = []

    for root, dirs, files in os.walk(input_folder):
        for file in files:
            if file.endswith(".wav"):
                full_path = os.path.join(root, file)
                file_path.append(full_path)

                # Extract the emotion part from the directory structure
                parts = root.split(os.sep)
                emotion_folder = parts[-1]  # Assuming the emotion is the last part of the path
                mapped_emotion = map_emotion(emotion_folder)
                file_emotion.append(mapped_emotion)

                if mapped_emotion == 'Unknown':
                    print(f"Unmapped emotion from path: {root}")  # Debug: print unmapped path

    # Create DataFrames from the lists
    emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])
    path_df = pd.DataFrame(file_path, columns=['Path'])

    # Concatenate the DataFrames into a single DataFrame
    esd_eval_df = pd.concat([emotion_df, path_df], axis=1)

    # Save the DataFrame to a CSV file
    esd_eval_df.to_csv(f'{output_folder}/ESD_eval_Dataset.csv', index=False)

# Define paths
input_folder = '../dataset/ESD_eval_aug'  # Make sure this is the correct path
output_folder = '../csv'
process_audio_files(input_folder, output_folder)

In [66]:
esd_eval_df = pd.read_csv(f'{output_folder}/ESD_eval_Dataset.csv')
esd_eval_df.head()

Unnamed: 0,Emotions,Path
0,Happy,../dataset/ESD_eval_aug/0003/Happy/aug_0003_00...
1,Happy,../dataset/ESD_eval_aug/0003/Happy/aug_0003_00...
2,Happy,../dataset/ESD_eval_aug/0003/Happy/aug_0003_00...
3,Happy,../dataset/ESD_eval_aug/0003/Happy/aug_0003_00...
4,Happy,../dataset/ESD_eval_aug/0003/Happy/aug_0003_00...


#### **2.2.3 TESS dataset (Seen)**<a id='tess-dataset-seen-label'></a>

In [68]:
# Label Mapping for the emotions in the TESS dataset
label_mapping = {
    'angry': 'Angry',
    'happy': 'Happy',
    'neutral': 'Neutral',
    'sad': 'Sad',
    'disgust': 'Disgust',
    'fear': 'Fear',
    'pleasantsurprise': 'Happy'  # Assuming 'ps' stands for pleasant surprise
}

def map_emotion(emotion_folder):
    # Extract the emotion part and map it
    emotion_part = emotion_folder.split('_')[-1].lower()  # Get the last part after '_'
    return label_mapping.get(emotion_part, 'Unknown')

def process_audio_files(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    file_emotion = []
    file_path = []

    for root, dirs, files in os.walk(input_folder):
        for file in files:
            if file.endswith(".wav"):
                full_path = os.path.join(root, file)
                file_path.append(full_path)

                # Extract the emotion part from the directory structure
                parts = root.split(os.sep)
                emotion_folder = parts[-2] + '_' + parts[-1]  # Combining voice type and emotion
                mapped_emotion = map_emotion(emotion_folder)
                file_emotion.append(mapped_emotion)

                if mapped_emotion == 'Unknown':
                    print(f"Unmapped emotion from path: {root}")  # Debug: print unmapped path

    # Create DataFrames from the lists
    emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])
    path_df = pd.DataFrame(file_path, columns=['Path'])

    # Concatenate the DataFrames into a single DataFrame
    tess_df = pd.concat([emotion_df, path_df], axis=1)

    # Save the DataFrame to a CSV file
    tess_df.to_csv(f'{output_folder}/TESS_Dataset.csv', index=False)

# Define paths
input_folder = '../dataset/TESS_aug'  # Adjust as necessary
output_folder = '../csv'
process_audio_files(input_folder, output_folder)

In [70]:
tess_df = pd.read_csv(f'{output_folder}/TESS_Dataset.csv')
tess_df.head()

Unnamed: 0,Emotions,Path
0,Happy,../dataset/TESS_aug/OAF/OAF_happy/aug_OAF_vine...
1,Happy,../dataset/TESS_aug/OAF/OAF_happy/aug_OAF_seiz...
2,Happy,../dataset/TESS_aug/OAF/OAF_happy/aug_OAF_bar_...
3,Happy,../dataset/TESS_aug/OAF/OAF_happy/aug_OAF_door...
4,Happy,../dataset/TESS_aug/OAF/OAF_happy/aug_OAF_nice...


In [72]:
# Combine the CREMA, TESS, ESD datasets (Label Mapping) 
combined_df_map = pd.concat([crema_df, tess_df, esd_df], ignore_index=True)
combined_df_map.to_csv('../csv/combined_Dataset.csv', index=False)

### **2.3 Feature Extraction**<a id='feature-extraction'></a>

The extracted functions look at extracting from the previous augmentation types. The extracted functions are as follows:
* Zero-Crossing Rate (1 value)
    - The rate at which the signal changes from positive to negative or back. This is a simple measure of the frequency content of a signal and is often used to distinguish percussive sounds from harmonic sounds. The output is typically a single mean value per time frame (or the entire audio clip if you average across frames), which represents the average rate of zero-crossings.

* Chroma Short-Time Fourier Transform or STFT (12 values)
    - Captures the essence of music pitches, ignoring aspects like timbre and loudness. Chroma refers to the 12 different pitch classes; each feature in a chromagram corresponds to one of the twelve distinct semitones (or notes) in an octave.

* Mel-Frequency Cepstral Coefficients - MFCC (20 values)
    - Represents the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. The number of MFCCs you choose to compute can vary, but 20 is a common choice in many applications, providing a good trade-off between capturing important aspects of the sound and computational efficiency. Each MFCC captures different characteristics of the sound, and together they provide a compact representation of the spectral envelope.

* Root Mean Square (1 value)
    - It is the measure of the average power or amplitude of the audio signal, essentially representing the 'loudness' of the sound. The output for RMS is usually a single value per frame which can be averaged across all frames to give a single measure of the overall signal strength. Hence, there is only 1 value.

* Spectral Contrast (7 values)
    - Measures the dynamic range of the spectrum within different sub-bands. By analyzing the contrast between the most prominent tones and the less intense sounds that surround them, spectral contrast provides a measure of the perceptual quality of the sound and can be useful for distinguishing different types of sound textures and timbres. The choice of seven features for spectral contrast generally follows the common practice of dividing the audible spectrum into seven sub-bands, which aligns with perceptually meaningful frequency ranges that correspond to different musical or speech attributes. 

#### **2.3.1 CREMA-D (Seen)**<a id='crema-d-seen-feature'></a>

In [None]:
def noise(data):
    noise_amp = 0.035 * np.random.uniform() * np.amax(data)
    data = data + noise_amp * np.random.normal(size=data.shape[0])
    return data

def stretch(data):
    stretch_rate = np.random.uniform(low=0.95, high=1.05) 
    return librosa.effects.time_stretch(data, rate=stretch_rate)

def pitch(data, sr, pitch_factor=2):
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=pitch_factor)

def extract_features(data, sample_rate):
    n_fft = 2048
    hop_length = 512
    # Collect features for different modifications of the data
    features = []
    transformations = [lambda x: x, noise, stretch, lambda x: pitch(x, sample_rate, 2)]
    for transform in transformations:
        modified_data = transform(data)
        #ZCR
        zcr = librosa.feature.zero_crossing_rate(y=modified_data, hop_length=hop_length).T.mean(axis=0)
        #Chroma SFTF
        chroma_stft = librosa.feature.chroma_stft(y=modified_data, sr=sample_rate, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
        #MFCC
        mfcc = librosa.feature.mfcc(y=modified_data, sr=sample_rate, n_mfcc=20, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
        #RMS
        rms = librosa.feature.rms(y=modified_data, hop_length=hop_length).T.mean(axis=0)
        #Spectral Contrast
        spectral_contrast = librosa.feature.spectral_contrast(y=modified_data, sr=sample_rate, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
        feature_array = np.hstack((zcr, chroma_stft, mfcc, rms, spectral_contrast))
        features.append(feature_array)
    # Flatten the list of arrays into a single array
    return np.hstack(features)

def generate_feature_names():
    features_info = {
        'ZCR': 1,
        'Chroma': 12,
        'MFCC': 20,
        'RMS': 1,
        'Spectral_Contrast': 7
    }
    types = ['original', 'noisy', 'stretched', 'pitched']
    names = []
    for feat_name, count in features_info.items():
        for i in range(1, count + 1):
            for t in types:
                names.append(f"{feat_name}_{i}_{t}")
    return names

def map_emotion(emotion_code):
    emotions = {'NEU': 'Neutral', 'ANG': 'Angry', 'SAD': 'Sad', 'FEA': 'Fear', 'HAP': 'Happy', 'DIS': 'Disgust'}
    return emotions.get(emotion_code, 'Unknown')

def process_files_crema(directory):
    features_list = []
    labels = []
    filenames = [f for f in os.listdir(directory) if f.endswith(".wav")]
    for filename in filenames:
        file_path = os.path.join(directory, filename)
        data, sample_rate = librosa.load(file_path, duration=2.5, offset=0.6)
        original_features = extract_features(data, sample_rate)
        features_list.append(original_features)
        parts = filename.split('_')
        emotion_code = parts[3]  # Adjust filename pattern as needed
        labels.append(map_emotion(emotion_code))
    feature_names = generate_feature_names()
    features_df = pd.DataFrame(features_list, columns=feature_names)
    features_df['Label'] = labels
    return features_df

# Define path to the dataset
crema_dataset_path = '../dataset/CREMA_aug'
crema_features_df = process_files_crema(crema_dataset_path)

# Save to CSV
crema_features_df.to_csv('../csv/CREMA_aug_features.csv', index=False)

  return pitch_tuning(


In [None]:
crema_features_df.head()

Unnamed: 0,ZCR_1_original,ZCR_1_noisy,ZCR_1_stretched,ZCR_1_pitched,Chroma_1_original,Chroma_1_noisy,Chroma_1_stretched,Chroma_1_pitched,Chroma_2_original,Chroma_2_noisy,...,Spectral_Contrast_5_pitched,Spectral_Contrast_6_original,Spectral_Contrast_6_noisy,Spectral_Contrast_6_stretched,Spectral_Contrast_6_pitched,Spectral_Contrast_7_original,Spectral_Contrast_7_noisy,Spectral_Contrast_7_stretched,Spectral_Contrast_7_pitched,Label
0,0.058203,0.168413,0.14026,0.222031,0.205495,0.329408,0.406689,0.287099,0.210801,0.316133,...,5.502777,0.054518,15.121599,16.979465,22.924174,23.677009,22.195865,19.999263,65.080437,Happy
1,0.049544,0.248724,0.239492,0.279569,0.269511,0.267364,0.369389,0.340688,0.429027,0.74855,...,-15.320172,0.059525,14.03652,14.78067,15.58615,15.448796,16.0052,16.309107,66.911804,Sad
2,0.048374,0.390639,0.236833,0.227314,0.248309,0.256225,0.355961,0.268141,0.288257,0.298744,...,-6.954707,0.050574,15.823275,15.035871,17.236838,17.33111,16.973861,17.195164,65.26174,Sad
3,0.125309,0.464438,0.444223,0.343996,0.32223,0.334812,0.260258,0.34067,0.33361,0.319384,...,-9.659683,0.039167,13.341967,18.273013,19.682507,20.911843,17.991627,17.974893,69.016132,Happy
4,0.129747,0.342628,0.373324,0.495987,0.503508,0.380763,0.363083,0.435024,0.391251,0.419177,...,-12.033329,0.02542,18.242384,14.34461,18.553798,17.762759,17.671194,16.58653,70.563453,Sad


In [None]:
crema_features_df = pd.read_csv('../csv/CREMA_aug_features.csv')

feature_ranges = crema_features_df.describe().loc[['min', 'max']]

print(feature_ranges)

     ZCR_1_original  ZCR_1_noisy  ZCR_1_stretched  ZCR_1_pitched  \
min        0.000000     0.000000         0.000000       0.000000   
max        0.330717     0.916395         0.821496       0.874335   

     Chroma_1_original  Chroma_1_noisy  Chroma_1_stretched  Chroma_1_pitched  \
min           0.000000        0.000000            0.000000          0.000000   
max           0.810113        0.793729            0.842867          0.898079   

     Chroma_2_original  Chroma_2_noisy  ...  Spectral_Contrast_5_stretched  \
min           0.000000        0.000000  ...                     -16.374187   
max           0.840144        0.955284  ...                      22.734421   

     Spectral_Contrast_5_pitched  Spectral_Contrast_6_original  \
min                   -21.689333                      0.000000   
max                    17.644751                      0.142532   

     Spectral_Contrast_6_noisy  Spectral_Contrast_6_stretched  \
min                   0.000000                       0.

#### **2.3.2 YouTube dataset (Unseen)**<a id='youtube-dataset-unseen-feature'></a>

In [122]:
def noise(data):
    noise_amp = 0.035 * np.random.uniform() * np.amax(data)
    data = data + noise_amp * np.random.normal(size=data.shape[0])
    return data

def stretch(data):
    stretch_rate = np.random.uniform(low=0.95, high=1.05) 
    return librosa.effects.time_stretch(data, rate=stretch_rate)

def pitch(data, sr, pitch_factor=2):
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=pitch_factor)

def extract_features(data, sample_rate):
    n_fft = 2048
    hop_length = 512
    # Collect features for different modifications of the data
    features = []
    transformations = [lambda x: x, noise, stretch, lambda x: pitch(x, sample_rate, 2)]
    for transform in transformations:
            modified_data = transform(data)
            #ZCR
            zcr = librosa.feature.zero_crossing_rate(y=modified_data, hop_length=hop_length).T.mean(axis=0)
            #Chroma SFTF
            chroma_stft = librosa.feature.chroma_stft(y=modified_data, sr=sample_rate, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
            #MFCC
            mfcc = librosa.feature.mfcc(y=modified_data, sr=sample_rate, n_mfcc=20, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
            #RMS
            rms = librosa.feature.rms(y=modified_data, hop_length=hop_length).T.mean(axis=0)
            #Spectral Contrast
            spectral_contrast = librosa.feature.spectral_contrast(y=modified_data, sr=sample_rate, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
            feature_array = np.hstack((zcr, chroma_stft, mfcc, rms, spectral_contrast))
            features.append(feature_array)
    # Flatten the list of arrays into a single array
    return np.hstack(features)

def generate_feature_names():
    features_info = {
        'ZCR': 1,
        'Chroma': 12,
        'MFCC': 20,
        'RMS': 1,
        'Spectral_Contrast': 7
    }
    types = ['original', 'noisy', 'stretched', 'pitched']
    names = []
    for feat_name, count in features_info.items():
        for i in range(1, count + 1):
            for t in types:
                names.append(f"{feat_name}_{i}_{t}")
    return names

def process_files_youtube(directory):
    features_list = []
    filenames = [f for f in os.listdir(directory) if f.endswith(".wav")]
    for filename in filenames:
        file_path = os.path.join(directory, filename)
        data, sample_rate = librosa.load(file_path, duration=2.5, offset=0.6)
        features = extract_features(data, sample_rate)
        features_list.append(features)
    feature_names = generate_feature_names()
    features_df = pd.DataFrame(features_list, columns=feature_names)
    return features_df

# Define path to the dataset for the unseen YouTube data
dataset_path_youtube = '../dataset/YouTube_aug'
features_df_youtube = process_files_youtube(dataset_path_youtube)

# Save to CSV
features_df_youtube.to_csv('../csv/YouTube_aug_features.csv', index=False)

  return pitch_tuning(


In [None]:
yt_features_df = pd.read_csv('../csv/YouTube_aug_features.csv')

feature_ranges = yt_features_df.describe().loc[['min', 'max']]

print(feature_ranges)

     ZCR_1_original  ZCR_1_noisy  ZCR_1_stretched  ZCR_1_pitched  \
min        0.000000     0.000000         0.000000       0.000000   
max        0.609013     0.876495         0.929679       0.902725   

     Chroma_1_original  Chroma_1_noisy  Chroma_1_stretched  Chroma_1_pitched  \
min           0.000000        0.000000            0.000000           0.00000   
max           0.918657        0.925917            0.990164           0.97337   

     Chroma_2_original  Chroma_2_noisy  ...  Spectral_Contrast_5_stretched  \
min           0.000000        0.000000  ...                     -20.488947   
max           0.928489        0.885999  ...                      12.891306   

     Spectral_Contrast_5_pitched  Spectral_Contrast_6_original  \
min                   -17.072622                      0.000000   
max                    14.112740                      0.153067   

     Spectral_Contrast_6_noisy  Spectral_Contrast_6_stretched  \
min                   0.000000                        0

#### **2.3.3 ESD**<a id='esd-feature'></a>

##### **2.3.3.1 ESD (Seen)**<a id='esd-seen-feature'></a>

In [None]:
def noise(data):
    noise_amp = 0.035 * np.random.uniform() * np.amax(data)
    data = data + noise_amp * np.random.normal(size=data.shape[0])
    return data

def stretch(data):
    stretch_rate = np.random.uniform(low=0.95, high=1.05) 
    return librosa.effects.time_stretch(data, rate=stretch_rate)

def pitch(data, sr, pitch_factor=2):
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=pitch_factor)

def extract_features(data, sample_rate):
    n_fft = 2048
    hop_length = 512
    features = []
    transformations = [lambda x: x, noise, stretch, lambda x: pitch(x, sample_rate, 2)]
    for transform in transformations:
        modified_data = transform(data)
        zcr = librosa.feature.zero_crossing_rate(y=modified_data, hop_length=hop_length).T.mean(axis=0)
        chroma_stft = librosa.feature.chroma_stft(y=modified_data, sr=sample_rate, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
        mfcc = librosa.feature.mfcc(y=modified_data, sr=sample_rate, n_mfcc=20, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
        rms = librosa.feature.rms(y=modified_data, hop_length=hop_length).T.mean(axis=0)
        spectral_contrast = librosa.feature.spectral_contrast(y=modified_data, sr=sample_rate, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
        feature_array = np.hstack((zcr, chroma_stft, mfcc, rms, spectral_contrast))
        features.append(feature_array)
    return np.hstack(features)

def generate_feature_names():
    features_info = {'ZCR': 1, 'Chroma': 12, 'MFCC': 20, 'RMS': 1, 'Spectral_Contrast': 7}
    types = ['original', 'noisy', 'stretched', 'pitched']
    names = []
    for t in types:
        for feat_name, count in features_info.items():
            for i in range(1, count + 1):
                names.append(f"{feat_name}_{i}_{t}")
    return names

def map_emotion(directory_name):
    emotion_mapping = {'Angry': 'Angry', 'Happy': 'Happy', 'Neutral': 'Neutral', 'Sad': 'Sad', 'Surprise': 'Happy'}
    return emotion_mapping.get(directory_name, 'Unknown')

def process_files_esd(directory):
    features_list = []
    labels = []
    for root, _, files in os.walk(directory):
        for filename in files:
            if filename.endswith(".wav"):
                file_path = os.path.join(root, filename)
                data, sample_rate = librosa.load(file_path, duration=2.5, offset=0.6)
                features = extract_features(data, sample_rate)
                features_list.append(features)
                labels.append(map_emotion(root.split(os.sep)[-1]))
    feature_names = generate_feature_names()  
    features_df = pd.DataFrame(features_list, columns=feature_names)
    features_df['Label'] = labels
    return features_df

input_folder_esd = '../dataset/ESD_aug'
features_df_esd = process_files_esd(input_folder_esd)
features_df_esd.to_csv('../csv/ESD_aug_features.csv', index=False)

In [None]:
esd_features_df = pd.read_csv('../csv/ESD_aug_features.csv')

feature_ranges = esd_features_df.describe().loc[['min', 'max']]

print(feature_ranges)

     ZCR_1_original  Chroma_1_original  Chroma_2_original  Chroma_3_original  \
min        0.035283           0.022347           0.039505           0.045234   
max        0.362988           0.839597           0.888793           0.818959   

     Chroma_4_original  Chroma_5_original  Chroma_6_original  \
min           0.032947           0.033066           0.026778   
max           0.812716           0.861563           0.939885   

     Chroma_7_original  Chroma_8_original  Chroma_9_original  ...  \
min           0.025347           0.016972           0.021873  ...   
max           0.869625           0.930398           0.871389  ...   

     MFCC_19_pitched  MFCC_20_pitched  RMS_1_pitched  \
min       -14.547981       -24.669319       0.007123   
max        45.623779        30.724112       0.138693   

     Spectral_Contrast_1_pitched  Spectral_Contrast_2_pitched  \
min                     9.907291                    11.455714   
max                    36.115294                    31.5041

##### **2.3.3.2 ESD (Unseen)**<a id='esd-unseen-feature'></a>

In [None]:
def noise(data):
    noise_amp = 0.035 * np.random.uniform() * np.amax(data)
    data = data + noise_amp * np.random.normal(size=data.shape[0])
    return data

def stretch(data):
    stretch_rate = np.random.uniform(low=0.95, high=1.05) 
    return librosa.effects.time_stretch(data, rate=stretch_rate)

def pitch(data, sr, pitch_factor=2):
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=pitch_factor)

def extract_features(data, sample_rate):
    n_fft = 2048
    hop_length = 512
    features = []
    transformations = [lambda x: x, noise, stretch, lambda x: pitch(x, sample_rate, 2)]
    for transform in transformations:
        modified_data = transform(data)
        zcr = librosa.feature.zero_crossing_rate(y=modified_data, hop_length=hop_length).T.mean(axis=0)
        chroma_stft = librosa.feature.chroma_stft(y=modified_data, sr=sample_rate, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
        mfcc = librosa.feature.mfcc(y=modified_data, sr=sample_rate, n_mfcc=20, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
        rms = librosa.feature.rms(y=modified_data, hop_length=hop_length).T.mean(axis=0)
        spectral_contrast = librosa.feature.spectral_contrast(y=modified_data, sr=sample_rate, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
        feature_array = np.hstack((zcr, chroma_stft, mfcc, rms, spectral_contrast))
        features.append(feature_array)
    return np.hstack(features)

def generate_feature_names():
    features_info = {'ZCR': 1, 'Chroma': 12, 'MFCC': 20, 'RMS': 1, 'Spectral_Contrast': 7}
    types = ['original', 'noisy', 'stretched', 'pitched']
    names = []
    for t in types:
        for feat_name, count in features_info.items():
            for i in range(1, count + 1):
                names.append(f"{feat_name}_{i}_{t}")
    return names

def map_emotion(directory_name):
    emotion_mapping = {'Angry': 'Angry', 'Happy': 'Happy', 'Neutral': 'Neutral', 'Sad': 'Sad', 'Surprise': 'Happy'}
    return emotion_mapping.get(directory_name, 'Unknown')

def process_files_esd(directory):
    features_list = []
    labels = []
    for root, _, files in os.walk(directory):
        for filename in files:
            if filename.endswith(".wav"):
                file_path = os.path.join(root, filename)
                data, sample_rate = librosa.load(file_path, duration=2.5, offset=0.6)
                features = extract_features(data, sample_rate)
                features_list.append(features)
                labels.append(map_emotion(root.split(os.sep)[-1]))
    feature_names = generate_feature_names()  
    features_df = pd.DataFrame(features_list, columns=feature_names)
    features_df['Label'] = labels
    return features_df

input_folder_esd = '../dataset/ESD_eval_aug'
features_df_esd = process_files_esd(input_folder_esd)
features_df_esd.to_csv('../csv/ESD_eval_aug_features.csv', index=False)

In [None]:
esd_eval_features_df = pd.read_csv('../csv/ESD_eval_aug_features.csv')

feature_ranges = esd_eval_features_df.describe().loc[['min', 'max']]

print(feature_ranges)

     ZCR_1_original  Chroma_1_original  Chroma_2_original  Chroma_3_original  \
min        0.051016           0.065614           0.085547           0.067468   
max        0.338617           0.849386           0.765298           0.754705   

     Chroma_4_original  Chroma_5_original  Chroma_6_original  \
min           0.016535           0.018146           0.031143   
max           0.826904           0.744975           0.769989   

     Chroma_7_original  Chroma_8_original  Chroma_9_original  ...  \
min           0.032557           0.031282           0.054921  ...   
max           0.775975           0.834109           0.810564  ...   

     MFCC_19_pitched  MFCC_20_pitched  RMS_1_pitched  \
min       -14.683578       -18.586187       0.012181   
max        25.541897        16.717937       0.101416   

     Spectral_Contrast_1_pitched  Spectral_Contrast_2_pitched  \
min                    10.900086                    11.970423   
max                    31.750141                    26.3412

#### **2.3.4 TESS dataset (Seen)**<a id='tess-dataset-seen-feature'></a>

In [None]:
def noise(data):
    noise_amp = 0.035 * np.random.uniform() * np.amax(data)
    return data + noise_amp * np.random.normal(size=data.shape[0])

def stretch(data):
    stretch_rate = np.random.uniform(low=0.95, high=1.05)
    return librosa.effects.time_stretch(data, rate=stretch_rate)

def pitch(data, sr):
    pitch_factor = np.random.uniform(low=-2, high=2)
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=pitch_factor)

def extract_features(data, sample_rate):
    n_fft = 2048
    hop_length = 512
    transformations = [lambda x: x, noise, stretch, lambda x: pitch(x, sample_rate)]
    features = []
    for transform in transformations:
        modified_data = transform(data)
        zcr = librosa.feature.zero_crossing_rate(modified_data, hop_length=hop_length).T.mean(axis=0)
        chroma_stft = librosa.feature.chroma_stft(y=modified_data, sr=sample_rate, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
        mfcc = librosa.feature.mfcc(y=modified_data, sr=sample_rate, n_mfcc=20, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
        rms = librosa.feature.rms(y=modified_data, hop_length=hop_length).T.mean(axis=0)
        spectral_contrast = librosa.feature.spectral_contrast(y=modified_data, sr=sample_rate, n_fft=n_fft, hop_length=hop_length).T.mean(axis=0)
        feature_array = np.hstack((zcr, chroma_stft, mfcc, rms, spectral_contrast))
        features.append(feature_array)
    return np.hstack(features)

def generate_feature_names():
    features_info = {'ZCR': 1, 'Chroma': 12, 'MFCC': 20, 'RMS': 1, 'Spectral_Contrast': 7}
    types = ['original', 'noisy', 'stretched', 'pitched']
    names = []
    for t in types:
        for feature_name, count in features_info.items():
            for i in range(1, count + 1):
                names.append(f"{feature_name}_{i}_{t}")
    return names

def map_emotion(audio_file_name):
    emotion_keywords = {
        'angry': 'Angry',
        'happy': 'Happy',
        'neutral': 'Neutral',
        'sad': 'Sad',
        'disgust': 'Disgust',
        'fear': 'Fear',
        'ps': 'Happy'
    }
    for key in emotion_keywords:
        if key in audio_file_name:
            return emotion_keywords[key]
    return 'Unknown'

def process_files_tess(directory):
    features_list = []
    labels = []
    feature_names = generate_feature_names()
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith(".wav"):
                file_path = os.path.join(root, file)
                data, sample_rate = librosa.load(file_path, sr=None)
                features = extract_features(data, sample_rate)
                features_list.append(features)
                label = map_emotion(file)
                labels.append(label)
    features_df = pd.DataFrame(features_list, columns=feature_names)
    features_df['Label'] = labels
    return features_df

input_folder_tess = '../dataset/TESS_aug'
features_df_tess = process_files_tess(input_folder_tess)
features_df_tess.to_csv('../csv/TESS_aug_features.csv', index=False)

In [None]:
tess_features_df = pd.read_csv('../csv/TESS_aug_features.csv')

feature_ranges = tess_features_df.describe().loc[['min', 'max']]

print(feature_ranges)

     ZCR_1_original  Chroma_1_original  Chroma_2_original  Chroma_3_original  \
min        0.053142           0.080655           0.092351           0.057976   
max        0.356191           0.813709           0.758412           0.824369   

     Chroma_4_original  Chroma_5_original  Chroma_6_original  \
min           0.059733           0.058978           0.046615   
max           0.815024           0.838086           0.850368   

     Chroma_7_original  Chroma_8_original  Chroma_9_original  ...  \
min           0.043898           0.029222           0.029962  ...   
max           0.899741           0.917119           0.930549  ...   

     MFCC_19_pitched  MFCC_20_pitched  RMS_1_pitched  \
min       -19.988125       -12.419239       0.013126   
max        22.311520        33.925964       0.208303   

     Spectral_Contrast_1_pitched  Spectral_Contrast_2_pitched  \
min                    10.230365                    13.027649   
max                    30.393639                    29.2860

### **2.4 Combine Dataset (Seen)**<a id='combine-dataset-seen'></a>

For the purpose of training and testing, we will combine the seen data together. The pros of using a diverse set of data is as follows:

1. Increased Diversity
Each dataset typically captures a variety of emotional expressions under different conditions and from different demographics. For example:

    * CREMA-D: This dataset is known for its diversity in terms of actor demographics and has a variety of vocal expressions recorded in a controlled studio environment.
    * TESS (Toronto Emotional Speech Set): It features a range of emotions spoken by Canadian English speakers, often with a focus on older adults.
    * ESD (Emotional Speech Dataset): It might offer different linguistic backgrounds or recording conditions.

Combining these datasets ensures that the model is exposed to a broader spectrum of vocal qualities, accents, intonations, and speech nuances, which can differ substantially across datasets due to geographical, cultural, and individual speaker differences.

2. Robustness and Generalizability

    * Avoid Overfitting: Training and testing on a single dataset can sometimes lead models to overfit to the specific characteristics and idiosyncrasies of that dataset. Using multiple datasets can help ensure that the model performs well across a more generalized set of data conditions, not just the one it was trained on.
    * Real-World Application: In practical applications, a model may encounter a wide range of speech inputs from users of different ages, ethnic backgrounds, and emotional states. Testing the model across different datasets helps verify that it can handle this variability effectively.

3. Enhanced Validation and Testing
    * Cross-Dataset Validation: Models can be validated more rigorously by using one dataset for training and others for testing. This cross-dataset testing helps to highlight any biases the model may have towards the training data.
    * Reliability and Accuracy: By testing on multiple datasets, you can assess the reliability and accuracy of the model across different types of emotional speech data, which is crucial for applications like clinical diagnosis, customer service, and interactive AI.

4. Statistical Significance
    * Increased Sample Size: More data points can lead to more statistically significant results and conclusions. It reduces the impact of outliers and allows for more complex modeling techniques that require larger datasets.
    * Variety of Testing Scenarios: It enables testing the model under different scenarios and conditions, which can be critical for ensuring the model’s utility in real-world applications.

In [None]:
# Combine the features from the CREMA, TESS, ESD 
combined_df = pd.concat([crema_features_df, esd_features_df, tess_features_df], ignore_index=True)
combined_df.to_csv('../csv/combined_aug_features.csv', index=False)