# The Audio, Speech, Vision Processing Lab - Emotional Sound Database (ASVP-ESD)

## Description
The dataset contains 12.625 audio files. It contains speech and non-speech emotional sound. The data (audio) was collected from movies, tv shows, youtube, and other websites.

## Emotion Classes

The dataset contains a total of 12 distinct emotions:

* Boredome
* Neutral
* Happiness
* Sadness
* Anger
* Fear
* Surprise
* Disgust
* Excite
* Pleasure
* Pain

## File Naming

Each audio file has a unique filename. The filename consists of numberical identifiers (e.g. 02-01-06-01-02-105-02-01-02.wav) these identifiers define the stimulus characteristic.

### Filename Identifiers

1. Modality - (03 = audio-only)
2. Vocal Channel - (01 = speech, 02 = non speech)
3. Emotion - (01 = boredom/sigh, 02 = neutral/calm, 03 = happy/laugh/gaggle, 04 = sad/cry, 05 = angry/grunt/frustation, 06 = fearful/scream/panic, 07 = disgust/dislike/contempt, 08 = surprised/gasp/amazed, 09 = excited, 10 = pleasure, 11 = pain/groan, 12 = disappointment/disapproval, 13 = breath)
4. Emotional Intensity - (01 = normal, 02 = high)
5. Statement - (as it's non scripted this help to refer approcimately to data collected from the same period or source base on their rank)
6. Actor - (even number represent male, odd numbers represent female)
7. Age - (01 = above 65 years, 02 = above 20 , 03 = below 20 years)
8. Source - (01 & 02 = movies/youtube/website, 03 = movies)
9. Language - (01 = Chinese, 02 = English, 04 = French, others = russian/others)

Exaple: 03-01-06-01-02-12-02-01-02-16.wav = audio_only-speech-fearful/scream/panic-normal-statement-male_actor_12-(20, 65)-movies/youtube/website-english-similar_with_16_other_audio_files

In [115]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pylab as plt

from glob import glob

import librosa
import librosa.display

import IPython.display as ipd

import torch
import torchaudio

from enum import Enum
from typing import Literal, List
from uuid import uuid4
import pathlib
import shutil

## Data Preparation

The first step is to filter the audio files to the sub set of audio files that we are interested in. 

We are interested in the following characteristics:

* Speech or (just) Sound
* Enlish Language

We are interested in the following emotions:

* Happy
* Excited
* Tender
* Scared
* Angry
* Sad

It is needed to take steps to transform the "raw" data into a new sub set that meet these criteria. This step needs to be done first before it is possible to move on data analysis, or any other data related actions.


Firt, let's define the various types of identifiers as enums for easy use later.

In [117]:
# define base functionality of enum
class BaseEnum(str, Enum):

    def __str__(self) -> str:
        return self.value

    def __str__(self) -> str:
        return self.value


# the medium in which emotion is conveyed
class Modality(BaseEnum):

    AUDIO_ONLY = "03"


# the type of audio (speech, or non speech)
class VocalChannel(BaseEnum):

    SPEECH = "01"
    NON_SPEECH = "02"


# the type of emotion, this will be reduced to the base six methioned previously
class Emotion(BaseEnum):

    BOREDOM = "01"
    NEUTRAL = "02"
    HAPPY = "03"
    SAD = "04"
    ANGRY = "05"
    FEARFUL = "06"
    DISGUST = "07"
    SURPRISED = "08"
    EXCITED = "09"
    PLEASURE = "10"
    PAIN = "11"
    DISAPPOINTMENT = "12"
    BREATH = "13"


class EmotionalIntensity(BaseEnum):

    NORMAL = "01"
    HIGH = "02"


# the age of the person expressing an emotion
class Age(BaseEnum):

    ABOVE_65 = "01"
    BETWEEN_20_AND_64 = "02"
    BELOW_20 = "03"


# the sex of the actor in the audio file
class Sex(BaseEnum):

    MALE = "male"
    FEMALE = "female"


# the source of the audio
class Source(BaseEnum):

    WEBSITE = "01"
    YOUTUBE = "02"
    MOVIES = "03"


# the language spoken in the audio file
class Language(BaseEnum):

    CHINESE = "01"
    ENGLISH = "02"
    FRENCH = "03"
    OTHER = "others"
    

Send, let's define a method that identifies wheather the actor is male or female.

In [118]:
# even number represents male, odd number represents female
def get_sex_identifier(sex_identifier: int = None) -> Literal[Sex.MALE, Sex.FEMALE]:
    return Sex.MALE if sex_identifier / 2 else Sex.FEMALE

In [119]:
def get_file_name_from_full_path(file_path: str = None) -> str:
    return file_path.split("/")[-1]

Third and last, let's define a method that returns all audio files based on the provided identifiers.

In [121]:
def get_audio_files_by_identifiers(
    files: List[str],
    modality: List[Modality] = Modality.AUDIO_ONLY, 
    vocal_channel: VocalChannel = None,
    emotion: List[Emotion] = None, 
    emotional_intensity: List[EmotionalIntensity] = None,
    age: List[Age] = None,
    source: List[Source] = None,
    language: Language = Language.ENGLISH
) -> list:
    ret_val = list() # intialize the return value as an empty array

    for index, file in enumerate(files):
        # extract the file name from the full file path
        file_name = get_file_name_from_full_path(file_path=file)

        # remove the extension of the file name
        file_name = file_name.replace(".wav", '')

        # extract the identifiers from the file name
        file_identifiers = file_name.split("-")

        # there are 99 cases where the 
        if len(file_identifiers) > 8:
            # filter out the desired files by the criteria
            if file_identifiers[0] in modality \
                and file_identifiers[1] in vocal_channel \
                and file_identifiers[2] in emotion \
                and file_identifiers[3] in emotional_intensity \
                and file_identifiers[6] in age \
                and file_identifiers[7] in source \
                and file_identifiers[8] == language:
                    ret_val.append(file)

    return ret_val

    

Time to load in the data and see what we are working with

In [122]:
audio_files: List[str] = glob("../../data/asvp-esd/Audio/*/*.wav")
audio_files[:5]

['../../data/asvp-esd/Audio/actor_94/03-01-02-01-07-94-02-02-01-32.wav',
 '../../data/asvp-esd/Audio/actor_94/03-02-11-01-10-94-02-02-01-31.wav',
 '../../data/asvp-esd/Audio/actor_94/03-02-05-01-12-94-02-02-02-99.wav',
 '../../data/asvp-esd/Audio/actor_94/03-02-08-02-01-94-04-02-01-38.wav',
 '../../data/asvp-esd/Audio/actor_94/03-02-03-01-20-94-02-02-01-33.wav']

In [123]:
# the length of the python list represents the number of audio files
len(audio_files)

12625

Time to filter the audio files by the criteria set previously.

In [124]:
# speech files

## happy
happy_speech = get_audio_files_by_identifiers(
    files=audio_files, 
    modality=[Modality.AUDIO_ONLY], 
    vocal_channel=[VocalChannel.SPEECH], 
    emotion=[Emotion.HAPPY],
    emotional_intensity=[EmotionalIntensity.NORMAL, EmotionalIntensity.HIGH],
    age=[Age.BELOW_20, Age.BETWEEN_20_AND_64, Age.ABOVE_65],
    source=[Source.MOVIES, Source.WEBSITE, Source.YOUTUBE],
)

## excited
excited_speech = get_audio_files_by_identifiers(
    files=audio_files, 
    modality=[Modality.AUDIO_ONLY], 
    vocal_channel=[VocalChannel.SPEECH], 
    emotion=[Emotion.EXCITED],
    emotional_intensity=[EmotionalIntensity.NORMAL, EmotionalIntensity.HIGH],
    age=[Age.BELOW_20, Age.BETWEEN_20_AND_64, Age.ABOVE_65],
    source=[Source.MOVIES, Source.WEBSITE, Source.YOUTUBE],
)

## tender

## scared
scared_speech = get_audio_files_by_identifiers(
    files=audio_files, 
    modality=[Modality.AUDIO_ONLY], 
    vocal_channel=[VocalChannel.SPEECH], 
    emotion=[Emotion.FEARFUL],
    emotional_intensity=[EmotionalIntensity.NORMAL, EmotionalIntensity.HIGH],
    age=[Age.BELOW_20, Age.BETWEEN_20_AND_64, Age.ABOVE_65],
    source=[Source.MOVIES, Source.WEBSITE, Source.YOUTUBE],
)

## angry
angry_speech = get_audio_files_by_identifiers(
    files=audio_files, 
    modality=[Modality.AUDIO_ONLY], 
    vocal_channel=[VocalChannel.SPEECH], 
    emotion=[Emotion.ANGRY],
    emotional_intensity=[EmotionalIntensity.NORMAL, EmotionalIntensity.HIGH],
    age=[Age.BELOW_20, Age.BETWEEN_20_AND_64, Age.ABOVE_65],
    source=[Source.MOVIES, Source.WEBSITE, Source.YOUTUBE],
)

## sad
sad_speech = get_audio_files_by_identifiers(
    files=audio_files, 
    modality=[Modality.AUDIO_ONLY], 
    vocal_channel=[VocalChannel.SPEECH], 
    emotion=[Emotion.SAD],
    emotional_intensity=[EmotionalIntensity.NORMAL, EmotionalIntensity.HIGH],
    age=[Age.BELOW_20, Age.BETWEEN_20_AND_64, Age.ABOVE_65],
    source=[Source.MOVIES, Source.WEBSITE, Source.YOUTUBE],
)


# non speech files

## happy
happy_non_speech = get_audio_files_by_identifiers(
    files=audio_files, 
    modality=[Modality.AUDIO_ONLY], 
    vocal_channel=[VocalChannel.NON_SPEECH], 
    emotion=[Emotion.HAPPY],
    emotional_intensity=[EmotionalIntensity.NORMAL, EmotionalIntensity.HIGH],
    age=[Age.BELOW_20, Age.BETWEEN_20_AND_64, Age.ABOVE_65],
    source=[Source.MOVIES, Source.WEBSITE, Source.YOUTUBE],
)

## excited
excited_non_speech = get_audio_files_by_identifiers(
    files=audio_files, 
    modality=[Modality.AUDIO_ONLY], 
    vocal_channel=[VocalChannel.NON_SPEECH], 
    emotion=[Emotion.EXCITED],
    emotional_intensity=[EmotionalIntensity.NORMAL, EmotionalIntensity.HIGH],
    age=[Age.BELOW_20, Age.BETWEEN_20_AND_64, Age.ABOVE_65],
    source=[Source.MOVIES, Source.WEBSITE, Source.YOUTUBE],
)

## tender

## scared
scared_non_speech = get_audio_files_by_identifiers(
    files=audio_files, 
    modality=[Modality.AUDIO_ONLY], 
    vocal_channel=[VocalChannel.NON_SPEECH], 
    emotion=[Emotion.FEARFUL],
    emotional_intensity=[EmotionalIntensity.NORMAL, EmotionalIntensity.HIGH],
    age=[Age.BELOW_20, Age.BETWEEN_20_AND_64, Age.ABOVE_65],
    source=[Source.MOVIES, Source.WEBSITE, Source.YOUTUBE],
)

## angry
angry_non_speech = get_audio_files_by_identifiers(
    files=audio_files, 
    modality=[Modality.AUDIO_ONLY], 
    vocal_channel=[VocalChannel.NON_SPEECH], 
    emotion=[Emotion.ANGRY],
    emotional_intensity=[EmotionalIntensity.NORMAL, EmotionalIntensity.HIGH],
    age=[Age.BELOW_20, Age.BETWEEN_20_AND_64, Age.ABOVE_65],
    source=[Source.MOVIES, Source.WEBSITE, Source.YOUTUBE],
)

## sad
sad_non_speech = get_audio_files_by_identifiers(
    files=audio_files, 
    modality=[Modality.AUDIO_ONLY], 
    vocal_channel=[VocalChannel.NON_SPEECH], 
    emotion=[Emotion.SAD],
    emotional_intensity=[EmotionalIntensity.NORMAL, EmotionalIntensity.HIGH],
    age=[Age.BELOW_20, Age.BETWEEN_20_AND_64, Age.ABOVE_65],
    source=[Source.MOVIES, Source.WEBSITE, Source.YOUTUBE],
)

In [127]:
len(happy_speech), len(excited_speech), len(scared_speech), len(angry_speech), len(sad_speech)

(196, 127, 56, 265, 109)

In [128]:
len(happy_non_speech), len(excited_non_speech), len(scared_non_speech), len(angry_non_speech), len(sad_non_speech)

(438, 94, 467, 297, 272)

## Saving the new data

In [129]:
# define base path
BASE_FILEPATH = "../../data/asvp-esd"

In [135]:
shutil.rmtree(path="../../data/asvp-esd/sound", ignore_errors=True)
shutil.rmtree(path="../../data/asvp-esd/speech", ignore_errors=True)

In [136]:
# initialize folders to save files to
for _, vocal_channel in enumerate(["sound", "speech"]):
    for _, emotion in enumerate(["happy", "excited", "tender", "scared", "angry", "sad"]):
        pathlib.Path(f"{BASE_FILEPATH}/{vocal_channel}/{emotion}").mkdir(parents=True, exist_ok=True)

In [137]:
# helper function to save new sub sets of audio files
def save_audio(audio_files: List[str] = None, vocal_channel: str = None, emotion: str = None):
    for _, audio_file in enumerate(audio_files):
        # load the waveform and sample rate of the audio file
        waveform, sr = torchaudio.backend.sox_io_backend.load(audio_file)

        # specify new file path & name, provide waveform & sample rate
        torchaudio.backend.sox_io_backend.save(
            filepath=f"{BASE_FILEPATH}/{vocal_channel}/{emotion}/{uuid4()}.wav", 
            src=waveform, 
            sample_rate=sr
        )

In [138]:
## happy
save_audio(audio_files=happy_speech, vocal_channel="speech", emotion="happy")
save_audio(audio_files=happy_non_speech, vocal_channel="sound", emotion="happy")

# excited
save_audio(audio_files=excited_speech, vocal_channel="speech", emotion="excited")
save_audio(audio_files=excited_non_speech, vocal_channel="sound", emotion="excited")

# tender

# scared
save_audio(audio_files=scared_speech, vocal_channel="speech", emotion="scared")
save_audio(audio_files=scared_non_speech, vocal_channel="sound", emotion="scared")

# angry
save_audio(audio_files=angry_speech, vocal_channel="speech", emotion="angry")
save_audio(audio_files=angry_non_speech, vocal_channel="sound", emotion="angry")

# sad
save_audio(audio_files=sad_speech, vocal_channel="speech", emotion="sad")
save_audio(audio_files=sad_non_speech, vocal_channel="sound", emotion="sad")

## Analyse new data

In [113]:
def plot_waveform(waveform: torch.Tensor, sr: int, title: str = "Waveform") -> None:
    waveform = waveform.numpy()

    num_channels, num_frames = waveform.shape
    time_axis = torch.range(0, num_frames) / sr

    figure, axes = plt.subplot(num_channels, 1)
    axes.plot(time_axis, waveform[0], linewidth=1)
    axes.grid(True)

    figure.suptitle(title)

    plt.show(block=False)


def plot_spectogram(specgram, title: str = "Spectogram (db)", ylabel="freq_bin") -> None:
    fig, axs = plt.subplot(1, 1)

    axs.set_title(title)
    axs.ylabel(ylabel)
    axs.xlabel("frame")

    im = axs.imshow(librosa.power_to_db(specgram), origin="lower", aspect="auto")
    fig.colorbar(im, ax=axs)

    plt.show(block=False)


def plot_fbank(fbank, title: str = "Filter bank") -> None:
    fig, axs = plt.subplot(1, 1)

    axs.set_title(title)
    axs.imshow(fbank, aspect="auto")
    axs.set_ylabel("freq_bin")
    axs.set_xlabel("mel_bin")

    plt.show()

### Speech

In [139]:
audio_files: List[str] = glob("../../data/asvp-esd/speech/*/*.wav")

In [140]:
len(audio_files)

753

In [141]:
audio_files: List[str] = glob("../../data/asvp-esd/sound/*/*.wav")

In [142]:
len(audio_files)

1568