# Speech Emotion Recognition - Signal Preprocessing

A project for the French Employment Agency

Telecom ParisTech 2018-2019

## I. Context

The aim of this notebook is to set up all speech emotion recognition preprocessing and audio features extraction.

### Audio features:
The complete list of the implemented short-term features is presented below:
- **Zero Crossing Rate**: The rate of sign-changes of the signal during the duration of a particular frame.
- **Energy**: The sum of squares of the signal values, normalized by the respective frame length.
- **Entropy of Energy**: The entropy of sub-frames' normalized energies. It can be interpreted as a measure of abrupt changes.
- **Spectral Centroid**: The center of gravity of the spectrum.
- **Sprectral Spread**: The second central moment of the spectrum.
- **Spectral Entropy**: Entropy of the normalized spectral energies for a set of sub-frames.
- **Spectral Flux**: The squared difference between the normalized magnitudes of the spectra of the two successive frames.
- **Spectral Rolloff**: The frequency below which 90% of the magnitude distribution of the spectrum is concentrated.
- **MFCCS**: Mel Frequency Cepstral Coefficients form a cepstral representation where the frequency bands are not linear but distributed according to the mel-scale.

Global Statistics are then computed on upper features:
- **mean, std, med, kurt, skew, q1, q99, min, max and range**

### Data:
**RAVDESS**: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes *calm*, *happy*, *sad*, *angry*, *fearful*, *surprise*, and *disgust* expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. (https://zenodo.org/record/1188976#.XA48aC17Q1J)

## II. General import

In [10]:
!pip install pydub



In [11]:
### General imports ###
from glob import glob
import os
import pickle
import itertools
import numpy as np

### Audio preprocessing imports ###
from AudioLibrary.AudioSignal import *
from AudioLibrary.AudioFeatures import *

## III. Set labels

In [12]:
# RAVDESS Database
label_dict_ravdess = {'02': 'NEU', '03':'HAP', '04':'SAD', '05':'ANG', '06':'FEA', '07':'DIS', '08':'SUR'}

# Set audio files labels
def set_label_ravdess(audio_file, gender_differentiation):
    label = label_dict_ravdess.get(audio_file[6:-16])
    if gender_differentiation == True:
        if int(audio_file[18:-4])%2 == 0: # Female
            label = 'f_' + label
        if int(audio_file[18:-4])%2 == 1: # Male
            label = 'm_' + label
    return label

## IV. Import audio files

In [13]:
# Start feature extraction
print("Import Data: START")

# Audio file path and names
file_path = '/Users/aryasoni/Documents/GitHub/AI-Interview/Audio/Dataset/RAVDESS/'
file_names = os.listdir(file_path)

# Initialize signal and labels list
signal = []
labels = []

# Sample rate (44.1 kHz)
sample_rate = 44100     

# Compute global statistics features for all audio file
for audio_index, audio_file in enumerate(file_names):

    # Select audio file
    if audio_file[6:-16] in label_dict_ravdess.keys():
        
        # Read audio file
        signal.append(AudioSignal(sample_rate, filename=file_path + audio_file))
        
        # Set label
        labels.append(set_label_ravdess(audio_file, True))

        # Print running...
        if (audio_index % 100 == 0):
            print("Import Data: RUNNING ... {} files".format(audio_index))
        
# Cast labels to array
labels = np.asarray(labels).ravel()

# Stop feature extraction
print("Import Data: END \n")
print("Number of audio files imported: {}".format(labels.shape[0]))

Import Data: START
Import Data: END 

Number of audio files imported: 0


## V. Audio features extraction

In [14]:
# Audio features extraction function
def global_feature_statistics(y, win_size=0.025, win_step=0.01, nb_mfcc=12, mel_filter=40,
                             stats = ['mean', 'std', 'med', 'kurt', 'skew', 'q1', 'q99', 'min', 'max', 'range'],
                             features_list =  ['zcr', 'energy', 'energy_entropy', 'spectral_centroid', 'spectral_spread', 'spectral_entropy', 'spectral_flux', 'sprectral_rolloff', 'mfcc']):
    
    # Extract features
    audio_features = AudioFeatures(y, win_size, win_step)
    features, features_names = audio_features.global_feature_extraction(stats=stats, features_list=features_list)
    return features
    
# Features extraction parameters
sample_rate = 16000 # Sample rate (16.0 kHz)
win_size = 0.025    # Short term window size (25 msec)
win_step = 0.01     # Short term window step (10 msec)
nb_mfcc = 12        # Number of MFCCs coefficients (12)
nb_filter = 40      # Number of filter banks (40)
stats = ['mean', 'std', 'med', 'kurt', 'skew', 'q1', 'q99', 'min', 'max', 'range'] # Global statistics
features_list =  ['zcr', 'energy', 'energy_entropy', 'spectral_centroid', 'spectral_spread', # Audio features
                      'spectral_entropy', 'spectral_flux', 'sprectral_rolloff', 'mfcc']

In [15]:
# Start feature extraction
print("Feature extraction: START")

# Compute global feature statistics for all audio file
features = np.asarray(list(map(global_feature_statistics, signal)))

# Stop feature extraction
print("Feature extraction: END!")

Feature extraction: START
Feature extraction: END!


## VI. Save as

In [16]:
# Save DataFrame to pickle
pickle.dump([features, labels], open("/Users/aryasoni/Documents/GitHub/AI-Interview/Audio/Dataset/[RAVDESS][HAP-SAD-NEU-ANG-FEA-DIS-SUR][GLOBAL_STATS].p", 'wb'))