## Speech Emotion Recognizer

This program uses Python and SciKit learn for a speech emotion recognizer. This programs detects
emotion from human speech tone. The model uses the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

The whole pipeline is as follows (as same as any machine learning pipeline):


1. Preparing the Dataset: Downloading and converting the dataset to be suited for extraction.

2. Loading the Dataset: This process is about loading the dataset in Python which involves extracting audio features, such as obtaining different features such as power, pitch and vocal tract configuration from the speech signal,  librosa library is used to do that.

3. Training the Model: After the dataset has been prepared and loaded, it is trained on a suited sklearn model.

4. Testing the Model: Measuring how good the model is doing.





In [None]:
import soundfile # to read audio file
import numpy as np
import librosa # to extract speech features
import glob
import os
import pickle # to save model after training
from sklearn.model_selection import train_test_split # for splitting training and testing
from sklearn.neural_network import MLPClassifier # multi-layer perceptron model
from sklearn.metrics import accuracy_score # to measure how good we are



### Feature extraction

The following function handles extracting features. This is changing the speech waveform to a form of parametric representation
at a relative lesser data rate:

(A parametric equation is commonly used to express the coordinates of the points that make up a geometric object such as a curve or surface)

MFCC, chroma and spectrogram frequency are used as speech features rather than raw waveform as these might contain unnecessary data
that doesn't help on the classificaiton.

In [None]:
def extract_feature(file_name, **kwargs):
    """
    Extract feature from audio file `file_name`
        Features supported:
            - MFCC (mfcc) Mel frequency cepstrum  a representation of the short-term power spectrum of sound
            - Chroma (chroma) chroma feature which refers to the twelfe different pitch classes
            - MEL Spectrogram Frequency (mel)
            - Contrast (contrast)
            - Tonnetz (tonnetz) tonal space
        e.g:
        `features = extract_feature(path, mel=True, mfcc=True)`
    """
    mfcc = kwargs.get("mfcc")
    chroma = kwargs.get("chroma")
    mel = kwargs.get("mel") #
    contrast = kwargs.get("contrast")
    tonnetz = kwargs.get("tonnetz")
    with soundfile.SoundFile(file_name) as sound_file:
        X = sound_file.read(dtype="float32")
        sample_rate = sound_file.samplerate
        if chroma or contrast:
            stft = np.abs(librosa.stft(X))
        result = np.array([])
        if mfcc:
            mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
            result = np.hstack((result, mfccs))
        if chroma:
            chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T, axis=0)
            result = np.hstack((result, chroma))
        if mel:
            mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T, axis=0)
            result = np.hstack((result, mel))
        if contrast:
            contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0)
            result = np.hstack((result, contrast))
        if tonnetz:
            tonnetz = np.mean(librosa.feature.tonnetz(y=librosa.effects.harmonic(X), sr=sample_rate).T,axis=0)
            result = np.hstack((result, tonnetz))
    return result




### Data loading

After the extraction we need to load the dataset containing voice samples of different actors.
The dataset used (RAVDESS) already contains emotions which will be defined in int2emotion.


In [None]:
int2emotion = {
    "01": "neutral",
    "02": "calm",
    "03": "happy",
    "04": "sad",
    "05": "angry",
    "06": "fearful",
    "07": "disgust",
    "08": "surprised"
}


# Currently the following will be used for the recognizer for classification. as adding more would mess the accuracy
AVAILABLE_EMOTIONS = {
    "angry",
    "sad",
    "neutral",
    "happy",
    # "surprised",
    # "fearful"
}

def load_data(test_size=0.2):
    X, y = [], []
    for file in glob.glob("dataset/Actor_*/*.wav"):
        # get the base name of the audio file
        basename = os.path.basename(file)
        # get the emotion label
        emotion = int2emotion[basename.split("-")[2]]
        # we allow only AVAILABLE_EMOTIONS we set
        if emotion not in AVAILABLE_EMOTIONS:
            continue
        # extract soeech features
        features = extract_feature(file, mfcc=True, chroma=True, mel=True)
        # add to data
        X.append(features)
        y.append(emotion)

    # split the data to training and testing and return it
    return train_test_split(np.array(X), y, test_size=test_size, random_state=7)



Now we can load the dataset with 75% training and 25% testing

In [None]:
# load RAVDESS dataset, 75% training 25% testing
X_train, X_test, y_train, y_test = load_data(test_size=0.25)

Some logging regarding dataset information:

In [None]:
# print some details
# number of samples in training data
print("[+] Number of training samples:", X_train.shape[0])
# number of samples in testing data
print("[+] Number of testing samples:", X_test.shape[0])
# number of features used
# this is a vector of features extracted
# using extract_features() function
print("[+] Number of features:", X_train.shape[1])


Now a grid search is required on MLPClassifier to get the best possible hyper parameters, this is tweaked based on findings on the internet and what was currently "ok" :

The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.

In machine learning, a hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters are derived via training.




In [None]:
# best model, determined by a grid search

model_params = {
    'alpha': 0.01,
    'batch_size': 256,
    'epsilon': 1e-08,
    'hidden_layer_sizes': (300,),
    'learning_rate': 'adaptive',
    'max_iter': 500,
}

Important to mention that this is a fully connected (dense) neural network with one layer that contains 300 units, a btach size of 256, 500 iteration
and an adaptive learning rate.

Now we can initialize the model with the model_params

In [None]:
# initialize Multi Layer Perceptron classifier
# with best parameters ( so far )
model = MLPClassifier(**model_params)

After init, we can start training the model with the dataset loaded:

In [None]:
# train the model
model.fit(X_train, y_train)


Now we can calculate the accuracy score and print it to meassure how good the model is:

In [None]:
# predict 25% of data to measure how good we are
y_pred = model.predict(X_test)

# calculate the accuracy
accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)

print("Accuracy: {:.2f}%".format(accuracy*100))

Last but not least, we save the model!

In [None]:
# make result directory if doesn't exist yet
if not os.path.isdir("result"):
    os.mkdir("result")

pickle.dump(model, open("result/mlp_classifier.model", "wb"))


Let's load the model!!!

In [None]:
loaded_model = pickle.load(open("result/mlp_classifier.model", "rb"))

now let's extract features from a wav file and then load these features against the trained model.

In [None]:
features = extract_feature("tests/happybutangry.wav", mfcc=True, chroma=True, mel=True).reshape(1, -1)
result = loaded_model.predict(features)[0]

print("result:", result)

