<a href="https://colab.research.google.com/github/akhilkusuma0502/DesignProjects/blob/master/SpeechEmotionRecognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Speech Emotion Recognition
Speech Emotion Recognition, abbreviated as SER, is the act of attempting to recognize human emotion and affective states from speech. This is capitalizing on the fact that voice often reflects underlying emotion through tone and pitch. This is also the phenomenon that animals like dogs and horses employ to be able to understand human emotion.

Download DatSet from [Here](https://drive.google.com/file/d/1wWsrN2Ep7x6lWqOXfr4rpKGYrJhWc8z7/view)

In [None]:
pip install librosa soundfile numpy sklearn pyaudio

In [10]:
import zipfile
local_zip = '/tmp/speech-emotion-recognition-ravdess-data.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp/ravdess-data')
zip_ref.close()

Install required libraries .. In this case we need librosa to extract features like mfcc,chroma, mel.
we numpy to represent features as numpy vectors and we need glob to iterate all sound files.

**librosa is a Python library for analyzing audio and music. It has a flatter package layout, standardizes interfaces and names, backwards compatibility, modular functions, and readable code.**

we also need to split the data in to train and test sets.

We are using MLPClassifier from skleran library neural network.

In [None]:
import librosa
import soundfile
import os, glob, pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# **Speech Emotion Recognition – Objective**
## To build a model to recognize emotion from speech using the librosa and sklearn libraries and the RAVDESS dataset.

The below function takes input a soundfile. Here all the soundfiles are of .wav types.
Set the mfcc,chroma and mel to True while calling the function.
Extract features like sample rate and initalize a numpy array to store the feature vectors.

In [27]:
# Extracting features(mfcc,chroma,mel) from a sound file.
def extract_feature(file_name, mfcc, chroma, mel):
    with soundfile.SoundFile(file_name) as sound_file:
        X = sound_file.read(dtype="float32")
        sample_rate = sound_file.samplerate
        if chroma:
            stft = np.abs(librosa.stft(X))
        result = np.array([])
        if mfcc:
            mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
            result = np.hstack((result, mfccs))
        if chroma:
            chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T, axis=0)
            result = np.hstack((result, chroma))
        if mel:
            mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T, axis=0)
            result = np.hstack((result, mel))
    return result

The below is a dictionary defined with values as keys to represent emotions of the speech. These are taken from Filenames.
Observed emotions are emotions that are observed in the speech.

In [None]:
emotions = {
    '01': 'neutral',
    '02': 'calm',
    '03': 'happy',
    '04': 'sad',
    '05': 'angry',
    '06': 'fearful',
    '07': 'disgust',
    '08': 'surprised'
}
observed_emotions = ['calm', 'happy', 'fearful', 'disgust']

The below function load data  and returns a split data to train and test variables. The default test size for this 0.2 and also it uses extract_features function to extract mel,chroma and mfcc and stoores in feature variable.

In [None]:
def load_data(test_size=0.2):
    x, y = [], []
    for file in glob.glob("/tmp/ravdess-data/**/*.wav"):
        file_name = os.path.basename(file)
        emotion = emotions[file_name.split("-")[2]]
        if emotion not in observed_emotions:
            continue
        feature = extract_feature(file, mfcc=True, chroma=True, mel=True)
        x.append(feature)
        y.append(emotion)
    return train_test_split(np.array(x), y,train_size =1-test_size, test_size=test_size, random_state=9)

Call the load_data() function which will return 4 variables x_train, x_test, y_train, y_test

In [None]:
x_train, x_test, y_train, y_test = load_data()
print((x_train.shape[0], x_test.shape[0]))

Lets see the shape of the features extracted.

In [29]:
print(f'Features extracted: {x_train.shape[1]}')

Features extracted: 180


Data Preprocessing is Done.
All we need is to create a classifier with MLP(Multi Layer perceptron) Neural Network.
Initalize it with alpha=0.01, batch_size=256, epsilon=1e-08
Provide number of hidden layers
and learning rate to adaptive

In [30]:
model=MLPClassifier(alpha=0.01,
                    batch_size=256, 
                    epsilon=1e-08,
                    hidden_layer_sizes=(300,), 
                    learning_rate='adaptive', 
                    max_iter=500)

Fit the training data to the model

In [31]:
model.fit(x_train,y_train)

MLPClassifier(activation='relu', alpha=0.01, batch_size=256, beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(300,), learning_rate='adaptive',
              learning_rate_init=0.001, max_fun=15000, max_iter=500,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

Pass the X_test data to predict and store it in Y_Pred.

In [32]:
y_pred=model.predict(x_test)

From Metrics import Accuracy_score and pass y_test and predicted values to get the accuracy score on Test data.

In [33]:
accuracy=accuracy_score(y_true=y_test, y_pred=y_pred)
print("Accuracy: {:.2f}%".format(accuracy*100))

Accuracy: 71.43%


We got 71 % Accuracy. That means out of 10 files we feed to our network , it predicts 7 acuurate and 3 inaccurate. Which is good model as of now.




**SER is tough because emotions are subjective and annotating audio is challenging.**

# Lets Predict with a Single item from our test set to recognize the feeling of the speech.

In [36]:
model.predict([x_test[3]])

array(['happy'], dtype='<U7')

# It Predicted Happy