# Speech Emotion Recognition Pipeline

## 1. Introduction

This notebook implements a robust, end-to-end pipeline for emotion classification on speech data using deep learning. The objective is to accurately identify and categorize emotional states—such as happy, sad, angry, fearful, neutral, calm, disgust, and surprised—from both speech and song audio recordings. The workflow covers data loading, feature extraction, augmentation, model training, evaluation, and inference.


## 2. Library Installation and Imports

We install and import all required libraries for audio processing, data handling, feature extraction, machine learning, and deep learning. This includes `librosa`, `numpy`, `scikit-learn`, and `tensorflow.keras`.


In [1]:
import os
import glob
import numpy as np
import librosa
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, MaxPooling1D, LSTM, Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import classification_report, confusion_matrix


## 3. Emotion Mapping and Data Path Configuration

We define the mapping between emotion codes and their corresponding labels, and specify the file paths for the speech and song audio datasets. This mapping is crucial for extracting the correct emotion label from each audio file's name.



In [3]:
# Emotion mapping (adapt for your dataset)
emotions = {
    '01': 'neutral', '02': 'calm', '03': 'happy', '04': 'sad',
    '05': 'angry', '06': 'fearful', '07': 'disgust', '08': 'surprised'
}
observed_emotions = list(emotions.values())
DATA_PATHS = [
    '/Users/dushyantyadav/Downloads/Audio_Speech_Actors_01-24/Actor*/**/*.wav',
    '/Users/dushyantyadav/Downloads/Audio_Song_Actors_01-24/Actor*/**/*.wav'
]


## 4. Data Augmentation Functions

To improve model robustness and address class imbalance, we define functions for audio augmentation, including adding noise and shifting the audio signal. Augmentation increases the diversity of training samples and helps the model generalize better.


In [5]:
def add_noise(data, noise_factor=0.005):
    noise = np.random.randn(len(data))
    return data + noise_factor * noise

def shift(data, shift_max=0.2, shift_direction='both'):
    shift_amt = np.random.randint(int(len(data) * shift_max))
    if shift_direction == 'right':
        shift_amt = -shift_amt
    elif shift_direction == 'both':
        if np.random.randint(0, 2) == 1:
            shift_amt = -shift_amt
    return np.roll(data, shift_amt)


## 5. Feature Extraction

We extract Mel Frequency Cepstral Coefficients (MFCCs) from each audio file, which serve as input features for the model. Each MFCC sequence is padded or truncated to a fixed length to ensure uniform input dimensions.

In [7]:
def extract_mfcc_sequence(file_path, n_mfcc=40, max_len=200):
    y, sr = librosa.load(file_path, res_type='scipy')
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)
    # Pad or truncate for batching
    if mfcc.shape[1] < max_len:
        pad_width = max_len - mfcc.shape[1]
        mfcc = np.pad(mfcc, pad_width=((0,0),(0,pad_width)), mode='constant')
    else:
        mfcc = mfcc[:, :max_len]
    return mfcc.T  # (max_len, n_mfcc)


## 6. Data Loading and Preprocessing

We load all relevant audio files, extract their features, apply augmentation, and encode the emotion labels. The data is then split into training and testing sets using stratified sampling to preserve class distributions. Label encoding and one-hot encoding are used to prepare the targets for model training.


In [9]:
def load_data_dl(test_size=0.2, max_len=200, augment=True):
    x, y = [], []
    files = []
    for path in DATA_PATHS:
        files.extend(glob.glob(path, recursive=True))
    for file in files:
        file_name = os.path.basename(file)
        emotion_code = file_name.split("-")[2]
        emotion = emotions.get(emotion_code)
        if emotion not in observed_emotions:
            continue
        try:
            # Original
            mfcc_seq = extract_mfcc_sequence(file, max_len=max_len)
            x.append(mfcc_seq)
            y.append(emotion)
            if augment:
                # Augmented: noise
                y_audio, sr = librosa.load(file, res_type='scipy')
                mfcc_noise = librosa.feature.mfcc(y=add_noise(y_audio), sr=sr, n_mfcc=40)
                if mfcc_noise.shape[1] < max_len:
                    pad_width = max_len - mfcc_noise.shape[1]
                    mfcc_noise = np.pad(mfcc_noise, pad_width=((0,0),(0,pad_width)), mode='constant')
                else:
                    mfcc_noise = mfcc_noise[:, :max_len]
                x.append(mfcc_noise.T)
                y.append(emotion)
                # Augmented: shift
                y_shift = shift(y_audio)
                mfcc_shift = librosa.feature.mfcc(y=y_shift, sr=sr, n_mfcc=40)
                if mfcc_shift.shape[1] < max_len:
                    pad_width = max_len - mfcc_shift.shape[1]
                    mfcc_shift = np.pad(mfcc_shift, pad_width=((0,0),(0,pad_width)), mode='constant')
                else:
                    mfcc_shift = mfcc_shift[:, :max_len]
                x.append(mfcc_shift.T)
                y.append(emotion)
        except Exception as e:
            print(f"Error processing {file}: {e}")
    x = np.array(x)
    y = np.array(y)
    le = LabelEncoder()
    y_enc = le.fit_transform(y)
    y_cat = to_categorical(y_enc)
    x_train, x_test, y_train, y_test = train_test_split(
        x, y_cat, test_size=test_size, random_state=42, stratify=y)
    return x_train, x_test, y_train, y_test, le, y


## 7. Model Architecture: CNN-LSTM/GRU

We define the deep learning model architecture, which combines 1D convolutional layers for feature extraction with either GRU or LSTM layers for capturing temporal dependencies in the audio data. Batch normalization and dropout are used for regularization.


In [11]:
def build_cnn_lstm(input_shape, num_classes):
    model = Sequential()
    model.add(Conv1D(64, kernel_size=5, activation='relu', input_shape=input_shape))
    model.add(BatchNormalization())
    model.add(MaxPooling1D(pool_size=2))
    model.add(Dropout(0.3))
    model.add(Conv1D(128, kernel_size=5, activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling1D(pool_size=2))
    model.add(Dropout(0.3))
    model.add(LSTM(128, return_sequences=False))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(num_classes, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model


## 8. Class Weights Calculation

To address class imbalance, we compute class weights based on the frequency of each class in the training data. These weights are used during model training to ensure that minority classes are not neglected.


In [13]:
# After loading data
x_train, x_test, y_train, y_test, le, y_all = load_data_dl(test_size=0.2, max_len=200, augment=True)
y_train_labels = np.argmax(y_train, axis=1)
class_weights = compute_class_weight('balanced', classes=np.unique(y_train_labels), y=y_train_labels)
class_weight_dict = dict(enumerate(class_weights))
print("Class weights:", class_weight_dict)


Class weights: {0: 0.815410199556541, 1: 0.8145071982281284, 2: 1.5954446854663775, 3: 0.815410199556541, 4: 0.815410199556541, 5: 1.630820399113082, 6: 0.815410199556541, 7: 1.5954446854663775}


## 9. Model Training

The model is trained on the processed and augmented data. Early stopping is used to prevent overfitting by monitoring the validation loss, and class weights are applied to improve performance on minority classes.


In [15]:
input_shape = x_train.shape[1:]  # (max_len, n_mfcc)
num_classes = y_train.shape[1]
model = build_cnn_lstm(input_shape, num_classes)

early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

history = model.fit(
    x_train, y_train,
    validation_data=(x_test, y_test),
    epochs=60,
    batch_size=32,
    callbacks=[early_stop],
    class_weight=class_weight_dict
)

# Evaluate
loss, acc = model.evaluate(x_test, y_test)
print("Test accuracy:", acc)

y_pred = np.argmax(model.predict(x_test), axis=1)
y_true = np.argmax(y_test, axis=1)
print(classification_report(y_true, y_pred, target_names=le.classes_))
print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))


2025-06-21 21:49:57.622698: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1
2025-06-21 21:49:57.622960: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 8.00 GB
2025-06-21 21:49:57.622987: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 2.67 GB
2025-06-21 21:49:57.623362: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-06-21 21:49:57.623824: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Epoch 1/60


2025-06-21 21:49:59.570151: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60
Test accuracy: 0.890625
              precision    recall  f1-score   support

       angry       0.98      0.90      0.94       226
        calm       0.89      0.94      0.92       225
     disgust       0.94      0.85      0.89       115
     fearful       0.90      0.84      0.86       226
     

## 10. Model Evaluation

After training, we evaluate the model on the test set. We report the overall accuracy, per-class precision, recall, F1-score, and present the confusion matrix. These metrics are used to ensure the model meets the project criteria for balanced and accurate emotion classification.


## 11. Inference and Prediction

We implement a function for predicting the emotion of a single audio file using the trained model. This function will be used for both standalone inference scripts and integration into the web application.


In [23]:

def predict_emotion(file_path, model, le, max_len=200):
    mfcc_seq = extract_mfcc_sequence(file_path, max_len=max_len)
    mfcc_seq = np.expand_dims(mfcc_seq, axis=0)
    pred = model.predict(mfcc_seq)
    predicted_class = np.argmax(pred)
    return le.classes_[predicted_class]

# Example usage
test_file = '/Users/dushyantyadav/Downloads/Crema/1082_IEO_FEA_MD.wav'
print("Predicted emotion:", predict_emotion(test_file, model, le))

Predicted emotion: fearful


In [31]:
# After training your model
model.save("model.keras")  # Recommended Keras v3 format
