# 🎧 Audio Emotion Classification – Detailed Step-by-Step Explanation

This section documents how we train an **LSTM-based neural network** to classify
emotions from raw speech using MFCC features.  
Dataset used: **TESS – Toronto Emotional Speech Set**.

---

## 1 · Imports & Setup
 - librosa – audio loading & MFCC extraction

 - LSTM – sequence model capturing temporal patterns in MFCCs

In [1]:
import pandas as pd
import numpy as np
import os
import seaborn  as sns 
import matplotlib.pyplot as plt
import librosa
import librosa.display
from IPython.display import Audio
import warnings 
warnings.filterwarnings('ignore')


### 📂 Collecting File Paths & Emotion Labels

The block below **discovers every audio file** in the TESS dataset directory and
builds two aligned Python lists:

| List | Contents | Example element |
|------|----------|-----------------|
| `path`   | Full file-path to the `.wav` file | `C:/…/TESS/.../OAF_back_angry.wav` |
| `labels` | Ground-truth emotion derived from the filename | `angry` |



In [2]:
path = []
labels = []
# Walk through all sub-folders in the TESS root directory
for dirname, _, filenames in os.walk('C:/Users/uvais/Downloads/TESS Toronto emotional speech set data'):
    for filename in filenames: # loop each file
        # 1️⃣  Build the absolute file path
        path.append(os.path.join(dirname,filename))
        # 2️⃣  Extract the emotion label from the filename
        #     TESS naming pattern:  <Speaker>_<Sentence>_<EMOTION>.wav
        #     e.g. OAF_back_angry.wav   →   label = "angry"
        label = filename.split('_')[-1]     # "angry.wav"
        label = label.split('.')[0]         # "angry"
        labels.append(label.lower())        # lower-case for consistency
print("Data set is loaded")

Data set is loaded


In [3]:
#Create a DataFrame for clean handling:
df = pd.DataFrame()
df['speech'] = path
df['labels'] = labels
df.head()

Unnamed: 0,speech,labels
0,C:/Users/uvais/Downloads/TESS Toronto emotiona...,angry
1,C:/Users/uvais/Downloads/TESS Toronto emotiona...,angry
2,C:/Users/uvais/Downloads/TESS Toronto emotiona...,angry
3,C:/Users/uvais/Downloads/TESS Toronto emotiona...,angry
4,C:/Users/uvais/Downloads/TESS Toronto emotiona...,angry


### 🔊 Extracting Audio Features using MFCC – `extract_mfcc()` Function

To train a machine learning model for **audio emotion classification**, raw audio signals must be transformed into a meaningful **numerical representation**. The most common and effective method for speech-related tasks is using **MFCCs (Mel-Frequency Cepstral Coefficients)**.

---

                 


In [4]:
def extract_mfcc(filename):
    y, sr = librosa.load(filename, duration=3,offset=0.5)
    mfcc = np.mean(librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40).T,axis=0)
    return mfcc

In [None]:
#apply to the speech column
X_mfcc = df['speech'].apply(lambda x: extract_mfcc(x))


In [None]:
# X = [x for x in X_mfcc]
# X = np.array(X) 
# X.shape
X = np.stack(X_mfcc.to_numpy())          # shape ⇒ (N_samples, 40)
X = np.expand_dims(X, -1)  # LSTM expects 3-D: (N, 40, 1)

In [None]:
# X = np.expand_dims(X, -1)   # LSTM expects 3-D: (N, 40, 1)
# X.shape

## Label Encoding
- LabelEncoder converts emotion strings → integers 0-6.

- to_categorical converts integers → one-hot vectors, e.g.
angry → [1 0 0 0 0 0 0].

In [14]:
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
y=to_categorical(labelencoder.fit_transform(df[['labels']]))

In [15]:
y[0]

array([1., 0., 0., 0., 0., 0., 0.], dtype=float32)

### 🧠 Building the Audio Emotion Classification Model (LSTM-Based)

After extracting fixed-length MFCC features from speech samples, we build a **deep learning model** to classify them into one of 7 emotion categories.

---

### 🧩 Model Architecture Details

| Layer Type | Parameters | Description |
|------------|------------|-------------|
| **LSTM (123)** | `input_shape=(40, 1)`<br>`return_sequences=False` | First and only recurrent layer. Accepts MFCC features shaped **(40 timesteps × 1 coef)** and outputs a **123-dimensional** feature vector. |
| **Dense (64)** | `activation='relu'` | Fully-connected hidden layer with **64 neurons** and ReLU activation. |
| **Dropout (0.2)** | *20 % dropout* | Randomly deactivates 20 % of neurons each step to mitigate over-fitting. |
| **Dense (32)** | `activation='relu'` | Another dense layer with **32 units**, adding non-linear transformation capacity. |
| **Dropout (0.2)** | *20 % dropout* | Second dropout layer for additional regularisation. |
| **Dense (7)** | `activation='softmax'` | Output layer with **7 neurons** (one per emotion class); softmax converts logits into a probability distribution. |


In [11]:
from keras.models import Sequential
from keras.layers import Dense,LSTM,Dropout
model = Sequential([
    LSTM(123,return_sequences=False, input_shape=(40,1)),
    Dense(64,activation='relu'),
    Dropout(0.2),
    Dense(32,activation='relu'),
    Dropout(0.2),
    Dense(7,activation='softmax')
])
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 123)               61500     
                                                                 
 dense (Dense)               (None, 64)                7936      
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 dropout_1 (Dropout)         (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 7)                 231       
                                                                 
Total params: 71,747
Trainable params: 71,747
Non-traina

### ⚙️ `model.fit()` — Parameter Breakdown

| Argument | Value | What it Means |
|----------|-------|---------------|
| **`X`** | NumPy array, shape **`(N_samples, 40, 1)`** | Input feature matrix &nbsp;→&nbsp; 40-dimensional MFCC sequence per audio clip. (The extra **`1`** dimension lets the LSTM treat it as **40 timesteps × 1 feature**). |
| **`y`** | NumPy array, shape **`(N_samples, 7)`** | Target labels, one-hot encoded for **7 emotion classes**. |
| **`validation_split`** | `0.20` | Keras automatically reserves **20 %** of `X` & `y` as a validation set each epoch—so no separate `X_val` / `y_val` needed. |
| **`epochs`** | `100` | Model performs **100 complete passes** over the training data.<br>More epochs help convergence but may over-fit—add early-stopping if necessary. |
| **`batch_size`** | `512` | Weights are updated **every 512 samples**. Large batches are feasible because MFCC vectors are small and speed up training. |
| **`shuffle`** | `True` | Shuffles training data **before every epoch** to avoid order-based biases. |

---

### 📈 `history` — What It Stores

`history` is a `keras.callbacks.History` object returned by `model.fit()`.  
After training it contains a dictionary of per-epoch metrics:

```python
history.history.keys()
# dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])


In [12]:
#Train the model
history = model.fit(X,y, validation_split=0.2,epochs=100, batch_size = 512,shuffle=True)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [13]:
#accuracy
test_accuracy = model.evaluate(X,y,verbose=0)
print(test_accuracy[1])

1.0


## Testing and Model Saving

In [22]:
#filename = "C:/Machine Learning/Toronto _emo_ speech/OAF_happy/OAF_back_happy.wav"
filename = "C:/Machine Learning/Toronto _emo_ speech/YAF_pleasant_surprised/YAF_back_ps.wav"
prediction_feature = extract_mfcc(filename)
prediction_feature = prediction_feature.reshape(1,-1)
#s = model.predict(prediction_feature)
y_predict = np.argmax(model.predict(prediction_feature), axis=-1)
model2 = labelencoder.inverse_transform(y_predict)
model2

array(['ps'], dtype=object)

In [23]:
from keras.models import load_model
model.save("emotion_audio_model.h5")
loaded_model = load_model("emotion_audio_model.h5")

In [7]:
from keras.models import load_model
loaded_model = load_model("emotion_audio_model.h5")

In [None]:
labels = {0 : "angry",
         1 : "disgust",
         2 : "fear",
         3 : "happy",
         4 : "neutral",
         5 : "Pleasure",
         6 : "sad"}

In [95]:
filename = "C:/Machine Learning/uv_r.wav"
prediction_feature = extract_mfcc(filename)
prediction_feature = prediction_feature.reshape(1,-1)
#s = model.predict(prediction_feature)
y_predict = np.argmax(loaded_model.predict(prediction_feature), axis=-1)
string = str(y_predict)[1:-1]
inti = int(string)
print(labels[inti])

#model2 = labelencoder.inverse_transform(y_predict)
#model2

neutral


6
