

# RAVDESS: Ryerson Audio-Visual Database of Emotional Speech and Song

---

The data contains around **~7,356 files** with a total size **24.8 GB**.

It has **24 professional actors** (12 female, 12 male) So it's a balanced data set
by Ryerson Audio Visual Database of emotional Speech and songs

Each actor vocalizes two lexically matched statements in a **neutral North American accent**.

---

## Key Highlights:

Balanced across gender (**12F, 12M**).
Includes both **speech** and **song** modalities.
High-quality recordings designed for research in emotion recognition.
Total size: **~24.8 GB**

---

## Speech Emotions:

- Neutral,
- Happy,
- Sad,
- Angry,
- Fearful,
- Surprise,
- and Disgust

---

## Songs Emotions:

- Neutral,
- Happy,
- Sad,
- Angry,
- Fearful emotions

---

## Access:
The dataset can be accessed using the Deep Lake API for direct streaming into ML pipelines

---

## Possible Use Cases:

- Call Center Analytics

- Mental Health Monitoring

- Music Emotion Recognition

---

## Goal:

- A mental health support tool that flags distress from speech tone.

---


Setting up the folder to download my data to the drive

In [1]:
# Installing Dependencies

!pip install librosa numpy pandas scikit-learn tensorflow matplotlib seaborn audiomentations tqdm

Collecting audiomentations
  Downloading audiomentations-0.43.1-py3-none-any.whl.metadata (11 kB)
Collecting numpy-minmax<1,>=0.3.0 (from audiomentations)
  Downloading numpy_minmax-0.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting numpy-rms<1,>=0.4.2 (from audiomentations)
  Downloading numpy_rms-0.6.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.5 kB)
Collecting python-stretch<1,>=0.3.1 (from audiomentations)
  Downloading python_stretch-0.3.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Collecting soxr>=0.3.2 (from librosa)
  Downloading soxr-0.5.0.post1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Downloading audiomentations-0.43.1-py3-none-any.whl (86 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.1/86.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?

In [2]:
from google.colab import drive
drive.mount('/content/drive')

data_path = '/content/drive/MyDrive/VoiceEmotionDetection/RavdessData/FullDataUnzipped'
features_dir = '/content/drive/MyDrive/VoiceEmotionDetection/RavdessData/Features'
processed_data_path = '/content/drive/MyDrive/VoiceEmotionDetection/processed_ravdess_data.pkl'

import os
os.makedirs(features_dir, exist_ok=True)

Mounted at /content/drive


In [3]:
import glob
import pandas as pd
from tqdm import tqdm

audio_files = glob.glob(f"{data_path}/Actor_*/*.wav", recursive=True)
print(len(audio_files))

1440


In [4]:
df = pd.DataFrame({'filename': audio_files})
df['actor'] = df['filename'].apply(lambda x: int(os.path.basename(x).split('-')[-1].split('.')[0]))
df['emotion_code'] = df['filename'].apply(lambda x: int(os.path.basename(x).split('-')[2]))

emotion_map = {
    1: 'neutral', 2: 'calm', 3: 'happy', 4: 'sad',
    5: 'angry', 6: 'fearful', 7: 'disgust', 8: 'surprised'
}
df['emotion'] = df['emotion_code'].map(emotion_map)
df = df[df['emotion'].notnull()]
print(f'DF ready with {len(df)} rows')

DF ready with 1440 rows


In [5]:
df.head()

Unnamed: 0,filename,actor,emotion_code,emotion
0,/content/drive/MyDrive/VoiceEmotionDetection/R...,8,1,neutral
1,/content/drive/MyDrive/VoiceEmotionDetection/R...,8,1,neutral
2,/content/drive/MyDrive/VoiceEmotionDetection/R...,8,1,neutral
3,/content/drive/MyDrive/VoiceEmotionDetection/R...,8,1,neutral
4,/content/drive/MyDrive/VoiceEmotionDetection/R...,8,2,calm


In [6]:
import librosa
import numpy as np

def preprocessing_pipeline(audio_path, target_sr=22050, top_db=25, duration=3.0):
    y, sr = librosa.load(audio_path, sr=target_sr)
    if y.ndim > 1:
        y = librosa.to_mono(y)
    y, _ = librosa.effects.trim(y, top_db=top_db)
    y = librosa.util.normalize(y)
    target_length = int(duration * sr)
    if len(y) < target_length:
        y = np.pad(y, (0, target_length - len(y)))
    else:
        y = y[:target_length]
    return y, sr

In [7]:
def extract_mfcc(y, sr, n_mfcc=20):
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc, n_fft=2048, hop_length=512)
    delta = librosa.feature.delta(mfccs)
    delta2 = librosa.feature.delta(mfccs, order=2)
    combined = np.concatenate([mfccs, delta, delta2])
    return librosa.util.normalize(combined)  # Shape: ~ (60, 132)

In [8]:
def process_and_extract(row):
    try:
        y, sr = preprocessing_pipeline(row['filename'])
        mfccs = extract_mfcc(y, sr)
        out_path = os.path.join(features_dir, f"{row['actor']}_{row['emotion_code']}.npy")
        np.save(out_path, mfccs)
        return out_path
    except Exception as e:
        print(f"Error processing {row['filename']}: {e}")
        return None

df['feature_file'] = [process_and_extract(row) for _, row in tqdm(df.iterrows(), total=len(df))]
valid_df = df[df['feature_file'].notnull()].copy()
print(f'Processed {len(valid_df)} valid files')

100%|██████████| 1440/1440 [07:51<00:00,  3.06it/s]

Processed 1440 valid files





In [9]:
!pip install tensorflow



In [10]:
import tensorflow as tf

In [11]:
X = []
for path in tqdm(valid_df['feature_file'], desc='Loading features'):
    try:
        mfcc = np.load(path)
        X.append(mfcc)
    except Exception as e:
        print(f"Error loading {path}: {e}")

X = np.array(X)
print(f'Loaded X shape: {X.shape}')

y_labels = valid_df['emotion_code'].values - 1  # 0-indexed
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_encoded = le.fit_transform(y_labels)
y_onehot = tf.keras.utils.to_categorical(y_encoded, num_classes=8)

X = X.astype(np.float32)
y_onehot = y_onehot.astype(np.float32)

from sklearn.model_selection import train_test_split
X_temp, X_test, y_temp, y_test = train_test_split(X, y_onehot, test_size=0.1, random_state=42, stratify=np.argmax(y_onehot, axis=1))
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.111, random_state=42, stratify=np.argmax(y_temp, axis=1))

X_train = np.expand_dims(X_train, -1)
X_val = np.expand_dims(X_val, -1)
X_test = np.expand_dims(X_test, -1)

print(f'Train: {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}')

import pickle
with open(processed_data_path, 'wb') as f:
    pickle.dump({'X_train': X_train, 'y_train': y_train, 'X_val': X_val, 'y_val': y_val, 'X_test': X_test, 'y_test': y_test, 'label_encoder': le}, f)
print('Processed data saved')

Loading features: 100%|██████████| 1440/1440 [00:09<00:00, 159.33it/s]


Loaded X shape: (1440, 60, 130)
Train: (1152, 60, 130, 1), Val: (144, 60, 130, 1), Test: (144, 60, 130, 1)
Processed data saved


In [12]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

class AudioAugGenerator(ImageDataGenerator):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def flow(self, x, y, **kwargs):
        batches = super().flow(x, y, **kwargs)
        while True:
            batch_x, batch_y = next(batches)
            for i in range(batch_x.shape[0]):
                if np.random.rand() < 0.5:  # Freq mask
                    f = np.random.randint(0, batch_x.shape[1] // 10)
                    f0 = np.random.randint(0, batch_x.shape[1] - f)
                    batch_x[i, f0:f0+f, :, :] = 0
                if np.random.rand() < 0.5:  # Time mask
                    t = np.random.randint(0, batch_x.shape[2] // 10)
                    t0 = np.random.randint(0, batch_x.shape[2] - t)
                    batch_x[i, :, t0:t0+t, :] = 0
            yield batch_x, batch_y

datagen = AudioAugGenerator(rotation_range=5, width_shift_range=0.05, height_shift_range=0.05, zoom_range=0.05, fill_mode='constant', cval=0)
train_generator = datagen.flow(X_train, y_train, batch_size=32)

In [13]:
import tensorflow as tf
from tensorflow.keras.layers import Conv2D, BatchNormalization, MaxPooling2D, Dropout, GlobalAveragePooling2D, Dense
from tensorflow.keras.regularizers import l2

def create_model(input_shape=X_train.shape[1:], num_classes=8):
    model = tf.keras.Sequential([
        Conv2D(32, (3,3), activation='relu', input_shape=input_shape, kernel_regularizer=l2(0.001)),
        BatchNormalization(),
        Conv2D(32, (3,3), activation='relu', kernel_regularizer=l2(0.001)),
        MaxPooling2D((2,2)),
        Dropout(0.3),

        Conv2D(64, (3,3), activation='relu', kernel_regularizer=l2(0.001)),
        BatchNormalization(),
        Conv2D(64, (3,3), activation='relu', kernel_regularizer=l2(0.001)),
        MaxPooling2D((2,2)),
        Dropout(0.3),

        GlobalAveragePooling2D(),
        Dense(128, activation='relu', kernel_regularizer=l2(0.001)),
        Dropout(0.5),
        Dense(num_classes, activation='softmax')
    ])
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005),
                  loss='categorical_crossentropy', metrics=['accuracy'])
    return model

model = create_model()
model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [14]:
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

class_weights = compute_class_weight('balanced', classes=np.unique(np.argmax(y_train, axis=1)), y=np.argmax(y_train, axis=1))
class_weight_dict = dict(enumerate(class_weights))

early_stop = EarlyStopping(monitor='val_loss', patience=20, restore_best_weights=True, min_delta=0.001)
lr_reduce = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6)

history = model.fit(
    train_generator,
    steps_per_epoch=len(X_train) // 32,
    validation_data=(X_val, y_val),
    epochs=100,
    class_weight=class_weight_dict,
    callbacks=[early_stop, lr_reduce],
    verbose=1
)

ValueError: Argument `class_weight` is not supported for Python generator inputs. Received: class_weight={0: np.float64(1.894736842105263), 1: np.float64(0.935064935064935), 2: np.float64(0.935064935064935), 3: np.float64(0.935064935064935), 4: np.float64(0.9411764705882353), 5: np.float64(0.935064935064935), 6: np.float64(0.935064935064935), 7: np.float64(0.9411764705882353)}