<a id="top2"></a>
# Data cleaning and features extraction

In this notebook, the audio samples will be processed and some features will be extracted

- [1. Data cleaning](#2_a)
- [2. Feature extraction](#2_b)
    * [Fundamental frequencies](#2_b1)
    * [Mixed features](#2_b2)
    * [MFCCs](#2_b3)
- [3. Merge and export data](#3_a)

In [1]:
# import libraries
import pandas as pd
import numpy as np
import random
from glob import glob
import os
import librosa
import noisereduce as nr
from scipy.io import wavfile
import warnings
warnings.filterwarnings("ignore")

## <a id='2_a'>1. Data cleaning</a>

First, let's import the dataset previously exported:

In [2]:
# Import the dataset and verify that the upload is successful
df = pd.read_csv('emotions_data.csv')
print('The dataset has {} audio files'.format(df.shape[0]))
df.sample()

The dataset has 16783 audio files


Unnamed: 0,path,filename,dataset,duration,sample_rate,gender,age,emotion
3043,../Audio files/CREMA-D/1038_ITH_ANG_XX.wav,1038_ITH_ANG_XX,CREMA-D,2.869,16000,male,21,angry


Audio data can be processed in many ways. As noted in the preliminary exploration of the data, the main technical problems with our files are:
- The files have **different sample rates**
- All files tested **start and end with some silence**
- **Some samples are noisier** than others

For the first problem I can resample the files, and they will all be at 16000 Hz, which is the lowest frequency among the samples, the one that is most present in their total number, and an acceptable value to represent the human voice.

In the second case I will implement a `trim` function, which in effect cuts initial and final silence from an audio signal.

For the noise the problem is more difficult, since the noise removal operation also affects the quality of the features to be extracted. For this reason I will use the module `noisereduce` ([link](https://pypi.org/project/noisereduce/)) and apply a light reduction of 10% for the stationary noise on all the samples. The reason for such reduction is due to the fact that this operation can also sensibly lower the quality of the signal thus removing also meaningful information of the audio for the next steps.

In [3]:
# Folder where to store the cleaned audio files
path = '/Users/greips/Audio files/cleaned_samples/'

# Function to clean the samples based on dataset
def clean_files(dataset): # Select files linked to a dataset
    
    df_clean = df[df['dataset'] == dataset].reset_index(drop=True)
    
    for i in range(0, df_clean.shape[0]):
        
        # load audio file at a sample rate of 16000 Hz:
        y, sr = librosa.load(df_clean.path[i], sr=16000)
    
        # Trim signal at a level of 20 db
        y_trim, _ = librosa.effects.trim(y, top_db=20)
        
        # Remove 25% noise from audio samples
        y_noise_rem = nr.reduce_noise(y=y_trim, sr=sr, prop_decrease=0.1, stationary=True)
        
        # Rename new file adding '_cleaned.wav' and put it in the new folder
        name = (os.path.join(path)+df_clean.filename[i]+'_cleaned.wav')
    
        #  Save output in a wav file in a new folder
        wavfile.write(name, sr, y_noise_rem)

In [4]:
%%time
clean_files('CREMA-D')

CPU times: user 3min 26s, sys: 46.5 s, total: 4min 12s
Wall time: 4min 19s


In [5]:
%%time
clean_files('RAVDESS')

CPU times: user 2min 42s, sys: 8.34 s, total: 2min 51s
Wall time: 2min 54s


In [6]:
%%time
clean_files('SAVEE')

CPU times: user 1min 14s, sys: 4.53 s, total: 1min 19s
Wall time: 1min 20s


In [7]:
%%time
clean_files('TESS')

CPU times: user 3min 8s, sys: 20.7 s, total: 3min 29s
Wall time: 3min 35s


In [8]:
%%time
clean_files('EmoV_DB')

CPU times: user 4min 26s, sys: 35.2 s, total: 5min 1s
Wall time: 5min 11s


In [9]:
%%time
clean_files('JL-Corpus')

CPU times: user 1min 30s, sys: 7.04 s, total: 1min 37s
Wall time: 1min 39s


In [10]:
# saving the new cleaned files path in a variable
cleaned_files = sorted(filter(os.path.isfile, glob('../Audio files/cleaned_samples/*.wav')),key=os.path.getmtime)

## <a id='2_b'>2. Feature extraction </a>

In sound analysis, there is a lot of data that can be extracted from audio signals. For this task I exctracted some information from the samples using `librosa`, and analyse them in relationship with the target feature. I obtained the mean and variance for each value. Let's see a brief description of the attributes chosen:

- **Fundamental frequency (f0)**: The lowest frequency produced by the voice. In this case I extracted the mean (`f0_mean`), median (`f0_median`), standard deviation (`f0_std`), min and max frequency (`f0_0`, `f0_100`), plus 25% and 75%-percentile value of the fundamental frequency (`f0_25`, `f0_75`) [link for details](https://en.wikipedia.org/wiki/Fundamental_frequency)
- **Zero-crossing rate (zcr)**: is the rate at which a signal changes from positive to zero to negative or from negative to zero to positive, normally useful to detect percussive sounds [link for details](https://en.wikipedia.org/wiki/Zero-crossing_rate)
- **Spectral centroid**: is a measure used to characterise a spectrum. It indicates where the center of mass of the spectrum is located. Perceptually, it has a robust connection with the impression of brightness of a sound [link for details](https://en.wikipedia.org/wiki/Spectral_centroid)
- **Spectral contrast**: measure the clarity versus the broad noise of each band in the spectrum [link for details](https://librosa.org/doc/main/generated/librosa.feature.spectral_contrast.html)
- **Spectral flatness**: is a measure to quantify how much noise-like a sound is [link for details](https://en.wikipedia.org/wiki/Spectral_flatness)
- **Harmony**: measure harmonic elements from an audio time-series [link for details](https://en.wikipedia.org/wiki/Harmonic_series_(music))
- **Root-Mean-Square (rms)**: is a value that measure the energy of the signal [link for details](https://en.wikipedia.org/wiki/Audio_power#Continuous_power_and_%22RMS_power%22)
- **Chroma Feature**: compute a chromagram from a waveform or power spectrogram [link for details](https://en.wikipedia.org/wiki/Chroma_feature)
- **Chroma Cqt**: compute the constant-Q chromagram [link for details](https://en.wikipedia.org/wiki/Constant-Q_transform)
- **Chroma Cens**: computes the chroma variant “Chroma Energy Normalized” (CENS) [link for details](https://librosa.org/doc/main/generated/librosa.feature.chroma_cens.html)
- **Rolloff**: the spectral roll-off point is the fraction of bins in the power spectrum at which 85% of the power is at lower frequencies [link for details](https://en.wikipedia.org/wiki/Roll-off)
- **Mel-Frequency Cepstrum Components (MFCCs)**: they are the coefficients of the Mel cepstrum which is computed on the Mel-bands (scaled to human ear) instead of the Fourier spectrum. In particular I got 60 features, the mean and variance of each of the 30 extracted components [link for details](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum)

### <a id='2_b1'>Fundamental frequencies extraction</a>

`librosa` offers the probabilistic YIN algorithm for extracting the fundamental frequency, which is a modification of the YIN algorithm for estimating F0. Although this method is much slower than the original, I have found it to be more accurate in finding right values.

In [11]:
f0_mean, f0_median, f0_std, f0_0, f0_25, f0_75, f0_100 = [], [], [], [], [], [], []

# Finction to extract f0 values
def get_f0(file):
    # load audio file:
    y, sr = librosa.load(file, sr=16000)
    
    # Extract fundamental frequency using a probabilistic approach
    f_zero, _, _ = librosa.pyin(y, sr=sr, fmin=50, fmax=1500, frame_length=1024)

    f0_mean.append(np.nanmean(f_zero))             # f0 mean
    f0_median.append(np.nanmedian(f_zero))         # f0 median
    f0_std.append(np.nanstd(f_zero))               # f0 standar deviation
    f0_0.append(np.nanpercentile(f_zero, 0))       # f0 min
    f0_25.append(np.nanpercentile(f_zero, 25))     # f0 25%-percentile value
    f0_75.append(np.nanpercentile(f_zero, 75))     # f0 75%-percentile value
    f0_100.append(np.nanpercentile(f_zero, 100))   # f0 max

In [12]:
%%time

# Apply function to cleaned files
for file in cleaned_files:
    get_f0(file)

CPU times: user 7h 28min 24s, sys: 17min 47s, total: 7h 46min 12s
Wall time: 7h 46min 24s


In [13]:
df_f0 = pd.DataFrame()
df_f0['f0_mean'] = f0_mean
df_f0['f0_median'] = f0_median
df_f0['f0_std'] = f0_std
df_f0['f0_0'] = f0_0
df_f0['f0_25'] = f0_25
df_f0['f0_75'] = f0_75
df_f0['f0_100'] = f0_100
df_f0.sample()

Unnamed: 0,f0_mean,f0_median,f0_std,f0_0,f0_25,f0_75,f0_100
12430,295.935383,228.416468,273.725123,155.114476,215.596437,237.841423,1475.70591


### <a id='2_b2'>Other variables</a>

In [14]:
%%time

# declare variables where to store the extracted feaures
zcr_mean, zcr_var, spectral_centroid_mean, spectral_centroid_var, rms_mean, rms_var, chroma_stft_mean, chroma_stft_var = [], [], [], [], [], [], [], []
rolloff_mean, rolloff_var, spectral_bandwidth_mean, spectral_bandwidth_var, harmony_mean, harmony_var, chroma_cqt_mean, chroma_cqt_var = [], [], [], [], [], [], [], []
spectral_contrast_mean, spectral_contrast_var, spectral_flatness_mean, spectral_flatness_var, chroma_cens_mean, chroma_cens_var = [], [], [], [], [], []

# extract features
for file in cleaned_files:
    # load audio file:
    y, sr = librosa.load(file, sr=16000)
    
    # Zero crossing rate
    zcr_mean.append(np.mean(librosa.feature.zero_crossing_rate(y)))
    zcr_var.append(np.var(librosa.feature.zero_crossing_rate(y)))
    
    # Spectral centroid
    spectral_centroid_mean.append(np.mean(librosa.feature.spectral_centroid(y)))
    spectral_centroid_var.append(np.var(librosa.feature.spectral_centroid(y)))
    
    # Spectral contrast
    spectral_contrast_mean.append(np.mean(librosa.feature.spectral_contrast(y)))
    spectral_contrast_var.append(np.var(librosa.feature.spectral_contrast(y)))
    
    # Spectral flatness
    spectral_flatness_mean.append(np.mean(librosa.feature.spectral_flatness(y)))
    spectral_flatness_var.append(np.var(librosa.feature.spectral_flatness(y)))
    
    # Rms mean and variance
    rms_mean.append(np.mean(librosa.feature.rms(y=y)))               
    rms_var.append(np.var(librosa.feature.rms(y=y)))
    
    # Chroma mean and variance
    chroma_stft_mean.append(np.mean(librosa.feature.chroma_stft(y, sr)))                   
    chroma_stft_var.append(np.var(librosa.feature.chroma_stft(y, sr)))
    
    #Chroma cqt mean and variance
    chroma_cqt_mean.append(np.mean(librosa.feature.chroma_cqt(y, sr)))                   
    chroma_cqt_var.append(np.var(librosa.feature.chroma_cqt(y, sr)))
    
    #Chroma cens mean and variance
    chroma_cens_mean.append(np.mean(librosa.feature.chroma_cens(y, sr)))                   
    chroma_cens_var.append(np.var(librosa.feature.chroma_cens(y, sr)))
    
    # Rolloff mean and variance
    rolloff_mean.append(np.mean(librosa.feature.spectral_rolloff(y,sr)))
    rolloff_var.append(np.var(librosa.feature.spectral_rolloff(y,sr)))
    
    # Spectral bandwidth mean and variance
    spectral_bandwidth_mean.append(np.mean(librosa.feature.spectral_bandwidth(y,sr)))
    spectral_bandwidth_var.append(np.var(librosa.feature.spectral_bandwidth(y,sr)))
    
    # Harmony mean and variance
    harmony_mean.append(np.mean(librosa.effects.harmonic(y)))
    harmony_var.append(np.var(librosa.effects.harmonic(y)))

CPU times: user 5h 3min 52s, sys: 33min 3s, total: 5h 36min 56s
Wall time: 2h 47min 13s


In [15]:
# add new features to a new dataframe
df_features = pd.DataFrame()
df_features['zcr_mean'] = zcr_mean
df_features['zcr_var'] = zcr_var
df_features['spectral_centroid_mean'] = spectral_centroid_mean
df_features['spectral_centroid_var'] = spectral_centroid_var
df_features['spectral_contrast_mean'] = spectral_contrast_mean
df_features['spectral_contrast_var'] = spectral_contrast_var
df_features['spectral_flatness_mean'] = spectral_flatness_mean
df_features['spectral_flatness_var'] = spectral_flatness_var
df_features['rms_mean'] = rms_mean
df_features['rms_var'] = rms_var
df_features['chroma_stft_mean'] = chroma_stft_mean
df_features['chroma_stft_var'] = chroma_stft_var
df_features['chroma_cqt_mean'] = chroma_cqt_mean
df_features['chroma_cqt_var'] = chroma_cqt_var
df_features['chroma_cens_mean'] = chroma_cens_mean
df_features['chroma_cens_var'] = chroma_cens_var
df_features['spectral_bandwidth_mean'] = spectral_bandwidth_mean
df_features['spectral_bandwidth_var'] = spectral_bandwidth_var
df_features['rolloff_mean'] = rolloff_mean
df_features['rolloff_var'] = rolloff_var
df_features['harmony_mean'] = harmony_mean
df_features['harmony_var'] = harmony_var
# Check a random row
df_features.sample()

Unnamed: 0,zcr_mean,zcr_var,spectral_centroid_mean,spectral_centroid_var,spectral_contrast_mean,spectral_contrast_var,spectral_flatness_mean,spectral_flatness_var,rms_mean,rms_var,...,chroma_cqt_mean,chroma_cqt_var,chroma_cens_mean,chroma_cens_var,spectral_bandwidth_mean,spectral_bandwidth_var,rolloff_mean,rolloff_var,harmony_mean,harmony_var
15443,0.100789,0.007527,2038.675889,1702130.0,20.540552,37.336241,0.027502,0.003225,0.107086,0.006086,...,0.359874,0.076113,0.244829,0.023392,1490.425241,300715.401892,2864.853896,3368736.0,8.1e-05,0.006406


### <a id='2_b3'>MFCCs</a>

In [16]:
# function to extract mean and variance of Mel-Frequency Cepstrum Components (MFCCs)
def extract_mfcc(file):
    
    mfcc_mean, mfcc_var = [], []
    
    # load audio file:
    y, sr = librosa.load(file, sr=16000)
    
    # get 20 mean values
    mfcc_mean.append(np.mean(librosa.feature.mfcc(y=y, sr=sr, fmin=50, n_mfcc=30).T, axis=0))
    
    # get 20 variance values
    mfcc_var.append(np.var(librosa.feature.mfcc(y=y, sr=sr, fmin=50, n_mfcc=30).T, axis=0))
    
    return np.hstack((mfcc_mean, mfcc_var))[0]

# apply function above to all files and store results in a list
extracted_mfcc = []

for file in cleaned_files:
    extracted_mfcc.append(extract_mfcc(file))

# create column names for a dataframe
name_mfcc_mean, name_mfcc_var = [], []

for i in range(0, 30):
    name_mfcc_mean.append('mfcc'+str(i+1)+'_mean')   # mfcc1_mean, mfcc2_mean, ...
    name_mfcc_var.append('mfcc'+str(i+1)+'_var')     # mfcc1_var, mfcc2_var, ...

name_mfcc = name_mfcc_mean + name_mfcc_var           # concatenate mfcc mean and variance names

# finally put all mfcc values in a dataframe
df_mfcc = pd.DataFrame(extracted_mfcc, columns = name_mfcc)
df_mfcc.sample()      # print one sample

Unnamed: 0,mfcc1_mean,mfcc2_mean,mfcc3_mean,mfcc4_mean,mfcc5_mean,mfcc6_mean,mfcc7_mean,mfcc8_mean,mfcc9_mean,mfcc10_mean,...,mfcc21_var,mfcc22_var,mfcc23_var,mfcc24_var,mfcc25_var,mfcc26_var,mfcc27_var,mfcc28_var,mfcc29_var,mfcc30_var
12577,-220.369965,81.935516,-23.836346,-0.899617,0.663118,-2.636038,-16.423683,-13.724349,-8.89509,-11.408977,...,58.321266,46.76601,31.269527,34.102951,34.067581,79.207108,79.191185,109.316628,123.586655,142.036484


## <a id='3_a'>Merge and export all the extracted features<a>

As it turned out, some operations used to take a long time to be done.
    
Finally, I can merge all the data into one dataframe and export it to perform the EDA. I can also remove the `sample_rate` column, now that the samples frequency is the same for each cleaned audio file (16000 Hz).

In [17]:
# Merge dataframes and print few samples
data_features = pd.concat([df, df_f0, df_features, df_mfcc], axis=1)

# Remove `sample_rate` column
data_features = data_features.drop(['sample_rate'], axis=1)

# Check one random row
data_features.sample()

Unnamed: 0,path,filename,dataset,duration,gender,age,emotion,f0_mean,f0_median,f0_std,...,mfcc21_var,mfcc22_var,mfcc23_var,mfcc24_var,mfcc25_var,mfcc26_var,mfcc27_var,mfcc28_var,mfcc29_var,mfcc30_var
793,../Audio files/CREMA-D/1010_TIE_NEU_XX.wav,1010_TIE_NEU_XX,CREMA-D,2.636,female,27,neutral,190.633147,196.56412,41.089922,...,20.663019,18.28289,22.221407,24.661129,40.810772,72.538109,50.762707,33.290585,70.890846,104.620483


In [18]:
# write csv
data_features.to_csv('emotions_data_features.csv', index=False)

<br>[Back to top](#top2)