# Capstone Project: Audio Based Emotion Classifier - RNN with Attention Models

After the first edition of this project, I really wanted to see how recurrent neural networks could be used to make predictions based on audio data. Luckily, I stumbled upon a repository that detailed the architecture of a model that used both convolutional layers and recurrent layers with attention for the purposes to recognizing speech commands. The attention layer allows the model to focus more on certain aspects of the audio, such as the transitions between phonemes. After modifying the architecture to suit the structure of my data, I was able to get this model to train on 4 sets of features: 80 mel frequency cepstral coefficients, 80 gammatone frequency cepstral coefficients, 60 linear frequency cepstral coefficients along with the zero-crossing rate, spectral centroid, spectral bandwidth and RMS for each time step, and finally a fusion of all the features concatenated at every timestep. The inspiration behind using the LFCC features and the fusion of features came from another research paper on using RNNs for audio scene classification, which has been a very important resource for me recently.

These two resources can be found here:

__https://github.com/douglas125/SpeechCmdRecognition__

__https://arxiv.org/pdf/1703.04770.pdf__

## Set Path to Project Folder

In [40]:
# Set PATH to 'Classifying Emotions With Audio' directory
# Make sure this folder contains 'Data Sets' with TESS and SAVEE inside, as well as the 'Models' Folder
PATH = '/Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Classifying Emotions With Audio'

## Import libraries

In [3]:
# Standard Library Imports
import csv
import os
import numpy as np
import pandas as pd
import re
import librosa
from librosa import display
import IPython.display as ipd
from IPython.display import display
import time
import pickle
import seaborn as sn
import matplotlib.pyplot as plt
import sounddevice as sd
import math
from scipy.io.wavfile import write

#Keras Imports
from tensorflow.keras.models import Sequential, load_model, Model
from tensorflow.keras.layers import (Dense, 
                                     Conv2D, 
                                     MaxPooling1D, 
                                     MaxPooling2D, 
                                     Flatten, 
                                     Dense, 
                                     Dropout,
                                     Conv1D, 
                                     Activation, 
                                     BatchNormalization, 
                                     AveragePooling1D,
                                     MaxPool1D,
                                     GlobalMaxPool2D,
                                     LSTM,
                                     Bidirectional,
                                     Embedding,
                                     GRU,
                                     SimpleRNN
                                    )
from tensorflow.keras import layers as L
from tensorflow.keras import backend as K

from tensorflow.keras.regularizers import l2, L1L2

from tensorflow.keras.callbacks import (EarlyStopping, 
                                        LearningRateScheduler,
                                        ModelCheckpoint, 
                                        TensorBoard, 
                                        ReduceLROnPlateau)

from tensorflow.keras import optimizers
from keras import callbacks

from tensorflow.keras.utils import to_categorical

# SKLearn Imports
from sklearn.metrics import (confusion_matrix, 
                             accuracy_score, 
                             classification_report)

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.utils import resample
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, RandomTreesEmbedding
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, plot_confusion_matrix

from kapre.time_frequency import Melspectrogram, Spectrogram
from kapre.utils import Normalization2D

from tensorflow.keras.models import Model, load_model

Import requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.
  from numba.decorators import jit as optional_jit
Import of 'jit' requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.
  from numba.decorators import jit as optional_jit


## Feature Extraction

### Extracting 80 MFCCs

In [3]:
# Loading the balanced dataframe containing information about each audio file.
os.chdir(os.path.join(PATH,'DataFrames'))
audiodata_balanced = pd.read_csv("Balanced Audio Data Iteration #2.csv", index_col = False)
audiodata_balanced

Unnamed: 0,Filename,Dataset,Sex,Speaker,Duration,Emotion
0,h05.wav,SAVEE,M,KL,3.38,5
1,sa04.wav,SAVEE,M,KL,4.73,3
2,h11.wav,SAVEE,M,KL,5.53,5
3,sa10.wav,SAVEE,M,KL,4.15,3
4,d08.wav,SAVEE,M,KL,3.29,1
...,...,...,...,...,...,...
835,YAF_tough_happy.wav,TESS,F,YAF,1.73,5
836,OAF_late_ps.wav,TESS,F,OAF,1.96,6
837,OAF_fat_ps.wav,TESS,F,OAF,1.98,6
838,OAF_life_sad.wav,TESS,F,OAF,2.56,3


In [12]:
# This script will create a 3D array of MFCCs for each audio file.

os.chdir(os.path.join(PATH,'Data Sets'))
df = audiodata_balanced

# Specify number of MFCCs, hop length and sampling rate here.

n_mfcc = 80
hop_length = 512
sr = 44100


# A function to access the file structure of TESS dataset.

def emotion_reverse(emotion):
    
    x = str
    if emotion == 0:
        x = 'angry'
    if emotion == 1:
        x = 'disgust'
    if emotion == 2:
        x = 'fear'
    if emotion == 3:
        x = 'sad'
    if emotion == 4:
        x = 'neutral'
    if emotion == 5:
        x = 'happy'
    if emotion == 6:
        x = 'pleasant_surprised'
    return (x)

# Our dataframe will be sized so that each observation corresponds to 1 second of audio. The number of MFCCs taken is determined by the specified hop length.
# We'll start by creating an empty array of the appropriate size.

df = audiodata_balanced

# Loading a sample audiofile.
i=0

filename = df.iloc[i]['Filename']
dataset = df.iloc[i]['Dataset']
speaker = df.iloc[i]['Speaker']
emotion = df.iloc[i]['Emotion']

if (dataset=='SAVEE'):
    y, sr = librosa.load(os.path.join(dataset, speaker, filename), sr = sr)
elif(dataset=='TESS'):
    emotion = emotion_reverse(emotion)
    speaker_emotion = (f'{speaker}_{emotion}')
    y, sr = librosa.load(os.path.join(dataset, speaker_emotion, filename), sr = sr)

# Getting the duration of the sample and padding with zeros so that the length is an integer.  

duration = librosa.get_duration(y,sr)
pieces = math.ceil(duration)

length = sr*pieces

pad_length = length - len(y)
y = np.pad(y,((0,pad_length)),'constant')

# Taking the first 1 second chunk from the audiofile and extracting MFCCs from that chunk.

start=0
end=sr

chunk = y[start:end]

mfccs = librosa.feature.mfcc(chunk, sr = sr, n_mfcc=n_mfcc, hop_length = hop_length)

# Using the MFCCs from the 1 second chunk to create an empty array.

chunk_MFCC_array = np.empty(shape =(1,mfccs.shape[0],mfccs.shape[1]))
chunk_MFCC_array[i]=mfccs

c=0
p=1

# Creating a new dataframe to track unique identifiers, and the number of 1 second pieces generated from each audiofile.

index_df = pd.DataFrame(columns = df.columns)
index_df.insert(0,'ID',0)
index_df.insert(1,'Pieces',0)
index_df = index_df.append(df.iloc[i])
index_df.iloc[c,0] = int(i)
index_df.iloc[c,1] = (f'{p}/{pieces}')

c+=1
p+=1

# Adding the remaining 1 second pieces from the sample audiofile to the two new dataframes.

for n in range(pieces-1):
    
    start +=sr
    end +=sr
    
    chunk = y[start:end]
    mfccs = librosa.feature.mfcc(chunk, sr = sr, n_mfcc=n_mfcc, hop_length = hop_length)
    
    mfccs = mfccs.reshape((1,mfccs.shape[0],mfccs.shape[1]))
    
    chunk_MFCC_array = np.append(chunk_MFCC_array,mfccs,axis=0)
    
    index_df = index_df.append(df.iloc[i])
    index_df.iloc[c,0] = int(i)
    index_df.iloc[c,1] = (f'{p}/{pieces}')
    
    c+=1
    p+=1
    
# Slicing all remaining audiofiles into 1 second pieces, extracting each ones set of MFCCs, then adding each set to our dataframe of MFCCs, as well as updating our index dataframe.

for i in range(1, len(df)):
    
    filename = df.iloc[i]['Filename']
    dataset = df.iloc[i]['Dataset']
    speaker = df.iloc[i]['Speaker']
    emotion = df.iloc[i]['Emotion']
    
    if (dataset=='SAVEE'):
        
        y, sr = librosa.load(os.path.join(dataset, speaker, filename), sr = sr)

    elif(dataset=='TESS'):
        emotion = emotion_reverse(emotion)
        speaker_emotion = (f'{speaker}_{emotion}')
        
        y, sr = librosa.load(os.path.join(dataset, speaker_emotion, filename), sr = sr)
        
    duration = librosa.get_duration(y,sr)
        
    pieces = math.ceil(duration)
    
    length = sr*pieces

    pad_length = length - len(y)
    
    y = np.pad(y,((0,pad_length)),'constant')

    start=0
    end=sr
    p=1

    for n in range(pieces):

        chunk = y[start:end]
        mfccs = librosa.feature.mfcc(chunk, sr = sr, n_mfcc=n_mfcc, hop_length = hop_length)

        mfccs = mfccs.reshape((1,mfccs.shape[0],mfccs.shape[1]))

        chunk_MFCC_array = np.append(chunk_MFCC_array,mfccs,axis=0)

        index_df = index_df.append(df.iloc[i])
        index_df.iloc[c,0] = int(i)
        index_df.iloc[c,1] = (f'{p}/{pieces}')

        c+=1
        p+=1
        start +=sr
        end +=sr
        
        
        print('\r',f'{c} arrays added',end='')
    

# Saving the new dataframes.

os.chdir(os.path.join(PATH,'DataFrames'))
np.save('80_MFCC_array_1s_Chunks.npy', chunk_MFCC_array)
index_df.to_csv('1s_chunk_index_df.csv', index=False)


 2913 arrays added

### Extracting 80 GFCCs

In [5]:
# This function will extract GFCCs from an audio file.
# Extracted from spafe.

import numpy as np
import scipy
import scipy.signal
import scipy.fftpack

from librosa import util
from librosa import filters
#from ..util.exceptions import ParameterError

from librosa.core.time_frequency import fft_frequencies
from librosa.core.audio import zero_crossings, to_mono
from librosa.core.spectrum import power_to_db, _spectrogram
from librosa.core.constantq import cqt, hybrid_cqt
from librosa.core.pitch import estimate_tuning
from spafe.fbanks.gammatone_fbanks import gammatone_filter_banks
from spafe.fbanks.linear_fbanks import linear_filter_banks

def gfcc_test(y=None, sr=22050, S=None, n_gfcc=20, dct_type=2, norm='ortho', lifter=0, **kwargs):


    if S is None:
        S = power_to_db(gammaspectrogram(y=y, sr=sr, **kwargs))

    M = scipy.fftpack.dct(S, axis=0, type=dct_type, norm=norm)[:n_gfcc]

    if lifter > 0:
        M *= 1 + (lifter / 2) * np.sin(np.pi * np.arange(1, 1 + n_gfcc, dtype=M.dtype) / lifter)[:, np.newaxis]
        return M
    elif lifter == 0:
        return M
    else:
        raise ParameterError('GFCC lifter={} must be a non-negative number'.format(lifter))

        
        
def gammaspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512,
                   win_length=None, window='hann', center=True, pad_mode='reflect',
                   power=1.0, **kwargs):


    S, n_fft = _spectrogram(y=y, S=S, n_fft=n_fft, hop_length=hop_length, power=power,
                            win_length=win_length, window=window, center=center,
                            pad_mode=pad_mode)


    gamma_basis = gammatone_filter_banks(fs=sr,nfft=n_fft,nfilts=128)

    return np.dot(gamma_basis, S)

In [6]:
# This script will create a 3D array of GFCCs for each audio file. It functions in almost exactly the same way as the code used to extract MFCCs per each 1 second slice of audio.

os.chdir(os.path.join(PATH,'DataFrames'))
audiodata_balanced = pd.read_csv("Balanced Audio Data Iteration #2.csv", index_col = False)
df = audiodata_balanced

# Specify number of GFCCs, hop length and sampling rate here.

n_gfcc = 80
hop_length = 512
sr = 44100

# A function to access the file structure of TESS dataset.

def emotion_reverse(emotion):
    
    x = str
    if emotion == 0:
        x = 'angry'
    if emotion == 1:
        x = 'disgust'
    if emotion == 2:
        x = 'fear'
    if emotion == 3:
        x = 'sad'
    if emotion == 4:
        x = 'neutral'
    if emotion == 5:
        x = 'happy'
    if emotion == 6:
        x = 'pleasant_surprised'
    return (x)

# Our dataframe will be sized so that each observation corresponds to 1 second of audio. The number of GFCCs taken is determined by the specified hop length.
# We'll start by creating an empty array of the appropriate size.

df = audiodata_balanced

os.chdir(os.path.join(PATH,'Data Sets'))

# Loading a sample audiofile.

i=0
c=1 

filename = df.iloc[i]['Filename']
dataset = df.iloc[i]['Dataset']
speaker = df.iloc[i]['Speaker']
emotion = df.iloc[i]['Emotion']

if (dataset=='SAVEE'):
    y, sr = librosa.load(os.path.join(dataset, speaker, filename), sr = sr)
elif(dataset=='TESS'):
    emotion = emotion_reverse(emotion)
    speaker_emotion = (f'{speaker}_{emotion}')
    y, sr = librosa.load(os.path.join(dataset, speaker_emotion, filename), sr = sr)
    
# Getting the duration of the sample and padding with zeros so that the length is an integer. 
    
duration = librosa.get_duration(y,sr)
pieces = math.ceil(duration)

length = sr*pieces

pad_length = length - len(y)
y = np.pad(y,((0,pad_length)),'constant')

# Taking the first 1 second chunk from the audiofile and extracting GFCCs from that chunk.

start=0
end=sr

chunk = y[start:end]

gfccs = gfcc_test(chunk, sr = sr, n_gfcc=n_gfcc, hop_length=hop_length)

# Using the GFCCs from the 1 second chunk to create an empty array.

chunk_GFCC_array = np.empty(shape =(1,gfccs.shape[0],gfccs.shape[1]))
chunk_GFCC_array[i]=gfccs

# Adding the remaining 1 second pieces from the sample audiofile to the two new dataframes.

for n in range(pieces-1):
    c+=1
    start +=sr
    end +=sr
    
    chunk = y[start:end]
    len(chunk)
    gfccs = gfcc_test(chunk, sr = sr, n_gfcc=n_gfcc, hop_length=hop_length)
    
    gfccs = gfccs.reshape((1,gfccs.shape[0],gfccs.shape[1]))
    
    chunk_GFCC_array = np.append(chunk_GFCC_array,gfccs,axis=0)

# Adding the remaining 1 second slices of audio to our dataframe.
    
for i in range(1, len(df)):
    
    filename = df.iloc[i]['Filename']
    dataset = df.iloc[i]['Dataset']
    speaker = df.iloc[i]['Speaker']
    emotion = df.iloc[i]['Emotion']
    
    if (dataset=='SAVEE'):
        
        y, sr = librosa.load(os.path.join(dataset, speaker, filename), sr = sr)

    elif(dataset=='TESS'):
        emotion = emotion_reverse(emotion)
        speaker_emotion = (f'{speaker}_{emotion}')
        
        y, sr = librosa.load(os.path.join(dataset, speaker_emotion, filename), sr = sr)
        
    duration = librosa.get_duration(y,sr)   
    pieces = math.ceil(duration)
    length = sr*pieces
    pad_length = length - len(y)
    y = np.pad(y,((0,pad_length)),'constant')

    start=0
    end=sr
    p=1

    for n in range(pieces):

        chunk = y[start:end]
        gfccs = gfcc_test(chunk, sr = sr, n_gfcc=n_gfcc, hop_length=hop_length)

        gfccs = gfccs.reshape((1,gfccs.shape[0],gfccs.shape[1]))

        chunk_GFCC_array = np.append(chunk_GFCC_array,gfccs,axis=0)

        start +=sr
        end +=sr
        c+=1
        
        print('\r',f'{c} arrays added',end='')
        
# Saving our new dataframe       
    
os.chdir(os.path.join(PATH,'DataFrames'))
np.save('80_GFCC_array_1s_Chunks.npy', chunk_GFCC_array)

 2913 arrays added

### Extracting Linear Filterbank Cepstral Coefficients

In [7]:
# This function will extract LFCCs from an audio file.

import numpy as np
import scipy
import scipy.signal
import scipy.fftpack

from librosa import util
from librosa import filters
#from ..util.exceptions import ParameterError

from librosa.core.time_frequency import fft_frequencies
from librosa.core.audio import zero_crossings, to_mono
from librosa.core.spectrum import power_to_db, _spectrogram
from librosa.core.constantq import cqt, hybrid_cqt
from librosa.core.pitch import estimate_tuning
from spafe.fbanks.gammatone_fbanks import gammatone_filter_banks
from spafe.fbanks.linear_fbanks import linear_filter_banks
import math

def lin_test(y=None, sr=22050, S=None, n_lfcc=20, dct_type=2, norm='ortho', lifter=0, **kwargs):


    if S is None:
        S = power_to_db(linspectrogram(y=y, sr=sr, **kwargs))

    M = scipy.fftpack.dct(S, axis=0, type=dct_type, norm=norm)[:n_lfcc]

    if lifter > 0:
        M *= 1 + (lifter / 2) * np.sin(np.pi * np.arange(1, 1 + n_lfcc, dtype=M.dtype) / lifter)[:, np.newaxis]
        return M
    elif lifter == 0:
        return M
    else:
        raise ParameterError('GFCC lifter={} must be a non-negative number'.format(lifter))

        
        
def linspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512,
                   win_length=None, window='hann', center=True, pad_mode='reflect',
                   power=1.0, **kwargs):


    S, n_fft = _spectrogram(y=y, S=S, n_fft=n_fft, hop_length=hop_length, power=power,
                            win_length=win_length, window=window, center=center,
                            pad_mode=pad_mode)


    linear_basis = linear_filter_banks(fs=sr,nfft=n_fft,nfilts=128)
    

    return np.dot(linear_basis, S)

In [8]:
# This script will create a 3D array of LFCCs for each audio file, plus the zero-crossing rate, spectral centroid, spectral bandwidth, and short-time energy with RMS. 
# It functions in almost exactly the same way as the code used to extract MFCCs & GFCCs. 

os.chdir(os.path.join(PATH,'DataFrames'))
audiodata_balanced = pd.read_csv("Balanced Audio Data Iteration #2.csv", index_col = False)

start_time = time.time()
df = audiodata_balanced

n_lfcc = 60
hop_length = 512
sr = 44100

def emotion_reverse(emotion):
    
    x = str
    if emotion == 0:
        x = 'angry'
    if emotion == 1:
        x = 'disgust'
    if emotion == 2:
        x = 'fear'
    if emotion == 3:
        x = 'sad'
    if emotion == 4:
        x = 'neutral'
    if emotion == 5:
        x = 'happy'
    if emotion == 6:
        x = 'pleasant_surprised'
    return (x)


df = audiodata_balanced

os.chdir(os.path.join(PATH,'Data Sets'))

i=0

filename = df.iloc[i]['Filename']
dataset = df.iloc[i]['Dataset']
speaker = df.iloc[i]['Speaker']
emotion = df.iloc[i]['Emotion']


if (dataset=='SAVEE'):
    y, sr = librosa.load(os.path.join(dataset, speaker, filename), sr = sr)
elif(dataset=='TESS'):
    emotion = emotion_reverse(emotion)
    speaker_emotion = (f'{speaker}_{emotion}')
    y, sr = librosa.load(os.path.join(dataset, speaker_emotion, filename), sr = sr)
    
duration = librosa.get_duration(y,sr)
pieces = math.ceil(duration)
length = sr*pieces
pad_length = length - len(y)
y = np.pad(y,((0,pad_length)),'constant')

start=0
end=sr

chunk = y[start:end]

lfccs = lin_test(chunk, sr = sr, n_lfcc=n_lfcc)

z0 = librosa.feature.zero_crossing_rate(chunk)
sc = librosa.feature.spectral_centroid(chunk, sr)
sbw = librosa.feature.spectral_bandwidth(chunk, sr)
rms = librosa.feature.rms(chunk)

lfccs = np.append(lfccs,z0,axis=0)
lfccs = np.append(lfccs,sc,axis=0)
lfccs = np.append(lfccs,sbw,axis=0)
lfccs = np.append(lfccs,rms,axis=0)


chunk_LFCC_array = np.empty(shape =(1,lfccs.shape[0],lfccs.shape[1]))
chunk_LFCC_array[i]=lfccs

c=1


for n in range(pieces-1):
    
    start +=sr
    end +=sr
    
    chunk = y[start:end]
    len(chunk)
    lfccs = lin_test(chunk, sr = sr, n_lfcc=n_lfcc)
    
    z0 = librosa.feature.zero_crossing_rate(chunk)
    sc = librosa.feature.spectral_centroid(chunk, sr)
    sbw = librosa.feature.spectral_bandwidth(chunk, sr)
    rms = librosa.feature.rms(chunk)
    
    lfccs = np.append(lfccs,z0,axis=0)
    lfccs = np.append(lfccs,sc,axis=0)
    lfccs = np.append(lfccs,sbw,axis=0)
    lfccs = np.append(lfccs,rms,axis=0)
    
    lfccs = lfccs.reshape((1,lfccs.shape[0],lfccs.shape[1]))
    
    chunk_LFCC_array = np.append(chunk_LFCC_array,lfccs,axis=0)
    
    c+=1
    
for i in range(1, len(df)):
    
    filename = df.iloc[i]['Filename']
    dataset = df.iloc[i]['Dataset']
    speaker = df.iloc[i]['Speaker']
    emotion = df.iloc[i]['Emotion']
    
    if (dataset=='SAVEE'):
        
        y, sr = librosa.load(os.path.join(dataset, speaker, filename), sr = sr)

    elif(dataset=='TESS'):
        emotion = emotion_reverse(emotion)
        speaker_emotion = (f'{speaker}_{emotion}')
        
        y, sr = librosa.load(os.path.join(dataset, speaker_emotion, filename), sr = sr)
        
    duration = librosa.get_duration(y,sr)    
    pieces = math.ceil(duration)
    length = sr*pieces
    pad_length = length - len(y)
    y = np.pad(y,((0,pad_length)),'constant')

    start=0
    end=sr

    for n in range(pieces):

        chunk = y[start:end]
        lfccs = lin_test(chunk, sr = sr, n_lfcc=n_lfcc)
        
        z0 = librosa.feature.zero_crossing_rate(chunk)
        sc = librosa.feature.spectral_centroid(chunk, sr)
        sbw = librosa.feature.spectral_bandwidth(chunk, sr)
        rms = librosa.feature.rms(chunk)

        lfccs = np.append(lfccs,z0,axis=0)
        lfccs = np.append(lfccs,sc,axis=0)
        lfccs = np.append(lfccs,sbw,axis=0)
        lfccs = np.append(lfccs,rms,axis=0)

        lfccs = lfccs.reshape((1,lfccs.shape[0],lfccs.shape[1]))

        chunk_LFCC_array = np.append(chunk_LFCC_array,lfccs,axis=0)

        c+=1
        start +=sr
        end +=sr
        
        
        print('\r',f'{c} arrays added',end='')
    
os.chdir(os.path.join(PATH,'DataFrames'))
np.save('60_LFCC_array.npy', chunk_LFCC_array)
index_df.to_csv('1s_chunk_index_df.csv')

 2913 arrays added

In [11]:
index_df=pd.read_csv('1s_chunk_index_df.csv')
index_df

Unnamed: 0.1,Unnamed: 0,ID,Pieces,Filename,Dataset,Sex,Speaker,Duration,Emotion
0,0,0.0,1/4,h05.wav,SAVEE,M,KL,3.38,5
1,0,0.0,2/4,h05.wav,SAVEE,M,KL,3.38,5
2,0,0.0,3/4,h05.wav,SAVEE,M,KL,3.38,5
3,0,0.0,4/4,h05.wav,SAVEE,M,KL,3.38,5
4,1,1.0,1/5,sa04.wav,SAVEE,M,KL,4.73,3
...,...,...,...,...,...,...,...,...,...
2908,838,838.0,1/3,OAF_life_sad.wav,TESS,F,OAF,2.56,3
2909,838,838.0,2/3,OAF_life_sad.wav,TESS,F,OAF,2.56,3
2910,838,838.0,3/3,OAF_life_sad.wav,TESS,F,OAF,2.56,3
2911,839,839.0,1/2,OAF_check_angry.wav,TESS,F,OAF,1.65,0


## Preprocessing

In [4]:
# From here we can load the dataframes containing MFCCs, GFCCs, and LFCCs, as well as the index_df containing necessary information about each.
os.chdir(os.path.join(PATH,'DataFrames'))

audiodata_balanced = pd.read_csv("Balanced Audio Data Iteration #2.csv", index_col = False)

index_df=pd.read_csv('1s_chunk_index_df.csv')

chunk_MFCC_array = np.load('80_MFCC_array_1s_Chunks.npy')
chunk_GFCC_array = np.load('80_GFCC_array_1s_Chunks.npy')
chunk_LFCC_array = np.load('60_LFCC_array.npy')

print(f'MFCC array shape:{chunk_MFCC_array.shape}, GFCC array shape:{chunk_GFCC_array.shape}, LFCC array shape:{chunk_LFCC_array.shape}')

display(index_df)


MFCC array shape:(2913, 80, 87), GFCC array shape:(2913, 80, 87), LFCC array shape:(2913, 64, 87)


Unnamed: 0,ID,Pieces,Filename,Dataset,Sex,Speaker,Duration,Emotion
0,0.0,1/4,h05.wav,SAVEE,M,KL,3.38,5
1,0.0,2/4,h05.wav,SAVEE,M,KL,3.38,5
2,0.0,3/4,h05.wav,SAVEE,M,KL,3.38,5
3,0.0,4/4,h05.wav,SAVEE,M,KL,3.38,5
4,1.0,1/5,sa04.wav,SAVEE,M,KL,4.73,3
...,...,...,...,...,...,...,...,...
2908,838.0,1/3,OAF_life_sad.wav,TESS,F,OAF,2.56,3
2909,838.0,2/3,OAF_life_sad.wav,TESS,F,OAF,2.56,3
2910,838.0,3/3,OAF_life_sad.wav,TESS,F,OAF,2.56,3
2911,839.0,1/2,OAF_check_angry.wav,TESS,F,OAF,1.65,0


In [5]:
# We'll train/test split our data, stratifying to maintain a balance of classes across 'emotion'

X = audiodata_balanced.iloc[:,:5]
y = audiodata_balanced.iloc[:,5]

X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size = 0.3, 
                                                    random_state = 18, 
                                                    stratify=y)

# We'll use the train/test split to pull out corresponding indices from index_df, which tells us which 1s clips belong to single files.

Train_index = index_df[index_df.ID.isin((X_train.index))]
Test_index = index_df[index_df.ID.isin((X_test.index))]

# From these indices we'll create equivalent arrays for each set of features.

MFCC_X_train = chunk_MFCC_array[Train_index.index]
MFCC_X_test = chunk_MFCC_array[Test_index.index]
GFCC_X_train = chunk_GFCC_array[Train_index.index]
GFCC_X_test = chunk_GFCC_array[Test_index.index]
LFCC_X_train = chunk_LFCC_array[Train_index.index]
LFCC_X_test = chunk_LFCC_array[Test_index.index]

# Declaring y_train and y_test to correspond to the dataframes of sliced audio.

y_train = Train_index.loc[:,'Emotion']
y_test = Test_index.loc[:,'Emotion']

# Resetting indices and converting y_train/y_test to numpy for future use.

Train_index = Train_index.reset_index()
Test_index = Test_index.reset_index()

y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

# Standard scaling each row across all timesteps for every feature set.

for i in range(MFCC_X_train.shape[1]):
    
    scaler = StandardScaler()
    
    scaler.fit(MFCC_X_train[:,i,:])
    
    MFCC_X_train[:,i,:] = scaler.transform(MFCC_X_train[:,i,:])
    MFCC_X_test[:,i,:] = scaler.transform(MFCC_X_test[:,i,:])
    
for i in range(GFCC_X_train.shape[1]):
    
    scaler = StandardScaler()
    scaler.fit(GFCC_X_train[:,i,:])
    
    GFCC_X_train[:,i,:] = scaler.transform(GFCC_X_train[:,i,:])
    GFCC_X_test[:,i,:] = scaler.transform(GFCC_X_test[:,i,:])

for i in range(LFCC_X_train.shape[1]):
    
    scaler = StandardScaler()
    scaler.fit(LFCC_X_train[:,i,:])
    
    LFCC_X_train[:,i,:] = scaler.transform(LFCC_X_train[:,i,:])
    LFCC_X_test[:,i,:] = scaler.transform(LFCC_X_test[:,i,:])
    
# Expanding array dimensions so that they're in the appropriate format for a neural network.

MFCC_X_train = np.expand_dims(MFCC_X_train,-1)
MFCC_X_test = np.expand_dims(MFCC_X_test,-1)
GFCC_X_train = np.expand_dims(GFCC_X_train,-1)
GFCC_X_test = np.expand_dims(GFCC_X_test,-1)
LFCC_X_train = np.expand_dims(LFCC_X_train,-1)
LFCC_X_test = np.expand_dims(LFCC_X_test,-1)

# Creating a feature fusion array from each independant feature set.
    
Fusion_X_train = np.concatenate([MFCC_X_train, GFCC_X_train, LFCC_X_train],axis=1)
Fusion_X_test = np.concatenate([MFCC_X_test, GFCC_X_test, LFCC_X_test],axis=1)

# Transposing all arrays so that they're in the appropriate format for RNNs.

MFCC_X_train = MFCC_X_train.transpose(0,2,1,3)
MFCC_X_test = MFCC_X_test.transpose(0,2,1,3)
GFCC_X_train = GFCC_X_train.transpose(0,2,1,3)
GFCC_X_test = GFCC_X_test.transpose(0,2,1,3)
LFCC_X_train = LFCC_X_train.transpose(0,2,1,3)
LFCC_X_test = LFCC_X_test.transpose(0,2,1,3)
Fusion_X_train = Fusion_X_train.transpose(0,2,1,3)
Fusion_X_test = Fusion_X_test.transpose(0,2,1,3)
    
print(f'MFCC X train shape: {MFCC_X_train.shape}')
print(f'MFCC X test shape: {MFCC_X_test.shape}')
print(f'GFCC X train shape: {GFCC_X_train.shape}')
print(f'GFCC X test shape: {GFCC_X_test.shape}')
print(f'LFCC X train shape: {LFCC_X_train.shape}')
print(f'LFCC X test shape: {LFCC_X_test.shape}')
print(f'Fusion X train shape: {Fusion_X_train.shape}')
print(f'Fusion X test shape: {Fusion_X_test.shape}')
    

MFCC X train shape: (2073, 87, 80, 1)
MFCC X test shape: (840, 87, 80, 1)
GFCC X train shape: (2073, 87, 80, 1)
GFCC X test shape: (840, 87, 80, 1)
LFCC X train shape: (2073, 87, 64, 1)
LFCC X test shape: (840, 87, 64, 1)
Fusion X train shape: (2073, 87, 224, 1)
Fusion X test shape: (840, 87, 224, 1)


## LFCC Kernel SVM Test

Before we go onto using the RNN model, we are going to test the predictive power of our LFCC features using an RBF kernel SVM.

In [36]:
# This will take mean value of each feature in our LFCC data frame and scale it accordingly with a StandardScaler.

LFCC_X_train_2D = LFCC_X_train[:,:,:,0]
LFCC_X_test_2D = LFCC_X_test[:,:,:,0]

LFCC_X_train_2D = LFCC_X_train_2D.mean(axis=1)
LFCC_X_test_2D = LFCC_X_test_2D.mean(axis=1)

scaler = StandardScaler()
scaler.fit(LFCC_X_train_2D)
LFCC_X_train_2D = scaler.transform(LFCC_X_train_2D)
LFCC_X_test_2D = scaler.transform(LFCC_X_test_2D)

LFCC_X_train_2D.shape

(2073, 64)

In [37]:
# Testing an RBF kernel SVM with a manually optimized C value

my_RBF_kernel_SVM = SVC(C=100, max_iter = 10000, probability=True)

my_RBF_kernel_SVM.fit(LFCC_X_train_2D, y_train)

train_score =  my_RBF_kernel_SVM.score(LFCC_X_train_2D, y_train)
test_score =  my_RBF_kernel_SVM.score(LFCC_X_test_2D, y_test)

print('Train score:', train_score)
print('Test score:', test_score)                                     
                                    

Train score: 0.9604438012542209
Test score: 0.594047619047619


In [38]:
# This will combine the predictions of every 1 second chunk of audio from each audiofile. This is done through multiplicative probabilistic voting.

# Taking the prediction probabilites of the test set from our SVM.
X_test_proba = my_RBF_kernel_SVM.predict_proba(LFCC_X_test_2D)

# Creating empty lists to record the predicted class and actual class of each audio file in the test set.
combined_preds = []
y_test_true=[]

# Here we combine all the sets of probabilites from each audio file to match their original audio file, and use these sets of probabilites to vote on the most likely class.
# We also regenerate y_test which contains the actual classes
for i in Test_index.ID.unique():
    probs = X_test_proba[Test_index[Test_index['ID']==i].index]
    test = y_test[Test_index[Test_index['ID']==i].index][0]
    total_probs=1
    for i in range(probs.shape[0]):
        total_probs = probs[i]*total_probs

    total_probs = total_probs.argmax()
    combined_preds.append(total_probs)
    y_test_true.append(test)

# We score how many of combined predictions are correct.
count = 0
for i in range(len(y_test_true)):
    if combined_preds[i]==y_test_true[i]:
        count +=1
LFCC_SVM_Score = count/len(y_test_true)
print(f'Final Accuracy Score: {LFCC_SVM_Score}')

Final Accuracy Score: 0.8174603174603174


## RNN With Attention Layer

In [11]:
# From https://github.com/douglas125/SpeechCmdRecognition

def AttRNNSpeechModel(nCategories,
                      inputLength=None, rnn_func=L.LSTM):
    # simple LSTM

    inputs = L.Input(shape = inputLength, name='input')


    x = L.Conv2D(10, (5, 1), activation='relu', padding='same')(inputs)
    x = L.BatchNormalization()(x)
    x = L.Conv2D(1, (5, 1), activation='relu', padding='same')(x)
    x = L.BatchNormalization()(x)

    x = L.Lambda(lambda q: K.squeeze(q, -1), name='squeeze_last_dim')(x)

    x = L.Bidirectional(rnn_func(64, return_sequences=True,
                                 kernel_regularizer = L1L2(l1=0.01, l2=0.01))
                        )(x)  # [b_s, seq_len, vec_dim]
    x = L.Bidirectional(rnn_func(64, return_sequences=True, 
                                 kernel_regularizer = L1L2(l1=0.01, l2=0.01))
                        )(x)  # [b_s, seq_len, vec_dim]

    xFirst = L.Lambda(lambda q: q[:, -1])(x)  # [b_s, vec_dim]
    query = L.Dense(128)(xFirst)

    # dot product attention
    attScores = L.Dot(axes=[1, 2])([query, x])
    attScores = L.Softmax(name='attSoftmax')(attScores)  # [b_s, seq_len]

    # rescale sequence
    attVector = L.Dot(axes=[1, 1])([attScores, x])  # [b_s, vec_dim]

    x = L.Dense(64, activation='relu')(attVector)
    x = L.Dense(32)(x)

    output = L.Dense(nCategories, activation='softmax', name='output')(x)

    model = Model(inputs=[inputs], outputs=[output])

    return model

### MFCC RNN

In [24]:
# If a pretrained model exists, we'll load and compile that model. If not, we'll instantiate and compile a new model.
if os.path.exists(os.path.join(PATH,'Models/RNN_model_with_Attention_MFCC_chunks')):
    print('Pretrained model found')
    model=load_model(os.path.join(PATH,'Models/RNN_model_with_Attention_MFCC_chunks'))
    model.summary()
    train = 'no'
else:
    print('No model found, compiling new model')
    model = AttRNNSpeechModel(nCategories=7, inputLength = (MFCC_X_train.shape[1:4]))
    model.compile(optimizer='adam', loss=['sparse_categorical_crossentropy'], metrics=['sparse_categorical_accuracy'])
    model.summary()
    train = 'yes'

Pretrained model found
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input (InputLayer)              [(None, 87, 80, 1)]  0                                            
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 87, 80, 10)   60          input[0][0]                      
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 87, 80, 10)   40          conv2d[0][0]                     
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 87, 80, 1)    51          batch_normalization[0][0]        
_______________________________________________________________________

In [20]:
# # Uncomment this to continue training the loaded model
# train = 'yes'

In [21]:
# Training our model using EarlyStopping and ModelCheckpoints to save the best versions of our model.
if train == 'yes':  
    earlystopper = EarlyStopping(monitor='val_sparse_categorical_accuracy', patience=100,
                                 verbose=1, restore_best_weights=True)
    checkpointer = ModelCheckpoint('/Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_MFCC_chunks', 
                                   monitor='val_sparse_categorical_accuracy', 
                                   verbose=1, 
                                   save_best_only=True)


    history = model.fit(MFCC_X_train, y_train,
              epochs = 500,
              verbose = 1,
              validation_data=(MFCC_X_test, y_test),
              shuffle=True,
              callbacks=[earlystopper, checkpointer])
    model.save('/Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_MFCC_chunks')

Epoch 1/500
Epoch 00001: val_sparse_categorical_accuracy improved from -inf to 0.13333, saving model to /Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_MFCC_chunks
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: /Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_MFCC_chunks/assets
Epoch 2/500
Epoch 00002: val_sparse_categorical_accuracy did not improve from 0.13333
Epoch 3/500
Epoch 00003: val_sparse_categorical_accuracy improved from 0.13333 to 0.18333, saving model to /Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_MFCC_chunks
INFO:tensorflow:Assets written to: /Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_MFCC_chunks/assets
Epoch 4/500
Epoch 00004: val_sparse_categorica

In [25]:
# This will combine the predictions of every 1 second chunk of audio from each audiofile. This is done through multiplicative probabilistic voting.
# This functions in the same way as the code used to combine prediction probabilities for the SVM.
X_test_proba = model.predict(MFCC_X_test)

combined_preds = []
y_test_true=[]
for i in Test_index.ID.unique():
    probs = X_test_proba[Test_index[Test_index['ID']==i].index]
    test = y_test[Test_index[Test_index['ID']==i].index][0]
    total_probs=1
    for i in range(probs.shape[0]):
        total_probs = probs[i]*total_probs

    total_probs = total_probs.argmax()
    combined_preds.append(total_probs)
    y_test_true.append(test)

count = 0
for i in range(len(y_test_true)):
    if combined_preds[i]==y_test_true[i]:
        count +=1
MFCC_RNN_score = count/len(y_test_true)

print(f'Final Accuracy Score: {MFCC_RNN_score}')

Final Accuracy Score: 0.7976190476190477


### GFCC RNN

In [26]:
# If a pretrained model exists, we'll load and compile that model. If not, we'll instantiate and compile a new model.
if os.path.exists(os.path.join(PATH,'Models/RNN_model_with_Attention_GFCC_chunks')):
    print('Pretrained model found')
    model=load_model(os.path.join(PATH,'Models/RNN_model_with_Attention_GFCC_chunks'))
    model.summary()
    train = 'no'
else:
    print('No model found, compiling new model')
    model = AttRNNSpeechModel(nCategories=7, inputLength = (GFCC_X_train.shape[1:4]))
    model.compile(optimizer='adam', loss=['sparse_categorical_crossentropy'], metrics=['sparse_categorical_accuracy'])
    model.summary()
    train = 'yes'

Pretrained model found
Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input (InputLayer)              [(None, 87, 80, 1)]  0                                            
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 87, 80, 10)   60          input[0][0]                      
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 87, 80, 10)   40          conv2d_2[0][0]                   
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 87, 80, 1)    51          batch_normalization_2[0][0]      
_____________________________________________________________________

In [24]:
# # Uncomment this to continue training the loaded model
# train = 'yes'

In [25]:
if train == 'yes':   
    earlystopper = EarlyStopping(monitor='val_sparse_categorical_accuracy', patience=100,
                                 verbose=1, restore_best_weights=True)
    checkpointer = ModelCheckpoint('/Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_GFCC_chunks', 
                                   monitor='val_sparse_categorical_accuracy', 
                                   verbose=1, 
                                   save_best_only=True)


    model.fit(GFCC_X_train, y_train,
              epochs = 500,
              verbose = 1,
              validation_data=(GFCC_X_test, y_test),
              shuffle=True,
              callbacks=[earlystopper, checkpointer])
    model.save('/Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_GFCC_chunks')

Epoch 1/500
Epoch 00001: val_sparse_categorical_accuracy improved from -inf to 0.33929, saving model to /Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_GFCC_chunks
INFO:tensorflow:Assets written to: /Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_GFCC_chunks/assets
Epoch 2/500
Epoch 00002: val_sparse_categorical_accuracy improved from 0.33929 to 0.38095, saving model to /Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_GFCC_chunks
INFO:tensorflow:Assets written to: /Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_GFCC_chunks/assets
Epoch 3/500
Epoch 00003: val_sparse_categorical_accuracy did not improve from 0.38095
Epoch 4/500
Epoch 00004: val_sparse_categorical_accuracy improved from 0.38095 to 0.42262, saving model to /Users/coleslatt/Doc

In [27]:
# This will combine the predictions of every 1 second chunk of audio from each audiofile. This is done through multiplicative probabilistic voting.
# This functions in the same way as the code used to combine prediction probabilities for the SVM.

X_test_proba = model.predict(GFCC_X_test)

combined_preds = []
y_test_true=[]
for i in Test_index.ID.unique():
    probs = X_test_proba[Test_index[Test_index['ID']==i].index]
    test = y_test[Test_index[Test_index['ID']==i].index][0]
    total_probs=1
    for i in range(probs.shape[0]):
        total_probs = probs[i]*total_probs

    total_probs = total_probs.argmax()
    combined_preds.append(total_probs)
    y_test_true.append(test)

count = 0
for i in range(len(y_test_true)):
    if combined_preds[i]==y_test_true[i]:
        count +=1
GFCC_RNN_score = count/len(y_test_true)
print(f'Final Accuracy Score: {GFCC_RNN_score}')

Final Accuracy Score: 0.8134920634920635


### LFCC RNN

In [31]:
# If a pretrained model exists, we'll load and compile that model. If not, we'll instantiate and compile a new model.
if os.path.exists(os.path.join(PATH,'Models/RNN_model_with_Attention_LFCC_chunks')):
    print('Pretrained model found')
    model=load_model(os.path.join(PATH,'Models/RNN_model_with_Attention_LFCC_chunks'))
    model.summary()
    train = 'no'
else:
    print('No model found, compiling new model')
    model = AttRNNSpeechModel(nCategories=7, inputLength = (LFCC_X_train.shape[1:4]))
    model.compile(optimizer='adam', loss=['sparse_categorical_crossentropy'], metrics=['sparse_categorical_accuracy'])
    model.summary()
    train = 'yes'

Pretrained model found
Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input (InputLayer)              [(None, 87, 64, 1)]  0                                            
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, 87, 64, 10)   60          input[0][0]                      
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 87, 64, 10)   40          conv2d_4[0][0]                   
__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, 87, 64, 1)    51          batch_normalization_4[0][0]      
_____________________________________________________________________

In [7]:
# # Uncomment this to continue training the loaded model
# train = 'yes'

In [29]:
if train =='yes':
    earlystopper = EarlyStopping(monitor='val_sparse_categorical_accuracy', patience=100,
                                 verbose=1, restore_best_weights=True)
    checkpointer = ModelCheckpoint('/Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_LFCC_chunks', 
                                   monitor='val_sparse_categorical_accuracy', 
                                   verbose=1, 
                                   save_best_only=True)


    model.fit(LFCC_X_train, y_train,
              epochs = 500,
              verbose = 1,
              validation_data=(LFCC_X_test, y_test),
              shuffle=True,
              callbacks=[earlystopper, checkpointer])
    model.save('/Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_LFCC_chunks')

In [32]:
# This will combine the predictions of every 1 second chunk of audio from each audiofile. This is done through multiplicative probabilistic voting.
# This functions in the same way as the code used to combine prediction probabilities for the SVM.
X_test_proba = model.predict(LFCC_X_test)

combined_preds = []
y_test_true=[]
for i in Test_index.ID.unique():
    probs = X_test_proba[Test_index[Test_index['ID']==i].index]
    test = y_test[Test_index[Test_index['ID']==i].index][0]
    total_probs=1
    for i in range(probs.shape[0]):
        total_probs = probs[i]*total_probs

    total_probs = total_probs.argmax()
    combined_preds.append(total_probs)
    y_test_true.append(test)

count = 0
for i in range(len(y_test_true)):
    if combined_preds[i]==y_test_true[i]:
        count +=1
LFCC_RNN_score = count/len(y_test_true)
print(f'Final Accuracy Score: {LFCC_RNN_score}')

Final Accuracy Score: 0.7182539682539683


### Fusion RNN

In [33]:
# If a pretrained model exists, we'll load and compile that model. If not, we'll instantiate and compile a new model.
if os.path.exists(os.path.join(PATH,'Models/RNN_model_with_Attention_Fusion')):
    print('Pretrained model found')
    model=load_model(os.path.join(PATH,'Models/RNN_model_with_Attention_Fusion'))
    model.summary()
    train = 'no'
else:
    print('No model found, compiling new model')
    model = AttRNNSpeechModel(nCategories=7, inputLength = (Fusion_X_train.shape[1:4]))
    model.compile(optimizer='adam', loss=['sparse_categorical_crossentropy'], metrics=['sparse_categorical_accuracy'])
    model.summary()
    train = 'yes'

Pretrained model found
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input (InputLayer)              [(None, 87, 224, 1)] 0                                            
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 87, 224, 10)  60          input[0][0]                      
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 87, 224, 10)  40          conv2d[0][0]                     
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 87, 224, 1)   51          batch_normalization[0][0]        
_______________________________________________________________________

In [None]:
# # Uncomment this to continue training the loaded model
# train = 'yes'

In [17]:
if train =='yes':
    earlystopper = EarlyStopping(monitor='val_sparse_categorical_accuracy', patience=100,
                                 verbose=1, restore_best_weights=True)
    checkpointer = ModelCheckpoint('/Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_Fusion', 
                                   monitor='val_sparse_categorical_accuracy', 
                                   verbose=1, 
                                   save_best_only=True)


    model.fit(Fusion_X_train, y_train,
              epochs = 500,
              verbose = 1,
              validation_data=(Fusion_X_test, y_test),
              shuffle=True,
              callbacks=[earlystopper, checkpointer])
    model.save('/Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_Fusion')

Epoch 1/500
Epoch 00001: val_sparse_categorical_accuracy improved from -inf to 0.57262, saving model to /Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_Fusion
INFO:tensorflow:Assets written to: /Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_Fusion/assets
Epoch 2/500
Epoch 00002: val_sparse_categorical_accuracy improved from 0.57262 to 0.60476, saving model to /Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_Fusion
INFO:tensorflow:Assets written to: /Users/coleslatt/Documents/Data Science/Brainstation/Projects/Capstone Project/Models/RNN_model_with_Attention_Fusion/assets
Epoch 3/500
Epoch 00003: val_sparse_categorical_accuracy did not improve from 0.60476
Epoch 4/500
Epoch 00004: val_sparse_categorical_accuracy did not improve from 0.60476
Epoch 5/500
Epoch 00005: val_sparse_categorical_accuracy did 

In [34]:
# This will combine the predictions of every 1 second chunk of audio from each audiofile. This is done through multiplicative probabilistic voting.
# This functions in the same way as the code used to combine prediction probabilities for the SVM.
X_test_proba = model.predict(Fusion_X_test)

combined_preds = []
y_test_true=[]
for i in Test_index.ID.unique():
    probs = X_test_proba[Test_index[Test_index['ID']==i].index]
    test = y_test[Test_index[Test_index['ID']==i].index][0]
    total_probs=1
    for i in range(probs.shape[0]):
        total_probs = probs[i]*total_probs

    total_probs = total_probs.argmax()
    combined_preds.append(total_probs)
    y_test_true.append(test)

count = 0
for i in range(len(y_test_true)):
    if combined_preds[i]==y_test_true[i]:
        count +=1
FUSION_RNN_score = count/len(y_test_true)
print(f'Final Accuracy Score: {FUSION_RNN_score}')

Final Accuracy Score: 0.8452380952380952


## Final Results

In [39]:
print(f'Kernel SVM with LFCC score: {round(LFCC_SVM_Score*100,2)}%')
print(f'RNN with MFCC score: {round(MFCC_RNN_score*100,2)}%')
print(f'RNN with GFCC score: {round(GFCC_RNN_score*100,2)}%')
print(f'RNN with LFCC score: {round(LFCC_RNN_score*100,2)}%')
print(f'RNN with Feature Fusion score: {round(FUSION_RNN_score*100,2)}%')


Kernel SVM with LFCC score: 81.75%
RNN with MFCC score: 79.76%
RNN with GFCC score: 81.35%
RNN with LFCC score: 71.83%
RNN with Feature Fusion score: 84.52%
