# Final Project, Machine Learning Fall 2021, John Harrington
## Introduction and References

For my project I have used the audio-only files of the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset.
The dataset includes 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent.
 The classes of emotion include calm, happy, sad, angry, fearful, surprise, and disgust.
 Each expression is produced at two levels of emotional intensity (normal and strong), with an additional neutral performance.

The most accurate model I have been able to implement (in terms of testing accuracy) is a neural netowrk with two hidden layers, performs optimization with Adam, uses cross entropy to calculate loss, and uses ReLU as an activation function.  

Extracted features include:

https://librosa.org/doc/latest/feature.html

1) Chroma_stft - Compute a chromagram from a waveform or power spectrogram. A spectrogram is a representation of frequency over time with the addition of amplitude as a third dimension, denoting the intensity or volume of the signal at a frequency and a time.

2) MFCC - Mel-frequency cepstral coefficients (MFCCs). MFCCs are commonly derived as follows: First, take the Fourier transform of a signal, second map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows or alternatively, cosine overlapping windows, third take the logs of the powers at each of the mel frequencies, fourth take the discrete cosine transform of the list of mel log powers, as if it were a signal, lastly the MFCCs are the amplitudes of the resulting spectrum.

3) Mel Spectogram - Compute a mel-scaled spectrogram. Studies have shown that humans do not perceive frequencies on a linear scale. We are better at detecting differences in lower frequencies than higher frequencies. For example, we can easily tell the difference between 500 and 1000 Hz, but we will hardly be able to tell a difference between 10,000 and 10,500 Hz, even though the distance between the two pairs are the same. In 1937, Stevens, Volkmann, and Newmann proposed a unit of pitch such that equal distances in pitch sounded equally distant to the listener. This is called the mel scale. We perform a mathematical operation on frequencies to convert them to the mel scale

 Feature extraction is performed using the Python library
 'Librosa' which is intended for music and audio analysis. Librosa
 will read each file as a 1D Numpy array (1D since I have converted
 all files to be mono channel audio, as opposed to stereo.
 I did this because almost all RAVDESS files were already in mono channel,
 and so it made most sense to convert the few stereo recordings to mono
 using the CLI tool 'ffmpeg.')



In [1]:
import soundfile
import librosa
import os, glob
import numpy as np
from sklearn.model_selection import train_test_split
from scipy.io import wavfile
import sounddevice as sd 
import matplotlib.pyplot as plt

## Examining our Data: 

In [2]:
fs, data = wavfile.read("RAVDESS/Actor_01/03-01-01-01-01-01-01.wav")
sd.play(data, fs)

  fs, data = wavfile.read("RAVDESS/Actor_01/03-01-01-01-01-01-01.wav")


In [3]:
fs, data = wavfile.read("RAVDESS/Actor_01/03-01-01-01-02-01-01.wav")
sd.play(data, fs)

  fs, data = wavfile.read("RAVDESS/Actor_01/03-01-01-01-02-01-01.wav")


In [4]:
fs, data = wavfile.read("RAVDESS/Actor_17/03-01-01-01-01-01-17.wav")
sd.play(data, fs)

  fs, data = wavfile.read("RAVDESS/Actor_17/03-01-01-01-01-01-17.wav")


In [5]:
fs, data = wavfile.read("RAVDESS/Actor_16/03-01-01-01-01-01-16.wav")
sd.play(data, fs)

  fs, data = wavfile.read("RAVDESS/Actor_16/03-01-01-01-01-01-16.wav")


In [6]:
fs, data = wavfile.read("RAVDESS/Actor_02/03-01-01-01-01-01-02.wav")
sd.play(data, fs)

  fs, data = wavfile.read("RAVDESS/Actor_02/03-01-01-01-01-01-02.wav")


In [7]:
# Step One: Prepare to extract features from audio samples
def extract_feature(file):
    with soundfile.SoundFile(file) as audio:
        sample = audio.read(dtype="float32")
        sample_rate = audio.samplerate
        stft = np.abs(librosa.stft(sample))
        result = np.array([])
        mfccs = np.mean(librosa.feature.mfcc(y=sample, sr = sample_rate, n_mfcc=40).T, axis = 0)
        result = np.hstack((result,mfccs))
        chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T, axis=0)
        result = np.hstack((result, chroma))
        mel = np.mean(librosa.feature.melspectrogram(sample, sr=sample_rate).T, axis=0)
        result = np.hstack((result, mel))

    return result

In [8]:
# Step Two: Audio files in the RAVDESS dataset are labeled by emotion, store these in a dict
emotions = {
    '01' : 'neutral',
    '02' : 'calm',
    '03' : 'happy',
    '04' : 'sad',
    '05' : 'angry',
    '06' : 'fearful',
    '07' : 'disgust',
    '08' : 'surprised'
}

emotions_int = {
    1 : 'neutral', 
    2 : 'calm', 
    3 : 'happy',
    4 : 'sad', 
    5 : 'angry', 
    6 : 'fearful', 
    7 : 'disgust', 
    8: 'surprised'
}

emotion_list = list(emotions.values())
print(len(emotions), "Classes: ", emotions.values())

8 Classes:  dict_values(['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised'])


In [9]:
# Define a load function to use all RAVDESS data for training the model 
def load_training_data():
    x, y = [], []
    file_count = 0
    for file in glob.glob("RAVDESS/Actor_*/*_mono.wav"):
        file_name = os.path.basename(file)
        # print(file)
        emotion = emotions[file_name.split("-")[2]]
        feature = extract_feature(file)
        x.append(feature)
        y.append(emotion)
        file_count += 1
    print(file_count, " Files Loaded")
    return train_test_split(np.array(x), y, train_size=0.5, random_state=None)

x_train, x_test, y_train, y_test = load_training_data()
print(x_train.shape[1], "Features Extracted")


1440  Files Loaded
180 Features Extracted


In [10]:
# Define a function to extract features from our audio samples below 
def load_test_data(file_name):
    x = []
    file_count = 0
    for file in glob.glob("clips/" + file_name):
        file_name = os.path.basename(file)
        # print(file)
        feature = extract_feature(file)
        x.append(feature)
        file_count += 1
    print("Emotion Classified As:")
    return x

## Implementing a Multi-Layer Perception Classifier (using PyTorch)

In [11]:
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn

# Class to represent data for feeding into neural network 
class MakeDataset(Dataset):
    def __init__(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train
    def __len__(self):
        return self.x_train.shape[0]
    def __getitem__(self, ind):
        x = self.x_train[ind]
        y = emotion_list.index(y_train[ind]) # Get emotion as integer value, not String 
        return x, y


train_set = MakeDataset(x_train, y_train)
test_set = MakeDataset(x_test, y_test)

batch_size = 512 # Optimal size for my machine

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)
  
class TwoLayerMLP(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerMLP, self).__init__()
        self.relu = torch.nn.ReLU() # Activation function 
        self.linear1 = torch.nn.Linear(D_in, H) # Hidden layer 
        self.linear2 = torch.nn.Linear(H, D_out) # Output: 8 classes 

    def forward(self, x):
        first = self.relu(x) 
        h_relu = self.linear1(first).clamp(min=0) # Clamp used to normalize data 
        second = self.relu(h_relu) 
        y_pred = self.linear2(second)
        return y_pred


model = TwoLayerMLP(180, 90, 8) # Model receives 180 features, outputs most likely membership among 8 classes 
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

epochs = 5001

total_train = 0
correct_train = 0
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for batch_num, data in enumerate(train_loader):
        audio , label = data
        optimizer.zero_grad()
        outputs = model(audio.float())
        loss = criterion(outputs, label)
        loss.backward()
        optimizer.step()
        
        pred_values, pred_indices = torch.max(outputs.data,1)
        total_train += float(label.size(0))
        correct_train += (sum(pred_indices == label)).item()
        
    model.eval()
    total = 0
    correct = 0
    for batch_num, data in enumerate(test_loader):
        audio , label = data
        outputs = model(audio.float())
        pred_values, pred_indices = torch.max(outputs.data,1)
        total += float(label.size(0))
        correct += (sum(pred_indices == label)).item()
    train_accuracy = 100.*correct_train/total_train
    test_accuracy = 100.*correct/total
    if (epoch % 1000 == 0):
        print("Epoch: %.0f, Training Loss: %.3f, Training Accuracy: %.3f, Testing Accuracy: %.3f" % (epoch, loss.item(), train_accuracy, test_accuracy))
    

Epoch: 0, Training Loss: 3.242, Training Accuracy: 14.167, Testing Accuracy: 13.611
Epoch: 1000, Training Loss: 0.147, Training Accuracy: 82.086, Testing Accuracy: 12.639
Epoch: 2000, Training Loss: 0.018, Training Accuracy: 90.826, Testing Accuracy: 12.083
Epoch: 3000, Training Loss: 0.004, Training Accuracy: 93.883, Testing Accuracy: 12.222
Epoch: 4000, Training Loss: 0.001, Training Accuracy: 95.412, Testing Accuracy: 12.361
Epoch: 5000, Training Loss: 0.000, Training Accuracy: 96.329, Testing Accuracy: 12.639


# Observe the high training accuracy and the low testing accuracy, this will be addressed but first let's see how our model classifies samples outside of the training dataset: 

## The Matrix (1999)

### "Copper Top"

In [12]:
fs, data = wavfile.read("clips/coppertop_mono.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("coppertop_mono.wav"))


with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))

Emotion Classified As:
calm


  test = torch.FloatTensor(load_test_data("coppertop_mono.wav"))


### "Just Doing My Job" 

In [13]:
fs, data = wavfile.read("clips/doing_job_mono.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("doing_job_mono.wav"))


with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))

Emotion Classified As:
happy


### "Two Lives"

In [14]:
fs, data = wavfile.read("clips/garbage_mono.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("garbage_mono.wav"))


with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))

Emotion Classified As:
happy


### "Mr. Rhinehart" 

In [15]:
fs, data = wavfile.read("clips/mr_rhinehart_mono.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("mr_rhinehart_mono.wav"))


with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))

Emotion Classified As:
sad


### "Perfect World" 

In [16]:
fs, data = wavfile.read("clips/perfect_world_mono.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("perfect_world_mono.wav"))


with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))

Emotion Classified As:
sad


### "Red Pill" 

In [17]:
fs, data = wavfile.read("clips/red_pill_mono.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("red_pill_mono.wav"))


with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))

Emotion Classified As:
sad


### "Remember Nothing"

In [18]:
fs, data = wavfile.read("clips/rem_nothing_mono.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("rem_nothing_mono.wav"))


with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))

Emotion Classified As:
sad


### "Why You're Here" 

In [19]:
fs, data = wavfile.read("clips/neo_morph.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("neo_morph.wav"))

with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))

Emotion Classified As:
sad



### "What You've Been Doing"

In [20]:
s, data = wavfile.read("clips/neo_trin.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("neo_trin.wav"))

with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))

Emotion Classified As:
happy


### "Chicken"

In [21]:
fs, data = wavfile.read("clips/chicken.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("chicken.wav"))

with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))



Emotion Classified As:
sad


## Trying test samples from outside of 'The Matrix'

## Bowling, The Big Lewbowski, 1998 

In [22]:
fs, data = wavfile.read("clips/bowling.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("bowling.wav"))


with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))


Emotion Classified As:
happy


## Bill Clinton's Denial (1998) 


In [23]:
fs, data = wavfile.read("clips/bill_1.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("bill_1.wav"))


with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))

Emotion Classified As:
surprised


## Bill Clinton's Apology (1998)

In [24]:
fs, data = wavfile.read("clips/bill_2.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("bill_2.wav"))

with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))

Emotion Classified As:
surprised


## Zed's Dead, Pulp Fiction (1994) 

In [25]:
fs, data = wavfile.read("clips/zeds_dead.wav")
sd.play(data, fs)

test = torch.FloatTensor(load_test_data("zeds_dead.wav"))

with torch.no_grad():
    prediction = model(test)
    predicted_class = np.argmax(prediction).tolist()
    print(emotions_int.get(predicted_class + 1))

Emotion Classified As:
disgust


# Striving for higher testing accuracy

### Normalize input data before it meets the neural net:

https://pytorch.org/docs/stable/generated/torch.nn.functional.normalize.html

In [26]:
class MakeDataset(Dataset):
    def __init__(self, x_train, y_train):
        self.x_train = nn.functional.normalize(torch.from_numpy(x_train)) # Normalize 
        self.y_train = y_train
    def __len__(self):
        return self.x_train.shape[0]
    def __getitem__(self, ind):
        x = self.x_train[ind]
        y = emotion_list.index(y_train[ind])
        return x, y


train_set = MakeDataset(x_train, y_train)
test_set = MakeDataset(x_test, y_test)

batch_size = 512

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)
  
class TwoLayerMLP(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerMLP, self).__init__()
        self.relu = torch.nn.ReLU()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        first = self.relu(x)
        h_relu = self.linear1(first).clamp(min=0) 
        second = self.relu(h_relu)
        y_pred = self.linear2(second)
        return y_pred


model = TwoLayerMLP(180, 90, 8)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

epochs = 5001

total_train = 0
correct_train = 0
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for batch_num, data in enumerate(train_loader):
        audio , label = data
        optimizer.zero_grad()
        outputs = model(audio.float())
        loss = criterion(outputs, label)
        loss.backward()
        optimizer.step()
        
        pred_values, pred_indices = torch.max(outputs.data,1)
        total_train += float(label.size(0))
        correct_train += (sum(pred_indices == label)).item()
        
    model.eval()
    total = 0
    correct = 0
    for batch_num, data in enumerate(test_loader):
        audio , label = data
        outputs = model(audio.float())
        pred_values, pred_indices = torch.max(outputs.data,1)
        total += float(label.size(0))
        correct += (sum(pred_indices == label)).item()
    train_accuracy = 100.*correct_train/total_train
    test_accuracy = 100.*correct/total
    if (epoch % 1000 == 0):
        print("Epoch: %.0f, Training Loss: %.3f, Training Accuracy: %.3f, Testing Accuracy: %.3f" % (epoch, loss.item(), train_accuracy, test_accuracy))
    

Epoch: 0, Training Loss: 2.086, Training Accuracy: 13.194, Testing Accuracy: 13.194
Epoch: 1000, Training Loss: 1.680, Training Accuracy: 31.846, Testing Accuracy: 14.028
Epoch: 2000, Training Loss: 1.572, Training Accuracy: 36.622, Testing Accuracy: 14.583
Epoch: 3000, Training Loss: 1.424, Training Accuracy: 40.052, Testing Accuracy: 13.750
Epoch: 4000, Training Loss: 1.320, Training Accuracy: 42.787, Testing Accuracy: 12.500
Epoch: 5000, Training Loss: 1.187, Training Accuracy: 45.242, Testing Accuracy: 12.639


### This hasn't improved our testing accuracy, but diminished training accuracy.

###  Next, remove the normalization implemented above and instead use batch normalizaton:

https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html

In [27]:
class MakeDataset(Dataset):
    def __init__(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train
    def __len__(self):
        return self.x_train.shape[0]
    def __getitem__(self, ind):
        x = self.x_train[ind]
        y = emotion_list.index(y_train[ind])
        return x, y


train_set = MakeDataset(x_train, y_train)
test_set = MakeDataset(x_test, y_test)

batch_size = 512

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)
  
class TwoLayerMLP(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerMLP, self).__init__()
        self.relu = torch.nn.ReLU()
        self.batch_norm = torch.nn.BatchNorm1d(D_in) # Batch normalization 
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        first = self.batch_norm(x)
        second = self.relu(first)
        hidden = self.linear1(second).clamp(min=0)
        h_relu = self.relu(hidden)
        y_pred = self.linear2(h_relu)
        return y_pred


model = TwoLayerMLP(180, 90, 8)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

epochs = 5001

total_train = 0
correct_train = 0
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for batch_num, data in enumerate(train_loader):
        audio , label = data
        optimizer.zero_grad()
        outputs = model(audio.float())
        loss = criterion(outputs, label)
        loss.backward()
        optimizer.step()
        
        pred_values, pred_indices = torch.max(outputs.data,1)
        total_train += float(label.size(0))
        correct_train += (sum(pred_indices == label)).item()
        
    model.eval()
    total = 0
    correct = 0
    for batch_num, data in enumerate(test_loader):
        audio , label = data
        outputs = model(audio.float())
        pred_values, pred_indices = torch.max(outputs.data,1)
        total += float(label.size(0))
        correct += (sum(pred_indices == label)).item()
    train_accuracy = 100.*correct_train/total_train
    test_accuracy = 100.*correct/total
    if (epoch % 1000 == 0):
        print("Epoch: %.0f, Training Loss: %.3f, Training Accuracy: %.3f, Testing Accuracy: %.3f" % (epoch, loss.item(), train_accuracy, test_accuracy))
    

Epoch: 0, Training Loss: 2.088, Training Accuracy: 7.361, Testing Accuracy: 12.778
Epoch: 1000, Training Loss: 0.036, Training Accuracy: 89.963, Testing Accuracy: 12.500
Epoch: 2000, Training Loss: 0.008, Training Accuracy: 94.971, Testing Accuracy: 12.778
Epoch: 3000, Training Loss: 0.001, Training Accuracy: 96.645, Testing Accuracy: 12.361
Epoch: 4000, Training Loss: 0.001, Training Accuracy: 97.481, Testing Accuracy: 12.222
Epoch: 5000, Training Loss: 0.000, Training Accuracy: 97.984, Testing Accuracy: 12.361


### This yields higher training accuracy, but testing accuracy is unaffected 

### Next, add dropout, which will zero out every Pth element on forward calls, preventing neuronal co-adaptation:

https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html

In [28]:
class MakeDataset(Dataset):
    def __init__(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train
    def __len__(self):
        return self.x_train.shape[0]
    def __getitem__(self, ind):
        x = self.x_train[ind]
        y = emotion_list.index(y_train[ind])
        return x, y


train_set = MakeDataset(x_train, y_train)
test_set = MakeDataset(x_test, y_test)

batch_size = 512

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)
  
class TwoLayerMLP(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerMLP, self).__init__()
        self.relu = torch.nn.ReLU(inplace=True)
        self.batch_norm = torch.nn.BatchNorm1d(D_in) # Batch normalization 
        self.dropout = torch.nn.Dropout(0.2)         # Dropout 
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        first = self.batch_norm(x)
        dropped = self.dropout(first)
        first_relu = self.relu(dropped)
        hidden = self.linear1(first_relu).clamp(min=0)
        h_relu = self.relu(hidden)
        y_pred = self.linear2(hidden)
        return y_pred


model = TwoLayerMLP(180, 90, 8)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

epochs = 5001

total_train = 0
correct_train = 0
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for batch_num, data in enumerate(train_loader):
        audio , label = data
        optimizer.zero_grad()
        outputs = model(audio.float())
        loss = criterion(outputs, label)
        loss.backward()
        optimizer.step()
        
        pred_values, pred_indices = torch.max(outputs.data,1)
        total_train += float(label.size(0))
        correct_train += (sum(pred_indices == label)).item()
        
    model.eval()
    total = 0
    correct = 0
    for batch_num, data in enumerate(test_loader):
        audio , label = data
        outputs = model(audio.float())
        pred_values, pred_indices = torch.max(outputs.data,1)
        total += float(label.size(0))
        correct += (sum(pred_indices == label)).item()
    train_accuracy = 100.*correct_train/total_train
    test_accuracy = 100.*correct/total
    if (epoch % 1000 == 0):
        print("Epoch: %.0f, Training Loss: %.3f, Training Accuracy: %.3f, Testing Accuracy: %.3f" % (epoch, loss.item(), train_accuracy, test_accuracy))
    

Epoch: 0, Training Loss: 2.050, Training Accuracy: 14.583, Testing Accuracy: 13.194
Epoch: 1000, Training Loss: 0.315, Training Accuracy: 74.615, Testing Accuracy: 13.194
Epoch: 2000, Training Loss: 0.215, Training Accuracy: 82.750, Testing Accuracy: 11.944
Epoch: 3000, Training Loss: 0.187, Training Accuracy: 86.462, Testing Accuracy: 11.250
Epoch: 4000, Training Loss: 0.194, Training Accuracy: 88.639, Testing Accuracy: 11.806
Epoch: 5000, Training Loss: 0.227, Training Accuracy: 90.085, Testing Accuracy: 11.389


### This has not helped, and if anything has slightly worsened our train and test accuracy, it is safe to say that overfitting was not our problem. 

P was set to 0.2, which 'dropped' every fifth input element. 

### Return to original model and use a different optimizer (RMSProp): 

In [29]:
class MakeDataset(Dataset):
    def __init__(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train
    def __len__(self):
        return self.x_train.shape[0]
    def __getitem__(self, ind):
        x = self.x_train[ind]
        y = emotion_list.index(y_train[ind])
        return x, y


train_set = MakeDataset(x_train, y_train)
test_set = MakeDataset(x_test, y_test)

batch_size = 512

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)
  
class TwoLayerMLP(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerMLP, self).__init__()
        self.relu = torch.nn.ReLU()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        first = self.relu(x)
        h_relu = self.linear1(first).clamp(min=0)
        second = self.relu(h_relu)
        y_pred = self.linear2(second)
        return y_pred


model = TwoLayerMLP(180, 90, 8)
optimizer = torch.optim.RMSprop(model.parameters())
criterion = nn.CrossEntropyLoss()

epochs = 5001

total_train = 0
correct_train = 0
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for batch_num, data in enumerate(train_loader):
        audio , label = data
        optimizer.zero_grad()
        outputs = model(audio.float())
        loss = criterion(outputs, label)
        loss.backward()
        optimizer.step()
        
        pred_values, pred_indices = torch.max(outputs.data,1)
        total_train += float(label.size(0))
        correct_train += (sum(pred_indices == label)).item()
        
    model.eval()
    total = 0
    correct = 0
    for batch_num, data in enumerate(test_loader):
        audio , label = data
        outputs = model(audio.float())
        pred_values, pred_indices = torch.max(outputs.data,1)
        total += float(label.size(0))
        correct += (sum(pred_indices == label)).item()
    train_accuracy = 100.*correct_train/total_train
    test_accuracy = 100.*correct/total
    if (epoch % 1000 == 0):
        print("Epoch: %.0f, Training Loss: %.3f, Training Accuracy: %.3f, Testing Accuracy: %.3f" % (epoch, loss.item(), train_accuracy, test_accuracy))
    

Epoch: 0, Training Loss: 32.152, Training Accuracy: 13.056, Testing Accuracy: 13.194
Epoch: 1000, Training Loss: 0.674, Training Accuracy: 69.598, Testing Accuracy: 13.750
Epoch: 2000, Training Loss: 0.639, Training Accuracy: 76.933, Testing Accuracy: 13.750
Epoch: 3000, Training Loss: 0.255, Training Accuracy: 80.574, Testing Accuracy: 15.000
Epoch: 4000, Training Loss: 0.237, Training Accuracy: 82.884, Testing Accuracy: 13.611
Epoch: 5000, Training Loss: 0.379, Training Accuracy: 84.573, Testing Accuracy: 12.778


### Using RMSProp instead of Adam did yield slightly greater testing accuracy, but could this be because RMSProp uses a default learning rate of 1e-2 while Adam uses a default learning rate of 1e-3? 

### Next, use RMSProp again but set learning rate = 1e-1: 

In [30]:
class MakeDataset(Dataset):
    def __init__(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train
    def __len__(self):
        return self.x_train.shape[0]
    def __getitem__(self, ind):
        x = self.x_train[ind]
        y = emotion_list.index(y_train[ind])
        return x, y


train_set = MakeDataset(x_train, y_train)
test_set = MakeDataset(x_test, y_test)

batch_size = 512

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)

class TwoLayerMLP(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerMLP, self).__init__()
        self.relu = torch.nn.ReLU()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        first = self.relu(x)
        h_relu = self.linear1(first).clamp(min=0)
        second = self.relu(h_relu)
        y_pred = self.linear2(second)
        return y_pred


model = TwoLayerMLP(180, 90, 8)
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.1)
criterion = nn.CrossEntropyLoss()

epochs = 5001

total_train = 0
correct_train = 0
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for batch_num, data in enumerate(train_loader):
        audio , label = data
        optimizer.zero_grad()
        outputs = model(audio.float())
        loss = criterion(outputs, label)
        loss.backward()
        optimizer.step()
        
        pred_values, pred_indices = torch.max(outputs.data,1)
        total_train += float(label.size(0))
        correct_train += (sum(pred_indices == label)).item()
        
    model.eval()
    total = 0
    correct = 0
    for batch_num, data in enumerate(test_loader):
        audio , label = data
        outputs = model(audio.float())
        pred_values, pred_indices = torch.max(outputs.data,1)
        total += float(label.size(0))
        correct += (sum(pred_indices == label)).item()
    train_accuracy = 100.*correct_train/total_train
    test_accuracy = 100.*correct/total
    if (epoch % 1000 == 0):
        print("Epoch: %.0f, Training Loss: %.3f, Training Accuracy: %.3f, Testing Accuracy: %.3f" % (epoch, loss.item(), train_accuracy, test_accuracy))
    

Epoch: 0, Training Loss: 3421.895, Training Accuracy: 8.472, Testing Accuracy: 13.750
Epoch: 1000, Training Loss: 2.069, Training Accuracy: 15.452, Testing Accuracy: 13.333
Epoch: 2000, Training Loss: 2.051, Training Accuracy: 14.545, Testing Accuracy: 13.611
Epoch: 3000, Training Loss: 2.059, Training Accuracy: 14.241, Testing Accuracy: 13.333
Epoch: 4000, Training Loss: 2.090, Training Accuracy: 14.090, Testing Accuracy: 13.889
Epoch: 5000, Training Loss: 2.044, Training Accuracy: 14.003, Testing Accuracy: 13.333


### Not much improvement, next try SGD with learning rate = 0.1

In [31]:
class MakeDataset(Dataset):
    def __init__(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train
    def __len__(self):
        return self.x_train.shape[0]
    def __getitem__(self, ind):
        x = self.x_train[ind]
        y = emotion_list.index(y_train[ind])
        return x, y


train_set = MakeDataset(x_train, y_train)
test_set = MakeDataset(x_test, y_test)

batch_size = 512

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)
  
class TwoLayerMLP(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerMLP, self).__init__()
        self.relu = torch.nn.ReLU()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        first = self.relu(x)
        h_relu = self.linear1(first).clamp(min=0)
        second = self.relu(h_relu)
        y_pred = self.linear2(second)
        return y_pred


model = TwoLayerMLP(180, 90, 8)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
criterion = nn.CrossEntropyLoss()

epochs = 5001

total_train = 0
correct_train = 0
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for batch_num, data in enumerate(train_loader):
        audio , label = data
        optimizer.zero_grad()
        outputs = model(audio.float())
        loss = criterion(outputs, label)
        loss.backward()
        optimizer.step()
        
        pred_values, pred_indices = torch.max(outputs.data,1)
        total_train += float(label.size(0))
        correct_train += (sum(pred_indices == label)).item()
        
    model.eval()
    total = 0
    correct = 0
    for batch_num, data in enumerate(test_loader):
        audio , label = data
        outputs = model(audio.float())
        pred_values, pred_indices = torch.max(outputs.data,1)
        total += float(label.size(0))
        correct += (sum(pred_indices == label)).item()
    train_accuracy = 100.*correct_train/total_train
    test_accuracy = 100.*correct/total
    if (epoch % 1000 == 0):
        print("Epoch: %.0f, Training Loss: %.3f, Training Accuracy: %.3f, Testing Accuracy: %.3f" % (epoch, loss.item(), train_accuracy, test_accuracy))
    

Epoch: 0, Training Loss: 8.799, Training Accuracy: 12.778, Testing Accuracy: 13.472
Epoch: 1000, Training Loss: 1.796, Training Accuracy: 25.619, Testing Accuracy: 13.611
Epoch: 2000, Training Loss: 1.548, Training Accuracy: 29.010, Testing Accuracy: 12.361
Epoch: 3000, Training Loss: 1.707, Training Accuracy: 30.761, Testing Accuracy: 15.139
Epoch: 4000, Training Loss: 1.688, Training Accuracy: 31.987, Testing Accuracy: 14.722
Epoch: 5000, Training Loss: 1.728, Training Accuracy: 32.775, Testing Accuracy: 14.028


### SGD was not helpful, next let's return to the orginal model and experment with adding hidden layers: 

### Our original model contained only one hidden layer: 180 features > 90 (HL) > 8 classes 

### Try 180 features > 90 > 45 > 8 

In [32]:
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn

class MakeDataset(Dataset):
    def __init__(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train
    def __len__(self):
        return self.x_train.shape[0]
    def __getitem__(self, ind):
        x = self.x_train[ind]
        y = emotion_list.index(y_train[ind])
        return x, y


train_set = MakeDataset(x_train, y_train)
test_set = MakeDataset(x_test, y_test)

batch_size = 512

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)
  
class TwoLayerMLP(torch.nn.Module):
    def __init__(self):
        super(TwoLayerMLP, self).__init__()
        self.relu = torch.nn.ReLU()
        self.linear1 = torch.nn.Linear(180, 90)
        self.linear2 = torch.nn.Linear(90, 45)
        self.linear3 = torch.nn.Linear(45, 8)

    def forward(self, x):
        first = self.relu(x)
        h_relu = self.linear1(first).clamp(min=0)
        second_relu = self.relu(h_relu)
        h_2 = self.linear2(second_relu)
        third_relu = self.relu(h_2)
        y_pred = self.linear3(third_relu)
        return y_pred


model = TwoLayerMLP()
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

epochs = 5001

total_train = 0
correct_train = 0
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for batch_num, data in enumerate(train_loader):
        audio , label = data
        optimizer.zero_grad()
        outputs = model(audio.float())
        loss = criterion(outputs, label)
        loss.backward()
        optimizer.step()
        
        pred_values, pred_indices = torch.max(outputs.data,1)
        total_train += float(label.size(0))
        correct_train += (sum(pred_indices == label)).item()
        
    model.eval()
    total = 0
    correct = 0
    for batch_num, data in enumerate(test_loader):
        audio , label = data
        outputs = model(audio.float())
        pred_values, pred_indices = torch.max(outputs.data,1)
        total += float(label.size(0))
        correct += (sum(pred_indices == label)).item()
    train_accuracy = 100.*correct_train/total_train
    test_accuracy = 100.*correct/total
    if (epoch % 1000 == 0):
        print("Epoch: %.0f, Training Loss: %.3f, Training Accuracy: %.3f, Testing Accuracy: %.3f" % (epoch, loss.item(), train_accuracy, test_accuracy))
    

Epoch: 0, Training Loss: 2.217, Training Accuracy: 13.472, Testing Accuracy: 13.611
Epoch: 1000, Training Loss: 0.015, Training Accuracy: 87.950, Testing Accuracy: 10.278
Epoch: 2000, Training Loss: 0.001, Training Accuracy: 93.972, Testing Accuracy: 9.583
Epoch: 3000, Training Loss: 0.000, Training Accuracy: 95.981, Testing Accuracy: 10.139
Epoch: 4000, Training Loss: 0.000, Training Accuracy: 96.985, Testing Accuracy: 10.278
Epoch: 5000, Training Loss: 0.000, Training Accuracy: 97.588, Testing Accuracy: 10.000


### This is trending in the right direction, so let's add more hidden layers: 

### 180 > 150 > 120 > 90 > 60 > 30 > 15 > 8

In [33]:
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn

class MakeDataset(Dataset):
    def __init__(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train
    def __len__(self):
        return self.x_train.shape[0]
    def __getitem__(self, ind):
        x = self.x_train[ind]
        y = emotion_list.index(y_train[ind])
        return x, y


train_set = MakeDataset(x_train, y_train)
test_set = MakeDataset(x_test, y_test)

batch_size = 512

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)
  
class TwoLayerMLP(torch.nn.Module):
    def __init__(self):
        super(TwoLayerMLP, self).__init__()
        self.relu = torch.nn.ReLU()
        self.linear1 = torch.nn.Linear(180, 150)
        self.linear2 = torch.nn.Linear(150, 120)
        self.linear3 = torch.nn.Linear(120, 90)
        self.linear4 = torch.nn.Linear(90, 60)
        self.linear5 = torch.nn.Linear(60, 30)
        self.linear6 = torch.nn.Linear(30, 15)
        self.linear7 = torch.nn.Linear(15, 8)

    def forward(self, x):
        first = self.relu(x)
        h_relu = self.linear1(first).clamp(min=0)
        h_2 = self.linear2(h_relu)
        rel_2 = self.relu(h_2)
        h_3 = self.linear3(rel_2)
        rel_3 = self.relu(h_3)
        h_4 = self.linear4(rel_3)
        rel_4 = self.relu(h_4)
        h_5 = self.linear5(rel_4)
        rel_5 = self.relu(h_5)
        h_6 = self.linear6(h_5)
        rel_6 = self.relu(h_6)
        y_pred = self.linear7(rel_6)
        return y_pred


model = TwoLayerMLP()
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

epochs = 5001

total_train = 0
correct_train = 0
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for batch_num, data in enumerate(train_loader):
        audio , label = data
        optimizer.zero_grad()
        outputs = model(audio.float())
        loss = criterion(outputs, label)
        loss.backward()
        optimizer.step()
        
        pred_values, pred_indices = torch.max(outputs.data,1)
        total_train += float(label.size(0))
        correct_train += (sum(pred_indices == label)).item()
        
    model.eval()
    total = 0
    correct = 0
    for batch_num, data in enumerate(test_loader):
        audio , label = data
        outputs = model(audio.float())
        pred_values, pred_indices = torch.max(outputs.data,1)
        total += float(label.size(0))
        correct += (sum(pred_indices == label)).item()
    train_accuracy = 100.*correct_train/total_train
    test_accuracy = 100.*correct/total
    if (epoch % 1000 == 0):
        print("Epoch: %.0f, Training Loss: %.3f, Training Accuracy: %.3f, Testing Accuracy: %.3f" % (epoch, loss.item(), train_accuracy, test_accuracy))
    

Epoch: 0, Training Loss: 2.096, Training Accuracy: 13.333, Testing Accuracy: 13.611
Epoch: 1000, Training Loss: 0.003, Training Accuracy: 87.215, Testing Accuracy: 13.056
Epoch: 2000, Training Loss: 0.000, Training Accuracy: 93.604, Testing Accuracy: 12.500
Epoch: 3000, Training Loss: 0.000, Training Accuracy: 95.736, Testing Accuracy: 11.944
Epoch: 4000, Training Loss: 0.000, Training Accuracy: 96.801, Testing Accuracy: 11.806
Epoch: 5000, Training Loss: 0.000, Training Accuracy: 97.441, Testing Accuracy: 11.667


### This appears to have slightly increased training accuracy, and slightly worsened testing accuracy 