# LAB 3: Speech-to-text with RNN



![alt text](https://miro.medium.com/max/556/1*NhOH4X9wKWfO6o8faYFf-w.png)

# Introduction

In this lab, we are going to use Recurrent Neural Networks to do some Speech Recognition. 

Nowadays, speech recognition is a common task present in smart home assistants (Amazon Echo, Google Home), phones, TVs... Most of the time, it is done using deep learning.

## What you will learn

- The different kinds of RNN (RNN, LSTM, GRU...)

- How to load and process audio data in PyTorch

- How to implement an RNN in PyTorch

- How to create a confusion matrix


## RNN

Recurrent Neural Networks are a kind of Neural Network used to process **sequences** of data.
These sequences can be of varying length and usually have some context information.

For example, sentences (text), audio, videos have some temporal logic. In a sentence, one word depends on the word that comes before it. In a video, one frame probably looks a lot like the previous frames.

RNNs have some kind of **persistence** of information during the processing of a sequence. Thus, RNNs are used for lots of things: sentiment analysis, text completion, speech recognition, etc.

![alt text](https://www.researchgate.net/profile/Weijiang_Feng/publication/318332317/figure/fig1/AS:614309562437664@1523474221928/The-standard-RNN-and-unfolded-RNN.png)

## LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit)

**RNN Short-term memory problem:**

*   Recurrent Neural Networks suffer from short-term memory. If a sequence is long enough, they’ll have a hard time carrying information from earlier time steps to later ones. So if you are trying to process a paragraph of text to do predictions, RNN’s may leave out important information from the beginning.

*  During back propagation, recurrent neural networks suffer from the vanishing gradient problem. Gradients are values used to update a neural networks weights. The vanishing gradient problem is when the gradient shrinks as it back propagates through time. If a gradient value becomes extremely small, it doesn’t contribute too much to the learning.

**As solution to short-term memory, LSTM and GRU were created:**

*   LSTM was introduced by this [article](https://www.bioinf.jku.at/publications/older/2604.pdf).
*   GRU was introduced by this [article](https://arxiv.org/pdf/1412.3555.pdf).

Both are **Reccurent Neural Network (RNN)** architectures which were created as the solution to short-term memory. They have internal mechanisms called gates that can regulate the flow of information. 

![alt text](http://dprogrammer.org/wp-content/uploads/2019/04/RNN-vs-LSTM-vs-GRU-1200x361.png)









These gates can learn which data in a sequence is important to keep or throw away. By doing that, it can pass relevant information down the long chain of sequences to make predictions.

**LSTM**

The LSTMs does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

*   the **cell state** make easy for the information to pass through the cell by combining the cells decisions
*   the **forget gate** decides what information should be thrown away from the cell state
*   the **input gate** decides which values we'll update using sigmoid ; it's combined with a tanh layer to create an update to the state
*   the **output gate**, based on the celle state, output a filtered information

**GRU**

The GRU is a modified version of the LSTM. It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.



## Libraries

Since we are working with PyTorch and sounds, we are going to use *torchaudio* instead of *torchvision*, this time.

Make sure you are using a GPU Runtime! (Runtime -> Change Runtime type)

In [None]:
!pip install torchaudio==0.9.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchaudio==0.9.0
  Downloading torchaudio-0.9.0-cp38-cp38-manylinux1_x86_64.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m73.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch==1.9.0
  Downloading torch-1.9.0-cp38-cp38-manylinux1_x86_64.whl (831.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m831.4/831.4 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch, torchaudio
  Attempting uninstall: torch
    Found existing installation: torch 1.13.1+cu116
    Uninstalling torch-1.13.1+cu116:
      Successfully uninstalled torch-1.13.1+cu116
  Attempting uninstall: torchaudio
    Found existing installation: torchaudio 0.13.1+cu116
    Uninstalling torchaudio-0.13.1+cu116:
      Successfully uninstalled torchaudio-0.13.1+cu116
[31mERROR: pip's dependency resolver does not current

In [None]:
from IPython.display import Audio

## PyTorch things
import torch
import torchaudio
import torch.nn.functional as F

## Other libs
import matplotlib.pyplot as plt
import glob
import os
import random
from tqdm import tqdm_notebook
import torchsummary
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import normalize
import pandas as pd
import seaborn as sn

In [None]:
import os 
import random
import numpy as np 
DEFAULT_RANDOM_SEED = 2021
def seedBasic(seed=DEFAULT_RANDOM_SEED):
  random.seed(seed)
  os.environ['PYTHONHASHSEED'] = str(seed)
  np.random.seed(seed)
# tensorflow random seed 
import tensorflow as tf 
def seedTF(seed=DEFAULT_RANDOM_SEED):
  tf.random.set_seed(seed)
# torch random seed
import torch
def seedTorch(seed=DEFAULT_RANDOM_SEED):
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False
# basic + tensorflow + torch 
def seedEverything(seed=DEFAULT_RANDOM_SEED):
  seedBasic(seed)
  seedTF(seed)
  seedTorch(seed)
seedEverything(1004)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


# Part 1: Working with audio data

The dataset we are using is Google's Speech Dataset (https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html).

It is composed of **"65,000 one-second long utterances of 30 short words, by thousands of different people"**.

Let's download the dataset:

In [None]:
!rm -rf ./*
!wget -O speech_commands_v0.02.tar.gz http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
!tar xzf speech_commands_v0.02.tar.gz 
!ls

--2023-02-17 16:51:48--  http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 74.125.68.128, 2404:6800:4003:c02::80
Connecting to download.tensorflow.org (download.tensorflow.org)|74.125.68.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2428923189 (2.3G) [application/gzip]
Saving to: ‘speech_commands_v0.02.tar.gz’


2023-02-17 16:51:57 (246 MB/s) - ‘speech_commands_v0.02.tar.gz’ saved [2428923189/2428923189]

_background_noise_  four     on				   tree
backward	    go	     one			   two
bed		    happy    README.md			   up
bird		    house    right			   validation_list.txt
cat		    learn    seven			   visual
dog		    left     sheila			   wow
down		    LICENSE  six			   yes
eight		    marvin   speech_commands_v0.02.tar.gz  zero
five		    nine     stop
follow		    no	     testing_list.txt
forward		    off      three


Let's print the different classes (words) that are part of this dataset.

We can see there are 30 different words.

In [None]:
classes = os.listdir()
classes.remove("LICENSE")
classes.remove("README.md")
classes.remove("_background_noise_")
classes.remove("speech_commands_v0.02.tar.gz")
classes.remove("testing_list.txt")
classes.remove("validation_list.txt")
classes.remove(".config")
classes.remove(".DS_Store")
# classes.remove('test_list.txt')
# classes.remove('train_list.txt')
print(classes)
print("Number of classes", len(classes))

['wow', 'right', 'stop', 'six', 'cat', 'go', 'nine', 'learn', 'yes', 'forward', 'bed', 'left', 'two', 'marvin', 'follow', 'happy', 'dog', 'down', 'sheila', 'house', 'five', 'tree', 'no', 'four', 'zero', 'one', 'eight', 'visual', 'seven', 'bird', 'three', 'up', 'on', 'backward', 'off']
Number of classes 35


## Q1: Listen to some samples

Using the **Audio(filename)** function from IPython notebook, you can listen to an audio file directly in Colab.

Try it on some samples!

## Q2: Displaying a waveform

Use **torchaudio.load** to load an audio file. Then, use matplotlib to display it.

HINT: you may have to transpose the waveform with **.t()** in order to display it

## Computing MFCC features

Extracting MFCC (**Mel Frequency Cepstral Coefficients**) features is a well known signal processing technique, especially used in **ASR** (Automatic Speech Recognition). These features are meant to represent the way humans perceive sound. https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

*Torchaudio* has transforms (just like the ones in *torchvision*) that allow you to compute these features:

Here, we are only keeping 12 MFCC features because it is enough for our purposes.

As you can see, we are getting a Tensor of shape [1, 12, 81], because we have one audio channel (mono) with 12 coefficients over 81 time windows.

## Creating a custom audio Dataset

You may have noticed that in this dataset, the test and validation datasets are given in testing_list.txt and validation_list.txt files. 

With that, we can infer a training list as well:

In [None]:
## Read the test list
with open("testing_list.txt") as testing_f:
  testing_list = [x.strip() for x in testing_f.readlines()]

## Read the val list
with open("validation_list.txt") as val_f:
  validation_list = [x.strip() for x in val_f.readlines()]  

print("Number of testing samples", len(testing_list))
print("Number of validation samples", len(validation_list))

## Construct a train list
training_list = []
for c in classes:
  training_list += glob.glob(c + "/*")

training_list = list(filter(lambda x : not x in testing_list and not x in validation_list, training_list))
print("Number of training samples", len(training_list))

Number of testing samples 11005
Number of validation samples 9981
Number of training samples 84843


Now, we can create a custom SpeechDataset class that takes a file list in input.

In [None]:
from torchaudio.datasets import SPEECHCOMMANDS
from sklearn.preprocessing import normalize
class SpeechDataset(SPEECHCOMMANDS):
  
  def __init__(self, classes, file_list):
    
    self.classes = classes
    
    # create a map from class name to integer
    self.class_to_int = dict(zip(classes, range(len(classes))))
    
    # store the file names
    self.samples = file_list
    
    # store our MFCC transform
    self.mfcc_transform = torchaudio.transforms.MFCC(n_mfcc=12, log_mels=True)
    
  def __len__(self):
    return len(self.samples)
    
  def __getitem__(self,i):
    with torch.no_grad():
      # load a normalized waveform
      waveform,_ = torchaudio.load(self.samples[i], normalize=True)
      
      # if the waveform is too short (less than 1 second) we pad it with zeroes
      if waveform.shape[1] < 16000:
        waveform = F.pad(input=waveform, pad=(0, 16000 - waveform.shape[1]), mode='constant', value=0)
      
      # then, we apply the transform
    #   normalize(matrix, axis=1, norm='l1')
      mfcc = normalize(self.mfcc_transform(waveform).squeeze(0).transpose(0,1))
    
    # get the label from the file name
    label = self.samples[i].split("/")[0]
    
    # return the mfcc coefficient with the sample label
    return mfcc, self.class_to_int[label]

   

## Q3: Create instances of the SpeechDataset for the train and val sets

Fill the code below to create your Dataset objects for training.

In [None]:


train_list = open("train_list.txt").readlines()
test_list = open("test_list.txt").readlines()

train_set1 = list(training_list[i-1] for i in list(map(int, train_list)))
test_set1 = list(testing_list[i-1] for i in list(map(int, test_list)))

train_set = SpeechDataset(classes,train_set1)
val_set = SpeechDataset(classes,test_set1)




## Q4: Create Dataloaders for training and validation

Fill the code below to create DataLoaders with the Datasets you just created.

Do not forget to add shuffling to the training DataLoader.

Print a batch of data to make sure everything works.

In [None]:
# from torchvision import  transforms

# train_set = (train_set - train_set.mean()) / train_set.std()               ###################################
train_dl = torch.utils.data.DataLoader(
    train_set,
    batch_size=32,
    shuffle=True)

# test_set = (test_set - test_set.mean()) / test_set.std()
val_dl = torch.utils.data.DataLoader(
    val_set,
    batch_size=32,
    shuffle=True)

# test_dl = transforms.Normalize(test_dl)


# Part 2: Implementing a simple Recurrent Neural Network

For our network, we are going to use an **RNN module** from torch.nn (which can have multiple layers, or cells).

This module has an **input size**, which in our case will be equal to **the number of MFCC features (12)**. The input size is the number of dimensions of **x** in the image below.

It also has an **hidden size**, which is the size of the output of the layers as well as the size of the internal representation of the features. We are going to choose **256** to start, but feel free to change that. This is the dimension of **h** in the image below.

PyTorch RNN modules have a **number of layers**, which is simply the number of stacked **RNN Cells**. We are going to use 2 cells here, but feel free to change that as well. This is the **depth** in the image below.

Then, in order to get as many output as the number of classes in our dataset, we need to have a **Linear layer** that goes from **256 inputs (the hidden size) to 30 outputs (the number of classes).**

Finally, to output categorical probabilities, we use a **Softmax layer.**

![alt text](https://i.stack.imgur.com/SjnTl.png)

## Q4: Implement the network

Fill the code below to implement the network.

In [None]:
from torch.autograd import Variable 

class SpeechRNN(torch.nn.Module):
  
  def __init__(self):
    super(SpeechRNN, self).__init__()
    self.lstm = torch.nn.GRU(input_size=12,hidden_size=48,num_layers=1,batch_first=True)
    self.out_layer1 = torch.nn.Linear(in_features=48,out_features=48)
    self.out_layer2 = torch.nn.Linear(in_features=48,out_features=35)
    self.softmax = torch.nn.LogSoftmax(dim=1)
    
    self.device = device
    
  def forward(self, x):
    out, _ = self.lstm(x)
    x = self.out_layer1(out[:,-1,:])   
    x = self.out_layer2(x)  
    return self.softmax(x)

Use this code to check that your implementation is working.

In [None]:
net = SpeechRNN()
net = net.double()
batch = next(iter(train_dl))[0]
y = net(batch)

# Part 3: Training the network

As usual, we need to define a loss and an optimizer. Since we have a categorical classification problem, we use cross-entropy (negative log likelihood).

We can use the Adam optimizer, feel free to change it or the learning rate.

In [None]:
##RE-RUN THIS CODE TO GET A "NEW" NETWORK

## Create an instance of our network
net = SpeechRNN()

## Move it to the GPU
net = net.cuda()

# Negative log likelihood loss
criterion = torch.nn.NLLLoss()

# Adam optimizer
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

## Q5: Training loop

We also need to write a training loop. Fill the code below to create it:

In [None]:
## NUMBER OF EPOCHS TO TRAIN
N_EPOCHS = 100


epoch_loss, epoch_acc, epoch_val_loss, epoch_val_acc = [], [], [], []

for e in range(N_EPOCHS):
  
  print("EPOCH:",e)
  running_loss = 0
  running_accuracy = 0
  net.train()
  
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):    
    data = data.to(device)
    data = data.cuda()
    data.requires_grad=True
    target = target.to(device)
    target = target.cuda()
    optimizer.zero_grad()
    output = net(data.float())
    
    loss = criterion(output.squeeze(), target)
    
    loss.backward()
    optimizer.step()
    ## Compute some statistics
    with torch.no_grad():
      running_loss += loss.item()
      running_accuracy += (output.max(1)[1] == target).sum().item()
  
  scheduler.step()
  print("Training accuracy:", running_accuracy/float(len(train_set)),
        "Training loss:", running_loss/float(len(train_set)))
  
  epoch_loss.append(running_loss/len(train_set))
  epoch_acc.append(running_accuracy/len(train_set))      

EPOCH: 0


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.15808333333333333 Training loss: 0.09205277930696805
EPOCH: 1


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.594 Training loss: 0.04202479535341263
EPOCH: 2


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.7286666666666667 Training loss: 0.028282379359006883
EPOCH: 3


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.7829166666666667 Training loss: 0.02296293906370799
EPOCH: 4


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.805 Training loss: 0.020145206773032744
EPOCH: 5


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.82825 Training loss: 0.01817012776620686
EPOCH: 6


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.8324166666666667 Training loss: 0.017399176536748807
EPOCH: 7


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.8415833333333333 Training loss: 0.0162699853470549
EPOCH: 8


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.8461666666666666 Training loss: 0.016073119500031072
EPOCH: 9


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.8529166666666667 Training loss: 0.01519863156788051
EPOCH: 10


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9106666666666666 Training loss: 0.009233340898839136
EPOCH: 11


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9269166666666667 Training loss: 0.007819235851522536
EPOCH: 12


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9311666666666667 Training loss: 0.007259913827447842
EPOCH: 13


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9360833333333334 Training loss: 0.006842654859181493
EPOCH: 14


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9385833333333333 Training loss: 0.006509525277341405
EPOCH: 15


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9399166666666666 Training loss: 0.0062653591361207265
EPOCH: 16


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9440833333333334 Training loss: 0.005964045756496489
EPOCH: 17


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9464166666666667 Training loss: 0.005680474691481019
EPOCH: 18


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9506666666666667 Training loss: 0.005418772018747404
EPOCH: 19


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9535 Training loss: 0.005193641852820292
EPOCH: 20


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9605 Training loss: 0.004555320284018914
EPOCH: 21


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9628333333333333 Training loss: 0.004437152363049487
EPOCH: 22


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9634166666666667 Training loss: 0.0043898881431669
EPOCH: 23


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9634166666666667 Training loss: 0.004341840752710898
EPOCH: 24


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.96475 Training loss: 0.0043048144804779444
EPOCH: 25


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9646666666666667 Training loss: 0.004274178873902808
EPOCH: 26


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.96525 Training loss: 0.0042414506564770514
EPOCH: 27


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9655 Training loss: 0.004209632202672462
EPOCH: 28


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9650833333333333 Training loss: 0.004180840730162648
EPOCH: 29


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.966 Training loss: 0.004151700911810621
EPOCH: 30


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9668333333333333 Training loss: 0.004061378840201845
EPOCH: 31


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.967 Training loss: 0.0040568541252675155
EPOCH: 32


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9669166666666666 Training loss: 0.004052901551049823
EPOCH: 33


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.96725 Training loss: 0.00404923354795513
EPOCH: 34


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9673333333333334 Training loss: 0.004045558640966192
EPOCH: 35


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9673333333333334 Training loss: 0.004042716754367575
EPOCH: 36


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9673333333333334 Training loss: 0.004039532097095313
EPOCH: 37


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9674166666666667 Training loss: 0.004036222468595952
EPOCH: 38


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9675 Training loss: 0.004033134864487996
EPOCH: 39


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9675 Training loss: 0.004030204468561957
EPOCH: 40


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9675833333333334 Training loss: 0.004020159310855282
EPOCH: 41


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9675833333333334 Training loss: 0.004019865439583858
EPOCH: 42


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9675833333333334 Training loss: 0.004019552567430461
EPOCH: 43


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9675833333333334 Training loss: 0.004019266507627132
EPOCH: 44


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9675833333333334 Training loss: 0.004018942814242716
EPOCH: 45


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9676666666666667 Training loss: 0.004018631088625019
EPOCH: 46


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9675833333333334 Training loss: 0.004018343918685181
EPOCH: 47


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9676666666666667 Training loss: 0.004018005833029747
EPOCH: 48


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9675833333333334 Training loss: 0.0040177153008213885
EPOCH: 49


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

Training accuracy: 0.9675833333333334 Training loss: 0.004017413373221643
EPOCH: 50


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (data, target) in enumerate(tqdm_notebook(train_dl)):


  0%|          | 0/375 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

## Q6: From RNN to LSTM/GRU

As you can see, the accuracy is pretty bad when we are only using "regular" RNNs. These are not used very much in practice nowadays because they do not have long-term memory. This means that by the time the network is done processing the whole audio sample, it probably has already forgotten the important parts of it. **Replace the RNN module in your network (Q4) with an LSTM or a GRU module (as you want). Train a new network and watch that accuracy go up!**

# Part 4: Evaluation

Now, we need to evaluate our network on the test set.

Use the code below to do that:

In [None]:
# Create a test dataset instance
test_dataset = SpeechDataset(classes, testing_list)

# Create a DataLoader
test_dl = torch.utils.data.DataLoader(test_dataset, batch_size=256)

net.eval()

test_loss = 0
test_accuracy = 0

preds, y_test = np.array([]), np.array([])

for batch_idx, (batch) in enumerate(tqdm_notebook(test_dl)):

  with torch.no_grad():
    # Get a batch from the dataloader
    x = batch[0]
    labels = batch[1]

    # move the batch to GPU
    x = x.cuda()
    labels = labels.cuda()

    # Compute the network output
    y = net(x.float())

    # Compute the loss
    loss = criterion(y, labels)
    
    ## Store all the predictions an labels for later
    preds = np.hstack([preds, y.max(1)[1].cpu().numpy()])
    y_test = np.hstack([y_test, labels.cpu().numpy()])

    test_loss += loss.item()
    test_accuracy += (y.max(1)[1] == labels).sum().item()

print("Test accuracy:", test_accuracy/float(len(test_dataset)),
      "Test loss:", test_loss/float(len(test_dataset)))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for batch_idx, (batch) in enumerate(tqdm_notebook(test_dl)):


  0%|          | 0/43 [00:00<?, ?it/s]

Test accuracy: 0.8547932757837347 Test loss: 0.002449480250227294
