# Using DIGITS with audio data

First we'll import some Python dependencies.  Most are standard and will be familiar.  Aifc may not be - that is a library for reading and writing .aiff audio files.

In [None]:
%matplotlib inline

import os
import sys
import numpy as np
import aifc
import warnings;
with warnings.catch_warnings():
    warnings.simplefilter("ignore");
    import matplotlib as mpl
    mpl.use('Agg')
    import matplotlib.pyplot as plt
    import matplotlib.image as mpimg
from skimage import io
from scipy.misc import imsave

## Overview

In this notebook we will work through an example of how to use DIGITS and Caffe with audio data.  The example is inspired by the Kaggle competition (https://www.kaggle.com/c/whale-detection-challenge) to detect whales from Sonar recordings.  Whales emit sounds underwater for functions including communication or echo-location (normally called Sonar).  The sounds emitted by the whales are difficult to recognize by the untrained ear and easily confused with other sources of underwater noise.  Marine biologists would like to have reliable, automated systems capable of classifying an underwater audio recording as being either a whale or something else.

## Preparing audio data for a convolutional neural network

The whale sounds are recorded as analogue audio signals and then digitized for storage and analysis.  Even though DIGITS and Caffe are most commonly used to classify images, we can still apply them to audio data by first converting these digitized audio files to a representation called a spectrogram, and then classifying it as an image.

The audio data is stored in the data folder in the same folder as this notebook.  The audio files are 2 seconds long, 2000 Hz, 16-bit, in mono format.   You can get retrieve this information from the headers of the audio file with the following Python code:

In [None]:
ifn="data/train/train4.aiff"
sf=aifc.open(ifn)
str_frames=sf.readframes(sf.getnframes())
data = np.fromstring(str_frames, np.short).byteswap()
sf.close()
 
print "Filename: %s " % ifn
print "Framerate: %d " % sf.getframerate()
print "Num Channels: %d " % sf.getnchannels()
print "Sample Width (bytes): %d " % sf.getsampwidth()
print "Number of Samples %d " % sf.getnframes()

To visualize an audio file, we can convert it into a spectrogram.  A spectrogram is a visual representation of the frequency spectrum of data, in this case audio.  A spectrogram is calculated as a series of Fourier transforms (http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.signal.spectrogram.html).  Use the code below to visualize what the spectrogram of the same audio file looks like:

In [None]:
ifn="data/train/train4.aiff"
sf=aifc.open(ifn)
str_frames=sf.readframes(sf.getnframes())
data = np.fromstring(str_frames, np.short).byteswap()
sf.close()
 
print "Filename: %s " % ifn
print "Framerate: %d " % sf.getframerate()
print "Num Channels: %d " % sf.getnchannels()
print "Sample Width: %d " % sf.getsampwidth()
print "Number of Samples %d " % sf.getnframes()
 
# Use specgram to plot the spectrogram of the data
fig,ax=plt.subplots()
pxx,freq,bins,im=ax.specgram(data,NFFT=256,noverlap=128,cmap='Greys_r')
cb=fig.colorbar(im)
cb.set_label('power (dB)')
 
plt.xlabel('time')
plt.ylabel('normalized frequency')
 
plt.show()

While the file above is a typical spectrogram file, it is not a good form with which to train a convolutional neural network (CNN).  The units of the figure above are in dB.  The range of the spectrogram is diminished by plotting in dB and there is not enough significant signal to detect. Instead, the raw values from the spectrogram should be used.  These are contained in the pxx array returned by the call to specgram().

We have preprocessed the raw audio files in the dataset (stored in .aiff format) in to image files stored in the .png format with a more suitable dynamic range for a CNN to work with.  We did this in advance because the spectrogram formation process takes a while to complete.  If you want to see how this conversion was done you can look in the script preprocess-aiff.py in the same folder as this notebook.

The output files will be 30x129, 8 bit grey scale images. Let's look at the same spectrogram as before in this new format.

In [None]:
img=mpimg.imread('data/train/train4.png')
plt.figure(figsize=(20,10))
plt.imshow(np.flipud(img),cmap='Greys_r')
plt.xlabel('time')
plt.ylabel('normalized frequency')
plt.tick_params(labelleft='off',labelbottom='off')
plt.show()

Notice that the image is mostly black, with much higher contrast of white.  This contrast is necessary to detect the whale sounds.

So is there a whale in the image above?  No, there is not.  Lets look at an image where there is a whale sound.

In [None]:
img=mpimg.imread('data/train/train6.png')
plt.figure(figsize=(20,10))
plt.imshow(np.flipud(img),cmap='Greys_r')
plt.xlabel('time')
plt.ylabel('normalized frequency')
plt.tick_params(labelleft='off',labelbottom='off')
plt.show()

It is hard to see a real difference between the two pictures.  The curve on the right side of the images is slightly different, but it would be difficult even for a whale expert to recognize this.  If there is a whale upcall, it will look like the single upward curve shown in the second image.  Let's see if a CNN is able to learn to distinguish whale from non-whale sounds using these subtle differences in features.

## Creating training, validation and label files



Before we start using DIGITS to classify the images we will split the labelled training dataset into a training and a validation dataset.  We will also create a file listing the names of the two classes, i.e. not-whale and whale.  We will use a random 90% of the training data for training, and the remaining 10% for validation.  

We use a couple of UNIX command line applications to create the text files.  If you don't understand, don't worry about the details.  The process that is happening is that we are removing the header from the train.csv file provided with the dataset, we are updating the file-paths to be the correct absolute file paths for our system and we are changing the file extensions to .png instead of .aiff.

In [None]:
%%bash
# Create training image list
tail -n +2 data/train.csv | sed 's/,/ /g' | awk -v dir=data/train '{printf("%s/%s %s\n",dir,$1,$2);}' | sed 's/.aiff /.png /g' | head -n 27000 > train.txt

# Create validation image list
tail -n +2 data/train.csv | sed 's/,/ /g' | awk -v dir=data/train '{printf("%s/%s %s\n",dir,$1,$2);}' | sed 's/.aiff /.png /g' | tail -n -3000 > validate.txt

# Create labels file
rm -f labels.txt
echo "not-whale" >> labels.txt
echo "whale" >> labels.txt

Let's check what we have in train.txt to see the format that DIGITS expects for defining datasets.

In [None]:
!head -25 train.txt

## Training the model in DIGITS

Now that our datasets are ready, we can start training.  Click <a href="/digits/" target="_blank">click here</a> to launch DIGITS in a second tab.

First we need to import a dataset into DIGITS.  Use the Datasets->Images dropdown and select "Classification" dataset.  When the "New Image Classification Dataset" panel opens, use the following preprocessing options:

![DIGITS New Image Classification Dataset panel](whales_digits_dataset.png)

Once you click "Create" you will the dataset get imported into DIGITS, it takes a couple of minutes.  Whilst the dataset is importing inspect the histograms showing how many samples of each class we have in the training and validation datasets.  You can also click "Explore the db" to see some sample images from each class.

<a id='question1'></a>
### Question 1

What potential training difficulties can we expect based on the histograms?

Answer: [Click here](#answer1)

Now that we have created the dataset, we are ready to train a model.  Return to the DIGITS main screen and use the Models->Images dropdown and select "classification" model.  On the "New Image Classification Model" panel that opens we will leave most options as default.  You just need to customize the following:

* Select the whale_sounds dataset you just created 
* Choose the Standard Network "Alexnet"
* Set the number of training epochs to 5
* Choose a name for the model, say "whale_sounds_baseline"

The panel should look like this:

![DIGITS New Image Classification Model panel](whales_digits_model.png)

Now click "Create" to start training the model.

After about a minute you should see a live updating graph displaying the model training loss and the validation loss and accuracy.  The losses should decrease as training progresses and the accuracy should increase.  It will take a few minutes for training to complete.  In the end you should see something like this:

![Completed baseline model training](whales_baseline_accuracy.png)

Over 95% accuracy against the validation dataset in only 5 epochs.  Your graphs will not be exactly the same because the initial model weights are generated stochastically.  Also, 5 epochs is probably not quite long enough to converge to the optimal solution, so you may have a slightly higher or lower final accuracy.  

But you have built a whale detector from sound using Deep Learning!

## Improving model performance

There are many ways in which we could improve performance over our baseline model.  For example, we left most of the model training parameters at default values. Let's start there: was the default learning rate of 0.01 the best value?  

### Exercise 1:

From the main DIGITS screen, create several new models with all settings the same as in the baseline model but varying the learning rate to see how the final model accuracy changes.  Try training models with learning rates of 0.001, 0.005, 0.05.

On the main DIGITS screen you can click "View details" in the Models pane to see a listing of the models you have trained and their accuracies.  Here is an example with two of the models changed.

![Model details](whales_model_list.png)

<a id='question2'></a>
### Question 2

What is the effect of changing the learning rate on the model accuracy?

Answer: [Click here](#answer2)

The process of modifying individual model and algorithm parameters to find the best performing model is often called "hyperparameter search" or "hyperparameter optimization".  In this case it was a manual process, but there are also methods for automating this search process.

### Exercise 2:

When we trained our baseline model we chose a Standard Network architecture "Alexnet".  Try choosing the GoogleNet model instead to see the effect on model accuracy.  It will take a little longer to train due to being a much more complicated model.

<a id='question3'></a>
### Question 3

Does GoogleNet achieve better validation accuracy?  If so, why?  If not, why not?

Answer: [Click here](#answer3)

### Additional optional exercises

* (Easy) Try varying some of the other model training parameters to see if you can find an even better performing model.  You could try changing the batch size, the solver type or the learning rate decay policy listed under "advanced learning rate options".
* (Moderate) Try modifying the dataset creation scripts to balance the classes in the training and validation datasets
* (Moderate) Try creating and training your own model architecture that does not require the input images to be resized to 256x256
* (Hard) Try creating additional training data by augmenting the original training images so that the model is more robust to noise in the data.  Hint:  you might try adding random noise to the image pixels or randomly perturbing the contrast of the images.
* (Really hard) Try using the Python Layers interface to make your data augmentation part of the online training process.

## Answers to questions:

<a id='answer1'></a>
### Answer 1

The datasets are imbalanced, i.e. there are many more "non-whale" samples than training samples.  This can cause a machine learning algorithm to learn a model that is accurate on average across the two classes by just labelling everything as "non-whale" but will miss many true detections of whales in the process.

[Click here](#question1) to return to question 1

<a id='answer2'></a>
### Answer 2

Decreasing the learning rate reduced the accuracy.  Increasing the learning rate increased the learning rate to over 96%.

[Click here](#question2) to return to question 2

<a id='answer3'></a>
### Answer 3

GoogleNet achieves a higher validation accuracy, around 97%.  Essentially this is because GoogleNet is a deep model with more trainable parameters than Alexnet.  GoogleNet also contains a special layer type, called an Inception Module, that improves the scale invariance property of the model.

[Click here](#question3) to return to question 3