# Right whale detection from hydrophone data
This project is an implementation and analysis of a Convolutional Neural Network to accomplish the Kaggle challenge,

## [The Marinexplore and Cornell University Whale Detection Challenge](https://www.kaggle.com/c/whale-detection-challenge)
The goal of the project is to "Create an algorithm to detect North Atlantic right whale calls from audio recordings, prevent collisions with shipping traffic" (from Kaggle).

### The data
- 2-second .aiff sound clip
- sample rate of 2 kHz
- 20,000 training samples
- 54,503 testing samples
- Training labels given in separate csv

### The implementation
Since the data are given as uniform sound clips and the training data are labeled, we can take a supervised learning approach. By converting the audio data to spectrograms we can treat this similarly to an image classification problem. We use a Convolutional Neural Network using Pytorch for the data loader and model classes, and librosa for opening and manipulating the sound files. Pandas and Numpy also provide support for interpreting the training metadata csv.

To 

In [1]:
import PIL
import io

import librosa
import pandas as pd
import numpy as np
from pathlib2 import Path
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split, Subset
from torch.utils.tensorboard import SummaryWriter

import torchaudio
from torchvision.transforms import ToTensor
from torchvision import models

from sklearn.metrics import roc_curve, roc_auc_score




The data was downloaded to my system and placed in the downloads section.

In [2]:
path_to_whale_data = "/Users/adambuderi/Downloads/whale-detection-challenge"
path_to_whale_data_csv = path_to_whale_data + "/data/train.csv"
path_to_whale_data_audio = path_to_whale_data + "/data/train/"

In [3]:
import sys
import os
sys.path.append(os.getcwd())
import whale_sound_dataset
import audio_classifier

I used this dictionary to set the parameters for the audio sample rate, duration, 

In [4]:
configuration_dict = {'number_of_epochs': 6, 'batch_size': 8, 
                      'audio_sample_rate': 2000, 'audio_duration': 2000}
print(configuration_dict)  # printing actual configuration (after override in remote mode)



{'number_of_epochs': 6, 'batch_size': 8, 'audio_sample_rate': 2000, 'audio_duration': 2000}


Implement the Pytorch primitive [Dataset class](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) for loading the audio files, converting them to mel spectrograms, and also loading their labels.

<b>note</b>: This implementation is included in the notebook for clarity, but the version used by the notebook here is implemented in the file audio_classifier.py
This is needed because of the [constraints of multithreading in Jupyter notebooks](https://stackoverflow.com/questions/41385708/multiprocessing-example-giving-attributeerror) I encountered



In [5]:
# ----------------------------
# Sound Dataset
# ----------------------------
class WhaleSoundDataSet(Dataset):
    def __init__(self, csv_path, data_path, configuration_dict):
        self.meta_df = None
        self.data_path = str(data_path)
        self.duration = configuration_dict.get('audio_duration', 2000)
        self.sr = configuration_dict.get('audio_sample_rate', 2000)
        self.channel = 2

        self.meta_df = pd.read_csv(csv_path)
        
    # ----------------------------
    # Number of items in dataset
    # ----------------------------
    def __len__(self):
        return len(self.meta_df)

    # ----------------------------

    # Get i'th item in dataset
    # ----------------------------
    def __getitem__(self, idx):
        # Absolute file path of the audio file - concatenate the audio directory with
        # the filename.
        audio_file_path = self.data_path + self.meta_df.loc[idx, 'clip_name']
        soundData, sr = librosa.load(audio_file_path, sr=2000, duration=2)
        # Get the Class ID, either 0 (no whale) or 1 (whale)
        # Get the Class ID.
        class_id = self.meta_df.loc[idx, 'label']
        
        # This will convert audio files with two channels into one
        soundData_mono = librosa.to_mono(soundData)
        
        # Convert audio to log-scale Mel spectrogram
        mel_spectrogram = librosa.feature.melspectrogram(y=soundData_mono, sr=sr)
        
        return mel_spectrogram, class_id

    

Instantiate the WhaleSoundDataSet class, then split that dataset between a training set and a testing set at an 80:20 ratio.

Then, instantiate a DataLoader iterable around the datasets with a batch size of 8.

In [6]:
# dataset = WhaleSoundDataSet()
train_set = whale_sound_dataset.WhaleSoundDataSet(path_to_whale_data_csv, path_to_whale_data_audio, configuration_dict)

# Random split of 80:20 between training and validation
num_items = len(train_set)
num_train = round(num_items * 0.8)
num_val = num_items - num_train
train_ds, test_ds = random_split(train_set, [num_train, num_val])
print("Train set size: " + str(len(train_ds)))
print("Test set size: " + str(len(test_ds)))

train_loader = torch.utils.data.DataLoader(train_ds, batch_size = configuration_dict.get('batch_size', 8), 
                                           shuffle = True, pin_memory=False, num_workers=1)
test_loader = torch.utils.data.DataLoader(test_ds, batch_size = configuration_dict.get('batch_size', 8), 
                                          shuffle = False, pin_memory=False, num_workers=1)


Train set size: 24000
Test set size: 6000


Define a custom Convolutional Neural Network

The input is of the shape [num_channels, batch_size, img_height, img_width]

<b>note</b>: This implementation is included in the notebook for clarity, but the version used by the notebook here is implemented in the file audio_classifier.py
This is needed because of the [constraints of multithreading in Jupyter notebooks](https://stackoverflow.com/questions/41385708/multiprocessing-example-giving-attributeerror) I encountered

In [7]:
# ----------------------------
# Audio Classification Model
# ----------------------------
class AudioClassifier (nn.Module):
    # ----------------------------
    # Build the model architecture
    # ----------------------------
    def __init__(self):
        super().__init__()
        conv_layers = []

        # First Convolution Block with Relu and Batch Norm. Use Kaiming Initialization
        self.conv1 = nn.Conv2d(1, 8, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
        self.relu1 = nn.ReLU()
        self.bn1 = nn.BatchNorm2d(8)
        nn.init.kaiming_normal_(self.conv1.weight, a=0.1)
        self.conv1.bias.data.zero_()
        conv_layers += [self.conv1, self.relu1, self.bn1]

        # Second Convolution Block
        self.conv2 = nn.Conv2d(8, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
        self.relu2 = nn.ReLU()
        self.bn2 = nn.BatchNorm2d(16)
        nn.init.kaiming_normal_(self.conv2.weight, a=0.1)
        self.conv2.bias.data.zero_()
        conv_layers += [self.conv2, self.relu2, self.bn2]

        # Second Convolution Block
        self.conv3 = nn.Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
        self.relu3 = nn.ReLU()
        self.bn3 = nn.BatchNorm2d(32)
        nn.init.kaiming_normal_(self.conv3.weight, a=0.1)
        self.conv3.bias.data.zero_()
        conv_layers += [self.conv3, self.relu3, self.bn3]

        # Third Convolution Block
        self.conv4 = nn.Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
        self.relu4 = nn.ReLU()
        self.bn4 = nn.BatchNorm2d(64)
        nn.init.kaiming_normal_(self.conv4.weight, a=0.1)
        self.conv4.bias.data.zero_()
        conv_layers += [self.conv4, self.relu4, self.bn4]

        # Linear Classifier
        self.ap = nn.AdaptiveAvgPool2d(output_size=1)
        self.lin = nn.Linear(in_features=64, out_features=10)

        # Wrap the Convolutional Blocks
        self.conv = nn.Sequential(*conv_layers)
 
    # ----------------------------
    # Forward pass computations
    # ----------------------------
    def forward(self, x):
        # Run the convolutional blocks
        x = self.conv(x)

        # Adaptive pool and flatten for input to linear layer
        x = self.ap(x)
        x = x.view(x.shape[0], -1)

        # Linear layer
        x = self.lin(x)

        # Final output
        return x

Instantiate the model, setting it up to use the GPU if possible

In [8]:
myModel = audio_classifier.AudioClassifier()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
myModel = myModel.to(device)

The training loop for the model.

In [9]:
# ----------------------------
# Training Loop
# ----------------------------
def training(model, train_dl, num_epochs):
    # Loss Function, Optimizer and Scheduler
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(),lr=0.001)
    scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.001,
                                                steps_per_epoch=int(len(train_dl)),
                                                epochs=num_epochs,
                                                anneal_strategy='linear')

    # Repeat for each epoch
    for epoch in range(num_epochs):
        running_loss = 0.0
        correct_prediction = 0
        total_prediction = 0

        # Repeat for each batch in the training set (sounds, sample_rate, inputs, labels)
        for i, data in enumerate(train_dl):
            # Get the input features and target labels, and put them on the GPU
            inputs, labels = data[2].to(device), data[3].to(device)

            #trying to reshape to make it 4d a la https://stackoverflow.com/questions/72808402/pytorch-identifying-batch-size-as-number-of-channels-in-conv2d-layer
            inputs = torch.reshape(inputs, (8, 1, 128, 8))

            # Normalize the inputs
            inputs_m, inputs_s = inputs.mean(), inputs.std()
            inputs = (inputs - inputs_m) / inputs_s

            # Zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            scheduler.step()

            # Keep stats for Loss and Accuracy
            running_loss += loss.item()

            # Get the predicted class with the highest score
            _, prediction = torch.max(outputs,1)
            # Count of predictions that matched the target label
            correct_prediction += (prediction == labels).sum().item()
            total_prediction += prediction.shape[0]

            if i % 1000 == 0:    # print every 1000 mini-batches
                print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 1000))
    
        # Print stats at the end of the epoch
        num_batches = len(train_dl)
        avg_loss = running_loss / num_batches
        acc = correct_prediction/total_prediction
        print(f'Epoch: {epoch}, Loss: {avg_loss:.2f}, Accuracy: {acc:.2f}')

    print('Finished Training')


Testing loop for the model

In [10]:
# ----------------------------
# Inference
# ----------------------------
def testing (model, val_dl, y_pred, y_true):
    correct_prediction = 0
    total_prediction = 0

    # Disable gradient updates
    with torch.no_grad():
        for data in val_dl:
            # Get the input features and target labels, and put them on the GPU
            inputs, labels = data[2].to(device), data[3].to(device)
            inputs = torch.reshape(inputs, (8, 1, 128, 8))
        

            # Normalize the inputs
            inputs_m, inputs_s = inputs.mean(), inputs.std()
            inputs = (inputs - inputs_m) / inputs_s

            # Get predictions
            outputs = model(inputs)
            
            # store the results for calculating auroc
            y_true.append(labels.numpy()) #true labels
            
            
            # Append the predicted probabilities to the list
            y_pred.append(nn.functional.softmax(outputs, dim=1).numpy()[:, 1])
    
            # y_pred.append(torch.sigmoid(outputs).numpy()) # predicted labels

            # Get the predicted class with the highest score
            _, prediction = torch.max(outputs,1)
            # Count of predictions that matched the target label
            correct_prediction += (prediction == labels).sum().item()
            total_prediction += prediction.shape[0]
    
    acc = correct_prediction/total_prediction
    print(f'Accuracy: {acc:.2f}, Total items: {total_prediction}')


Running the training and testing loops. Note the creation of two lists to hold the predictions made by the model as well as the ground truth labels.

In [11]:
y_true = []
y_pred = []
training(myModel, train_loader, configuration_dict.get('number_of_epochs', 6))
# Run inference on trained model with the validation set
testing(myModel, test_loader, y_pred, y_true)

[1,     1] loss: 0.002
[1,  1001] loss: 1.735
[1,  2001] loss: 2.296
Epoch: 0, Loss: 0.92, Accuracy: 0.70
[2,     1] loss: 0.000
[2,  1001] loss: 0.434
[2,  2001] loss: 0.850
Epoch: 1, Loss: 0.42, Accuracy: 0.78
[3,     1] loss: 0.001
[3,  1001] loss: 0.412
[3,  2001] loss: 0.808
Epoch: 2, Loss: 0.40, Accuracy: 0.80
[4,     1] loss: 0.000
[4,  1001] loss: 0.387
[4,  2001] loss: 0.758
Epoch: 3, Loss: 0.38, Accuracy: 0.81
[5,     1] loss: 0.000
[5,  1001] loss: 0.376
[5,  2001] loss: 0.733
Epoch: 4, Loss: 0.37, Accuracy: 0.82
[6,     1] loss: 0.000
[6,  1001] loss: 0.358
[6,  2001] loss: 0.710
Epoch: 5, Loss: 0.35, Accuracy: 0.83
Finished Training
Accuracy: 0.83, Total items: 6000


Below we calculate the Area Under Curve of the Receiver Operating Characterization Curve (AUR ROC Curve)

This is a more robust measure of model quality than accuracy in a binary classification problem because it accounts for the likelihood that a "positive" guess by the model (e.g. there is a whale call in the audio sample) is a true positive. I.e. the model's positive guess is also labeled as a positive result by the class data.

There are several functions for calculating ROC AUC scores by different libraries. Here we use the [sklearn.metrics roc_auc_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html) function. The closer the printed value is to 1, the better the model is. A value of 0.5 means the model is essentially randomly guessing

In [12]:
# Calculate the AUC score
auc = roc_auc_score(y_true, y_pred)
print("AUC: {:.4f}".format(auc))

AUC: 0.8845
