## Constructing the datasets

In [1]:
import librosa
import pandas as pd
import numpy as np
from IPython.lib.display import Audio
from matplotlib import pyplot as plt
import multiprocessing
import scipy.signal

import torch

The audio has been recorded with a sampling rate of 44100. Let's load all of the data into a pandas DataFrame so that we don't have to load the audio everytime we want to train on an example.

In [2]:
%%time

anno = pd.read_csv('data/annotations.csv')

audio = []

for _, row in anno.iterrows():
    recording, _ = librosa.load(f'data/audio/{row.fn}', sr=None)
    audio.append(recording)
    
anno['audio'] = audio

CPU times: user 604 ms, sys: 792 ms, total: 1.4 s
Wall time: 39 s


The only annotations we have available for this dataset are whether the recording contains an orca call or not.

In [6]:
anno.head()

Unnamed: 0,fn,label,audio
0,0.wav,call,"[0.07357788, 0.09561157, 0.072021484, 0.019134..."
1,1.wav,call,"[-0.051696777, -0.023529053, 0.014511108, 0.02..."
2,2.wav,call,"[-0.0035552979, -6.1035156e-05, -0.011566162, ..."
3,3.wav,call,"[0.0016479492, 0.024032593, 0.032470703, -0.00..."
4,4.wav,call,"[-0.030548096, -0.046203613, -0.029937744, 0.0..."


All recordings are 4 second long. But the classes are not balanced. We provide an option to balance the classes for training.

In [7]:
anno.label.value_counts()

call       398
no_call    196
Name: label, dtype: int64

We provide two versions of the dataset:
* **unbalanced** (examples represented in line with counts in the raw dataset)
* **balanced** (examples upsampled to count of the most frequently occuring class)

Creating a mapping from labels to class idx.

In [9]:
labels = anno['label'].unique().tolist()

label2idx = {label: idx for idx, label in enumerate(labels)}

In [22]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, df, classes='unbalanced'):
        self.examples = df
    
        assert classes in ['balanced', 'unbalanced']
        if classes=='balanced':
            max_examples = self.examples['label'].value_counts()[0]
            for grp in self.examples.groupby('label'):
                example_count = self.examples[self.examples['label'] == grp[0]].shape[0]
                while example_count < max_examples:
                    self.examples = self.examples.append(self.examples[self.examples['label'] == grp[0]][:max_examples-example_count])
                    example_count = self.examples[self.examples['label'] == grp[0]].shape[0]
        
    def __getitem__(self, index):
        example = self.examples.iloc[index]
        x = example.audio
        y = label2idx[example['label']]
        return x, y

    def __len__(self):
        return self.examples.shape[0]

Now that we have the components in place, let's work on our datasets.

In [23]:
anno = anno.sample(frac=1)

In [30]:
train_ds = Dataset(anno[:int(0.8*anno.shape[0])], classes='balanced')
valid_ds = Dataset(anno[int(0.8*anno.shape[0]):])

In [31]:
len(train_ds)

636

In [32]:
train_ds.examples['label'].value_counts()

no_call    318
call       318
Name: label, dtype: int64

Let's now construct the dataloaders to ensure everything works as expected.

In [33]:
train_dl = torch.utils.data.DataLoader(
    dataset=train_ds,
    batch_size=32,
    shuffle=True,
    num_workers=multiprocessing.cpu_count()-1
)

valid_dl = torch.utils.data.DataLoader(
    dataset=valid_ds,
    batch_size=32,
    shuffle=False,
    num_workers=multiprocessing.cpu_count()-1
)

In [34]:
for batch in train_dl: pass
for batch in valid_dl: pass

In [35]:
batch[0].shape, batch[1].shape # we are on the final batch, there were not enough examples to fill it

(torch.Size([23, 176400]), torch.Size([23]))

## CPP suitability

This dataset contains a relatively small number of calls and the callers are not identified. Additionally, there is a high amount of backround noise.

This dataset would not lend itself well to CPP analysis.