## Constructing the datasets

In [1]:
import librosa
import pandas as pd
import numpy as np
import multiprocessing

import torch

The audio has been recorded with a sampling rate of 24414 and 44100. We can normalize the recordings to a sample rate of 24414.

Also, let's load the data into the pandas dataframe with annotations. The dataset is small and it can fit into our RAM. This means, that every time we iterate over the dataset, we don't have to load every example from disk.

In [240]:
%%time

anno = pd.read_csv('data/annotations.csv')

audio = []

for _, row in anno.iterrows():
    recording, _ = librosa.load(f'data/{row.split}/{row.filename}', sr=24414)
    audio.append(recording)
    
anno['audio'] = audio

CPU times: user 22.7 s, sys: 812 ms, total: 23.5 s
Wall time: 23.5 s


The only annotations we have available for this dataset are the individual codenames. We will use these as our labels.

In [25]:
anno.head()

Unnamed: 0,class,split,filename,audio
0,TH,train,TH28.wav,"[0.006072998, 0.0051574707, 0.004119873, 0.002..."
1,TH,train,TH22.wav,"[0.00894165, 0.009246826, 0.0093688965, 0.0093..."
2,TH,train,TH928.wav,"[-0.006164551, -0.0059814453, -0.0056762695, -..."
3,TH,train,TH1145.wav,"[0.005554199, 0.005859375, 0.0061035156, 0.006..."
4,TH,valid,TH470.wav,"[-0.006164551, -0.0057678223, -0.005218506, -0..."


The shortest of recordings is 992 frames. With the sample rate of 24414, this translates to 0.04 seconds of audio.

The longest recording is 41307 frames, which translates to 1.69 seconds of audio.

Given this, we will provide 3 options for this dataset:
* sample random 0.04 seconds from each call for each example (the **sample** option)
* cut each example into examples of 0.04 seconds duration (the **cut** option), this will produce some number of new examples, that will depend on the total length of recordings
* pad each example to the longest example in the dataset (the **pad** option)
* take just the 0.04 of each call from the beginning (the **first** option)

Additionally, the classes are unbalanced.

In [31]:
anno['class'].value_counts()

TH    1345
MU    1017
IO    1002
SN    1001
AL     999
QU     975
BE     478
TW     468
Name: class, dtype: int64

We provide two versions of the dataset:
* **unbalanced** (examples represented in line with counts in the raw dataset)
* **balanced** (examples upsampled to count of the most frequently occuring class)

In [141]:
class ExampleProcessor():
    def __init__(self, example_length):
        assert example_length in options.keys()
        self.example_length = example_length
    def __call__(self, example):
        return options[self.example_length](example)
    
    
def first(example):
    return example[:992]

def sample(example):
    start_frame = np.random.randint(example.shape[0] - 991)
    return example[start_frame:start_frame+992]

def pad(example):
    out = np.zeros((41307))
    out[:example.shape[0]] = example
    return out

options = {
    'first': first,
    'sample': sample,
    'pad': pad
}

Creating a mapping from codename to class idx.

In [142]:
codenames = anno['class'].unique().tolist()

codename2idx = {codename: idx for idx, codename in enumerate(codenames)}

In [254]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, df, example='sample', classes='unbalanced'):
        if example == 'cut':
            cls = []
            audio = []

            for idx, row in df.iterrows():
                while True:
                    if row.audio.shape[0] < 992: break
                    cls.append(row['class'])
                    audio.append(row.audio[:992])
                    row.audio = row.audio[992:]
            df = pd.DataFrame({'class': cls, 'audio': audio})
            example = 'first'
            
        self.examples = df
        self.example_processor = ExampleProcessor(example)
        
        assert classes in ['balanced', 'unbalanced']
        if classes=='balanced':
            max_examples = self.examples['class'].value_counts()[0]
            for grp in self.examples.groupby('class'):
                example_count = self.examples[self.examples['class'] == grp[0]].shape[0]
                while example_count < max_examples:
                    self.examples = self.examples.append(self.examples[self.examples['class'] == grp[0]][:max_examples-example_count])
                    example_count = self.examples[self.examples['class'] == grp[0]].shape[0]
        
    def __getitem__(self, index):
        example = self.examples.iloc[index]
        x = self.example_processor(example.audio)
        y = codename2idx[example['class']]
        return x, y

    def __len__(self):
        return self.examples.shape[0]

Now that we have the components in place, let's work on our datasets.

In [255]:
train_ds = Dataset(anno[anno.split == 'train'], example='cut', classes='balanced')
valid_ds = Dataset(anno[anno.split == 'valid'], example='cut')

In [256]:
len(train_ds)

114904

In [257]:
train_ds.examples['class'].value_counts()

AL    14363
TH    14363
IO    14363
SN    14363
QU    14363
TW    14363
MU    14363
BE    14363
Name: class, dtype: int64

Let's now construct the dataloaders to ensure everything works as expected.

In [258]:
train_dl = torch.utils.data.DataLoader(
    dataset=train_ds,
    batch_size=32,
    shuffle=True,
    num_workers=multiprocessing.cpu_count()-1
)

valid_dl = torch.utils.data.DataLoader(
    dataset=valid_ds,
    batch_size=32,
    shuffle=False,
    num_workers=multiprocessing.cpu_count()-1
)

In [259]:
for batch in train_dl: pass
for batch in valid_dl: pass

In [260]:
batch[0].shape, batch[1].shape # we are on the final batch, there were not enough examples to fill it

(torch.Size([24, 992]), torch.Size([24]))