In this competition, we don't really have a validation set. The competition organizers where open about the fact that the provided data we can use to evaluate our model comes from a different distribution than the test set!

As a side note - it is really great that the competition organizers opted to take the route of transparency. Not being open about how you sample what might look like as the validation set, or going about it in a 'tricky' way, is a sure route to hurting the competition (or any ML project in general).

So we know we don't have a validation set. We understand where the train set comes from, so that will give us some idea on what we can do to improve the preformance of our model.

Still, the data provided is the closest thing we have to a validation set, at least for now. Let's put our model to the test and see what we can learn.

Also, we could be importing from our library the AudioDataset, metrics, and so forth. But we will not have access to this on Kaggle. Thus the rules of engagement will be that we need to define everything we will need in this notebook.

In [66]:
import pandas as pd
import torch
from fastai2.vision.all import *
import soundfile as sf

In [2]:
mean, std = (-6.132126808166504e-05, 0.04304003225515465)

In [16]:
classes = pd.read_pickle('data/classes.pkl')

In [17]:
get_arch = lambda: nn.Sequential(*[
    Lambda(lambda x: x.unsqueeze(1)),
    ConvLayer(1, 16, ks=64, stride=2, ndim=1),
    ConvLayer(16, 16, ks=8, stride=8, ndim=1),
    ConvLayer(16, 32, ks=32, stride=2, ndim=1),
    ConvLayer(32, 32, ks=8, stride=8, ndim=1),
    ConvLayer(32, 64, ks=16, stride=2, ndim=1),
    ConvLayer(64, 128, ks=8, stride=2, ndim=1),
    ConvLayer(128, 256, ks=4, stride=2, ndim=1),
    ConvLayer(256, 256, ks=4, stride=4, ndim=1),
    Flatten(),
    LinBnDrop(5120, 512, p=0.25, act=nn.ReLU()),
    LinBnDrop(512, 512, p=0.25, act=nn.ReLU()),
    LinBnDrop(512, 256, p=0.25, act=nn.ReLU()),
    LinBnDrop(256, len(classes)),
    nn.Sigmoid()
])

In [177]:
model = get_arch()
model.load_state_dict(torch.load('data/models/first_model.pth'))
model.cuda()
model.eval();

In [94]:
valid_df = pd.read_csv('data/example_test_audio_summary.csv')
test_df = pd.read_csv('data/test.csv')

In [95]:
valid_df.head()

Unnamed: 0,filename_seconds,birds,filename,seconds
0,BLKFR-10-CPL_20190611_093000_5,gockin mouchi westan,BLKFR-10-CPL,5
1,BLKFR-10-CPL_20190611_093000_10,gockin westan,BLKFR-10-CPL,10
2,BLKFR-10-CPL_20190611_093000_15,gockin westan,BLKFR-10-CPL,15
3,BLKFR-10-CPL_20190611_093000_20,mouchi,BLKFR-10-CPL,20
4,BLKFR-10-CPL_20190611_093000_25,mouchi,BLKFR-10-CPL,25


In [96]:
valid_df.filename = valid_df.filename_seconds.apply(lambda fn: '_'.join(fn.split('_')[:-1]))

In [98]:
valid_df.loc[valid_df.filename == 'BLKFR-10-CPL_20190611_093000'].filename = 'BLKFR-10-CPL_20190611_093000.pt540'
valid_df.loc[valid_df.filename == 'ORANGE-7-CAP_20190606_093000'].filename = 'ORANGE-7-CAP_20190606_093000.pt623'

In [99]:
test_df.head()

Unnamed: 0,site,row_id,seconds,audio_id
0,site_1,site_1_0a997dff022e3ad9744d4e7bbf923288_5,5,0a997dff022e3ad9744d4e7bbf923288
1,site_1,site_1_0a997dff022e3ad9744d4e7bbf923288_10,10,0a997dff022e3ad9744d4e7bbf923288
2,site_1,site_1_0a997dff022e3ad9744d4e7bbf923288_15,15,0a997dff022e3ad9744d4e7bbf923288


In [100]:
valid_df.loc[valid_df.birds.isna(), 'birds'] = ''

In [101]:
valid_df.shape

(153, 4)

In [108]:
items = [(row.birds.split(), f'data/example_test_audio/{row.filename}.mp3', row.seconds) for idx, row in valid_df.iterrows()]

In [125]:
SAMPLE_RATE = 32_000

from torch.utils.data import Dataset

class AudioDataset(Dataset):
    def __init__(self, items, classes, mean=None, std=None):
        self.items = items
        self.vocab = classes
        self.do_norm = (mean and std)
        self.mean = mean
        self.std = std
    def __getitem__(self, idx):
        cls, path, offset = self.items[idx]
        x, _ = sf.read(path, SAMPLE_RATE*5, start=offset)
        if self.do_norm: x = self.normalize(x)
        return x.astype(np.float32), self.one_hot_encode(cls)
    def normalize(self, x):
        return (x - self.mean) / self.std
    def one_hot_encode(self, cls):
        one_hot = np.zeros((len(self.vocab)))
        for bird in cls:
            if bird == 'unk': continue
            bird_idx = self.vocab.index(bird)
            one_hot[bird_idx] = 1
        return one_hot
    def __len__(self):
        return len(self.items)

In [126]:
valid_ds = AudioDataset(items, classes, mean=mean, std=std)

In [111]:
len(valid_ds)

153

In [112]:
from torch.utils.data import DataLoader

dl = DataLoader(valid_ds, batch_size=128)

for batch in dl:
    break

RuntimeError: Error opening 'data/example_test_audio/BLKFR-10-CPL_20190611_093000.pt540.mp3': File contains data in an unknown format.

Oh no! Turns out `soundfile` is unable to read mp3 files! Switching to librosa... But that means we need to go back and use librosa everywhere now, and also save the train files in an mp3 format. Or even better, figure out how to use the train files without concatenating them together.

In [113]:
import librosa

In [115]:
librosa.load('data/example_test_audio/BLKFR-10-CPL_20190611_093000.pt540.mp3', sr=SAMPLE_RATE)[0]



array([0.01870877, 0.02787147, 0.03086149, ..., 0.0399581 , 0.03838276,
       0.0199724 ], dtype=float32)

In [156]:
SAMPLE_RATE = 32_000

from torch.utils.data import Dataset

class AudioDataset(Dataset):
    def __init__(self, items, classes, mean=None, std=None):
        self.items = items
        self.vocab = classes
        self.do_norm = (mean and std)
        self.mean = mean
        self.std = std
    def __getitem__(self, idx):
        cls, path, offset = self.items[idx]
        x, _ = librosa.load(path, sr=SAMPLE_RATE, duration=5, offset=offset)
        if self.do_norm: x = self.normalize(x)
        return x.astype(np.float32), self.one_hot_encode(cls)
    def normalize(self, x):
        return (x - self.mean) / self.std
    def one_hot_encode(self, cls):
        one_hot = np.zeros((len(self.vocab)))
        for bird in cls:
            if bird in ['unk', 'whhwoo', 'squirrel', 'mouqua', 'hawo']: continue
            bird_idx = self.vocab.index(bird)
            one_hot[bird_idx] = 1
        return one_hot
    def __len__(self):
        return len(self.items)

In [157]:
valid_ds = AudioDataset(items, classes, mean=mean, std=std)

Unfortunately, the files in test_audio have different sampling rate than 32_000... AFAICT, neither librosa or sondfile can output mp3 files. This means every time we load each examples, librosa will perform resampling...

I guess we can live with this bit of slowness here and hope this will be much faster on kaggle where hopefully we will not need to peform resampling on the fly.

In [230]:
from torch.utils.data import DataLoader
import multiprocessing
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

dl = DataLoader(valid_ds, batch_size=128, num_workers=multiprocessing.cpu_count() // 4)

preds = []
targs = []

for batch in dl:
    with torch.no_grad():
        preds.append(model(batch[0].cuda()).cpu().detach())
    targs.append(batch[1])

In [231]:
preds = torch.cat(preds)
targs = torch.cat(targs)

In [310]:
def preds_to_tp_fp_fn(preds, targs):
    positives = (preds >= 0.5)
    true_positives = positives[targs == 1]
    false_positives = positives[targs != 1]
    negatives = ~positives
    false_negatives = negatives[targs == 1]
    return true_positives.sum(), false_positives.sum(), false_negatives.sum()

def precision(preds, targs):
    tp, fp, fn = preds_to_tp_fp_fn(preds, targs)
    return (tp.float() / (tp + fp)).item()

def recall(preds, targs):
    tp, fp, fn = preds_to_tp_fp_fn(preds, targs)
    return (tp.float() / (tp + fn)).item()

def f1(preds, targs, eps=1e-8):
    prec = precision(preds, targs)
    rec = recall(preds, targs)
    return 2 * (prec * rec) / (prec + rec + eps)

In [311]:
recall(preds, targs)

0.0

In [312]:
precision(preds, targs)

0.0

In [313]:
f1(preds, targs)

0.0

In [314]:
preds_to_tp_fp_fn(preds, targs)

(tensor(0), tensor(48), tensor(106))

This doesn't seem like stellar performance from our model!

We could have some bug in our training code. We could have an issue in how we calculate our metrics. But my guess is that the model cheated during training through how we sampled the validation set. Maybe it was not fitting to the calls at all but was learning some other charactericts of the audio files?

Nonehteless, this is good. Making a pass through the entire pipeline all the way to predictions we have identified quite a few things that were not apparent from start and hopefully we will be able to use this information to iterate on the design.

Time to move the weights to kaggle and create the piece for outputting the predictions there.